# **Question 1: [Index] S&P 500 Stocks Added to the Index**

**Which year had the highest number of additions?**

Using the list of S&P 500 companies from Wikipedia's [S&P 500 companies page](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies), download the data including the year each company was added to the index.

Hint: you can use [pandas.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) to scrape the data into a DataFrame.

Steps:
1. Create a DataFrame with company tickers, names, and the year they were added.
2. Extract the year from the addition date and calculate the number of stocks added each year.
3. Which year had the highest number of additions (1957 doesn't count, as it was the year when the S&P 500 index was founded)? Write down this year as your answer (the most recent one, if you have several records).

*Context*:
> "Following the announcement, all four new entrants saw their stock prices rise in extended trading on Friday" - recent examples of S&P 500 additions include DASH, WSM, EXE, TKO in 2025 ([Nasdaq article](https://www.nasdaq.com/articles/sp-500-reshuffle-dash-tko-expe-wsm-join-worth-buying)).

*Additional*: How many current S&P 500 stocks have been in the index for more than 20 years? When stocks are added to the S&P 500, they usually experience a price bump as investors and index funds buy shares following the announcement.

In [178]:
!pip install -q yfinance

In [2]:
import pandas as pd

# URL of the Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

# Read all tables from the Wikipedia page
tables = pd.read_html(url)
sp500_table = tables[0]

# Select columns
sp500_df = sp500_table[['Symbol', 'Security', 'Date added']].copy()
sp500_df.columns = ['Ticker', 'Company', 'Date Added']

# Extract the year
sp500_df['Year Added'] = pd.to_datetime(sp500_df['Date Added'], errors='coerce').dt.year

sp500_df.head()


Unnamed: 0,Ticker,Company,Date Added,Year Added
0,MMM,3M,1957-03-04,1957
1,AOS,A. O. Smith,2017-07-26,2017
2,ABT,Abbott Laboratories,1957-03-04,1957
3,ABBV,AbbVie,2012-12-31,2012
4,ACN,Accenture,2011-07-06,2011


In [3]:
# Not 1957
filtered_df = sp500_df[sp500_df['Year Added'] != 1957]

# Count
year_counts = filtered_df['Year Added'].value_counts()

print(year_counts.head(1))


Year Added
2017    23
Name: count, dtype: int64


# **Question 2. [Macro] Indexes YTD (as of 1 May 2025)**

**How many indexes (out of 10) have better year-to-date returns than the US (S&P 500) as of May 1, 2025?**

Using Yahoo Finance World Indices data, compare the year-to-date (YTD) performance (1 January-1 May 2025) of major stock market indexes for the following countries:
* United States - S&P 500 (^GSPC)
* China - Shanghai Composite (000001.SS)
* Hong Kong - HANG SENG INDEX (^HSI)
* Australia - S&P/ASX 200 (^AXJO)
* India - Nifty 50 (^NSEI)
* Canada - S&P/TSX Composite (^GSPTSE)
* Germany - DAX (^GDAXI)
* United Kingdom - FTSE 100 (^FTSE)
* Japan - Nikkei 225 (^N225)
* Mexico - IPC Mexico (^MXX)
* Brazil - Ibovespa (^BVSP)

*Hint*: use start_date='2025-01-01' and end_date='2025-05-01' when downloading daily data in yfinance

Context:
> [Global Valuations: Who's Cheap, Who's Not?](https://simplywall.st/article/beyond-the-us-global-markets-after-yet-another-tariff-update) article suggests "Other regions may be growing faster than the US and you need to diversify."

Reference: Yahoo Finance World Indices - https://finance.yahoo.com/world-indices/

*Additional*: How many of these indexes have better returns than the S&P 500 over 3, 5, and 10 year periods? Do you see the same trend?
Note: For simplicity, ignore currency conversion effects.)

In [4]:
import yfinance as yf
import pandas as pd

tickers = {
    "US - S&P 500": "^GSPC",
    "China - Shanghai Composite": "000001.SS",
    "Hong Kong - HANG SENG": "^HSI",
    "Australia - S&P/ASX 200": "^AXJO",
    "India - Nifty 50": "^NSEI",
    "Canada - S&P/TSX Composite": "^GSPTSE",
    "Germany - DAX": "^GDAXI",
    "UK - FTSE 100": "^FTSE",
    "Japan - Nikkei 225": "^N225",
    "Mexico - IPC Mexico": "^MXX",
    "Brazil - Ibovespa": "^BVSP"
}

start_date = "2025-01-01"
end_date = "2025-05-01"

adj_close_df = pd.DataFrame()

for country, ticker in tickers.items():
    data = yf.download(ticker, start=start_date, end=end_date)

    if 'Adj Close' in data.columns:
        adj_close_df[country] = data['Adj Close']
    elif 'Close' in data.columns:
        adj_close_df[country] = data['Close']
    else:
        print("No Data")

first_prices = adj_close_df.iloc[0]
last_prices = adj_close_df.iloc[-1]

ytd_returns = (last_prices - first_prices) / first_prices

us_return = ytd_returns.get("US - S&P 500", None)
print(f"\nUS - S&P 500:{us_return}")

better_than_us = ytd_returns[(ytd_returns > us_return) & (ytd_returns.index != "US - S&P 500")]
count_better = len(better_than_us)
print(better_than_us.head(10))
print(f"Total indexes: {len(better_than_us)}")




YF.download() has changed argument auto_adjust default to True


[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


US - S&P 500:-0.0510330074824504
China - Shanghai Composite    0.005048
Hong Kong - HANG SENG         0.127200
Australia - S&P/ASX 200      -0.009145
India - Nifty 50              0.006017
Canada - S&P/TSX Composite   -0.002261
Germany - DAX                 0.123464
UK - FTSE 100                 0.028426
Mexico - IPC Mexico           0.130494
Brazil - Ibovespa             0.124387
dtype: float64
Total indexes: 9





# **Question 3. [Index] S&P 500 Market Corrections Analysis**


**Calculate the median duration (in days) of significant market corrections in the S&P 500 index.**

For this task, define a correction as an event when a stock index goes down by **more than 5%** from the closest all-time high maximum.

Steps:
1. Download S&P 500 historical data (1950-present) using yfinance
2. Identify all-time high points (where price exceeds all previous prices)
3. For each pair of consecutive all-time highs, find the minimum price in between
4. Calculate drawdown percentages: (high - low) / high × 100
5. Filter for corrections with at least 5% drawdown
6. Calculate the duration in days for each correction period
7. Determine the 25th, 50th (median), and 75th percentiles for correction durations

*Context:*
> * Investors often wonder about the typical length of market corrections when deciding "when to buy the dip" ([Reddit discussion](https://www.reddit.com/r/investing/comments/1jrqnte/when_are_you_buying_the_dip/?rdt=64135)).

> * [A Wealth of Common Sense - How Often Should You Expect a Stock Market Correction?](https://awealthofcommonsense.com/2022/01/how-often-should-you-expect-a-stock-market-correction/)

*Hint (use this data to compare with your results)*: Here is the list of top 10 largest corrections by drawdown:
* 2007-10-09 to 2009-03-09: 56.8% drawdown over 517 days
* 2000-03-24 to 2002-10-09: 49.1% drawdown over 929 days
* 1973-01-11 to 1974-10-03: 48.2% drawdown over 630 days
* 1968-11-29 to 1970-05-26: 36.1% drawdown over 543 days
* 2020-02-19 to 2020-03-23: 33.9% drawdown over 33 days
* 1987-08-25 to 1987-12-04: 33.5% drawdown over 101 days
* 1961-12-12 to 1962-06-26: 28.0% drawdown over 196 days
* 1980-11-28 to 1982-08-12: 27.1% drawdown over 622 days
* 2022-01-03 to 2022-10-12: 25.4% drawdown over 282 days
* 1966-02-09 to 1966-10-07: 22.2% drawdown over 240 days


In [53]:
import yfinance as yf
import pandas as pd

# Step 1: Download S&P 500 historical data (1950-present) using yfinance
ticker = '^GSPC'
sp500 = yf.download(ticker, start='1950-01-01', progress=False)

sp500 = sp500[['Close']]
sp500.columns = ['Price']

# Step 2: Identify all-time high points
sp500['AllTimeHigh'] = sp500['Price'].cummax()

all_time_highs = sp500[sp500['Price'] == sp500['AllTimeHigh']]
all_time_highs = all_time_highs.reset_index()

# Step 3: For each pair of consecutive all-time highs, find the minimum price in between
lows_between_highs = []

for i in range(len(all_time_highs) - 1):
    start_date = all_time_highs.loc[i, 'Date']
    end_date = all_time_highs.loc[i + 1, 'Date']
    high_price = all_time_highs.loc[i, 'Price']

    mask = (sp500.index > start_date) & (sp500.index < end_date)
    price_between = sp500.loc[mask, 'Price']

    if not price_between.empty:
        min_price = price_between.min()
        min_date = price_between.idxmin()
        lows_between_highs.append((start_date.date(), end_date.date(), min_date.date(), min_price, high_price))

results = pd.DataFrame(lows_between_highs, columns=['Start_High', 'End_High', 'Min_Date_Between', 'Min_Price', 'High_Price'])

# Step 4: Calculate drawdown percentages: (high - low) / high × 100
results['Drawdown_pct'] = ((results['High_Price'] - results['Min_Price']) / results['High_Price']) * 100

# Step 5: Filter for corrections with at least 5% drawdown
filtered_results = results[results['Drawdown_pct'] >= 5].copy()

# Step 6: Calculate the duration in days for each correction period
filtered_results['Start_High'] = pd.to_datetime(filtered_results['Start_High'])
filtered_results['Min_Date_Between'] = pd.to_datetime(filtered_results['Min_Date_Between'])
filtered_results['Correction_Duration_Days'] = (filtered_results['Min_Date_Between'] - filtered_results['Start_High']).dt.days

top_10 = filtered_results.sort_values(by='Drawdown_pct', ascending=False).head(10)
print('TOP 10:\n',top_10[['Start_High', 'Min_Date_Between', 'Correction_Duration_Days', 'Drawdown_pct']])

# Step 7: Determine the 25th, 50th (median), and 75th percentiles for correction durations
percentiles = filtered_results['Correction_Duration_Days'].quantile([0.25, 0.5, 0.75])
print('Precentiles:\n',percentiles)



TOP 10:
     Start_High Min_Date_Between  Correction_Duration_Days  Drawdown_pct
448 2007-10-09       2009-03-09                       517     56.775388
443 2000-03-24       2002-10-09                       929     49.146948
206 1973-01-11       1974-10-03                       630     48.203593
193 1968-11-29       1970-05-26                       543     36.061641
574 2020-02-19       2020-03-23                        33     33.924960
292 1987-08-25       1987-12-04                       101     33.509515
133 1961-12-12       1962-06-26                       196     27.973568
219 1980-11-28       1982-08-12                       622     27.113582
620 2022-01-03       2022-10-12                       282     25.425097
176 1966-02-09       1966-10-07                       240     22.177335
Precentiles:
 0.25    21.5
0.50    39.0
0.75    89.0
Name: Correction_Duration_Days, dtype: float64


# **Question 4. [Stocks] Earnings Surprise Analysis for Amazon (AMZN)**


**Calculate the median 2-day percentage change in stock prices following positive earnings surprises days.**

Steps:
1. Load earnings data from CSV ([ha1_Amazon.csv](ha1_Amazon.csv)) containing earnings dates, EPS estimates, and actual EPS. Make sure you are using the correct delimiter to read the data, such as in this command ```python pandas.read_csv("ha1_Amazon.csv", delimiter=';') ```
2. Download complete historical price data using yfinance
3. Calculate 2-day percentage changes for all historical dates: for each sequence of 3 consecutive trading days (Day 1, Day 2, Day 3), compute the *return* as Close_Day3 / Close_Day1 - 1. (Assume Day 2 may correspond to the earnings announcement.)
4. Identify positive earnings surprises (where "actual EPS > estimated EPS" OR "Surprise (%)>0")
5. Calculate 2-day percentage changes following positive earnings surprises. Show your answer in % (closest number to the 2nd digit): *return* * 100.0
6. (Optional) Compare the median 2-day percentage change for positive surprises vs. all historical dates. Do you see the difference?

Context: Earnings announcements, especially when they exceed analyst expectations, can significantly impact stock prices in the short term.

Reference: Yahoo Finance earnings calendar - https://finance.yahoo.com/calendar/earnings?symbol=AMZN

*Additional*: Is there a correlation between the magnitude of the earnings surprise and the stock price reaction? Does the market react differently to earnings surprises during bull vs. bear markets?)

In [176]:
import pandas as pd
import yfinance as yf
import numpy as np

# Step 1: Load earnings data from CSV
earnings_url = "https://raw.githubusercontent.com/DataTalksClub/stock-markets-analytics-zoomcamp/main/cohorts/2025/ha1_Amazon.csv"
earnings_df = pd.read_csv(earnings_url, delimiter=';')

# Step 2: Download complete historical price data using yfinance
ticker = "AMZN"
amazon_data = yf.download(ticker, period="max")
amazon_data = amazon_data[['Close']].reset_index()
amazon_data.columns = [col[0] for col in amazon_data.columns.values]

# Step 3: Calculate 2-day percentage changes for all historical dates: for each sequence of 3 consecutive trading days
amazon_data['Close_Day1'] = amazon_data['Close']
amazon_data['Close_Day3'] = amazon_data['Close'].shift(-2)
amazon_data['2d_return'] = (amazon_data['Close_Day3'] / amazon_data['Close_Day1']) - 1

# Step 4: Identify positive earnings surprises (where "actual EPS > estimated EPS" OR "Surprise (%)>0")
earnings_df['EPS Estimate'] = earnings_df['EPS Estimate'].astype(str).str.replace(r'\?{3}', '1', regex=True)
earnings_df['Reported EPS'] = earnings_df['Reported EPS'].astype(str).str.replace(r'\?{3}', '1', regex=True)
earnings_df['EPS_Estimate_clean'] = pd.to_numeric(earnings_df['EPS Estimate'], errors='coerce')
earnings_df['Reported_EPS_clean'] = pd.to_numeric(earnings_df['Reported EPS'], errors='coerce')

earnings_df['Surprise_clean'] = earnings_df['Surprise (%)'].str.replace('+', '', regex=False).str.strip()
earnings_df['Surprise_clean'] = pd.to_numeric(earnings_df['Surprise_clean'], errors='coerce')

positive_surprises_df = earnings_df[(earnings_df['Reported_EPS_clean'] > earnings_df['EPS_Estimate_clean'])|(earnings_df['Surprise_clean'] > 0)].copy()

# Step 5: Calculate the median 2-day percentage change in stock prices following positive earnings surprises days.
positive_surprises_df['Earnings_Date_clean'] = pd.to_datetime(positive_surprises_df['Earnings Date'].str.split(' at').str[0],errors='coerce')

trading_dates = amazon_data['Date'].reset_index(drop=True)
returns = []

for dt in positive_surprises_df['Earnings_Date_clean']:
    idx = trading_dates.searchsorted(dt)

    if idx < len(trading_dates):
        next_day = trading_dates.iloc[idx]
        match = amazon_data[amazon_data['Date'] == next_day]
        if not match.empty:
            returns.append(match['2d_return'].values[0])

median_return = np.median(returns) * 100.0
print(f"\nMedian 2-day return after positive earnings surprises: {median_return:.2f}%")



[*********************100%***********************]  1 of 1 completed


Median 2-day return after positive earnings surprises: 0.27%



