## Module 1 Homework (2025 cohort)

In this homework, we're going to download finance data from various sources and make simple calculations or analysis.

---
### Question 1: [Index] S&P 500 Stocks Added to the Index

**Which year had the highest number of additions?**

Using the list of S&P 500 companies from Wikipedia's [S&P 500 companies page](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies), download the data including the year each company was added to the index.

Hint: you can use [pandas.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) to scrape the data into a DataFrame.

Steps:
1. Create a DataFrame with company tickers, names, and the year they were added.
2. Extract the year from the addition date and calculate the number of stocks added each year.
3. Which year had the highest number of additions (1957 doesn't count, as it was the year when the S&P 500 index was founded)? Write down this year as your answer (the most recent one, if you have several records).

*Context*:
> "Following the announcement, all four new entrants saw their stock prices rise in extended trading on Friday" - recent examples of S&P 500 additions include DASH, WSM, EXE, TKO in 2025 ([Nasdaq article](https://www.nasdaq.com/articles/sp-500-reshuffle-dash-tko-expe-wsm-join-worth-buying)).

*Additional*: How many current S&P 500 stocks have been in the index for more than 20 years? When stocks are added to the S&P 500, they usually experience a price bump as investors and index funds buy shares following the announcement.

In [1]:
# !pip install yfinance==0.2.61

In [2]:
import pandas as pd
import numpy as np
import yfinance as yf
import time
from datetime import datetime

**Guide**: [Tutorial: Using pandas read_html() to read tables in webpages](https://scrapingant.com/blog/pandas-read-html-table#:~:text=read_html()%3F-,%E2%80%8B,with%20limited%20web%20scraping%20experience.)

In [3]:
# Extract a list of dataframes made up of tables found in the url
link_sp500 = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
tables = pd.read_html(link_sp500)
print(f'Total tables extracted from S&P 500: {len(tables)}')

Total tables extracted from S&P 500: 2


In [4]:
df = tables[0]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Symbol                 503 non-null    object
 1   Security               503 non-null    object
 2   GICS Sector            503 non-null    object
 3   GICS Sub-Industry      503 non-null    object
 4   Headquarters Location  503 non-null    object
 5   Date added             503 non-null    object
 6   CIK                    503 non-null    int64 
 7   Founded                503 non-null    object
dtypes: int64(1), object(7)
memory usage: 31.6+ KB


In [5]:
# 1. Create a DataFrame with company tickers, names, and the year they were added.
df = df[['Symbol', 'Security', 'Date added']]
# 2. Extract the year from the addition date and calculate the number of stocks added each year.
df['year'] = df.loc[:,'Date added'].str.slice(0,4)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Symbol      503 non-null    object
 1   Security    503 non-null    object
 2   Date added  503 non-null    object
 3   year        503 non-null    object
dtypes: object(4)
memory usage: 15.8+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['year'] = df.loc[:,'Date added'].str.slice(0,4)


In [6]:
# 3. Which year had the highest number of additions (1957 doesn't count, as it was the year when the S&P 500 index was founded)?
# Write down this year as your answer (the most recent one, if you have several records).

# group by year, count number of additions each year, take top 5 due to some years have the same count
df.groupby('year')[['Security']].count().sort_values(by=['Security', 'year'], ascending=False)[:5]


Unnamed: 0_level_0,Security
year,Unnamed: 1_level_1
1957,53
2017,23
2016,23
2019,22
2008,17



**How many indexes (out of 10) have better year-to-date returns than the US (S&P 500) as of May 1, 2025?**

Using Yahoo Finance World Indices data, compare the year-to-date (YTD) performance (1 January-1 May 2025) of major stock market indexes for the following countries:
* United States - S&P 500 (^GSPC)
* China - Shanghai Composite (000001.SS)
* Hong Kong - HANG SENG INDEX (^HSI)
* Australia - S&P/ASX 200 (^AXJO)
* India - Nifty 50 (^NSEI)
* Canada - S&P/TSX Composite (^GSPTSE)
* Germany - DAX (^GDAXI)
* United Kingdom - FTSE 100 (^FTSE)
* Japan - Nikkei 225 (^N225)
* Mexico - IPC Mexico (^MXX)
* Brazil - Ibovespa (^BVSP)

*Hint*: use start_date='2025-01-01' and end_date='2025-05-01' when downloading daily data in yfinance

Context:
> [Global Valuations: Who's Cheap, Who's Not?](https://simplywall.st/article/beyond-the-us-global-markets-after-yet-another-tariff-update) article suggests "Other regions may be growing faster than the US and you need to diversify."

Reference: Yahoo Finance World Indices - https://finance.yahoo.com/world-indices/

*Additional*: How many of these indexes have better returns than the S&P 500 over 3, 5, and 10 year periods? Do you see the same trend?
Note: For simplicity, ignore currency conversion effects.)

In [7]:
start_date = datetime(2025, 1, 1)
end_date = datetime(2025, 5, 1)
tickers = ["^GSPC", "000001.SS","^HSI",	"^AXJO","^NSEI","^GSPTSE","^GDAXI","^FTSE","^N225","^MXX","^BVSP"]
first_close_list = []
last_close_list = []
ytd_list = []

In [8]:
for t in tickers:
    ticker_obj = yf.Ticker(t)
    ticker = ticker_obj.history(start=start_date.strftime('%Y-%m-%d'),
                                end=end_date.strftime('%Y-%m-%d'),
                                interval = "1d")
    # ticker.info()
    if not ticker.empty:
        first_close = ticker['Close'].iloc[0]
        last_close = ticker['Close'].iloc[-1]
        ytd_return = (last_close - first_close) / first_close * 100
        first_close_list.append(first_close)
        last_close_list.append(last_close)
        ytd_list.append(ytd_return)
    else:
        first_close_list.append(0)
        last_close_list.append(0)
        ytd_list.append(0)

    time.sleep(10)


In [9]:
df_world_indices = pd.DataFrame({'Index': tickers,
                                 'FirstDay': first_close_list,
                                 'LastDay': last_close_list,
                                 'YTD': ytd_list})
df_world_indices

Unnamed: 0,Index,FirstDay,LastDay,YTD
0,^GSPC,5868.549805,5569.060059,-5.103301
1,000001.SS,3262.561035,3279.031006,0.504817
2,^HSI,19623.320312,22119.410156,12.720018
3,^AXJO,8201.200195,8126.200195,-0.9145
4,^NSEI,23742.900391,24334.199219,2.490424
5,^GSPTSE,24898.0,24841.699219,-0.226126
6,^GDAXI,20024.660156,22496.980469,12.346378
7,^FTSE,8260.099609,8494.900391,2.84259
8,^N225,39307.050781,36045.378906,-8.297931
9,^MXX,49765.199219,56259.28125,13.049444


In [10]:
# df_world_indices.sort_values(['YTD'], ascending=False)
sp500_ytd = df_world_indices.loc[df_world_indices['Index']=='^GSPC', 'YTD'].values[0]
total = df_world_indices.loc[df_world_indices['YTD'] > sp500_ytd, 'YTD'].count()
print(f'Total world indices with better returns: {total}')


Total world indices with better returns: 9


### Question 3. [Index] S&P 500 Market Corrections Analysis


**Calculate the median duration (in days) of significant market corrections in the S&P 500 index.**

For this task, define a correction as an event when a stock index goes down by **more than 5%** from the closest all-time high maximum.

Steps:
1. Download S&P 500 historical data (1950-present) using yfinance
2. Identify all-time high points (where price exceeds all previous prices)
3. For each pair of consecutive all-time highs, find the minimum price in between
4. Calculate drawdown percentages: (high - low) / high × 100
5. Filter for corrections with at least 5% drawdown
6. Calculate the duration in days for each correction period
7. Determine the 25th, 50th (median), and 75th percentiles for correction durations

*Context:*
> * Investors often wonder about the typical length of market corrections when deciding "when to buy the dip" ([Reddit discussion](https://www.reddit.com/r/investing/comments/1jrqnte/when_are_you_buying_the_dip/?rdt=64135)).

> * [A Wealth of Common Sense - How Often Should You Expect a Stock Market Correction?](https://awealthofcommonsense.com/2022/01/how-often-should-you-expect-a-stock-market-correction/)

*Hint (use this data to compare with your results)*: Here is the list of top 10 largest corrections by drawdown:
* 2007-10-09 to 2009-03-09: 56.8% drawdown over 517 days
* 2000-03-24 to 2002-10-09: 49.1% drawdown over 929 days
* 1973-01-11 to 1974-10-03: 48.2% drawdown over 630 days
* 1968-11-29 to 1970-05-26: 36.1% drawdown over 543 days
* 2020-02-19 to 2020-03-23: 33.9% drawdown over 33 days
* 1987-08-25 to 1987-12-04: 33.5% drawdown over 101 days
* 1961-12-12 to 1962-06-26: 28.0% drawdown over 196 days
* 1980-11-28 to 1982-08-12: 27.1% drawdown over 622 days
* 2022-01-03 to 2022-10-12: 25.4% drawdown over 282 days
* 1966-02-09 to 1966-10-07: 22.2% drawdown over 240 days

In [11]:
# 1. Download S&P 500 historical data (1950-present) using yfinance
sp500_ticker = yf.Ticker('^GSPC')
sp500_history = sp500_ticker.history(start='1950-01-01', interval='1d')

In [12]:
sp500_history.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1950-01-03 00:00:00-05:00,16.66,16.66,16.66,16.66,1260000,0.0,0.0
1950-01-04 00:00:00-05:00,16.85,16.85,16.85,16.85,1890000,0.0,0.0
1950-01-05 00:00:00-05:00,16.93,16.93,16.93,16.93,2550000,0.0,0.0
1950-01-06 00:00:00-05:00,16.98,16.98,16.98,16.98,2010000,0.0,0.0
1950-01-09 00:00:00-05:00,17.08,17.08,17.08,17.08,2520000,0.0,0.0


In [13]:
sp500_history.index = sp500_history.index.strftime('%Y-%m-%d')
# 2. Identify all-time high points (where price exceeds all previous prices)
# use cummulative max to find all time high
sp500_history['All Time High'] = sp500_history['Close'].cummax()
# filter to retain only closing price == all time high
sp500_ath = sp500_history[sp500_history['Close']==sp500_history['All Time High']]

In [14]:
sp500_ath.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1474 entries, 1950-01-03 to 2025-02-19
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Open           1474 non-null   float64
 1   High           1474 non-null   float64
 2   Low            1474 non-null   float64
 3   Close          1474 non-null   float64
 4   Volume         1474 non-null   int64  
 5   Dividends      1474 non-null   float64
 6   Stock Splits   1474 non-null   float64
 7   All Time High  1474 non-null   float64
dtypes: float64(7), int64(1)
memory usage: 103.6+ KB


In [15]:
sp500_history.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,All Time High
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1950-01-03,16.66,16.66,16.66,16.66,1260000,0.0,0.0,16.66
1950-01-04,16.85,16.85,16.85,16.85,1890000,0.0,0.0,16.85
1950-01-05,16.93,16.93,16.93,16.93,2550000,0.0,0.0,16.93
1950-01-06,16.98,16.98,16.98,16.98,2010000,0.0,0.0,16.98
1950-01-09,17.08,17.08,17.08,17.08,2520000,0.0,0.0,17.08


In [16]:
ath_dates = sp500_ath.index.to_list()
ath_highs = sp500_ath['All Time High'].to_list()
ath_periods = []

# 3. For each pair of consecutive all-time highs, find the minimum price in between
for i in range(len(ath_dates)-1):
  start = ath_dates[i]
  end = ath_dates[i+1]

  subset = sp500_history.loc[start: end]
  if len(subset) <= 1:
    continue
  min_index = subset['Close'].idxmin()
  min_price = subset.loc[min_index, 'Close']

  ath_periods.append({'ath_start': start,
                     'ath_end': end,
                      'prev_ath': ath_highs[i],
                     'min_date': min_index,
                     'min_price': min_price})

sp500_corr = pd.DataFrame(ath_periods)

In [17]:
sp500_corr.tail()

Unnamed: 0,ath_start,ath_end,prev_ath,min_date,min_price
1468,2024-12-03,2024-12-04,6049.879883,2024-12-03,6049.879883
1469,2024-12-04,2024-12-06,6086.490234,2024-12-05,6075.109863
1470,2024-12-06,2025-01-23,6090.27002,2025-01-10,5827.040039
1471,2025-01-23,2025-02-18,6118.709961,2025-02-03,5994.569824
1472,2025-02-18,2025-02-19,6129.580078,2025-02-18,6129.580078


In [18]:
# 4. Calculate drawdown percentages: (high - low) / high × 100
sp500_corr['drawdown_pct'] = (sp500_corr['prev_ath'] - sp500_corr['min_price']) / sp500_corr['prev_ath'] * 100
# 5. Filter for corrections with at least 5% drawdown
sp500_corr_filtered = sp500_corr[sp500_corr['drawdown_pct'] > 5]

In [19]:
# 6. Calculate the duration in days for each correction period
sp500_corr_filtered['ath_start'] = pd.to_datetime(sp500_corr_filtered['ath_start'])
sp500_corr_filtered['min_date'] = pd.to_datetime(sp500_corr_filtered['min_date'])
sp500_corr_filtered['duration'] = (sp500_corr_filtered['min_date'] - sp500_corr_filtered['ath_start']) / np.timedelta64(1, 'D')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sp500_corr_filtered['ath_start'] = pd.to_datetime(sp500_corr_filtered['ath_start'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sp500_corr_filtered['min_date'] = pd.to_datetime(sp500_corr_filtered['min_date'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sp500_corr_filtered['duration'] = (sp50

In [20]:
sp500_corr_filtered

Unnamed: 0,ath_start,ath_end,prev_ath,min_date,min_price,drawdown_pct,duration
40,1950-06-12,1950-09-22,19.400000,1950-07-17,16.680000,14.020615,35.0
47,1950-11-24,1950-12-28,20.320000,1950-12-04,19.000000,6.496062,10.0
68,1951-05-03,1951-08-02,22.809999,1951-06-29,20.959999,8.110480,57.0
83,1951-10-15,1952-01-03,23.850000,1951-11-23,22.400000,6.079668,39.0
91,1952-01-22,1952-06-25,24.660000,1952-02-20,23.090000,6.366584,29.0
...,...,...,...,...,...,...,...
1331,2020-09-02,2020-11-13,3580.840088,2020-09-23,3236.919922,9.604455,21.0
1396,2021-09-02,2021-10-21,4536.950195,2021-10-04,4300.459961,5.212538,32.0
1413,2022-01-03,2024-01-19,4796.560059,2022-10-12,3577.030029,25.425097,282.0
1435,2024-03-28,2024-05-15,5254.350098,2024-04-19,4967.229980,5.464427,22.0


In [21]:
# 7. Determine the 25th, 50th (median), and 75th percentiles for correction durations
sp500_corr_filtered['duration'].describe()

Unnamed: 0,duration
count,71.0
mean,113.098592
std,179.073341
min,7.0
25%,21.5
50%,39.0
75%,89.0
max,929.0


### Question 4.  [Stocks] Earnings Surprise Analysis for Amazon (AMZN)


**Calculate the median 2-day percentage change in stock prices following positive earnings surprises days.(actual EPS > estimated EPS)**

Steps:
1. Load earnings data from CSV ([ha1_Amazon.csv](ha1_Amazon.csv)) containing earnings dates, EPS estimates, and actual EPS. Make sure you are using the correct delimiter to read the data, such as in this command ```python pandas.read_csv("ha1_Amazon.csv", delimiter=';') ```
2. Download complete historical price data using yfinance
3. Calculate 2-day percentage changes for all historical dates: for each sequence of 3 consecutive trading days (Day 1, Day 2, Day 3), compute the *return* as Close_Day3 / Close_Day1 - 1. (Assume Day 2 may correspond to the earnings announcement.)
4. Identify positive earnings surprises (where "actual EPS > estimated EPS"). Both fields should be present in the file. You should obtain 36 data points for use in the descriptive analysis (median) later.
5. Calculate 2-day percentage changes following positive earnings surprises. Show your answer in % (closest number to the 2nd digit): *return* * 100.0
6. (Optional) Compare the median 2-day percentage change for positive surprises vs. all historical dates. Do you see the difference?

Context: Earnings announcements, especially when they exceed analyst expectations, can significantly impact stock prices in the short term.

Reference: Yahoo Finance earnings calendar - https://finance.yahoo.com/calendar/earnings?symbol=AMZN

*Additional*: Is there a correlation between the magnitude of the earnings surprise and the stock price reaction? Does the market react differently to earnings surprises during bull vs. bear markets?)


In [22]:
# Load earnings data from CSV (ha1_Amazon.csv) containing earnings dates, EPS estimates, and actual EPS.
# Make sure you are using the correct delimiter to read the data, such as in this command python pandas.read_csv("ha1_Amazon.csv", delimiter=';')
amzn_file = 'https://raw.githubusercontent.com/DataTalksClub/stock-markets-analytics-zoomcamp/refs/heads/main/cohorts/2025/ha1_Amazon.csv'
df_amzn = pd.read_csv(amzn_file, delimiter=';')

In [24]:
df_amzn.head(10)

Unnamed: 0,Symbol,Company,Earnings Date,EPS Estimate,Reported EPS,Surprise (%)
0,AMZN,Amazon.com Inc,"April 29, 2026 at 6 AM EDT",-,-,-
1,AMZN,Amazon.com Inc,"February 4, 2026 at 4 PM EST",-,-,-
2,AMZN,Amazon.com Inc,"October 29, 2025 at 6 AM EDT",-,-,-
3,AMZN,Amazon.com Inc,"July 30, 2025 at 4 PM EDT",-,-,-
4,AMZN,"Amazon.com, Inc.","May 1, 2025 at 4 PM EDT",???.36,???.59,+16.74
5,AMZN,"Amazon.com, Inc.","February 6, 2025 at 4 PM EST",???.49,???.86,+24.47
6,AMZN,"Amazon.com, Inc.","October 31, 2024 at 4 PM EDT",???.14,???.43,+25.17
7,AMZN,"Amazon.com, Inc.","August 1, 2024 at 4 PM EDT",01.???,???.26,+22.58
8,AMZN,"Amazon.com, Inc.","April 30, 2024 at 4 PM EDT",0.83,0.98,+17.91
9,AMZN,"Amazon.com, Inc.","February 1, 2024 at 4 PM EST",0.8,1,+24.55


In [None]:
# Download complete historical price data using yfinance
amzn_ticker = yf.Ticker('AMZN')
amzn_history = amzn_ticker.history(start='1950-01-01', interval='1d')

### Question 5.  [Exploratory, optional] Brainstorm potential idea for your capstone project

**Free text answer**

Describe the capstone project you would like to pursue, considering your aspirations, ML model predictions, and prior knowledge. Even if you are unsure at this stage, try to generate an idea you would like to explore-such as a specific asset class, country, industry vertical, or investment strategy. Be as specific as possible.

*Example: I want to build a short-term prediction model for the US/India/Brazil stock markets, focusing on the largest stocks over a 30-day investment horizon. I plan to use RSI and MACD technical indicators and news coverage data to generate predictions.*

**Answer**:

I would like to analyse the stock performance between Nvidia and AMD, given their similarity in company business. As Nvidia stocks are currently expensive, I wonder if AMD has the potential to grow like Nvidia and hence it's worth buying at its current price?

### Question 6. [Exploratory, optional] Investigate new metrics

**Free text answer**

Using the data sources we have covered (or any others you find relevant), download and explore a few additional metrics or time series that could be valuable for your project. Briefly explain why you think each metric is useful. This does not need to be a comprehensive list-focus on demonstrating your ability to generate data requests based on your project description, identify and locate the necessary data, and explain how you would retrieve it using Python.

## Submitting the solutions

Form for submitting: https://courses.datatalks.club/sma-zoomcamp-2025/homework/hw01