#Question 1: [Index] S&P 500 Stocks Added to the Index

**Which year had the highest number of additions?**

Using the list of S&P 500 companies from Wikipedia's [S&P 500 companies page](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies), download the data including the year each company was added to the index.

Hint: you can use [pandas.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) to scrape the data into a DataFrame.

Steps:

1. Create a DataFrame with company tickers, names, and the year they were added.
2. Extract the year from the addition date and calculate the number of stocks added each year.
3. Which year had the highest number of additions (1957 doesn't count, as it was the year when the S&P 500 index was founded)? Write down this year as your answer (the most recent one, if you have several records).

*Context*:

  "Following the announcement, all four new entrants saw their stock prices rise in extended trading on Friday" - recent examples of S&P 500 additions include DASH, WSM, EXE, TKO in 2025 (Nasdaq article [link text](https://www.nasdaq.com/articles/sp-500-reshuffle-dash-tko-expe-wsm-join-worth-buying)).

*Additional*: How many current S&P 500 stocks have been in the index for more than 20 years? When stocks are added to the S&P 500, they usually experience a price bump as investors and index funds buy shares following the announcement.

In [None]:
# Import libraries
import requests
import pandas as pd
from bs4 import BeautifulSoup
from IPython.display import display

# Define the URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

# Define headers with a user-agent to mimic a web browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}

# Send a GET request to the URL with headers
response = requests.get(url, headers=headers)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the webpage
    soup = BeautifulSoup(response.content, "html.parser")

    # Find the table tag containing the list of companies (based on the class name)
    table = soup.find("table", {"class": "wikitable sortable sticky-header"})

    # Use pandas to read the table into a DataFrame
    df = pd.read_html(str(table))[0]# Assuming there's only one table, otherwise, loop through the list

    # Select relevant columns: Symbol, Security, and Date added
    sp500_df = df[['Symbol', 'Security', 'Date added']]

    # Extract the year from the 'Date added' column
    sp500_df['Year Added'] = pd.to_datetime(sp500_df['Date added']).dt.year

    # Display the DataFrame
    print("DataFrame with S&P 500 companies and year added:")
    display(sp500_df.head())

    # Calculate the number of stocks added each year, excluding 1957
    additions_per_year = sp500_df[sp500_df['Year Added'] != 1957]['Year Added'].value_counts().sort_index()

    # Display the DataFrame
    print("\nNumber of stocks added each year (excluding 1957):")
    display(additions_per_year.sort_values(ascending=False))

    # Find and print the year with the highest number of additions
    year_with_most_additions = additions_per_year.idxmax() # Pandas method to find the index with the maximum value
    print(f"\nYear with the highest number of additions (excluding 1957): {year_with_most_additions}")

else:
    print("Failed to retrieve data from the Wikipedia page. Status code:", response.status_code)

DataFrame with S&P 500 companies and year added:


  df = pd.read_html(str(table))[0]# Assuming there's only one table, otherwise, loop through the list
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sp500_df['Year Added'] = pd.to_datetime(sp500_df['Date added']).dt.year


Unnamed: 0,Symbol,Security,Date added,Year Added
0,MMM,3M,1957-03-04,1957
1,AOS,A. O. Smith,2017-07-26,2017
2,ABT,Abbott Laboratories,1957-03-04,1957
3,ABBV,AbbVie,2012-12-31,2012
4,ACN,Accenture,2011-07-06,2011



Number of stocks added each year (excluding 1957):


Unnamed: 0_level_0,count
Year Added,Unnamed: 1_level_1
2016,23
2017,23
2019,22
2008,17
2024,16
2022,16
2023,15
2021,15
2015,14
2018,14



Year with the highest number of additions (excluding 1957): 2016


# Question 2. [Macro] Indexes YTD (as of 1 May 2025)

**How many indexes (out of 10) have better year-to-date returns than the US (S&P 500) as of May 1, 2025?**

Using Yahoo Finance World Indices data, compare the year-to-date (YTD) performance (1 January-1 May 2025) of major stock market indexes for the following countries:

* United States - S&P 500 (^GSPC)
* China - Shanghai Composite (000001.SS)
* Hong Kong - HANG SENG INDEX (^HSI)
* Australia - S&P/ASX 200 (^AXJO)
* India - Nifty 50 (^NSEI)
* Canada - S&P/TSX Composite (^GSPTSE)
* Germany - DAX (^GDAXI)
* United Kingdom - FTSE 100 (^FTSE)
* Japan - Nikkei 225 (^N225)
* Mexico - IPC Mexico (^MXX)
* Brazil - Ibovespa (^BVSP)

*Hint*: use start_date='2025-01-01' and end_date='2025-05-01' when downloading daily data in yfinance

Context:

 [Global Valuations: Who's Cheap, Who's Not?](https://simplywall.st/article/beyond-the-us-global-markets-after-yet-another-tariff-update) article suggests "Other regions may be growing faster than the US and you need to diversify."

Reference: Yahoo Finance World Indices - https://finance.yahoo.com/world-indices/

Additional: How many of these indexes have better returns than the S&P 500 over 3, 5, and 10 year periods? Do you see the same trend? Note: For simplicity, ignore currency conversion effects.)

In [None]:
# Import library
import yfinance as yf

# Create ticker objects for each index
sp500_ticker = yf.Ticker("^GSPC")
shanghai_composite_ticker = yf.Ticker("000001.SS")
hang_seng_ticker = yf.Ticker("^HSI")
asx200_ticker = yf.Ticker("^AXJO")
nifty50_ticker = yf.Ticker("^NSEI")
tsx_composite_ticker = yf.Ticker("^GSPTSE")
dax_ticker = yf.Ticker("^GDAXI")
ftse100_ticker = yf.Ticker("^FTSE")
nikkei225_ticker = yf.Ticker("^N225")
ipc_mexico_ticker = yf.Ticker("^MXX")
ibovespa_ticker = yf.Ticker("^BVSP")

# Define the start and end dates for YTD calculation
start_date = "2025-01-01"
end_date = "2025-05-01"

# Dictionary to store YTD returns
ytd_returns = {}

# List of tickers and their names
tickers = {
    "^GSPC": "S&P 500",
    "000001.SS": "Shanghai Composite",
    "^HSI": "HANG SENG INDEX",
    "^AXJO": "S&P/ASX 200",
    "^NSEI": "Nifty 50",
    "^GSPTSE": "S&P/TSX Composite",
    "^GDAXI": "DAX",
    "^FTSE": "FTSE 100",
    "^N225": "Nikkei 225",
    "^MXX": "IPC Mexico",
    "^BVSP": "Ibovespa",
}

# Calculate YTD return for each index
for ticker_symbol, index_name in tickers.items():
    try:
        ticker_data = yf.Ticker(ticker_symbol)
        history = ticker_data.history(start=start_date, end=end_date)# Retrieve historical prices

        if not history.empty:
            # Calculate the YTD percentage change from the first available Open date to the last Close
            ytd_return = ((history['Close'].iloc[-1] - history['Open'].iloc[0]) / history['Open'].iloc[0]) * 100
            ytd_returns[index_name] = ytd_return # Store the data
            print(f"{index_name} YTD Return: {ytd_return:.2f}%")
        else:
            print(f"Could not retrieve data for {index_name} ({ticker_symbol})")
            ytd_returns[index_name] = None # Store None if data is not available

    except Exception as e:
        print(f"An error occurred while processing {index_name} ({ticker_symbol}): {e}")
        ytd_returns[index_name] = None

# Get S&P 500 YTD return and handle cases where the key might not exist
sp500_ytd = ytd_returns.get("S&P 500", None)

if sp500_ytd is not None:
    # Count indexes with better YTD returns than S&P 500
    better_than_sp500_count = 0
    better_than_sp500_list = []

    # Exclude S&P 500 from the comparison count
    for index_name, ytd_return in ytd_returns.items():
        if index_name != "S&P 500" and ytd_return is not None and ytd_return > sp500_ytd:
            better_than_sp500_count += 1
            better_than_sp500_list.append(index_name)

    print(f"\nS&P 500 YTD Return: {sp500_ytd:.2f}%")
    print(f"Number of indexes with better YTD returns than S&P 500: {better_than_sp500_count}")
    if better_than_sp500_count > 0:
        print("Indexes with better YTD returns:", ", ".join(better_than_sp500_list))
else:
    print("\nCould not determine S&P 500 YTD return for comparison.")


S&P 500 YTD Return: -5.66%
Shanghai Composite YTD Return: -2.06%
HANG SENG INDEX YTD Return: 10.97%
S&P/ASX 200 YTD Return: -0.40%
Nifty 50 YTD Return: 2.95%
S&P/TSX Composite YTD Return: 0.08%
DAX YTD Return: 12.92%
FTSE 100 YTD Return: 3.94%
Nikkei 225 YTD Return: -9.76%
IPC Mexico YTD Return: 13.41%
Ibovespa YTD Return: 12.29%

S&P 500 YTD Return: -5.66%
Number of indexes with better YTD returns than S&P 500: 9
Indexes with better YTD returns: Shanghai Composite, HANG SENG INDEX, S&P/ASX 200, Nifty 50, S&P/TSX Composite, DAX, FTSE 100, IPC Mexico, Ibovespa


#Question 3. [Index] S&P 500 Market Corrections Analysis

**Calculate the median duration (in days) of significant market corrections in the S&P 500 index.**

For this task, define a correction as an event when a stock index goes down by **more than 5%** from the closest all-time high maximum.

Steps:

1. Download S&P 500 historical data (1950-present) using yfinance
2. Identify all-time high points (where price exceeds all previous prices)
3. For each pair of consecutive all-time highs, find the minimum price in between
4. Calculate drawdown percentages: (high - low) / high × 100
5. Filter for corrections with at least 5% drawdown
6. Calculate the duration in days for each correction period
7. Determine the 25th, 50th (median), and 75th percentiles for correction durations

*Context:*

* Investors often wonder about the typical length of market corrections when deciding "when to buy the dip" ([Reddit discussion](https://www.reddit.com/r/investing/comments/1jrqnte/when_are_you_buying_the_dip/?rdt=64135)).
* [A Wealth of Common Sense - How Often Should You Expect a Stock Market Correction?](https://awealthofcommonsense.com/2022/01/how-often-should-you-expect-a-stock-market-correction/)

*Hint (use this data to compare with your results)*: Here is the list of top 10 largest corrections by drawdown:

* 2007-10-09 to 2009-03-09: 56.8% drawdown over 517 days
* 2000-03-24 to 2002-10-09: 49.1% drawdown over 929 days
* 1973-01-11 to 1974-10-03: 48.2% drawdown over 630 days
* 1968-11-29 to 1970-05-26: 36.1% drawdown over 543 days
* 2020-02-19 to 2020-03-23: 33.9% drawdown over 33 days
* 1987-08-25 to 1987-12-04: 33.5% drawdown over 101 days
* 1961-12-12 to 1962-06-26: 28.0% drawdown over 196 days
* 1980-11-28 to 1982-08-12: 27.1% drawdown over 622 days
* 2022-01-03 to 2022-10-12: 25.4% drawdown over 282 days
* 1966-02-09 to 1966-10-07: 22.2% drawdown over 240 days

In [None]:
# E.g. the through between 1950-01-11 and 1950-02-02 is 16.67 and 15 days although bellow threshold ((17.09-16.67)/17.09 * 100)
import yfinance as yf
import pandas as pd

sp500_ticker = yf.Ticker("^GSPC")
sp500_data = sp500_ticker.history(start="1950-01-01")

sp500_data = sp500_data.sort_index()

sp500_data['All_Time_High'] = sp500_data['Close'].cummax() # Pandas method to calculate maximum value up to a point in time, including the current row
sp500_data['Is_ATH'] = sp500_data['Close'] == sp500_data['All_Time_High']

sp500_data.head(10)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,All_Time_High,Is_ATH
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1950-01-03 00:00:00-05:00,16.66,16.66,16.66,16.66,1260000,0.0,0.0,16.66,True
1950-01-04 00:00:00-05:00,16.85,16.85,16.85,16.85,1890000,0.0,0.0,16.85,True
1950-01-05 00:00:00-05:00,16.93,16.93,16.93,16.93,2550000,0.0,0.0,16.93,True
1950-01-06 00:00:00-05:00,16.98,16.98,16.98,16.98,2010000,0.0,0.0,16.98,True
1950-01-09 00:00:00-05:00,17.08,17.08,17.08,17.08,2520000,0.0,0.0,17.08,True
1950-01-10 00:00:00-05:00,17.030001,17.030001,17.030001,17.030001,2160000,0.0,0.0,17.08,False
1950-01-11 00:00:00-05:00,17.09,17.09,17.09,17.09,2630000,0.0,0.0,17.09,True
1950-01-12 00:00:00-05:00,16.76,16.76,16.76,16.76,2970000,0.0,0.0,17.09,False
1950-01-13 00:00:00-05:00,16.67,16.67,16.67,16.67,3330000,0.0,0.0,17.09,False
1950-01-16 00:00:00-05:00,16.719999,16.719999,16.719999,16.719999,1460000,0.0,0.0,17.09,False


In [None]:
# E.g. 1950-01-10 removed
ath_dates_test = sp500_data[sp500_data['Is_ATH']].index.tolist()
ath_dates_test[:10]

[Timestamp('1950-01-03 00:00:00-0500', tz='America/New_York'),
 Timestamp('1950-01-04 00:00:00-0500', tz='America/New_York'),
 Timestamp('1950-01-05 00:00:00-0500', tz='America/New_York'),
 Timestamp('1950-01-06 00:00:00-0500', tz='America/New_York'),
 Timestamp('1950-01-09 00:00:00-0500', tz='America/New_York'),
 Timestamp('1950-01-11 00:00:00-0500', tz='America/New_York'),
 Timestamp('1950-02-02 00:00:00-0500', tz='America/New_York'),
 Timestamp('1950-02-03 00:00:00-0500', tz='America/New_York'),
 Timestamp('1950-02-06 00:00:00-0500', tz='America/New_York'),
 Timestamp('1950-03-06 00:00:00-0500', tz='America/New_York')]

In [None]:
# Test hint: 2020-02-19 to 2020-03-23: 33.9% drawdown over 33 days
start_date_test = "2020-02-19" # 3386.149902
end_date_test = "2020-03-23" # 2237.399902

date_range_test = sp500_data.loc[start_date_test:end_date_test]

peak_date_test = date_range_test['Close'].max()
trough_date_test = date_range_test['Close'].min()

(peak_date_test - trough_date_test) / peak_date_test * 100

33.92496000265327

In [None]:
# Import libraries
import yfinance as yf
import pandas as pd

# Download S&P 500 historical data
sp500_ticker = yf.Ticker("^GSPC")
sp500_data = sp500_ticker.history(start="1950-01-01")

# Ensure the data is sorted by date
sp500_data = sp500_data.sort_index()

# Identify all-time highs and the dates they occurred
sp500_data['All_Time_High'] = sp500_data['Close'].cummax() # Pandas method to calculate maximum value up to a point in time, including the current row
sp500_data['Is_ATH'] = sp500_data['Close'] == sp500_data['All_Time_High']

# Find the dates of all-time highs
ath_dates = sp500_data[sp500_data['Is_ATH']].index.tolist()

corrections = []

# Iterate through the all-time high dates to find corrections between them
for i in range(len(ath_dates) - 1):
    start_date = ath_dates[i]
    end_date_of_period = ath_dates[i+1]

    # Get the data for the period between two consecutive all-time highs
    period_data = sp500_data.loc[start_date:end_date_of_period].copy()

    # Find the minimum price and its date in this period (excluding the start ATH date)
    if len(period_data) > 1: # Ensure there's data after the starting ATH
        # Find the lowest price point between two consecutive all-time high dates (the "trough" of a potential correction)
        trough_data = period_data.iloc[1:].loc[period_data.iloc[1:]['Close'].idxmin()] # Pandas method to find the index with the minimun value
        trough_date = trough_data.name # The date of the trough
        trough_price = trough_data['Close'] # the 'Close' price from the trough_data

        # Get the price at the start all-time high
        ath_price_at_start = sp500_data.loc[start_date, 'Close']

        # Calculate the drawdown from the start ATH to the trough
        drawdown_percentage = ((ath_price_at_start - trough_price) / ath_price_at_start) * 100

        # If the drawdown is greater than 5%, consider it a significant correction
        if drawdown_percentage > 5:
            duration = (trough_date - start_date).days
            corrections.append({
                'Start Date': start_date,
                'End Date': trough_date, # End date is the trough date
                'Duration': duration,
                'Max Drawdown': drawdown_percentage
            })

# Sort corrections by start date
corrections.sort(key=lambda x: x['Max Drawdown'], reverse=True)

# Print the corrections
print("Significant Market Corrections (Drawdown > 5% - Peak to Trough):")
for correction in corrections:
    start_date_str = correction['Start Date'].strftime('%Y-%m-%d')
    end_date_str = correction['End Date'].strftime('%Y-%m-%d')
    drawdown_percent = correction['Max Drawdown']
    duration_days = correction['Duration']
    print(f"{start_date_str} to {end_date_str}: {drawdown_percent:.1f}% drawdown over {duration_days} days")

# Calculate percentiles of correction durations
if corrections:
    durations = [c['Duration'] for c in corrections]
    percentiles = pd.Series(durations).quantile([0.25, 0.5, 0.75])
    print("\nCorrection Duration Percentiles (in days - Peak to Trough):")
    print(f"25th Percentile: {percentiles[0.25]:.2f}")
    print(f"Median (50th Percentile): {percentiles[0.5]:.2f}")
    print(f"75th Percentile: {percentiles[0.75]:.2f}")
else:
    print("No significant market corrections found (drawdown > 5%).")

Significant Market Corrections (Drawdown > 5% - Peak to Trough):
2007-10-09 to 2009-03-09: 56.8% drawdown over 517 days
2000-03-24 to 2002-10-09: 49.1% drawdown over 928 days
1973-01-11 to 1974-10-03: 48.2% drawdown over 629 days
1968-11-29 to 1970-05-26: 36.1% drawdown over 542 days
2020-02-19 to 2020-03-23: 33.9% drawdown over 32 days
1987-08-25 to 1987-12-04: 33.5% drawdown over 101 days
1961-12-12 to 1962-06-26: 28.0% drawdown over 195 days
1980-11-28 to 1982-08-12: 27.1% drawdown over 621 days
2022-01-03 to 2022-10-12: 25.4% drawdown over 281 days
1966-02-09 to 1966-10-07: 22.2% drawdown over 239 days
1956-08-03 to 1957-10-22: 21.5% drawdown over 445 days
1990-07-16 to 1990-10-11: 19.9% drawdown over 87 days
2018-09-20 to 2018-12-24: 19.8% drawdown over 95 days
1998-07-17 to 1998-08-31: 19.3% drawdown over 45 days
1953-01-05 to 1953-09-14: 14.8% drawdown over 251 days
1983-10-10 to 1984-07-24: 14.4% drawdown over 288 days
2015-05-21 to 2016-02-11: 14.2% drawdown over 266 days
1950

#Question 4. [Stocks] Earnings Surprise Analysis for Amazon (AMZN)

**Calculate the median 2-day percentage change in stock prices following positive earnings surprises days.**

Steps:

1. Load earnings data from CSV ([ha1_Amazon.csv](https://github.com/DataTalksClub/stock-markets-analytics-zoomcamp/blob/main/cohorts/2025/ha1_Amazon.csv)) containing earnings dates, EPS estimates, and actual EPS (Reported EPS)
2. Download complete historical price data using yfinance
3. Calculate 2-day percentage changes for all historical dates: for each sequence of 3 consecutive trading days (Day 1, Day 2, Day 3), compute the return as Close_Day3 / Close_Day1 - 1. (Assume Day 2 may correspond to the earnings announcement.)
4. Identify positive earnings surprises (where "actual EPS > estimated EPS" OR "Surprise (%)>0")
5. Calculate 2-day percentage changes following positive earnings surprises
6. Compare the median 2-day percentage change for positive surprises vs. all historical dates

Context: Earnings announcements, especially when they exceed analyst expectations, can significantly impact stock prices in the short term.

Reference: Yahoo Finance earnings calendar - https://finance.yahoo.com/calendar/earnings?symbol=AMZN

*Additional*: Is there a correlation between the magnitude of the earnings surprise and the stock price reaction? Does the market react differently to earnings surprises during bull vs. bear markets?

In [None]:
import pandas as pd
import yfinance as yf
import numpy as np
# https://finance.yahoo.com/calendar/earnings?symbol=AMZN&offset=0&size=100
earnings_df = pd.read_csv('/content/ha1_Amazon.csv', sep=';')
earnings_df.head(10)

Unnamed: 0,Symbol,Company,Earnings Date,EPS Estimate,Reported EPS,Surprise (%)
0,AMZN,"Amazon.com, Inc.","May 1, 2025 at 4 PM EDT",1.36,1.59,16.73
1,AMZN,"Amazon.com, Inc.","February 6, 2025 at 4 PM EST",1.49,1.86,24.47
2,AMZN,"Amazon.com, Inc.","October 31, 2024 at 4 PM EDT",1.14,1.43,25.17
3,AMZN,"Amazon.com, Inc.","August 1, 2024 at 4 PM EDT",1.03,1.26,22.58
4,AMZN,"Amazon.com, Inc.","April 30, 2024 at 4 PM EDT",0.83,0.98,17.91
5,AMZN,"Amazon.com, Inc.","February 1, 2024 at 4 PM EST",0.8,1.0,24.55
6,AMZN,"Amazon.com, Inc.","October 26, 2023 at 4 PM EDT",0.58,0.94,60.85
7,AMZN,"Amazon.com, Inc.","August 3, 2023 at 4 PM EDT",0.35,0.65,85.73
8,AMZN,"Amazon.com, Inc.","April 27, 2023 at 4 PM EDT",0.21,0.31,46.36
9,AMZN,"Amazon.com, Inc.","February 2, 2023 at 4 PM EST",0.18,0.25,42.56


In [None]:
amzn_ticker = yf.Ticker("AMZN")
amzn_history = amzn_ticker.history(period="max")
amzn_history.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1997-05-15 00:00:00-04:00,0.121875,0.125,0.096354,0.097917,1443120000,0.0,0.0
1997-05-16 00:00:00-04:00,0.098438,0.098958,0.085417,0.086458,294000000,0.0,0.0
1997-05-19 00:00:00-04:00,0.088021,0.088542,0.08125,0.085417,122136000,0.0,0.0
1997-05-20 00:00:00-04:00,0.086458,0.0875,0.081771,0.081771,109344000,0.0,0.0
1997-05-21 00:00:00-04:00,0.081771,0.082292,0.06875,0.071354,377064000,0.0,0.0


In [None]:
amzn_history.tail(25)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2025-04-25 00:00:00-04:00,187.619995,189.940002,185.490005,188.990005,36414300,0.0,0.0
2025-04-28 00:00:00-04:00,190.110001,190.220001,184.889999,187.699997,33224700,0.0,0.0
2025-04-29 00:00:00-04:00,183.990005,188.020004,183.679993,187.389999,41667300,0.0,0.0
2025-04-30 00:00:00-04:00,182.169998,185.050003,178.850006,184.419998,55176500,0.0,0.0
2025-05-01 00:00:00-04:00,190.630005,191.809998,187.5,190.199997,74266000,0.0,0.0
2025-05-02 00:00:00-04:00,191.440002,192.880005,186.399994,189.979996,77903500,0.0,0.0
2025-05-05 00:00:00-04:00,186.509995,188.179993,185.529999,186.350006,35217500,0.0,0.0
2025-05-06 00:00:00-04:00,184.570007,187.929993,183.850006,185.009995,29314100,0.0,0.0
2025-05-07 00:00:00-04:00,185.559998,190.990005,185.009995,188.710007,43948600,0.0,0.0
2025-05-08 00:00:00-04:00,191.429993,194.330002,188.820007,192.080002,41043600,0.0,0.0


In [None]:
# Change to fit AMZ
earnings_df['Earnings Date'] = earnings_df['Earnings Date'].str.replace(r' at \d+ (AM|PM) (EDT|EST)', '', regex=True)
earnings_df['Earnings Date'] = pd.to_datetime(earnings_df['Earnings Date'], format="%B %d, %Y")
earnings_df['Earnings Date'] = earnings_df['Earnings Date'].dt.tz_localize('UTC')

earnings_df.tail()

Unnamed: 0,Symbol,Company,Earnings Date,EPS Estimate,Reported EPS,Surprise (%)
107,AMZN,"Amazon.com, Inc.",1998-07-22 00:00:00+00:00,-,-,1.34
108,AMZN,"Amazon.com, Inc.",1998-04-27 00:00:00+00:00,-,-,13.92
109,AMZN,"Amazon.com, Inc.",1998-01-22 00:00:00+00:00,-,-,11.41
110,AMZN,"Amazon.com, Inc.",1997-10-27 00:00:00+00:00,-,-,13.29
111,AMZN,"Amazon.com, Inc.",1997-07-10 00:00:00+00:00,-,-,13.33


In [None]:
amzn_history['2Day_Change'] = amzn_history['Close'].pct_change(periods=2) * 100
amzn_history.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,2Day_Change
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1997-05-15 00:00:00-04:00,0.121875,0.125,0.096354,0.097917,1443120000,0.0,0.0,
1997-05-16 00:00:00-04:00,0.098438,0.098958,0.085417,0.086458,294000000,0.0,0.0,
1997-05-19 00:00:00-04:00,0.088021,0.088542,0.08125,0.085417,122136000,0.0,0.0,-12.76591
1997-05-20 00:00:00-04:00,0.086458,0.0875,0.081771,0.081771,109344000,0.0,0.0,-5.421125
1997-05-21 00:00:00-04:00,0.081771,0.082292,0.06875,0.071354,377064000,0.0,0.0,-16.463936


In [None]:
earnings_df.head()

Unnamed: 0,Symbol,Company,Earnings Date,EPS Estimate,Reported EPS,Surprise (%)
0,AMZN,"Amazon.com, Inc.",2025-05-01 00:00:00+00:00,1.36,1.59,16.73
1,AMZN,"Amazon.com, Inc.",2025-02-06 00:00:00+00:00,1.49,1.86,24.47
2,AMZN,"Amazon.com, Inc.",2024-10-31 00:00:00+00:00,1.14,1.43,25.17
3,AMZN,"Amazon.com, Inc.",2024-08-01 00:00:00+00:00,1.03,1.26,22.58
4,AMZN,"Amazon.com, Inc.",2024-04-30 00:00:00+00:00,0.83,0.98,17.91


In [None]:
earnings_df['Reported EPS'] = pd.to_numeric(earnings_df['Reported EPS'], errors='coerce')
earnings_df['EPS Estimate'] = pd.to_numeric(earnings_df['EPS Estimate'], errors='coerce')
earnings_df['Surprise (%)'] = pd.to_numeric(earnings_df['Surprise (%)'], errors='coerce')

In [None]:
earnings_df.head()

Unnamed: 0,Symbol,Company,Earnings Date,EPS Estimate,Reported EPS,Surprise (%)
0,AMZN,"Amazon.com, Inc.",2025-05-01 00:00:00+00:00,1.36,1.59,16.73
1,AMZN,"Amazon.com, Inc.",2025-02-06 00:00:00+00:00,1.49,1.86,24.47
2,AMZN,"Amazon.com, Inc.",2024-10-31 00:00:00+00:00,1.14,1.43,25.17
3,AMZN,"Amazon.com, Inc.",2024-08-01 00:00:00+00:00,1.03,1.26,22.58
4,AMZN,"Amazon.com, Inc.",2024-04-30 00:00:00+00:00,0.83,0.98,17.91


##1) Number of positive surprises: 86

In [None]:
import pandas as pd
import yfinance as yf
import numpy as np

# Load earnings data from CSV (does not contain actual stock prices (OHLCV) to calculate price changes)
earnings_df = pd.read_csv('/content/ha1_Amazon.csv', sep=';')

# Transform Earnings Date columns and make them timezone-aware (UTC)
earnings_df['Earnings Date'] = earnings_df['Earnings Date'].str.replace(r' at \d+ (AM|PM) (EDT|EST)', '', regex=True)
earnings_df['Earnings Date'] = pd.to_datetime(earnings_df['Earnings Date'], format="%B %d, %Y").dt.tz_localize('UTC')

# Download historical price data for AMZN (to determine how the stock price reacted to an earnings surprise)
amzn_ticker = yf.Ticker("AMZN")
amzn_history = amzn_ticker.history(period="max")# Retrieve the maximum available historical data

# Calculate 2-day %-changes for all historical dates and account for pct_change decimal returns
all_two_day_changes = amzn_history['Close'].pct_change(periods=2).dropna() * 100 # Pandas method to calculate %-change between current and prior series elements
print(f"%-change of closing price on date D to date D-2: {all_two_day_changes.loc['2025-05-01']}") #%-change of closing price on date D to date D-2

# Convert relevant columns to numeric
earnings_df['Reported EPS'] = pd.to_numeric(earnings_df['Reported EPS'], errors='coerce')
earnings_df['EPS Estimate'] = pd.to_numeric(earnings_df['EPS Estimate'], errors='coerce')
earnings_df['Surprise (%)'] = pd.to_numeric(earnings_df['Surprise (%)'], errors='coerce')

# Identify positive earnings surprises
positive_surprises = earnings_df[
    (earnings_df['Reported EPS'] > earnings_df['EPS Estimate']) | (earnings_df['Surprise (%)'] > 0)
]
print(f"Number of positive surprises: {len(positive_surprises)}")

# Calculate 2-day %-changes (Day 3 in the 3-day sequence) following positive surprises (Day 2 in the 3-day sequence)
positive_surprise_changes = []

# Convert the index of amzn_history to a list
amzn_dates = amzn_history.index.tolist()

# Calculate 2-day percentage changes (Day 1 to day 3 in the 3-day sequence)
for earnings_date in positive_surprises['Earnings Date']:
    try:
        # Find the trading day on or before the earnings date (Day 1)
        day1_index = amzn_history.index.searchsorted(earnings_date, side='right') - 1

        # Ensure there is a trading day before the earnings date
        if day1_index >= 0:
            day1_date = amzn_history.index[day1_index]

            # Find the trading day two days after Day 1 (Day 3)
            day3_index = day1_index + 2

            # Ensure there are at least two trading days after Day 1
            if day3_index < len(amzn_history):
                day3_date = amzn_history.index[day3_index]

                # Calculate the 2-day percentage change
                day1_close = amzn_history.loc[day1_date, 'Close']
                day3_close = amzn_history.loc[day3_date, 'Close']

                if day1_close != 0: # Avoid division by zero
                     two_day_change = ((day3_close / day1_close) - 1) * 100
                     positive_surprise_changes.append(two_day_change)
                else:
                    print(f"Day 1 close price is zero for earnings date {earnings_date}. Skipping.")
            else:
                 print(f"Not enough historical data after Day 1 ({day1_date}) for earnings date {earnings_date}. Skipping.")
        else:
            print(f"Could not find a trading day on or before earnings date {earnings_date}. Skipping.")

    except Exception as e:
        print(f"An error occurred processing earnings date {earnings_date}: {e}")

# Print all individual 2-day percentage changes following positive earnings surprises
print("Individual 2-day percentage changes following positive earnings surprises:")
for change in positive_surprise_changes:
    print(f"{change:.4f}%")

# Calculate the median 2-day percentage change for positive surprises
median_positive_surprise_change = pd.Series(positive_surprise_changes).median()

# Calculate the median 2-day percentage change for all historical dates
median_all_dates_change = all_two_day_changes.median()

# Print the median results
print(f"\nMedian 2-day percentage change following positive earnings surprises: {median_positive_surprise_change:.4f}")
print(f"Median 2-day percentage change for all historical dates: {median_all_dates_change:.4f}")

%-change of closing price on date D to date D-2: 1.4995451025915152
Number of positive surprises: 86
Individual 2-day percentage changes following positive earnings surprises:
3.0149%
-2.9724%
2.6981%
-10.2043%
-1.0831%
10.7023%
5.2311%
8.8605%
0.4477%
-1.6738%
11.5566%
4.6656%
-8.3389%
0.2579%
-0.9079%
-4.0038%
4.3233%
8.1119%
2.5703%
-2.6460%
-1.2853%
-2.4866%
7.7012%
-1.4433%
13.1605%
1.7266%
-2.6611%
3.0054%
8.7409%
7.7838%
8.4277%
14.1868%
16.6562%
-6.3929%
-5.1976%
9.3389%
16.6804%
-6.6042%
4.2297%
-2.4232%
6.5923%
-1.9122%
2.1670%
26.8358%
-2.5904%
6.6414%
16.7991%
0.1792%
15.8158%
-2.3995%
5.0710%
-2.8042%
20.1282%
26.8930%
-0.7433%
14.5985%
0.0000%
-12.8467%
15.0198%
1.2467%
-8.8901%
-9.3304%
13.5296%
13.9206%
4.4402%
-2.2785%
-9.0323%
17.3305%
22.7362%
-12.8848%
-28.9753%
-0.6790%
-13.9752%
6.2500%
-16.6113%
0.8343%
-12.6154%
-11.0822%
-18.2756%
11.7909%
8.7588%
-3.3601%
12.6658%
-2.6695%
-1.5543%
-1.3457%

Median 2-day percentage change following positive earnings surprises:

## 2) Number of positive surprises: 37

In [None]:
import pandas as pd
import yfinance as yf
import numpy as np

# Load earnings data from CSV (does not contain actual stock prices (OHLCV) to calculate price changes)
earnings_df = pd.read_csv('/content/ha1_Amazon2.csv', sep=';')

# Transform Earnings Date columns and make them timezone-aware (UTC)
earnings_df['Earnings Date'] = earnings_df['Earnings Date'].str.replace(r' at \d+ (AM|PM) (EDT|EST)', '', regex=True)
earnings_df['Earnings Date'] = pd.to_datetime(earnings_df['Earnings Date'], format="%B %d, %Y").dt.tz_localize('UTC')

# Download historical price data for AMZN (to determine how the stock price reacted to an earnings surprise)
amzn_ticker = yf.Ticker("AMZN")
amzn_history = amzn_ticker.history(period="max")# Retrieve the maximum available historical data

# Calculate 2-day %-changes for all historical dates and account for pct_change decimal returns
all_two_day_changes = amzn_history['Close'].pct_change(periods=2).dropna() * 100 # Pandas method to calculate %-change between current and prior series elements
print(f"%-change of closing price on date D to date D-2: {all_two_day_changes.loc['2025-05-01']}") #%-change of closing price on date D to date D-2

# Convert relevant columns to numeric
earnings_df['Reported EPS'] = pd.to_numeric(earnings_df['Reported EPS'], errors='coerce')
earnings_df['EPS Estimate'] = pd.to_numeric(earnings_df['EPS Estimate'], errors='coerce')
earnings_df['Surprise (%)'] = pd.to_numeric(earnings_df['Surprise (%)'], errors='coerce')

# Identify positive earnings surprises
positive_surprises = earnings_df[earnings_df['Reported EPS'] > earnings_df['EPS Estimate']]
print(f"Number of positive surprises: {len(positive_surprises)}")

# Calculate 2-day %-changes (Day 3 in the 3-day sequence) following positive surprises (Day 2 in the 3-day sequence)
positive_surprise_changes = []

# Convert the index of amzn_history to a list
amzn_dates = amzn_history.index.tolist()

# Calculate 2-day percentage changes (Day 1 to day 3 in the 3-day sequence)
for earnings_date in positive_surprises['Earnings Date']:
    try:
        # Find the trading day on or before the earnings date (Day 1)
        day1_index = amzn_history.index.searchsorted(earnings_date, side='right') - 1

        # Ensure there is a trading day before the earnings date
        if day1_index >= 0:
            day1_date = amzn_history.index[day1_index]

            # Find the trading day two days after Day 1 (Day 3)
            day3_index = day1_index + 2

            # Ensure there are at least two trading days after Day 1
            if day3_index < len(amzn_history):
                day3_date = amzn_history.index[day3_index]

                # Calculate the 2-day percentage change
                day1_close = amzn_history.loc[day1_date, 'Close']
                day3_close = amzn_history.loc[day3_date, 'Close']

                if day1_close != 0: # Avoid division by zero
                     two_day_change = ((day3_close / day1_close) - 1) * 100
                     positive_surprise_changes.append(two_day_change)
                else:
                    print(f"Day 1 close price is zero for earnings date {earnings_date}. Skipping.")
            else:
                 print(f"Not enough historical data after Day 1 ({day1_date}) for earnings date {earnings_date}. Skipping.")
        else:
            print(f"Could not find a trading day on or before earnings date {earnings_date}. Skipping.")

    except Exception as e:
        print(f"An error occurred processing earnings date {earnings_date}: {e}")

# Print all individual 2-day percentage changes following positive earnings surprises
print("Individual 2-day percentage changes following positive earnings surprises:")
for change in positive_surprise_changes:
    print(f"{change:.4f}%")

# Calculate the median 2-day percentage change for positive surprises
median_positive_surprise_change = pd.Series(positive_surprise_changes).median()

# Calculate the median 2-day percentage change for all historical dates
median_all_dates_change = all_two_day_changes.median()

# Print the median results
print(f"\nMedian 2-day percentage change following positive earnings surprises: {median_positive_surprise_change:.4f}")
print(f"Median 2-day percentage change for all historical dates: {median_all_dates_change:.4f}")

%-change of closing price on date D to date D-2: 1.4995451025915152
Number of positive surprises: 37
Individual 2-day percentage changes following positive earnings surprises:
3.0149%
-2.9724%
2.6981%
-10.2043%
-1.0831%
10.7023%
5.2311%
8.8605%
0.4477%
-1.6738%
11.5566%
4.6656%
-8.3389%
0.2579%
-0.9079%
-4.0038%
4.3233%
8.1119%
2.5703%
-2.6460%
-1.2853%
-2.4866%
7.7012%
-1.4433%
1.7266%
-2.6611%
3.0054%
8.7409%
7.7838%
8.4277%
16.6562%
-6.6042%
-2.4232%
6.5923%
16.7991%
15.8158%
-12.8467%

Median 2-day percentage change following positive earnings surprises: 2.5703
Median 2-day percentage change for all historical dates: 0.1646
