# S&P 500 Monthly Seasonality Analysis

## Overview
This notebook studies whether strong performance in a given month (default: September) leads to consistent gains over the following 30, 60, and 90 calendar days. The dataset covers the full 1789â€“2025 S&P 500 series so seasonal insights span many economic cycles.

## Key Inputs
- **Data**: Daily S&P 500 prices aggregated to month-end observations.
- **Target Month**: Filtered for negative monthly returns in the selected calendar month (currently September).
- **Forward Horizons**: Analyze cumulative log returns for the next 30, 60, and 90 trading days.

## Analysis Steps
1. Load and clean the S&P 500 price history.
2. Resample to month-end prices and compute monthly returns.
3. Identify all negative months for the target calendar month.
4. For those years, compute forward cumulative log returns for each horizon.
5. Summarize descriptive statistics, success rates, and distributional behavior for the forward periods.

## Considerations
- Results emphasize historical seasonality and do not imply persistence.
- Forward-return calculations handle weekends and holidays via pandas offsets.
- Maintain transparency on sample sizes and regime shifts when interpreting the findings.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

# Fix the path construction - use forward slashes or raw strings
import os

# Method 1: Use forward slashes (recommended)
path = os.path.abspath(os.path.join("..", "04_S&P500_quant_analysis", "01_data", "^spx_d.csv"))

# Alternative Method 2: Use raw string
# path = os.path.abspath(os.path.join("..", r"04_S&P500_quant_analysis\01_data", "^spx_d.csv"))

# Alternative Method 3: Use pathlib (modern approach)
# from pathlib import Path
# path = Path("..") / "04_S&P500_quant_analysis" / "01_data" / "^spx_d.csv"

# print(f"Path exists: {os.path.exists(path)}")

# Read the CSV file
df = pd.read_csv(path)

# lower case column names for easier access
df.columns = [col.lower() for col in df.columns]

# Set the 'date' column as the index and convert it to datetime
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

# Show the first few rows of the dataframe
df.head()

Unnamed: 0_level_0,open,high,low,close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1789-05-01,0.51,0.51,0.51,0.51,0.0
1789-06-01,0.51,0.51,0.51,0.51,0.0
1789-07-01,0.5,0.5,0.5,0.5,0.0
1789-08-01,0.5,0.51,0.5,0.51,0.0
1789-09-01,0.51,0.51,0.5,0.51,0.0


In [2]:
# Calculate simple returns with pct_change()
df['simple_returns'] = df['close'].pct_change()

# Drop NA values that result from pct_change()
df.dropna(inplace=True)

# Show dataframe
df.head()

Unnamed: 0_level_0,open,high,low,close,volume,simple_returns
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1789-06-01,0.51,0.51,0.51,0.51,0.0,0.0
1789-07-01,0.5,0.5,0.5,0.5,0.0,-0.019608
1789-08-01,0.5,0.51,0.5,0.51,0.0,0.02
1789-09-01,0.51,0.51,0.5,0.51,0.0,0.0
1789-10-01,0.51,0.51,0.51,0.51,0.0,0.0


In [3]:
# resample to monthly frequency, taking the last observation of each month for OHLCV
monthly_ohlcv = df.resample('M').agg({
    'open': 'last',
    'high': 'last',
    'low': 'last',
    'close': 'last',
    'volume': 'last'
})

# simple_returns for monthly data
monthly_ohlcv['monthly_returns'] = monthly_ohlcv['close'].pct_change()

# Drop NA values that result from pct_change()
monthly_ohlcv.dropna(inplace=True)

# Show dataframe
monthly_ohlcv.head()

Unnamed: 0_level_0,open,high,low,close,volume,monthly_returns
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1789-07-31,0.5,0.5,0.5,0.5,0.0,-0.019608
1789-08-31,0.5,0.51,0.5,0.51,0.0,0.02
1789-09-30,0.51,0.51,0.5,0.51,0.0,0.0
1789-10-31,0.51,0.51,0.51,0.51,0.0,0.0
1789-11-30,0.51,0.51,0.5,0.5,0.0,-0.019608


In [4]:
# Calculate log returns (Monthly) for all data
log_returns = (1 + monthly_ohlcv['monthly_returns']).apply(np.log)

# Cumulative sum of log returns (correct for compounding)
cumsum_log = log_returns.cumsum()

In [5]:
# # Filter cumulative log returns from 2020 onwards
# cumsum_log_2020 = cumsum_log[cumsum_log.index <= '2020-01-01']

# # Plot cumulative log returns from 2020
# plt.figure(figsize=(14, 7))
# plt.plot(cumsum_log_2020, label='Cumulative Sum (Log Returns) from 2020', color='blue')
# plt.title('Cumulative Log Returns (Monthly) - From 2020')
# plt.xlabel('Date')
# plt.ylabel('Cumulative Log Returns')
# plt.legend()
# plt.grid()
# plt.show()

In [6]:
# add year column from DatetimeIndex
monthly_ohlcv['year'] = monthly_ohlcv.index.year

# add month column from DatetimeIndex
monthly_ohlcv['month'] = monthly_ohlcv.index.month

# add day column from DatetimeIndex
monthly_ohlcv['day'] = monthly_ohlcv.index.day

monthly_ohlcv.head()

Unnamed: 0_level_0,open,high,low,close,volume,monthly_returns,year,month,day
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1789-07-31,0.5,0.5,0.5,0.5,0.0,-0.019608,1789,7,31
1789-08-31,0.5,0.51,0.5,0.51,0.0,0.02,1789,8,31
1789-09-30,0.51,0.51,0.5,0.51,0.0,0.0,1789,9,30
1789-10-31,0.51,0.51,0.51,0.51,0.0,0.0,1789,10,31
1789-11-30,0.51,0.51,0.5,0.5,0.0,-0.019608,1789,11,30


In [7]:
# Select month for the analysis
# M = 1 # January
# M = 2 # February
# M = 3 # March
# M = 4 # April
# M = 5 # May
# M = 6 # June
# M = 7 # July
# M = 8 # August
# M = 9  # September
# M = 10 # October
M = 11 # November
# M = 12 # December

In [8]:
df_ = monthly_ohlcv.copy()

# select only rows where month == M
df_ = df_[df_.month == M]

# count negative months (where monthly_returns < 0)
negative_months = len(df_[df_['monthly_returns'] < 0])
print(f"Number of negative months in month {M}: {negative_months}")

# count months (length of df)
months_count = len(df_)
print(f"Number of months in the dataset for month {M}: {months_count}")

# percentage of negative months
negative_percentage = (negative_months / months_count) * 100
print(f"Percentage of negative months in month {M}: {negative_percentage:.2f}%")

Number of negative months in month 11: 98
Number of months in the dataset for month 11: 236
Percentage of negative months in month 11: 41.53%


In [9]:
# def get_negative_months(monthly_ohlcv, M):
#     df_ = monthly_ohlcv.copy()
#     df_ = df_[df_.month == M]
#     negative_months = df_[df_['monthly_returns'] < 0]['year'].tolist()
#     return negative_months

# function that return monthly_ohlcv.year where month == M and monthly_returns < 0
def get_negative_years(monthly_ohlcv, M):
    df_ = monthly_ohlcv.copy()
    df_ = df_[df_.month == M]
    negative_years = df_[df_['monthly_returns'] < 0]['year'].tolist()
    return negative_years

# Example usage
negative_years = get_negative_years(monthly_ohlcv, M)
print(f"Years with negative returns in month {M}: {negative_years}")

print(f"Number of years with negative returns in month {M}: {len(negative_years)}")

Years with negative returns in month 11: [1789, 1790, 1795, 1796, 1801, 1806, 1807, 1810, 1811, 1815, 1819, 1825, 1826, 1827, 1828, 1831, 1833, 1835, 1836, 1838, 1839, 1841, 1844, 1847, 1848, 1849, 1851, 1853, 1854, 1855, 1857, 1860, 1864, 1866, 1869, 1870, 1871, 1873, 1876, 1881, 1882, 1883, 1884, 1888, 1889, 1890, 1892, 1894, 1895, 1897, 1902, 1909, 1910, 1913, 1916, 1917, 1918, 1919, 1920, 1922, 1925, 1929, 1930, 1931, 1932, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1946, 1947, 1948, 1950, 1951, 1956, 1963, 1964, 1965, 1969, 1971, 1973, 1974, 1976, 1984, 1987, 1988, 1991, 1993, 1994, 2000, 2007, 2008, 2010, 2011, 2021]
Number of years with negative returns in month 11: 98


In [10]:
df.head()

Unnamed: 0_level_0,open,high,low,close,volume,simple_returns
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1789-06-01,0.51,0.51,0.51,0.51,0.0,0.0
1789-07-01,0.5,0.5,0.5,0.5,0.0,-0.019608
1789-08-01,0.5,0.51,0.5,0.51,0.0,0.02
1789-09-01,0.51,0.51,0.5,0.51,0.0,0.0
1789-10-01,0.51,0.51,0.51,0.51,0.0,0.0


In [11]:
# function that for every year in negative_years, calculate the Cumulative Sum (Log Returns) after 30, 60, 90 days
def get_cumsum_log_after_days(negative_years, df_daily, days=30):
    """
    Calculate cumulative log returns after specified number of days
    from the end of the negative month for each year.
    
    Args:
        negative_years: List of years with negative returns in month M
        df_daily: Daily dataframe with simple_returns column
        days: Number of days to look ahead (30, 60, or 90)
    
    Returns:
        Dictionary with year as key and cumulative log return as value
    """
    cumsum_logs = {}
    
    for year in negative_years:
        try:
            # Find the last day of month M in the given year
            month_end_date = pd.Timestamp(year=year, month=M, day=1) + pd.offsets.MonthEnd(0)
            
            # Find the date that is 'days' after the month end
            target_date = month_end_date + pd.Timedelta(days=days)
            
            # Get data from month end to target date
            mask = (df_daily.index > month_end_date) & (df_daily.index <= target_date)
            period_data = df_daily.loc[mask]
            
            if not period_data.empty:
                # Calculate log returns for the period
                log_returns_period = (1 + period_data['simple_returns']).apply(np.log)
                # Sum log returns (equivalent to cumulative multiplication of (1+r))
                cumsum_log_value = log_returns_period.sum()
                cumsum_logs[year] = cumsum_log_value
            else:
                # If no data available for that period, skip
                print(f"No data available for {year} after {days} days from month {M}")
                
        except Exception as e:
            print(f"Error processing year {year}: {e}")
            continue
    
    return cumsum_logs

# Use the original daily dataframe instead of monthly
cumsum_30_days = get_cumsum_log_after_days(negative_years, df, days=30)
cumsum_60_days = get_cumsum_log_after_days(negative_years, df, days=60)
cumsum_90_days = get_cumsum_log_after_days(negative_years, df, days=90)

print(f"Cumulative Sum (Log Returns) after 30 days for years with negative returns in month {M}:")
for year, ret in cumsum_30_days.items():
    print(f"{year}: {ret*100:.2f}%")

print(f"\nCumulative Sum (Log Returns) after 60 days for years with negative returns in month {M}:")
for year, ret in cumsum_60_days.items():
    print(f"{year}: {ret*100:.2f}%")

print(f"\nCumulative Sum (Log Returns) after 90 days for years with negative returns in month {M}:")
for year, ret in cumsum_90_days.items():
    print(f"{year}: {ret*100:.2f}%")

Cumulative Sum (Log Returns) after 30 days for years with negative returns in month 11:
1789: 0.00%
1790: 3.57%
1795: -1.36%
1796: -1.57%
1801: -3.13%
1806: -3.28%
1807: -1.53%
1810: -1.94%
1811: -6.25%
1815: 0.00%
1819: -1.14%
1825: -0.96%
1826: 0.00%
1827: -1.45%
1828: -0.52%
1831: -0.48%
1833: -5.06%
1835: -1.10%
1836: -2.52%
1838: -0.46%
1839: -1.20%
1841: -5.80%
1844: -5.84%
1847: -0.79%
1848: 0.45%
1849: -1.25%
1851: 0.00%
1853: -0.97%
1854: 3.89%
1855: -0.40%
1857: -4.68%
1860: -0.47%
1864: -0.24%
1866: -1.09%
1869: -3.10%
1870: 0.66%
1871: 0.61%
1873: -1.97%
1876: -4.75%
1881: -1.43%
1882: -0.81%
1883: -1.08%
1884: -1.51%
1888: 1.84%
1889: 0.35%
1890: -2.96%
1892: 0.18%
1894: 0.48%
1895: -11.11%
1897: 3.37%
1902: 1.26%
1909: 3.07%
1910: 0.22%
1913: 1.20%
1916: -6.55%
1917: 2.32%
1918: -2.68%
1919: -0.11%
1920: -5.31%
1922: 3.09%
1925: 3.75%
1929: -0.14%
1930: -9.56%
1931: -15.94%
1932: 5.50%
1937: -5.08%
1938: 3.17%
1939: 2.35%
1940: -0.38%
1941: -4.04%
1942: 4.94%
1943: 5.90%


In [12]:
# calculate statistics for cumsum_30_days, cumsum_60_days, cumsum_90_days
def calculate_statistics(cumsum_dict):
    """
    Calculate statistics for cumulative log returns.
    
    Args:
        cumsum_dict: Dictionary with year as key and cumulative log return as value

    Returns:
        Dictionary with statistics (mean, median, std) for cumulative log returns
    """
    stats = {
        "mean": np.mean(list(cumsum_dict.values())),
        "median": np.median(list(cumsum_dict.values())),
        "std": np.std(list(cumsum_dict.values()))
    }
    return stats

# Calculate statistics for each cumulative sum dictionary
stats_30_days = calculate_statistics(cumsum_30_days)
stats_60_days = calculate_statistics(cumsum_60_days)
stats_90_days = calculate_statistics(cumsum_90_days)

print(f"Statistics for Cumulative Sum (Log Returns) after 30 days for years with negative returns in month {M}:")
print(stats_30_days)

print(f"\nStatistics for Cumulative Sum (Log Returns) after 60 days for years with negative returns in month {M}:")
print(stats_60_days)

print(f"\nStatistics for Cumulative Sum (Log Returns) after 90 days for years with negative returns in month {M}:")
print(stats_90_days)

Statistics for Cumulative Sum (Log Returns) after 30 days for years with negative returns in month 11:
{'mean': np.float64(-0.0018001500850852968), 'median': np.float64(-0.001428769487395233), 'std': np.float64(0.03833908974754275)}

Statistics for Cumulative Sum (Log Returns) after 60 days for years with negative returns in month 11:
{'mean': np.float64(0.0007781886325773193), 'median': np.float64(-0.008506742499261706), 'std': np.float64(0.05407651406035764)}

Statistics for Cumulative Sum (Log Returns) after 90 days for years with negative returns in month 11:
{'mean': np.float64(-0.005907742758259689), 'median': np.float64(-0.010703956587260775), 'std': np.float64(0.06756568755198528)}


In [13]:
# statistics with describe method and print results in percentage with 2 decimals
def describe_statistics(cumsum_dict):
    """
    Describe statistics for cumulative log returns using pandas describe method.
    
    Args:
        cumsum_dict: Dictionary with year as key and cumulative log return as value

    Returns:
        None
    """
    # Convert to DataFrame for easier description
    df = pd.DataFrame.from_dict(cumsum_dict, orient='index', columns=['cumsum_log_return'])
    description = df.describe()
    # Convert to percentage
    description = description * 100
    # Print results
    print(description)
    return description

print(f"Descriptive Statistics for Cumulative Sum (Log Returns) after 30 days for years with negative returns in month {M}:")
describe_statistics(cumsum_30_days)

print(f"\nDescriptive Statistics for Cumulative Sum (Log Returns) after 60 days for years with negative returns in month {M}:")
describe_statistics(cumsum_60_days)

print(f"\nDescriptive Statistics for Cumulative Sum (Log Returns) after 90 days for years with negative returns in month {M}:")
describe_statistics(cumsum_90_days)

Descriptive Statistics for Cumulative Sum (Log Returns) after 30 days for years with negative returns in month 11:
       cumsum_log_return
count        9800.000000
mean           -0.180015
std             3.853621
min           -15.942774
25%            -1.523446
50%            -0.142877
75%             1.758203
max            10.110329

Descriptive Statistics for Cumulative Sum (Log Returns) after 60 days for years with negative returns in month 11:
       cumsum_log_return
count        9800.000000
mean            0.077819
std             5.435454
min           -17.937852
25%            -2.955243
50%            -0.850674
75%             3.642695
max            11.670603

Descriptive Statistics for Cumulative Sum (Log Returns) after 90 days for years with negative returns in month 11:
       cumsum_log_return
count        9800.000000
mean           -0.590774
std             6.791307
min           -19.821529
25%            -4.439714
50%            -1.070396
75%             3.597926
max

Unnamed: 0,cumsum_log_return
count,9800.0
mean,-0.590774
std,6.791307
min,-19.821529
25%,-4.439714
50%,-1.070396
75%,3.597926
max,16.895672
