# S&P 500 Monthly Seasonality Analysis

## Overview
This notebook studies whether strong performance in a given month (default: September) leads to consistent gains over the following 30, 60, and 90 calendar days. The dataset covers the full 1789â€“2025 S&P 500 series so seasonal insights span many economic cycles.

## Key Inputs
- **Data**: Daily S&P 500 prices aggregated to month-end observations.
- **Target Month**: Filtered for positive monthly returns in the selected calendar month (currently September).
- **Forward Horizons**: Analyze cumulative log returns for the next 30, 60, and 90 trading days.

## Analysis Steps
1. Load and clean the S&P 500 price history.
2. Resample to month-end prices and compute monthly returns.
3. Identify all positive months for the target calendar month.
4. For those years, compute forward cumulative log returns for each horizon.
5. Summarize descriptive statistics, success rates, and distributional behavior for the forward periods.

## Considerations
- Results emphasize historical seasonality and do not imply persistence.
- Forward-return calculations handle weekends and holidays via pandas offsets.
- Maintain transparency on sample sizes and regime shifts when interpreting the findings.

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

# Fix the path construction - use forward slashes or raw strings
import os

# Method 1: Use forward slashes (recommended)
path = os.path.abspath(os.path.join("..", "04_S&P500_quant_analysis", "01_data", "^spx_d.csv"))

# Alternative Method 2: Use raw string
# path = os.path.abspath(os.path.join("..", r"04_S&P500_quant_analysis\01_data", "^spx_d.csv"))

# Alternative Method 3: Use pathlib (modern approach)
# from pathlib import Path
# path = Path("..") / "04_S&P500_quant_analysis" / "01_data" / "^spx_d.csv"

# print(f"Path exists: {os.path.exists(path)}")

# Read the CSV file
df = pd.read_csv(path)

# lower case column names for easier access
df.columns = [col.lower() for col in df.columns]

# Set the 'date' column as the index and convert it to datetime
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

# Show the first few rows of the dataframe
df.head()

Unnamed: 0_level_0,open,high,low,close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1789-05-01,0.51,0.51,0.51,0.51,0.0
1789-06-01,0.51,0.51,0.51,0.51,0.0
1789-07-01,0.5,0.5,0.5,0.5,0.0
1789-08-01,0.5,0.51,0.5,0.51,0.0
1789-09-01,0.51,0.51,0.5,0.51,0.0


In [14]:
# Calculate simple returns with pct_change()
df['simple_returns'] = df['close'].pct_change()

# Drop NA values that result from pct_change()
df.dropna(inplace=True)

# Show dataframe
df.head()

Unnamed: 0_level_0,open,high,low,close,volume,simple_returns
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1789-06-01,0.51,0.51,0.51,0.51,0.0,0.0
1789-07-01,0.5,0.5,0.5,0.5,0.0,-0.019608
1789-08-01,0.5,0.51,0.5,0.51,0.0,0.02
1789-09-01,0.51,0.51,0.5,0.51,0.0,0.0
1789-10-01,0.51,0.51,0.51,0.51,0.0,0.0


In [15]:
# resample to monthly frequency, taking the last observation of each month for OHLCV
monthly_ohlcv = df.resample('M').agg({
    'open': 'last',
    'high': 'last',
    'low': 'last',
    'close': 'last',
    'volume': 'last'
})

# simple_returns for monthly data
monthly_ohlcv['monthly_returns'] = monthly_ohlcv['close'].pct_change()

# Drop NA values that result from pct_change()
monthly_ohlcv.dropna(inplace=True)

# Show dataframe
monthly_ohlcv.head()

Unnamed: 0_level_0,open,high,low,close,volume,monthly_returns
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1789-07-31,0.5,0.5,0.5,0.5,0.0,-0.019608
1789-08-31,0.5,0.51,0.5,0.51,0.0,0.02
1789-09-30,0.51,0.51,0.5,0.51,0.0,0.0
1789-10-31,0.51,0.51,0.51,0.51,0.0,0.0
1789-11-30,0.51,0.51,0.5,0.5,0.0,-0.019608


In [16]:
# Calculate log returns (Monthly) for all data
log_returns = (1 + monthly_ohlcv['monthly_returns']).apply(np.log)

# Cumulative sum of log returns (correct for compounding)
cumsum_log = log_returns.cumsum()

In [17]:
# # Filter cumulative log returns from 2020 onwards
# cumsum_log_2020 = cumsum_log[cumsum_log.index >= '2020-01-01']

# # Plot cumulative log returns from 2020
# plt.figure(figsize=(14, 7))
# plt.plot(cumsum_log_2020, label='Cumulative Sum (Log Returns) from 2020', color='blue')
# plt.title('Cumulative Log Returns (Monthly) - From 2020')
# plt.xlabel('Date')
# plt.ylabel('Cumulative Log Returns')
# plt.legend()
# plt.grid()
# plt.show()

In [18]:
# add year column from DatetimeIndex
monthly_ohlcv['year'] = monthly_ohlcv.index.year

# add month column from DatetimeIndex
monthly_ohlcv['month'] = monthly_ohlcv.index.month

# add day column from DatetimeIndex
monthly_ohlcv['day'] = monthly_ohlcv.index.day

monthly_ohlcv.head()

Unnamed: 0_level_0,open,high,low,close,volume,monthly_returns,year,month,day
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1789-07-31,0.5,0.5,0.5,0.5,0.0,-0.019608,1789,7,31
1789-08-31,0.5,0.51,0.5,0.51,0.0,0.02,1789,8,31
1789-09-30,0.51,0.51,0.5,0.51,0.0,0.0,1789,9,30
1789-10-31,0.51,0.51,0.51,0.51,0.0,0.0,1789,10,31
1789-11-30,0.51,0.51,0.5,0.5,0.0,-0.019608,1789,11,30


In [None]:
# Select month for the analysis
# M = 1 # January
# M = 2 # February
# M = 3 # March
# M = 4 # April
# M = 5 # May
# M = 6 # June
# M = 7 # July
# M = 8 # August
M = 9  # September
# M = 10 # October
# M = 11 # November
# M = 12 # December

In [20]:
df_ = monthly_ohlcv.copy()

# select only rows where month == M
df_ = df_[df_.month == M]

# count positive months (where monthly_returns > 0)
positive_months = len(df_[df_['monthly_returns'] > 0])
print(f"Number of positive months in month {M}: {positive_months}")

# count months (length of df)
months_count = len(df_)
print(f"Number of months in the dataset for month {M}: {months_count}")

# percentage of positive months
positive_percentage = (positive_months / months_count) * 100
print(f"Percentage of positive months in month {M}: {positive_percentage:.2f}%")

Number of positive months in month 11: 120
Number of months in the dataset for month 11: 236
Percentage of positive months in month 11: 50.85%


In [21]:
# def get_positive_months(monthly_ohlcv, M):
#     df_ = monthly_ohlcv.copy()
#     df_ = df_[df_.month == M]
#     positive_months = df_[df_['monthly_returns'] > 0]['year'].tolist()
#     return positive_months

# function that return monthly_ohlcv.year where month == M and monthly_returns > 0
def get_positive_years(monthly_ohlcv, M):
    df_ = monthly_ohlcv.copy()
    df_ = df_[df_.month == M]
    positive_years = df_[df_['monthly_returns'] > 0]['year'].tolist()
    return positive_years

# Example usage
positive_years = get_positive_years(monthly_ohlcv, M)
print(f"Years with positive returns in month {M}: {positive_years}")

print(f"Number of years with positive returns in month {M}: {len(positive_years)}")

Years with positive returns in month 11: [1794, 1799, 1802, 1803, 1804, 1808, 1809, 1813, 1814, 1816, 1817, 1818, 1820, 1823, 1829, 1832, 1834, 1837, 1843, 1850, 1852, 1856, 1858, 1859, 1861, 1862, 1863, 1865, 1867, 1868, 1872, 1874, 1875, 1877, 1878, 1879, 1880, 1885, 1886, 1887, 1893, 1896, 1898, 1899, 1900, 1901, 1903, 1904, 1905, 1906, 1907, 1908, 1911, 1912, 1921, 1923, 1924, 1926, 1927, 1928, 1933, 1934, 1935, 1936, 1944, 1945, 1949, 1952, 1953, 1954, 1955, 1957, 1958, 1959, 1960, 1961, 1962, 1966, 1967, 1968, 1970, 1972, 1975, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1985, 1986, 1989, 1990, 1992, 1995, 1996, 1997, 1998, 1999, 2001, 2002, 2003, 2004, 2005, 2006, 2009, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2022, 2023, 2024, 2025]
Number of years with positive returns in month 11: 120


In [22]:
# function that for every year in positive_years, calculate the Cumulative Sum (Log Returns) after 30, 60, 90 days
def get_cumsum_log_after_days(positive_years, df_daily, days=30):
    """
    Calculate cumulative log returns after specified number of days
    from the end of the positive month for each year.
    
    Args:
        positive_years: List of years with positive returns in month M
        df_daily: Daily dataframe with simple_returns column
        days: Number of days to look ahead (30, 60, or 90)
    
    Returns:
        Dictionary with year as key and cumulative log return as value
    """
    cumsum_logs = {}
    
    for year in positive_years:
        try:
            # Find the last day of month M in the given year
            month_end_date = pd.Timestamp(year=year, month=M, day=1) + pd.offsets.MonthEnd(0)
            
            # Find the date that is 'days' after the month end
            target_date = month_end_date + pd.Timedelta(days=days)
            
            # Get data from month end to target date
            mask = (df_daily.index > month_end_date) & (df_daily.index <= target_date)
            period_data = df_daily.loc[mask]
            
            if not period_data.empty:
                # Calculate log returns for the period
                log_returns_period = (1 + period_data['simple_returns']).apply(np.log)
                # Sum log returns (equivalent to cumulative multiplication of (1+r))
                cumsum_log_value = log_returns_period.sum()
                cumsum_logs[year] = cumsum_log_value
            else:
                # If no data available for that period, skip
                print(f"No data available for {year} after {days} days from month {M}")
                
        except Exception as e:
            print(f"Error processing year {year}: {e}")
            continue
    
    return cumsum_logs

# Use the original daily dataframe instead of monthly
cumsum_30_days = get_cumsum_log_after_days(positive_years, df, days=30)
cumsum_60_days = get_cumsum_log_after_days(positive_years, df, days=60)
cumsum_90_days = get_cumsum_log_after_days(positive_years, df, days=90)

print(f"Cumulative Sum (Log Returns) after 30 days for years with positive returns in month {M}:")
for year, ret in cumsum_30_days.items():
    print(f"{year}: {ret*100:.2f}%")

print(f"\nCumulative Sum (Log Returns) after 60 days for years with positive returns in month {M}:")
for year, ret in cumsum_60_days.items():
    print(f"{year}: {ret*100:.2f}%")

print(f"\nCumulative Sum (Log Returns) after 90 days for years with positive returns in month {M}:")
for year, ret in cumsum_90_days.items():
    print(f"{year}: {ret*100:.2f}%")

No data available for 2025 after 30 days from month 11
No data available for 2025 after 60 days from month 11
No data available for 2025 after 90 days from month 11
Cumulative Sum (Log Returns) after 30 days for years with positive returns in month 11:
1794: 0.00%
1799: 2.56%
1802: 0.00%
1803: 1.34%
1804: 1.82%
1808: 1.12%
1809: 0.00%
1813: 3.43%
1814: 0.68%
1816: 0.00%
1817: 1.18%
1818: 0.00%
1820: 1.04%
1823: 0.47%
1829: 0.96%
1832: 3.89%
1834: 1.66%
1837: -2.22%
1843: 0.57%
1850: 0.35%
1852: 1.12%
1856: 0.75%
1858: -1.14%
1859: -0.64%
1861: 1.88%
1862: 5.06%
1863: 0.88%
1865: 1.14%
1867: 1.28%
1868: 0.77%
1872: 0.38%
1874: 0.43%
1875: -0.67%
1877: 2.43%
1878: 1.64%
1879: 2.30%
1880: 1.03%
1885: -0.00%
1886: -2.92%
1887: -0.00%
1893: -10.25%
1896: -1.87%
1898: 5.28%
1899: -9.17%
1900: 5.93%
1901: -0.59%
1903: 5.91%
1904: -1.33%
1905: 2.95%
1906: -4.16%
1907: 2.46%
1908: 0.95%
1911: 0.00%
1912: -4.16%
1921: 0.75%
1923: 1.33%
1924: 3.75%
1926: 0.91%
1927: 0.57%
1928: -0.95%
1933: 2.20%

In [23]:
# calculate statistics for cumsum_30_days, cumsum_60_days, cumsum_90_days
def calculate_statistics(cumsum_dict):
    """
    Calculate statistics for cumulative log returns.
    
    Args:
        cumsum_dict: Dictionary with year as key and cumulative log return as value

    Returns:
        Dictionary with statistics (mean, median, std) for cumulative log returns
    """
    stats = {
        "mean": np.mean(list(cumsum_dict.values())),
        "median": np.median(list(cumsum_dict.values())),
        "std": np.std(list(cumsum_dict.values()))
    }
    return stats

# Calculate statistics for each cumulative sum dictionary
stats_30_days = calculate_statistics(cumsum_30_days)
stats_60_days = calculate_statistics(cumsum_60_days)
stats_90_days = calculate_statistics(cumsum_90_days)

print(f"Statistics for Cumulative Sum (Log Returns) after 30 days for years with positive returns in month {M}:")
print(stats_30_days)

print(f"\nStatistics for Cumulative Sum (Log Returns) after 60 days for years with positive returns in month {M}:")
print(stats_60_days)

print(f"\nStatistics for Cumulative Sum (Log Returns) after 90 days for years with positive returns in month {M}:")
print(stats_90_days)

Statistics for Cumulative Sum (Log Returns) after 30 days for years with positive returns in month 11:
{'mean': np.float64(0.007142041801063959), 'median': np.float64(0.00984088983744599), 'std': np.float64(0.029591184163629006)}

Statistics for Cumulative Sum (Log Returns) after 60 days for years with positive returns in month 11:
{'mean': np.float64(0.016794905654669463), 'median': np.float64(0.019048194970694564), 'std': np.float64(0.04892464167926132)}

Statistics for Cumulative Sum (Log Returns) after 90 days for years with positive returns in month 11:
{'mean': np.float64(0.025235919236835665), 'median': np.float64(0.025975486403260736), 'std': np.float64(0.06380815289358056)}


In [24]:
# statistics with describe method and print results in percentage with 2 decimals
def describe_statistics(cumsum_dict):
    """
    Describe statistics for cumulative log returns using pandas describe method.
    
    Args:
        cumsum_dict: Dictionary with year as key and cumulative log return as value

    Returns:
        None
    """
    # Convert to DataFrame for easier description
    df = pd.DataFrame.from_dict(cumsum_dict, orient='index', columns=['cumsum_log_return'])
    description = df.describe()
    # Convert to percentage
    description = description * 100
    # Print results
    print(description)
    return description

print(f"Descriptive Statistics for Cumulative Sum (Log Returns) after 30 days for years with positive returns in month {M}:")
describe_statistics(cumsum_30_days)

print(f"\nDescriptive Statistics for Cumulative Sum (Log Returns) after 60 days for years with positive returns in month {M}:")
describe_statistics(cumsum_60_days)

print(f"\nDescriptive Statistics for Cumulative Sum (Log Returns) after 90 days for years with positive returns in month {M}:")
describe_statistics(cumsum_90_days)

Descriptive Statistics for Cumulative Sum (Log Returns) after 30 days for years with positive returns in month 11:
       cumsum_log_return
count       11900.000000
mean            0.714204
std             2.971631
min           -10.472187
25%            -0.080613
50%             0.984089
75%             2.367749
max             5.931658

Descriptive Statistics for Cumulative Sum (Log Returns) after 60 days for years with positive returns in month 11:
       cumsum_log_return
count       11900.000000
mean            1.679491
std             4.913151
min           -10.706185
25%            -0.909752
50%             1.904819
75%             5.075636
max            13.902475

Descriptive Statistics for Cumulative Sum (Log Returns) after 90 days for years with positive returns in month 11:
       cumsum_log_return
count       11900.000000
mean            2.523592
std             6.407796
min           -11.069504
25%            -0.995563
50%             2.597549
75%             6.443128
max

Unnamed: 0,cumsum_log_return
count,11900.0
mean,2.523592
std,6.407796
min,-11.069504
25%,-0.995563
50%,2.597549
75%,6.443128
max,29.626582
