# S&P 500 Monthly Seasonality Analysis: Forward-Looking Returns After Positive Months

## Executive Summary
This notebook conducts a comprehensive analysis of S&P 500 forward-looking performance following positive monthly returns. Using historical data spanning from 1789 to 2025, we examine whether positive performance in a specific month (currently set to September) provides predictive value for subsequent 30, 60, and 90-day periods.

## Research Methodology

### 📊 **Data Foundation**
- **Dataset**: S&P 500 daily price data (1789-2025) - 236+ years of historical market data
- **Frequency**: Daily data resampled to monthly for seasonality analysis
- **Returns Calculation**: Simple returns using percentage change methodology
- **Forward Analysis**: Log returns for precise compounding over multi-day periods

### 🔍 **Analysis Framework**

#### **Step 1: Monthly Filtering**
- Filter historical data for the target month (M = 9 for September)
- Identify all years where the target month showed positive returns
- Calculate success rate and historical frequency of positive months

#### **Step 2: Forward-Looking Performance**
- For each year with positive target month performance:
  - Calculate cumulative log returns for the **30 days following** month-end
  - Calculate cumulative log returns for the **60 days following** month-end  
  - Calculate cumulative log returns for the **90 days following** month-end
- Use precise date arithmetic to handle weekends, holidays, and market closures

#### **Step 3: Statistical Analysis**
- **Descriptive Statistics**: Mean, median, standard deviation, quartiles
- **Distribution Analysis**: Min/max values, skewness assessment
- **Performance Metrics**: Success rates and magnitude of forward returns

## Key Research Questions

1. **Predictive Power**: Do positive September returns predict positive forward performance?
2. **Time Horizon Effect**: How does predictive strength vary across 30/60/90-day windows?
3. **Risk-Reward Profile**: What are the typical gains vs. potential losses in forward periods?
4. **Historical Consistency**: How reliable is this pattern across different market regimes?

## Expected Analytical Outputs

### 📈 **Performance Metrics**
- Percentage of years with positive September returns
- Average forward returns for each time horizon (30/60/90 days)
- Success rates for positive forward performance
- Risk metrics including standard deviation and drawdown analysis

### 📊 **Statistical Summary**
- Complete descriptive statistics (count, mean, std, min, quartiles, max)
- Distribution characteristics of forward returns
- Comparison across different time horizons
- Identification of outlier years and extreme events

## Financial Market Context

### 🎯 **Practical Applications**
- **Seasonal Trading Strategies**: Evidence-based calendar effects
- **Portfolio Timing**: Tactical asset allocation decisions  
- **Risk Management**: Understanding forward-looking volatility patterns
- **Market Efficiency**: Testing weak-form efficiency in equity markets

### ⚠️ **Important Considerations**
- **Sample Size**: Long historical dataset provides statistical robustness
- **Market Regime Changes**: Results may vary across different economic periods
- **Transaction Costs**: Real-world implementation requires cost consideration
- **Overfitting Risk**: Historical patterns may not persist in future markets

---

## Methodology Notes

### 🔧 **Technical Implementation**
- **Date Handling**: Precise month-end calculations using pandas offsets
- **Missing Data**: Robust error handling for data gaps and market closures
- **Log Returns**: Mathematically correct compounding for multi-period analysis
- **Percentage Display**: User-friendly formatting for financial interpretation

### 📋 **Analysis Structure**
1. **Data Loading & Preparation**: Historical S&P 500 data processing
2. **Monthly Resampling**: Convert daily to month-end observations
3. **Positive Month Identification**: Filter for target month positive returns
4. **Forward Return Calculation**: Precise 30/60/90-day forward analysis
5. **Statistical Summary**: Comprehensive descriptive statistics
6. **Results Interpretation**: Financial and statistical significance assessment

---

*This analysis provides quantitative evidence for seasonal patterns in equity markets, contributing to the literature on calendar effects and market anomalies. Results should be interpreted within the context of overall portfolio strategy and risk management framework.*

In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

# Fix the path construction - use forward slashes or raw strings
import os

# Method 1: Use forward slashes (recommended)
path = os.path.abspath(os.path.join("..", "04_S&P500_quant_analysis", "01_data", "S&P500_D_1789-05-01_2025-09-17.csv"))

# Alternative Method 2: Use raw string
# path = os.path.abspath(os.path.join("..", r"04_S&P500_quant_analysis\01_data", "S&P500_D_1789-05-01_2025-09-17.csv"))

# Alternative Method 3: Use pathlib (modern approach)
# from pathlib import Path
# path = Path("..") / "04_S&P500_quant_analysis" / "01_data" / "S&P500_D_1789-05-01_2025-09-17.csv"

# print(f"Path exists: {os.path.exists(path)}")

# Read the CSV file
df = pd.read_csv(path)

# lower case column names for easier access
df.columns = [col.lower() for col in df.columns]

# Set the 'date' column as the index and convert it to datetime
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

# Show the first few rows of the dataframe
df.head()

Unnamed: 0_level_0,open,high,low,close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1789-05-01,0.51,0.51,0.51,0.51,0.0
1789-06-01,0.51,0.51,0.51,0.51,0.0
1789-07-01,0.5,0.5,0.5,0.5,0.0
1789-08-01,0.5,0.51,0.5,0.51,0.0
1789-09-01,0.51,0.51,0.5,0.51,0.0


In [53]:
# Calculate simple returns with pct_change()
df['simple_returns'] = df['close'].pct_change()

# Drop NA values that result from pct_change()
df.dropna(inplace=True)

# Show dataframe
df.head()

Unnamed: 0_level_0,open,high,low,close,volume,simple_returns
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1789-06-01,0.51,0.51,0.51,0.51,0.0,0.0
1789-07-01,0.5,0.5,0.5,0.5,0.0,-0.019608
1789-08-01,0.5,0.51,0.5,0.51,0.0,0.02
1789-09-01,0.51,0.51,0.5,0.51,0.0,0.0
1789-10-01,0.51,0.51,0.51,0.51,0.0,0.0


In [54]:
# resample to monthly frequency, taking the last observation of each month for OHLCV
monthly_ohlcv = df.resample('M').agg({
    'open': 'last',
    'high': 'last',
    'low': 'last',
    'close': 'last',
    'volume': 'last'
})

# simple_returns for monthly data
monthly_ohlcv['monthly_returns'] = monthly_ohlcv['close'].pct_change()

# Drop NA values that result from pct_change()
monthly_ohlcv.dropna(inplace=True)

# Show dataframe
monthly_ohlcv.head()

Unnamed: 0_level_0,open,high,low,close,volume,monthly_returns
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1789-07-31,0.5,0.5,0.5,0.5,0.0,-0.019608
1789-08-31,0.5,0.51,0.5,0.51,0.0,0.02
1789-09-30,0.51,0.51,0.5,0.51,0.0,0.0
1789-10-31,0.51,0.51,0.51,0.51,0.0,0.0
1789-11-30,0.51,0.51,0.5,0.5,0.0,-0.019608


In [55]:
# Calculate log returns (Monthly) for all data
log_returns = (1 + monthly_ohlcv['monthly_returns']).apply(np.log)

# Cumulative sum of log returns (correct for compounding)
cumsum_log = log_returns.cumsum()

In [56]:
# # Filter cumulative log returns from 2020 onwards
# cumsum_log_2020 = cumsum_log[cumsum_log.index >= '2020-01-01']

# # Plot cumulative log returns from 2020
# plt.figure(figsize=(14, 7))
# plt.plot(cumsum_log_2020, label='Cumulative Sum (Log Returns) from 2020', color='blue')
# plt.title('Cumulative Log Returns (Monthly) - From 2020')
# plt.xlabel('Date')
# plt.ylabel('Cumulative Log Returns')
# plt.legend()
# plt.grid()
# plt.show()

In [57]:
# add year column from DatetimeIndex
monthly_ohlcv['year'] = monthly_ohlcv.index.year

# add month column from DatetimeIndex
monthly_ohlcv['month'] = monthly_ohlcv.index.month

# add day column from DatetimeIndex
monthly_ohlcv['day'] = monthly_ohlcv.index.day

monthly_ohlcv.head()

Unnamed: 0_level_0,open,high,low,close,volume,monthly_returns,year,month,day
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1789-07-31,0.5,0.5,0.5,0.5,0.0,-0.019608,1789,7,31
1789-08-31,0.5,0.51,0.5,0.51,0.0,0.02,1789,8,31
1789-09-30,0.51,0.51,0.5,0.51,0.0,0.0,1789,9,30
1789-10-31,0.51,0.51,0.51,0.51,0.0,0.0,1789,10,31
1789-11-30,0.51,0.51,0.5,0.5,0.0,-0.019608,1789,11,30


In [58]:
# Select month for the analysis
# M = 1 # January
# M = 2 # February
# M = 3 # March
# M = 4 # April
# M = 5 # May
# M = 6 # June
# M = 7 # July
# M = 8 # August
M = 9  # September
# M = 10 # October
# M = 11 # November
# M = 12 # December

In [59]:
df_ = monthly_ohlcv.copy()

# select only rows where month == M
df_ = df_[df_.month == M]

# count positive months (where monthly_returns > 0)
positive_months = len(df_[df_['monthly_returns'] > 0])
print(f"Number of positive months in month {M}: {positive_months}")

# count months (length of df)
months_count = len(df_)
print(f"Number of months in the dataset for month {M}: {months_count}")

# percentage of positive months
positive_percentage = (positive_months / months_count) * 100
print(f"Percentage of positive months in month {M}: {positive_percentage:.2f}%")

Number of positive months in month 9: 109
Number of months in the dataset for month 9: 236
Percentage of positive months in month 9: 46.19%


In [60]:
# def get_positive_months(monthly_ohlcv, M):
#     df_ = monthly_ohlcv.copy()
#     df_ = df_[df_.month == M]
#     positive_months = df_[df_['monthly_returns'] > 0]['year'].tolist()
#     return positive_months

# function that return monthly_ohlcv.year where month == M and monthly_returns > 0
def get_positive_years(monthly_ohlcv, M):
    df_ = monthly_ohlcv.copy()
    df_ = df_[df_.month == M]
    positive_years = df_[df_['monthly_returns'] > 0]['year'].tolist()
    return positive_years

# Example usage
positive_years = get_positive_years(monthly_ohlcv, M)
print(f"Years with positive returns in month {M}: {positive_years}")

print(f"Number of years with positive returns in month {M}: {len(positive_years)}")

Years with positive returns in month 9: [1790, 1791, 1794, 1798, 1800, 1802, 1804, 1807, 1808, 1809, 1814, 1815, 1817, 1820, 1822, 1823, 1827, 1829, 1830, 1832, 1834, 1838, 1840, 1842, 1843, 1845, 1850, 1852, 1856, 1860, 1861, 1862, 1863, 1865, 1866, 1867, 1868, 1874, 1877, 1878, 1879, 1880, 1882, 1885, 1886, 1887, 1888, 1889, 1891, 1893, 1896, 1904, 1905, 1906, 1909, 1910, 1912, 1915, 1916, 1918, 1919, 1920, 1921, 1925, 1927, 1928, 1935, 1936, 1938, 1939, 1940, 1942, 1943, 1945, 1949, 1950, 1953, 1954, 1955, 1958, 1964, 1965, 1967, 1968, 1970, 1973, 1976, 1980, 1982, 1983, 1988, 1992, 1995, 1996, 1997, 1998, 2004, 2005, 2006, 2007, 2009, 2010, 2012, 2013, 2017, 2018, 2019, 2024, 2025]
Number of years with positive returns in month 9: 109


In [61]:
# function that for every year in positive_years, calculate the Cumulative Sum (Log Returns) after 30, 60, 90 days
def get_cumsum_log_after_days(positive_years, df_daily, days=30):
    """
    Calculate cumulative log returns after specified number of days
    from the end of the positive month for each year.
    
    Args:
        positive_years: List of years with positive returns in month M
        df_daily: Daily dataframe with simple_returns column
        days: Number of days to look ahead (30, 60, or 90)
    
    Returns:
        Dictionary with year as key and cumulative log return as value
    """
    cumsum_logs = {}
    
    for year in positive_years:
        try:
            # Find the last day of month M in the given year
            month_end_date = pd.Timestamp(year=year, month=M, day=1) + pd.offsets.MonthEnd(0)
            
            # Find the date that is 'days' after the month end
            target_date = month_end_date + pd.Timedelta(days=days)
            
            # Get data from month end to target date
            mask = (df_daily.index > month_end_date) & (df_daily.index <= target_date)
            period_data = df_daily.loc[mask]
            
            if not period_data.empty:
                # Calculate log returns for the period
                log_returns_period = (1 + period_data['simple_returns']).apply(np.log)
                # Sum log returns (equivalent to cumulative multiplication of (1+r))
                cumsum_log_value = log_returns_period.sum()
                cumsum_logs[year] = cumsum_log_value
            else:
                # If no data available for that period, skip
                print(f"No data available for {year} after {days} days from month {M}")
                
        except Exception as e:
            print(f"Error processing year {year}: {e}")
            continue
    
    return cumsum_logs

# Use the original daily dataframe instead of monthly
cumsum_30_days = get_cumsum_log_after_days(positive_years, df, days=30)
cumsum_60_days = get_cumsum_log_after_days(positive_years, df, days=60)
cumsum_90_days = get_cumsum_log_after_days(positive_years, df, days=90)

print(f"Cumulative Sum (Log Returns) after 30 days for years with positive returns in month {M}:")
for year, ret in cumsum_30_days.items():
    print(f"{year}: {ret*100:.2f}%")

print(f"\nCumulative Sum (Log Returns) after 60 days for years with positive returns in month {M}:")
for year, ret in cumsum_60_days.items():
    print(f"{year}: {ret*100:.2f}%")

print(f"\nCumulative Sum (Log Returns) after 90 days for years with positive returns in month {M}:")
for year, ret in cumsum_90_days.items():
    print(f"{year}: {ret*100:.2f}%")

No data available for 2025 after 30 days from month 9
No data available for 2025 after 60 days from month 9
No data available for 2025 after 90 days from month 9
Cumulative Sum (Log Returns) after 30 days for years with positive returns in month 9:
1790: 1.80%
1791: 0.00%
1794: 0.00%
1798: 1.48%
1800: 1.00%
1802: 1.57%
1804: 1.87%
1807: -4.26%
1808: 3.55%
1809: 1.72%
1814: 1.39%
1815: 0.64%
1817: 1.83%
1820: 1.59%
1822: -0.50%
1823: 0.48%
1827: 0.48%
1829: 1.50%
1830: 0.00%
1832: 2.49%
1834: 2.10%
1838: -0.90%
1840: 0.58%
1842: 0.00%
1843: 8.54%
1845: 0.90%
1850: 1.43%
1852: 2.03%
1856: 1.15%
1860: -1.40%
1861: 1.96%
1862: 0.35%
1863: 0.68%
1865: 2.35%
1866: 1.52%
1867: 2.19%
1868: -0.78%
1874: 0.21%
1877: 1.27%
1878: 0.28%
1879: 1.08%
1880: 3.46%
1882: 0.80%
1885: 10.49%
1886: 2.24%
1887: -4.63%
1888: -2.53%
1889: -2.04%
1891: -0.35%
1893: 7.91%
1896: 4.63%
1904: 7.11%
1905: -0.20%
1906: -2.16%
1909: -1.50%
1910: 3.25%
1912: -3.53%
1915: 7.45%
1916: 0.63%
1918: 0.79%
1919: 2.78%
1920:

In [63]:
# calculate statistics for cumsum_30_days, cumsum_60_days, cumsum_90_days
def calculate_statistics(cumsum_dict):
    """
    Calculate statistics for cumulative log returns.
    
    Args:
        cumsum_dict: Dictionary with year as key and cumulative log return as value

    Returns:
        Dictionary with statistics (mean, median, std) for cumulative log returns
    """
    stats = {
        "mean": np.mean(list(cumsum_dict.values())),
        "median": np.median(list(cumsum_dict.values())),
        "std": np.std(list(cumsum_dict.values()))
    }
    return stats

# Calculate statistics for each cumulative sum dictionary
stats_30_days = calculate_statistics(cumsum_30_days)
stats_60_days = calculate_statistics(cumsum_60_days)
stats_90_days = calculate_statistics(cumsum_90_days)

print(f"Statistics for Cumulative Sum (Log Returns) after 30 days for years with positive returns in month {M}:")
print(stats_30_days)

print(f"\nStatistics for Cumulative Sum (Log Returns) after 60 days for years with positive returns in month {M}:")
print(stats_60_days)

print(f"\nStatistics for Cumulative Sum (Log Returns) after 90 days for years with positive returns in month {M}:")
print(stats_90_days)

Statistics for Cumulative Sum (Log Returns) after 30 days for years with positive returns in month 9:
{'mean': np.float64(0.012625961237475789), 'median': np.float64(0.010722260638862804), 'std': np.float64(0.031871057470532226)}

Statistics for Cumulative Sum (Log Returns) after 60 days for years with positive returns in month 9:
{'mean': np.float64(0.024527730612824112), 'median': np.float64(0.024197652472942584), 'std': np.float64(0.04738880581432337)}

Statistics for Cumulative Sum (Log Returns) after 90 days for years with positive returns in month 9:
{'mean': np.float64(0.02999298204963317), 'median': np.float64(0.031417057777201524), 'std': np.float64(0.057561614447262044)}


In [64]:
# statistics with describe method and print results in percentage with 2 decimals
def describe_statistics(cumsum_dict):
    """
    Describe statistics for cumulative log returns using pandas describe method.
    
    Args:
        cumsum_dict: Dictionary with year as key and cumulative log return as value

    Returns:
        None
    """
    # Convert to DataFrame for easier description
    df = pd.DataFrame.from_dict(cumsum_dict, orient='index', columns=['cumsum_log_return'])
    description = df.describe()
    # Convert to percentage
    description = description * 100
    # Print results
    print(description)
    return description

print(f"Descriptive Statistics for Cumulative Sum (Log Returns) after 30 days for years with positive returns in month {M}:")
describe_statistics(cumsum_30_days)

print(f"\nDescriptive Statistics for Cumulative Sum (Log Returns) after 60 days for years with positive returns in month {M}:")
describe_statistics(cumsum_60_days)

print(f"\nDescriptive Statistics for Cumulative Sum (Log Returns) after 90 days for years with positive returns in month {M}:")
describe_statistics(cumsum_90_days)

Descriptive Statistics for Cumulative Sum (Log Returns) after 30 days for years with positive returns in month 9:
       cumsum_log_return
count       10800.000000
mean            1.262596
std             3.201964
min            -8.272219
25%            -0.198040
50%             1.072226
75%             2.357732
max            10.494289

Descriptive Statistics for Cumulative Sum (Log Returns) after 60 days for years with positive returns in month 9:
       cumsum_log_return
count       1.080000e+04
mean        2.452773e+00
std         4.760973e+00
min        -1.082030e+01
25%        -5.421011e-16
50%         2.419765e+00
75%         5.166791e+00
max         1.590424e+01

Descriptive Statistics for Cumulative Sum (Log Returns) after 90 days for years with positive returns in month 9:
       cumsum_log_return
count       10800.000000
mean            2.999298
std             5.782997
min           -15.894944
25%            -0.352452
50%             3.141706
75%             6.591007
max   

Unnamed: 0,cumsum_log_return
count,10800.0
mean,2.999298
std,5.782997
min,-15.894944
25%,-0.352452
50%,3.141706
75%,6.591007
max,19.970304
