## Introduction

Motivated by the observation of seasonal trends for the S&P500, we now see if we can make money off of trading individual stocks in the S&P500. This is a data-mining exercise, and (spoiler alert) we will find that historical seasonal trends predictors are not the best at picking out individual stocks.

The methodology is as follows: For a given trade day, using data from the past X years, identify the top N trades in the S&P500, based off of some criteria such as Sharpe ratio.

In [3]:
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import datetime, timedelta
import numpy as np
from itertools import product

# Pre-process data with forward filled fields, in this case using the SP500
# Read in the SP500 tickers from S&P500-Symbols.csv, which was pulled from
# Wikipedia and saved locally.

# Run the below code to pull symbols
# import ssl
# ssl._create_default_https_context = ssl._create_unverified_context
# table=pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
# df = table[0]
# df.to_csv("S&P500-Symbols.csv", columns=['Symbol'], index=False)

# Pull the adjusted close prices off Yahoo Finance
df = pd.read_csv("S&P500-Symbols.csv")
tickers = list(df['Symbol'])
start_date = '1989-01-01'
end_date = '2024-01-03'  # Get data a few days past end of year to backfill

# Either pull from Yahoo finance, or for read the pre-downloaded CSV
# data = pd.DataFrame(yf.download(tickers, start_date, end_date)['Adj Close'])
# data.reset_index().to_csv("S&P500-adjusted-close.csv", index=False)
data = pd.read_csv('S&P500-adjusted-close.csv')
data['Date']= pd.to_datetime(data['Date'])
data = data.set_index('Date')
all_dates = pd.date_range(start_date, end_date)
data['Price Date'] = data.index

# Backfill with trading prices for missing dates
data = data.reindex(all_dates, method='bfill')
sp500_dates_added = pd.read_csv("S&P500-Info.csv")[['Symbol','Date added']]

all_stocks = data.columns.drop(labels='Price Date')

# Only keep through end of 2023
data = data[data.index < '2024-01-01']

## Define seasonal return, performance metric functions

In [4]:
# Define a function that outputs historical performance between two dates
# Note that this specific function does not work for start and end dates that
# cross a year, e.g. a Dec-Jan seasonal trend

def seasonal_return(data, symbol, start_date, end_date, first_year, last_year):
    data_list = []
    # Deal with Feb 29: assign start/end dates to Mar 1
    if start_date == '02-29': start_date = '03-01'
    if end_date == '02-29': end_date = '03-01'
    for year in range(first_year, (last_year+1)):
        full_start_date = str(year)+'-'+start_date
        full_end_date = str(year)+'-'+end_date
        trade_start_date = data.loc[full_start_date,'Price Date']
        trade_end_date = data.loc[full_end_date,'Price Date']
        start_price = data.loc[full_start_date,symbol]

        # If price data is missing, skip that year
        if np.isnan(start_price):
            continue
        end_price = data.loc[full_end_date,symbol]
        if np.isnan(end_price):
            continue
        returns = (end_price/start_price)-1
        data_list.append([symbol, year, trade_start_date, start_price, 
            trade_end_date,
            end_price, returns])

    df = pd.DataFrame(data_list, columns=['Symbol','Year','Init Date',
        'Init Price','Final Date','Final Price','Return'])
    return df

# An example of finding the best trade given a certain start/end date.
# In this case, Feb 11-16, using data from 2014-2023 (last 10 years)

stock_returns_list = []
start_calendar = '02-11'
end_calendar = '02-16'
start_year = 2014
end_year = 2023
for stock in all_stocks:
    stock_returns_list.append(seasonal_return(data, stock, start_calendar, 
    end_calendar, start_year, end_year))

seasonal_returns = pd.concat(stock_returns_list)

def sortino(df, strat_name, risk_free, threshold):
    excess_return = df[strat_name]-df[risk_free]
    downside = excess_return[(excess_return<df[threshold])]
    denom = np.sqrt(sum(downside*downside)/len(downside))
    return excess_return.mean()/denom

# Generate return stats
def return_stats(x, risk_free_rate = 0):
    d = {}
    d['N'] = x['Symbol'].count()
    d['avg r'] = x['Return'].mean()
    d['vol'] = x['Return'].std()
    downsides = x[x['Return'] < risk_free_rate]['Return']
    d['downside dev'] = 0 if downsides.count()==0 else downsides.std()
    upsides = x[-x['Return'] < risk_free_rate]['Return']
    d['upside dev'] = 0 if upsides.count()==0 else upsides.std()
    d['up'] = sum(x['Return']>risk_free_rate)
    return pd.Series(d, index = ['N','avg r','vol','downside dev','upside dev',
        'up'])

# N is number of observations, avg r is average return for going LONG,
# vol is std dev of returns, downside/upside dev are corresponding deviations
# used to calculate Sortino Ratio, and up is number of observations that are
# above the risk free rate
symbol_stats = seasonal_returns.groupby('Symbol').apply(return_stats, risk_free_rate = 0)

symbol_stats['Sharpe Long'] = symbol_stats['avg r']/symbol_stats['vol']
symbol_stats['Sharpe Short'] = -symbol_stats['avg r']/symbol_stats['vol']
symbol_stats['Sortino Long'] = symbol_stats['avg r']/symbol_stats['downside dev']
symbol_stats['Sortino Short'] = -symbol_stats['avg r']/symbol_stats['upside dev']
symbol_stats['Winrate Long'] = symbol_stats['up']/symbol_stats['N']


# Identify the five most profitable seasonal trades for long/short within the
# window. Restrict to samples that we actually have 10 years of data for
sub_stats = symbol_stats[symbol_stats.N==10]
best_long_trades = sub_stats.sort_values(by='avg r',ascending= False).iloc[0:5]
best_short_trades = sub_stats.sort_values(by='avg r',ascending= True).iloc[0:5]

## Identify seasonal strategies throughout the year

Now that we have a way to identify the top 5 seasonal trades by Sharpe ratio for any given date and hold duration, we apply this function to a series of dates throughout the year. In this case, we consider trades on the dates of 1, 6, 11, 15, 20, and 25 of each month, for durations of 7, 14, and 28 days, for long/short. We use data from 2013-2022 to generate the seasonal pattern, to propose trades in 2023. We constrain to stocks for which we have 10 years of data (i.e. they were in the S&P500 throughout 2013-2023, and have a winrate of at least 6/10 (6/10 gains for long seasonal strategies, 6/10 losses for short seasonal strategies) and annualized return of at least 40%.

In [None]:
# NOTE: Running this cell will take ~20 minutes. Use the saved CSV file in the next cell instead

# First, restrict to stocks that actually have prices back in 2013 (not NA on
# Jan 1, 2013, even when backfilling). This drops to 461 stocks
sub_cols = data.columns[data.loc['2013-01-01'].notna()]
sub_stocks = data[sub_cols].columns.drop(labels='Price Date')

sub_data = data[sub_cols][data.index>='2013-01-01']

hold_range = [7, 14, 28] # hold for a fixed number of weeks, up to a month
delay_range = [0, 5, 10]
start_months = list(range(1,12+1))
start_days = ['-01','-15']
initial_dates = [str(i)+j for i, j in product(start_months,start_days)]
start_year = 2013
end_year = 2022

all_returns_list = []

# Look up to two weeks forward
for initial_date in initial_dates:

    initial_calendar_2023 = datetime.strptime("2023-"+initial_date, "%Y-%m-%d")

    # Delay refers to how many days after the 1st or 15th we start the trade
    # This technically means we will never start a position on the 29-31st
    for delay in delay_range:
        start_calendar_2023 = initial_calendar_2023+timedelta(days=delay)
        start_calendar = start_calendar_2023.strftime('%m-%d')

        for hold_length in hold_range:
            end_calendar = (start_calendar_2023+timedelta(days=hold_length)
                ).strftime('%m-%d')
            stock_returns_list = []

            for stock in sub_stocks:
                stock_returns_list.append(seasonal_return(sub_data, stock,
                    start_calendar, end_calendar, start_year, end_year))

            seasonal_returns = pd.concat(stock_returns_list)
            symbol_stats = seasonal_returns.groupby('Symbol').apply(
                return_stats, risk_free_rate = 0)
            symbol_stats['trade window']=initial_date
            symbol_stats['start date'] = start_calendar_2023
            symbol_stats['end date'] = end_calendar
            symbol_stats['Sharpe Long'] = symbol_stats['avg r']/symbol_stats['vol']
            symbol_stats['Sharpe Short'] = -symbol_stats['avg r']/symbol_stats['vol']
            symbol_stats['Sortino Long'] = symbol_stats['avg r']/symbol_stats['downside dev']
            symbol_stats['Sortino Short'] = -symbol_stats['avg r']/symbol_stats['upside dev']

            symbol_stats['hold length'] = hold_length
            all_returns_list.append(symbol_stats)

all_returns = pd.concat(all_returns_list)

# Approximately annualize returns
all_returns['annualized r'] = (all_returns['avg r'] * 365/
    all_returns['hold length'])
# all_returns.to_csv('seasonal_trades.csv')

In [25]:
# Read in the CSV: running the code takes about 20 minutes
all_returns = pd.read_csv('seasonal_trades.csv')
# Identify which long/short positions we would take. Take the top 5 long/short for each trade window
long_positions = all_returns[(all_returns['annualized r']>0.4) & (all_returns.up>=6)].sort_values(
    'Sharpe Long',ascending=False).groupby('trade window').head(5)
short_positions = all_returns[(all_returns['annualized r']< (-0.4)) & (all_returns.up<=4)].sort_values(
    'Sharpe Short',ascending=False).groupby('trade window').head(5)

## Seasonal Trade Results

Now let's evaluate the results of making these seasonal trades in 2023.

In [26]:
# Note that within the same trade window, we may end up trading the same stock
# multiple times due to the delay, e.g. we short COF from Jan 11-25 but also
# from Jan 6 to Feb 23. This can happen if a stock endures sustained movements
# in one direction, and we would probably avoid doubling down in reality. But for
# now, let's just look at the results of our trades in 2023

long_trades = long_positions[['Symbol','start date','end date']]
long_trades['buy date']=long_trades['start date']
long_trades['sell date']="2023-"+long_trades['end date']

# Results of long trades (invert order of buy/sell dates if shorting)
def long_trade_result(data, symbol, buy_date, sell_date):
    [buy_price, actual_buy_date] = data.loc[buy_date][[symbol,'Price Date']]
    [sell_price, actual_sell_date] = data.loc[sell_date][[symbol,'Price Date']]
    trade_return =  (sell_price/buy_price)-1
    return pd.Series({'Symbol':symbol,'Buy Signal Date':buy_date,
        'Actual Buy Date':actual_buy_date,'Buy Price':buy_price,
        'Sell Signal Date':sell_date,'Actual Sell Date':actual_sell_date,
        'Actual Sell Price':sell_price,'Return':trade_return},
        index = ['Symbol','Buy Signal Date','Actual Buy Date','Buy Price',
        'Sell Signal Date','Actual Sell Date','Actual Sell Price','Return'])

long_trade_results_list = []

for row in range(long_trades.shape[0]):
    long_trade_results_list.append(
        long_trade_result(data, long_trades.iloc[row].Symbol,
            long_trades.iloc[row]['buy date'],
            long_trades.iloc[row]['sell date']))

long_trade_results = pd.DataFrame(long_trade_results_list)

short_trades = short_positions[['Symbol','start date','end date']]
short_trades['buy date']="2023-"+short_trades['end date']
short_trades['sell date']=short_trades['start date']

short_trade_results_list = []

for row in range(short_trades.shape[0]):
    short_trade_results_list.append(
        long_trade_result(data, short_trades.iloc[row].Symbol,
            short_trades.iloc[row]['buy date'],
            short_trades.iloc[row]['sell date']))

short_trade_results = pd.DataFrame(short_trade_results_list)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  long_trades['buy date']=long_trades['start date']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  long_trades['sell date']="2023-"+long_trades['end date']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  short_trades['buy date']="2023-"+short_trades['end date']
A value is trying to be set on a copy of

In [30]:
short_trade_results

Unnamed: 0,Symbol,Buy Signal Date,Actual Buy Date,Buy Price,Sell Signal Date,Actual Sell Date,Actual Sell Price,Return
0,UNH,2023-01-03,2023-01-03,510.947601,2023-12-06,2023-12-06,549.530029,0.075512
1,UNH,2023-01-12,2023-01-12,488.318268,2023-12-15,2023-12-15,531.119995,0.087651
2,UNH,2023-01-08,2023-01-09,482.791473,2023-12-11,2023-12-11,543.679993,0.126118
3,CTAS,2023-01-03,2023-01-03,444.399078,2023-12-06,2023-12-06,550.929993,0.239719
4,CTAS,2023-01-08,2023-01-09,436.797607,2023-12-11,2023-12-11,559.599976,0.281142
...,...,...,...,...,...,...,...,...
115,HRL,2023-04-13,2023-04-13,38.916107,2023-04-06,2023-04-06,39.071190,0.003985
116,PEAK,2023-03-01,2023-03-01,22.582718,2023-02-01,2023-02-01,25.905079,0.147120
117,IT,2023-02-08,2023-02-08,346.480011,2023-02-01,2023-02-01,347.279999,0.002309
118,K,2023-02-08,2023-02-08,60.614395,2023-02-01,2023-02-01,61.570515,0.015774


In [32]:
# Part 5

# Long position results

## Average returns: 3.14% returns
print("Average long return: %5.4f" % long_trade_results['Return'].mean())

# Sharpe (assuming risk-free return is just zero): 0.52
print("Sharpe ratio (long): %5.4f" % 
      (long_trade_results['Return'].mean()/long_trade_results['Return'].std()))

# Sortino Ratio: 1.35
print("Sortino ratio (long): %5.4f" % 
      (long_trade_results['Return'].mean()/long_trade_results[long_trade_results['Return']<0]['Return'].std()))

# Short position results

# Average returns: 1.14% returns
print("Average short return: %5.4f" % short_trade_results['Return'].mean())

# Sharpe (assuming risk-free return is just zero): 0.15
print("Sharpe ratio (short): %5.4f" % 
      (short_trade_results['Return'].mean()/short_trade_results['Return'].std()))

# Sortino Ratio: 0.23
print("Sortino ratio (short): %5.4f" %
     (short_trade_results['Return'].mean()/short_trade_results[long_trade_results['Return']<0]['Return'].std()))

# Compare to what we may have expected if 2023 played out like the historicals used to select:
print('Historical long return: %5.4f' % long_positions['avg r'].mean())
print('Historical long Sharpe ratio: %5.4f' % long_positions['Sharpe Long'].quantile(q=0.01))
print('Historical long Sortino ratio: %5.4f'% long_positions['Sortino Long'].quantile(q=0.01))

print('Historical short return: %5.4f' % -short_positions['avg r'].mean()) # Avg return of 4.72%
print('Historical short Sharpe ratio: %5.4f' % short_positions['Sharpe Short'].quantile(q=0.01)) # 5%ile of Sharpe ratios is 0.49
print('Historical short Sortino ratio: %5.4f' % short_positions['Sortino Short'].quantile(q=0.01)) # 1%ile of Sortino ratios is 0.86


Average long return: 0.0314
Sharpe ratio (long): 0.5196
Sortino ratio (long): 1.3476
Average short return: 0.0133
Sharpe ratio (short): 0.1721
Sortino ratio (short): 0.2514
Historical long return: 0.0402
Historical long Sharpe ratio: 0.9358
Historical long Sortino ratio: 2.4721
Historical short return: 0.0476
Historical short Sharpe ratio: 0.4983
Historical short Sortino ratio: 0.8555


In [18]:
# Clearly our 2023 results dramatically underperform what we may have expected

# Now what would have happened if we had just bought and held the S&P500 during
# the same period? Since we know that in retrospect, 2023 was a great year for
# the S&P500, let's compute the returns/Sharpe ratio if we had taken long S&P500
# trades (regardless of long/short of the stock) at the same times,
# but over the last 10 years (2014-2023)
spx = pd.DataFrame(yf.download('^SPX', '2023-01-01', '2024-01-03')['Adj Close'])
all_dates = pd.date_range('2023-01-01', '2024-01-03')
spx['Price Date'] = spx.index
spx = spx.reindex(all_dates, method='bfill')
spx = spx[spx.index < '2024-01-01']

sp500_returns_long_list = []
# Long trade dates
for idx in range(long_trades.shape[0]):
    buy_date = long_trades.iloc[idx]['buy date']
    sell_date = long_trades.iloc[idx]['sell date']
    buy_price = spx.loc[buy_date]['Adj Close']
    sell_price = spx.loc[sell_date]['Adj Close']
    sp500_return = sell_price/buy_price-1
    sp500_returns_long_list.append([idx,buy_date,sell_date,sp500_return])

sp500_returns = pd.DataFrame(sp500_returns_long_list,columns=['Index','Buy Date','Sell Date','SP500 Return'])

# Avg SP500 returns during the same long positions
sp500_returns['SP500 Return'].mean() # 1.32% avg return

# Sharpe (assuming risk-free return is just zero): 0.48
sp500_returns['SP500 Return'].mean()/sp500_returns['Return'].std()

# Sortino Ratio: 1.10
sp500_returns['SP500 Return'].mean()/sp500_returns[sp500_returns['SP500 Return']<0]['SP500 Return'].std()

[*********************100%%**********************]  1 of 1 completed


In [23]:
# Short trades
sp500_returns_short_list = []

# Long trade dates
for idx in range(short_trades.shape[0]):
    buy_date = short_trades.iloc[idx]['buy date']
    sell_date = short_trades.iloc[idx]['sell date']
    buy_price = spx.loc[buy_date]['Adj Close']
    sell_price = spx.loc[sell_date]['Adj Close']
    sp500_return = sell_price/buy_price-1
    sp500_returns_short_list.append([idx,buy_date,sell_date,sp500_return])

sp500_returns_short_dates = pd.DataFrame(sp500_returns_short_list,
    columns=['Index','Buy Date','Sell Date','SP500 Return'])


# Avg SP500 returns during the same long positions
sp500_returns_short_dates['SP500 Return'].mean() # 1.16% avg return

# Sharpe (assuming risk-free return is just zero): 0.19
sp500_returns_short_dates['SP500 Return'].mean()/sp500_returns_short_dates['SP500 Return'].std()

# Sortino Ratio: 0.75
sp500_returns['SP500 Return'].mean()/(sp500_returns_short_dates[
    sp500_returns_short_dates['SP500 Return']<0]['SP500 Return'].std())

0.7473058887362829

## Conclusion

So taking short seasonal trades during a bull market (which 2023 was in retrospect, with 24% returns from Jan 3, 2023 to Jan 3, 2024) performs significantly worse than just going long the SP500 over those same periods. However, taking long seasonal positions does seem to outperform just going long the SP500 over those same periods (3.14% avg returns over the holding period compared to 1.32% for the SP 500),although due to the increased volatility the Sharpe/Sortino ratios are only slightly better. But, both long/short seasonal strategies perform perform significantly worse than what would be expected from the prior 10 years' performance, which is to be expected: we are a bit guilty of finding patterns in the noise and cherry-picking the best results.

But why did the long seasonal strategy do so much better compared to the short  strategy? Maybe we should explore a "veto" option for long/short strategie: use our simple long/short criteria for identifying the top X stocks for each trading window, but then take contemporaneous data (i.e. from the same year instead of from previous years) to identify whether we actually act on the signal and trade.ade.