# S&P500 Exploratory models

We are interested in predicting return, let's see what works best.
Starting of with a simple linear regression

In [23]:
import pickle

import pandas as pd
import statsmodels.api as sm

In [43]:
df = pd.read_csv('../../data/features/sp500_basics.csv', parse_dates=['Date'])
df = df.set_index('Date')

endo_column = 'returns'
exo_columns = df.columns.to_list()
exo_columns.remove(endo_column)

n_obs = len(df.index)
train_obs = int(n_obs*0.6)
test_obs = n_obs - train_obs

df[exo_columns] = df[exo_columns].shift(1)
df_train = df.iloc[:train_obs]
df_test = df.iloc[train_obs:]

Start with a simple in-sample regression
Standardize the features, add a constant, and drop na values

In [44]:
df_train[exo_columns] = (df_train[exo_columns] - df_train[exo_columns].mean())/(df_train[exo_columns].std())

model = sm.OLS(df_train[endo_column], sm.add_constant(df_train[exo_columns]), missing='drop')
fitted = model.fit()

pickle.dump(fitted, open('../../models/simple_ols.pkl', 'wb'))
fitted.summary()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train[exo_columns] = (df_train[exo_columns] - df_train[exo_columns].mean())/(df_train[exo_columns].std())


0,1,2,3
Dep. Variable:,returns,R-squared:,0.03
Model:,OLS,Adj. R-squared:,0.007
Method:,Least Squares,F-statistic:,1.289
Date:,"Sat, 05 Aug 2023",Prob (F-statistic):,0.281
Time:,13:59:28,Log-Likelihood:,231.83
No. Observations:,131,AIC:,-455.7
Df Residuals:,127,BIC:,-444.2
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.0001,0.004,-0.040,0.968,-0.007,0.007
logs,-0.0041,0.006,-0.665,0.507,-0.016,0.008
Consumer Price Index,-0.0016,0.007,-0.228,0.820,-0.015,0.012
Long Interest Rate,-0.0046,0.009,-0.544,0.587,-0.022,0.012

0,1,2,3
Omnibus:,44.515,Durbin-Watson:,1.47
Prob(Omnibus):,0.0,Jarque-Bera (JB):,122.712
Skew:,-1.298,Prob(JB):,2.2600000000000003e-27
Kurtosis:,6.968,Cond. No.,4.42


In [48]:
prediction = fitted.predict(sm.add_constant(df_test[exo_columns]))
prediction.name = f'{endo_column}_pred'

df_pred = pd.merge(df_test[endo_column], prediction, left_index=True, right_index=True)
df_pred = df_pred.rename(columns={endo_column: f'{endo_column}_actual'})
df_pred.to_csv('../../models/simple_ols.csv')

Our features do not hold any predictive value.
We know that our returns are not normally distributed, we might try predicting the sign of the return and use a binary classifier