# Predicting the Stock Market

In this project, we would be attempting to predict the stock prices based on historical stock prices for the S&P500 index. The dataset contains stock prices for every trading day from years 1950 to 2015. However, for the purposes of creating a predictive model in this project, only the data for years 1950 to 2013 would be used for training. That is, model with be trained with the data from 1950-2012, and try to make predictions from 2013-2015.

# Introduction

The dataset to be used is in the `sphist.csv` file, which contains the following columns:
* Date -- The date of the record.
* Open -- The opening price of the day (when trading starts).
* High -- The highest trade price during the day.
* Low -- The lowest trade price during the day.
* Close -- The closing price for the day (when trading is finished).
* Volume -- The number of shares traded.
* Adj Close -- The daily closing price, adjusted retroactively to include any corporate actions.

In [1]:
# Importing the libraries to be used
import pandas as pd
import numpy as np
from datetime import datetime

# Setting display options
pd.set_option('display.max_rows', 25)
pd.set_option('display.max_columns', 25)

In [2]:
# Reading in the csv file as a pandas dataframe object
df = pd.read_csv('sphist.csv', parse_dates=['Date'])

# # Setting the Date as the index and then sorting
df.set_index('Date', inplace=True)
df.sort_index(inplace=True)

In [3]:
# First 5 rows
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66
1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08


In [4]:
# Last 5 rows
df.tail()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883
2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068


# Creating Indicators

To avoid mistakes often made by algorithmic traders, it is important to avoid incorporating future information into past rows when training and testing the model. Injecting future knowledge will make our model look good when training and testing it, but will make it fail in the real world (i.e. on data the model has never seen before). This is how many algorithmic traders lose money.

The time series nature of the data means that we can generate indicators to make our model more accurate. For instance, we can create a new column that contains the average price of the last 10 trades for each row. This will incorporate information from multiple prior rows into one, and will make predictions much more accurate. For the purposes of our predictive model, we would be employing the following three indicators to be created:

* `past_5` - The average price from the past 5 days, exclusive of the current price on the current trading day
* `past_30` -The average price for the past 30 days, exclusive of the current price on the current trading day
* `past_365` -The average price for the past 365 days, exclusive of the current price on the current trading day

In [5]:
# Creating the `past_5` indicator
past_5 = df.rolling(window=5)['Close'].mean()
past_5 = past_5.shift()

In [6]:
# Sanity check to ensure accuracy
past_5.head(10)

Date
1950-01-03       NaN
1950-01-04       NaN
1950-01-05       NaN
1950-01-06       NaN
1950-01-09       NaN
1950-01-10    16.900
1950-01-11    16.974
1950-01-12    17.022
1950-01-13    16.988
1950-01-16    16.926
Name: Close, dtype: float64

In [7]:
# Assigning back to the original df, with a new column
df['past_5'] = past_5
df.head(10)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close,past_5
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,
1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,
1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,
1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,
1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,
1950-01-10,17.030001,17.030001,17.030001,17.030001,2160000.0,17.030001,16.9
1950-01-11,17.09,17.09,17.09,17.09,2630000.0,17.09,16.974
1950-01-12,16.76,16.76,16.76,16.76,2970000.0,16.76,17.022
1950-01-13,16.67,16.67,16.67,16.67,3330000.0,16.67,16.988
1950-01-16,16.719999,16.719999,16.719999,16.719999,1460000.0,16.719999,16.926


In [8]:
# Repeating the same to create the `past_30` and `past_365` indicators
past_30 = df.rolling(window=30)['Close'].mean()
past_30 = past_30.shift()

past_365 = df.rolling(window=365)['Close'].mean()
past_365 = past_365.shift()

# Assigning back to the original df, with a new column
df['past_30'] = past_30
df['past_365'] = past_365

In [9]:
df.head(10)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close,past_5,past_30,past_365
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,,,
1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,,,
1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,,,
1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,,,
1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,,,
1950-01-10,17.030001,17.030001,17.030001,17.030001,2160000.0,17.030001,16.9,,
1950-01-11,17.09,17.09,17.09,17.09,2630000.0,17.09,16.974,,
1950-01-12,16.76,16.76,16.76,16.76,2970000.0,16.76,17.022,,
1950-01-13,16.67,16.67,16.67,16.67,3330000.0,16.67,16.988,,
1950-01-16,16.719999,16.719999,16.719999,16.719999,1460000.0,16.719999,16.926,,


In [10]:
df.tail(10)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close,past_5,past_30,past_365
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2015-11-23,2089.409912,2095.610107,2081.389893,2086.590088,3587980000.0,2086.590088,2071.523974,2061.892989,2033.60589
2015-11-24,2084.419922,2094.120117,2070.290039,2089.139893,3884930000.0,2089.139893,2078.204004,2064.197327,2034.018028
2015-11-25,2089.300049,2093.0,2086.300049,2088.870117,2852940000.0,2088.870117,2085.943994,2067.045658,2034.432712
2015-11-27,2088.820068,2093.290039,2084.129883,2090.110107,1466840000.0,2090.110107,2087.002002,2070.199996,2034.835123
2015-11-30,2090.949951,2093.810059,2080.409912,2080.409912,4245030000.0,2080.409912,2088.776025,2072.408333,2035.199864
2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883,2087.024023,2073.984998,2035.531178
2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001,2090.231982,2076.283993,2035.914082
2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117,2088.306006,2077.908659,2036.234356
2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941,2080.456006,2078.931331,2036.507343
2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068,2080.771973,2080.237329,2036.869425


# Incorporation of Additional Features for Model Improvement

## <span style='color:red'>***(To skip this section until Conclusion is read)</span>***

As a follow-up from the conclusion, we can incorporate 3 additional indicators to see if the error would be reduced.
We would be incorporating the following 3 additional indicators:
* `vol_5` -- The average volume over the past 5 days.
* `vol_30` --The average volume over the past 30 days.
* `vol_365` --The average volume over the past 365 days.

In [11]:
# Uncomment these lines of code then restart kernel and rerun all to obtain new error
vol_5 = df.rolling(window=5)['Volume'].mean()
vol_5 = vol_5.shift()
vol_30 = df.rolling(window=30)['Volume'].mean()
vol_30 = vol_30.shift()
vol_365 = df.rolling(window=365)['Volume'].mean()
vol_365 = vol_365.shift()

# Assigning back to the original df, with new columns
df['vol_5'] = vol_5
df['vol_30'] = vol_30
df['vol_365'] = vol_365

# Data Cleaning

Since all 3 indicators are computed using historical stock prices, there would be rows where there are null values due to insufficient historical data to compute. As such, all rows that fall before 1951-01-03 would be dropped since one of the indicator require a year or 365 days worth of historical data to compute and the very first data we have starts from 1950-01-3.

In [12]:
# Removing rows that fall before 1951-01-03
df = df[df.index > datetime(1951, 1, 3)]

# Removing rows with any null values
df.dropna(axis=0, inplace=True)

In [13]:
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close,past_5,past_30,past_365,vol_5,vol_30,vol_365
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1951-06-19,22.02,22.02,22.02,22.02,1100000.0,22.02,21.8,21.703333,19.447726,1196000.0,1707667.0,1989479.0
1951-06-20,21.91,21.91,21.91,21.91,1120000.0,21.91,21.9,21.683,19.462411,1176000.0,1691667.0,1989041.0
1951-06-21,21.780001,21.780001,21.780001,21.780001,1100000.0,21.780001,21.972,21.659667,19.476274,1188000.0,1675667.0,1986932.0
1951-06-22,21.549999,21.549999,21.549999,21.549999,1340000.0,21.549999,21.96,21.631,19.489562,1148000.0,1647000.0,1982959.0
1951-06-25,21.290001,21.290001,21.290001,21.290001,2440000.0,21.290001,21.862,21.599,19.502082,1142000.0,1636333.0,1981123.0


# Train-Test Split

With the null values out of the way, we can now split the data into a training set and the test set. As mentioned earlier, we would be training the model on historical prices from before 2013 and testing the model on historical prices after 2013.

In [14]:
# Training set to be before 2013-01-01
train = df[df.index < datetime(2013,1,1)]
test = df[df.index >= datetime(2013,1,1)]

In [15]:
train.shape

(15486, 12)

In [16]:
test.shape

(739, 12)

In [17]:
train.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close,past_5,past_30,past_365,vol_5,vol_30,vol_365
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1951-06-19,22.02,22.02,22.02,22.02,1100000.0,22.02,21.8,21.703333,19.447726,1196000.0,1707667.0,1989479.0
1951-06-20,21.91,21.91,21.91,21.91,1120000.0,21.91,21.9,21.683,19.462411,1176000.0,1691667.0,1989041.0
1951-06-21,21.780001,21.780001,21.780001,21.780001,1100000.0,21.780001,21.972,21.659667,19.476274,1188000.0,1675667.0,1986932.0
1951-06-22,21.549999,21.549999,21.549999,21.549999,1340000.0,21.549999,21.96,21.631,19.489562,1148000.0,1647000.0,1982959.0
1951-06-25,21.290001,21.290001,21.290001,21.290001,2440000.0,21.290001,21.862,21.599,19.502082,1142000.0,1636333.0,1981123.0


In [18]:
train.tail()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close,past_5,past_30,past_365,vol_5,vol_30,vol_365
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2012-12-24,1430.150024,1430.150024,1424.660034,1426.660034,1248960000.0,1426.660034,1437.36001,1405.926001,1326.114028,4108678000.0,3461864000.0,3886189000.0
2012-12-26,1426.660034,1429.420044,1416.430054,1419.829956,2285030000.0,1419.829956,1436.620019,1407.486336,1326.412494,3667348000.0,3381918000.0,3878488000.0
2012-12-27,1419.829956,1422.800049,1401.800049,1418.099976,2830180000.0,1418.099976,1431.228003,1408.813,1326.716494,3263906000.0,3372501000.0,3872807000.0
2012-12-28,1418.099976,1418.099976,1401.579956,1402.430054,2426680000.0,1402.430054,1427.685986,1410.265332,1326.995836,3055982000.0,3351655000.0,3868936000.0
2012-12-31,1402.430054,1426.73999,1398.109985,1426.189941,3204330000.0,1426.189941,1419.434009,1411.830001,1327.261562,2804002000.0,3295561000.0,3864302000.0


In [19]:
test.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close,past_5,past_30,past_365,vol_5,vol_30,vol_365
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2013-01-02,1426.189941,1462.430054,1426.189941,1462.420044,4202600000.0,1462.420044,1418.641992,1414.258667,1327.534055,2399036000.0,3271409000.0,3861288000.0
2013-01-03,1462.420044,1465.469971,1455.530029,1459.369995,3829730000.0,1459.369995,1425.793994,1417.676668,1327.908247,2989764000.0,3276632000.0,3862480000.0
2013-01-04,1459.369995,1467.939941,1458.98999,1466.469971,3424290000.0,1466.469971,1433.702002,1420.092668,1328.224877,3298704000.0,3291797000.0,3859719000.0
2013-01-07,1466.469971,1466.469971,1456.619995,1461.890015,3304970000.0,1461.890015,1443.376001,1422.714665,1328.557617,3417526000.0,3299034000.0,3859449000.0
2013-01-08,1461.890015,1461.890015,1451.640015,1457.150024,3601600000.0,1457.150024,1455.267993,1425.076664,1328.898603,3593184000.0,3320297000.0,3858814000.0


# Fitting and Predicting

With the train and test sets split, we can now move on to training a Linear Regression model on the three newly created feature columns which are the three indicators mentioned earlier. The target column would be the `Close` column since we are attempting to predict the closing price at the end of each closing day. To evaluate our model, we would need to choose an error metric. The root mean-squared error (RMSE) would be an ideal error metric as it ignores the sign of the errors and also penalising errors by squaring. Furthermore, by square rooting, the error metric would be measured in the same units as the original `Close` column in dollars.

In [20]:
# Importing the LinearRegression and mean_squared_error class from sci-kit learn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [21]:
# Instantiate a Linear Regression model, using the default parameters
model = LinearRegression()

In [22]:
# Fitting the model with the train set
features = ['past_5', 'past_30', 'past_365']

target = 'Close'
model.fit(train[features], train[target])

LinearRegression()

In [23]:
# Making predictions on the test set
predictions = model.predict(test[features])

In [24]:
# Computing the RMSE for the test set
mse = mean_squared_error(test[target], predictions)
rmse = np.sqrt(mse)
rmse

22.22006532421968

# Conclusion

We have managed to come up with a predictive model for the S&P500 index. However, the error of this model can be improved significantly by incorporating more features. Some other helpful indicators that can be computed to improve the model are as follows:

* The average volume over the past five days.
* The average volume over the past year.
* The ratio between the average volume for the past five days, and the average volume for the past year.
* The standard deviation of the average volume over the past five days.
* The standard deviation of the average volume over the past year.
* The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of the average volume for the past year.
* The year component of the date.
* The ratio between the lowest price in the past year and the current price.
* The ratio between the highest price in the past year and the current price.
* The month component of the date.
* The day of week.
* The day component of the date.
* The number of holidays in the prior month.

Nevertheless, for a fairly simple predictive model, the RMSE is fairly decent. The coefficient of determination (R^2) should also be looked at.

In [25]:
model.score(test[features], test[target])

0.9866427208352108

The R-squared value is close to 1, indicating that the model is a good fit.

# Follow-up from Incorporation of Additional Features

In [26]:
# Instantiate another model
model2 = LinearRegression()

In [27]:
# Fitting the model with the train set and include the new 3 additional features
features2 = ['past_5', 'past_30', 'past_365', 'vol_5', 'vol_30', 'vol_365']

target = 'Close'
model2.fit(train[features2], train[target])

LinearRegression()

In [28]:
# Making predictions on the test set
predictions = model2.predict(test[features2])

In [29]:
# Computing the RMSE for the test set
mse = mean_squared_error(test[target], predictions)
rmse = np.sqrt(mse)
rmse

22.233749345427544

In [30]:
model2.score(test[features2], test[target])

0.9866262638562789

We can see that the model did not improve and instead, slightly worsen with the introduction of the three additional indcators. 