# Predicting the Stock Market

We'll be working with data set containing index prices. Indexes aggregate the prices of multiple stocks together, and allow you to see how the market as a whole is performing. Each row in the file contains a daily record of the price of the S&P500 Index from 1950 to 2015. The S&P500 Index aggregates the stock prices of 500 large companies. The dataset is stored in sphist.csv. The columns of the dataset are:
- `Date` -- The date of the record.
- `Open` -- The opening price of the day (when trading starts).
- `High` -- The highest trade price during the day.
- `Low` -- The lowest trade price during the day.
- `Close` -- The closing price for the day (when trading is finished).
- `Volume` -- The number of shares traded.
- `Adj Close` -- The daily closing price, adjusted retroactively to include any corporate actions.

We'll be using this dataset to develop a predictive model. We'll train the model with data from 1950-2012, and try to make predictions from 2013-2015.

In [15]:
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

### Exploring and Cleaning the Data

In [2]:
sphist = pd.read_csv("sphist.csv")

In [3]:
sphist.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068
1,2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2,2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
3,2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
4,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883


In [4]:
sphist['Date'] = pd.to_datetime(sphist['Date'])
sphist_sorted = sphist.sort_values("Date", ascending=True)
sphist_sorted.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08


In [5]:
sphist_sorted.shape

(16590, 7)

### Generating Indicators

In [6]:
sphist_sorted["day_5"] = sphist_sorted['Close'].shift(1).rolling(center=False, window=5).mean()
sphist_sorted["year_1"] = sphist_sorted['Close'].shift(1).rolling(center=False, window=365).mean()
sphist_sorted["day_year_ratio"] = sphist_sorted["day_5"] / sphist_sorted["year_1"]
sphist_sorted["day_5_std"] = sphist_sorted['Close'].shift(1).rolling(center=False, window=5).std()
sphist_sorted["year_1_std"] = sphist_sorted['Close'].shift(1).rolling(center=False, window=365).std()
sphist_sorted["day_year_std_ratio"] = sphist_sorted["day_5_std"] / sphist_sorted["year_1_std"]

In [7]:
sphist_sorted.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5,year_1,day_year_ratio,day_5_std,year_1_std,day_year_std_ratio
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,,,,,,
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,,,,,,
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,,,,,,
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,,,,,,
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,,,,,,


In [8]:
sphist_sorted.shape

(16590, 13)

In [9]:
sphist_sorted = sphist_sorted[sphist_sorted['Date'] > datetime(year = 1951, month = 1, day = 2)]
sphist_sorted.dropna(axis = 0, inplace = True)
sphist_sorted.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5,year_1,day_year_ratio,day_5_std,year_1_std,day_year_std_ratio
16224,1951-06-19,22.02,22.02,22.02,22.02,1100000.0,22.02,21.8,19.447726,1.120954,0.256223,1.790253,0.143121
16223,1951-06-20,21.91,21.91,21.91,21.91,1120000.0,21.91,21.9,19.462411,1.125246,0.213659,1.789307,0.119409
16222,1951-06-21,21.780001,21.780001,21.780001,21.780001,1100000.0,21.780001,21.972,19.476274,1.128142,0.092574,1.788613,0.051758
16221,1951-06-22,21.549999,21.549999,21.549999,21.549999,1340000.0,21.549999,21.96,19.489562,1.126757,0.115108,1.787659,0.06439
16220,1951-06-25,21.290001,21.290001,21.290001,21.290001,2440000.0,21.290001,21.862,19.502082,1.121008,0.204132,1.786038,0.114293


### Generating Train and Test Data

In [10]:
train = sphist_sorted[sphist_sorted['Date'] < datetime(year = 2013, month = 1, day = 1)]
test = sphist_sorted[sphist_sorted['Date'] >= datetime(year = 2013, month = 1, day = 1)]

In [11]:
train.shape

(15486, 13)

In [12]:
test.shape

(739, 13)

### Training a Linear Regression Model

In [19]:
features = ['day_5','year_1', 'day_year_ratio', 'day_5_std', 'year_1_std', 'day_year_std_ratio']

target = ["Close"]

In [20]:
lr = LinearRegression()
lr.fit(train[features], train[target])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Making Predictions

In [21]:
predictions = lr.predict(test[features])

In [22]:
rmse = mean_squared_error(test["Close"], predictions) ** (1/2)
rmse

22.15180399006688