# Predicting the Stock Market

We'll be using the S&P 500 dataset to develop a predictive model. We'll train the model with data from 1950-2012, and try to make predictions from 2013-2015.

In [0]:
import pandas as pd

In [2]:
stocks = pd.read_csv('https://raw.githubusercontent.com/sharontan/machine-learning/master/sphist.csv')
stocks.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068
1,2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2,2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
3,2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
4,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883


In [3]:
stocks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
Date         16590 non-null object
Open         16590 non-null float64
High         16590 non-null float64
Low          16590 non-null float64
Close        16590 non-null float64
Volume       16590 non-null float64
Adj Close    16590 non-null float64
dtypes: float64(6), object(1)
memory usage: 907.3+ KB


### Cleaning data

In [0]:
from datetime import datetime

#convert Date column from string to datetime format
stocks['Date'] = pd.to_datetime(stocks['Date'])


In [5]:
#remove data before Jan 03, 1950
after1950 = stocks['Date'] > datetime(year=1950, month=1, day=2)
stocks = stocks[after1950]
stocks.tail()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66


In [6]:
stocks = stocks.dropna(axis=0)
stocks.shape

(16590, 7)

In [0]:
#sort stocks based on earliest to latest dates
stocks = stocks.sort_values('Date', ascending=True)

### Generate time series columns

All prices based on "Close":

- Ave_5: average price of past 5 days
- Ave_30: average price of past 30 days
- Std_365: standard deviation of prices over past 365 days

In [8]:
stocks['Ave_5'] = stocks['Close'].rolling(5).mean().shift(1)
stocks[['Close', 'Ave_5']].tail()

Unnamed: 0,Close,Ave_5
4,2102.629883,2087.024023
3,2079.51001,2090.231982
2,2049.620117,2088.306006
1,2091.689941,2080.456006
0,2077.070068,2080.771973


In [9]:
stocks['Ave_30'] = stocks['Close'].rolling(30).mean().shift(1)
stocks['Ave_30'].isnull().sum()

30

In [10]:
stocks['Std_5'] = stocks['Close'].rolling(5).std().shift(1)
stocks['Std_5'].isnull().sum()

5

In [0]:
#split dataset into training set (before 2013) and test set (after 2013)
cutoff = datetime(year=2013, month=1, day=1)
train = stocks[stocks['Date'] < cutoff]
test = stocks[stocks['Date'] >= cutoff]

In [12]:
train.isnull().sum()

Date          0
Open          0
High          0
Low           0
Close         0
Volume        0
Adj Close     0
Ave_5         5
Ave_30       30
Std_5         5
dtype: int64

In [13]:
test.isnull().sum()

Date         0
Open         0
High         0
Low          0
Close        0
Volume       0
Adj Close    0
Ave_5        0
Ave_30       0
Std_5        0
dtype: int64

In [14]:
#drop null values in train set
train = train.dropna(axis=0)
train.isnull().sum()

Date         0
Open         0
High         0
Low          0
Close        0
Volume       0
Adj Close    0
Ave_5        0
Ave_30       0
Std_5        0
dtype: int64

In [0]:
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

In [16]:
lr = LinearRegression()
lr.fit(train[['Ave_5', 'Ave_30', "Std_5"]], train['Close'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
predictions = lr.predict(test[['Ave_5', 'Ave_30', "Std_5"]])

In [0]:
mse = mean_squared_error(predictions, test['Close'])

In [19]:
rmse = mse ** 0.5
rmse

22.198219543695416

In [20]:
average_price = test['Close'].mean()
average_price

1874.8903383897166

## Conclusion

Using time series calculating average prices over 5 days, 30 days and standard deviation over 5 days, we were able to linear regression to predict prices with a root mean squared error of 22.20, a 1.1% error rate.