# Predictions on stock market #

We'll be working with a csv file containing historical data on the price of the S&P500 Index, making predictions about future prices. Predicting whether an index will go up or down will help us forecast how the stock market as a whole will perform. Since stocks tend to correlate with how well the economy as a whole is performing, it can also help us make economic forecasts.

In [1]:
import pandas as pd

stock = pd.read_csv(filepath_or_buffer="https://raw.githubusercontent.com/NickyThreeNames/DataquestGuidedProjects/master/Guided%20Project-%20Predicting%20the%20stock%20market/sphist.csv")
stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068
1,2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2,2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
3,2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
4,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883


In [2]:
stock.tail()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66


Below is the description of the columns:
- `Date` ---------- the date of the record.
- `Open` ---------- the opening price of the day (when trading starts).
- `High` ---------- the highest trade price of the day.
- `Low`  ---------- the lowest trade price of the day.
- `Close` --------- the closing price for the day (when trading ends).
- `Volumne` ------- the number of shares traded.
- `Adj Close` ----- the daily closing price, adjusted retroactively to include any corporate actions.

The data recorded trading days from January 3th 1950 to December 7th 2015.

We 'll train the model with data from 1950-2012, and try to make predictions from 2013-2015.

In [3]:
# convert 'Date' column into datetime and sort the dates
stock['Date'] = pd.to_datetime(stock['Date'])
stock.sort_values('Date', inplace=True)
stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08


## Generating indicators ##

Let's generate some indicators for each row to help make predictions:
- The average price for the past 5 days.
- The average price for the past 30 days.
- The average price for the past 365 days.

The price must be the closing price of the day.

For any date that doesn't have enough historical data to compute an indicator, we fill in `0`.

In [4]:
# create 3 columns that have 0 as a default value
stock['day_5'] = 0
stock['day_30'] = 0
stock['day_365'] = 0
stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5,day_30,day_365
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,0,0,0
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,0,0,0
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,0,0,0
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,0,0,0
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,0,0,0


In [5]:
# create indicators
stock['day_5'] = stock['Close'].rolling(5).mean()
# shift all the values "forward" one day
stock['day_5'] = stock['day_5'].shift()

stock['day_30'] = stock['Close'].rolling(30).mean()
stock['day_30'] = stock['day_30'].shift()

stock['day_365'] = stock['Close'].rolling(365).mean()
stock['day_365'] = stock['day_365'].shift()

stock

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5,day_30,day_365
16589,1950-01-03,16.660000,16.660000,16.660000,16.660000,1.260000e+06,16.660000,,,
16588,1950-01-04,16.850000,16.850000,16.850000,16.850000,1.890000e+06,16.850000,,,
16587,1950-01-05,16.930000,16.930000,16.930000,16.930000,2.550000e+06,16.930000,,,
16586,1950-01-06,16.980000,16.980000,16.980000,16.980000,2.010000e+06,16.980000,,,
16585,1950-01-09,17.080000,17.080000,17.080000,17.080000,2.520000e+06,17.080000,,,
16584,1950-01-10,17.030001,17.030001,17.030001,17.030001,2.160000e+06,17.030001,16.900000,,
16583,1950-01-11,17.090000,17.090000,17.090000,17.090000,2.630000e+06,17.090000,16.974000,,
16582,1950-01-12,16.760000,16.760000,16.760000,16.760000,2.970000e+06,16.760000,17.022000,,
16581,1950-01-13,16.670000,16.670000,16.670000,16.670000,3.330000e+06,16.670000,16.988000,,
16580,1950-01-16,16.719999,16.719999,16.719999,16.719999,1.460000e+06,16.719999,16.926000,,


## Splitting up the data ##

Because fitting model requires non-missing values, we need to remove any row containg `Nan` values.

In [6]:
clean_stock = stock.dropna(axis=0).copy()
clean_stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5,day_30,day_365
16224,1951-06-19,22.02,22.02,22.02,22.02,1100000.0,22.02,21.8,21.703333,19.447726
16223,1951-06-20,21.91,21.91,21.91,21.91,1120000.0,21.91,21.9,21.683,19.462411
16222,1951-06-21,21.780001,21.780001,21.780001,21.780001,1100000.0,21.780001,21.972,21.659667,19.476274
16221,1951-06-22,21.549999,21.549999,21.549999,21.549999,1340000.0,21.549999,21.96,21.631,19.489562
16220,1951-06-25,21.290001,21.290001,21.290001,21.290001,2440000.0,21.290001,21.862,21.599,19.502082


Next, let's create the training and the test data. As we mentioned above, training data contains data with a date less then 01/01/2013, and test data contains the remaining data.

In [7]:
from datetime import datetime 
train = clean_stock[clean_stock['Date'] < datetime(year=2013, month=1, day=1)]
test = clean_stock[clean_stock['Date'] >= datetime(year=2013, month=1, day=1)]

## Making predictions on closing prices ##

Let's use the Mean Absolute Error as an error metric because it will show us how "close" we were to the price in intuitive terms.

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

lr = LinearRegression()

lr.fit(train[['day_5']], train['Close'])
predictions = lr.predict(test[['day_5']])
mae_5 = mean_absolute_error(test['Close'], predictions)
mae_5

16.267878754475266

In [9]:
lr.fit(train[['day_30']], train['Close'])
predictions = lr.predict(test[['day_30']])
mae_30 = mean_absolute_error(test['Close'], predictions)
mae_30

31.970513557097345

In [10]:
lr.fit(train[['day_365']], train['Close'])
predictions = lr.predict(test[['day_365']])
mae_365 = mean_absolute_error(test['Close'], predictions)
mae_365

146.631302068694

In [24]:
# try to train more features at one
lr.fit(train[['day_5', 'day_30']], train['Close'])
pred = lr.predict(test[['day_5', 'day_30']])
mae_5_30 = mean_absolute_error(test['Close'], pred)
print("mae_5_30 : ", mae_5_30)

lr.fit(train[['day_5', 'day_365']], train['Close'])
pred = lr.predict(test[['day_5', 'day_365']])
mae_5_365 = mean_absolute_error(test['Close'], pred)
print("mae_5_365 : ", mae_5_365)

lr.fit(train[['day_30', 'day_365']], train['Close'])
pred = lr.predict(test[['day_30', 'day_365']])
mae_30_365 = mean_absolute_error(test['Close'], pred)
print("mae_30_365 : ", mae_30_365)

lr.fit(train[['day_5', 'day_30', 'day_365']], train['Close'])
pred = lr.predict(test[['day_5', 'day_30', 'day_365']])
mae_all = mean_absolute_error(test['Close'], pred)
print("mae_all : ", mae_5_30)

mae_5_30 :  16.14929996262859
mae_5_365 :  16.13038769289735
mae_30_365 :  30.175302281842477
mae_all :  16.14929996262859


## Creating more indicators ##

Let's make more indicators related on the volume of shares traded during the period.
- Indicator 1 : the average volume over the past 5 days.
- Indicator 2 : the average volumne over the past year (or 365 days).
- Indicator 3 : the ratio between the average volume over the past 5 days and the average volume over the past year (created aften cleaning the data to avoid dividing by 0).

In [13]:
stock['vol_5'] = 0
stock['vol_365'] = 0
stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5,day_30,day_365,vol_5,vol_365
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,,,,0,0
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,,,,0,0
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,,,,0,0
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,,,,0,0
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,,,,0,0


In [19]:
# as shares number is discrete value, convert in interger
stock['vol_5'] = stock['Volume'].rolling(5).mean()
stock['vol_5'] = stock['vol_5'].shift()

stock['vol_365'] = stock['Volume'].rolling(365).mean()
stock['vol_365'] = stock['vol_365'].shift()

stock

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5,day_30,day_365,vol_5,vol_365
16589,1950-01-03,16.660000,16.660000,16.660000,16.660000,1.260000e+06,16.660000,,,,,
16588,1950-01-04,16.850000,16.850000,16.850000,16.850000,1.890000e+06,16.850000,,,,,
16587,1950-01-05,16.930000,16.930000,16.930000,16.930000,2.550000e+06,16.930000,,,,,
16586,1950-01-06,16.980000,16.980000,16.980000,16.980000,2.010000e+06,16.980000,,,,,
16585,1950-01-09,17.080000,17.080000,17.080000,17.080000,2.520000e+06,17.080000,,,,,
16584,1950-01-10,17.030001,17.030001,17.030001,17.030001,2.160000e+06,17.030001,16.900000,,,2.046000e+06,
16583,1950-01-11,17.090000,17.090000,17.090000,17.090000,2.630000e+06,17.090000,16.974000,,,2.226000e+06,
16582,1950-01-12,16.760000,16.760000,16.760000,16.760000,2.970000e+06,16.760000,17.022000,,,2.374000e+06,
16581,1950-01-13,16.670000,16.670000,16.670000,16.670000,3.330000e+06,16.670000,16.988000,,,2.458000e+06,
16580,1950-01-16,16.719999,16.719999,16.719999,16.719999,1.460000e+06,16.719999,16.926000,,,2.722000e+06,


In [23]:
# drop missing values
stock2 = stock.dropna(axis=0).copy()

# without missing values, let's create the ratio columns
stock2['vol_ratio_5_365'] = stock2['vol_5'] / stock2['vol_365']
stock2

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5,day_30,day_365,vol_5,vol_365,vol_ratio_5_365
16224,1951-06-19,22.020000,22.020000,22.020000,22.020000,1.100000e+06,22.020000,21.800000,21.703333,19.447726,1.196000e+06,1.989479e+06,0.601162
16223,1951-06-20,21.910000,21.910000,21.910000,21.910000,1.120000e+06,21.910000,21.900000,21.683000,19.462411,1.176000e+06,1.989041e+06,0.591240
16222,1951-06-21,21.780001,21.780001,21.780001,21.780001,1.100000e+06,21.780001,21.972000,21.659667,19.476274,1.188000e+06,1.986932e+06,0.597907
16221,1951-06-22,21.549999,21.549999,21.549999,21.549999,1.340000e+06,21.549999,21.960000,21.631000,19.489562,1.148000e+06,1.982959e+06,0.578933
16220,1951-06-25,21.290001,21.290001,21.290001,21.290001,2.440000e+06,21.290001,21.862000,21.599000,19.502082,1.142000e+06,1.981123e+06,0.576441
16219,1951-06-26,21.299999,21.299999,21.299999,21.299999,1.260000e+06,21.299999,21.710000,21.564333,19.513617,1.420000e+06,1.980904e+06,0.716844
16218,1951-06-27,21.370001,21.370001,21.370001,21.370001,1.360000e+06,21.370001,21.566000,21.535000,19.525315,1.452000e+06,1.978438e+06,0.733912
16217,1951-06-28,21.100000,21.100000,21.100000,21.100000,1.940000e+06,21.100000,21.458000,21.522000,19.537041,1.500000e+06,1.974959e+06,0.759509
16216,1951-06-29,20.959999,20.959999,20.959999,20.959999,1.730000e+06,20.959999,21.322000,21.502333,19.548932,1.668000e+06,1.972137e+06,0.845783
16215,1951-07-02,21.100000,21.100000,21.100000,21.100000,1.350000e+06,21.100000,21.204000,21.470667,19.560685,1.746000e+06,1.967753e+06,0.887306


In [27]:
# time to make predictions
train = stock2[stock2['Date'] < datetime(year=2013, month=1, day=1)]
test = stock2[stock2['Date'] >= datetime(year=2013, month=1, day=1)]

lr.fit(train[['vol_5']], train['Close'])
pred = lr.predict(test[['vol_5']])
mae_vol5 = mean_absolute_error(test['Close'], pred)
print('mae_vol5 :' , mae_vol5)

lr.fit(train[['vol_365']], train['Close'])
pred = lr.predict(test[['vol_365']])
mae_vol365 = mean_absolute_error(test['Close'], pred)
print('mae_vol365 :' , mae_vol365)

lr.fit(train[['vol_ratio_5_365']], train['Close'])
pred = lr.predict(test[['vol_ratio_5_365']])
mae_ratio = mean_absolute_error(test['Close'], pred)
print('mae_ratio :' , mae_ratio)

lr.fit(train[['vol_5', 'vol_365']], train['Close'])
pred = lr.predict(test[['vol_5', 'vol_365']])
mae_vol_5_365 = mean_absolute_error(test['Close'], pred)
print('mae_vol_5_365 :' , mae_vol_5_365)

lr.fit(train[['vol_5', 'vol_ratio_5_365']], train['Close'])
pred = lr.predict(test[['vol_5', 'vol_ratio_5_365']])
mae_vol_5_ratio = mean_absolute_error(test['Close'], pred)
print('mae_vol_5_ratio :' , mae_vol_5_ratio)

lr.fit(train[['vol_365', 'vol_ratio_5_365']], train['Close'])
pred = lr.predict(test[['vol_365', 'vol_ratio_5_365']])
mae_vol_365_ratio = mean_absolute_error(test['Close'], pred)
print('mae_vol_365_ratio :' , mae_vol_5_ratio)

lr.fit(train[['vol_5', 'vol_365', 'vol_ratio_5_365']], train['Close'])
pred = lr.predict(test[['vol_5', 'vol_365', 'vol_ratio_5_365']])
mae_vol_all = mean_absolute_error(test['Close'], pred)
print('mae_vol_all :' , mae_vol_all)

mae_vol5 : 734.5794880319837
mae_vol365 : 687.1969697261956
mae_ratio : 1455.002851905863
mae_vol_5_365 : 700.3937534996745
mae_vol_5_ratio : 734.418972473799
mae_vol_365_ratio : 734.418972473799
mae_vol_all : 703.8973357656081
