# Predicting Stock Prices

In this project, we will be working with data from the S&P500 Index. The S&P500 is a stock market index. Before we get into what an index is, we'll need to get into the basics of the stock market.

Some companies are publicly traded, which means that anyone can buy and sell their shares on the open market. A share entitles the owner to some control over the direction of the company, and to some percentage (or share) of the earnings of the company. When you buy or sell shares, it's common to say that you're trading a stock.

The price of a share is based mainly on supply and demand for a given stock. For example, Apple stock has a price of 120 dollars per share as of December 2015 -- http://www.nasdaq.com/symbol/aapl. A stock that is in less demand, like Ford Motor Company, has a lower price -- http://finance.yahoo.com/q?s=F. Stock price is also influenced by other factors, including the number of shares a company has issued.

Stocks are traded daily, and the price can rise or fall from the beginning of a trading day to the end based on demand. Stocks that are in more in demand, such as Apple, are traded more often than stocks of smaller companies.

Indexes aggregate the prices of multiple stocks together, and allow you to see how the market as a whole is performing. For example, the Dow Jones Industrial Average aggregates the stock prices of 30 large American companies together. The S&P500 Index aggregates the stock prices of 500 large companies. When an index fund goes up or down, you can say that the underlying market or sector it represents is also going up or down. For example, if the Dow Jones Industrial Average price goes down one day, you can say that American stocks overall went down (ie, most American stocks went down in price).

We will be using historical data on the price of the S&P500 Index to make predictions about future prices. Predicting whether an index will go up or down will help us forecast how the stock market as a whole will perform. Since stocks tend to correlate with how well the economy as a whole is performing, it can also help us make economic forecasts.

There are also thousands of traders who make money by buying and selling Exchange Traded Funds. ETFs allow you to buy and sell indexes like stocks. This means that you could "buy" the S&P500 Index ETF when the price is low, and sell when it's high to make a profit. Creating a predictive model could allow traders to make money on the stock market.

In this mission, we'll be working with a csv file containing index prices. Each row in the file contains a daily record of the price of the S&P500 Index from 1950 to 2015.

The columns of the dataset are:

- Date -- The date of the record.
- Open -- The opening price of the day (when trading starts).
- High -- The highest trade price during the day.
- Low -- The lowest trade price during the day.
- Close -- The closing price for the day (when trading is finished).
- Volume -- The number of shares traded.
- Adj Close -- The daily closing price, adjusted retroactively to include any corporate actions. Read more here.

We'll be using this dataset to develop a predictive model. We will train the model with data from 1950-2012, and try to make predictions from 2013-2015.

## Data Exploration

In [1]:
# importing pandas as numpy
import pandas as pd
import numpy as np
from datetime import timedelta
from datetime import datetime

In [2]:
sp_index = pd.read_csv('sphist.csv')

In [3]:
sp_index.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068
1,2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2,2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
3,2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
4,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883


In [4]:
# reviewing datatypes of all columns 
sp_index.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
Date         16590 non-null object
Open         16590 non-null float64
High         16590 non-null float64
Low          16590 non-null float64
Close        16590 non-null float64
Volume       16590 non-null float64
Adj Close    16590 non-null float64
dtypes: float64(6), object(1)
memory usage: 907.4+ KB


The Date column of object datatype. Let's convert to datetime

In [5]:
sp_index['Date'] = pd.to_datetime(sp_index['Date'])

In [6]:
sp_index.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
Date         16590 non-null datetime64[ns]
Open         16590 non-null float64
High         16590 non-null float64
Low          16590 non-null float64
Close        16590 non-null float64
Volume       16590 non-null float64
Adj Close    16590 non-null float64
dtypes: datetime64[ns](1), float64(6)
memory usage: 907.4 KB


In [7]:
sp_index.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Open,16590.0,482.5709,554.8892,16.66,83.86,144.05,950.7225,2130.36
High,16590.0,485.6242,558.186,16.66,84.595,145.295,956.665,2134.72
Low,16590.0,479.3675,551.3676,16.66,83.14,143.105,941.97,2126.06
Close,16590.0,482.6925,555.0079,16.66,83.86,144.265,950.7975,2130.82
Volume,16590.0,794009900.0,1456582000.0,680000.0,7610000.0,71705000.0,786675000.0,11456230000.0
Adj Close,16590.0,482.6925,555.0079,16.66,83.86,144.265,950.7975,2130.82


Looks like there are no null values. Another way to confirm that is as follows:

In [8]:
sp_index.isnull().sum()

Date         0
Open         0
High         0
Low          0
Close        0
Volume       0
Adj Close    0
dtype: int64

In [9]:
# sorting dataframe by ascending order of date
sp_index = sp_index.sort_values(by='Date')
sp_index = sp_index.reset_index(drop=True)
# creating a copy of the dataframe to use for training
sp_index_new = sp_index.copy()


## Generating Indicators

To aid us in our attempt to predict stock prices, we will be adding some time honored technical indicators that are derived from past price data using mathematical formulas. Simply put, these are additional features that we should consider adding them to our dataset to improve prediction accuracy. Some typical indicators include :

 - 5 day moving average
 - 365 day moving average
 - ratio of 5 day moving avg to 365 day moving avg
 
We will create functions to calculate these indicators and add them to our dataset

In [10]:
### generating 5 & 365 day moving avg
### we will consider the closing price as our stock price for the day
def moving_avg(df,n):
    moving_avg = []
    for index,row in df.iterrows():
        date_var = row['Date']
        #if row['Date'] < (df.loc[0,'Date']+timedelta(n)):
        if (len(df[df['Date'] <= date_var].index) <= n):
            moving_avg.append(np.nan)
        else:
            moving_avg.append(df.loc[index-n:index-1,'Close'].mean())
    return pd.Series(moving_avg)
    
sp_index['mv_avg_5'] = moving_avg(sp_index,5)
sp_index['mv_avg_365'] = moving_avg(sp_index,365)
    

In [11]:
sp_index

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,mv_avg_5,mv_avg_365
0,1950-01-03,16.660000,16.660000,16.660000,16.660000,1.260000e+06,16.660000,,
1,1950-01-04,16.850000,16.850000,16.850000,16.850000,1.890000e+06,16.850000,,
2,1950-01-05,16.930000,16.930000,16.930000,16.930000,2.550000e+06,16.930000,,
3,1950-01-06,16.980000,16.980000,16.980000,16.980000,2.010000e+06,16.980000,,
4,1950-01-09,17.080000,17.080000,17.080000,17.080000,2.520000e+06,17.080000,,
...,...,...,...,...,...,...,...,...,...
16585,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3.712120e+09,2102.629883,2087.024023,2035.531178
16586,2015-12-02,2101.709961,2104.270020,2077.110107,2079.510010,3.950640e+09,2079.510010,2090.231982,2035.914082
16587,2015-12-03,2080.709961,2085.000000,2042.349976,2049.620117,4.306490e+09,2049.620117,2088.306006,2036.234356
16588,2015-12-04,2051.239990,2093.840088,2051.239990,2091.689941,4.214910e+09,2091.689941,2080.456006,2036.507343


In [12]:
# calculating ratio of 5 day moving avg to 365 day moving avg
sp_index['ratio_5_365'] = sp_index['mv_avg_5']/sp_index['mv_avg_365']

In [13]:
sp_index

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,mv_avg_5,mv_avg_365,ratio_5_365
0,1950-01-03,16.660000,16.660000,16.660000,16.660000,1.260000e+06,16.660000,,,
1,1950-01-04,16.850000,16.850000,16.850000,16.850000,1.890000e+06,16.850000,,,
2,1950-01-05,16.930000,16.930000,16.930000,16.930000,2.550000e+06,16.930000,,,
3,1950-01-06,16.980000,16.980000,16.980000,16.980000,2.010000e+06,16.980000,,,
4,1950-01-09,17.080000,17.080000,17.080000,17.080000,2.520000e+06,17.080000,,,
...,...,...,...,...,...,...,...,...,...,...
16585,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3.712120e+09,2102.629883,2087.024023,2035.531178,1.025297
16586,2015-12-02,2101.709961,2104.270020,2077.110107,2079.510010,3.950640e+09,2079.510010,2090.231982,2035.914082,1.026680
16587,2015-12-03,2080.709961,2085.000000,2042.349976,2049.620117,4.306490e+09,2049.620117,2088.306006,2036.234356,1.025573
16588,2015-12-04,2051.239990,2093.840088,2051.239990,2091.689941,4.214910e+09,2091.689941,2080.456006,2036.507343,1.021580


In [14]:
sp_index.iloc[4:]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,mv_avg_5,mv_avg_365,ratio_5_365
4,1950-01-09,17.080000,17.080000,17.080000,17.080000,2.520000e+06,17.080000,,,
5,1950-01-10,17.030001,17.030001,17.030001,17.030001,2.160000e+06,17.030001,16.900000,,
6,1950-01-11,17.090000,17.090000,17.090000,17.090000,2.630000e+06,17.090000,16.974000,,
7,1950-01-12,16.760000,16.760000,16.760000,16.760000,2.970000e+06,16.760000,17.022000,,
8,1950-01-13,16.670000,16.670000,16.670000,16.670000,3.330000e+06,16.670000,16.988000,,
...,...,...,...,...,...,...,...,...,...,...
16585,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3.712120e+09,2102.629883,2087.024023,2035.531178,1.025297
16586,2015-12-02,2101.709961,2104.270020,2077.110107,2079.510010,3.950640e+09,2079.510010,2090.231982,2035.914082,1.026680
16587,2015-12-03,2080.709961,2085.000000,2042.349976,2049.620117,4.306490e+09,2049.620117,2088.306006,2036.234356,1.025573
16588,2015-12-04,2051.239990,2093.840088,2051.239990,2091.689941,4.214910e+09,2091.689941,2080.456006,2036.507343,1.021580


## Splitting up the data

In [15]:
sp_index.isnull().sum()

Date             0
Open             0
High             0
Low              0
Close            0
Volume           0
Adj Close        0
mv_avg_5         5
mv_avg_365     365
ratio_5_365    365
dtype: int64

We can see null values in the `mv_avg_365` column and consequently in the `ratio_5_365 column`. Since these indicators use **365** days of historical data, and the dataset starts on **1950-01-03**, any rows that fall before **1951-01-03** don't have enough historical data to compute all the indicators. Let's remove these null values before we proceed

In [16]:
sp_index = sp_index.dropna(axis=0)

In [17]:
sp_index.isnull().sum()

Date           0
Open           0
High           0
Low            0
Close          0
Volume         0
Adj Close      0
mv_avg_5       0
mv_avg_365     0
ratio_5_365    0
dtype: int64

In [18]:
# reseting the index 
sp_index = sp_index.reset_index(drop=True)

In [19]:
sp_index

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,mv_avg_5,mv_avg_365,ratio_5_365
0,1951-06-19,22.020000,22.020000,22.020000,22.020000,1.100000e+06,22.020000,21.800000,19.447726,1.120954
1,1951-06-20,21.910000,21.910000,21.910000,21.910000,1.120000e+06,21.910000,21.900000,19.462411,1.125246
2,1951-06-21,21.780001,21.780001,21.780001,21.780001,1.100000e+06,21.780001,21.972000,19.476274,1.128142
3,1951-06-22,21.549999,21.549999,21.549999,21.549999,1.340000e+06,21.549999,21.960000,19.489562,1.126757
4,1951-06-25,21.290001,21.290001,21.290001,21.290001,2.440000e+06,21.290001,21.862000,19.502082,1.121008
...,...,...,...,...,...,...,...,...,...,...
16220,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3.712120e+09,2102.629883,2087.024023,2035.531178,1.025297
16221,2015-12-02,2101.709961,2104.270020,2077.110107,2079.510010,3.950640e+09,2079.510010,2090.231982,2035.914082,1.026680
16222,2015-12-03,2080.709961,2085.000000,2042.349976,2049.620117,4.306490e+09,2049.620117,2088.306006,2036.234356,1.025573
16223,2015-12-04,2051.239990,2093.840088,2051.239990,2091.689941,4.214910e+09,2091.689941,2080.456006,2036.507343,1.021580


Now that we have a clean dataset, let's proceed to create our training and test data. For the purpose of this project, we will be using all rows before 01-01-2013 for training and the remaining data for test

In [20]:
# creating training data
train = sp_index[sp_index['Date'] < datetime(year=2013,month=1,day=1)]

#creating test data
test = sp_index[sp_index['Date'] >= datetime(year=2013,month=1,day=1)]

In [21]:
train

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,mv_avg_5,mv_avg_365,ratio_5_365
0,1951-06-19,22.020000,22.020000,22.020000,22.020000,1.100000e+06,22.020000,21.800000,19.447726,1.120954
1,1951-06-20,21.910000,21.910000,21.910000,21.910000,1.120000e+06,21.910000,21.900000,19.462411,1.125246
2,1951-06-21,21.780001,21.780001,21.780001,21.780001,1.100000e+06,21.780001,21.972000,19.476274,1.128142
3,1951-06-22,21.549999,21.549999,21.549999,21.549999,1.340000e+06,21.549999,21.960000,19.489562,1.126757
4,1951-06-25,21.290001,21.290001,21.290001,21.290001,2.440000e+06,21.290001,21.862000,19.502082,1.121008
...,...,...,...,...,...,...,...,...,...,...
15481,2012-12-24,1430.150024,1430.150024,1424.660034,1426.660034,1.248960e+09,1426.660034,1437.360010,1326.114028,1.083889
15482,2012-12-26,1426.660034,1429.420044,1416.430054,1419.829956,2.285030e+09,1419.829956,1436.620019,1326.412494,1.083087
15483,2012-12-27,1419.829956,1422.800049,1401.800049,1418.099976,2.830180e+09,1418.099976,1431.228003,1326.716494,1.078775
15484,2012-12-28,1418.099976,1418.099976,1401.579956,1402.430054,2.426680e+09,1402.430054,1427.685986,1326.995836,1.075878


In [22]:
test

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,mv_avg_5,mv_avg_365,ratio_5_365
15486,2013-01-02,1426.189941,1462.430054,1426.189941,1462.420044,4.202600e+09,1462.420044,1418.641992,1327.534055,1.068629
15487,2013-01-03,1462.420044,1465.469971,1455.530029,1459.369995,3.829730e+09,1459.369995,1425.793994,1327.908247,1.073714
15488,2013-01-04,1459.369995,1467.939941,1458.989990,1466.469971,3.424290e+09,1466.469971,1433.702002,1328.224877,1.079412
15489,2013-01-07,1466.469971,1466.469971,1456.619995,1461.890015,3.304970e+09,1461.890015,1443.376001,1328.557617,1.086423
15490,2013-01-08,1461.890015,1461.890015,1451.640015,1457.150024,3.601600e+09,1457.150024,1455.267993,1328.898603,1.095093
...,...,...,...,...,...,...,...,...,...,...
16220,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3.712120e+09,2102.629883,2087.024023,2035.531178,1.025297
16221,2015-12-02,2101.709961,2104.270020,2077.110107,2079.510010,3.950640e+09,2079.510010,2090.231982,2035.914082,1.026680
16222,2015-12-03,2080.709961,2085.000000,2042.349976,2049.620117,4.306490e+09,2049.620117,2088.306006,2036.234356,1.025573
16223,2015-12-04,2051.239990,2093.840088,2051.239990,2091.689941,4.214910e+09,2091.689941,2080.456006,2036.507343,1.021580


## Making Predictions

We will be using the Linear Regression model to make predictions.  We will choose `Root Mean Squared Error` as our error metric to evaluate how well our model is predicting. Before we fit our model, we will standardize our data so that they are on the scale. We will also drop all original rows from the train set as they contain knowledge of the future and can affect our predictions when using on real world data. 

In [23]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import MinMaxScaler

In [24]:
train = train.drop(['Date'],axis=1)
test = test.drop(['Date'],axis=1)

In [25]:
train

Unnamed: 0,Open,High,Low,Close,Volume,Adj Close,mv_avg_5,mv_avg_365,ratio_5_365
0,22.020000,22.020000,22.020000,22.020000,1.100000e+06,22.020000,21.800000,19.447726,1.120954
1,21.910000,21.910000,21.910000,21.910000,1.120000e+06,21.910000,21.900000,19.462411,1.125246
2,21.780001,21.780001,21.780001,21.780001,1.100000e+06,21.780001,21.972000,19.476274,1.128142
3,21.549999,21.549999,21.549999,21.549999,1.340000e+06,21.549999,21.960000,19.489562,1.126757
4,21.290001,21.290001,21.290001,21.290001,2.440000e+06,21.290001,21.862000,19.502082,1.121008
...,...,...,...,...,...,...,...,...,...
15481,1430.150024,1430.150024,1424.660034,1426.660034,1.248960e+09,1426.660034,1437.360010,1326.114028,1.083889
15482,1426.660034,1429.420044,1416.430054,1419.829956,2.285030e+09,1419.829956,1436.620019,1326.412494,1.083087
15483,1419.829956,1422.800049,1401.800049,1418.099976,2.830180e+09,1418.099976,1431.228003,1326.716494,1.078775
15484,1418.099976,1418.099976,1401.579956,1402.430054,2.426680e+09,1402.430054,1427.685986,1326.995836,1.075878


In [26]:
# removing original rows
train_y = train['Close']
train_X = train.drop(['Close', 'High', 'Low', 'Open', 'Volume', 'Adj Close'],axis=1)
test_y  = test['Close']
test_X  = test.drop(['Close', 'High', 'Low', 'Open', 'Volume', 'Adj Close'],axis=1)

In [27]:
train_X

Unnamed: 0,mv_avg_5,mv_avg_365,ratio_5_365
0,21.800000,19.447726,1.120954
1,21.900000,19.462411,1.125246
2,21.972000,19.476274,1.128142
3,21.960000,19.489562,1.126757
4,21.862000,19.502082,1.121008
...,...,...,...
15481,1437.360010,1326.114028,1.083889
15482,1436.620019,1326.412494,1.083087
15483,1431.228003,1326.716494,1.078775
15484,1427.685986,1326.995836,1.075878


In [28]:
# initializing our model
lr = LinearRegression()
lr.fit(train_X,train_y)
predictions=lr.predict(test_X)
mae = mean_absolute_error(test_y,predictions)
mse = mean_squared_error(test_y,predictions)
rmse = np.sqrt(mean_squared_error(test_y,predictions))


In [29]:
print("mae:",mae)
print("mse:",mse)
print("rmse:",rmse)

mae: 16.125519160928306
mse: 491.87029967385087
rmse: 22.178149148967567


We are getting an RMSE value of 22.17. Let's try to add some more indicators and see if it helps with the performance

Adding the following indicators:
- Avg purchase volumne for 5 days
- Avg purchase volume for 365 days
- Avg purchase volume for 30 days
- Mean close price for 30 days
- Year
- Month
- Day

Before we do that, we will create a function to easily generate our indicators, and train, predict and validate them

In [30]:
# creating a function generate indicators
def moving_avg(orig_df,n,col,func):
    moving_avg = []
    for index,row in orig_df.iterrows():
        date_var = row['Date']
        #if row['Date'] < (df.loc[0,'Date']+timedelta(n)):
        if (len(orig_df[orig_df['Date'] <= date_var]) <= n):
            moving_avg.append(np.nan)
        else:
            moving_avg.append(func(orig_df.loc[index-n:index-1,col]))
    return pd.Series(moving_avg)




In [31]:
# creating a function to train, predict, and generate error metrics for validation
def train_test_validate(df,cols):
    df = df.dropna(axis=0)
    df = df.reset_index(drop=True)
    # creating training data
    train = df[df['Date'] < datetime(year=2013,month=1,day=1)]

    #creating test data
    test = df[df['Date'] >= datetime(year=2013,month=1,day=1)]
    
    #Importing sklearn classes
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error
    from sklearn.metrics import mean_absolute_error
    
    #dropping original columns and creating features and target dataset
    train_y = train['Close']
    train_X = train[cols]
    test_y  = test['Close']
    test_X  = test[cols]
    lr = LinearRegression()
    lr.fit(train_X,train_y)
    predictions=lr.predict(test_X)
    mae = mean_absolute_error(test_y,predictions)
    mse = mean_squared_error(test_y,predictions)
    rmse = np.sqrt(mean_squared_error(test_y,predictions))
    print("mae:",mae)
    print("mse:",mse)
    print("rmse:",rmse)
    

In [32]:
# generating date features
from datetime import datetime as dt
sp_index_new['Year'] = sp_index_new['Date'].dt.strftime('%Y').astype('int')
sp_index_new['Month'] = sp_index_new['Date'].dt.strftime('%m').astype('int')
sp_index_new['Day'] = sp_index_new['Date'].dt.strftime('%d').astype('int')




In [33]:
# generating indicators
sp_index_new['avg_close_5'] = moving_avg(sp_index_new,5,'Close',np.mean)
sp_index_new['avg_close_365'] = moving_avg(sp_index_new,365,'Close',np.mean)
sp_index_new['ratio_5_365_close'] = sp_index_new['avg_close_5']/sp_index_new['avg_close_365']
sp_index_new['avg_close_30'] = moving_avg(sp_index_new,30,'Close',np.mean)

sp_index_new['avg_purchase_5'] = moving_avg(sp_index_new,5,'Volume',np.mean)
sp_index_new['avg_purchase_365'] = moving_avg(sp_index_new,365,'Volume',np.mean)
sp_index_new['avg_purchase_30'] = moving_avg(sp_index_new,30,'Volume',np.mean)
sp_index_new['ratio_5_365_volume'] = sp_index_new['avg_purchase_5']/sp_index_new['avg_purchase_365']

sp_index_new['std_purchase_5'] = moving_avg(sp_index_new,5,'Volume',np.std)
sp_index_new['std_purchase_365'] = moving_avg(sp_index_new,365,'Volume',np.std)
sp_index_new['ratio_5_365_volume_std'] = sp_index_new['std_purchase_5']/sp_index_new['std_purchase_365']

sp_index_new['std_close_5'] = moving_avg(sp_index_new,5,'Close',np.std)
sp_index_new['std_close_365'] = moving_avg(sp_index_new,365,'Close',np.std)
sp_index_new['ratio_5_365_std_close'] = sp_index_new['std_close_5']/sp_index_new['std_close_365']



In [34]:
sp_index_new.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close', 'Year',
       'Month', 'Day', 'avg_close_5', 'avg_close_365', 'ratio_5_365_close',
       'avg_close_30', 'avg_purchase_5', 'avg_purchase_365', 'avg_purchase_30',
       'ratio_5_365_volume', 'std_purchase_5', 'std_purchase_365',
       'ratio_5_365_volume_std', 'std_close_5', 'std_close_365',
       'ratio_5_365_std_close'],
      dtype='object')

Let's see which features are the most highly correlated and train the model on different combinations of the features

In [49]:
sp_index_new.corr()['Close'][6:].sort_values(ascending=False)

avg_close_5               0.999797
avg_close_30              0.999201
avg_close_365             0.988870
Year                      0.868672
std_close_365             0.816103
avg_purchase_30           0.788295
avg_purchase_365          0.784878
avg_purchase_5            0.782413
std_close_5               0.725241
std_purchase_365          0.684141
std_purchase_5            0.619356
ratio_5_365_std_close     0.087018
ratio_5_365_volume_std    0.070329
ratio_5_365_close         0.047782
Month                     0.011170
Day                      -0.001082
ratio_5_365_volume       -0.012305
Name: Close, dtype: float64

In [63]:
# training with top 3 features
cols = ['avg_close_5', 'avg_close_365','avg_close_30']
train_test_validate(sp_index_new,cols)

mae: 16.142439643554926
mse: 493.7313030125977
rmse: 22.220065324219856


In [59]:
# training with features > 0.7 correlation score
cols = ['avg_close_5', 'avg_close_365','avg_close_30','std_close_5', 'std_close_365','Year','avg_purchase_30','avg_purchase_5','avg_purchase_365']
train_test_validate(sp_index_new,cols)

mae: 16.16662602470951
mse: 494.3905883344646
rmse: 22.234895734733378


In [60]:
# training with features > 0.85 correlation score
cols = ['avg_close_5', 'avg_close_365','avg_close_30','Year']
train_test_validate(sp_index_new,cols)

mae: 16.186530129335353
mse: 494.416564326206
rmse: 22.23547985374289


In [67]:
# trying another combination of features
cols = ['avg_close_5', 'avg_close_365', 'ratio_5_365_close','std_close_5', 'std_close_365', 'ratio_5_365_std_close','Year','Month','Day','avg_purchase_5',
       'avg_purchase_365', 'ratio_5_365_volume']
train_test_validate(sp_index_new,cols)

mae: 16.17867704166052
mse: 491.99356252208423
rmse: 22.180927900385147


In [68]:
# trying all features
cols = ['Year',
       'Month', 'Day', 'avg_close_5', 'avg_close_365', 'ratio_5_365_close',
       'avg_close_30', 'avg_purchase_5', 'avg_purchase_365', 'avg_purchase_30',
       'ratio_5_365_volume', 'std_purchase_5', 'std_purchase_365',
       'ratio_5_365_volume_std', 'std_close_5', 'std_close_365',
       'ratio_5_365_std_close']
train_test_validate(sp_index_new,cols)

mae: 16.00775550693879
mse: 492.8940739996291
rmse: 22.201217849470083


In [72]:
# training with top 2 features and 1 ratio feature
cols = ['avg_close_5', 'avg_close_365','ratio_5_365_close']
train_test_validate(sp_index_new,cols)

mae: 16.125519160928306
mse: 491.87029967385087
rmse: 22.178149148967567


In [77]:
cols = ['avg_close_5', 'avg_close_365','ratio_5_365_close']
train_test_validate(sp_index_new,cols)

mae: 16.125519160928306
mse: 491.87029967385087
rmse: 22.178149148967567


## Prediction Next-Day Prices

We can see that adding additional indicators did not really help improve the performance of the model. Our best model gave an rmse value of 22.178. 

Let's see if performance of the model improves if we make predictions for the next day instead of multiple years. For example, train a model using data from 1951-01-03 to 2013-01-02, make predictions for 2013-01-03, and then train another model using data from 1951-01-03 to 2013-01-03, make predictions for 2013-01-04, and so on. This more closely simulates what we'd do if we were trading using the algorithm.

In [39]:
def train_test(df, features):
    df = df.dropna(axis=0)
    df = df.reset_index(drop=True)
    rmses_dict = {}
    rmses = []
    for index,row in df.iterrows():
        train  = df[df["Date"] < row['Date']]
        test = df[df["Date"] == row['Date']]
        if len(train) > 0:
            #initialize model
            lr = LinearRegression()
            target = 'Close'

            #Train
            lr.fit(train[features], train[target])

            #Test
            predictions = lr.predict(test[features])

            #Calculate error
            mse = mean_squared_error(test[target], predictions)
            rmse = np.sqrt(mse)
            rmses_dict[row['Date']]=[index,predictions[0],test[target].values[0],rmse]
            rmses.append(rmse)
    return rmses_dict,rmses

In [40]:
features = ['avg_close_5', 'avg_close_365', 'ratio_5_365_close','std_close_5', 'std_close_365','ratio_5_365_std_close','Year','Month','Day','avg_purchase_5','avg_purchase_365', 'ratio_5_365_volume']
rmses_dict,rmses = train_test(sp_index_new, features)

In [41]:
np.mean(rmses)

5.505248144777133

Voila! That's a massive improvement. If you see the rmse values, it looks like they are really love for instances where there is not enough training data. As the for loop progresses, the training data grows bigger. Let's see what happens when will start predicting for dates after 01-01-2013. That way our training set will atleast contain data from 1951 to 2012.

In [42]:
# creating a function generate indicators
def next_day_prediction(df,cols):
    df = df.dropna(axis=0)
    df = df.reset_index(drop=True)
    rmses = []
    max_date = df['Date'].max()
    i = df[df['Date'] == max_date].index[0]
    time_upper_limit = datetime(year=2013,month=1,day=1)
    for x in range(0,i):
        fixed_date =  datetime(year=2013,month=1,day=1)
        train_upper_limit=fixed_date+timedelta(x)
        if len(df[df['Date']==train_upper_limit+timedelta(1)])>0:
            train = df[df['Date']<=train_upper_limit]
            test = df[df['Date']==train_upper_limit+timedelta(1)]
            train_y = train['Close']
            train_X = train[cols]
            test_y  = test['Close']
            test_X  = test[cols]
            lr = LinearRegression()
            lr.fit(train_X,train_y)
            predictions=lr.predict(test_X)
            rmse = np.sqrt(mean_squared_error(test_y,predictions))
            rmses.append(rmse)
    return rmses




In [43]:
features = ['avg_close_5', 'avg_close_365', 'ratio_5_365_close','std_close_5', 'std_close_365','ratio_5_365_std_close','Year','Month','Day','avg_purchase_5','avg_purchase_365', 'ratio_5_365_volume']
rmses_next_day = next_day_prediction(sp_index_new,features)
np.mean(rmses_next_day)

16.136415292991614

In [79]:
features = ['avg_close_5', 'avg_close_365', 'avg_close_30','avg_purchase_5', 'avg_purchase_365', 'avg_purchase_30', 'Year', 'Month', 'Day']
rmses_next_day = next_day_prediction(sp_index_new,features)
np.mean(rmses_next_day)

16.105216505457363

In [81]:
# trying the winnining combination of features we discovered previously
features = ['avg_close_5', 'avg_close_365','ratio_5_365_close']
rmses_next_day = next_day_prediction(sp_index_new,features)
np.mean(rmses_next_day)

16.079620578697412

Looks like the performance is not as good but definitely better than when we tried to predict for the entire test dataset. 

## Conclusion

In this project, I used S&P 500 index data ranging from 1950 to 2015 to predict stock prices. The goal was to build a predictve model that was trained using data from 1950-2012, and make prediction on data from 2013-2015. I used Linear Regression to develop the model and used different combinations of technical indicators which were features that I added to improve accuracy of my predictions. I also developed a model to make next day predictions and found that the error significantly reduced.


## Next steps:

I would like to try other classification models like SVM, XGBoost, Random forest, & Neural Networks and see if the performance improves. 
Ultimately, I would like to make the system real-time by writing an automated script to download the latest data when the market closes, and make predictions for the next day.