# Predicting Stock Market Prices


In this project, I will work with data from the S&P500 Index. 

I will be using historical data on the price of the S&P500 Index to make predictions about future prices. Predicting whether an index goes up or down helps forecast how the stock market as a whole performs. Since stocks tend to correlate with how well the economy as a whole is performs, it can also help with economic forecasts.

I will be working with a csv file containing index prices. Each row in the file contains a daily record of the price of the S&P500 Index from 1950 to 2015.

The columns of the dataset are:

 - Date -- The date of the record.
 - Open -- The opening price of the day (when trading starts).
 - High -- The highest trade price during the day.
 - Low -- The lowest trade price during the day.
 - Close -- The closing price for the day (when trading is finished).
 - Volume -- The number of shares traded.
 - Adj Close -- The daily closing price, adjusted retroactively to include any corporate actions.

I will be using this dataset to develop a predictive model. I will train the model with data from 1950-2012 and try to make predictions from 2013-2015.

 ## Reading and Processing the Data

In [3]:
#import all needed libraries
import pandas as pd
from datetime import datetime as dt
import numpy as np

In [4]:
# read file
data = pd.read_csv('sphist.csv')

In [5]:
data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068
1,2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2,2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
3,2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
4,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883


In [6]:
#converting Date column to datetime for easier comparison
data['Date'] = pd.to_datetime(data['Date'])

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       16590 non-null  datetime64[ns]
 1   Open       16590 non-null  float64       
 2   High       16590 non-null  float64       
 3   Low        16590 non-null  float64       
 4   Close      16590 non-null  float64       
 5   Volume     16590 non-null  float64       
 6   Adj Close  16590 non-null  float64       
dtypes: datetime64[ns](1), float64(6)
memory usage: 907.4 KB


In [8]:
#sort dataframe according to date
# df["Date"] > datetime(year=2015, month=4, day=1)
data = data.sort_values('Date')

In [9]:
data.head(6)


Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08
16584,1950-01-10,17.030001,17.030001,17.030001,17.030001,2160000.0,17.030001


## Generate Indicators

Here are some indicators that are interesting to generate for each row:

 - The average price from the past 5 days.
 - The average price for the past 30 days.
 - The average price for the past 365 days.

In [10]:
#add columns to the dataframe
data['5 Days'] = np.NAN
data['30 Days'] = np.NAN
data['365 Days'] = np.NAN
data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,5 Days,30 Days,365 Days
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,,,
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,,,
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,,,
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,,,
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,,,


In [11]:
# average price from the past 5 days
for row in range(4, len(data)-1):
    total = []
    for i in range(row,row-5, -1):
        total.append(data.iloc[i,4])
    average = np.mean(total)
    data.iloc[row+1,7] = average

In [12]:
data.iloc[5:10]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,5 Days,30 Days,365 Days
16584,1950-01-10,17.030001,17.030001,17.030001,17.030001,2160000.0,17.030001,16.9,,
16583,1950-01-11,17.09,17.09,17.09,17.09,2630000.0,17.09,16.974,,
16582,1950-01-12,16.76,16.76,16.76,16.76,2970000.0,16.76,17.022,,
16581,1950-01-13,16.67,16.67,16.67,16.67,3330000.0,16.67,16.988,,
16580,1950-01-16,16.719999,16.719999,16.719999,16.719999,1460000.0,16.719999,16.926,,


In [13]:
# average price from the past 30 days
for row in range(29, len(data)-1):
    total = []
    for i in range(row,row-30, -1):
        total.append(data.iloc[i,4])
    average = np.mean(total)
    data.iloc[row+1,8] = average

In [14]:
data.iloc[28:35]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,5 Days,30 Days,365 Days
16561,1950-02-10,17.24,17.24,17.24,17.24,1790000.0,17.24,17.266,,
16560,1950-02-14,17.059999,17.059999,17.059999,17.059999,2210000.0,17.059999,17.256,,
16559,1950-02-15,17.059999,17.059999,17.059999,17.059999,1730000.0,17.059999,17.204,16.976667,
16558,1950-02-16,16.99,16.99,16.99,16.99,1920000.0,16.99,17.17,16.99,
16557,1950-02-17,17.15,17.15,17.15,17.15,1940000.0,17.15,17.126,16.994667,
16556,1950-02-20,17.200001,17.200001,17.200001,17.200001,1420000.0,17.200001,17.1,17.002,
16555,1950-02-21,17.17,17.17,17.17,17.17,1260000.0,17.17,17.092,17.009333,


In [15]:
# average price from the past 365 days
for row in range(364, len(data)-1):
    total = 0
    for i in range(row,row-365, -1):
        total += data.iloc[i,4]
    average = total/365
    data.iloc[row+1,9] = average

In [16]:
data.iloc[362:368]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,5 Days,30 Days,365 Days
16227,1951-06-14,21.84,21.84,21.84,21.84,1300000.0,21.84,21.546,21.779,
16226,1951-06-15,22.040001,22.040001,22.040001,22.040001,1370000.0,22.040001,21.602,21.753,
16225,1951-06-18,22.049999,22.049999,22.049999,22.049999,1050000.0,22.049999,21.712,21.727333,
16224,1951-06-19,22.02,22.02,22.02,22.02,1100000.0,22.02,21.8,21.703333,19.447726
16223,1951-06-20,21.91,21.91,21.91,21.91,1120000.0,21.91,21.9,21.683,19.462411
16222,1951-06-21,21.780001,21.780001,21.780001,21.780001,1100000.0,21.780001,21.972,21.659667,19.476274


Other indicators that can be interesting:

 - The ratio between the average price for the past 5 days, and the average price for the past 365 days.
 - The standard deviation of the price over the past 5 days.
 - The standard deviation of the price over the past 365 days.
 - The ratio between the standard deviation for the past 5 days, and the standard deviation for the past 365 days.

Some of the indicators use 365 days of historical data and the dataset starts on 1950-01-03. Thus, any rows that fall before 1951-06-16 don't have enough historical data to compute all the indicators and need to be removed

In [25]:
#remove all NAN values
data.dropna(inplace=True)

In [26]:
# divide dataframe into train and test dataframes
train = data[data['Date'] < dt(year=2013, month=1, day=1)]

test = data[data['Date'] >= dt(year=2013, month=1, day=1)]

In [27]:
train.tail()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,5 Days,30 Days,365 Days
743,2012-12-24,1430.150024,1430.150024,1424.660034,1426.660034,1248960000.0,1426.660034,1437.36001,1405.926001,1326.114028
742,2012-12-26,1426.660034,1429.420044,1416.430054,1419.829956,2285030000.0,1419.829956,1436.620019,1407.486336,1326.412494
741,2012-12-27,1419.829956,1422.800049,1401.800049,1418.099976,2830180000.0,1418.099976,1431.228003,1408.813,1326.716494
740,2012-12-28,1418.099976,1418.099976,1401.579956,1402.430054,2426680000.0,1402.430054,1427.685986,1410.265332,1326.995836
739,2012-12-31,1402.430054,1426.73999,1398.109985,1426.189941,3204330000.0,1426.189941,1419.434009,1411.830001,1327.261562


In [28]:
test.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,5 Days,30 Days,365 Days
738,2013-01-02,1426.189941,1462.430054,1426.189941,1462.420044,4202600000.0,1462.420044,1418.641992,1414.258667,1327.534055
737,2013-01-03,1462.420044,1465.469971,1455.530029,1459.369995,3829730000.0,1459.369995,1425.793994,1417.676668,1327.908247
736,2013-01-04,1459.369995,1467.939941,1458.98999,1466.469971,3424290000.0,1466.469971,1433.702002,1420.092668,1328.224877
735,2013-01-07,1466.469971,1466.469971,1456.619995,1461.890015,3304970000.0,1461.890015,1443.376001,1422.714665,1328.557617
734,2013-01-08,1461.890015,1461.890015,1451.640015,1457.150024,3601600000.0,1457.150024,1455.267993,1425.076664,1328.898603


## Make Predictions

In [22]:
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression

In [29]:
#Train a linear regression model, using the train dataframe
lr = LinearRegression()
train_cols = ['5 Days', '30 Days', '365 Days']
lr.fit(train[train_cols], train['Close'])

predictions = lr.predict(test[train_cols])

In [33]:
#Compute the error between the predictions and the Close column of test
mae = mean_absolute_error(test['Close'], predictions)
print(mae)

16.142439643554876


some indicators that might be helpful to compute

 - The ratio between the average volume for the past five days, and the average volume for the past year.
 - The standard deviation of the average volume over the past five days.
 - The standard deviation of the average volume over the past year.
 - The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of the average volume for the past year.
 - The year component of the date.
 - The ratio between the lowest price in the past year and the current price.
 - The ratio between the highest price in the past year and the current price.
 - The month component of the date.
 - The day of week.
 - The day component of the date.
 - The number of holidays in the prior month.