## Predicting_The_Stock_Market

In this project, we worked with data from the S&P500 Index which is a stock market index. The S&P500 Index aggregates the stock prices of 500 large companies. Moreover, we used historical data on the price of the S&P500 Index to make predictions about future prices. Predicting whether an index will go up or down will help us forecast how the stock market as a whole will perform. Since stocks tend to correlate with how well the economy as a whole is performing, it can also help us make economic forecasts.

### Note: You shouldn't make trades with any models developed in this project. Trading stocks has risks, and nothing in this project constitutes stock trading advice.

The file that we worked upon is a csv file containing index prices. Each row in the file contains a daily record of the price of the S&P500 Index from 1950 to 2015. The dataset is stored in [sphist.csv](https://github.com/syed0019/Predicting_The_Stock_Market/blob/master/sphist.csv).

The columns of the dataset are:

- `Date` -- The date of the record.
- `Open` -- The opening price of the day (when trading starts).
- `High` -- The highest trade price during the day.
- `Low` -- The lowest trade price during the day.
- `Close` -- The closing price for the day (when trading is finished).
- `Volume` -- The number of shares traded.
- `Adj Close` -- The daily closing price, adjusted retroactively to include any corporate actions. Read more [here](http://www.investopedia.com/terms/a/adjusted_closing_price.asp).

We used this dataset to develop a predictive model and trained the model with data from 1950-2012 to make predictions from 2013-2015.

In [1]:
# importing libraries
import pandas as pd
from datetime import datetime
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# reading file into dataframe
sphist = pd.read_csv('sphist.csv')
sphist.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068
1,2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2,2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
3,2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
4,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883


In [2]:
sphist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       16590 non-null  object 
 1   Open       16590 non-null  float64
 2   High       16590 non-null  float64
 3   Low        16590 non-null  float64
 4   Close      16590 non-null  float64
 5   Volume     16590 non-null  float64
 6   Adj Close  16590 non-null  float64
dtypes: float64(6), object(1)
memory usage: 907.4+ KB


In [3]:
# converting Date column to pandas datetime
sphist['Date'] = pd.to_datetime(sphist['Date'])

sphist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       16590 non-null  datetime64[ns]
 1   Open       16590 non-null  float64       
 2   High       16590 non-null  float64       
 3   Low        16590 non-null  float64       
 4   Close      16590 non-null  float64       
 5   Volume     16590 non-null  float64       
 6   Adj Close  16590 non-null  float64       
dtypes: datetime64[ns](1), float64(6)
memory usage: 907.4 KB


In [4]:
# sorting values by Date column and in ascending order
sphist.sort_values('Date', inplace=True)

In [5]:
# creating new column for 5 days mean against stock closing price
sphist['5_days_mean'] = sphist.Close.rolling(5, win_type='triang', on='Date').mean()

# rolling mean will use the current day's price, therefore we need to reindex the resulting series
# to shift all the values "forward" one day, i.e. the rolling mean calculated for 1950-01-03 will
# need to be assigned to 1950-01-04, and so on
sphist = sphist.shift(periods=1, freq=None)

sphist.head(10)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,5_days_mean
16589,NaT,,,,,,,
16588,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,
16587,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,
16586,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,
16585,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,
16584,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,16.91
16583,1950-01-10,17.030001,17.030001,17.030001,17.030001,2160000.0,17.030001,16.982222
16582,1950-01-11,17.09,17.09,17.09,17.09,2630000.0,17.09,17.031111
16581,1950-01-12,16.76,16.76,16.76,16.76,2970000.0,16.76,17.018889
16580,1950-01-13,16.67,16.67,16.67,16.67,3330000.0,16.67,16.955556


In [6]:
# creating new column for 365 days mean against stock closing price
sphist['365_days_mean'] = sphist.Close.rolling(365, win_type='triang', on='Date').mean()

sphist = sphist.shift(periods=1, freq=None)

sphist.head(10)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,5_days_mean,365_days_mean
16589,NaT,,,,,,,,
16588,NaT,,,,,,,,
16587,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,,
16586,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,,
16585,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,,
16584,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,,
16583,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,16.91,
16582,1950-01-10,17.030001,17.030001,17.030001,17.030001,2160000.0,17.030001,16.982222,
16581,1950-01-11,17.09,17.09,17.09,17.09,2630000.0,17.09,17.031111,
16580,1950-01-12,16.76,16.76,16.76,16.76,2970000.0,16.76,17.018889,


In [7]:
# calculating ratio of 5 days mean and 365 days mean
sphist['mean_ratio'] = sphist['5_days_mean'] / sphist['365_days_mean']

In [8]:
# creating new column for 5 days standard deviation against stock closing price
sphist['5_days_std'] = sphist.Close.rolling(5, win_type='triang', on='Date').std()

sphist = sphist.shift(periods=1, freq=None)

sphist.head(10)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,5_days_mean,365_days_mean,mean_ratio,5_days_std
16589,NaT,,,,,,,,,,
16588,NaT,,,,,,,,,,
16587,NaT,,,,,,,,,,
16586,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,,,,
16585,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,,,,
16584,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,,,,
16583,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,,,,
16582,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,16.91,,,0.181157
16581,1950-01-10,17.030001,17.030001,17.030001,17.030001,2160000.0,17.030001,16.982222,,,0.094002
16580,1950-01-11,17.09,17.09,17.09,17.09,2630000.0,17.09,17.031111,,,0.057922


In [9]:
# creating new column for 365 days standard deviation against stock closing price
sphist['365_days_std'] = sphist.Close.rolling(5, win_type='triang', on='Date').std()

sphist = sphist.shift(periods=1, freq=None)

sphist.head(10)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,5_days_mean,365_days_mean,mean_ratio,5_days_std,365_days_std
16589,NaT,,,,,,,,,,,
16588,NaT,,,,,,,,,,,
16587,NaT,,,,,,,,,,,
16586,NaT,,,,,,,,,,,
16585,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,,,,,
16584,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,,,,,
16583,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,,,,,
16582,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,,,,,
16581,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,16.91,,,0.181157,0.176445
16580,1950-01-10,17.030001,17.030001,17.030001,17.030001,2160000.0,17.030001,16.982222,,,0.094002,0.082515


In [10]:
# calculating ratio of 5 days std and 365 days std
sphist['std_ratio'] = sphist['5_days_std'] / sphist['365_days_std']

In [11]:
# dropping or ignoring values before Jan 03, 1951 as they don't
# have enough historical data to compute all the indicators. 
sphist_new = sphist[sphist['Date'] > datetime(year=1951, month=1, day=2)]

sphist_new = sphist_new.dropna(axis=0).copy()

sphist_new.isnull().sum()

Date             0
Open             0
High             0
Low              0
Close            0
Volume           0
Adj Close        0
5_days_mean      0
365_days_mean    0
mean_ratio       0
5_days_std       0
365_days_std     0
std_ratio        0
dtype: int64

In [12]:
# splitting dataframe for training
train_sphist = sphist_new[sphist_new['Date'] < datetime(year=2013, month=1, day=1)]

train_sphist.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,5_days_mean,365_days_mean,mean_ratio,5_days_std,365_days_std,std_ratio
16221,1951-06-18,22.049999,22.049999,22.049999,22.049999,1050000.0,22.049999,21.807778,19.361962,1.126321,0.266029,0.263547,1.009421
16220,1951-06-19,22.02,22.02,22.02,22.02,1100000.0,22.02,21.941111,19.378927,1.132215,0.222431,0.177465,1.253377
16219,1951-06-20,21.91,21.91,21.91,21.91,1120000.0,21.91,22.002222,19.395884,1.134376,0.081157,0.079193,1.024791
16218,1951-06-21,21.780001,21.780001,21.780001,21.780001,1100000.0,21.780001,21.977778,19.412851,1.132125,0.09715,0.097055,1.000983
16217,1951-06-22,21.549999,21.549999,21.549999,21.549999,1340000.0,21.549999,21.881111,19.429809,1.126162,0.17496,0.221798,0.788826


In [13]:
# splitting dataframe for testing
test_sphist = sphist_new[sphist_new['Date'] >= datetime(year=2013, month=1, day=1)]

test_sphist.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,5_days_mean,365_days_mean,mean_ratio,5_days_std,365_days_std,std_ratio
734,2013-01-02,1426.189941,1462.430054,1426.189941,1462.420044,4202600000.0,1462.420044,1419.791111,1338.321983,1.060874,23.117977,19.752132,1.170404
733,2013-01-03,1462.420044,1465.469971,1455.530029,1459.369995,3829730000.0,1459.369995,1431.748888,1338.998255,1.069269,25.39289,24.157556,1.051137
732,2013-01-04,1459.369995,1467.939941,1458.98999,1466.469971,3424290000.0,1466.469971,1447.475559,1339.675576,1.080467,27.39082,30.423109,0.900329
731,2013-01-07,1466.469971,1466.469971,1456.619995,1461.890015,3304970000.0,1461.890015,1458.218886,1340.35488,1.087935,19.184967,17.236841,1.113021
730,2013-01-08,1461.890015,1461.890015,1451.640015,1457.150024,3601600000.0,1457.150024,1462.388889,1341.035839,1.090492,3.366521,3.592537,0.937087


In [14]:
# instantiating a linear model
lr = LinearRegression()

# generating a list of required features, excluding all columns that 
# contain knowledge of the future that we don't want to feed the model.
features = list(train_sphist.columns[7:])

# fitting linear model
lr.fit(train_sphist[features], train_sphist['Close'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [15]:
# predicting using linear model
predicted_label = lr.predict(test_sphist[features])

# utilizing 'mean absolute error' as an error metric, because it will show
# how "close" we were to the price in intuitive terms.
mae = mean_absolute_error(test_sphist['Close'], predicted_label)

print('Mean Absolute Error:', mae)
print('Coefficient of determination (r^2) of the prediction:', lr.score(train_sphist[features], train_sphist['Close']))

Mean Absolute Error: 12.725945147984763
Coefficient of determination (r^2) of the prediction: 0.9996935495041345
