Dataset from here: https://finance.yahoo.com/q/hp?s=%5EGSPC+Historical+Prices

In [16]:
import pandas
sp500 = pandas.read_csv("sp500.csv")
sp500.head()

Unnamed: 0,date,value
0,1/3/1950,16.66
1,1/4/1950,16.85
2,1/5/1950,16.93
3,1/6/1950,16.98
4,1/9/1950,17.08


In [17]:
sp500.columns

Index([u'date', u'value'], dtype='object')

In [18]:
#sp500 = sp500[sp500["value"] != "."]

In [19]:
# This prints the last 10 rows -- note where the dataset ends.
print(sp500.tail(10))

next_day = sp500["value"].iloc[1:]

#since the row row needs to be dropped- won't have a next day
sp500 = sp500.iloc[:-1,:]
sp500["next_day"] = next_day.values
#one shifted to make the value for next day. 
print(sp500.tail(10))

            date        value
16724  6/21/2016  2088.899902
16725  6/22/2016  2085.449951
16726  6/23/2016  2113.320068
16727  6/24/2016  2037.300049
16728  6/27/2016  2000.540039
16729  6/28/2016  2036.089966
16730  6/29/2016  2070.770020
16731  6/30/2016  2098.860107
16732   7/1/2016  2102.949951
16733   7/5/2016  2088.550049
            date        value     next_day
16723  6/20/2016  2083.250000  2088.899902
16724  6/21/2016  2088.899902  2085.449951
16725  6/22/2016  2085.449951  2113.320068
16726  6/23/2016  2113.320068  2037.300049
16727  6/24/2016  2037.300049  2000.540039
16728  6/27/2016  2000.540039  2036.089966
16729  6/28/2016  2036.089966  2070.770020
16730  6/29/2016  2070.770020  2098.860107
16731  6/30/2016  2098.860107  2102.949951
16732   7/1/2016  2102.949951  2088.550049


"In order to use the next_day and value columns in a machine learning algorithm, they will need to be converted to be floats first. This is because machine learning algorithms can't deal with strings."

In [20]:
sp500["next_day"] = sp500["next_day"].astype(float)
sp500["value"] = sp500["value"].astype(float)

In [21]:
sp500.head(5)

Unnamed: 0,date,value,next_day
0,1/3/1950,16.66,16.85
1,1/4/1950,16.85,16.93
2,1/5/1950,16.93,16.98
3,1/6/1950,16.98,17.08
4,1/9/1950,17.08,17.030001


Now, using the value column to predict the next_day column in the sp500 dataframe.

In [22]:
# Import the linear regression class
from sklearn.linear_model import LinearRegression

# Initialize the linear regression class.
regressor = LinearRegression()

In [23]:
# We're using 'value' as a predictor, and making predictions for 'next_day'.
# The predictors need to be in a dataframe.
# We pass in a list when we select predictor columns from "sp500" to force pandas not to generate a series.
predictors = sp500[["value"]]
to_predict = sp500["next_day"] #actual values

# Train the linear regression model on our dataset.
regressor.fit(predictors, to_predict)

# Generate a list of predictions with our trained linear regression model
next_day_predictions = regressor.predict(predictors)
print(next_day_predictions) #predicted values

[   16.7319435     16.92196407    17.00197273 ...,  2071.06435428
  2099.15748249  2103.24776928]


So far, fit a model and made predictions, now figure out the error of our model. Mean Squared Error. We take each prediction, and each actual observed value, and subtract them from each other. Then, we square the resulting differences and add them all together. Then, we divide that sum by the number of predictions made.

In [25]:
# The actual values are in to_predict, and the predictions are in next_day_predictions.
mse = sum((to_predict - next_day_predictions) ** 2)
mse /= len(next_day_predictions)

In [28]:
mse

67.396946807063571

Our model was trained using the same data we predicted on.
Now, fixing overgitting. Splitting into training and test sets.

In [29]:
import numpy as np
import random

# Set a random seed to make the shuffle deterministic.
np.random.seed(1)
random.seed(1)

In [30]:
# Randomly shuffle the rows in our dataframe
sp500 = sp500.loc[np.random.permutation(sp500.index)]

In [34]:
# Select 70% of the dataset to be training data
highest_train_row = int(sp500.shape[0] * .7)
train = sp500.loc[:highest_train_row,:]
train.head()

Unnamed: 0,date,value,next_day
5243,12/16/1970,89.720001,90.040001
5176,9/11/1970,82.519997,82.07
6405,7/25/1975,89.290001,88.690002
22,2/2/1950,17.23,17.290001
110,6/12/1950,19.4,19.25


In [35]:
# Select 30% of the dataset to be test data.
test = sp500.loc[highest_train_row:,:]
test.head()

Unnamed: 0,date,value,next_day
11713,7/24/1996,626.650024,631.169983
1579,4/20/1956,47.759998,47.650002
8730,10/4/1984,162.919998,162.679993
8469,9/23/1983,169.509995,170.070007
1016,1/27/1954,26.01,26.02


In [33]:
regressor = LinearRegression()

In [43]:
predictors = train[["value"]]
to_predict = train["next_day"]

# Train the linear regression model on our dataset.
regressor.fit(predictors, to_predict)

#To make predictions on the test set.
predictors_test = test[["value"]]
to_predict_test = test["next_day"]

# Generate a list of predictions with our trained linear regression model
predictions = regressor.predict(predictors_test)
print(predictions)

[  626.70460546    47.84310522   162.99743052 ...,    86.55120072
  1109.55080724    19.80448794]


In [44]:
# The actual values are in to_predict, and the predictions are in next_day_predictions.
mse = sum((to_predict_test - next_day_predictions) ** 2)
mse /= len(predictions)
mse

63.616044615167588

In [45]:
import matplotlib.pyplot as plt

# Make a scatterplot with the actual values in the testing set
plt.scatter(test["value"], test["next_day"])

plt.plot(test["value"], regressor.predict(test[["value"]]))

plt.show()

Root mean squared error, or RMSE, and mean absolute error, or MAE. MSE and RMSE, because they square the errors, penalize large errors way out of proportion to small errors. MAE, on the other hand, doesn't. MAE can be useful, because it is a more accurate look at the average error.

In [46]:
# The test set predictions are in the predictions variable.
import math
rmse = math.sqrt(sum((predictions - test["next_day"]) ** 2) / len(predictions))
mae = sum(abs(predictions - test["next_day"])) / len(predictions)