In [1]:
pwd

u'/usr/local/lib/python3.6/site-packages'

In [19]:
import quandl
import pandas as pd
import numpy as np
import datetime

from sklearn.linear_model import LinearRegression
from sklearn import preprocessing, model_selection, svm

In [9]:
df = quandl.get("WIKI/AMZN")

In [10]:
print(df.tail())

               Open     High      Low    Close     Volume  Ex-Dividend  \
Date                                                                     
2018-03-21  1586.45  1590.00  1563.17  1581.86  4667291.0          0.0   
2018-03-22  1565.47  1573.85  1542.40  1544.10  6177737.0          0.0   
2018-03-23  1539.01  1549.02  1495.36  1495.56  7843966.0          0.0   
2018-03-26  1530.00  1556.99  1499.25  1555.86  5547618.0          0.0   
2018-03-27  1572.40  1575.96  1482.32  1497.05  6793279.0          0.0   

            Split Ratio  Adj. Open  Adj. High  Adj. Low  Adj. Close  \
Date                                                                  
2018-03-21          1.0    1586.45    1590.00   1563.17     1581.86   
2018-03-22          1.0    1565.47    1573.85   1542.40     1544.10   
2018-03-23          1.0    1539.01    1549.02   1495.36     1495.56   
2018-03-26          1.0    1530.00    1556.99   1499.25     1555.86   
2018-03-27          1.0    1572.40    1575.96   1482.32

In [11]:
df = df[['Adj. Close']]

Now, let’s set up our forecasting. We want to predict 30 days into the future, so we’ll set a variable forecast_out equal to that. Then, we need to create a new column in our dataframe which serves as our label, which, in machine learning, is known as our output. To fill our output data with data to be trained upon, we will set our  prediction column equal to our Adj. Close column, but shifted 30 units up.

In [12]:
forecast_out = int(30) # predicting 30 days into future
df['Prediction'] = df[['Adj. Close']].shift(-forecast_out) #  label column with data shifted 30 units up

In [13]:
print(df.tail())

            Adj. Close  Prediction
Date                              
2018-03-21     1581.86         NaN
2018-03-22     1544.10         NaN
2018-03-23     1495.56         NaN
2018-03-26     1555.86         NaN
2018-03-27     1497.05         NaN


Our X will be an array consisting of our Adj. Close values, and so we want to drop the Prediction column. We also want to scale our input values. Scaling our features allow us to normalize the data.

In [14]:
X = np.array(df.drop(['Prediction'], 1))
X = preprocessing.scale(X)

Now, if you printed the dataframe after we created the Prediction column, you saw that for the last 30 days, there were NaNs, or no label data. We’ll set a new input variable to these days and remove them from the X array.

In [15]:
X_forecast = X[-forecast_out:] # set X_forecast equal to last 30
X = X[:-forecast_out] # remove last 30 from X

To define our y, or output, we will set it equal to our array of the Prediction values and remove the last 30 days where we don’t have any pricing data.

In [16]:
y = np.array(df['Prediction'])
y = y[:-forecast_out]

Finally, prediction time! First, we’ll want to split our testing and training data sets, and set our test_size equal to 20% of the data. The training set contains our known outputs, or prices, that our model learns on, and our test dataset is to test our model’s predictions based on what it learned from the training set.

In [20]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.2)

Now, we can initiate our Linear Regression model and fit it with training data. After training, to test the accuracy of the model, we “score” it using the testing data. We can get an r^2 (coefficient of determination) reading based on how far the predicted price was compared to the actual price in the test data set. When I ran the algorithm, I usually got a value of over 90%.

In [21]:
# Training
clf = LinearRegression()
clf.fit(X_train,y_train)
# Testing
confidence = clf.score(X_test, y_test)
print("confidence: ", confidence)

('confidence: ', 0.9879963502179075)




Lastly, we will predict for the next 30 days. The following are our X_forecast values:

In [22]:
forecast_prediction = clf.predict(X_forecast)
print(forecast_prediction)

[1500.03351635 1538.8778759  1550.26329163 1536.3690448  1557.26888357
 1572.75772645 1574.71376425 1590.91486003 1614.24911542 1603.65036707
 1604.15000716 1583.95179065 1591.18062603 1616.01380171 1630.92859001
 1638.75274125 1646.04536047 1674.77998112 1695.50972965 1684.65584594
 1687.65368649 1678.42629074 1667.11528949 1638.67832676 1682.88052901
 1677.93728128 1637.79598362 1586.19485575 1650.29761658 1587.77882115]
