## Problem Scenario:

In this project we have a dataset containing stock prices of Google from May-2009 to August-2018. We have both AMD and Google datasets. We are using only Google data as of now. 

Dataset Source: https://www.kaggle.com/gunhee/amdgoogle

We are using the stock prices data from 2009 to 2017 (9 years) for training the neural network and predict the stock prices for the year 2018. This is a Regression problem.

To achieve this goal, we will train a **Recurrent Neural Network (LSTM)**. We will use one of the deep learning libraries, **Keras**, to build the neural network.


## Importing the Libraries

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

## Loading the dataset

In [43]:
totaldata = pd.read_csv("dataset/GOOGL.csv")

In [44]:
totaldata.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2009-05-22,198.528534,199.524521,196.196198,196.946945,196.946945,3433700
1,2009-05-26,196.171173,202.702698,195.19519,202.382385,202.382385,6202700
2,2009-05-27,203.023026,206.136139,202.607605,202.982986,202.982986,6062500
3,2009-05-28,204.54454,206.016022,202.507507,205.405411,205.405411,5332200
4,2009-05-29,206.261261,208.823822,205.555557,208.823822,208.823822,5291100


In [45]:
totaldata.tail()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
2330,2018-08-23,1219.880005,1235.170044,1219.349976,1221.160034,1221.160034,1233300
2331,2018-08-24,1226.0,1237.400024,1221.420044,1236.75,1236.75,1222700
2332,2018-08-27,1244.140015,1257.869995,1240.680054,1256.27002,1256.27002,1429000
2333,2018-08-28,1255.900024,1256.560059,1242.969971,1245.859985,1245.859985,1366500
2334,2018-08-29,1255.0,1267.170044,1252.800049,1264.650024,1264.650024,1846300


You can see that we have the dataset over the years from 2009 to 2018. We planned to train the model over the data from 2009 to 2017 and test it on 2018 data. So, we need to filter the data into two parts.

- data_from_2009_to_2017 (starting from May 2009 to December 2017 included)
- data_2018 (starting from January 2018 to August 2018)

In [68]:
# converting the Date row of the Dataframe to datetime format for easy handling

import datetime
totaldata['Date'] = pd.to_datetime(totaldata['Date'])

In [57]:
# the seperation date

dec_2017 = '2017-12-31'

### data_from_2009_to_2017

In [48]:
mask = (totaldata['Date'] <= dec_2017)
data_09to17 = totaldata.loc[mask]

In [49]:
data_09to17.tail()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
2163,2017-12-22,1070.0,1071.719971,1067.640015,1068.859985,1068.859985,889400
2164,2017-12-26,1068.640015,1068.859985,1058.640015,1065.849976,1065.849976,918800
2165,2017-12-27,1066.599976,1068.27002,1058.380005,1060.199951,1060.199951,1116200
2166,2017-12-28,1062.25,1064.839966,1053.380005,1055.949951,1055.949951,994200
2167,2017-12-29,1055.48999,1058.050049,1052.699951,1053.400024,1053.400024,1180300


### data_2018

In [52]:
mask = (totaldata['Date'] > end_date)
data_18 = totaldata.loc[mask]

In [53]:
data_18.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
2168,2018-01-02,1053.02002,1075.97998,1053.02002,1073.209961,1073.209961,1588300
2169,2018-01-03,1073.930054,1096.099976,1073.430054,1091.52002,1091.52002,1565900
2170,2018-01-04,1097.089966,1104.079956,1094.26001,1095.76001,1095.76001,1302600
2171,2018-01-05,1103.449951,1113.579956,1101.800049,1110.290039,1110.290039,1512500
2172,2018-01-08,1111.0,1119.160034,1110.0,1114.209961,1114.209961,1232200


We will use the `data_09to17` for the further steps. We will use the `data_18` in the end for testing and validating.

## Data Preprocessing

The dataset contains various information like displayed above. But in this project we will only use the `Open` stock prices for training our model. Therefore for the purpose of convenience we will create another variable that stores only the required (`Open` stock price) information. 

In [63]:
training_set = data_09to17.iloc[:,1:2].values

print(training_set)
print("********************")
print("********************")
print(training_set.shape)

[[ 198.528534]
 [ 196.171173]
 [ 203.023026]
 ...
 [1066.599976]
 [1062.25    ]
 [1055.48999 ]]
********************
********************
(2168, 1)


Now we can see that there is only one column with the `Open` stock prices. There are a total of 2168 stock prices.

### Additional Information (Things to Remember!)

In [65]:
print(type(data_09to17))
print(type(data_09to17.iloc[:,1:2]))   
print(type(data_09to17.iloc[:,1:2].values))

# iloc[rangeofRows, rangeofColumns]
# Indexing starts from zero.
# ":" indicates entire range.
# "1:2" indicates column one only. Because, the upper bound will be excluded. 
# mathematical operation are performed on the arrays. So, it is crusial to convert the data to arrays.

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>


## Feature scaling

The performance of the neural network will be better if the entire training input is in the same range. As we can see from above the stock prices are not in the same range. So, we need to scale the training data such that they are in the same range. This process is called Feature Scaling. The two popular methods for feature scaling are:

* **Standardization**

$ x' = \frac{x - \bar{x}}{\sigma} $

where $ x $ is the original feature vector, $ \bar{x} $ is the mean of that feature vector, and $ \sigma $ is its standard deviation.

* **Normalization** (Min-Max normalization)

$ x' = \frac{x - \text{min}(x)}{\text{max}(x)-\text{min}(x)} $

where $ x $ is an original value, $ x' $ is the normalized value.

It is recommended to use Normalization in the case of RNN networks. Therefore we use Min-Max normalization here. You may also experiment with different feature scaling methods.

In [66]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range = (0,1)) 
scaled_training_set = scaler.fit_transform(training_set)

scaled_training_set

array([[0.00265813],
       [0.        ],
       [0.00772607],
       ...,
       [0.98148496],
       [0.97657998],
       [0.96895747]])

The `fit` method only calculates the min and max values. It does not apply the formula on the training set. The `fit_transform` method applies the minmax formula on the training set. After applying the minmax formula, the transformed values will be in the range 0 and 1 i.e, the training data (features) will be in the range 0 and 1 (as shown above).

## Implementation of model