<a href="https://colab.research.google.com/github/sai-teja-ponugoti/Stock_Price_Prediction/blob/master/stock_price_documentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Stock price prediction using Recurrent Neural Networks**


*   Time Series is a collection of indexed data points based on the time during which they were collected. The data is most often recorded at regular time intervals.

*   In practise, predicting future values for the time series is a very common problem. Predicting next week's weather, stock prices, tomorrow's Bitcoins price, the amount of your Chrismas sales and potential heart disease are common examples of this.


*   Recurrent neural networks ( RNNs) may predict, or classify, the next value(s) in a series. A series is stored as a matrix, where each row is a descriptive vector of a function. The order of the rows in the matrix is of course essential.

*   Time Series is just one type of a sequence. We’ll have to cut the Time Series into smaller sequences, so our RNN models can use them for training. 

*   Classic RNNs have memory issues (long-term dependencies). The beginning of the sequences that we use for training appears to be "forgotten" due to the overwhelming effect of more recent states.

*   In general, these problems can be overcome by using gated RNNs. They can store information, just like having a memory, for later use. The data learns to read, write, and erase from the memory.


*   **The two most commonly used gated RNNs are Long Short-Term Memory Networks and Gated Recurrent Unit Neural Networks**.We will try both these RNNs for or application and select one from it.





# **Dataset preparation and pre-processing:**

**Converting data to time-series and supervised learning problem:**
*   The dataset given to us have feature values recorded for each day.
*   RNNs consume input in format [ batch_size, time_steps, Features ]; a 3- dimensional array.
*   Time Steps define how many units back in time you want your network to see. In our experiment it is past 3 time frames.
*   Features is the number of attributes used to represent each time step.In our case we are considering 4 features of each time record, these features are "open","high","low" and "volume". The uncessary columns datatime and "close" values has been removed from the dataset.
*   In order to retain the order of dates, a index column is introuced in the dataframe, so that while plotting the prediction graph, it can be plotted from past to present, so that the graph doesn't look clumsy.


Below function is used to prepare train and test sets from the original dataset.



```
# function to create train and test data sets form the given original dataset
def create_train_test_data():
    # reading the original csv file
    data = pd.read_csv("./data/q2_dataset.csv")

    # ignoring the two unnecessary columns of the dataset
    # ignoring Date and Close/Last columns as they dont constitue anything to 
    our problem
    data = data[data.columns[[2, 3, 4, 5]]]

    shift_window = -3

    dataframe = pd.concat([data.shift(shift_window), data.shift(shift_window + 
    1), data.shift(shift_window + 2), data],
                          axis=1)
    dataframe.columns = ['v3', 'o3', 'h3', 'l3', 'v2', 'o2', 'h2', 'l2', 'v1', 
    'o1','h1', 'l1', 'v0', 'label', 'h0', 'l0']

    # ignoring last 3 rows of the result processed data as they are not valid 
    and has Nan in them
    dataframe = dataframe.iloc[:dataframe.shape[0] + shift_window]

    # inoring extra colums volume,high,low of the last set as they are next 
    days unwanted columns
    dataframe = dataframe[['v3', 'o3', 'h3', 'l3', 'v2', 'o2', 'h2', 'l2', 
    'v1', 'o1','h1', 'l1', 'label', ]]

    dataframe['index'] = [-i for i in range(dataframe.shape[0])]
    # shuffling the data using sklearn shuffle function
    dataframe = shuffle(dataframe)

    # splitting data to train and test parts
    # using random state so that the results can be replicated
    train_data, test_data = train_test_split(dataframe, test_size=0.3,
     random_state=100)

    # storing the split data into respective files
    train_data.to_csv('./data/train_data_RNN.csv', index=False)
    test_data.to_csv('./data/test_data_RNN.csv', index=False)

    print("train and test data set creation is finished")
```

**What above function is doing:**

*   As the first step we read the given data set from ./data/ folder in a pandas dataframe.
*   Then we remove the uncessary columns from the dataset.
*   Then use a shifting window to create a dataset that consider,past 3 days data along with current days data. So in total 16 features in order from past to present.
*   The present days "open" values is chosen as the labels and the remaining 3 features of current day has been removed.(samples,13) - shape of dataframe
*   Next and index column is added to the dataframe, so that the order of dates can be tracked after shuffling the data.(samples,14) - shape of dataframe
*   Then we use Skleran train_test_split utility function to randomly split the data into 70% train and 30% test data.
*   This dataframes are then stored to the data folder using names "train_data_RNN.csv" and "test_data_RNN.csv"

### **Pre-processing:**

The data stored in train and test files is not normalized, as the unnormalized data is need to inverse the data transformation or the predicted values.
<br>
<br>


**In train_RNN.py:**
*    The train_data_RNN.csv" file is read to a dataframe and index column is removed.
*    Then the datframe is normalized using MinMax scaler to bring down all the values to (0,1). This step is taken as different columns of the data are having diffent ranges nad this affects the ability of the model to learn consistently.
*    Then the train data features and labels are created by selecting the past 3 days features as features , and current day opening price as label.
*    In training the order of the dates is not necessary, so the index column is not used.
*    Then the data is reshaped to a 3D array to be able to feed to RNN network.(samples,3,4) - shape of x_train




```
def load_train_data():
    train_data = pd.read_csv("./data/train_data_RNN.csv")
    train_data = np.array(train_data)

    scaler = MinMaxScaler(feature_range=(0, 1)).fit(train_data[:, 0:13])
    train_data = scaler.transform(train_data[:, 0:13])

    x_train = train_data[:, 0:12]
    x_train = np.reshape(x_train, (879, 3, 4))
    y_train = train_data[:, 12]
    y_train = np.asarray(y_train)

    print("finished loading training data")
    return x_train, y_train
```

# **Model Selection and Training:**

*   The next step is to select a RNN model and try out different architures and observe the effect on training data.

*   Both LSTM and GRU has been used to train and observe the results on training data.

*   Function to create model with input shape is as shown below:


```
def create_model(input_shape_length):
    model = Sequential()

    # model.add(GRU(units=50, return_sequences=True, input_shape=(input_shape_length, 4)))
    model.add(GRU(units=50, input_shape=(input_shape_length, 4)))
    # model.add(Dropout(0.2))

    # model.add(LSTM(units=50, return_sequences=True))
    # model.add(Dropout(0.2))

    # model.add(LSTM(units=50, return_sequences=True))
    # model.add(Dropout(0.2))

    # model.add(GRU(units=50))
    # model.add(Dropout(0.2))

    model.add(Dense(units=1))

    model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])

    print("model creation is done")
    return model
```




**The results are as shown below:**

**Number of layer: count of layers excluding the outut layer**

**train loss: train loss here of the normalised data, so the values are pretty low when compared with un-normalized predicted values in later section** 

![alt text](https://drive.google.com/uc?id=1gY7wgA1BhbuogY_3pXzf9x6jwS3VESpO)


A simple GRU network with 1 layer and 50 units in the layer has obtained minimum train loss compared to multi layer LSTM or modles with higher units. So a GRU with single layer is slected as the imput layer of the model followed by a Dense layer with one unit to output the predicted value.

**Loss function:** As the application is prediction based, Mean Squared error has been considered as the loss function.

**Training loop Output:**


![alt text](https://drive.google.com/uc?id=1uFTikCvz-j86s-fIhEtx6V6StHprJnpH)


We can consider only loss as the metric for this task. Even though this is supervised learning task , accuracy cannot be considered as a metric as the network cannot exactly predict the same value os the label.

So for this loss can be considered as the metric and we can discard the accuracy.





# **Testing the trained model:**

**In test_RNN.py:**

**Reading the test data:** 
*   At first we read train and test files stored in data folder.
*   Train file sis read to scale the test features using the same scaler used for train set.
*   As the first step we sort the test set based on the index column so that data gets aligned based on the dates (past to present), then we get rid of the index column.
*   The next step is normalising the test data using MinMax scaler that was fit on the train data.
*   Then the test data is split in features and labels.
*   Test features are then reshaped to a 3D array, so that it can be fed to the saved model, which exects the input to be in 3D.
*   Return x_test,y_test and the MinMax scaler object to use to inverse the predicted value for 0-1 range to normal value.

```
def load_test_data():
    train_data = pd.read_csv("./data/train_data_RNN.csv")
    train_data = np.array(train_data)
    test_data = pd.read_csv("./data/test_data_RNN.csv")
    test_data = np.array(train_data)
    test_data = test_data[np.argsort(test_data[:, 13])]
    # test_data = test_data[:,0:13]

    scaler_min_max = MinMaxScaler(feature_range=(0, 1)).fit(train_data[:, 0:13])
    train_data = scaler_min_max.transform(train_data[:, 0:13])
    test_data = scaler_min_max.transform(test_data[:, 0:13])

    x_test = test_data[:, 0:12]
    x_test = np.reshape(x_test, (879, 3, 4))
    y_test = test_data[:, 12]
    y_test = np.asarray(y_test)

    return x_test, y_test, scaler_min_max
```

**Loading the Saved model and predicting the prices:**

*   Then we move on to load the traied model stored in models directory.
*   Pass the pre-processed x_test to the model and predict the values.
*   Inverse the predicted value to normal cost range.
*   Caluclate Mean Squared error between predicted price and actual opening price.
*   Plot the graph to show the difference in predictions.

```
    # 1. Load your saved model
    model = tf.keras.models.load_model("./models/20841154_RNN_model.h5")

    # 2. Load your testing data
    x_test, y_test, scaler_min_max = load_test_data()

    # 3. Run prediction on the test data and output required plot and loss
    # print(type(y_test))
    # print(predicted_stock_price)
    predicted_stock_price = model.predict(x_test)
    # predicted_stock_price = scaler.inverse_transform(predicted_stock_price)

    y_test = (y_test * scaler_min_max.data_range_[12] + scaler_min_max.data_min_[12])
    predicted_stock_price = (predicted_stock_price * scaler_min_max.data_range_[12] + scaler_min_max.data_min_[12])

    print("testing Mean Squared Loss :",mean_squared_error(y_test, predicted_stock_price))

    plt.figure(figsize=(16, 12))
    plt.plot(y_test, color='orange', label='original Stock Price')
    plt.plot(predicted_stock_price, color='green', label='Predicted Stock Price')
    plt.title('Stock Price Prediction',fontweight="bold")
    plt.xlabel('Time (very old to recent( left to right))',fontweight="bold")
    plt.xticks([])
    plt.ylabel('Stock Price',fontweight="bold")
    plt.legend()
    plt.show()
```

**Output of testing:**

#### **testing Mean Squared Loss : 11.98907410878154**

The comparision graph:

<!-- ![alt text](https://drive.google.com/uc?id=1XfSuYCwV1QAOtvis7z7aSWAJ0FL6qJud) -->

![alt text](https://drive.google.com/uc?id=1GMbU7-X-J_dJ2KWk-zU9GlqiZkPRJ_VR)



The prediction seems to work pretty well, irrespetcive of the small dataset that we have. The mean Sqaure error might decrease if we have more data.


### **What would happen if you used more days for features:**

**Considered 5 days for features:**

**Results obtained: Both tarin loss and test loss has been decreased for the same model selected above**. The more previous days considered the better is the prediction.

**final training loss: 0.000114** (normalized data)

**final test loss: 8.2133** (un normalized data)

Graph comparison:

![alt text](https://drive.google.com/uc?id=1DLiVcq44fdnF-eYd-uKgyXFsiiuueJCL)
