# LSTM Stock Predictor Using Closing Prices

## Data Preparation

In this activity, we will prepare the training and testing data for the LSTM model.

We will need to:
1. Use the `window_data` function to generate the X and y values for the model.
2. Split the data into 70% training and 30% testing
3. Apply the MinMaxScaler to the `X` and `y` values
4. Reshape the `X_train` and `X_test` data for the model.

**Note:** The required input format for the LSTM is:

```python
reshape((X_train.shape[0], X_train.shape[1], 1))
```

In [1]:
# Initial imports
import numpy as np
import pandas as pd
from path import Path

%matplotlib inline

In [2]:
# Set the random seed for reproducibility
# Note: This is used for model prototyping, but it is good practice to comment this out and run multiple experiments to evaluate your model.
from numpy.random import seed

seed(1)
from tensorflow import random

random.set_seed(2)

### Data Loading

In this activity, we will use closing prices from different stocks to make predictions of future closing prices based on the temporal data of each stock.

In [3]:
# Load the stocks data
filepath = Path("../Resources/stock_data.csv")
df = pd.read_csv(
    filepath,
    index_col="date",
    infer_datetime_format=True,
    parse_dates=True
    )
df.head()

Unnamed: 0_level_0,VIX,Gold,T-Bonds,Junk Bonds,Oil
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-10-20,97500.0,119.8,29.18,121.08,33.78
2014-10-21,83000.0,120.02,29.14,121.59,34.17
2014-10-22,95250.0,119.34,29.01,120.75,33.53
2014-10-23,84750.0,118.52,28.96,120.84,34.37
2014-10-24,82750.0,118.35,28.96,121.32,34.17


### Creating the Features `X` and Target `y` Data

The first step towards preparing the data is to create the input features vectors `X` and the target vector `y`. We will use the `window_data()` function to create these vectors.

This function chunks the data up with a rolling window of _X<sub>t</sub> - window_ to predict _X<sub>t</sub>_.

The function returns two `numpy` arrays:

* `X`: The input features vectors.

* `y`: The target vector.

The function has the following parameters:

* `df`: The original DataFrame with the time series data.

* `window`: The window size in days of previous closing prices that will be used for the prediction.

* `feature_col_number`: The column number from the original DataFrame where the features are located.

* `target_col_number`: The column number from the original DataFrame where the target is located.

In [4]:
def window_data(df, window, feature_col_number, target_col_number):
    """
    This function accepts the column number for the features (X) and the target (y).
    It chunks the data up with a rolling window of Xt - window to predict Xt.
    It returns two numpy arrays of X and y.
    """
    X = []
    y = []
    
    for i in range(len(df) - window - 1):
        
        features = df.iloc[i : (i + window), feature_col_number] # 
        target = df.iloc[(i + window), target_col_number]
        
        X.append(features)
        y.append(target)
        
    return np.array(X), np.array(y).reshape(-1, 1)

In the forthcoming activities, we will predict closing prices using a `5` days windows of previous _T-Bonds_ closing prices, so that, we will create the `X` and `y` vectors by calling the `window_data` function and defining a window size of `5` and setting the features and target column numbers to `2` (this is the column with the _T-Bonds_ closing prices).

In [5]:
# Creating the features (X) and target (y) data using the window_data() function.
window_size = 5

feature_column = 2
target_column = 2
X, y = window_data(df, window_size, feature_column, target_column)
print (f"X sample values:\n{X[:5]} \n")
print (f"y sample values:\n{y[:5]}")

X sample values:
[[29.18 29.14 29.01 28.96 28.96]
 [29.14 29.01 28.96 28.96 29.06]
 [29.01 28.96 28.96 29.06 29.1 ]
 [28.96 28.96 29.06 29.1  28.88]
 [28.96 29.06 29.1  28.88 28.82]] 

y sample values:
[[29.06]
 [29.1 ]
 [28.88]
 [28.82]
 [28.59]]


In [7]:
df.head(8)

Unnamed: 0_level_0,VIX,Gold,T-Bonds,Junk Bonds,Oil
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-10-20,97500.0,119.8,29.18,121.08,33.78
2014-10-21,83000.0,120.02,29.14,121.59,34.17
2014-10-22,95250.0,119.34,29.01,120.75,33.53
2014-10-23,84750.0,118.52,28.96,120.84,34.37
2014-10-24,82750.0,118.35,28.96,121.32,34.17
2014-10-27,81000.0,118.06,29.06,120.93,33.81
2014-10-28,71000.0,118.1,29.1,120.93,34.03
2014-10-29,73000.0,116.41,28.88,121.14,34.55


### Splitting Data Between Training and Testing Sets

To avoid the dataset being randomized, we will manually split the data using array slicing.

In [12]:
type(X)

numpy.ndarray

In [23]:
y[0]

array([29.06])

In [20]:
len(X_train)+len(X_test)

1253

In [19]:
len(y)

1253

In [17]:
# Use 70% of the data for training and the remainder for testing
train_length = int(len(X)*0.7)

X_train = X[:train_length]
X_test = X[train_length:]

y_train, y_test = y[:train_length], y[train_length:]

### Scaling Data with `MinMaxScaler`

Once the training and test datasets are created, we need to scale the data before training the LSTM model. We will use the `MinMaxScaler` from `sklearn` to scale all values between `0` and `1`.

Note that we scale both features and target sets.

In [24]:
# Use the MinMaxScaler to scale data between 0 and 1.
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

scaler.fit(y_train)
y_train = scaler.transform(y_train)
y_test = scaler.transform(y_test)

### Reshape Features Data for the LSTM Model

The LSTM API from Keras needs to receive the features data as a _vertical vector_, so that we need to reshape the `X` data in the form `reshape((X_train.shape[0], X_train.shape[1], 1))`.

Both sets, training, and testing are reshaped.

In [29]:
# Reshape the features for the model
X_train_reshape = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test_reshape = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

In [31]:
X_train_reshape.shape

(877, 5, 1)

In [28]:
X_train_reshape

array([[[0.8973747 ],
        [0.88782816],
        [0.85680191],
        [0.84486874],
        [0.84486874]],

       [[0.88782816],
        [0.85680191],
        [0.84486874],
        [0.84486874],
        [0.86873508]],

       [[0.85680191],
        [0.84486874],
        [0.84486874],
        [0.86873508],
        [0.87828162]],

       ...,

       [[0.97136038],
        [0.94749403],
        [0.95465394],
        [0.96897375],
        [0.97136038]],

       [[0.94749403],
        [0.95465394],
        [0.96897375],
        [0.97136038],
        [0.96181384]],

       [[0.95465394],
        [0.96897375],
        [0.97136038],
        [0.96181384],
        [0.93078759]]])