# **Stock Price Prediction Using Historical Data and Recurrent Neural Networks (RNNs)** 

### **Author:** Reece Iriye
### **Course:** MATH 4377 (Math of Machine Learning)
### **Section:** Fall 2022, TTh 12:30-1:50PM, 001-LEC
### **Department:** Mathematics

## **Project Description**

This project studies the usage of historical data and recurrent neural network to predict a stock price or an index. I will choose to use regression to predict a stock price or an index, instead of classification to predict whether or not there exists an overall upward or downward trend.

### **Data Generation**

1. Go to http://finance.yahoo.com
2. Search one (or several, depending on the need for your model) of the stocks from a company (apple, amazon, microsoft, etc.) or one or several stock indices (S&P 500, Dow 30, Nasdaq, Russell 2000, Crude Oil, etc.)
3. Once the stock or index’s quote page is shown, click the “Historic Data” tab and change “the time period” as needed. Note normally we prefer more data for a neural network.
4. Click “Apply” and then click “download” below the “Apply” button and a CSV data will be generated and downloaded to your computer.


*Note*: I will need to pick a time period and decide how many data points to produce. I will need to rearrange the data into time series by lagging the data to obtain percentage returns instead of raw closing price data. I will also need to split my data into a training set and a testing set thus the validation can be performed. 

### **Recurrent Neural Network (RNN)**

I need to setup a RNN model and use the obtained data to predict stock prices. I will consider the following questions when you setup the RNN:


1. What's the overall dimension of the RNN?
2. What is the number of time steps for returns? In predicting n-day returns, what am I specifying as n?
3. How many neurons are in the hidden layer?
4. Did I use the LSTM, if so what are the related parameters?
5. What is my choice of activation function? Why?

### **Research Objective**

Through this project, I will demonstrate:
1. Cleaning and rearranging data in the form that a neural network can be applied on.
2. Setting up a RNN model that is working.
3. Further tuning the model by modifying model parameters to finalize an optimized model.
4. Getting *creative* based on accumulated knowledge and skills.

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split  # for splitting dataset
from sklearn.metrics import mean_squared_error
from tensorflow.keras.models import Sequential   # the RNN model

## **Project Implementation**

### **Generate Necessary Data**

I will predict Target Corporation's (NYSE: TGT) returns in 2019 using historical data from 01/02/2000 - 12/31/2018. Using Bloomberg Terminal which is downloaded on the computers in SMU's business library, I will scrape data using the `BDH` Excel function connected with Bloomberg and download the data to a CSV file, so I can store closing price data into a DataFrame using Pandas. I used Bloomberg Terminal instead of Yahoo Finance, because Yahoo did not allow me to download the closing price data from every single one of my indices and equities that I wanted to use for the project.



Additionally, I'm using the training time period from 2000-2018, because that time period captures a wide array of market phases that the United States has gone through, with some periods of growth but also with the Stock Market Crash of 2008. I'm using 2019 as a testing year, because it may be somewhat predictable especially with the market trends happening a couple years before it that I will train the RNN on. I did not want to test it on 2020, because my model likely would not have factored in the COVID crash that occurred based solely on historical data of various indices and equities.



Predictors for TGT that I'll include:
- Benchmark Indices:
    - SPX Index
        - The SPX Index, also known as the S&P 500, is the basic index that generally describes the overall state of the market based on the top 500 performing companies.
    - RUI Index (Russell 1000)
        - The Russell 1000 represents the top 1000 companies by market capitalization in the United States.
    - FED5YEAR Index
        - The FED5Year is the yield of a US Treasury note that pays out in 5 years. 
    - VIX Index
        - VIX is the volatility index that measures the volatility of the S&P 500 options and shows how volatile the stock market can be at certain times.
    - DXY Curncy
        - DXY measures the price of the US Dollar in comparison to other currencies.
- Competitor Indices:
    - Costco Wholesale Corporation (NASDAQ: COST)
    - Walmart Inc (NYSE: WMT)
    - Macy's Inc (NYSE: M)
- Industry-Specific Metric:
    - SPSIRE (S&P Retail Select Industry Index)
        - SPSIRE measures the performance of the retail sector in the stock market. 
- Historical TGT Data:
    - The Performance of previous TGT data will help us gage how exactly Target stock is performing and what its general trends are. 


I am using all these factors, because I wanted to use a variety of predictors that represent the overall state of the market, the performance of the US dollar, the performance of competitors who are similar in nature, the retail sector's performance, and the TGT historical data itself. Using only a couple of these factors I believe would have limited myself, and I wanted to branch out using multiple predictors to hopefully build an accurate model.

In [25]:
df_pxlast = pd.read_csv("data/full_data.csv")
df_pxlast.head(10)

Unnamed: 0,Date,SPX,RUI,FED5YEAR,VIX,DXY,COST,WMT,M,SPSIRE,TGT
0,1/2/00,1469.25,3055.86,2.7701,24.64,101.87,45.625,69.125,25.2813,1041.97,36.7188
1,1/3/00,1455.22,3029.45,2.7473,24.21,100.22,44.5,66.8125,25.1875,1016.4,36.0313
2,1/4/00,1399.42,2909.98,2.8377,27.01,100.41,42.063,64.3125,24.4688,989.24,34.4688
3,1/5/00,1402.11,2917.77,2.8245,26.41,100.38,42.781,63.0,24.9375,989.65,33.6875
4,1/6/00,1403.45,2898.75,2.7696,25.73,100.48,43.641,63.6875,24.7188,990.1,32.0938
5,1/9/00,1441.47,2998.01,2.676,21.72,100.72,46.531,68.5,25.4063,1010.94,33.75
6,1/10/00,1457.6,3041.51,2.5936,21.71,100.99,47.5,67.25,26.1875,1019.65,33.0938
7,1/11/00,1438.56,2996.16,2.7436,22.5,100.56,45.813,66.25,25.3438,1005.29,34.0938
8,1/12/00,1432.25,2985.8,2.8011,22.84,100.62,45.75,65.0625,25.2813,992.22,33.7188
9,1/13/00,1449.68,3031.35,2.7769,21.71,101.0,47.469,65.125,25.625,996.9,34.5


All of the closing price data for my predictors has been imported successfully. Now let's do the same for my testing data solutions of TGT closing prices to check whether or not that has been imported successfully as well

### **Generate Returns Data Using Lags of Closing Price Data**

All data has been imported successfully into Pandas Dataframes. Now,  transform all the data into percentage returns data, because percentage returns have a mean-reverting characteristic that represents the normal distribution. We want a distribution like this, because it makes predictions more accurately in relation to its actual shape, instead of general closing price data which has a more $\chi^2$ shape associated with it. 



Also, change the dates to datetime types in the Pandas DataFrame while we're already observing other values.

In [35]:
df_returns = pd.DataFrame()
for index, colname in enumerate(df_pxlast):
    if colname == "Date":
        df_returns[colname] = pd.to_datetime(df_pxlast[colname][1:])
    else:
        df_returns[colname + " Returns"] = df_pxlast[colname].pct_change(1)
df_returns.set_index(df_returns["Date"], inplace=True)
df_returns.drop("Date", axis=1, inplace=True)

In [36]:
df_returns.head()

Unnamed: 0_level_0,SPX Returns,RUI Returns,FED5YEAR Returns,VIX Returns,DXY Returns,COST Returns,WMT Returns,M Returns,SPSIRE Returns,TGT Returns
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2000-01-03,-0.009549,-0.008642,-0.008231,-0.017451,-0.016197,-0.024658,-0.033454,-0.00371,-0.02454,-0.018723
2000-01-04,-0.038345,-0.039436,0.032905,0.115655,0.001896,-0.054764,-0.037418,-0.028534,-0.026722,-0.043365
2000-01-05,0.001922,0.002677,-0.004652,-0.022214,-0.000299,0.01707,-0.020408,0.019155,0.000414,-0.022667
2000-01-06,0.000956,-0.006519,-0.019437,-0.025748,0.000996,0.020102,0.010913,-0.00877,0.000455,-0.047308
2000-01-09,0.02709,0.034242,-0.033795,-0.155849,0.002389,0.066222,0.075564,0.027813,0.021048,0.051605


I specifically chose to use 1-day returns, because it's a decent gage on daily market activity and may work well when testing the model.

## **Split the Data into a Training Set and a Testing Set**

Now that I have the data that I need categorized correctly, I will split the data into training and testing sets to ensure that we can apply the Recurrent Neural Network correctly. Based on the Time Series nature of the RNN, and based on my data and the custom dates that I want to use, I will not need to invoke the `train_test_split()` module from `scikit-learn`. Instead, I will create a new column in df_returns to indicate whether or not I will train the data on its content.

In [42]:
X_train = df_returns.loc["2000-01-03":"2018-12-31", "SPX Returns":"SPSIRE Returns"]
X_train.tail()

Unnamed: 0_level_0,SPX Returns,RUI Returns,FED5YEAR Returns,VIX Returns,DXY Returns,COST Returns,WMT Returns,M Returns,SPSIRE Returns
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2018-12-25,0.0,0.0,0.0,0.0,0.00029,0.0,0.0,0.0,0.0
2018-12-26,0.049594,0.049689,0.002082,-0.156917,0.004856,0.045476,0.053484,0.070337,0.058489
2018-12-27,0.008563,0.008387,-0.015526,-0.014798,-0.005863,0.012334,0.013052,-0.002987,-0.002474
2018-12-30,-0.001242,-0.000809,0.000444,-0.054072,-0.000819,0.004774,0.005896,-0.000666,0.004232
2018-12-31,0.008492,0.00883,-0.01754,-0.103035,-0.002375,0.008266,0.011071,-0.007995,0.006084


In [46]:
X_test = df_returns.loc["2019-01-01":"2019-12-31", "SPX Returns":"SPSIRE Returns"]
print(f"Shape of X_test is {X_test.shape}")

Shape of X_test is (261, 9)


In [48]:
y_train = df_returns.loc["2000-01-03":"2018-12-31", "TGT Returns"]
y_train.tail()

Date
2018-12-25    0.000000
2018-12-26    0.057839
2018-12-27   -0.006143
2018-12-30    0.003863
2018-12-31    0.017395
Name: TGT Returns, dtype: float64

In [50]:
y_test = df_returns.loc["2019-01-01":"2019-12-31", "TGT Returns"]
print(f"Shape of y_test is {y_test.shape}")

Shape of y_test is (261,)


### **Convert Training Data to LSTM Format**

I will now be transforming our `X_train` and `X_test` variables into numpy arrays so I can put them into Long Short-Term Memory (LSTM) format. The reason I need to make this change is because the model will only take in my data if it is in that specific format. My training and testing set for X will need to be reshaped using the `.reshape()` function from numpy, which will need to indicate batch size, time steps, and the number of features in our entire model.



To further describe LSTM format, it essentially is an architecture employed in various RNNs because it is able to capture long-term dependencies in the data. This allows for the model to retain important information for longer periods of time, which I believe is crucial in a task like predicting stocks, because it requires remembering previous events to predict future events.

In [55]:
# Convert DataFrames to numpy arrays
X_train_2D = X_train.values
X_test_2D  = X_test.values

# Obtain number of timesteps and feature count based on shape of DataFrame
timesteps_train, num_features_train = X_train.shape
timesteps_test, num_features_test   = X_test.shape

print(f"The shape of X_train is {X_train.shape}\nThe shape of X_test is {X_test.shape}")

# Reshape the 2D array into a 3D array
X_train_3D = np.reshape(X_train_2D, (1, timesteps_train, num_features_train))
X_test_3D  = np.reshape(X_test_2D, (1, timesteps_test, num_features_test))

The shape of X_train is (4956, 9)
The shape of X_test is (261, 9)


This time around, I will predict it using only a batch_size of 1 and timesteps_train and timesteps_test as the exact number of dates in each X dataset. This batch size may lead to problems with my model because it does not check up on the values again and is almost as if we don't incorporate LSTM in the first place.