## Linear Regression and LSTM Models

In [2]:
# Import Statements
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
print("Done importing")

Done importing


In [13]:
# Load symbols_valid_meta to get reference Stocks and ETFs with full name
# Change this to match your filepath name
nomenclature = pd.read_csv('/Users/taylor/ICS-438/big-data-stock-analysis/Notebooks/Data/symbols_valid_meta.csv')

In [14]:
# When reading the .csv in pandas df, special characters are treated differently
# Manually renamed the files so no files are run into
nomenclature.at[162,'Symbol'] = 'AGM-A'
nomenclature.at[1068,'Symbol'] = 'CARR#'
nomenclature.at[7457,'Symbol'] = 'UTX#'

In [15]:
# Import statement
import pathlib

# Iterate through the number of stocks
stock_count, etf_count = 0, 0
for path in pathlib.Path('/Users/taylor/ICS-438/big-data-stock-analysis/Notebooks/Data/stocks').iterdir():
    if path.is_file():
        stock_count += 1
print("Stock file count: ", stock_count)

# Iterate through the number of etfs
for path in pathlib.Path('/Users/taylor/ICS-438/big-data-stock-analysis/Notebooks/Data/etfs').iterdir():
    if path.is_file():
        etf_count += 1
print("Etf file count: ", etf_count)

Stock file count:  5884
Etf file count:  2165


In [None]:
# Load the files into one big list of dataframes and normalize dates according to pandas
# Yahoo finance package dealt with pandas date normalization already

"""
Please note:
In order to run this, it is important to change your $PATH to reflect where the actual files are downloaded
"""

stock_df_list = []
etf_df_list = []

for i in range(len(nomenclature)):
    if nomenclature['ETF'][i] == 'Y':
        etf_df_list.append(pd.read_csv('Notebooks/Data/stocks/' + nomenclature['Symbol'][i] +'.csv'))
    else:
        stock_df_list.append(pd.read_csv('Notebooks/Data/etfs/' + nomenclature['Symbol'][i] +'.csv'))

In [11]:
"""
Separate stocks into 3 types and drop stocks with a history < 3 years, 2018 and younger
Type 1: Longer than 25 years
Type 2: 25 Years and Younger
Type 3: After 2009
We will only be using type 3 stocks, since we do not want to deal with market crashes
and any other event that will affect the volatility
"""

stocks_relevant = [s for s in stock_df_list if s['Date'][0] < '2018-01-01']
stocks_type_3 = [s for s in stocks_relevant if s['Date'][0] > '2009-12-31']
stocks_type_2 = [s for s in stocks_relevant if s['Date'][0] > '1994-12-31']
stocks_type_1 = [s for s in stocks_relevant if s['Date'][0] < '1994-12-31']

In [18]:
import numpy as np

#Cleaning, preprocessing drop NaN rows and change date to days after first
for i in range(len(stocks_type_3)): 
    stocks_type_3[i].dropna(axis=0, how='any', inplace=True)
    stocks_type_3[i]['Date'] = pd.to_datetime(stocks_type_3[i]['Date'])
    stocks_type_3[i]['Date'] = (stocks_type_3[i]['Date'] - stocks_type_3[i]['Date'].min())  / np.timedelta64(1,'D')

In [19]:
def lin_reg_accuracy(list_stock_df):
    lin_acc = []
    for i in range(len(list_stock_df)):
        temp = list_stock_df[i]
        temp = (temp-temp.min())/(temp.max()-temp.min()) #Min-Max Scaling
        X = temp.drop(['Open','High', 'Low','Close', 'Adj Close'], axis=1)
        y = temp['Open']
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)
        regressor = LinearRegression()
        regressor.fit(X_train, y_train)
        lin_acc.append(regressor.score(X_test, y_test))
    return lin_acc

In [20]:
accuracies = lin_reg_accuracy(stocks_type_3)

In [21]:
c = filter(lambda x: x > 0.85, accuracies)

In [22]:
length = len(list(c))

In [None]:
print('Percentage of stocks with an R^2 value of greater than 85% is', length / len(accuracies), '%')

The above determines that the percentage of stocks with an R^2 value of greater than 50% is approximately 0.067. This means that linear regressions are not viable tools to successfully predicts a stocks value. The results determine that there is only a small percentage of stocks that follow clear correlated linear trend over time, as expected during a clear economic expansion. The years that we have also chosen determine that no clear events that determine volatility could have affected these prices. 

## Time Series Prediction of Tesla Stock

The first step to determining whether these deep learning models are unsuccessful is to apply the, to how they compare to each other. We will be doing this time series prediction on Apple stocks.

In [37]:
# Import Statements
import seaborn as sns
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import SimpleRNN
from keras.layers import Dropout
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler

In [28]:
#Get AAPL stock into df and cut time to 2014 - 2020
aapl = pd.read_csv('Notebooks/Data/stocks/' + 'AAPL' +'.csv')

In [31]:
aapl = aapl.loc[(aapl['Date'] >= '2012-01-01') & (aapl['Date'] <= '2020-01-01')]

In [32]:
#Split train test
training_size = int(len(aapl)*0.80)
data_len = len(aapl)

train, test = aapl[0:training_size],aapl[training_size:data_len]

In [33]:
print("Training Size --> ", training_size)
print("total length of data --> ", data_len)
print("Train length --> ", len(train))
print("Test length --> ", len(test))

Training Size -->  1609
total length of data -->  2012
Train length -->  1609
Test length -->  403


In [None]:
#MinMax scale values
train = train.loc[:, ["Open"]].values

scaler = MinMaxScaler(feature_range=(0, 1))
train_scaled = scaler.fit_transform(train)

In [None]:
end_len = len(train_scaled)
X_train, y_train = [], []
timesteps = 40

for i in range(timesteps, end_len):
    X_train.append(train_scaled[i - timesteps:i, 0])
    y_train.append(train_scaled[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)

In [None]:
#We need to reshape our data as RNN needs 3 dimensions
#the size of data we have, the number of steps and the number of features
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))

## Recurrent Neural Network Model

This is the creation of the model and a specification of the parameters. We will determine the hyperbolic tangent as an activation function and dimensionality of output space to be a total of 50 units. The parameter values do not contain any specific special features, but are mostly based around their default values and values in our examples. 

In [None]:
regressor = Sequential()

regressor.add(SimpleRNN(units = 50, activation = "tanh", return_sequences = True, input_shape = (X_train.shape[1],1)))
regressor.add(Dropout(0.2))

regressor.add(SimpleRNN(units = 50, activation = "tanh", return_sequences = True))
regressor.add(Dropout(0.2))

regressor.add(SimpleRNN(units = 50, activation = "tanh", return_sequences = True))
regressor.add(Dropout(0.2))

regressor.add(SimpleRNN(units = 50))
regressor.add(Dropout(0.2))

regressor.add(Dense(units = 1))

In [None]:
regressor.compile(optimizer= "adam", loss = "mean_squared_error")

In [None]:
epochs = 100 
batch_size = 20

In [None]:
"""
Note: Only run this once, it takes a long time to run within local machine.
"""
regressor.fit(X_train, y_train, epochs = epochs, batch_size = batch_size)

In [None]:
real_price = test.loc[:, ["Open"]].values
print("Real Price Shape --> ", real_price.shape)

In [None]:
dataset_total = pd.concat((aapl["Open"], test["Open"]), axis = 0)
inputs = dataset_total[len(dataset_total) - len(test) - timesteps:].values.reshape(-1,1)
inputs = scaler.transform(inputs)

In [None]:
X_test = []

for i in range(timesteps, real_price.shape[0]+timesteps):
    X_test.append(inputs[i-timesteps:i, 0])
X_test = np.array(X_test)

print("X_test shape --> ", X_test.shape)

In [None]:
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
predict = regressor.predict(X_test)
predict = scaler.inverse_transform(predict)

In [None]:
plt.plot(real_price, color = "blue", label = "Real Stock Price")
plt.plot(predict, color = "red", label = "Predict Stock Price")
plt.title("Stock Price Prediction")
plt.xlabel("2012 - 2020 (Days)")
plt.ylabel("AAPL Stock Price")
plt.legend()
plt.show()

## Long Short-Term Memory

The next step in analyzing this data includes an LSTM model, which is to be run on the same individual stock and compare its prediction with the RNN mode. 

In [None]:
regressor = Sequential()

regressor.add(LSTM(units = 50, return_sequences = True, input_shape = (X_train.shape[1],1)))
regressor.add(Dropout(0.2))

regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))

regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))

regressor.add(LSTM(units = 50))
regressor.add(Dropout(0.2))

regressor.add(Dense(units = 1))

In [None]:
regressor.compile(optimizer= "adam", loss = "mean_squared_error")

In [None]:
"""
Note: Only run this once, it takes a long time to run within local machine.
"""
regressor.fit(X_train, y_train, epochs = epochs, batch_size = batch_size)

In [None]:
plt.plot(real_price, color = "blue", label = "Real Stock Price")
plt.plot(predict, color = "red", label = "Predict Stock Price")
plt.title("Stock Price Prediction")
plt.xlabel("Time")
plt.ylabel("AAPL Stock Price")
plt.legend()
plt.show()

## Conclusion

In conclusion, it is determine that the LSTM model is a better predictor against the case. This is not an ideal solution to predict the stock market; however, with some experimentation and hyperparameter tuning, this could be a viable option. 