# Stock Market Predictor

### Background

The United States stock market provides investors with the ability to invest in small, fractional shares of companies in a relatively liquid environment. This provides one of the most well-established ways of earning investment income and building long term wealth. Unfortunately, there has traditionally been an uneven playing field between the average investor and high-powered trading firms. Fortunately, roughly five years ago, most of the major brokerage firms removed trade fees for individual investors, creating an environment where an individual investor now has an opportunity to generate returns that are in the same ballpark of the major firms.


These major investment firms, such as Morgan Stanley and Blackrock, are able to consistently post higher returns than the average investor through the use of powerful analytics and high-frequency trading. This project will focus on making some of those analytical tools available to the average investor. High-frequency trading is still not achievable for most individual traders due to hardware constraints. If this tool can help investors increase their returns by even 1%, this can go a long way towards helping people reach financial freedom earlier, and being able to take more control of their lives.

### Topic

This project will utilize publicly available stock market data, such as volume, open/close prices, and high/low prices, along with other features that I will engineer to create a system that can accurately predict if a given stock will increase or decrease the next trading day. This will be difficult as the stock market is notoriously noisy and volatile and because of this, the system will be designed to focus on stocks that have been in existence for at least 10 years.


Most stock prediction systems utilize close prices and label a stock a “buy” if tomorrow's close price is higher than today’s. This creates a potential timeline issue, as most of these systems simply predict if a stock will go up or down, not when it will go up or down by. I will base my buy and sell signals on the difference between tomorrow’s estimated open cost and tomorrow’s estimated close cost. This will increase the error rate (since two predictions are being made instead of one) but will provide a clearly defined time table.

### Strategy

The first step in this process will be to import public stock data from Yahoo Finance by ticker symbol for the last ten years. To begin cleaning the data I will drop the adjusted close column. This is not needed since the goal is to  predict one day in the future. The stock will be labeled as follows:


Buy: if tomorrow’s close price is predicted higher than tomorrow’s open price
Sell: if tomorrow’s close price is predicted lower than tomorrow’s open price


This creates the need to build a system that effectively predicts both the open and close price of a stock for the following day. These might not be the exact same system, although I would expect that will be quite similar.


Next, I will begin feature engineering. I will add returns and volatility over one, two, and three month periods in a style similar to what Devpark0506 did in his Kaggle notebook titled “JPX Stock Market Analysis & Prediction with LGBM.” Then I will create 10 and 40 day exponential moving averages (ema) and an average price oscillator (apo) similar to what Sabestien Donadio recommends in his book “Learn Algorithmic Trading." These features add some of the “standard” analytical calculations that a trader might run for a stock. Finally, I will impute labels that indicate wether a stock was a 'Buy' or 'Sell' for that day. This will provide another feature for our algorithms to work with.

Then the splitting must occur. This will be done in two 'batches'. The first will be for the open price prediction where all of the data for a day are the features and the label is the open price from the following day. The second will be for the close price prediction where all of the data for the day are the features and the label is the next day's close price. These will then be seperately split into train and test sets, and the train set will be further split into a train set and a validation set for the neural network.


Once these features are developed, I will begin to train both neural networks and random forests on the training set. I will create neural networks and random forests to predict tomorrow’s open and tomorrow’s close prices. While the ultimate result of this project will be a signal, these machine learning systems are going to be used for regression. I will run these on the test set in order to get the RMSE. This is how I will be able to determine which program is more successful and should be deployed.


In order to create a system that is usable by an individual trader, I will use the system with the best combination of runtime and low RMSE. This system will send the trader a buy signal if the system predicts that tomorrow’s close will be higher than tomorrow’s open and a sell signal if the opposite is true. Further, the system will provide the combined RMSE of the prediction (at least for the Random Forest) so that the trader can understand the potential for the prediction to be worng. The system is designed for an individual trader to run the prediction after hours and put an order in that will resolve immediately upon market open the next day. Finally, an interactive webpage will be created in which a trader can input a stock ticker symbol and get the signal in return. This webapp can be found at daytradingstocksignalsystem.streamlit.app




# Imports

In [None]:
#Standard + Necessary Imports
import numpy as np
import pandas as pd
import math
import scipy as stats
import matplotlib.pyplot as plt
from pandas_datareader import data as pdr
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from tensorflow import keras
from hyperopt import tpe, hp, Trials
from hyperopt.fmin import fmin

import yfinance as yf
yf.pdr_override()

import warnings
warnings.filterwarnings('ignore')

# Functions

In [None]:
#Function that will get data for the provided stock and date range
def get_data(ticker, start_date, end_date):
    data = pdr.get_data_yahoo(ticker, start_date, end_date)
    data = data.drop(columns = ['Adj Close'])
    return data

In [None]:
#Most of this cell comes from Devpark0506's notebook called "JPX Stock Market Analysis & Prediction with LGBM"
def add_returns_volatility(df):
    #adding monthly returns using 20 day periods as months and the percent change over that period
    df['1_month_returns'] = df['Close'].pct_change(20)
    df['2_month_returns'] = df['Close'].pct_change(40)
    df['3_month_returns'] = df['Close'].pct_change(60)

    #adding volatility using standard deviation
    df['1_month_vol'] = (np.log(df['Close']).diff().rolling(20).std())
    df['2_month_vol'] = (np.log(df['Close']).diff().rolling(40).std())
    df['3_month_vol'] = (np.log(df['Close']).diff().rolling(60).std())

    return df

In [None]:
#this will add emas and apo to our database.
#Most of the ideas in this function come from Learn Algorithmic Trading by Sabestien Donadio
def add_emas(df):
    close = df['Close']
    apo_values = []
    ema_fast_values = []
    ema_slow_values = []
    price_history = []
    ema_fast = 0
    ema_slow = 0

    for close_price in close:
        price_history.append(close_price)
        if len(price_history) > 20:
            del(price_history[0])

        #This idea is from Learn Algorithmic Trading by Sabestien Donadio. It will be used for our volatility measure
        sma = stats.mean(price_history)
        variance = 0
        for hist_price in price_history:
            variance = variance + ((hist_price - sma) ** 2)

        #this idea for a volatility factor comes from Learn Algorithmic Trading by Sabestien Donadio
        stdev = math.sqrt(variance / len(price_history))
        stdev_factor = stdev/15
        if stdev_factor == 0:
            stdev_factor = 1


        if (ema_fast == 0): # first observation
            ema_fast = close_price
            ema_slow = close_price
        else:
            #calculating ema with a smoothing factor and the stdev_factor which is a way to account for volatility
            ema_fast = (close_price - ema_fast) * (2/(10+1)) *stdev_factor + ema_fast
            ema_slow = (close_price - ema_slow) * (2/(40+1)) *stdev_factor + ema_slow

        ema_fast_values.append(ema_fast)
        ema_slow_values.append(ema_slow)

        #calculating apo as the difference between fast and slow emas
        apo = ema_fast - ema_slow
        apo_values.append(apo)

    df = df.assign(fast_ema = pd.Series(ema_fast_values, index=df.index))
    df = df.assign(slow_ema = pd.Series(ema_slow_values, index=df.index))
    df = df.assign(APO = pd.Series(apo_values, index=df.index))

    return df

In [None]:
#this function adds labels to the data
def add_labels(df):
    day_change = df['Close'] - df['Open']
    percent_day_change = (df['Close'] - df['Open']) / df['Open']
    labels = []

    for change in percent_day_change:
        #stock is a buy if it closes higher than it opens
        if change > 0:
            labels.append(1)
        #a sell if the stock closes lower than it opens
        else:
            labels.append(0)

    df['day_change'] = day_change
    df['percent_day_change'] = percent_day_change
    df['signal'] = labels

    return df

In [None]:
#this function combines all of our previous functions to output a single dataframe with all of our engineered features
def clean_data(user_input_symbol, user_input_start_date, user_input_end_date):
    data = get_data(user_input_symbol, user_input_start_date, user_input_end_date)
    data = add_returns_volatility(data)
    data = add_emas(data)
    data = add_labels(data)
    data = data.fillna(0)

    return data

# Dataframe Creation

In [None]:
data = clean_data('RMAX', '2013-08-02', '2023-08-03')
data

[*********************100%***********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Volume,1_month_returns,2_month_returns,3_month_returns,1_month_vol,2_month_vol,3_month_vol,fast_ema,slow_ema,APO,day_change,percent_day_change,signal
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2013-10-02,26.250000,27.969999,25.400000,27.000000,13845300,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,27.000000,27.000000,0.000000,0.750000,0.028571,1
2013-10-03,27.129999,31.080000,27.000000,30.209999,2456400,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,27.062449,27.016755,0.045694,3.080000,0.113527,1
2013-10-04,30.920000,33.540001,30.400000,31.650000,1998100,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,27.170529,27.046041,0.124489,0.730000,0.023609,1
2013-10-07,31.100000,31.420000,30.020000,30.830000,460400,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,27.248727,27.067734,0.180993,-0.270000,-0.008682,0
2013-10-08,30.510000,30.639999,28.080000,28.500000,1101700,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,27.274152,27.075542,0.198610,-2.010000,-0.065880,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-07-27,20.299999,20.440001,19.660000,19.770000,58800,0.027547,0.062903,0.023822,0.019446,0.019904,0.020244,20.038104,25.670282,-5.632178,-0.529999,-0.026108,0
2023-07-28,19.879999,20.040001,19.629999,19.650000,38000,0.000509,0.051364,0.044102,0.018967,0.019932,0.019972,20.035645,25.660050,-5.624405,-0.230000,-0.011569,0
2023-07-31,19.600000,19.840000,19.590000,19.709999,56100,0.023364,0.071196,0.057971,0.018405,0.019744,0.019923,20.033605,25.650047,-5.616443,0.109999,0.005612,1
2023-08-01,19.629999,19.790001,19.260000,19.540001,83400,0.001538,0.028963,0.073037,0.018312,0.019202,0.019720,20.030513,25.639781,-5.609268,-0.089998,-0.004585,0


# Train Test Splitting/Scaling

In [None]:
#creating labels and features dataframes for models predicting tomorrow's open
open_labels = data['Open'].shift(-1).fillna(data['Open'].iloc[-1])
open_features = data.drop(['Open'], axis = 1)

#creating labels and feaatures dataframes for models predictings tomorrow's close
close_labels = data['Close'].shift(-1).fillna(data['Close'].iloc[-1])
close_features = data.drop(['Close'], axis = 1)

In [None]:
#creating train/test dataframes for random forest training
x_train_open, x_test_open, y_train_open, y_test_open = train_test_split(open_features, open_labels, test_size = 0.2)
x_train_close, x_test_close, y_train_close, y_test_close = train_test_split(close_features,
                                                                            close_labels, test_size = 0.2)

#creating training and valid sets for neural net creation
x_train_open_nn, x_valid_open, y_train_open_nn, y_valid_open = train_test_split(x_train_open, y_train_open)
x_train_close_nn, x_valid_close, y_train_close_nn, y_valid_close = train_test_split(x_train_close, y_train_close)

In [None]:
#instantiating a scaler
scaler = StandardScaler()

#creating scaled dataframes for future use with the neural network
x_train_open_nn = scaler.fit_transform(x_train_open_nn)
x_valid_open = scaler.fit_transform(x_valid_open)

x_train_close_nn = scaler.fit_transform(x_train_close_nn)
x_valid_close = scaler.fit_transform(x_valid_close)

scaled_open_features = scaler.fit_transform(open_features)
scaled_close_features = scaler.fit_transform(close_features)

scaled_x_test_open = scaler.fit_transform(x_test_open)
scaled_x_test_close = scaler.fit_transform(x_test_close)

# Random Forest Build

## Open Prediction

In [None]:
#Creating the first random forest model to operate as a baseline
rnd_for_open = RandomForestRegressor()
rnd_for_open.fit(x_train_open, y_train_open)

#Generating predictions
pred_rf_open = rnd_for_open.predict(x_test_open)

#Getting RMSE score
print('Root mean squared error is:', np.sqrt(mean_squared_error(y_test_open, pred_rf_open)))

Root mean squared error is: 0.38846089908809445


This provides us with a baseline to work with. A completely untuned model has a root mean squared error of 0.3885

### Hypertuning

For hypertuning I will try using two different methods. First I will implement the more "modern" hyperopt style. This has the benefit of working off of a truly automated coding format, meaning that there is no input needed in between runnings. This is more promising for use on the deployed system. It is worth noting that a lot of the code for implementing this comes from Viraj Bagal's wonderful Kaggle notebook titled "EDA, XGB,Random Forest Parameter tuning-hyperopt".

Then I will try to more traditional GridSearchCV method. It will be interesting to see if this produces better results with periodic input from a human (me).

Above all, it is important to remember that the purpose of this is to create a system useable by day traders. So, a method that improves results but takes 2 hours to run on a standard computer is not practical.

In [None]:
#creating a function that tells hyperopt what to work with
def objective(params):
  n_estimators = int(params['n_estimators'])
  max_depth = int(params['max_depth'])
  min_samples_split = int(params['min_samples_split'])
  min_samples_leaf = int(params['min_samples_leaf'])
  rf_open_tuned = RandomForestRegressor(n_estimators = n_estimators,
                                        max_depth = max_depth,
                                        min_samples_split = min_samples_split,
                                        min_samples_leaf = min_samples_leaf)
  rf_open_tuned.fit(x_train_open, y_train_open)
  pred_rf_open_tuned = rf_open_tuned.predict(x_test_open)
  score = np.sqrt(mean_squared_error(y_test_open, pred_rf_open_tuned))
  return score

#creating a function that tells hyperopt what space to optimize within
def optimize(trial):
  params = {
      'n_estimators':hp.uniform('n_estimators', 100, 750),
      'max_depth':hp.uniform('max_depth', 2, 75),
      'min_samples_split':hp.uniform('min_samples_split', 2, 6),
      'min_samples_leaf':hp.uniform('min_samples_leaf', 1, 5)}

#Timeout is set to 120 sec. This can be decreased for fast tuning, or increased for slow tuning
  best = fmin(fn=objective, space=params, algo = tpe.suggest, trials = trial, max_evals=200, timeout = 120)

  return best

In [None]:
trial=Trials()
best = optimize(trial)

  6%|▋         | 13/200 [02:11<31:30, 10.11s/trial, best loss: 0.38179608221654265]


This has improved our final result by a RMSE of 0.007. While this is nice, especially for only 2 min of tuning, it still makes for an impractical solution based on our design purpose, and the baseline random forest is better.

In [None]:
#Now I'll start the hypertuning process to see how much better we can make this
grid_search = GridSearchCV(rnd_for_open, {'n_estimators': [100, 250, 500], 'max_depth':[2, 15, 30],
    'min_samples_split':[2, 4, 6]}, cv=3, n_jobs = -1)
grid_search.fit(x_train_open, y_train_open)
grid_search.best_params_

{'max_depth': 30, 'min_samples_split': 4, 'n_estimators': 250}

In [None]:
grid_search = GridSearchCV(rnd_for_open, {'n_estimators': [150, 200, 250, 300, 400, 450], 'max_depth':[30, 50, 70],
    'min_samples_split':[3, 4, 5]}, cv=3, n_jobs = -1)
grid_search.fit(x_train_open, y_train_open)
grid_search.best_params_

{'max_depth': 70, 'min_samples_split': 3, 'n_estimators': 200}

In [None]:
grid_search = GridSearchCV(rnd_for_open, {'n_estimators': [160, 170, 180, 190, 200, 210, 220, 230, 240], 'max_depth':[70, 85, 100],
    'min_samples_split':[3]}, cv=3, n_jobs = -1)
grid_search.fit(x_train_open, y_train_open)
grid_search.best_params_

{'max_depth': 85, 'min_samples_split': 3, 'n_estimators': 160}

### Final Model

In [None]:
rnd_for_open_ht = RandomForestRegressor(**grid_search.best_params_)
rnd_for_open_ht.fit(x_train_open, y_train_open)

pred_rf_open_ht = rnd_for_open_ht.predict(x_test_open)

if np.sqrt(mean_squared_error(y_test_open, pred_rf_open_ht)) < np.sqrt(mean_squared_error(y_test_open, pred_rf_open)):
  print('The tuned model has a RMSE', np.sqrt(mean_squared_error(y_test_open, pred_rf_open)) - np.sqrt(mean_squared_error(y_test_open, pred_rf_open_ht)), 'lower than the untuned model')
else:
  print('The tuned model has a worse RMSE than the untuned model')

The tuned model has a RMSE 0.001717540362127623 lower than the untuned model


This creates a similar scenario to the hyperopt tuning. While the gridsearched model does produce a better result by a RMSE of 0.002, it takes a total of >5min to run. This is non-satisfactory for a day trader who is trying to evaluate many stocks in a short period of time. Therefore we will build our prediciton model with the base random forest

In [None]:
#storing the RMSE for later
RMSE_open = np.sqrt(mean_squared_error(y_test_open, pred_rf_open))

In [None]:
#Generating open predictions using baseline random forest
tomorrows_open_pred_rf = rnd_for_open.predict(open_features)
open_pred_rf = tomorrows_open_pred_rf[-1]

## Close Prediction

In [None]:
rnd_for_close = RandomForestRegressor()
rnd_for_close.fit(x_train_close, y_train_close)

pred_rf_close = rnd_for_close.predict(x_test_close)

print('Root mean squared error is:', np.sqrt(mean_squared_error(y_test_close, pred_rf_close)))

Root mean squared error is: 0.8902604311765772


### Hypertuning

For the hypertuning of the close model, I will transfer the learnings from above. This means that we will stick with the standard random forest model for deployment, mainly driven by the amount of run time hypertuning takes. However, I will include some gridsearching here just to satisfy my curiosity of how much it might improve the model

In [None]:
grid_search_close = GridSearchCV(rnd_for_close, {'max_depth':[20, 50, 75, 100],
                                                 'min_samples_split':[2, 5, 10],
                                                 'n_estimators':[100, 250, 500]}, cv=3, n_jobs=-1)
grid_search_close.fit(x_train_close, y_train_close)
grid_search_close.best_params_

{'max_depth': 20, 'min_samples_split': 5, 'n_estimators': 250}

In [None]:
grid_search_close = GridSearchCV(rnd_for_close, {'max_depth':[10, 15, 20],
                                                 'min_samples_split':[3, 4, 5, 6, 7, 8, 9],
                                                 'n_estimators':[125, 150, 175, 200, 225, 250, 275, 300,
                                                                 325, 350, 375, 400, 425, 450, 475]}, cv=3, n_jobs=-1)
grid_search_close.fit(x_train_close, y_train_close)
grid_search_close.best_params_

{'max_depth': 10, 'min_samples_split': 4, 'n_estimators': 275}

In [None]:
grid_search_close = GridSearchCV(rnd_for_close, {'max_depth':[2, 4, 6, 8, 10],
                                                 'min_samples_split':[4],
                                                 'n_estimators':[260, 270, 275, 280, 290]}, cv=3, n_jobs=-1)
grid_search_close.fit(x_train_close, y_train_close)
grid_search_close.best_params_

{'max_depth': 8, 'min_samples_split': 4, 'n_estimators': 275}

### Final Model

In [None]:
rnd_for_close_ht = RandomForestRegressor(**grid_search_close.best_params_)
rnd_for_close_ht.fit(x_train_close, y_train_close)

pred_rf_close_ht = rnd_for_close_ht.predict(x_test_close)

if np.sqrt(mean_squared_error(y_test_close, pred_rf_close_ht)) < np.sqrt(mean_squared_error(y_test_close, pred_rf_close)):
  print('The tuned model has a RMSE', np.sqrt(mean_squared_error(y_test_close, pred_rf_close)) - np.sqrt(mean_squared_error(y_test_close, pred_rf_close_ht)), 'lower than the untuned model')
else:
  print('The tuned model has a worse RMSE than the untuned model')

The tuned model has a worse RMSE than the untuned model


Here we actually see the tuned model performing worse than the original model. This is most likely a result of overfitting, but reinforces the idea that the amount of potential improvement in the model does not overrule the amount of run time tuning takes. If we were angling for the best possible model, we would want to take the time to tune it, but since we are angling for the best combination of RMSE and run time, tuning is not worth the cost.

In [None]:
#storing the RMSE for later
RMSE_close = np.sqrt(mean_squared_error(y_test_close, pred_rf_close))

In [None]:
tomorrows_close_pred_rf = rnd_for_close.predict(close_features)
close_pred_rf = tomorrows_close_pred_rf[-1]

# Random Forest Signal

In [None]:
if close_pred_rf > open_pred_rf:
  print('BUY: the stock is predicted to increase by', round(close_pred_rf - open_pred_rf, 3), 'tomorrow')
  print('WARNING: this prediction could be off by as much as +/-', round(RMSE_open + RMSE_close, 3))
else:
  print('SELL: the stock is predicted to decrease by', round(close_pred_rf - open_pred_rf, 3), 'tomorrow')
  print('WARNING: this prediction could be off by as much as +/-', round(RMSE_open + RMSE_close, 3))

BUY: the stock is predicted to increase by 0.033 tomorrow


# Neural Net Build

## Open Prediction

### Deep Neural Net


I will start by building a basic deep neural network. This is similar in concept to what I did above with the random forest, and will provide a baseline for me to judge more tuned models against

In [None]:
x_train_open_nn.shape

(1485, 16)

In [None]:
open_model_deep = keras.models.Sequential([
    keras.layers.Dense(1509, activation = 'relu'),
    keras.layers.Dense(755, activation = 'relu'),
    keras.layers.Dense(378, activation = 'relu'),
    keras.layers.Dense(188, activation = 'relu'),
    keras.layers.Dense(95, activation = 'relu'),
    keras.layers.Dense(45, activation = 'relu'),
    keras.layers.Dense(24, activation = 'relu'),
    keras.layers.Dense(1)
])

early_stopping = keras.callbacks.EarlyStopping(patience = 10, restore_best_weights = True)
optimizer = keras.optimizers.Adam(learning_rate = 0.01)

open_model_deep.compile(loss = 'huber',
                        optimizer = optimizer,
                        metrics=['mean_squared_error'])

open_model_deep_1 = open_model_deep.fit(x_train_open_nn, y_train_open_nn, epochs=200,
                             validation_data = (x_valid_open, y_valid_open),
                              callbacks = [early_stopping])

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200


### Deep and Wide NN

In [None]:
input = keras.layers.Input(shape = x_train_open_nn.shape[1:])
hidden1 = keras.layers.Dense(1509, activation = 'relu')(input)
hidden2 = keras.layers.Dense(755, activation = 'relu')(hidden1)
hidden3 = keras.layers.Dense(378, activation = 'relu')(hidden2)
hidden4 = keras.layers.Dense(188, activation = 'relu')(hidden3)
hidden5 = keras.layers.Dense(95, activation = 'relu')(hidden4)
hidden6 = keras.layers.Dense(45, activation = 'relu')(hidden5)
hidden7 = keras.layers.Dense(24, activation = 'relu')(hidden6)
concat = keras.layers.concatenate([input, hidden7])
output = keras.layers.Dense(1)(concat)
open_model_deep_wide = keras.models.Model(inputs=[input], outputs = [output])

open_model_deep_wide.compile(loss = 'huber',
                             optimizer = 'adam',
                             metrics=['mean_squared_error'])

open_model_deep_wide_1 = open_model_deep_wide.fit(x_train_open_nn, y_train_open_nn, epochs=200,
                             validation_data = (x_valid_open, y_valid_open),
                         callbacks = [early_stopping])

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200


### Deep Model with Normalization


In [None]:
open_model_deep_normalized = keras.models.Sequential([
    keras.layers.Dense(1509, activation = 'relu'),
    keras.layers.Dense(755, activation = 'relu'),
    keras.layers.Dense(378, activation = 'relu'),
    keras.layers.Dense(188, activation = 'relu'),
    keras.layers.Dense(95, activation = 'relu'),
    keras.layers.Dense(45, activation = 'relu'),
    keras.layers.Dense(24, activation = 'relu'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(1)
])

optimizer = keras.optimizers.legacy.Adam(learning_rate = 0.01)

open_model_deep_normalized.compile(loss = 'huber',
                                   optimizer = optimizer,
                                   metrics=['mean_squared_error'])

open_model_deep_norm = open_model_deep_normalized.fit(x_train_open_nn, y_train_open_nn, epochs=200,
                                                            validation_data = (x_valid_open, y_valid_open),
                                                            callbacks = [early_stopping])

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200


### Comparison

In [None]:
#Base Model
deep_error = open_model_deep.evaluate(scaled_x_test_open, y_test_open)
np.sqrt(deep_error)



array([0.85279222, 1.44092872])

7ms/step and a RMSE of 1.44

In [None]:
#Functional Model
deep_wide_error = open_model_deep_wide.evaluate(scaled_x_test_open, y_test_open)
np.sqrt(deep_wide_error)



array([0.69655402, 1.09588246])

10ms/step and a RMSE of 1.10. Slower than the deep model but 0.34 better RMSE

In [None]:
#Normalized Base
deep_norm_error = open_model_deep_normalized.evaluate(scaled_x_test_open, y_test_open)
rmse_open_nn = np.sqrt(deep_norm_error)
rmse_open_nn = rmse_open_nn[1]



6ms/step and 0.92 RMSE. Fastest and lowest RMSE

Hoever, this is still higher than the RMSE from the open model of the random forest. So, for the open predicition, random forest is optimal

### Prediction

In [None]:
tomorrows_open_pred_nn = open_model_deep_normalized.predict(scaled_open_features)
open_pred_nn = tomorrows_open_pred_nn[-1]



## Close Predicition

All of this will follow the same logic and steps as above

### Deep Neural Net

In [None]:
x_train_close_nn.shape

(1485, 16)

In [None]:
close_model_deep = keras.models.Sequential([
    keras.layers.Dense(1509, activation = 'relu'),
    keras.layers.Dense(755, activation = 'relu'),
    keras.layers.Dense(378, activation = 'relu'),
    keras.layers.Dense(188, activation = 'relu'),
    keras.layers.Dense(95, activation = 'relu'),
    keras.layers.Dense(45, activation = 'relu'),
    keras.layers.Dense(24, activation = 'relu'),
    keras.layers.Dense(1)
])

early_stopping = keras.callbacks.EarlyStopping(patience = 10, restore_best_weights = True)
optimizer = keras.optimizers.Adam(learning_rate = 0.01)

close_model_deep.compile(loss = 'huber',
                         optimizer = optimizer,
                         metrics=['mean_squared_error'])

close_model_deep_1 = close_model_deep.fit(x_train_close_nn, y_train_close_nn, epochs=200,
                                          validation_data = (x_valid_close, y_valid_close),
                                          callbacks = [early_stopping])

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200


### Deep and Wide NN

In [None]:
input = keras.layers.Input(shape = x_train_close_nn.shape[1:])
hidden1 = keras.layers.Dense(1509, activation = 'relu')(input)
hidden2 = keras.layers.Dense(755, activation = 'relu')(hidden1)
hidden3 = keras.layers.Dense(378, activation = 'relu')(hidden2)
hidden4 = keras.layers.Dense(188, activation = 'relu')(hidden3)
hidden5 = keras.layers.Dense(95, activation = 'relu')(hidden4)
hidden6 = keras.layers.Dense(45, activation = 'relu')(hidden5)
hidden7 = keras.layers.Dense(24, activation = 'relu')(hidden6)
concat = keras.layers.concatenate([input, hidden7])
output = keras.layers.Dense(1)(concat)
close_model_deep_wide = keras.models.Model(inputs=[input], outputs = [output])

optimizer = keras.optimizers.legacy.Adam(learning_rate = 0.01)

close_model_deep_wide.compile(loss = 'huber',
                              optimizer = optimizer,
                              metrics=['mean_squared_error'])

close_model_deep_wide_1 = close_model_deep_wide.fit(x_train_close_nn, y_train_close_nn, epochs=200,
                                                    validation_data = (x_valid_open, y_valid_open),
                                                    callbacks = [early_stopping])

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200


### Deep with Normalization

In [None]:
close_model_deep_normalized = keras.models.Sequential([
    keras.layers.Dense(1509, activation = 'relu'),
    keras.layers.Dense(755, activation = 'relu'),
    keras.layers.Dense(378, activation = 'relu'),
    keras.layers.Dense(188, activation = 'relu'),
    keras.layers.Dense(95, activation = 'relu'),
    keras.layers.Dense(45, activation = 'relu'),
    keras.layers.Dense(24, activation = 'relu'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(1)
])

early_stopping = keras.callbacks.EarlyStopping(patience = 10, restore_best_weights = True)
optimizer = keras.optimizers.legacy.Adam(learning_rate = 0.01)

close_model_deep_normalized.compile(loss = 'huber',
                                    optimizer = optimizer,
                                    metrics=['mean_squared_error'])

close_model_deep_2 = close_model_deep_normalized.fit(x_train_close_nn, y_train_close_nn, epochs=200,
                                                     validation_data = (x_valid_close, y_valid_close),
                                                     callbacks = [early_stopping])

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200


### Comparison

In [None]:
#Base Model
close_error_deep = close_model_deep.evaluate(scaled_x_test_close, y_test_close)
np.sqrt(close_error_deep)



array([0.8352383 , 1.58535967])

7ms/step and RMSE of 1.59 as baseline

In [None]:
#Functional Model
close_error_deep_wide = close_model_deep_wide.evaluate(scaled_x_test_close, y_test_close)
rmse_close_nn = np.sqrt(close_error_deep_wide)
rmse_close_nn = rmse_close_nn[1]



6ms/step and RMSE of 0.95. Slightly faster and much better RMSE

In [None]:
close_error_deep_norm = close_model_deep_normalized.evaluate(scaled_x_test_close, y_test_close)
np.sqrt(close_error_deep_norm)



array([0.5988882 , 0.96517463])

6ms/step and RMSE of 0.97. Slightly worse RMSE than deep wide

Deep wide is the best, but again it has a higher RMSE than random forest and longer run time, so random forest is optimal for close as well

### Predicition

In [None]:
tomorrows_close_pred_nn = close_model_deep_normalized.predict(scaled_close_features)
close_pred_nn = tomorrows_close_pred_nn[-1]



# Neural Net Signal

In [None]:
if close_pred_nn > open_pred_nn:
  print('BUY, the stock is predicted to increase by', close_pred_nn - open_pred_nn, 'tomorrow')
  print('WARNING: this prediction could be off by as much as +/-', round(rmse_open_nn + rmse_close_nn, 3))
else:
  print('SELL, the stock is predicted to decrease by', close_pred_nn - open_pred_nn, 'tomorrow')
  print('WARNING: this prediction could be off by as much as +/-', round(rmse_open_nn + rmse_close_nn, 3))

SELL, the stock is predicted to decrease by [-0.53027344] tomorrow


# Conclusion

What I have found is that the random forest model generally performs better on stock data than the neural network does. This is a surprising finding for me, I initially assumed that the neural network would perform better since it is a more robust approach.

The final deployed app can be found at daytradingstocksignalsystem.streamlit.app

Next steps on this project could include a more robust handling of volatility, potentially involving the use of GARCH instead of standard deviation, and further research to find a pre-trained neural net that could be used for a better initialization