Hi there, welcome to my project.

I wanted to get some basic practice creating models, something where I could train either regression or categorical models on the data depending on feature selection. I haven't played with this one in a while now, but if I come back to it, adding sentiment analysis would be the next step. See if I can narrow down some feature or threshold that I can pick something useful out of, possibly by narrowing it based on common stock behavior, going with all tech stocks for example. The idea was to see if the model could predict either via regression giving us a predicted Close, or categorically based on threshold gains/losses. Basically trying to predict tomorrow's stock prices and go find the option based on it. If I found anything useful, the next step would be to analyze which options would be the most profitable, followed by testing with a virtual account. But in any case, some old learning code...

<hr>
<h2>Step 1 - Define our Goal</h2>
What do we want to do? Get rich of course. But how?<br>
<br>
We're going to try to predict when either of two scenarios occur for any of a given selection of stocks:<br>
a) tomorrow's Close price will be significantly higher than today's Close value<br>
b) tomorrow's Close price will be significantly lower than today's Close value<br>
<br>
And then we're going to buy options, either put or call, based on those predictions first thing in the morning right after the market opens and selling them at the end of the day just before the market closes. By using options, being able to predict changes in price of even a single percentage point are very very valuable.<br>
<br>
At the end of all this, we want to be able to have a script that runs after every trading day. What we want from that script are our top picks for large changes in price in either direction for the following day. We want to then log in first thing in the morning and buy the predicted gainers, to be sold the following morning when the exchange opens before buying for the day. All gambles are on 1 day jumps or drops in price using put or call options accordingly.

<hr>
<h2>Step 2 - Get our data</h2>
We'll be using yfinance to get our trading history, so first we need to install that. Then the usual imports.<br>

In [1]:
!pip install -q yfinance

In [2]:
# importing our libraries
import numpy
import pandas
import matplotlib.pyplot as pyplot
import yfinance
import tensorflow.keras as keras
import seaborn
import os
import sklearn

# importing some methods and classes just so we don't have to type the package names every time
from pandas import DataFrame, Series
from tensorflow.data import Dataset
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Bidirectional, LSTM
from tensorflow.keras.optimizers import SGD, RMSprop, Adam, AdamW, Adadelta, Adagrad, Adamax, Adafactor, Nadam, Ftrl
from tensorflow.keras.losses import MeanSquaredError, MeanAbsoluteError
from tensorflow.keras.utils import Sequence
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split



I've selected companies with market valuations above 10 billion dollars traded on American stock exchanges. From that I've removed the first 4 years of data, tossing out the startup behavior as I won't be predicting behavior on new companies, I'm looking for macro patterns that are less susceptible to variations from general noise. I'll be keeping the last year of data as our validation data, the most relevant to today's behavior. We're also going to have to provide a window of stock history, which I will remove before the validation set so no portion of the validation set has been seen by the model before. Any stocks that didn't have enough data for that got tossed. Below is what we have left.<br>
<br>
<hr>
<br>
<h3>TODO: find which stocks correlate behavior, would a subset of similarly behaved stocks improve our predictions? Find out.</h3>
<h3>TODO: make this automaticaly determined.</h3>
<h3>TODO: narrow this to a single exchange perhaps? NASDAQ or TSE for example.</h3>
<br>
<hr>

Downloading the stock histories for the above companies.

In [3]:
def downloadHistories(tickerNames):
    histories = {}
    for name in tickerNames:
        ticker = yfinance.Ticker(name)
        histories[name] = ticker.history(period='max')
    return histories

def downloadHistoriesByTicker(tickerNames, start=None, end=None, period=None):
    historiesByTicker = {}
    for name in tickerNames:
        ticker = yfinance.Ticker(name)
        if period is None:
            historiesByTicker[name] = ticker.history(start=start, end=end)
        else:
            historiesByTicker[name] = ticker.history(period=period)
    return historiesByTicker

def getInitialData(tickerNames):
    trainingStockHistoriesByTicker = downloadHistoriesByTicker(tickerNames, start='2010-01-01', end='2020-01-01')
    testStockHistoriesByTicker = downloadHistoriesByTicker(tickerNames, period='2y')
    return trainingStockHistoriesByTicker, testStockHistoriesByTicker

<hr>
<h2>Step 3 - Feature selection</h2>
Looks good. Now to add in additional features to our dataframes and drop anything we don't want.<br>
<br>
For new features, we'll be adding:<br>
Gain - the percentage gain or loss of a stock from the Open to Close price (unbounded positive or negative float). We're going to cap this at a 10 percent gain or loss in a day for scaling. This will be our feature set, and for our regression models, our labels.<br>
The following labels will be our labels for our binary classifier models.<br>
Gain05 - 1.0 if Gain above 0.005, 0.0 otherwise.<br>
Gain10 - 1.0 if Gain above 0.01, 0.0 otherwise.<br>
Gain15 - 1.0 if Gain above 0.015, 0.0 otherwise.<br>
Loss05 - 1.0 if Gain below -0.005, 0.0 otherwise.<br>
Loss10 - 1.0 if Gain below -0.01, 0.0 otherwise.<br>
Loss15 - 1.0 if Gain below -0.015, 0.0 otherwise.<br>
For dropped features, we're getting rid of everything else, we're going to try treating it as a univariate time series problem first, see if that works since it's fairly simple and hopefully faster to train.

In [18]:
def makeGain05Label(value):
    if value >= 0.005:
        return 1.0
    return 0.0

def makeGain10Label(value):
    if value >= 0.010:
        return 1.0
    return 0.0

def makeGain15Label(value):
    if value >= 0.015:
        return 1.0
    return 0.0

def makeLoss05Label(value):
    if value <= -0.005:
        return 1.0
    return 0.0

def makeLoss10Label(value):
    if value <= -0.010:
        return 1.0
    return 0.0

def makeLoss15Label(value):
    if value <= -0.015:
        return 1.0
    return 0.0

def capGain10(value):
    if value > 0.1:
        return 0.1
    if value < -0.1:
        return -0.1
    return value

def getFeaturesByTicker(tickerNames, stockHistories):
    featuresByTicker = {}
    
    for name in tickerNames:
        history = stockHistories[name].copy()
        history['Gain'] = (history['Close'] - history['Open']) / history['Close']
        history['Gain'] = history['Gain'].apply(capGain10)
        history['Gain05'] = history['Gain'].apply(makeGain05Label)
        history['Gain10'] = history['Gain'].apply(makeGain10Label)
        history['Gain15'] = history['Gain'].apply(makeGain15Label)
        history['Loss05'] = history['Gain'].apply(makeLoss05Label)
        history['Loss10'] = history['Gain'].apply(makeLoss10Label)
        history['Loss15'] = history['Gain'].apply(makeLoss15Label)
        history = history.drop(['Open', 'High', 'Low', 'Close', 'Volume'], axis=1)
        # history = history.drop(['Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits'], axis=1)
        featuresByTicker[name] = history
        
    return featuresByTicker

Looks good. Now to make something useful out of it.<br>
<br>
<hr>
<br>
<h2>Step 4 - Process the data</h2>
So now that we have our raw data and changed it into our feature set, time to process this into something we can use.<br>
<br>
For scaling, I'm going to use a threshold of a 10 percent gain or loss to capture our trading data. If a value change exceeds this threshold, I will cap it at 10 percent for our training. This shouldn't have a major impact on many values and allows us to use all the data. After that, we should have all our values lying within a range of -0.1 <= x <= 0.1, so dividing by our threshold gives us -1.0 <= x <= 1.0, so we add one to that to shift it, giving us 0.0 <= x <= 2.0, and then divide by 2 to give us our desired scaling of 0.0 <= x <= 1.0. Obviously we'll have to reverse that at the other end to get our predictions.<br>
<br>
After that, just break it into windows.

In [5]:
def scaleDataByTicker(featuresByTicker):
    results = {}
    
    for name in tickerNames:
        features = featuresByTicker[name].copy()
        features['Gain'] = ((features['Gain'] / 0.1) + 1.0) / 2.0
        results[name] = features
        
    return results

def mergeDataFrameDict(featuresByTicker):
    mergedData = None
    for name in tickerNames:
        data = featuresByTicker[name]
        if mergedData is None:
            mergedData = data
        else:
            mergedData = mergedData.append(data)
    return mergedData

In [6]:
def getWindows(features, windowSize):
    X = []
    y = []
    yGain05 = []
    yGain10 = []
    yGain15 = []
    yLoss05 = []
    yLoss10 = []
    yLoss15 = []
    
    numWindows = len(features) - windowSize
    for index in range(0, numWindows):
        featureWindow = features[index:index + windowSize + 1]
        X.append(featureWindow['Gain'][:-1])
        y.append(featureWindow['Gain'][-1])
        yGain05.append(featureWindow['Gain05'][-1])
        yGain10.append(featureWindow['Gain10'][-1])
        yGain15.append(featureWindow['Gain15'][-1])
        yLoss05.append(featureWindow['Loss05'][-1])
        yLoss10.append(featureWindow['Loss10'][-1])
        yLoss15.append(featureWindow['Loss15'][-1])
    
    X = numpy.array(X)
    X = X.reshape(X.shape[0], X.shape[1], 1)
    y = numpy.array(y)
    yGain05 = numpy.array(yGain05)
    yGain10 = numpy.array(yGain10)
    yGain15 = numpy.array(yGain15)
    yLoss05 = numpy.array(yLoss05)
    yLoss10 = numpy.array(yLoss10)
    yLoss15 = numpy.array(yLoss15)
    
    return (X, y, yGain05, yGain10, yGain15, yLoss05, yLoss10, yLoss15)

def getWindowsByTicker(featuresByTicker, windowSize):
    windowsByTicker = {}
    for name in tickerNames:
        data = featuresByTicker[name]
        windowsByTicker[name] = getWindows(data, windowSize)
    return windowsByTicker

In [7]:
def getScaledDataByTicker(tickerNames):
    trainingStockHistoriesByTicker, testStockHistoriesByTicker = getInitialData(tickerNames)
    
    trainingFeaturesByTicker = getFeaturesByTicker(tickerNames, trainingStockHistoriesByTicker)
    testFeaturesByTicker = getFeaturesByTicker(tickerNames, testStockHistoriesByTicker)
    
    scaledTrainingDataByTicker = scaleDataByTicker(trainingFeaturesByTicker)
    scaledTestDataByTicker = scaleDataByTicker(testFeaturesByTicker)
    
    return scaledTrainingDataByTicker, scaledTestDataByTicker

In [8]:
def getRegressionData(tickerNames, featureWindowsByTicker):
    X = None
    y = None
    
    for name in tickerNames:
        featureWindow = featureWindowsByTicker[name]
        X = featureWindow[0] if X is None else numpy.vstack((X, featureWindow[0]))
        y = featureWindow[1] if y is None else numpy.concatenate((y, featureWindow[1]), axis=0)
    
    return X, y

In [9]:
# Define a custom data generator to provide a random subset of a larger dataset
class RandomSubsetDataGenerator(Sequence):
    def __init__(self, X, y, batch_size):
        self.X = X
        self.y = y
        self.batch_size = batch_size
        self.indexes = numpy.arange(len(self.X))

    def __len__(self):
        return int(numpy.ceil(len(self.X) / self.batch_size))

    def __getitem__(self, index):
#         start = index * self.batch_size
#         end = (index + 1) * self.batch_size
        batch_indexes = numpy.random.choice(self.indexes, size=self.batch_size, replace=False)
        batch_X = self.X[batch_indexes]
        batch_y = self.y[batch_indexes]
        return batch_X, batch_y

In [10]:
def getBinaryData(tickerNames, featureWindowsByTicker, column):
    X_true = []
    y_true = []
    X_false = []
    y_false = []
    
    for name in tickerNames:
        featureWindow = featureWindowsByTicker[name]
        
        for X, y in zip(numpy.nditer(featureWindow[0]), numpy.nditer(featureWindow[column])):
            if y == 1.0:
                X_true.append(X)
                y_true.append(y)
            else:
                X_false.append(X)
                y_false.append(y)
    
    return X_true, y_true, X_false, y_false

In [11]:
# Define a custom data generator to provide a random subset of a larger dataset, but keep the number of true and the number of false equal
class RandomPairedSubsetDataGenerator(Sequence):
    def __init__(self, X_true, y_true, X_false, y_false, batch_size):
        self.X_true = X_true
        self.y_true = y_true
        self.indexes_true = numpy.arange(len(self.X_true))
        self.X_false = X_false
        self.y_false = y_false
        self.indexes_false = numpy.arange(len(self.X_false))
        self.batch_size = batch_size
    
    def __len__(self):
        length = len(self.X_true) + len(self.X_false)
        return int(numpy.ceil(length / self.batch_size))
    
    def __getitem__(self, index):
        batch_indexes_true = numpy.random.choice(self.indexes_true, size=self.batch_size/2, replace=False)
        batch_indexes_false = numpy.random.choice(self.indexes_false, size=self.batch_size/2, replace=False)
        batch_X_true = self.X_true[batch_indexes_true]
        batch_y_true = self.y_true[batch_indexes_true]
        batch_X_false = self.X_true[batch_indexes_false]
        batch_y_false = self.X_true[batch_indexes_false]
        
        batch_X = numpy.hstack((batch_X_true, batch_X_false))
        batch_y = numpy.hstack((batch_y_true, batch_y_false))
        batch_X, batch_y = sklearn.utils.shuffle(batch_X, batch_y)
        
        return batch_X, batch_y

Okay, we now have our data, we've gone through and selected the parts that we want, and everything is scaled and ready to be used. This should be plenty of data for now, we'll be training on a small subset initially or this will take forever. Onwards to the model building.<br>
<br>
<hr>
<h2>Step 5 - Build a model</h2>
Time to start building our models. Some to try would be simple neural networks, RandomForestModel, XGBoost, vanilla or stacked LSTMs, Bidirectional LSTMs, GRUs, Convolutional LSTMs, and Simple RNNs. We'll try both MAE and MSE for loss functions. For optimizers we'll try all of them.<br>
<br>
So for our LSTM with five alternating layers of LSTM with 16 nodes and a 0.2 dropout, our early results show MAE as our loss function for the regression model along with SGD showed the best results. Early results for the binary classification models both looked extremely good, but need to dig into those results a bit more next.

In [12]:
def basicLSTM(size, batchSize, daysOfHistory, columnCount):
    model = Sequential([
        LSTM(size, return_sequences=True, input_shape=(daysOfHistory, columnCount), batch_input_shape=(batchSize, daysOfHistory, columnCount), activation='relu'),
        LSTM(size, return_sequences=True, activation='relu'),
        LSTM(size, return_sequences=True, activation='relu'),
        LSTM(size, return_sequences=True, activation='relu'),
        LSTM(size, return_sequences=False, activation='relu'),
        Dense(1)
    ])
    return model

def basicLSTM_binary(size, batchSize, daysOfHistory, columnCount):
    model = Sequential([
        LSTM(size, return_sequences=True, input_shape=(daysOfHistory, columnCount), batch_input_shape=(batchSize, daysOfHistory, columnCount), activation='relu'),
        LSTM(size, return_sequences=True, activation='relu'),
        LSTM(size, return_sequences=True, activation='relu'),
        LSTM(size, return_sequences=True, activation='relu'),
        LSTM(size, return_sequences=False, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    return model

Okay, looks good, but this is far too much data to train on right now, let's take a sampling. Once we settle on our model of choice we can train it longer. We'll need to create a DataGenerator to do the trick.

In [13]:
def loadOrTrainModel(modelName, model, X, y, X_test, y_test):
    modelPath = '/kaggle/working/' + str(modelName) + '.csv'
    if os.path.isfile(modelPath):
        print(modelName, 'already trained, loading history')
        modelHistoryDataFrame = pandas.read_csv(modelPath)
    else:
        train_data_generator = RandomSubsetDataGenerator(X, y, batchSize * 50)
        test_data_generator = RandomSubsetDataGenerator(X_test, y_test, batchSize * 10)
        earlyStoppingCallback = keras.callbacks.EarlyStopping(patience=5, min_delta=0.0005, restore_best_weights=True,)
        
        print('training', modelName)
        modelHistory = model.fit(train_data_generator,
                                 validation_data=test_data_generator,
                                 epochs=epochs,
                                 steps_per_epoch=len(train_data_generator),
                                 validation_steps=len(test_data_generator),
                                 verbose=1,
                                 batch_size=batchSize,
                                 callbacks=[earlyStoppingCallback])
        modelHistoryDataFrame = DataFrame(modelHistory.history)
        modelHistoryDataFrame.to_csv(modelPath)
    
    return modelHistoryDataFrame

def trainRegressionModel(modelName, Model, X, y, X_test, y_test):
    modelPath = '/kaggle/working/' + str(modelName) + '.csv'
    train_data_generator = RandomSubsetDataGenerator(X, y, batchSize * 50)
    test_data_generator = RandomSubsetDataGenerator(X_test, y_test, batchSize * 10)
    earlyStoppingCallback = keras.callbacks.EarlyStopping(patience=5, min_delta=0.0001, restore_best_weights=True,)
    print('\n******************************\n')
    print('training', modelName)
    print('\n******************************\n')
    modelHistory = model.fit(train_data_generator,
                             validation_data=test_data_generator,
                             epochs=epochs,
                             steps_per_epoch=len(train_data_generator),
                             validation_steps=len(test_data_generator),
                             verbose=1,
                             batch_size=batchSize,
                             callbacks=[earlyStoppingCallback])
    modelHistoryDataFrame = DataFrame(modelHistory.history)
    modelHistoryDataFrame.to_csv(modelPath)
    
    return modelHistoryDataFrame

def loadModelHistory(modelName):
    modelPath = '/kaggle/working/' + str(modelName) + '.csv'
    if os.path.isfile(modelPath):
        print(modelName, 'already trained, loading history')
        modelHistoryDataFrame = pandas.read_csv(modelPath)
        return modelHistoryDataFrame
    
    return None

<hr>
<h2>Step 6 - Look at our results</h2>
Okay, from those first LSTM results, the binary classification models appear to be far superior to the regression models. But to make sure, let's go see what we have for actual results and predictions. If these results pan out, will try multiple layers of binary classifiers with different thresholds, and taking the stocks that get predicted by multiple threshold models for the highest gains and additional accuracy hopefully.<br>
<br>
First off, let's go see what we have for labels and predictions for each of the stocks. Let's see how much of a change we have vs. what we're predicting with each model.
What constitutes good results though?  For our regression model we'll be using mean absolute error as our loss function. A successful result is one in which we predict a useful percentage of price changes larger than the mean absolute error. For the binary classifiers we're looking for predicting enough of them to be useful and being correct enough times to make it profitable.<br>
<br>
As for what to measure it against, we need to keep in mind our same criteria for selecting training windows, namely the entire 250 day trading history needs to fall within a ten percent change in either direction and be one of the companies in the tickerNames list. Our trading strategy to test with is to purchase either call or put options based on whether it is a predicted gain or loss. The options chain we'll be using is the second one, so the one that ends the following week, but we should test that to see which makes the most money. Finally, we want to look at each stock individually. Do certain stocks behave more predictably than others?

In [14]:
# def getBinaryComparison(predictions, actuals):
#     results = {}
#     successfulPredictions = 0
#     actualJumps = 0
#     missedJumps = 0
#     badPredictions = 0
    
#     for prediction, actual in zip(numpy.nditer(predictions), numpy.nditer(actuals)):
#         if actual == 1.0:
#             actualJumps = actualJumps + 1
#             if prediction >= 0.5:
#                 successfulPredictions = successfulPredictions + 1
#             else:
#                 missedJumps = missedJumps + 1
#         else:
#             if prediction >= 0.5:
#                 badPredictions = badPredictions + 1
    
#     results['total predictions'] = len(predictions)
#     results['successfulPredictions'] = successfulPredictions
#     results['actualJumps'] = actualJumps
#     results['missedJumps'] = missedJumps
#     results['badPredictions'] = badPredictions
    
#     return results

In [15]:
# actuals = ((y_test.squeeze() * 2) - 1) * 10
# print('actuals[0:10]:', actuals[0:10], '\n')
# binaryActuals = y_gain10_test.squeeze()
# print('binaryActuals[0:10]:', binaryActuals[0:10], '\n')

# binary8Predictions = model8binary.predict(X_test).squeeze()
# print('binary8Predictions[:10]:', binary8Predictions[0:10])
# binary8Comparison = getBinaryComparison(binary8Predictions, binaryActuals)
# print('binary8Comparison', binary8Comparison, '\n')

# binary16Predictions = model16binary.predict(X_test).squeeze()
# print('binary16Predictions[:10]:', binary16Predictions[0:10])
# binary16Comparison = getBinaryComparison(binary16Predictions, binaryActuals)
# print('binary16Comparison', binary16Comparison, '\n')

# averageChange = numpy.mean(numpy.abs(actuals))

# model8Predictions = ((model8.predict(X_test).squeeze() * 2) - 1) * 10
# print('model8Predictions[:10]:', model8Predictions[0:10], '\n')
# print('total predictions:', len(model8Predictions))
# model8mae = numpy.mean(numpy.abs(actuals - model8Predictions))
# print('model8 mae:', model8mae, '\n')
# print('average change:', averageChange)

# model16Predictions = ((model16.predict(X_test).squeeze() * 2) - 1) * 10
# print('model16Predictions[:10]:', model16Predictions[0:10], '\n')
# print('total predictions:', len(model16Predictions))
# model16mae = numpy.mean(numpy.abs(actuals - model16Predictions))
# print('model16 mae:', model16mae, '\n')
# print('average change:', averageChange)

And ... no useful results. But we at least now have a way to test it. Time to try some other models. When we have better results we'll take the time to implement the trading strategy and verify actual profit levels. So for our regression models, we need our MAE to be significantly lower than our average change. For our classifier models, we need a useful percentage of big changes to be predicted and we need enough big predictions to be accurate to turn a profit.

<hr>
And to put it all in one place ...

In [16]:
tickerNames = ['msft', 'aapl', 'goog', 'amzn', 'nvda', 'meta', 'tsla', 'tsm', 'avgo', 'orcl', 'asml', 'amd', 'adbe', 'crm', 'nflx', 'csco', 'intc',
               'sap', 'intu', 'qcom', 'txn', 'ibm', 'now', 'amat', 'bkng', 'sony', 'lrcx', 'panw', 'shop', 'adp', 'adi', 'mu', 'meli', 'fi', 'klac',
               'anet', 'cdns', 'snps', 'wday', 'eqix', 'pypl', 'team', 'mrvl', 'dell', 'rop', 'nxpi', 'adsk', 'mchp', 'ftnt', 'tel', 'stm', 'sq',
               'iqv', 'ea', 'fis', 'csgp', 'gpn', 'veev', 'ttd', 'on', 'fico', 'mpwr', 'anss', 'hubs', 'hpq', 'mdb', 'ttwo', 'snap', 'keys', 'splk',
               'grmn', 'smci', 'ebay', 'ptc', 'se', 'expe', 'algn', 'asx', 'hpe', 'umc', 'eric', 'nok', 'chkp', 'akam', 'ntap', 'tyl', 'stx', 'entg',
               'fds', 'epam', 'swks', 'gddy', 'ldos', 'ssnc', 'gen', 'logi', 'enph', 'manh', 'okta', 'ntnx', 'nice', 'twlo', 'pstg', 'azpn', 'zbra',
               'trmb', 'jkhy', 'roku', 'jnpr', 'payc', 'otex', 'ffiv', 'dox', 'qrvo']

Get our data with features and labels. This has not been windowed yet. We'll pass in the window size so we can run models against different history lengths.

In [19]:
scaledTrainingDataByTicker, scaledTestDataByTicker = getScaledDataByTicker(tickerNames)

In [20]:
scaledTrainingDataByTicker['aapl'].describe()

Unnamed: 0,Dividends,Stock Splits,Gain,Gain05,Gain10,Gain15,Loss05,Loss10,Loss15
count,2516.0,2516.0,2516.0,2516.0,2516.0,2516.0,2516.0,2516.0,2516.0
mean,0.001676,0.002782,0.499558,0.323529,0.18283,0.08903,0.303259,0.180843,0.113275
std,0.015637,0.139554,0.063767,0.467916,0.386604,0.284844,0.459758,0.384964,0.316992
min,0.0,0.0,0.144785,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.464423,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.502185,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.537182,1.0,0.0,0.0,1.0,0.0,0.0
max,0.1925,7.0,0.900019,1.0,1.0,1.0,1.0,1.0,1.0


In [21]:
scaledTestDataByTicker['aapl'].describe()

Unnamed: 0,Dividends,Stock Splits,Gain,Gain05,Gain10,Gain15,Loss05,Loss10,Loss15
count,501.0,501.0,501.0,501.0,501.0,501.0,501.0,501.0,501.0
mean,0.003872,0.0,0.506051,0.333333,0.195609,0.071856,0.265469,0.129741,0.063872
std,0.03044,0.0,0.055851,0.471876,0.397065,0.258508,0.442024,0.336354,0.24477
min,0.0,0.0,0.300885,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.472319,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.504688,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.540742,1.0,0.0,0.0,1.0,0.0,0.0
max,0.25,0.0,0.825851,1.0,1.0,1.0,1.0,1.0,1.0


In [22]:
epochs = 300
daysOfHistory = 150
size = 250
batchSize = 32
columnCount = 1

**Uncomment this cell to see my last models**

In [23]:
models = {}

# modelAdam = basicLSTM(size, batchSize, daysOfHistory, columnCount)
# modelAdam.compile(optimizer=Adam(), loss=MeanAbsoluteError(), metrics=['mean_absolute_error'])
# models['Adam'] = modelAdam

# modelAdamW = basicLSTM(size, batchSize, daysOfHistory, columnCount)
# modelAdamW.compile(optimizer=AdamW(), loss=MeanAbsoluteError(), metrics=['mean_absolute_error'])
# models['AdamW'] = modelAdamW

# modelAdadelta = basicLSTM(size, batchSize, daysOfHistory, columnCount)
# modelAdadelta.compile(optimizer=Adadelta(), loss=MeanAbsoluteError(), metrics=['mean_absolute_error'])
# models['Adadelta'] = modelAdadelta

# modelAdagrad = basicLSTM(size, batchSize, daysOfHistory, columnCount)
# modelAdagrad.compile(optimizer=Adagrad(), loss=MeanAbsoluteError(), metrics=['mean_absolute_error'])
# models['Adagrad'] = modelAdagrad

# modelAdamax = basicLSTM(size, batchSize, daysOfHistory, columnCount)
# modelAdamax.compile(optimizer=Adamax(), loss=MeanAbsoluteError(), metrics=['mean_absolute_error'])
# models['Adamax'] = modelAdamax

# modelAdafactor = basicLSTM(size, batchSize, daysOfHistory, columnCount)
# modelAdafactor.compile(optimizer=Adafactor(), loss=MeanAbsoluteError(), metrics=['mean_absolute_error'])
# models['Adafactor'] = modelAdafactor

# modelFtrl = basicLSTM(size, batchSize, daysOfHistory, columnCount)
# modelFtrl.compile(optimizer=Ftrl(), loss=MeanAbsoluteError(), metrics=['mean_absolute_error'])
# models['Ftrl'] = modelFtrl

# modelNadam = basicLSTM(size, batchSize, daysOfHistory, columnCount)
# modelNadam.compile(optimizer=Nadam(), loss=MeanAbsoluteError(), metrics=['mean_absolute_error'])
# models['Nadam'] = modelNadam

# modelRMSprop = basicLSTM(size, batchSize, daysOfHistory, columnCount)
# modelRMSprop.compile(optimizer=RMSprop(), loss=MeanAbsoluteError(), metrics=['mean_absolute_error'])
# models['RMSprop'] = modelRMSprop

# modelSGD = basicLSTM(size, batchSize, daysOfHistory, columnCount)
# modelSGD.compile(optimizer=SGD(), loss=MeanAbsoluteError(), metrics=['mean_absolute_error'])
# models['SGD'] = modelSGD

In [24]:
modelHistories = {}
modelsToRun = []

for modelName in models:
    modelHistory = loadModelHistory(modelName)
    if modelHistory is None:
        modelsToRun.append(modelName)
    else:
        modelHistories[modelName] = modelHistory

if len(modelsToRun) > 0:
    print('Running', len(modelsToRun), 'models')
    
    print('Building training windows')
    featureWindowsByTicker_train = getWindowsByTicker(scaledTrainingDataByTicker, daysOfHistory)
    print('Building test windows')
    featureWindowsByTicker_test = getWindowsByTicker(scaledTestDataByTicker, daysOfHistory)
    print('Preparing training feature sets and labels')
    X_train_regression, y_train_regression = getRegressionData(tickerNames, featureWindowsByTicker_train)
    print('Preparing test feature sets and labels')
    X_test_regression, y_test_regression = getRegressionData(tickerNames, featureWindowsByTicker_test)
    
    for modelName in modelsToRun:
        model = models[modelName]
        modelHistory = trainRegressionModel(modelName, model, X_train_regression, y_train_regression, X_test_regression, y_test_regression)
        modelHistories[modelName] = modelHistory

In [25]:
def combineDict(dictionary, metric):
    results = dictionary['Adam'][metric].copy()
    for modelName in dictionary:
        historyDf = dictionary[modelName]
        results[modelName] = historyDf[metric]
    results = results.drop([metric], axis=1)
    return results

if len(modelsToRun) > 0:
    lossDf = combineDict(modelHistories, 'loss')
    val_lossDf = combineDict(modelHistories, 'val_loss')
    accuracyDf = combineDict(modelHistories, 'accuracy')
    val_accuracyDf = combineDict(modelHistories, 'val_accuracy')
    
    lossDf.plot(title='Loss by optimizer')
    val_lossDf.plot(title='Validation Loss by optimizer')
    accuracyDf.plot(title='Accuracy by optimizer')
    val_accuracyDf.plot(title='Validation Accuracy by optimizer')

In [None]:
# trainingFeatureWindowsByTicker = getWindowsByTicker(scaledTrainingDataByTicker, windowSize)
# testFeatureWindowsByTicker = getWindowsByTicker(scaledTestDataByTicker, windowSize)

Our regression training can use all of the data, so nothing special here. We'll use RandomSubsetDataGenerator to give us a random training set from the entire available dataset.

In [None]:
# X_train_regression, y_train_regression = getRegressionData(tickerNames, trainingFeatureWindowsByTicker)
# X_test_regression, y_test_regression = getRegressionData(tickerNames, testFeatureWindowsByTicker)

Our binary classifiers however need to have each set represented relatively evenly, so we'll do a random subset, but from each side, so we'll make a new RandomPairedSubsetDataGenerator to do that for us on binary models.

In [None]:
# # binary classifier with true being a gain above 0.5%
# X_train_gain05_true, y_train_gain05_true, X_train_gain05_false, y_train_gain05_false = getBinaryData(tickerNames, trainingFeatureWindowsByTicker, 2)
# X_test_gain05_true, y_test_gain05_true, X_test_gain05_false, y_test_gain05_false = getBinaryData(tickerNames, testFeatureWindowsByTicker, 2)

# # binary classifier with true being a gain above 1.0%
# X_train_gain10_true, y_train_gain10_true, X_train_gain10_false, y_train_gain10_false = getBinaryData(tickerNames, trainingFeatureWindowsByTicker, 3)
# X_test_gain10_true, y_test_gain10_true, X_test_gain10_false, y_test_gain10_false = getBinaryData(tickerNames, testFeatureWindowsByTicker, 3)

# # binary classifier with true being a gain above 1.5%
# X_train_gain15_true, y_train_gain15_true, X_train_gain15_false, y_train_gain15_false = getBinaryData(tickerNames, trainingFeatureWindowsByTicker, 4)
# X_test_gain15_true, y_test_gain15_true, X_test_gain15_false, y_test_gain15_false = getBinaryData(tickerNames, testFeatureWindowsByTicker, 4)

# # binary classifier with true being a loss above 0.5%
# X_train_loss05_true, y_train_loss05_true, X_train_loss05_false, y_train_loss05_false = getBinaryData(tickerNames, trainingFeatureWindowsByTicker, 5)
# X_test_loss05_true, y_test_loss05_true, X_test_loss05_false, y_test_loss05_false = getBinaryData(tickerNames, testFeatureWindowsByTicker, 5)

# # binary classifier with true being a loss above 1.0%
# X_train_loss10_true, y_train_loss10_true, X_train_loss10_false, y_train_loss10_false = getBinaryData(tickerNames, trainingFeatureWindowsByTicker, 6)
# X_test_loss10_true, y_test_loss10_true, X_test_loss10_false, y_test_loss10_false = getBinaryData(tickerNames, testFeatureWindowsByTicker, 6)

# # binary classifier with true being a loss above 1.5%
# X_train_loss15_true, y_train_loss15_true, X_train_loss15_false, y_train_loss15_false = getBinaryData(tickerNames, trainingFeatureWindowsByTicker, 7)
# X_test_loss15_true, y_test_loss15_true, X_test_loss15_false, y_test_loss15_false = getBinaryData(tickerNames, testFeatureWindowsByTicker, 7)

Binary classifiers however need to have a balanced set of data, roughly half and half of each, otherwise you're skewed to favoring whichever is more common, so we'll return a randomized subset from each