# Data Acquisition and Processing Systems (DaPS) (ELEC0136)

__Final Assignment__

---

## Package Imports:

In [1]:
# General
import os
from datetime import datetime
import numpy as np
import math

# APIs
import yfinance as yf
import alpha_vantage
from alpha_vantage.timeseries import TimeSeries

# Data storage and processing
import pandas as pd
import pyarrow.feather as feather

# Data exploration and visualisation
import statsmodels.api as sm
from statsmodels.tsa.seasonal import STL
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.ar_model import AR
from statsmodels.tsa.seasonal import seasonal_decompose
from scipy.stats import f_oneway
from scipy.stats import chi2_contingency
from statsmodels.formula.api import ols
from scipy.stats import friedmanchisquare
import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import kaleido
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

# Modelling
from operator import itemgetter
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from math import sqrt
import torch
import torch.nn as nn
from torch.autograd import Variable

## Data Acquisition:

We group data sources into classes with which we can:

1. `load` -- request data from an API or parse local plaintext storage files, bringing the data into the scope
2. `write` -- store the ingested data using media which proide engineering benefits to the performance of the system
3. `fetch` -- not exactly a query system, but rather a much more perfomant data i/o routine from local 'hot' storage

### `YahooFinance`

This class sources, stores and fetches data from the Yahoo! finance plaform, either by an API request (using `yfinance`) or by parsing a local csv file sourced manually.

#### Data:

`YahooFinance.data` is where the ingested data is stored.
It takes the form a `pandasDataFrame` with a `datetime` index.


From this data structure, the data we have **and are interested in** is the following:

* `Close` -- the closing price of MSFT stock
* `Volume` -- the number of MSFT shares traded on any given day

There may be other financial metrics of interest, in which case we might shop around for other APIs (such as AlphaVantage) to obtain metrics such as moving averages, long/short interest and other technical indicators.

In [2]:
class YahooFinance:
    """Closing price, opening, trading volumes, daily averages and other pricing value for a given ticker.
    
    **Using API**
    
    (DEFAULT)
    
    Set `_local=False` in constructor to indicate we are sourcing the data from the `yfinance` API.
    You can modify the symbols (stock tickers) which can be fetched with the API call.
    
    **Using local file**
    
    Set `_local=True.
    
    Only market data for $MSFT is downloaded onto a local file at `data/raw/yfinance_MSFT.csv`.
    This data will still be loaded and written to a feather file format.
    The feather file format has more efficient read/write in terms of memory usage, but also speed.
    """
    
    RAW_PATH = "data/raw/yfinance.csv"
    SAVE_PATH = "data/store/market_data"
    
    def __init__(self, _local=False, _savelocal=True, _symbols="MSFT"):
        self.local = _local
        self.savelocal = _savelocal
        self.symbols = _symbols
        self.data = 0
    
    def load(self):
        if self.savelocal == False:
            print("Permission to load files from cold storage denied, instead try using `fetch`.")
            return 1
        
        if self.local == True:
            print("Files loaded from: " + self.RAW_PATH)
            data = pd.read_csv("data/raw/yfinance_MSFT.csv", index_col=0)
            data['Volume'] = data['Volume'].values.astype(float)
            data['Date'] = data.index.values
            self.data = data
        else:
            print("Data requested from Yahoo! Finance API (via yfinance)")
            data = yf.download(self.symbols, start="2017-04-01", end="2021-05-01")
            data['Volume'] = data['Volume'].values.astype(float)
            data['Date'] = data.index.values
            self.data = data
        
    def write(self):
        if self.savelocal == False:
            print("You do not have the permission to local storage, there is already a file written.")
        else:
            if os.path.exists(self.SAVE_PATH):
                os.remove(self.SAVE_PATH)
            feather.write_feather(self.data, self.SAVE_PATH)
            print("File written at: " + self.SAVE_PATH)
        
    def fetch(self):
        "writing to feather unfortunately gets rid of datetime index, workaround needed in write()"
        if not os.path.exists(self.SAVE_PATH):
            print("No saved data exists, ensure you have first loaded and saved external data locally.")
        else:
            print("Data fetched from: " + self.SAVE_PATH)
            data = feather.read_feather(self.SAVE_PATH)
            data.index = data['Date'].values; data = data.drop(['Date'], axis=1)
            self.data = data
            return data

### `GoogleTrends`

The class provides methods to source and manipulate data from the Google trends analysis tool.

> This was initially attempted by using the `pytrends` package, a third-party wrapper for the Google trends API, however the data packets received did not match up to the raw data manually sourced from the [Google trends web dashboard](https://trends.google.co.uk/trends/?geo=GB).

Google scores web trends on a scale from 0-100, zero being the least searched/interacted with and 100 garnering the maximum web activity from Google users.

Three trends related to Microsoft Corporation were sourced based on the `Microsoft Corporation` tag:

1. `Search` -- general search trends (somewhat based on volume of Microsoft related searches)
2. `News` -- news headline and article metric (analogous to frequency of Microsoft being mentioned in news publications)
3. `Financial` -- financial searches related to Microsoft (this trend seems intersting since one naively assumes that greater interest in Microsoft related finances should correlate to stock price movement)

#### Data:

`GoogleTrends.data` is the field that stores data whilst in scope, otherwise `fetch` can be used to perform a fast i/o read from feather local store.

In betweent the sourcing and loading-into-scope stage of the data ingress, we bundle the separate trend dataframes into a single structure which can be indexed.

In [3]:
class GoogleTrends:
    """Trend data from Google search engine: general search, financial searches and mews for given company name.
    
    If you change the location of the manually sourced data (mostly csv files in 'data/raw'), you must remember
    to appropriately modify the `RAW_PATH` field of the `GoogleTrends` class.
    """
    
    RAW_PATH = "data/raw/"
    SAVE_PATH = "data/store/google_trends" # feather file database path
    
    def __init__(self, _local=True, _savelocal=True):
        "Used to download, store and load Google trend data from API or local sources."
        self.local = _local
        self.savelocal = _savelocal
        self.data = 0
                
    def load(self):
        if self.savelocal == False:
            print("Permission to load files from cold storage denied, instead try using `fetch`.")
            return 1
        
        print("Files loaded from: " + self.RAW_PATH)
        def trend_import(file :str):
            return pd.read_csv(self.RAW_PATH + file, index_col=0)[1:]

        # bundle the three Google trend dataframes into a single structure
        bundle = trend_import("gtrends_search.csv")
        bundle['Search'] = trend_import("gtrends_search.csv").values.astype(float)
        bundle['News'] = trend_import("gtrends_news.csv").values.astype(float)
        bundle['Financial'] = trend_import("gtrends_financial.csv").values.astype(float)
        bundle['Date'] = bundle.index.values
        bundle.index.rename('Date', inplace=True)
        bundle.index = pd.to_datetime(bundle.index)
        del bundle[bundle.columns[0]]
        
        self.data = bundle
    
    def write(self):
        if self.savelocal == False:
            print("Local disk storage is disabled in the `DataConfig` config object!")
        else:
            if os.path.exists(self.SAVE_PATH):
                os.remove(self.SAVE_PATH)
            feather.write_feather(self.data, self.SAVE_PATH)
            print("Data written at: " + self.SAVE_PATH)
            
    def fetch(self):
        "Google trends data, fetch from fast I/O file storage (using feather from Apache Arrow)"
        if not os.path.exists(self.SAVE_PATH):
            print("No saved data exists, ensure you have first loaded and saved external data locally.")
        else:
            print("Data fetched from: " + self.SAVE_PATH)
            data = feather.read_feather(self.SAVE_PATH)
            data.index = data['Date'].values; data = data.drop(['Date'], axis=1)
            data.index = pd.to_datetime(data.index)
            self.data = data
            return self.data

## Data Processing:

General wrangling, cleaning, preprocessing, feature generation and overall preparation for the exploration/visualisation.

Things to do:

1. fill in the gaps (weekends and sampling periods) for both google trends and stock pricing data
    - generate day-date index from 01/04/2017 to 01/05/2021 and fill Nan values with previous non-NaN data-point
2. split datasets into train / validate groups

In [4]:
def merger(yahoo, google):
    "Consistent datetime index avoiding NaN values issues for GoogleTrends & YahooFinance data structures."
    
    def linear_df(dff):
        df = pd.DataFrame(
            index=pd.date_range(
                start=str(dff.index.values[0])[0:10],
                end=str(dff.index.values[-1])[0:10]))
        return df
    
    df = linear_df(yahoo)
    df = df.join(google)
    df = df.interpolate(method='time')
    
    df = df.join(yahoo[['Close', 'Volume']])
    df = df.dropna(subset=['Close'])
    
    df['Search'][0] = df['Search'][1]
    df['News'][0] = df['News'][1]
    df['Financial'][0] = df['Financial'][1]
    
    return df

## Data Exploration:

### Static Analysis and Hypothesis Testing

The following hypothesis tests should allow us to gain insight into the statistical relationship between features and stock price.

#### 1. ANOVA Test

We would like to find out if the classes in question are the same or different distributions.

##### Assumptions:

* Sample observations are independent and identically distributed
* Sample observations are normally distributed
* Sample observations have the same variance

##### Hypothesis:

* __H0__ : When all samples' mean values are the same
* __H1__ : When one ore more sample are very different statistically.

##### Outcome:

The pvalue <<< 0.05, therefore the means of the samples for:

* MSFT closing price
* MSFT Google trends search score
* MSFT Google trends financial search score

__are not equal__.


#### 2. Friedman Test

We will perform a statistical test to determine whether the two or more paired samples: MSFT closing price, MSFT Google trends search score and MSFT Google trends financial score are equal or not by considering their distribution.

##### Assumptions:

* Observations samples are independent and identically distributed
* Observations samples can be ranked
* Observations across samples are paired

##### Hypothesis:

* __H0__ : The distributions of all samples are equal
* __H1__ : The distributions of one or more samples are not equal

##### Outcome:

The pvalue <<< 0.05, therefore the distributions of the samples for:

* MSFT closing price
* MSFT Google trends search score
* MSFT Google trends financial search score

__are not equal__.

In [30]:
class StaticAnalysis:
    "Statistical hypothesis tests."
    
    def __init__(self, _data):
        self.data = _data
        self.anova = None
        self.friedman = None
        self.dickey_fuller = None
        
    def friedman_test(self):
        s, p = friedmanchisquare(self.data['Close'],
                                 self.data['Search'],
                                 self.data['Financial'])
        self.friedman = p
        print('stat=%.3f, p=%.3f' % (s, p))
        if p > 0.05:
            print('H0: Probably the same distribution')
        else:
            print('H1: Probably different distributions')
            
    def anova_test(self):
        s, p = f_oneway(self.data['Close'],
                        self.data['Search'],
                        self.data['Financial'])
        self.anova = p
        print('stat=%.3f, p=%.3f' % (s, p))
        if p > 0.05:
            print('H0: Probably the same distribution')
        else:
            print('H1: Probably different distributions')
            
    def dickey_fuller_test(self):
        test = adfuller(self.data['Close'])
        p = test[2]
        s = test[1]
        self.dickey_fuller = p
        print('stat=%.3f, p=%.3f' % (s, p))
        if p > 0.05:
            print('H0: Time series is non-stationary, has time-dependent structure.')
        else:
            print('H1: Time series is stationary, no time-dependent structure.')

### Time Series Analysis

We will decompose our timeseries data using statistical methods.

These should help use get a better understanding of what type of signals drive the behaviour of the closing stock prices.

In [6]:
def automatic_tsd(df):
    "Automatic time series decompostion."
    close_result = seasonal_decompose(df['Close'], model='additive',period=50,extrapolate_trend='freq')
    search_result = seasonal_decompose(df['Search'], model='additive',period=20,extrapolate_trend='freq')
    news_result = seasonal_decompose(df['News'], model='additive',period=20,extrapolate_trend='freq')
    financial_result = seasonal_decompose(df['Financial'], model='additive',period=20,extrapolate_trend='freq')
    
    def tsd_plot(df, result, savefile):
        fig = make_subplots(rows=5, cols=1,
                            subplot_titles=("Observed","Trend","Seasonality","Residual"))
        fig. add_trace(go.Scatter(x=df.index, y=result.observed,name="Observed"),
                                  row=1,col=1)
        fig. add_trace(go.Scatter(x=df.index, y=result.trend, name="Trend"),
                                  row=2,col=1)
        fig. add_trace(go.Scatter(x=df.index, y=result.seasonal, name="Seasonality"),
                                  row=3,col=1)
        fig. add_trace(go.Scatter(x=df.index, y=result.resid, name="Residual"),
                                  row=4,col=1)
        fig.update_layout(autosize=False,
                          width=800,
                          height=800,)
        print("Figure saved to file at: " + savefile)
        fig.write_image(savefile)
    
    tsd_plot(df, close_result, "images/tsd_price.png")
    tsd_plot(df, search_result, "images/tsd_search.png")
    tsd_plot(df, news_result, "images/tsd_news.png")
    tsd_plot(df, financial_result, "images/tsd_financial.png")

### Visualizations

#### Google Trends Signal Smoothing

In order to smoothly merge the Google trends date-indexed score with the financial data from Yahoo! Finance, we needed to convert a discrete and sparse timeseries dataset into a daily-timeseries dataset, which we would then be able to merge with the Yahoo! dataframe.

In the end, we have used a floodfill-like techniques in which we chronologically extend known trend score over dates previously with NaN values, until then next sample value is reached and becomes the next filler value.
This results in a timeseries trend signal which displays *plateaus* where the known values were 'flooded'.

The function `plot_trend_filling` showcases an example of this floodfilling effect, compared with the original sparse signal.

A copy of the figure produced is saved at: `images/google_trends_filling_effect.png` for viewing.

In [7]:
def plot_trend_filling(df_merged, df_gtrends):
    "Plot of the effect of 'filling-in' trend data to fit linearly over datetime index range."
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=df_merged.index, y=df_merged['Search'],
                             mode='lines',
                             name='Filled'))
    fig.add_trace(go.Scatter(x=df_gtrends.index, y=df_gtrends['Search'],
                             mode='lines',
                             name='Original'))
    fig.update_layout(title="Google Search Trend Data for 'Microsoft Corporation' - Discrete vs. 'Filled'",
                      xaxis_title='Date',
                      yaxis_title='Score')
    fig.write_image("images/google_trends_filling_effect.png")
    print("Figure saved to file at: images/google_trends_filling_effect.png")

#### Feature Correlation Heatmap

No machine learning or data science task would be complete without one.
The quintessential visualization technique serves to aid with feature selection, if there is a good correlation between an input feature and the target outputs it is more likely that meaningful patterns can be learned from the feature.

`plot_correlation_heatmap` generates the heatmap figure, whilst `correlation_matrix` calculates the 2D-array of correlation values.

The heatmap figure can be found at: `images/closing_price_feature_correlation.png`

In [8]:
def cross_correlation_analysis(df):
    close = np.array(df['Close'].values)
    search = np.array(df['Search'].values)
    news = np.array(df['News'].values)
    financial = np.array(df['Financial'].values)
    
    cc_search = sm.tsa.stattools.ccf(search, close, adjusted=False)
    cc_news = sm.tsa.stattools.ccf(news, close, adjusted=False)
    cc_financial = sm.tsa.stattools.ccf(financial, close, adjusted=False)
    
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=df.index, y=cc_search,
                             mode='lines',
                             name='Search-Close'))
    fig.add_trace(go.Scatter(x=df.index, y=cc_news,
                             mode='lines',
                             name='News-Close'))
    fig.add_trace(go.Scatter(x=df.index, y=cc_financial,
                             mode='lines',
                             name='Financial-Close'))
    fig.update_layout(title="Closing Price & Features Cross-Correlation Pairs",
                      xaxis_title='Date',
                      yaxis_title='Cross Correlation')
    fig.write_image("images/ts_cross_correlation.png")
    print("Figure saved to file at: images/ts_cross_correlation.png")

In [9]:
def correlation_matrix(df):
    "Extremely hacky and non-composable way to making a feature correlation matrix."
    def corr(df, key):
        return [df[key].corr(df['Close']),
                df[key].corr(df['Volume']),
                df[key].corr(df['Search']),
                df[key].corr(df['News']),
                df[key].corr(df['Financial'])]
    
    close_corr = corr(df, 'Close')
    volume_corr = corr(df, 'Volume')
    search_corr = corr(df, 'Search')
    news_corr = corr(df, 'News')
    financial_corr = corr(df, 'Financial')
    
    return [close_corr, volume_corr, search_corr, news_corr, financial_corr]

In [10]:
def plot_correlation_heatmap(df):
    "Timeseries feature correlations to MSFT closing price."
    features = ['Close', 'Volume', 'Search', 'News', 'Financial']
    matrix = correlation_matrix(df)
    
    fig = go.Figure(data=go.Heatmap(z=matrix,
                                    x=features,
                                    y=features,
                                    colorscale='Viridis'))
    fig.update_layout(title='MSFT Stock Price Timeseries Feature Correlation Map')
    fig.write_image("images/closing_price_feature_correlation.png")
    print("Figure saved to file at: images/closing_price_feature_correlation.png")
    return fig

In [11]:
def moving_average(df):
    "A visual look at the moving average (MA) behaviour of MSFT"
    short_rolling = df.rolling(window=20).mean()
    long_rolling = df.rolling(window=80).mean()
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=df.index, y=df['Close'],
                         mode='lines',
                         name='Close'))
    fig.add_trace(go.Scatter(x=df.index[19:], y=short_rolling['Close'],
                         mode='lines',
                         name='Short MA'))
    fig.add_trace(go.Scatter(x=df.index[79:], y=long_rolling['Close'],
                         mode='lines',
                         name='Long MA'))
    fig.update_layout(title='MSFT Closing Price with Long and Short Moving Average',
                 xaxis_title='Date',
                 yaxis_title='Price ($)')
    
    fig.write_image("images/moving_average.png")
    print("Figure saved to file at: images/moving_average.png")

### Data Preparation for Models

In [12]:
def split_scale_data(data, only_price=True):
    "Scale train data, then test data independently, to avoid leaking information forward."
    scaler = MinMaxScaler(feature_range=(-1, 1))
    
    train = data[:-21]
    test = data[-21:]
    
    if only_price == True:
        train = scaler.fit_transform(train.reshape(-1, 1))
        test = scaler.transform(test.reshape(-1, 1))
    else:
        train = scaler.fit_transform(train.reshape(-1, 4))
        test = scaler.transform(test.reshape(-1, 4))
    
    return (np.concatenate((train, test)), scaler)

In [13]:
def prep_data(df, window, only_price=True):
    "Pre-model data preparation, behaviour defined by whether using `only_price` as feature or multivariate."
    if only_price == True:
        data, scaler = split_scale_data(np.array(df['Close'], dtype='float'),
                                        only_price)
    else:
        # scale each feature independently, only keep scaler values for price data for la
        data, scaler = split_scale_data(np.array(df[['Close','Search','News','Financial']], dtype='float'),
                                        only_price)

    windowed_data = []
    
    for i in range((len(data) - window)):
        windowed_data.append(data[i:i+window])
    
    windowed_data = np.array(windowed_data, dtype='float')
    
    x_train = windowed_data[:-21, :-1, :]
    x_test = windowed_data[-21:, :-1, :]
    
    if only_price == True:
        y_train = windowed_data[:-21, -1, :]
        y_test = windowed_data[-21:, -1, :]
    else:
        y_train = windowed_data[:-21, -1, :]
        y_test = windowed_data[-21:, -1, :]
    
    return (x_train, y_train, x_test, y_test, scaler)

## Model Definitions:

In [14]:
class LSTM(nn.Module):
    "LSTM object, generates lstm model with pytorch graph, variable geometry."
    def __init__(self, inputs, hiddens, layers, outputs):
        super(LSTM, self).__init__()
        self.hiddens = hiddens
        self.layers = layers
        self.lstm = nn.LSTM(inputs, hiddens, layers, batch_first=True)
        self.fc = nn.Linear(hiddens, outputs)

    def forward(self, x):
        # initialize hidden
        h0 = torch.zeros(self.layers, x.size(0), self.hiddens).requires_grad_()
        # init cell
        c0 = torch.zeros(self.layers, x.size(0), self.hiddens).requires_grad_()
        out, (hn, cn) = self.lstm(x, (h0.detach(), c0.detach()))
        out = self.fc(out[:, -1, :])
        return out

### Training Loop

In [15]:
def train(model, x_train, y_train, loss_function, optimiser, epochs, window):
    history = np.zeros(epochs)    
    for epoch in range(epochs):
        # forward pass
        y_train_pred = model(x_train)

        loss = loss_function(y_train_pred, y_train)
        if epoch % 10 == 0:
            print("Epoch ", epoch, "MSE: ", loss.item())
        history[epoch] = loss.item()
        
        # epoch-wise gradient reset
        optimiser.zero_grad()
        # backpass
        loss.backward()
        # hyperparam update
        optimiser.step()
        
    return history

In [16]:
def plot_training_loss(history, epochs, title):
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=np.linspace(0, epochs), y=history,
                         mode='lines',
                         name='Loss'))
    fig.update_layout(title='LSTM Training Performance',
             xaxis_title='Epoch',
             yaxis_title='Loss')
    
    fig.write_image("images/"+title)
    print("Figure saved to file at: images/"+title)

In [17]:
def train_univariate_model(mixed_data, window=10, epochs=100):
    xtr, ytr, xte, yte, scaler = prep_data(mixed_data, window, True) # univariate data prep
    
    x_train = torch.from_numpy(xtr).type(torch.Tensor)
    x_test = torch.from_numpy(xte).type(torch.Tensor)
    y_train = torch.from_numpy(ytr).type(torch.Tensor)
    y_test = np.array(mixed_data['Close'], dtype='float')[-21:]
    
    model = LSTM(inputs=1, hiddens=32, layers=2, outputs=1)
    loss_function = torch.nn.MSELoss()
    optimiser = torch.optim.Adam(model.parameters(), lr=0.01)
    
    hist = train(model, x_train, y_train, loss_function, optimiser, epochs, window)
    plot_training_loss(hist, epochs, "univariate_model_training.png") # training performance plot
    
    return (model, scaler, x_test, y_test, x_train, y_train)

In [18]:
def train_multivariate_model(mixed_data, window=10, epochs=100):
    xtr, ytr, xte, yte, scaler = prep_data(mixed_data, window, False) # multivariate data prep
    
    x_train = torch.from_numpy(xtr).type(torch.Tensor)
    x_test = torch.from_numpy(xte).type(torch.Tensor)
    y_train = torch.from_numpy(ytr).type(torch.Tensor)
    y_test = torch.from_numpy(yte).type(torch.Tensor)
    
    model = LSTM(inputs=4, hiddens=32, layers=2, outputs=4)
    loss_function = torch.nn.MSELoss()
    optimiser = torch.optim.Adam(model.parameters(), lr=0.01)
    
    hist = train(model, x_train, y_train, loss_function, optimiser, epochs, window)
    plot_training_loss(hist, epochs, "multivariate_model_training.png") # training performance plot
    
    return (model, scaler, x_test, y_test, x_train, y_train)

## Model Evaluation

### Performance Visualization

In [19]:
def plot_evaluation(df, df_result, title):
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=df_result.index, y=df_result['Reference'], name='Target',
                             line=dict(width=2)))
    fig.add_trace(go.Scatter(x=df_result.index, y=df_result['Predicted'], name='Prediction',
                             line=dict(width=2, dash='dot')))
    fig.update_layout(title='LSTM Inference Result',
             xaxis_title='Date',
             yaxis_title='Closing Price ($)')
    
    fig.write_image("images/"+title)
    print("Figure saved to file at: images/"+title)

In [20]:
def plot_focused_evaluation(df, df_result, title):
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=df_result.index[-40:], y=df_result['Reference'][-40:], name='Target',
                             line=dict(width=2)))
    fig.add_trace(go.Scatter(x=df_result.index[-40:], y=df_result['Predicted'][-40:], name='Prediction',
                             line=dict(width=2, dash='dot')))
    fig.update_layout(title='LSTM Inference Result',
             xaxis_title='Date',
             yaxis_title='Closing Price ($)')
    
    fig.write_image("images/"+title)
    print("Figure saved to file at: images/"+title)

### Evaluation

In [21]:
def evaluate_univariate(df, model, x_train, x_test, y_train, y_test, scaler):
    y_pred = model(x_test)
    y_pred = scaler.inverse_transform(y_pred.detach().numpy())
    y_test = y_test.reshape(-1, 1)
    
    train_pred = model(x_train)
    train_pred = scaler.inverse_transform(train_pred.detach().numpy())
    train = scaler.inverse_transform(y_train.detach().numpy())
    
    rmse_test = math.sqrt(mean_squared_error(y_test[:, 0], y_pred[:, 0]))
    rmse_train = math.sqrt(mean_squared_error(train[:, 0], train_pred[:, 0]))
    
    print('Train: %.2f RMSE' % (rmse_train))
    print('Test: %.2f RMSE' % (rmse_test))
    
    result = pd.DataFrame(index=df.index[10:])
    result['Reference'] = np.concatenate((train[:,0], y_test[:,0]))
    result['Predicted'] = np.concatenate((train_pred[:,0], y_pred[:,0]))
    
    plot_evaluation(df, result, "univariate_overall_evaluation.png")
    plot_focused_evaluation(df, result, "univariate_forecast_evaluation.png")
    
    return result

In [22]:
def evaluate_multivariate(df, model, x_train, x_test, y_train, y_test, scaler):
    # select only price data point for eval metrics
    y_pred = model(x_test)
    y_pred = scaler.inverse_transform(y_pred.detach().numpy())
    y_test = scaler.inverse_transform(y_test.detach().numpy())
    
    train_pred = model(x_train)
    train_pred = scaler.inverse_transform(train_pred.detach().numpy())
    train = scaler.inverse_transform(y_train.detach().numpy())
    
    rmse_test = math.sqrt(mean_squared_error(y_test[:, 0], y_pred[:, 0]))
    rmse_train = math.sqrt(mean_squared_error(train[:, 0], train_pred[:, 0]))
    
    print('Train: %.2f RMSE' % (rmse_train))
    print('Test: %.2f RMSE' % (rmse_test))
    
    result = pd.DataFrame(index=df.index[10:])
    result['Reference'] = np.concatenate((train[:,0], y_test[:,0]))
    result['Predicted'] = np.concatenate((train_pred[:,0], y_pred[:,0]))
    
    plot_evaluation(df, result, "multivariate_overall_evaluation.png")
    plot_focused_evaluation(df, result, "multivariate_forecast_evaluation.png")
    
    return result

## Main Program

In [37]:
def main():
    
    # Generate data ingress objects -- price and auxillary data sources
    yahoo_finance = YahooFinance(False, True)
    google_trends = GoogleTrends(True, True)
    
    # Load data from: API or local cold-storage (csv) and bring data into scope
    yahoo_finance.load()
    google_trends.load()
    
    # Write price and auxillary data from object scope to local hot-storage (fast and efficient read/write)
    yahoo_finance.write()
    google_trends.write()
    
    # Now, to demonstrate that the data is well and truly stored and not simply loaded in from scope
    # we create two new data type objects, both without `load`, `write` permissions
    yahoo_finance = 0 # forcefully 'kill' old data objects
    google_trends = 0
    
    # New data type objects
    yahoo = YahooFinance(False, False)
    google = GoogleTrends(True, False)
    
    # Fetch the data from hot-storage (from Apache Arrow, feather file format)
    yahoo.fetch()
    google.fetch()
    
    # Create two datasets; one for the price-only model1, the other for the auxillary features model2
    # Merge dataset, 'floodfill' applied to googtrends data
    price_data = yahoo.data['Close']
    mixed_data = merger(yahoo.data, google.data)
    
    # Data exploration
    # You can find all locally saved figure in `images/`
    
    # Google trends 'floodfill'
    plot_trend_filling(mixed_data, google.data)
    
    # Feature correlation heatmap
    plot_correlation_heatmap(mixed_data)
    
    # Statistical analysis
    static_analysis = StaticAnalysis(mixed_data)
    static_analysis.anova_test() # mean hypothesis
    static_analysis.friedman_test() # distribution hypothesis
    
    # Time series decompostion analysis
    # breaks down a signal into 4 component: trends, seasonality, cycle, residual
    automatic_tsd(mixed_data)
    
    # Model training:
    # calls to data processing occurs within model training routine
    window = 10 # size of look-back LSTM window
    model_u, scaler_u, xte_u, yte_u, xtr_u, ytr_u = train_univariate_model(mixed_data, window)
    model_m, scaler_m, xte_m, yte_m, xtr_m, ytr_m = train_multivariate_model(mixed_data, window)
    
    # Model evaluation:
    evaluate_univariate(mixed_data, model_u, xtr_u, xte_u, ytr_u, yte_u, scaler_u)
    evaluate_multivariate(mixed_data, model_m, xtr_m, xte_m, ytr_m, yte_m, scaler_m)

In [35]:
main()

Data requested from Yahoo! Finance API (via yfinance)
[*********************100%***********************]  1 of 1 completed
Files loaded from: data/raw/
File written at: data/store/market_data
Data written at: data/store/google_trends
Data fetched from: data/store/market_data
Data fetched from: data/store/google_trends
Figure saved to file at: images/google_trends_filling_effect.png
Figure saved to file at: images/closing_price_feature_correlation.png
stat=2245.219, p=0.000
H1: Probably different distributions
stat=1831.266, p=0.000
H1: Probably different distributions
Figure saved to file at: images/tsd_price.png
Figure saved to file at: images/tsd_search.png
Figure saved to file at: images/tsd_news.png
Figure saved to file at: images/tsd_financial.png
Epoch  0 MSE:  0.3304557800292969
Epoch  10 MSE:  0.020251963287591934
Epoch  20 MSE:  0.01966392807662487
Epoch  30 MSE:  0.010228545404970646
Epoch  40 MSE:  0.004828971344977617
Epoch  50 MSE:  0.003193236654624343
Epoch  60 MSE:  0.0