# Predicting GOOGL Stock Closing Price on a Time Series (with PyQrack quantum associative memory)

"PyQrack" is a pure Python language standard wrapper for the (C++11) Qrack quantum computer simulator library. PyQrack exposes a "quantum neuron" called "`QrackNeuron`." (Its API reference is [here](https://pyqrack.readthedocs.io/en/latest/autoapi/pyqrack/qrack_neuron/index.html).) We'd like to model a simple data set to achieve a proof-of-concept of using `QrackNeuron`.

In [1]:
!pip install pyqrack



## Overview

To model time-series data (ex.: stock closing price, climate data, biological rhythms, infrastructure load, etc.), we can start with OLS with point in time as predictor, to isolate any dominant "straight-line" trend over time. That hopefully explains a significant part of the variance. (ex.: "The major industrial stock market indices tend to return ~7% APR, on average.") Then, we **subtract the OLS prediction and new residual mean as baseline**. Then, we bin the data by training set quantile and apply the _**quantum Fourier transform (QFT)**_ to the time index. If we have time series data on a perfectly regular interval cadence, the QFT just produces _**uniform superposition**._ However, this is _uniform superposition_ that represents _low-frequency oscillation periods_ in the data. Then, `QrackNeuron` can _infer_ connections between _dependent variables_ and _frequency._ (ex.: "Boom/bust cycles and deep market corrections tend to happen every 7-to-11 years, roughly," or, "There is an element of semi-reliable _annual and quarterly seasonality_ to market sentiment and behavior.") Then, we combine the inference model, OLS, and mean-value behavior back to together to get overall predictions.

## Pre-processing

First, load the data set into a `pandas` dataframe.

In [2]:
import math
import random
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split


all_data = pd.read_csv('stonks/all_stocks_2006-01-01_to_2018-01-01.csv')
all_data['Date'] = pd.to_datetime(all_data['Date'], format='%Y-%m-%d').apply(lambda x:x.toordinal())
all_data['CompanyID'] = all_data['Name'].astype('category').cat.codes

print(all_data.head())
print("Number of observations: ", all_data.shape[0])

train, test = train_test_split(all_data.loc[all_data['Name'] == 'GOOGL'].dropna(), shuffle=False)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

     Date   Open   High    Low  Close   Volume Name  CompanyID
0  732314  77.76  79.35  77.24  79.11  3117200  MMM         19
1  732315  79.49  79.49  78.25  78.71  2558000  MMM         19
2  732316  78.41  78.65  77.56  77.99  2529500  MMM         19
3  732317  78.64  78.90  77.64  78.63  2479500  MMM         19
4  732320  78.50  79.83  78.46  79.02  1845600  MMM         19
Number of observations:  93612


Separate the dependent and independent variables.

In [3]:
features = ['Date']
# dependents = ['Open', 'High', 'Low', 'Close', 'Volume' ]
dependents = ['Close' ]

X = train[features]
y = train[dependents]

X_test = test[features]
y_test = test[dependents]

y_mean = y.mean()
y -= y_mean
y_test -= y_mean

At a baseline, our first choice to model most data sets, at least in a _explorator_ capacity, might be linear regression.

In [4]:
from sklearn import linear_model

regr = []
for i in range(len(dependents)):
    regr.append(linear_model.LinearRegression())
    regr[i].fit(X, y[dependents[i]])
    pd.DataFrame(zip(X.columns, regr[i].coef_))

y_pred = [r.predict(X) for r in regr]
y_proj = [r.predict(X_test) for r in regr]
sst = [0.0] * len(regr)
ssr = [0.0] * len(regr)
for i in range(len(dependents)):
    dependent = dependents[i]
    for j in range(len(y_proj[i])):
        sst[i] += y_test[dependent][j] * y_test[dependent][j]
        ssr[i] += (y_test[dependent][j] - y_proj[i][j]) * (y_test[dependent][j] - y_proj[i][j])
    print("Variable: ", dependents[i])
    print("Linear regression (OLS) validation R^2: ", 1 - ssr[i] / sst[i])

Variable:  Close
Linear regression (OLS) validation R^2:  0.6959668592970805


Ideally, we'd like to make an improvement on the goodness-of-fit of linear regression by combining it with PyQrack's `QrackNeuron`. To start, to first order, we want to eliminate the "straight-line" component of time dependence, from linear regession.

In [5]:
yp = y.copy()
yp_test = y_test.copy()

for i in range(len(dependents)):
    if (1 - ssr[i] / sst[i]) <= 0:
        continue

    dependent = dependents[i]
    for j in range(len(y_pred[i])):
        yp.at[j, dependent] -= y_pred[i][j]
    for j in range(len(y_proj[i])):
        yp_test.at[j, dependent] -= y_proj[i][j]

yp_mean = yp.mean()

for i in range(len(dependents)):
    if (1 - ssr[i] / sst[i]) <= 0:
        yp_mean[dependents[i]] = 0

yp -= yp_mean
yp_test -= yp_mean

This class can only work with discrete, binary data. To model this or any data set, we have to reduce it to a simple, discrete, binary form.

We'll try to model the data set via "(quantum) associative memory." There are several statistical considerations, to avoid overfit.

Firstly, each possible discretized independent variable permutation input trains an independent parameter of a `QrackNeuron`. If a `QrackNeuron` has never seen a specific, exact permutation of input bits, it has no information about them at all, so its prediction defaults to "maximal superposition," (i.e. a totally random guess). Therefore, we'd like to keep our number of possible distinct inputs significantly fewer in number than our observation rows, when we discretize our indepedent variables.

Satisfying the first consideration, we secondly discretize our dependent variable to have exactly as many possible discrete values as possible distinct inputs. (We guess that this loses the least information about the dependent variable, while we still have enough observations to fully train our network.)

Thirdly, our learning rate should should just barely "saturate" the learned parameters of our (quantum) associative memory. As a learning volatility parameter ("`eta`") of `1/2` "fully trains" one parameter of a `QrackNeuron` between input qubits and output qubit, on average, this implies that we might set `eta` to `1/2` times `2` to the power of input qubits (summed across all predictors) divided by the number of observations.

In [6]:
in_qubit_counts = [5]
out_qubit_counts = [5]

in_tot_qubits = sum(in_qubit_counts)
in_bin_counts = [(1 << i) for i in in_qubit_counts]
in_tot_bins = sum(in_bin_counts)
in_qubits = list(range(in_tot_qubits))
out_tot_qubits = sum(out_qubit_counts)
out_bin_counts = [(1 << o) for o in out_qubit_counts]
out_tot_bins = sum(out_bin_counts)
out_qubits = list(range(in_tot_qubits, in_tot_qubits + out_tot_qubits))

To discretize the data, we split it into as many quantiles as `2` to the power of our number of input qubits. For date or time data, we'll introduce a separate parameter to control choice of quantiles, and we'll transform to the frequency domain. Fitting to frequency rather than point in time, we potentially capture periodic correlations in weather, as opposed to non-periodic changes with monotonically increasing time.

Once we have our quantiles, we bin our indepedent training and validation data.

In [7]:
from pyqrack import QrackSimulator, QrackNeuron

xd = []
yd = []
xd_test = []
yd_test = []
y_bins = []

for i, feature in enumerate(features):
    l = list(X[feature])
    xd_bounds_col = QrackNeuron.quantile_bounds(l, in_qubit_counts[i])
    xd.append(QrackNeuron.discretize(l, xd_bounds_col))
    xd_test.append(QrackNeuron.discretize(list(X_test[feature]), xd_bounds_col))

for i, dependent in enumerate(dependents):
    l = list(yp[dependent])
    yd_bounds_col_2 = QrackNeuron.quantile_bounds(l, out_qubit_counts[i] + 1)
    yd_bounds_col = yd_bounds_col_2[0::2]
    yd.append(QrackNeuron.discretize(l, yd_bounds_col))
    yd_test.append(QrackNeuron.discretize(list(yp_test[dependent]), yd_bounds_col))
    y_bins.append(yd_bounds_col_2[1::2])

xd = QrackNeuron.flatten_and_transpose(xd)
xd_test = QrackNeuron.flatten_and_transpose(xd_test)
yd = QrackNeuron.flatten_and_transpose(yd)
yd_test = QrackNeuron.flatten_and_transpose(yd_test)

## Inference model (with QFT)

Our model is based on a very simple assumption: the dependent variable values can be inferred from their relationship to both _**(linear) time**_ and _**periodic oscillation frequency**._ As such, the `Date` column is transformed via an _inverse quantum Fourier transform (QFT)_ before training or prediction, to capture dependence on _oscillation frequency._ (The _linear_ vs. _oscillatory_ dimensions are assumed to be approximately or exactly _orthogonal._)

(It's time to train our inference model!)

In [8]:
from IPython.display import clear_output

eta = (1 / 2) * (sum(in_bin_counts) / y.shape[0])
input_indices = list(range(in_tot_qubits))
qsim = QrackSimulator(in_tot_qubits + out_tot_qubits)

qft_qubits = list(range(in_qubit_counts[0]))

output_layer = []
for i in range(out_tot_qubits):
    output_layer.append(QrackNeuron(qsim, input_indices, in_tot_qubits + i))

# Train the network to associate powers of 2 with their log2()
print("Learning...")
for i in range(len(xd)):
    clear_output(wait=True)
    print("Epoch ", (i + 1), " out of ", len(xd))
    
    perm = xd[i]
    res = yd[i]

    for j in range(out_tot_qubits):
        qsim.reset_all()
        for k in range(in_tot_qubits):
            if perm[k]:
                qsim.x(k)
        # Transform time domain to Fourier basis
        qsim.qft(qft_qubits)
        output_layer[j].learn(eta, res[j] == 1)

Epoch  2264  out of  2264


With training complete, we predict the validation set and calculate the coefficient of determination (R^2).

In [9]:
from collections import Counter

print("Should associate each input with its trained output...")
dependents_len = len(dependents)
sum_sqr_tot = [0.0] * dependents_len
sum_sqr_res = [0.0] * dependents_len
sum_sqr_tot_p = [0.0] * dependents_len
sum_sqr_res_p = [0.0] * dependents_len
shots = min(10 ** 6, 1 << (out_tot_qubits + 2))
for i in range(len(xd_test)):
    clear_output(wait=True)
    print("Predicting ", (i + 1), " out of ", len(xd_test))
    
    perm = xd_test[i]

    qsim.reset_all()
    for j in range(in_tot_qubits):
        if perm[j]:
            qsim.x(j)
    # Transform time domain to Fourier basis
    qsim.qft(qft_qubits)

    for j in range(out_tot_qubits):
        output_layer[j].predict()

    m_res = dict(Counter(qsim.measure_shots(out_qubits, shots)))

    front = 0
    for j in range(dependents_len):
        pred = 0
        mid_mask = out_bin_counts[j] - 1
        for k, v in m_res.items():
            pred += y_bins[j][(k >> front) & mid_mask] * v / shots
        front += out_qubit_counts[j]

        dependent = dependents[j]
        
        sum_sqr_tot_p[j] += yp_test[dependent][i] * yp_test[dependent][i]
        sum_sqr_res_p[j] += (yp_test[dependent][i] - pred) * (yp_test[dependent][i] - pred)

        if (1 - ssr[j] / sst[j]) > 0:
            pred += (yp_mean[dependent] + y_proj[j][i])

        sum_sqr_tot[j] += y_test[dependent][i] * y_test[dependent][i]
        sum_sqr_res[j] += (y_test[dependent][i] - pred) * (y_test[dependent][i] - pred)

Predicting  755  out of  755


How does this compare to the validation R^2 of linear regression?

In [10]:
for i in range(len(dependents)):
    dependent = dependents[i]
    print("Variable: ", dependent)
    print("Linear regression (OLS) validation R^2: ", 1 - ssr[i] / sst[i])
    print("QrackNeuron (periodic-only) validation R^2: ", 1 - sum_sqr_res_p[i] / sum_sqr_tot_p[i])
    print("QrackNeuron + OLS validation R^2: ", 1 - sum_sqr_res[i] / sum_sqr_tot[i])
    msr = sum_sqr_res[i] / y_test[dependent].shape[0]
    print("QrackNeuron validation MSR: ", msr)
    print("QrackNeuron validation RMSE: ", math.sqrt(msr))

Variable:  Close
Linear regression (OLS) validation R^2:  0.6959668592970805
QrackNeuron (periodic-only) validation R^2:  0.6103271807000468
QrackNeuron + OLS validation R^2:  0.881526548901667
QrackNeuron validation MSR:  27831.44888794739
QrackNeuron validation RMSE:  166.8276022963448
