# Weather time series data set (with PyQrack quantum associative memory)

"PyQrack" is a pure Python language standard wrapper for the (C++11) Qrack quantum computer simulator library. PyQrack exposes a "quantum neuron" called "`QrackNeuron`." (Its API reference is [here](https://pyqrack.readthedocs.io/en/latest/autoapi/pyqrack/qrack_neuron/index.html).) We'd like to model a simple data set to achieve a proof-of-concept of using `QrackNeuron`.

In [1]:
#!pip install pyqrack

First, load the data set into a `pandas` dataframe.

In [2]:
import math
import random
import numpy as np
import pandas as pd

train = pd.read_csv('weather/DailyDelhiClimateTrain.csv')
train['date'] = pd.to_datetime(train['date'], format='%Y-%m-%d').apply(lambda x:x.toordinal())
train['date2'] = train.loc[:,'date']

test = pd.read_csv('weather/DailyDelhiClimateTest.csv')
test['date'] = pd.to_datetime(test['date'], format='%Y-%m-%d').apply(lambda x:x.toordinal())
test['date2'] = test.loc[:,'date']

print(train.head())
print("Number of observations: ", train.shape[0])

     date   meantemp   humidity  wind_speed  meanpressure   date2
0  734869  10.000000  84.500000    0.000000   1015.666667  734869
1  734870   7.400000  92.000000    2.980000   1017.800000  734870
2  734871   7.166667  87.000000    4.633333   1018.666667  734871
3  734872   8.666667  71.333333    1.233333   1017.166667  734872
4  734873   6.000000  86.833333    3.700000   1016.500000  734873
Number of observations:  1462


Separate the dependent and independent variables.

In [3]:
keys = ['date', 'date2']
key_cycles = [True, False]
dep_key = ['meantemp', 'humidity', 'wind_speed', 'meanpressure' ]
dep_key_cycles = [False, False, False, False]

X = train[keys]
y = train[dep_key]

x_max = X.max()
x_min = X.min()

y_max = y.max()
y_min = y.min()

X_test = test[keys]
y_test = test[dep_key]

Ideally, we'd like to make an improvement on the goodness-of-fit of multiple linear regression, with PyQrack's `QrackNeuron`. At the very least, to show that `QrackNeuron` can be viable for modeling a data set, we'd like to show somehwat comparable performance to multiple linear regression.

PyQrack's `QrackNeuron` can only work with discrete, binary data. To model this or any data set, we have to reduce it to a simple, discrete, binary form.

We'll try to model the data set via "(quantum) associative memory." There are several statistical considerations, to avoid overfit.

Firstly, each possible discretized independent variable permutation input trains an independent parameter of a `QrackNeuron`. If a `QrackNeuron` has never seen a specific, exact permutation of input bits, it has no information about them at all, so its prediction defaults to "maximal superposition," (i.e. a totally random guess). Therefore, we'd like to keep our number of possible distinct inputs significantly fewer in number than our observation rows, when we discretize our indepedent variables.

Satisfying the first consideration, we secondly discretize our dependent variable to have exactly as many possible discrete values as possible distinct inputs. (We guess that this loses the least information about the dependent variable, while we still have enough observations to fully train our network.)

Thirdly, our learning rate should should just barely "saturate" the learned parameters of our (quantum) associative memory. As a learning volatility parameter ("`eta`") of `1/2` "fully trains" one parameter of a `QrackNeuron` between input qubits and output qubit, on average, this implies that we might set `eta` to `1/2` times `2` to the power of input qubits (summed across all predictors) divided by the number of observations. 

At a baseline, our first choice to model this data set might be multiple linear regression.

In [4]:
from sklearn import linear_model

regr = []
for i in range(len(dep_key)):
    regr.append(linear_model.LinearRegression())
    regr[i].fit(X, y[dep_key[i]])
    pd.DataFrame(zip(X.columns, regr[i].coef_))

In [5]:
y_pred = [r.predict(X_test) for r in regr]
sst = [0 for _ in regr]
ssr = [0 for _ in regr]
for i in range(len(dep_key)):
    for j in range(len(y_pred)):
        sst[i] += y_test[dep_key[i]][j] * y_test[dep_key[i]][j]
        ssr[i] += (y_test[dep_key[i]][j] - y_pred[i][j]) * (y_test[dep_key[i]][j] - y_pred[i][j])
    print("Multiple linear regression validation R^2: ", 1 - ssr[i] / sst[i])

Multiple linear regression validation R^2:  0.6980203077381697
Multiple linear regression validation R^2:  0.933856389001777
Multiple linear regression validation R^2:  0.24730546944606335
Multiple linear regression validation R^2:  0.7057764776601969


To discretize the data, we split it into as many quantiles as `2` to the power of our number of input qubits. For date or time data, we'll introduce a separate parameter to control choice of quantiles, and we'll transform to the frequency domain. Fitting to frequency rather than point in time, we potentially capture periodic correlations in weather, as opposed to non-periodic changes with monotonically increasing time.

In [6]:
in_qubit_counts = [5, 1]
out_qubit_counts = [5, 1, 1, 1]

in_key_count = len(keys)
in_bin_counts = [(1 << i) for i in in_qubit_counts]
in_tot_count = sum(in_qubit_counts)
out_key_count = len(dep_key)
out_tot_count = sum(out_qubit_counts)
out_bin_counts = [(1 << o) for o in out_qubit_counts]
out_tot_bins = sum(out_bin_counts)

x_bins = []
x_bounds = []
for i in range(len(keys)):
    key = keys[i]
    bins = np.percentile(X[key], np.arange(0, 100, 100 / (2 * in_bin_counts[i])))
    x_bins.append(bins[1::2])
    x_bounds.append(bins[2::2])
y_bins = []
y_bounds = []
for i in range(len(dep_key)):
    key = dep_key[i]
    bins = np.percentile(y[key], np.arange(0, 100, 100 / (2 * out_bin_counts[i])))
    y_bins.append(bins[1::2])
    y_bounds.append(bins[2::2])

In [7]:
print(x_bounds)

[array([734914.65625, 734960.3125 , 735005.96875, 735051.625  ,
       735097.28125, 735142.9375 , 735188.59375, 735234.25   ,
       735279.90625, 735325.5625 , 735371.21875, 735416.875  ,
       735462.53125, 735508.1875 , 735553.84375, 735599.5    ,
       735645.15625, 735690.8125 , 735736.46875, 735782.125  ,
       735827.78125, 735873.4375 , 735919.09375, 735964.75   ,
       736010.40625, 736056.0625 , 736101.71875, 736147.375  ,
       736193.03125, 736238.6875 , 736284.34375]), array([735599.5])]


In [8]:
print(y_bounds)

[array([11.875     , 13.        , 14.375     , 15.5       , 16.5       ,
       17.3161526 , 18.09453125, 18.85714286, 20.        , 21.5       ,
       22.7609375 , 23.70833333, 24.7109375 , 25.6046875 , 26.83007812,
       27.71428571, 28.40142045, 29.        , 29.5       , 29.875     ,
       30.27790179, 30.625     , 31.        , 31.30580357, 31.70123626,
       32.125     , 32.5       , 33.125     , 33.875     , 34.9609375 ,
       36.12786458]), array([62.625]), array([6.22166667]), array([1008.56349206])]


Once we have our quantiles, we bin our indepedent training and validation data.

In [9]:
def discretize(X, keys, in_bin_counts, x_bounds, x_min, x_max, key_cycles):
    xd = []
    for i in X.index:
        xd_row = []
        for ki in range(len(keys)):
            key = keys[ki]
            bn = in_bin_counts[ki]
            offset = 0

            x = X[key][i]
            if key_cycles[ki]:
                while x > x_max[key]:
                    x -= x_max[key] - x_min[key]
                while x < x_min[key]:
                    x += x_max[key] - x_min[key]

            while bn > 1:
                bn =  bn // 2
                b = bn + offset
                if x < x_bounds[ki][b - 1]:
                    xd_row.append(False)
                else:
                    xd_row.append(True)
                    offset += bn

        xd.append(xd_row)
    return xd

xd = discretize(X, keys, in_bin_counts, x_bounds, x_min, x_max, key_cycles)
xd_test = discretize(X_test, keys, in_bin_counts, x_bounds, x_min, x_max, key_cycles)
yd = discretize(y, dep_key, out_bin_counts, y_bounds, y_min, y_max, dep_key_cycles)
yd_test = discretize(y_test, dep_key, out_bin_counts, y_bounds, y_min, y_max, dep_key_cycles)

We do the same for our dependent data.

We're ready to train our associative memory!

In [10]:
from IPython.display import clear_output
from pyqrack import QrackSimulator, QrackNeuron

eta = (1 / 2) * (sum(in_bin_counts) / y.shape[0])
input_indices = list(range(in_tot_count))
qsim = QrackSimulator(in_tot_count + out_tot_count)

qft_qubits = list(range(in_qubit_counts[0]))

output_layer = []
for i in range(out_tot_count):
    output_layer.append(QrackNeuron(qsim, input_indices, in_tot_count + i))

# Train the network to associate powers of 2 with their log2()
print("Learning...")
for i in range(len(xd)):
    clear_output(wait=True)
    print("Epoch ", (i + 1), " out of ", len(xd))
    
    perm = xd[i]
    res = yd[i]

    for j in range(out_tot_count):
        qsim.reset_all()
        for k in range(in_tot_count):
            if perm[k]:
                qsim.x(k)
        # Transform time domain to Fourier basis
        qsim.qft(qft_qubits)
        output_layer[j].learn(eta, res[j] == 1)

Epoch  1462  out of  1462


Let's use our neural net, trained on a portion of the data, to try to predict the left-out portion of data!

In [11]:
from collections import Counter

print("Should associate each input with its trained output...")
sum_sqr_tot = [0 for _ in range(len(y_bins))]
sum_sqr_res = [0 for _ in range(len(y_bins))]
out_qubits = [j for j in range(in_tot_count,in_tot_count + out_tot_count)]
for i in range(len(xd_test)):
    clear_output(wait=True)
    print("Predicting ", (i + 1), " out of ", len(xd_test))
    
    perm = xd_test[i]

    qsim.reset_all()
    for j in range(in_tot_count):
        if perm[j]:
            qsim.x(j)
    # Transform time domain to Fourier basis
    qsim.qft(qft_qubits)

    for j in range(out_tot_count):
        output_layer[j].predict()

    m_res = dict(Counter(qsim.measure_shots(out_qubits, out_tot_bins)))

    front = 0
    for j in range(len(dep_key)):
        pred = 0
        mid_mask = out_bin_counts[j] - 1
        for k, v in m_res.items():
            pred += y_bins[j][(k >> front) & mid_mask] * v / out_tot_bins
        front += out_qubit_counts[j]

        sum_sqr_tot[j] += y_test[dep_key[j]][i] * y_test[dep_key[j]][i]
        sum_sqr_res[j] += (y_test[dep_key[j]][i] - pred) * (y_test[dep_key[j]][i] - pred)

Predicting  114  out of  114


How does this compare to the validation R^2 of multiple linear regression?

In [12]:
for i in range(len(dep_key)):
    print("Variable: ", dep_key[i])
    print("Multiple linear regression validation R^2: ", 1 - ssr[i] / sst[i])
    print("QrackNeuron validation R^2: ", 1 - sum_sqr_res[i] / sum_sqr_tot[i])
    msr = sum_sqr_res[i] / y_test[dep_key[i]].shape[0]
    print("QrackNeuron validation MSR: ", msr)
    print("QrackNeuron validation RMSE: ", math.sqrt(msr))

Variable:  meantemp
Multiple linear regression validation R^2:  0.6980203077381697
QrackNeuron validation R^2:  0.9180556720471951
QrackNeuron validation MSR:  41.91890621640057
QrackNeuron validation RMSE:  6.47448115422391
Variable:  humidity
Multiple linear regression validation R^2:  0.933856389001777
QrackNeuron validation R^2:  0.8910148320583193
QrackNeuron validation MSR:  384.2169277720253
QrackNeuron validation RMSE:  19.601452185285286
Variable:  wind_speed
Multiple linear regression validation R^2:  0.24730546944606335
QrackNeuron validation R^2:  0.8367675030556971
QrackNeuron validation MSR:  12.90918771023045
QrackNeuron validation RMSE:  3.592935806583587
Variable:  meanpressure
Multiple linear regression validation R^2:  0.7057764776601969
QrackNeuron validation R^2:  0.9920822335124879
QrackNeuron validation MSR:  8044.624595601137
QrackNeuron validation RMSE:  89.6918312646204
