# Games Release data set (with PyQrack quantum associative memory)

"PyQrack" is a pure Python language standard wrapper for the (C++11) Qrack quantum computer simulator library. PyQrack exposes a "quantum neuron" called "`QrackNeuron`." (Its API reference is [here](https://pyqrack.readthedocs.io/en/latest/autoapi/pyqrack/qrack_neuron/index.html).) We'd like to model a simple data set to achieve a proof-of-concept of using `QrackNeuron`.

First, load the data set into a `pandas` dataframe.

In [1]:
import pandas as pd

df = pd.read_csv('games-release/games-release-ALL.csv')
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

df = df.drop(['game', 'link', 'release', 'total_reviews'], axis=1)

df['peak_players'] = df['peak_players'].str.replace(',', '').astype(int)
df['positive_reviews'] = df['positive_reviews'].str.replace(',', '').astype(int)
df['negative_reviews'] = df['negative_reviews'].str.replace(',', '').astype(int)
df['rating'] = df['rating'].str.replace('%', '').astype(float)

print(df.head())
print("Number of observations: ", df.shape[0])

   peak_players  positive_reviews  negative_reviews  rating
0          4529             19807               227   96.39
1        168191             61752              1616   95.75
2         15543             12643               213   95.54
3          1415             11717               209   95.39
4          6132             14152               324   95.09
Number of observations:  66427


Separate the dependent and independent variables.

In [2]:
keys = ['peak_players', 'positive_reviews', 'negative_reviews']
dep_key = 'rating'

X = df[keys]
y = df[dep_key] 

Ideally, we'd like to make an improvement on the goodness-of-fit of multiple linear regression, with PyQrack's `QrackNeuron`. At the very least, to show that `QrackNeuron` can be viable for modeling a data set, we'd like to show somehwat comparable performance to multiple linear regression.

PyQrack's `QrackNeuron` can only work with discrete, binary data. To model this or any data set, we have to reduce it to a simple, discrete, binary form.

We'll try to model the data set via "(quantum) associative memory." There are several statistical considerations, to avoid overfit.

Firstly, each possible discretized independent variable permutation input trains an independent parameter of a `QrackNeuron`. If a `QrackNeuron` has never seen a specific, exact permutation of input bits, it has no information about them at all, so its prediction defaults to "maximal superposition," (i.e. a totally random guess). Therefore, we'd like to keep our number of possible distinct inputs significantly fewer in number than our observation rows, when we discretize our indepedent variables.

Satisfying the first consideration, we secondly discretize our dependent variable to have exactly as many possible discrete values as possible distinct inputs. (We guess that this loses the least information about the dependent variable, while we still have enough observations to fully train our network.)

Thirdly, our learning rate should should just barely "saturate" the learned parameters of our (quantum) associative memory. As a learning volatility parameter ("`eta`") of `1/2` "fully trains" one parameter of a `QrackNeuron` between input qubits and output qubit, on average, this implies that we might set `eta` to `1/2` times `2` to the power of input qubits (summed across all predictors) divided by the number of observations. 

Start by splitting the data set into equal halves for training and validation.

In [3]:
train=df.sample(frac = 1/2)
test=df.drop(train.index)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

X = train[keys]
y = train[dep_key]

X_test = test[keys]
y_test = test[dep_key]

At a baseline, our first choice to model this data set might be multiple linear regression.

In [4]:
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(X, y)
pd.DataFrame(zip(X.columns, regr.coef_))

Unnamed: 0,0,1
0,peak_players,6.1e-05
1,positive_reviews,4.5e-05
2,negative_reviews,-0.000323


In [5]:
y_pred = regr.predict(X)
sst = 0
ssr = 0
for i in range(len(y_pred)):
    sst += y[i] * y[i]
    ssr += (y[i] - y_pred[i]) * (y[i] - y_pred[i])

print("Multiple linear regression training R^2: ", 1 - ssr / sst)

Multiple linear regression training R^2:  0.9557760123761988


In [6]:
y_pred = regr.predict(X_test)
sst = 0
ssr = 0
for i in range(len(y_pred)):
    sst += y_test[i] * y_test[i]
    ssr += (y_test[i] - y_pred[i]) * (y_test[i] - y_pred[i])

print("Multiple linear regression validation R^2: ", 1 - ssr / sst)

Multiple linear regression validation R^2:  0.9565691441192254


To discretize the data, we split it into as many quantiles as `2` to the power of our number of input qubits.

In [7]:
import numpy as np

in_qubit_count = 4

in_key_count = len(keys)
in_bin_count = 1 << in_qubit_count
in_tot_count = in_key_count * in_qubit_count
out_qubit_count = in_key_count * in_qubit_count
out_bin_count = 1 << out_qubit_count

x_bins = []
for key in keys:
    x_bins.append(np.percentile(X[key], np.arange(0, 100, 100 / in_bin_count)))
y_bins = np.percentile(y, np.arange(0, 100, 100 / out_bin_count))

In [8]:
print(x_bins)

[array([0.000e+00, 1.000e+00, 2.000e+00, 2.000e+00, 3.000e+00, 3.000e+00,
       4.000e+00, 5.000e+00, 7.000e+00, 9.000e+00, 1.400e+01, 2.300e+01,
       4.600e+01, 1.080e+02, 2.780e+02, 1.107e+03]), array([0.000e+00, 1.000e+00, 2.000e+00, 3.000e+00, 5.000e+00, 7.000e+00,
       1.000e+01, 1.400e+01, 1.900e+01, 2.600e+01, 3.900e+01, 5.900e+01,
       9.900e+01, 1.900e+02, 4.260e+02, 1.461e+03]), array([  0.   ,   0.   ,   0.   ,   1.   ,   1.   ,   2.   ,   3.   ,
         4.   ,   6.   ,   8.   ,  12.   ,  18.   ,  29.   ,  50.   ,
       103.375, 298.   ])]


In [9]:
print(y_bins)

[16.39       20.54108643 21.40125977 ... 95.34674072 95.74130859
 96.09935791]


Once we have our quantiles, we bin our indepedent training and validation data.

In [10]:
def discretize_x(X):
    xd = []
    for i in X.index:
        xd_row = []
        for ki in range(len(keys)):
            key = keys[ki]
            bn = in_bin_count
            offset = 0
            while bn > 1:
                bn =  bn // 2
                b = bn + offset
                if X[key][i] < x_bins[ki][b - 1]:
                    xd_row.append(False)
                else:
                    xd_row.append(True)
                    offset += bn
        xd.append(xd_row)
    return xd

xd = discretize_x(X)
xd_test = discretize_x(X_test)

We do the same for our dependent data.

In [11]:
def discretize_y(y):
    yd = []
    for i in y.index:
        yd_row = []
        bn = out_bin_count
        offset = 0
        while bn > 0:
            bn =  bn // 2
            b = bn + offset
            if y[i] < y_bins[b - 1]:
                yd_row.append(False)
            else:
                yd_row.append(True)
                offset += bn
        yd.append(yd_row)
    return yd

yd = discretize_y(y)
yd_test = discretize_y(y_test)

We're ready to train our associative memory! (Note that it offers us no particular advantage, in this case, that our "neurons" are based on simulated quantum computational gates, though it is possible to predict in "superposition" of many rows at once.)

In [12]:
from IPython.display import clear_output
from pyqrack import QrackSimulator, QrackNeuron

eta = (1 / 2) * (out_bin_count / y.shape[0])

input_power = 1 << in_tot_count
input_indices = list(range(in_tot_count))
qsim = QrackSimulator(in_tot_count + out_qubit_count)

output_layer = []
for i in range(out_qubit_count):
    output_layer.append(QrackNeuron(qsim, input_indices, in_tot_count + i))

# Train the network to associate powers of 2 with their log2()
print("Learning...")
for i in range(len(xd)):
    perm = xd[i]
    res = yd[i]

    if i > 0:
        clear_output(wait=True)
    print("Epoch ", (i + 1), " out of ", len(xd))
    for j in range(out_qubit_count):
        qsim.reset_all()
        for k in range(in_tot_count):
            if perm[k]:
                qsim.x(k)
        for k in range(out_qubit_count):
            if res[k]:
                qsim.x(in_tot_count + k)
        output_layer[j].learn_permutation(eta, res[j] == 1)

Epoch  33214  out of  33214


Let's use our neural net, trained on half the data, to try to predict the left-out half of data!

In [13]:
print("Should associate each input with its trained output...")
sum_sqr_tot = 0
sum_sqr_res = 0
for i in range(len(xd_test)):
    perm = xd_test[i]

    qsim.reset_all()
    for j in range(in_tot_count):
        if perm[j]:
            qsim.x(j)
    for j in range(out_qubit_count):
        output_layer[j].predict()

    bn = out_bin_count
    offset = 0
    for j in range(out_qubit_count):
        bn = bn // 2
        if qsim.m(in_tot_count + j):
            offset += bn 
    pred = y_bins[offset]
    
    if i > 0:
        clear_output(wait=True)
    print("Predicting ", (i + 1), " out of ", len(xd))

    # print("Row: ", str(i))
    # print("Input: ", str(perm))
    # print("Prediction: ", str(pred))
    # print("Observed: ", str(y_test[i]))
    # print("Residual: ", str(y_test[i] - pred))
    # print()

    sum_sqr_tot += y_test[i] * y_test[i]
    sum_sqr_res += (y_test[i] - pred) * (y_test[i] - pred)

Predicting  33213  out of  33214


How does this compare to the training R^2 of multiple linear regression?

In [14]:
print("QrackNeuron Validation R^2: ", 1 - sum_sqr_res / sum_sqr_tot)
print("Multiple linear regression validation R^2: ", 1 - ssr / sst)
if (1 - sum_sqr_res / sum_sqr_tot) > (1 - ssr / sst):
    print("QrackNeuron quantum associative memory outperfomed multiple linear regression.")
else:
    print("Multiple linear regression outperfomed QrackNeuron quantum associative memory.")

QrackNeuron Validation R^2:  0.9855655982535823
Multiple linear regression validation R^2:  0.9565691441192254
QrackNeuron quantum associative memory outperfomed multiple linear regression.
