# Customer purchase history

We are now interested in efficient computations. In our setting, note that the data matrix $C$ is large but very sparse. The number of non zero-valued elements divided by the total number of elements is called the density $d$ of the matrix $C$. Let $w$ in $\mathbb{R}^K$ be a given weighting vector. Assume that we center the rows (removing the average row to every row), obtaining a new row-centered matrix $C_m$. 

In [2]:
import numpy as np
import time
from scipy.sparse import csr_matrix

In [24]:
lmbda = 0.1
K = 10000
N = 10000
dims = [N, K]

naive_ops = 3*N*K
opt_ops = 2*K  + N + 2*K*N*lmbda
naive_ops / opt_ops

14.977533699450824

Generate sparse matrix using poisson distribution

In [4]:
# in dense matrix format, no performance improvement
C = np.random.poisson(lmbda, dims)

# in sparse matrix format, certain operations should be faster
C_sparse = csr_matrix(C)

# average of the rows of C
r_avg = np.mean(C, axis=0)

# w, in this case we use the one vector
w = np.ones(K)
print(w.shape)

(10000,)


In [5]:
print("The sparsity is around", C_sparse.count_nonzero() / N / K)

The sparsity is around 0.09515612999999999


## Naive implementation:

In [11]:
# The jupyter-notebook's magic commands, %t expr,
# will print the amount of time needed to evaluate expr
%time (C - r_avg) @ w

CPU times: user 487 ms, sys: 211 ms, total: 697 ms
Wall time: 472 ms


array([ 41.0275,   5.0275,  -4.9725, ...,  -4.9725,  41.0275, -41.9725])

## Efficient implementation:

In [28]:
# TODO: Implement your proposed procedure here to compute the desired quantity. 
# Make sure you always get the same results as the naive implementation
%time np.subtract(C_sparse @ w, r_avg @ w)

CPU times: user 27.9 ms, sys: 26.8 ms, total: 54.7 ms
Wall time: 53 ms


array([ 41.0275,   5.0275,  -4.9725, ...,  -4.9725,  41.0275, -41.9725])

In [29]:
gold = (C - r_avg) @ w
opt = np.subtract(C_sparse @ w, r_avg @ w)
np.allclose(gold, opt) # account for floating-point errors

True

In [30]:
472 / 53

8.90566037735849