# Collaborative Filtering

How do companies such as Amazon and Netflix choose what to recommend to users out of thousands of available products? A key technique they employ is *collaborative filtering*, which is the use of information from similar users and items to predict preference for a given item.

Suppose we have some movie-streaming data:

In [1]:
import pandas as pd

#Data
raw_data = [
            [1,1,1,0,0,0],    
            [2,1,1,0,0,1],
            [3,1,0,0,0,0],
            [4,1,0,0,0,1],
            [5,0,0,1,0,1],
            [6,0,1,0,1,0],
            ]
labels = ['customer','movie1','movie2','movie3','movie4','movie5']
data = pd.DataFrame.from_records(raw_data,columns=labels)
data

Unnamed: 0,customer,movie1,movie2,movie3,movie4,movie5
0,1,1,1,0,0,0
1,2,1,1,0,0,1
2,3,1,0,0,0,0
3,4,1,0,0,0,1
4,5,0,0,1,0,1
5,6,0,1,0,1,0


Choice data such as the one above is called *implicit data*, because it only reflects the users' preference implicitly through their choices. Because it is possible that some users dislike some choices they have made, a chosen item is not the same as a prefered item. But since it is highly unlikely that most users hate most of the choices they have made, the data is usually quite informative as a whole. 

Data that includes actual preference is called *explicit data*. For example, the data might contain user ratings from 1 to 5:

In [2]:
import pandas as pd

#Data
raw_data_explicit = [
            [1,4,2,0,0,0],    
            [2,3,2,0,0,3],
            [3,3,0,0,0,0],
            [4,5,0,0,0,4],
            [5,0,0,3,0,1],
            [6,0,3,0,4,0],
            ]
labels = ['customer','movie1','movie2','movie3','movie4','movie5']
data_explicit = pd.DataFrame.from_records(raw_data_explicit,columns=labels)
data_explicit

Unnamed: 0,customer,movie1,movie2,movie3,movie4,movie5
0,1,4,2,0,0,0
1,2,3,2,0,0,3
2,3,3,0,0,0,0
3,4,5,0,0,0,4
4,5,0,0,3,0,1
5,6,0,3,0,4,0


It is also possible that both explicit and implicit data are available. This is often the case because only a minority of users who have chosen an item will voluntarily provide ratings.

We will mostly work with implicit data, but the techniques we cover do work with both types of data in general.

## A. Nearest Neighbor

One intuitively appealing approach is to look for other users that have similar records and see what they have chosen before. 

In the following example, we will look for the three closest neighbors to a new user and average their choices:

In [3]:
from sklearn.neighbors import NearestNeighbors
import numpy as np

#Extract data from dataframe
X = np.asarray(data[["movie1","movie2","movie3","movie4","movie5"]])

#Only consider the three nearest neighbor
nbrs = NearestNeighbors(n_neighbors=3)
nbrs.fit(X)

#New user who have only watched movie 1
x = np.array([1,0,0,0,0])

#Get neighbor and calculate average choices
neigh = nbrs.kneighbors([x],return_distance=False)
np.mean(X[neigh],axis=1)

array([[ 1.        ,  0.33333333,  0.        ,  0.        ,  0.33333333]])

Our model suggests that we should recommend the user to try out movie 2 and movie 5. 

We can also use ```np.argsort()``` to get a list of index to recommend starting from the most recommended item:

In [4]:
np.argsort(-(np.mean(X[neigh],axis=1) - x))

array([[1, 4, 0, 2, 3]])

The above recommendation is based on the fact that the new user having already chosen movie 1. What if the new user has not yet tried anything? In that case, we might want to simply recommend the average of all users. In other words, we recommend the most popular items.

## B. Matrix Factorization

Matrix factorization assumes that choice data can be represented by a matrix multiplication of item features (usually called *factors*) and user preference on those features:

![Collaborative-Filtering-Matrix-Factorization](../images/collaborative-filtering-matrix-factorization.png)

Let $P$ be a matrix of user preference of shape $\text{no. of users} \times \text{no. of factors}$ and $Q$ a matrix of item factors of shape $\text{no. of items} \times \text{no. of factors}$. Then the choice data $X$ can be represented by
$$
X = PQ^T
$$

There are several ways to get from $X$ to $P$ and $Q$. 

### B1. Single Value Decomposition (SVD)

Single value decomposition performs the following factorization:

$$
X = U \Sigma V^T
$$

where $\Sigma$ is a diagonal matrix of *singular values*. To get $X=PQ^T$, we can let $P=U$ and $Q^T=\Sigma V^T$.

SVD can be performed by calling ```scipy.sparse.linalg.svds(X,k)```, where $1\leq k<\min{\{\text{X.shape}\}}$ is the number of singular values. Several things to note:
- ```svds()``` expects float-point numbers, so if the data contains integers we will need to convert them to float via ```.astype(float)```.
- ```svds()``` does not return a diagonal matrix $\Sigma$, instead it returns its diagonal values in a 1-D array. We can apply ```np.diag()``` on this array to get $\Sigma$. 

Let us try running ```svds``` on our data:

In [5]:
import numpy as np
from scipy.sparse import csc_matrix
from scipy.sparse.linalg import svds

P,s,V = svds(X.astype(float),k=2)
print(P.shape,s.shape,V.shape)

(6, 2) (2,) (2, 5)


The $P$ matrix represents user preference. Each row is one user and each column is the user's preference for a particular factor:

In [6]:
print(P)

[[  3.71748034e-01  -4.54503234e-01]
 [ -6.86155564e-17  -6.46068676e-01]
 [  1.02923335e-16  -2.62937792e-01]
 [ -3.71748034e-01  -4.54503234e-01]
 [ -6.01500955e-01  -2.23956027e-01]
 [  6.01500955e-01  -2.23956027e-01]]


Since the factors are automatically generated, it is hard to know what they represent without further investigation. For movies, you can imagine that one factor might represent how much action element there is, while another might represent how much romance element there is. In any case, from the perspective of generating recommendation we do not necessarily care about what these underlying factors are, since what we want is $PQ^T$.

Now let us take a look at $Q^T$:

In [7]:
Qt = np.dot(np.diag(s),V)
print(Qt)

[[  2.69456788e-16   9.73248989e-01  -6.01500955e-01   6.01500955e-01
   -9.73248989e-01]
 [ -1.81801294e+00  -1.32452794e+00  -2.23956027e-01  -2.23956027e-01
   -1.32452794e+00]]


Each column is one item and each row is the item's exposure to a particular factor.

What is the effect of having different values of $k$? Let's try that out:

In [8]:
#k=4
k=4
P,s,V = svds(X.astype(float),k)
Qt = np.dot(np.diag(s),V)
print("k =",k)
print(np.round(np.dot(P,Qt),5))

#k=2
k=2
P,s,V = svds(X.astype(float),k)
Qt = np.dot(np.diag(s),V)
print("k =",k)
print(np.round(np.dot(P,Qt),5))


k = 4
[[ 0.98719  1.01175 -0.01753 -0.01753  0.01175]
 [ 1.12993  0.88077  0.17779  0.17779  0.88077]
 [ 0.84446  0.14274 -0.21284 -0.21284  0.14274]
 [ 0.98719  0.01175 -0.01753 -0.01753  1.01175]
 [-0.07011  0.06433  0.90407 -0.09593  1.06433]
 [-0.07011  1.06433 -0.09593  0.90407  0.06433]]
k = 2
[[ 0.82629  0.96381 -0.12182  0.3254   0.2402 ]
 [ 1.17456  0.85574  0.14469  0.14469  0.85574]
 [ 0.47802  0.34827  0.05889  0.05889  0.34827]
 [ 0.82629  0.2402   0.3254  -0.12182  0.96381]
 [ 0.40715 -0.28877  0.41196 -0.31165  0.88205]
 [ 0.40715  0.88205 -0.31165  0.41196 -0.28877]]


Note how much more $PQ^T$ resembles $X$ when $k$ is large. Does that mean we want a large $k$ then? Far from it. A large $k$ means that our model will predict the existing chocies of the users perfectly. If the user has not chosen an item before, the model will predict that she does not like the item, resulting in no recommendation. In other words, our model is overfitting the data. 

As you should now realize, $k$ is the main hyperparameter that you would want to tune in SVD models.

How do we generate recommendations for an arbitrary user, particular one that is not already in the data? Since

$$
X = PQ^T \\
XQ = PQ^TQ \\
XQ(Q^TQ)^{-1} = P
$$

This gives us an equation to find the preference of a particular user $u$:

$$
p_u = x_uQ(Q^TQ)^{-1}
$$

Which in turn allows us to predict what the user might choose:

$$
\hat{x}_u = p_uQ^T
$$

Let $S=Q(Q^TQ)^{-1}$. Then $p_u = x_uS$:

In [9]:
#New user who have only watched movie 1
x = np.array([1,0,0,0,0])

#Generate prediction
S = np.dot(Qt.T,np.linalg.inv(np.dot(Qt,Qt.T)))
p = np.dot(x,S)
x_hat = np.dot(p,Qt)
x_hat

array([ 0.47802431,  0.34826845,  0.0588865 ,  0.0588865 ,  0.34826845])

As in the case of nearest neighbor, our model suggests that we should recommend the user to try out movie 2 and movie 5.

Below is a simple class that implements an interface similar to ```scikit-learn```. I call it ```SVDarnoldi``` because ```svds``` implements the <a href="https://en.wikipedia.org/wiki/Arnoldi_iteration#Implicitly_restarted_Arnoldi_method_.28IRAM.29">Arnoldi iteration</a>.

In [10]:
class SVDarnoldi():
    def __init__(self,k=2):
        self.k = k
        
    def fit(self,X):
        P,s,V = svds(X.astype(float),k=self.k)
        Qt = np.dot(np.diag(s),V)
        self.S = np.dot(Qt.T,np.linalg.inv(np.dot(Qt,Qt.T)))
        self.Qt = Qt
        
    def predict(self,x):
        p = np.dot(x,self.S)
        x_hat = np.dot(p,self.Qt)
        return x_hat
        

With this class, it is easy to try out different values of $k$:

In [11]:
#New user who have only watched movie 1
x = np.array([1,0,0,0,0])

#Loop through
for k in range(1,5):
    print("k =",k)
    model = SVDarnoldi(k=k)
    model.fit(X)
    print(model.predict(x))

k = 1
[ 0.47802431  0.34826845  0.0588865   0.0588865   0.34826845]
k = 2
[ 0.47802431  0.34826845  0.0588865   0.0588865   0.34826845]
k = 3
[ 0.84445557  0.14273618 -0.21284164 -0.21284164  0.14273618]
k = 4
[ 0.84445557  0.14273618 -0.21284164 -0.21284164  0.14273618]


### B2. Alternating Least Squared (ALS)

For efficiency reasons, practical implementations of matrix-factorization-based filtering mostly utilize approximation of SVD. Here we will cover a method called *alternating least squared*. The idea is as follows: we first randomly initiatizes $P$ and $Q^T$, and then iteratively update these two matrices until $PQ^T \approx X$. But how should we update the matrices? 

For clarity, I will be using $'$ instead of $^T$ to denote inverse. Starting with the objective $PQ' = X$:

$$
PQ' = X \\
PQ'Q = XQ \\
P = XQ(Q'Q)^{-1}
$$

Similarly, 

$$
PQ' = X  \\
P'PQ' = P'X \\
Q' = (P'P)^{-1}P'X 
$$

So what we are going to do is to run $P = XQ(Q'Q)^{-1}$ and $Q = (P'P)^{-1}P'X$ iteratively until $PQ' \approx X$. 

Note that what we are doing here are essentially running OLS repeatedly until we converge on a stable combination of $P$ and $Q'.$ Now it is true that when we run OLS we usually have a single column vector of dependent variable, whereas here $X$ is a matrix, but the technique is the same.

Reference:
- <a href="http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=34AEEE06F0C2428083376C26C71D7CFF?doi=10.1.1.167.5120&rep=rep1&type=pdf">Collaborative Filtering for Implicit Datasets</a>
- <a href="https://pdfs.semanticscholar.org/dbe9/d04bffb5c1df8eb721dab4f744ea81d9a4c1.pdf">Alternating Least Squared for Personalized Ranking</a>

Below is a straight-forward implementation of ALS:

In [12]:
k = 2                    #Number of latent factors
min_loss_delta = 0.00001 #Minimum change in mean squared error to continue training
max_epochs = 20          #Maximum number of training rounds
    
#Initializes P and Qt with random values
user_count = X.shape[0]
item_count = X.shape[1]
P = np.random.rand(user_count,k)
Qt = np.random.rand(k,item_count)
X_hat = np.dot(P,Qt)

#Initial loss
loss_delta = 1 + min_loss_delta
loss_prev = np.mean(np.square(X - X_hat))

#Main loop
epoch = 0
while loss_delta > min_loss_delta and epoch < max_epochs:
    #Iteratively update P and Qt
    P = np.dot(X,np.dot(Qt.T,np.linalg.inv(np.dot(Qt,Qt.T))))
    Qt = np.dot(np.dot(np.linalg.inv(np.dot(P.T,P)),P.T),X)
    X_hat = np.dot(P,Qt)

    #Update loss
    loss = np.mean(np.square(X - X_hat))
    epoch = epoch + 1
    print(str(epoch).ljust(3),"loss:",round(loss,4))
    loss_delta = abs(loss - loss_prev)
    loss_prev = loss

#Prediction
S = np.dot(Qt.T,np.linalg.inv(np.dot(Qt,Qt.T)))
p = np.dot(x,S)
x_hat = np.dot(p,Qt)
print(x_hat)

1   loss: 0.1114
2   loss: 0.1026
3   loss: 0.0966
4   loss: 0.0911
5   loss: 0.087
6   loss: 0.0846
7   loss: 0.0834
8   loss: 0.0828
9   loss: 0.0825
10  loss: 0.0824
11  loss: 0.0823
12  loss: 0.0823
13  loss: 0.0823
14  loss: 0.0823
[ 0.47806141  0.35191127  0.05659474  0.06112324  0.34458401]


The predicted preference is essentially the same as what we got from a real SVD.

If we are going use the algorithm repeatedly, however, it would be best to write a self-contained class that we can use repeatedly:

In [13]:
class SVDals():
    """
    Alternating Least Square SVD
    """    
        
    def __init__(self,k=2,min_loss_delta=0.00001,max_epochs=20):
        """
        k:               Number of latent factors
        min_loss_delta:  Minimum change in mean squared error to continue training
        max_epochs:      Maximum number of training rounds
        """
        self.k = k
        self.min_loss_delta = min_loss_delta
        self.max_epochs = max_epochs
       
    def fit(self,X):
        """
        Fit the model
        X: training data
        """
        
        #Initialize model parameters
        self.initialize(X)
        loss, loss_delta = self.update_loss(X,0) 
        
        print("Training...")
        epoch = 0
        while loss_delta > self.min_loss_delta and epoch < self.max_epochs:
            #Update parameters
            self.update_params()
            
            #Update error and loss
            loss, loss_delta = self.update_loss(X,loss)
            
            #Increment counter
            epoch = epoch + 1

            #Show each round's epoch and self.error
            self._printloss(epoch,loss)
       
    def initialize(self,X):
        """
        Initializes P and Qt
        """
        self.user_count = X.shape[0]
        self.item_count = X.shape[1]
        #P and Qt uniformly distributed from -0.5 to 0.5
        self.P = np.random.rand(self.user_count,self.k) - 0.5 
        self.Qt = np.random.rand(self.k,self.item_count ) - 0.5
        
    def update_params(self):
        """
        Update parameters
        """        
        #Update P and Qt
        self.P = np.dot(X,np.dot(self.Qt.T,np.linalg.inv(np.dot(self.Qt,self.Qt.T))))
        self.Qt = np.dot(np.dot(np.linalg.inv(np.dot(self.P.T,self.P)),self.P.T),X)
        self.S = np.dot(self.Qt.T,np.linalg.inv(np.dot(self.Qt,self.Qt.T)))
                
    def update_loss(self,X,loss_prev):
        """
        Update self.error and mean squared self.error
        """        
        #Generate Prediction   
        X_hat = np.dot(self.P,self.Qt)
        
        #Error matrix
        self.error = X - X_hat
        
        #loss
        loss = self._loss(X,X_hat)
        loss_delta = abs(loss_prev - loss)  
        
        return loss, loss_delta  
            
    def _loss(self,X,X_hat):
        """
        Calculate mean squared error
        """
        return np.mean(np.square(X - X_hat))
    
    def _printloss(self,epoch,loss):
        """
        Print formated loss
        """
        print(str(epoch).ljust(3),"loss:",round(loss,4))
        
    def predict(self,x):
        """
        Inference
        x: input data
        """        
        p = np.dot(x,self.S)
        x_hat = np.dot(p,self.Qt)
        return x_hat

Let us try it out:

In [14]:
model = SVDals(k=2)
model.fit(X)
print(model.predict(x))

Training...
1   loss: 0.1172
2   loss: 0.1051
3   loss: 0.0998
4   loss: 0.0941
5   loss: 0.0891
6   loss: 0.0858
7   loss: 0.084
8   loss: 0.0831
9   loss: 0.0826
10  loss: 0.0824
11  loss: 0.0823
12  loss: 0.0823
13  loss: 0.0823
14  loss: 0.0823
[ 0.47808298  0.35284263  0.05599565  0.06169033  0.34362845]


The main advantage of ALS is that it is highly parallelizable and converges very quickly, resulting in very fast training. This is in contrast to the stochastic gradient descent approach that we will cover next.

### B3. Gradient Descent

*Gradient descent* nudges parameters by an amount proportional to their contribution to the loss function. Gradient Descent and its approximation, *Stochastic Gradient Descent* (SGD), are very general optimization methods, usable in all sorts of models from logistic regression to neural network.

A simple example is as follows: Suppose our model is

$$
\hat{y} = \alpha + x
$$

As is common in regression problem, we would like to minimize the squared error. So our loss function is:

$$
c = \left( y - \hat{y} \right)^2
$$

<img src="../Images/loss-error.png" width="300">

We have an initial guess of what $\alpha$ is---often just a random number---and an initial prediction $\hat{y}_0 = \alpha_0 + x$. This prediction is likely inaccurate, which means the loss will be positive:

$$
c_0 = \left( y - \hat{y}_0 \right)^2 > 0
$$

How do we use this information to update $\alpha$? Let $\epsilon_0 = y - \hat{y}_0$. 
The marginal effect, or *gradient*, of $\alpha$ on $c$ is:

$$
\frac{\partial c}{\partial \alpha} 
= \frac{\partial}{\partial \alpha}\left( y-\hat{y} \right)^2 = -2 \epsilon
$$

Suppose the error is positive, so $\epsilon_0 = y - \hat{y}_0  > 0$. The loss function will be decreasing in $\alpha$, which makes sense---$\hat{y}$ is increasing in $\alpha$, and right now $\hat{y} < y$. We can make our model more accurate by increasing $\alpha$. Conversely, we should decrease $\alpha$ if the error is negative.

The gradient thus tells us the direction we need to adjustment our parameter. Furthermore, the amount we need to adjust is, to a first-order approximation, proportional to

$$
- \frac{\partial c}{\partial \alpha}
$$

We therefore have the following update rule:

$$
\alpha_{t} = \alpha_{t-1} - \gamma \frac{\partial c}{\partial \alpha} \bigg\rvert_{\alpha_{t-1}}
$$

Or more typical in computer science:

$$
\alpha \gets \alpha - \gamma \frac{\partial c}{\partial \alpha} 
$$

$\gamma$ is called the *learning rate*. Learning rate is usually much smaller than 1 to prevent overshooting. It can be manually specified in simple settings such as ours but is often automatically adjusted in more advance alogrithms.

There are a couple of options when it comes to the computation of gradient:
- Averaging the gradient from all samples. The advantage of this method is that the "true" gradient is used, in the sense that it reflects the overall gradient of the training data. The disadvantage is that the speed of convergence is slow, since we are only updating the model parameters after we compute the gradient of all observations.
- **Stochastic Gradient Descent (SGD):** Update parameters with the average gradient of a subset of samples. Some updates will push the parameters in one direction while some others will push them in the other direction---this is the *stochastic* part of the algorithm---but on average the parameters will move towards the right direction. Besides allowing for faster convergence, the fact that the gradient is noisy also helps the model avoid local minimas. 

Large-scale machine learning models are typically trained on variations of SGD. Data is broken into mutually-exclusive groups called *mini-batches* and model parameters are updated with the average gradient of each mini-batch.  

Now specifically for SVD, we have for each user $u$ and each item $i$,

$$
\hat{x}_{ui} = p_u q^T_i
$$

The loss function is:

$$
\sum_{u,i}{\left( x_{ui} - \hat{x}_{ui} \right)^2}
$$

so the gradient consists of:

$$
\frac{\partial c_{ui}}{\partial p_u} = -2 \epsilon_{ui} \cdot q_i \\
\frac{\partial c_{ui}}{\partial q_i} = -2 \epsilon_{ui} \cdot p_u
$$

The update rules for $P$ and $Q^T$ are thus:

$$
P \gets P + \gamma \mathcal{E} Q \\
Q^T \gets Q^T + \gamma P^T \mathcal{E}
$$

As before, here is a straight-forward implementation of gradient descent SVD. Most of the codes are shared with the ALS implementation---the only difference is the few lines updating $P$ and $Q^T$:

In [15]:
k = 2                    #Number of latent factors
min_loss_delta = 0.00001 #Minimum change in mean squared error to continue training
max_epochs = 200         #Maximum number of training rounds
learning_rate=0.1        #Learning rate
    
#Initializes P and Qt with random values
user_count = X.shape[0]
item_count = X.shape[1]
P = np.random.rand(user_count,k)
Qt = np.random.rand(k,item_count)
X_hat = np.dot(P,Qt)

#Initial loss
loss_delta = 1 + min_loss_delta
loss_prev = np.mean(np.square(X - X_hat))

#Main loop
epoch = 0
while loss_delta > min_loss_delta and epoch < max_epochs:
    #Iteratively update P and Qt
    error = X - X_hat
    P = P + learning_rate * (np.dot(error, Qt.T))
    Qt = Qt + learning_rate * (np.dot(P.T,error))
    X_hat = np.dot(P,Qt)
    
    #Update loss
    loss = np.mean(np.square(X - X_hat))
    epoch = epoch + 1
    print(str(epoch).ljust(3),"loss:",round(loss,4))
    loss_delta = abs(loss - loss_prev)
    loss_prev = loss

#Prediction
S = np.dot(Qt.T,np.linalg.inv(np.dot(Qt,Qt.T)))
p = np.dot(x,S)
x_hat = np.dot(p,Qt)
print(x_hat)

1   loss: 0.2349
2   loss: 0.2103
3   loss: 0.1969
4   loss: 0.1874
5   loss: 0.1797
6   loss: 0.1728
7   loss: 0.1663
8   loss: 0.1598
9   loss: 0.1532
10  loss: 0.1465
11  loss: 0.1396
12  loss: 0.1326
13  loss: 0.1257
14  loss: 0.119
15  loss: 0.1129
16  loss: 0.1075
17  loss: 0.1028
18  loss: 0.0988
19  loss: 0.0957
20  loss: 0.0931
21  loss: 0.0911
22  loss: 0.0896
23  loss: 0.0883
24  loss: 0.0873
25  loss: 0.0865
26  loss: 0.0859
27  loss: 0.0854
28  loss: 0.085
29  loss: 0.0846
30  loss: 0.0843
31  loss: 0.0841
32  loss: 0.0839
33  loss: 0.0837
34  loss: 0.0836
35  loss: 0.0834
36  loss: 0.0833
37  loss: 0.0832
38  loss: 0.0832
39  loss: 0.0831
40  loss: 0.083
41  loss: 0.083
42  loss: 0.0829
43  loss: 0.0829
44  loss: 0.0828
45  loss: 0.0828
46  loss: 0.0827
47  loss: 0.0827
48  loss: 0.0827
49  loss: 0.0827
50  loss: 0.0826
51  loss: 0.0826
52  loss: 0.0826
53  loss: 0.0826
54  loss: 0.0825
55  loss: 0.0825
56  loss: 0.0825
57  loss: 0.0825
58  loss: 0.0825
59  loss: 0.0825
6

Note how many more epochs it takes for gradient descent to converge in contrast to ALS. Can we speed things up by setting a higher learning rate? If we try different learning rates, we will see that having too high a learning rate would result in constant overshooting, and as a result no convergence.

Below is a class implementing gradient descent SVD:

In [16]:
class SVDgd():
    """
    Gradient Descent SVD
    """
        
    def __init__(self,k=2,
                 min_loss_delta=0.00001,
                 max_epochs=200,
                 learning_rate=0.1,
                 show_progress=True
                ):
        """
        k:               Number of latent factors
        min_loss_delta:   Minimum change in mean squared error to continue training
        max_epochs:      Maximum number of training rounds
        learning_rate:   Learning rate
        show_progress:   Print error of each epoch
        """
        self.k = k
        self.min_loss_delta = min_loss_delta
        self.max_epochs = max_epochs
        self.learning_rate = learning_rate
        self.show_progress = show_progress
        
    def fit(self,X):
        """
        Fit the  model
        X: training data
        """
        
        #Initialize model parameters
        self.initialize(X)
        loss, loss_delta = self.update_loss(X,0) 
        
        print("Training...")
        epoch = 0
        while loss_delta > self.min_loss_delta and epoch < self.max_epochs:
            #Update parameters
            self.update_params()
            
            #Update error and loss
            loss, loss_delta = self.update_loss(X,loss)
            
            #Increment counter
            epoch = epoch + 1
            
            if self.show_progress:
                #Show each round's epoch and self.error
                self._printloss(epoch,loss)
                
        if not self.show_progress:
            #Show the final epoch and self.error
            self._printloss(epoch,loss)
       
    def initialize(self,X):
        """
        Initializes P and Qt
        """
        self.user_count = X.shape[0]
        self.item_count = X.shape[1]
        #P and Qt uniformly distributed from -0.5 to 0.5
        self.P = np.random.rand(self.user_count,self.k) - 0.5 
        self.Qt = np.random.rand(self.k,self.item_count ) - 0.5
        
    def update_params(self):
        """
        Update parameters
        """        
        #Update P and Qt with previous epoch's loss
        self.P = self.P + self.learning_rate * (np.dot(self.error, self.Qt.T))
        self.Qt = self.Qt + self.learning_rate * (np.dot(self.P.T,self.error))
        self.S = np.dot(self.Qt.T,np.linalg.inv(np.dot(self.Qt,self.Qt.T)))
                
    def update_loss(self,X,loss_prev):
        """
        Update self.error and mean squared self.error
        """        
        #Generate Prediction   
        X_hat = np.dot(self.P,self.Qt)
        
        #Error matrix
        self.error = X - X_hat
        
        #loss
        loss = self._loss(X,X_hat)
        loss_delta = abs(loss_prev - loss)  
        
        return loss, loss_delta  
            
    def _loss(self,X,X_hat):
        """
        Calculate mean squared error
        """
        return np.mean(np.square(X - X_hat))
    
    def _printloss(self,epoch,loss):
        """
        Print formated loss
        """
        print(str(epoch).ljust(3),"loss:",round(loss,4))
        
    def predict(self,x):
        """
        Inference
        x: input data
        """        
        p = np.dot(x,self.S)
        x_hat = np.dot(p,self.Qt)
        return x_hat

As before, let us try out the algorithm:

In [17]:
model = SVDgd(k=2)
model.fit(X)
print(model.predict(x))

Training...
1   loss: 0.3877
2   loss: 0.3761
3   loss: 0.3641
4   loss: 0.3505
5   loss: 0.3343
6   loss: 0.3152
7   loss: 0.2937
8   loss: 0.2712
9   loss: 0.2494
10  loss: 0.2301
11  loss: 0.2141
12  loss: 0.2015
13  loss: 0.1917
14  loss: 0.1839
15  loss: 0.1775
16  loss: 0.1719
17  loss: 0.1667
18  loss: 0.1617
19  loss: 0.1566
20  loss: 0.1511
21  loss: 0.1452
22  loss: 0.1389
23  loss: 0.1322
24  loss: 0.1254
25  loss: 0.1186
26  loss: 0.1123
27  loss: 0.1066
28  loss: 0.1018
29  loss: 0.0978
30  loss: 0.0947
31  loss: 0.0922
32  loss: 0.0904
33  loss: 0.089
34  loss: 0.0879
35  loss: 0.0871
36  loss: 0.0864
37  loss: 0.0859
38  loss: 0.0855
39  loss: 0.0851
40  loss: 0.0848
41  loss: 0.0846
42  loss: 0.0843
43  loss: 0.0842
44  loss: 0.084
45  loss: 0.0839
46  loss: 0.0837
47  loss: 0.0836
48  loss: 0.0835
49  loss: 0.0834
50  loss: 0.0833
51  loss: 0.0833
52  loss: 0.0832
53  loss: 0.0831
54  loss: 0.0831
55  loss: 0.083
56  loss: 0.083
57  loss: 0.0829
58  loss: 0.0829
59  lo

### B3. Simon Funk's SVD

This is a method popularized during the Netflix Prize, and it is usually what people refers to when they mention "SVD" in the context of collaborative filtering.  

The prediction of the model is given by:
$$
\hat{x}_{ui} = \mu + b_u + b_i + p_u q^T_i
$$

$\mu$, $b_u$ and $b_i$ are called *bias* in machine learning, but a more familiar name for economists would be dummy variables. So Funk's SVD is essentially SVD with fixed effects.

The model is also regularized, so the loss function is:
$$
\sum_{u,i}{\left( x_{ui} - \hat{x}_{ui} \right)^2 
+ \alpha \left( b_u^2 + b_i^2 + \lVert p_u \rVert^2 + \lVert q_i \rVert^2  \right) }
$$
where $\alpha$ is the strength of regularization. 

Reference:
- <a href="http://sifter.org/~simon/journal/20061211.html">Netflix Update: Try This at Home</a>
 
Here are the list of changes we have to make in comparison with the simple SGD implementation:
- We need to add four variables, three representing the biases (```mu```, ```bu```, ```bi```) and one the strength of regularization (```alpha```). The biases need to be updated in our main loop.
- The model's prediction needs to be updated to include the biases.
- The loss function needs to be updated to includ regularization.
- Since we have more parameters to estimate we will increase the maximum epochs to give the model more time to train.

In [18]:
#Simon Funk's SVD
k = 2                    #Number of latent factors
min_loss_delta = 0.00001 #Minimum change in mean squared error to continue training
max_epochs = 800         #Maximum number of training rounds
learning_rate=0.1        #Learning rate
alpha=0.05               #Regularization strength
    
#Initializes P and Qt with random values
user_count = X.shape[0]
item_count = X.shape[1]

#Original SVD matrices
P = np.random.rand(user_count,k)
Qt = np.random.rand(k,item_count)

#Biases
mu = np.mean(X)
bu = np.random.rand(user_count,1) 
bi = np.random.rand(item_count,1)

#Vectors of 1
ones_i = np.ones((1,item_count))
ones_u = np.ones((user_count,1))   

X_hat = (mu + np.dot(bu,ones_i)
         + np.dot(ones_u,bi.T) 
         + np.dot(P,Qt))  

#Initial loss and error
error = X - X_hat

loss_delta = 1 + min_loss_delta
loss_prev = np.mean(np.square(X - X_hat)) + alpha * (
                np.mean(bu**2)
                + np.mean(bi**2)
                + np.mean(P**2)
                + np.mean(Qt**2)
                )

#Main loop
epoch = 0
while loss_delta > min_loss_delta and epoch < max_epochs:
    #Iteratively update P and Qt
    P = P + learning_rate * (np.dot(error, Qt.T))
    Qt = Qt + learning_rate * (np.dot(P.T,error))

    #Update biases
    mu = mu + learning_rate * np.mean(error)
    bu = bu + learning_rate * (
         np.mean(error,axis=1).reshape(user_count,1)
         - alpha * bu
         )
    bi = bi + learning_rate * (
         np.mean(error,axis=0).reshape(item_count,1)
         - alpha * bi
         )
    
    X_hat = (mu + np.dot(bu,ones_i) 
             + np.dot(ones_u,bi.T) 
             + np.dot(P,Qt))  
    
    #Update loss and error
    error = X - X_hat
    loss = np.mean(np.square(X - X_hat)) + alpha * (
                np.mean(bu**2)
                + np.mean(bi**2)
                + np.mean(P**2)
                + np.mean(Qt**2)
                )
    epoch = epoch + 1
    print(str(epoch).ljust(3),"loss:",round(loss,4))
    loss_delta = abs(loss - loss_prev)
    loss_prev = loss

#Prediction
S = np.dot(Qt.T,np.linalg.inv(np.dot(Qt,Qt.T)))
p = np.dot(x,S)
x_hat = mu + bi.T + np.dot(p,Qt) #user bias is zero for new user
print(x_hat)

1   loss: 0.7098
2   loss: 0.419
3   loss: 0.311
4   loss: 0.258
5   loss: 0.2262
6   loss: 0.2048
7   loss: 0.1894
8   loss: 0.1782
9   loss: 0.1696
10  loss: 0.1627
11  loss: 0.1569
12  loss: 0.1517
13  loss: 0.1469
14  loss: 0.1425
15  loss: 0.1384
16  loss: 0.1346
17  loss: 0.1311
18  loss: 0.1279
19  loss: 0.125
20  loss: 0.1224
21  loss: 0.1199
22  loss: 0.1177
23  loss: 0.1155
24  loss: 0.1135
25  loss: 0.1115
26  loss: 0.1096
27  loss: 0.1077
28  loss: 0.1059
29  loss: 0.104
30  loss: 0.1022
31  loss: 0.1004
32  loss: 0.0986
33  loss: 0.0969
34  loss: 0.0952
35  loss: 0.0935
36  loss: 0.0919
37  loss: 0.0904
38  loss: 0.0889
39  loss: 0.0875
40  loss: 0.0862
41  loss: 0.085
42  loss: 0.0838
43  loss: 0.0827
44  loss: 0.0817
45  loss: 0.0808
46  loss: 0.0799
47  loss: 0.0792
48  loss: 0.0784
49  loss: 0.0778
50  loss: 0.0772
51  loss: 0.0766
52  loss: 0.0761
53  loss: 0.0756
54  loss: 0.0752
55  loss: 0.0748
56  loss: 0.0745
57  loss: 0.0741
58  loss: 0.0738
59  loss: 0.0736
60 

Below we extend the ```SVDsgd``` class to create a new ```SVDfunk``` class:

In [19]:
class SVDfunk(SVDgd):
    """
    Simon Funk's SVD. Regularized.
    """
    
    def __init__(self,alpha=0,max_epochs=500,*args,**kargs):
        """
        alpha: regularization strength
        """    
        self.alpha = alpha
        #pass other arguments to parent class
        super().__init__(max_epochs=max_epochs,*args,**kargs)
      
    def initialize(self,X):
        """
        Initializes biases, P and Qt
        """
        super().initialize(X) #Use parent class to initialize P and Qt
        self.mu = np.mean(X)
        self.bu = np.random.rand(self.user_count,1)
        self.bi = np.random.rand(self.item_count,1)
        
    def update_params(self):
        """
        Update parameters
        """        
         #Update biases, P and Qt with previous epoch's loss
        self.mu = self.mu + self.learning_rate * np.mean(self.error)
        self.bu = self.bu + self.learning_rate * (
                np.mean(self.error,axis=1).reshape(self.user_count,1)
                - self.alpha * self.bu
                )
        self.bi = self.bi + self.learning_rate * (
                np.mean(self.error,axis=0).reshape(self.item_count,1)
                - self.alpha * self.bi
                )
        self.P = self.P + self.learning_rate * (
                np.dot(self.error, self.Qt.T)
                - self.alpha * self.P
                )
        self.Qt = self.Qt + self.learning_rate * (
                np.dot(self.P.T,self.error)
                - self.alpha * self.Qt
                )
        self.S = np.dot(self.Qt.T,np.linalg.inv(np.dot(self.Qt,self.Qt.T)))
        
    def update_loss(self,X,loss_prev):
        """
        Update self.error and mean squared self.error
        """        
        #Generate Prediction
        ones_i = np.ones((1,self.item_count))
        ones_u = np.ones((self.user_count,1))        
        X_hat = (self.mu + np.dot(self.bu,ones_i) 
                 + np.dot(ones_u,self.bi.T) 
                 + np.dot(self.P,self.Qt))  
        
        #Error matrix
        self.error = X - X_hat
        
        #Regularized loss
        loss = self._loss(X,X_hat) + self.alpha * (
                np.mean(self.bu**2)
                + np.mean(self.bi**2)
                + np.mean(self.P**2)
                + np.mean(self.Qt**2)
                )
        loss_delta = abs(loss_prev - loss)  
        
        return loss, loss_delta  

    def predict(self,x):
        """
        Inference
        x: input data
        """        
        p = np.dot(x,self.S)
        x_hat = self.mu + self.bi.T + np.dot(p,self.Qt) #user bias is zero for new user
        return x_hat    

Now let us try out the model:

In [20]:
model = SVDfunk(k=2,alpha=0.05,show_progress=False)
model.fit(X)
print(model.predict(x))

Training...
244 loss: 0.0636
[[ 1.02129603  0.50273299 -0.18212007 -0.07692987  0.33253277]]


At first sight, there seems to be little difference between ```SVDfunk``` and ```SVDsgd```. Taking a closer look at the mean-squared errors of the two models, however, and it is clear that ```SVDfunk``` performs better for this particular metric.

In [21]:
#SGD
model = SVDgd(k=2,show_progress=False)
model.fit(X)
print(model.predict(x))

#Funk's
model = SVDfunk(k=2,show_progress=False)
model.fit(X)
print(model.predict(x))

Training...
157 loss: 0.0824
[ 0.4794837   0.32214011  0.073204    0.04215209  0.37238314]
Training...
95  loss: 0.0329
[[ 0.72632976  0.80918189 -0.13537655  0.30472425  0.10386051]]


A similar performance lead exists for explicit data:

In [23]:
#Extract data from dataframe
X2 = np.asarray(data_explicit[["movie1","movie2","movie3","movie4","movie5"]])

#New user who rated movie1 with a 3
x2 = np.array([3,0,0,0,0])

#SGD
model = SVDgd(k=2,show_progress=False)
model.fit(X2)
print(model.predict(x2))

#Funk's
model = SVDfunk(k=2,show_progress=False)
model.fit(X2)
print(model.predict(x2))

Training...
34  loss: 0.6965
[ 2.17086641  0.35714357  0.06248979 -0.17769563  1.27941628]
Training...
61  loss: 0.3174
[[ 3.37003186  1.39519828 -0.29970494  0.61011098  1.38816256]]


## Conclusion

We have gone through a few different methods of generating recommendations based on existing records. 

If you are interested in collaborative filtering, be sure to check out the Netflix Prize. The data is <a href="https://www.kaggle.com/netflix-inc/netflix-prize-data">available on Kaggle</a>, and you can find the winning team's research papers <a href="https://netflixprize.com/community/topic_1537.html">here</a>. One thing you will find is the the winning teams all employ an ensemble of models, which very often perform better than any single model.