<small><i>This notebook was create by Franck Iutzeler, Jerome Malick and Yann Vernaz (2016).</i></small>
<!-- Credit (images) Jeffrey Keating Thompson. -->

<center><img src="UGA.png" width="30%" height="30%"></center>
<center><h3>Master of Science in Industrial and Applied Mathematics (MSIAM)</h3></center>
<hr>
<center><h1>Convex and distributed optimization</h1></center>
<center><h2>Part III - Recommender Systems (3h + 3h home work)</h2></center>

# Outline

In this Lab, we will investigate some gradient-based algorithms on the very well known matrix factorization problem which is the most prominent approach for build a _Recommender Systems_.

Our goal is to implement Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent in Spark.

# Problem Formulation

The problem of matrix factorization for collaborative filtering captured much attention, especially after the [Netflix prize](https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf). The premise behind this approach is to approximate a large rating matrix $R$ with the multiplication of two low-dimensional factor matrices $P$ and $Q$, i.e. $R \approx \hat{R} = P^TQ$, that model respectively users and items in some latent space. For instance, matrix $R$ has dimension $m \times  n$ where $m$ and $n$ are restrictively the number of users and items, both large; while $P$ has size $m \times  k$ and contains user information in a latent space of size $k<<m,n$, $Q$ has size $n\times k$ and contains item information in the same latent space of size $k << m,n$. Typical values for $m, n$ are $10^6$ while $k$ is in the tens.

For a pair of user and item $(u_i,i_j)$ for which a rating $r_{ij}$ exists, a common approach approach is based on the minimization of the $\ell_2$-regularized quadratic error:
$$  \ell_{u_i,i_j}(P,Q)= \left(r_{ij} - p_{i}^{\top}q_{j}\right)^2 + \lambda(|| p_{i} ||^{2} + || q_{j} ||^2 )  $$
where $p_i$ is the column vector composed of the $i$-th line of $P$ and  $\lambda\geq 0$ is a regularization parameter. The whole matrix factorization problem thus writes
$$ \min_{P,Q} \sum_{i,j : r_{ij} \text{exists}}  \ell_{u_i,i_j}(P,Q). $$
Note that the error $ \ell_{u_i,i_j}(P,Q)$ depends only on $P$ and $Q$ through $p_{i}$ and $q_{j}$; however, item $i_j$ may also be rated by user $u_{i'}$ so that the optimal factor $q_{j}$ depends on both $p_{i}$ and $p_{i'}$.

In [2]:
# set up spark environment (Using Spark Local Mode)
from pyspark import SparkContext, SparkConf

conf = SparkConf()
conf.setMaster("local[*]")
conf.setAppName("MSIAM part III - Matrix Factorization")

sc = SparkContext(conf = conf)

We remind you that you can access this interface by simply opening http://localhost:4040 in a web browser.

We will capitalize on the first lab and take the MovieLens dataset, and thus the RDD routines we already have.

In [3]:
def parseRating(line):
    fields = line.split('::')
    return int(fields[0]), int(fields[1]), float(fields[2])

def parseMovie(line):
    fields = line.split("::")
    return int(fields[0]), fields[1], fields[2]

# path to MovieLens dataset
movieLensHomeDir="data/movielens/medium/"

# ratings is an RDD of (userID, movieID, rating)
ratingsRDD = sc.textFile(movieLensHomeDir + "ratings.dat").map(parseRating).setName("ratings").cache()

numRatings = ratingsRDD.count()
numUsers = ratingsRDD.map(lambda r: r[0]).distinct().count()
numMovies = ratingsRDD.map(lambda r: r[1]).distinct().count()
print("We have %d ratings from %d users on %d movies.\n" % (numRatings, numUsers, numMovies))

M = ratingsRDD.map(lambda r: r[0]).max()
N = ratingsRDD.map(lambda r: r[1]).max()
matrixSparsity = float(numRatings)/float(M*N)
print("We have %d users, %d movies and the rating matrix has %f percent of non-zero value.\n" % (M, N, 100*matrixSparsity))

We have 1000209 ratings from 6040 users on 3706 movies.

We have 6040 users, 3952 movies and the rating matrix has 4.190221 percent of non-zero value.



#  Gradient Descent Algorithms

The goal here is to 
1. Compute gradients of the loss functions.
2. Implement gradient algorithms.
3. Observe the prediction accuracy of the developed methods.

__Question 1__

> Split (ramdomly) the dataset into training versus testing sample. We learn over 70% (for example) of the users, we test over the rest.

> Define a routine that returns the predicted rating from factor matrices. Form a RDD with the following elements `(i,j,true rating,predicted rating)`. 

> Define a routine that returns the Mean Square Error (MSE).


In [4]:
import numpy as np

trainingSample, testingSample = ratingsRDD.randomSplit([70, 30])
print("The number of examples in the dataset : ",ratingsRDD.count())
print("The number of examples in the training dataset : ",ratingsRDD.count())
print("The number of examples in the testing dataset : ",ratingsRDD.count())

moviesRDD = sc.textFile(movieLensHomeDir + "movies.dat").map(parseMovie).setName("movies").cache()

P = np.asmatrix(ratingsRDD.collect())
print("P contains user information in a latent spaceof size 3.\n",P)
Q = np.asmatrix(moviesRDD.take(5))
print("Q contains item information in the same latent space of size 3.",Q)

moviesDistinct = moviesRDD.distinct().count()
print(moviesDistinct)

The number of examples in the dataset :  1000209
The number of examples in the training dataset :  1000209
The number of examples in the testing dataset :  1000209
P contains user information in a latent spaceof size 3.
 [[  1.00000000e+00   1.19300000e+03   5.00000000e+00]
 [  1.00000000e+00   6.61000000e+02   3.00000000e+00]
 [  1.00000000e+00   9.14000000e+02   3.00000000e+00]
 ..., 
 [  6.04000000e+03   5.62000000e+02   5.00000000e+00]
 [  6.04000000e+03   1.09600000e+03   4.00000000e+00]
 [  6.04000000e+03   1.09700000e+03   4.00000000e+00]]
Q contains item information in the same latent space of size 3. [['1' 'Toy Story (1995)' "Animation|Children's|Comedy"]
 ['2' 'Jumanji (1995)' "Adventure|Children's|Fantasy"]
 ['3' 'Grumpier Old Men (1995)' 'Comedy|Romance']
 ['4' 'Waiting to Exhale (1995)' 'Comedy|Drama']
 ['5' 'Father of the Bride Part II (1995)' 'Comedy']]
3883


__Question 2__

> Derive the update rules for gradient descent. 

> Implement a (full) gradient algorithm in `Python` on the training set.  Take a step size (learning rate) $\gamma=0.001$ and stop after a specified number of iterations. Investigate the latent space size (e.g. $K=2,5,10,50$).

> Provide plots and explanations for your experiments. 

> Try to parrallelize it so that the code can be run using `PySpark`. What do you conclude?

Stochastic Gradient Descent (SGD) simply does away with the expectation in the update and computes the gradient of the parameters using only a single or a few training examples. In SGD the learning rate $\gamma$ is typically much smaller than a corresponding learning rate in batch gradient descent because there is much more variance in the update.

__Question 3__
> Implement stochastic gradient descent algorithm for Matrix Factorization.

> Provide plots and explanations for your experiments.

> Compare and discuss the results with the (full) gradient algorithm in terms of MSE versus full data passes.

> Discuss the stepsize choice of SGD (e.g. constant v.s. 1/`nb_iter`).

Now we will implement Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent (DSGD) in Spark. 
The algorithm is described in the following article: <br \><br \>
_Gemulla, R., Nijkamp, E., Haas, P. J., & Sismanis, Y. (2011). Large-scale matrix factorization with distributed stochastic gradient descent. New York, USA._<br \><br \>
The paper sets forth a solution for matrix factorization using minimization of sum of local losses.  The solution involves dividing the matrix into strata for each iteration and performing sequential stochastic gradient descent within each stratum in parallel.  DSGD is a fully distributed algorithm, i.e. both the data matrix $R$ and factor matrices $P$ and $Q$ can be carefully split and distributed to multiple workers for parallel computation without communication costs between the workers. Hence, it is a good match for implementation in a distributed in-memory data processing system like Spark. 

__Question 4__

> Implement a `PySpark` version of DSGD.

> Test on different number of cores on a local machine (1 core, 2 cores, 4 cores). Ran the ALS method already implemented in MLlib as a reference for comparison.