<small><i>This notebook was create by Franck Iutzeler, Jerome Malick and Yann Vernaz (2016).</i></small>
<!-- Credit (images) Jeffrey Keating Thompson. -->

<center><img src="UGA.png" width="30%" height="30%"></center>
<center><h3>Master of Science in Industrial and Applied Mathematics (MSIAM)</h3></center>
<hr>
<center><h1>Convex and distributed optimization</h1></center>
<center><h2>Part III - Recommender Systems (3h + 3h home work)</h2></center>

# Outline

In this Lab, we will investigate some gradient-based algorithms on the very well known matrix factorization problem which is the most prominent approach for build a _Recommender Systems_.

Our goal is to implement Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent in Spark.

# Problem Formulation

The problem of matrix factorization for collaborative filtering captured much attention, especially after the [Netflix prize](https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf). The premise behind this approach is to approximate a large rating matrix $R$ with the multiplication of two low-dimensional factor matrices $P$ and $Q$, i.e. $R \approx \hat{R} = P^TQ$, that model respectively users and items in some latent space. For instance, matrix $R$ has dimension $m \times  n$ where $m$ and $n$ are restrictively the number of users and items, both large; while $P$ has size $m \times  k$ and contains user information in a latent space of size $k<<m,n$, $Q$ has size $n\times k$ and contains item information in the same latent space of size $k << m,n$. Typical values for $m, n$ are $10^6$ while $k$ is in the tens.

For a pair of user and item $(u_i,i_j)$ for which a rating $r_{ij}$ exists, a common approach approach is based on the minimization of the $\ell_2$-regularized quadratic error:
$$  \ell_{u_i,i_j}(P,Q)= \left(r_{ij} - p_{i}^{\top}q_{j}\right)^2 + \lambda(|| p_{i} ||^{2} + || q_{j} ||^2 )  $$
where $p_i$ is the column vector composed of the $i$-th line of $P$ and  $\lambda\geq 0$ is a regularization parameter. The whole matrix factorization problem thus writes
$$ \min_{P,Q} \sum_{i,j : r_{ij} \text{exists}}  \ell_{u_i,i_j}(P,Q). $$
Note that the error $ \ell_{u_i,i_j}(P,Q)$ depends only on $P$ and $Q$ through $p_{i}$ and $q_{j}$; however, item $i_j$ may also be rated by user $u_{i'}$ so that the optimal factor $q_{j}$ depends on both $p_{i}$ and $p_{i'}$.

In [2]:
# set up spark environment (Using Spark Local Mode)
from pyspark import SparkContext, SparkConf

conf = SparkConf()
conf.setMaster("local[*]")
conf.setAppName("MSIAM part III - Matrix Factorization")

sc = SparkContext(conf = conf)

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=MSIAM part III - Matrix Factorization, master=local[*]) created by __init__ at <ipython-input-1-8ff3678c3768>:8 

We remind you that you can access this interface by simply opening http://localhost:4040 in a web browser.

We will capitalize on the first lab and take the MovieLens dataset, and thus the RDD routines we already have.

In [2]:
def parseRating(line):
    fields = line.split('::')
    return int(fields[0]), int(fields[1]), float(fields[2])

def parseMovie(line):
    fields = line.split("::")
    return int(fields[0]), fields[1], fields[2]

# path to MovieLens dataset
movieLensHomeDir="data/movielens/medium/"


# movies is an RDD of (movieID, title, genres)
moviesRDD = sc.textFile(movieLensHomeDir + "movies.dat").map(parseMovie).setName("movies").cache()

# ratings is an RDD of (userID, movieID, rating)
ratingsRDD = sc.textFile(movieLensHomeDir + "ratings.dat").map(parseRating).setName("ratings").cache()

numRatings = ratingsRDD.count()
numUsers = ratingsRDD.map(lambda r: r[0]).distinct().count()
numMovies = ratingsRDD.map(lambda r: r[1]).distinct().count()
print("We have %d ratings from %d users on %d movies.\n" % (numRatings, numUsers, numMovies))

M = ratingsRDD.map(lambda r: r[0]).max()
N = ratingsRDD.map(lambda r: r[1]).max()
matrixSparsity = float(numRatings)/float(M*N)
print("We have %d users, %d movies and the rating matrix has %f percent of non-zero value.\n" % (M, N, 100*matrixSparsity))

We have 1000209 ratings from 6040 users on 3706 movies.

We have 6040 users, 3952 movies and the rating matrix has 4.190221 percent of non-zero value.



#  Gradient Descent Algorithms

The goal here is to 
1. Compute gradients of the loss functions.
2. Implement gradient algorithms.
3. Observe the prediction accuracy of the developed methods.

__Question 1__

> Split (ramdomly) the dataset into training versus testing sample. We learn over 70% (for example) of the users, we test over the rest.

> Define a routine that returns the predicted rating from factor matrices. Form a RDD with the following elements `(i,j,true rating,predicted rating)`. 

> Define a routine that returns the Mean Square Error (MSE).


In [3]:
import numpy as np

trainingSample, testingSample = ratingsRDD.randomSplit([70, 30])
print("The number of examples in the dataset : ",ratingsRDD.count())
print("The number of examples in the training dataset : ",trainingSample.count())
print("The number of examples in the testing dataset : ",testingSample.count())

The number of examples in the dataset :  1000209
The number of examples in the training dataset :  699709
The number of examples in the testing dataset :  300500


In [4]:
# Create moviesGenresRDD, a new RDD of (genre, list_of_movies)
genres = moviesRDD.map(lambda x : x[2]).flatMap(lambda x : x.split("|")).distinct()

def getGenresOfMovie(movie):
    return moviesRDD.filter(lambda x : x[1] == movie).map(lambda y : y[2]).flatMap(lambda x : x.split("|")).collect()
    
def getMoviesFromGenre(genre):
    return moviesRDD.filter(lambda x : genre in x[2]).map(lambda y : y[1]).collect()

genres_dict = {}
for g in genres.collect() :
    genres_dict_temp = {g:getMoviesFromGenre(g)}
    genres_dict.update(genres_dict_temp)

moviesGenresRDD = genres.map(lambda genre : (genre,genres_dict[genre]))

##TEST
g = getGenresOfMovie("Pocahontas (1995)")
print("Genres of movie \"Pocahontas (1995)\":\n",g,"\n")
m = getMoviesFromGenre("Comedy")
print("Five movies from genre \"Comedy\":\n",np.asarray(m)[:5],"\n")
print("There are",genres.count(),"different genres:\n",genres.collect(),"\n")
print("Check from dictionnary:\n",len(genres_dict.keys()),"Genres:\n",genres_dict.keys(),"\n")
print("Number of movies having \"Comedy\" in their genres :",moviesRDD.filter(lambda x : "Comedy" in x[2]).count())
print("Check from dictionnary:",len(genres_dict["Comedy"]))
print("Number of movies having",moviesGenresRDD.first()[0],"in their genres :",moviesRDD.filter(lambda x : "Musical" in x[2]).count())
print("Check from RDD, the number of movies having",moviesGenresRDD.first()[0],"in their genres:",len(moviesGenresRDD.first()[1]))

#genres_dict.clear()

Genres of movie "Pocahontas (1995)":
 ['Animation', "Children's", 'Musical', 'Romance'] 

Five movies from genre "Comedy":
 ['Toy Story (1995)' 'Grumpier Old Men (1995)' 'Waiting to Exhale (1995)'
 'Father of the Bride Part II (1995)' 'Sabrina (1995)'] 

There are 18 different genres:
 ['Musical', 'Fantasy', 'Documentary', "Children's", 'Horror', 'Drama', 'Thriller', 'Action', 'Sci-Fi', 'Western', 'Romance', 'Adventure', 'Film-Noir', 'Crime', 'War', 'Animation', 'Mystery', 'Comedy'] 

Check from dictionnary:
 18 Genres:
 dict_keys(["Children's", 'Romance', 'Fantasy', 'Comedy', 'Thriller', 'Western', 'Action', 'War', 'Sci-Fi', 'Drama', 'Musical', 'Film-Noir', 'Horror', 'Documentary', 'Animation', 'Mystery', 'Adventure', 'Crime']) 

Number of movies having "Comedy" in their genres : 1200
Check from dictionnary: 1200
Number of movies having Musical in their genres : 114
Check from RDD, the number of movies having Musical in their genres: 114


In [5]:
# Define factors = Genres (k = 18 factors)

# How to attribute values to factors
# For vector q_i:
#    each movie (item), have factor_value = 1 for its genres and -1 for others 
# For vector p_u:
#    each user, have factor_value = Average_ratings_of_movies_of_factor = ratings / number_of_movies_of_factor
  
movies_dict = {}
for i,title in moviesRDD.map(lambda x : (x[0],x[1])).collect():
    movies_dict_temp = {i:title}
    movies_dict.update(movies_dict_temp)

def update_factors(movieIdListOfGenres,rating,p_u):
    for f in movieIdListOfGenres:
        p_u[f] = (p_u[f][0]+1, p_u[f][1] + rating)
    return p_u

def compute_final_p_u(p_u):
    for f in genres.collect():
        if p_u[f][0] != 0:
            p_u[f] = (p_u[f][0]+1, p_u[f][1] / p_u[f][0])
    return p_u

def get_preferences(user,p_u):
    pref = []
    for g in genres.collect():
        if p_u[g][0]>0:
            pref.append(g)
    return pref
            

def get_genres_list_of_movieId(movie):
    movieTitle = movies_dict[movie]
    #print("movie (id =",movie,"):",movieTitle)
    genreslist = moviesRDD.map(lambda x : x[2]).flatMap(lambda x : x.split("|")).distinct()
    return getGenresOfMovie(movieTitle)
    
def computePu(user):
    p_u = {}
    for g in genres.collect() :
        p_u_temp = {g:(0,0)} # {genre, (number of movie with that genre rated; sum of eatings; average)}
        p_u.update(p_u_temp)
    userRatingsRDD = trainingSample.filter(lambda x : x[0] == user)
    userMoviesList = userRatingsRDD.map(lambda x : x[1]).collect()
    userRatingsList = userRatingsRDD.map(lambda x : x[2]).collect()
    print("Number of movies rated:",len(userMoviesList))
    print("Movies rated:",userMoviesList)
    for l in range(0,len(userMoviesList)):
        movieIdListOfGenres = get_genres_list_of_movieId(userMoviesList[l])
        rating = userRatingsList[l]
        #print(movieIdListOfGenres," ///// ", rating)
        p_u = update_factors(movieIdListOfGenres,rating,p_u)
    p_u = compute_final_p_u(p_u)
    print("User preferences :\n",get_preferences(user,p_u))
    return p_u
    
usersList = trainingSample.map(lambda r: r[0]).distinct().collect()

In [None]:
def computeMovieRatingsAverage(movieId):
    movieIdRatings = trainingSample.filter(lambda x : x[1] == movieId).map(lambda x : x[2])
    movieIdRatingsAverage = movieIdRatings.mean()
    print("The average of ratings (for the movieId =",movieId,")",movieIdRatingsAverage)
    return movieIdRatingsAverage

def computeQi(item):
    q_i = {}
    for g in genres.collect() :
        q_i_temp = {g:0} 
        q_i.update(q_i_temp)
        
    itemGenres = get_genres_list_of_movieId(item)
    for genre in itemGenres:
        q_i[genre] = computeMovieRatingsAverage(item)/len(itemGenres)
    return q_i       
    
itemsList = moviesRDD.map(lambda r: r[0]).distinct().collect()
print(itemsList)
#for i in itemsList:
#    print(getGenresOfMovie(i))

In [8]:
def dict2list(dic,col):
    index = 0
    _list = np.zeros(18)
    for f in genres.collect():
        if col != 0:
            _list[index] = dic[f][col]
        else:
            _list[index] = dic[f]
        index = index + 1
    return _list
    
print("All genres:\n",genres.collect(),"\n")

user = 1
item = 2804

print("User =",user) 
print("Movie =",movies_dict[item])

Pu = dict2list(computePu(user),1)
Qi = dict2list(computeQi(item),0)

print("\nPu = \n",Pu) 
print("\nQi =\n",Qi)

ps = np.vdot(Pu,Qi)
print("\nDot product =",ps)

realRating = ratingsRDD.filter(lambda x : x[0]==user).filter(lambda x : x[1]==item).collect()
print("True rating =",realRating)
# western movie eg : "Wild Bill (1995)"

All genres:
 ['Musical', 'Fantasy', 'Documentary', "Children's", 'Horror', 'Drama', 'Thriller', 'Action', 'Sci-Fi', 'Western', 'Romance', 'Adventure', 'Film-Noir', 'Crime', 'War', 'Animation', 'Mystery', 'Comedy'] 

User = 1
Movie = Christmas Story, A (1983)
Number of movies rated: 42
Movies rated: [661, 914, 3408, 2355, 1287, 2804, 594, 919, 2918, 1035, 2791, 2018, 3105, 2797, 2321, 720, 527, 2340, 48, 1097, 1721, 1545, 745, 2294, 3186, 588, 1907, 1836, 1022, 2762, 150, 1, 1961, 1962, 2692, 260, 1029, 1207, 531, 3114, 608, 1246]
User preferences :
 ['Musical', 'Fantasy', "Children's", 'Drama', 'Thriller', 'Action', 'Sci-Fi', 'Romance', 'Adventure', 'Crime', 'War', 'Animation', 'Comedy']
The average of ratings (for the movieId = 2804 ) 4.216931216931217
The average of ratings (for the movieId = 2804 ) 4.216931216931217

Pu = 
 [ 4.22222222  4.          0.          4.26666667  0.          4.38888889
  3.66666667  4.33333333  4.          0.          3.8         4.33333333
  0.          4

__Question 2__

> Derive the update rules for gradient descent. 

> Implement a (full) gradient algorithm in `Python` on the training set.  Take a step size (learning rate) $\gamma=0.001$ and stop after a specified number of iterations. Investigate the latent space size (e.g. $K=2,5,10,50$).

> Provide plots and explanations for your experiments. 

> Try to parrallelize it so that the code can be run using `PySpark`. What do you conclude?

Stochastic Gradient Descent (SGD) simply does away with the expectation in the update and computes the gradient of the parameters using only a single or a few training examples. In SGD the learning rate $\gamma$ is typically much smaller than a corresponding learning rate in batch gradient descent because there is much more variance in the update.

__Question 3__
> Implement stochastic gradient descent algorithm for Matrix Factorization.

> Provide plots and explanations for your experiments.

> Compare and discuss the results with the (full) gradient algorithm in terms of MSE versus full data passes.

> Discuss the stepsize choice of SGD (e.g. constant v.s. 1/`nb_iter`).

Now we will implement Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent (DSGD) in Spark. 
The algorithm is described in the following article: <br \><br \>
_Gemulla, R., Nijkamp, E., Haas, P. J., & Sismanis, Y. (2011). Large-scale matrix factorization with distributed stochastic gradient descent. New York, USA._<br \><br \>
The paper sets forth a solution for matrix factorization using minimization of sum of local losses.  The solution involves dividing the matrix into strata for each iteration and performing sequential stochastic gradient descent within each stratum in parallel.  DSGD is a fully distributed algorithm, i.e. both the data matrix $R$ and factor matrices $P$ and $Q$ can be carefully split and distributed to multiple workers for parallel computation without communication costs between the workers. Hence, it is a good match for implementation in a distributed in-memory data processing system like Spark. 

__Question 4__

> Implement a `PySpark` version of DSGD.

> Test on different number of cores on a local machine (1 core, 2 cores, 4 cores). Ran the ALS method already implemented in MLlib as a reference for comparison.