# wk11 Demo - Recommender Systems; ALS
__`MIDS w261: Machine Learning at Scale | UC Berkeley School of Information | Fall 2018`__

Last week we did pagerank. This week we're doing Alternating Least Squares (ALS) Regression. 

In class today we'll start with a general discussion of recommender systems, then we'll look at some basic theory of ALS and how it can be prallelized in a map/reduce framework like Spark. 

We provide the code for a closed-form ridge regression (l2) starting with a single node implementation, distributed implementation, and a mllib implementation, for your reference.

By the end of today's demo you should be able to:  
* ... __identify__ pros and cons in various RS approaches
* ... __describe__ ALS regression 
* ... __implement__ ALS regression in a distributed fashion.

__`Additional Resources:`__ 
- The code in this notebook was based on several notebooks by Jimi Shanahan   
- Recommendation Systems: Techniques, Challenges, Application, and Evaluation https://www.researchgate.net/publication/328640457_Recommendation_Systems_Techniques_Challenges_Application_and_Evaluation_SocProS_2017_Volume_2
- Matrix Completion via Alternating Least Square(ALS) https://web.stanford.edu/~rezab/classes/cme323/S15/notes/lec14.pdf    
- Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares https://arxiv.org/pdf/1410.2596.pdf     
- Explicit Matrix Factorization: ALS, SGD, and All That Jazz https://blog.insightdatascience.com/explicit-matrix-factorization-als-sgd-and-all-that-jazz-b00e4d9b21ea    
- Collaborative Filtering for Implicit Feedback Datasets http://yifanhu.net/PUB/cf.pdf    
- Joeran Beel https://www.tcd.ie/research/researchmatters/joeran-beel.php

- Collaborative Filtering - RDD-based API https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

# Background discussion

>From a research perspective, recommender-systems are one of the most diverse areas imaginable. The areas of interest range from hard mathematical/algorithmic problems over user-centric problems (user interfaces, evaluations, privacy) to ethical and political questions (bias, information bubbles). Given this broad range, many disciplines contribute to recommender-systems research including computer science (e.g. information retrieval, natural language processing, graphic and user interface design, machine learning, distributed computing, high performance computing) the social sciences, and many more. Recommender-systems research can also be conducted in almost every domain including e-commerce, movies, music, art, health, food, legal, or finance. This opens the door for interdisciplinary cooperation with exciting challenges and high potential for impactful work. ~Joeran Beel    
*Dr Joeran Beel is an Ussher Assistant Professor in the Artificial Intelligence Discipline at the School of Computer Science & Statistics at Trinity College Dublin. https://www.tcd.ie/research/researchmatters/joeran-beel.php*

Most of us are inundated with examples of recommendation systems, from Facebook, to Amazon, to Netflix. So instead of  starting with ‘what they are’,   maybe it’s good to start with a quick discussion about what unspoken assumptions underlie their ubiquity. 

### Discussion questions:
* What are some of the political and ethical questions related to RS? What could go wrong? Examples?
* What are some assumptions to keep in mind and/or try to avoid when designing RS?
* What is the value proposition? who are the stakeholders?
* What are some of the areas of expertise involved in designing RS?

# Types of Recommender Systems

<img src="types-diagram.png" width=70%>

https://www.researchgate.net/publication/328640457_Recommendation_Systems_Techniques_Challenges_Application_and_Evaluation_SocProS_2017_Volume_2

<img src="RS-comparison-table.png">

# Representation

## Recommender System (RS) as bipartite graph

We can think of the recommender problem as a weighted bipartite graph, where one set of nodes represents users, and the other set represents items. 

* __NODES__ - Each user can be represented by a vector of features, thinkgs like preferences, demographics, traits, etc.. and likewise, our items can also be represented by feature vectors. For example, if our items are movies, then we might have features like genre, director, lead actor, etc... (We'll talk about how these features are derived later).  

* __EDGES__ - The edges in our graph could be ratings, a positive or negative indicator, or another continuous measure of preference.

<img src="bipartite-graph.png">

# Content-based
Content-Based systems focus on properties of items. Similarity of items is determined by measuring the similarity in their properties. If a user bought a particular item, then we can recommend similar items. Early recommender systems employed this approach. 

This approach depends heavily on the similarity metrics beteen items and feature engineering. For movies that might be genre, lead actors, etc. When comparing news articles we might want to perform some topic modeling, TFIDF, cosine similarities, etc..

We represent each item as a vector of features, and each user as a vector of these item features, and we compute the cosine similarity between a user and an item to determine if the user will like this item.

While it was intuitive and easily interpretable, more effective methods have been developed since. 

Pros: interpretable, no cold start problem, makes use of implicit data collection .  
Cons: creates filter bubble

# Collaborative Filtering

## Neighborhood Based
As discussed in DDS, we could now take a Nearest Neighbor approach, which is intuitive and simple to reason about. The intuition being that similar people like similar things.

### Discussion questions:
1. How can we measure similarity between people? (HINT: Think about the person node representation)
2. What are the challenges with this approach from a theoretical standpoint, as well as a computational one?

## Model Based


__2.2.2 Model-Based Filtering__
Model-based techniques make use of data mining and machine learning approaches
to predict the preference of a user to an item. These techniques include 
* association rule mining, (Apriori)
* clustering, 
* decision tree, 
* artificial neural network, 
* Bayesian classifier, 
* regression, 
* link analysis, and 
* __latent factor models.__   

Among these, __latent factor models__ are the most studied and used model-based techniques.
These techniques perform dimensionality reduction over user–item preference matrix
and learn latent variables to predict preference of the user to an item in the recommendation
process. These methods include:
* __matrix factorization__, 
* singular value decomposition, 
* probabilistic matrix factorization, 
* Bayesian probabilistic matrix factorization, 
* low-rank factorization, 
* nonnegative matrix factorization, and 
* latent Dirichlet allocation.   

Source: https://www.researchgate.net/publication/328640457_Recommendation_Systems_Techniques_Challenges_Application_and_Evaluation_SocProS_2017_Volume_2

# Data collection - implicit vs explicit feedback
Before we dive in to the methodology, let's talk about how the system is popuated in the first place?


## Explicit feeback - users provide ratings.

### Discussion questions: 
* What are some of the limitations of star ratings?


### Digression: Yahoo experiment -   
<img src="yahoo-experiment.png">   
* On the left - When users are asked to rate movies from a random list, there are many very low ratings.
* On the right - When the user has the freedom to choose what items to rate, instead of giving a low rating, they don’t give a rating at all.    

These two distributions are very different. The challenge is that we have this data for training and testing where the "true" distribution is like one on the left, but the model we build (user experience) depends on the distribution on the right.
We can reframe this challenge in terms of missing data. And what this means, is that we cannot make the assumption that the data is missing at random. The consequence is that we cannot ignore missing values, instead, we need a mechanism for imputing those values.   
For more information and how this issue can be addressed, see https://www.youtube.com/watch?v=aKQfUbxU96c marker 6:00

## Implicit feedback - We can make inferences from users’ behavior. 
If a user buys a product at Amazon, watches a movie on YouTube, or reads a news article, then the user can be said to “like” this item.

### Discussion questions:
* What are some of the challenges of this approach?
* Ex: If I click on a movie but don't watch it, is that a positive or negative indicator?

## Matrix Factorization

We'll limit our implementation to explicit feedback given by users in the form of ratings. We might want to do some preprocessing, like normalization. For example, we might want to subtract the mean of the ratings to account for user bias - some users tend to rate higher than others, and vs. And we may or may not want to impute missing values, as discussed above.

We can represent our bipartite graph by a $n\times m$ "utility" matrix $R$ with entries $r_{u,i}$ representing the $i$th item rating by user $u$ with $n$ users and $m$ items.

Our goal is to fill in the missing (or previously imputed) values of our matrix with good estimates of future ratings. 

A common approach for this problem is matrix factorization where we make estimates for the complete ratings matrix $R$ in terms of two matrix "factors" $U$ and $V$ which multiply together to form $R$. Where $U$ is the user matrix and $V$ is the item matrix.

$$
R \approx UV
$$

We can estimate $R$ by creating factor matricies with reduced complexity $U\in\mathbb{R}^{k,n}$ and $V"\in\mathbb{R}^{k,m}$ with $n$ users, $m$ items, and $k$ factors.

### Prediction 
If we multiply each feature of the user by the corresponding feature of the movie and add everything together, this will be a good approximation for the rating the user would give that movie.

$$
r'_{u,i} = \boldsymbol{u}^{T}_{u}\boldsymbol{v}_{i} = \sum_{k} u_{u,k}v_{k,i}
$$

#### Assumptions
- Each user can be described by $k$ attributes or features. For example, feature 1 might be a number that says how much each user likes sci-fi movies; however, they are ambiguous since the model derives them similar to a neural network. So we do not get the interpretability.
- Each item (movie) can be described by an analagous set of $k$ attributes or features. To correspond to the above example, feature 1 for the movie might be a number that says how close the movie is to pure sci-fi.

These user and item vectors are often called latent vectors or low-dimensional embeddings.

### Discussion questions:
* What is a latent factor? Intuitively? Mathematically?
* How many latent factors should we choose? What would it mean if we had 1 latent factor? What about if we had too many? HINT: how does it relate to underfitting and overfitting


<img src="MF-01.png" width=70%>

## Training - Alternating Least Sqaures (ALS)
### How can we find U and V to approximate R?
* What are we trying to optimize?       
* What do we start with?
* Explain the 3 components of this loss function. 


$$
\Vert R- U\cdot V\Vert ^2 + \lambda \Vert U \Vert ^2 + \lambda \Vert V \Vert ^2 
$$

* There is something left out of the notation shown, what is it? (hint: are we really computing the loss for every cell?)
* Why wouldn't we use GD on this loss function?


Turns out that minimizing the joint optimaztion is hard. For one, this function is not convex, so there are local minima we could "get stuck in". 

This is where ALS comes in. It turns out if we constrain $U$ or $V$ to be constant, that this is a convex problem since the multiplicative factor is a constant and it has the same matrix notation as standard least square regression.

__STEPS__

- Initialize $U_0$ and $V_0$   
- Holding $U$ constant, solve for $V_1$ to minimize:

$$
L(V) = \Vert R- U_0\cdot V\Vert ^2 + \lambda \Vert V \Vert ^2
$$

- Holding $V$ constant, solve for $U_1$ to minimize:$   

$$
L(U) = \Vert R- V_1\cdot U\Vert ^2 + \lambda \Vert U \Vert ^2
$$

- Repeat until convergence

### Discussion questions:
* When we say “solve” for $U_i$ and $V_i$, how is that actually done?
* Where are there opportunities to parallelize this training process?
* How will we partition the data to avoid shuffling?


# Parallel ALS

<img src="parallel-ALS.png">

### Discussion questions
* What data needs to be cached/ broadcast at each phase?
* What happens in ‘mappers’ (narrow transformations) & what will happen in aggregation (wide transformations)?
* How many shuffles per iteration?
* Are there any limitations to this approach?


# Recommender-System Evaluation
>‘What constitutes a good recommender system and how to measure it’ might seem like a simple question to answer, but it is actually quite difficult.  For many years, the recommender-systems community focused on accuracy. 
>
>Accuracy, in the broader sense, is easy to quantify: numbers like error rates such as the difference between a user’s actual rating of a movie and the previously predicted rating by the recommender system (the lower the error rate, the better the recommender system); or precision, i.e. the fraction of items in a list of recommendations that was actually bought, viewed, clicked, etc. (the higher the precision, the better the recommender system). 
>
>Recently, the community’s attention has shifted to other measures that are more meaningful but also more difficult to measure including __serendipity__, __novelty__, and __diversity__. I contributed to this development by critically analyzing the state of the art [15] ; comparing evaluation metrics (click-through rate, user ratings, precision, recall, …) and methods (online evaluations, offline evaluations, user studies) [13] as well as introducing novel evaluation methods [3].
>
>Regardless of the metrics used to measure how “good” a recommender system is (accuracy, precision, user satisfaction…), studies report surprisingly inconsistent results on how effective different recommendation algorithms are. For instance, as shown in Figure 2, one of my experiments shows that five news recommendation-algorithms perform vastly different on six news websites [5]. Almost every algorithm performed best on at least one news website. Consequently, the operator of a new news website would hardly know which of the five algorithms is the best to use, because any one could potentially be it.  ~Joeran Beel   
*Dr Joeran Beel is an Ussher Assistant Professor in the Artificial Intelligence Discipline at the School of Computer Science & Statistics at Trinity College Dublin.* https://www.tcd.ie/research/researchmatters/joeran-beel.php

### Notebook Set-Up

In [None]:
# imports
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from os import path

In [None]:
# store path to notebook
PWD = !pwd
PWD = PWD[0]

In [None]:
from pyspark.sql import SparkSession
app_name = "wk11_demo"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .getOrCreate()
sc = spark.sparkContext

## About the Data
https://grouplens.org/datasets/movielens/

MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.

This data set consists of: 
* 100,000 ratings (1-5) from 943 users on 1682 movies. 
* Each user has rated at least 20 movies. 
* Simple demographic info for the users (age, gender, occupation, zip)



In [None]:
# make data directory if it doesn't already exist
!mkdir -p data
!curl https://www.dropbox.com/s/yk72grsouyw018l/test.data.txt?dl=1 -o data/test.data
!curl http://files.grouplens.org/datasets/movielens/ml-latest-small.zip -o data/data-ml-latest-small.zip
!curl http://files.grouplens.org/datasets/movielens/ml-10m.zip -o data/data-ml-10m.zip

In [None]:
!rm -rf data/ml*
!unzip data/data-ml-latest-small.zip -d data/
!unzip data/data-ml-10m.zip -d data/

In [None]:
!tree data/

...

## Python - Single Node Implementation

In [None]:
baseDir = f"{PWD}/data/ml-10M100K"

# The MovieLens dataset contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users.

rating_headers = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(path.join(baseDir,'ratings.dat'), sep='::', header=None, names=rating_headers)

movie_headers = ['movie_id', 'title', 'genres']
movies = pd.read_table(path.join(baseDir,'movies.dat'), sep='::', header=None, names=movie_headers)
movie_titles = movies.title.tolist()

df = movies.join(ratings, on=['movie_id'], rsuffix='_r')
del df['movie_id_r']
df.head()

In [None]:
# Getting Q Matrix
rp = df.pivot_table(columns=['movie_id'],index=['user_id'],values='rating').fillna(0)
rp.head()
Q = rp.values
Q.shape

In [None]:
# build a binary weight matrix (so the algo focuses on say the movies a user rated during each subproblem 
# (each user can be view as an atomic problem to be solved) that is solved)
W = Q>0.5
W[W == True] = 1
W[W == False] = 0
# To be consistent with our Q matrix
W = W.astype(np.float64, copy=False)
lambda_ = 0.1 # learning rate
n_factors = 100
m, n = Q.shape

#setup user and movie factor matrices of order n_factors between [0, 5] stars
X = 5 * np.random.rand(m, n_factors) 
Y = 5 * np.random.rand(n_factors, n)
X.shape

#compute the error (Frobenus norm) where
# Q target ratings matrix
# X and Y are the factorized matrices
# W weight matrix
def get_error(Q, X, Y, W):
    return np.sum((W * (Q - np.dot(X, Y)))**2)

print(W)

In [None]:
# non-weighted version of ALS (does not work well!)
# uses all user item values (as opposed to the subset of actual ratings)
n_iterations = 20

errors = []
for i in range(n_iterations):
    X = np.linalg.solve(np.dot(Y, Y.T) + lambda_ * np.eye(n_factors), 
                        np.dot(Y, Q.T)).T
    Y = np.linalg.solve(np.dot(X.T, X) + lambda_ * np.eye(n_factors),
                        np.dot(X.T, Q))
    print(f'{i}th iteration is completed')
    errors.append(get_error(Q, X, Y, W))
Q_hat = np.dot(X, Y)

print('')
print(f'Error of rated movies: {get_error(Q, X, Y, W)}')

In [None]:
import matplotlib.pyplot as plt
# display plots inline (otherwise it will fire up a separate window)
%matplotlib inline
plt.plot(errors);
plt.ylim([0, 20000]);
plt.ylim([0, 20000]);
plt.xlabel("ALS Iteration")
plt.ylabel("Total Squared Error")

In [None]:
weighted_errors = []
for ii in range(n_iterations):
    for u, Wu in enumerate(W):
        #AX=B =>  X=A^-1B ; in python use solve(A, B) 
        X[u] = np.linalg.solve(np.dot(Y, np.dot(np.diag(Wu), Y.T)) + lambda_ * np.eye(n_factors),
                               np.dot(Y, np.dot(np.diag(Wu), Q[u].T))).T
    for i, Wi in enumerate(W.T):
        Y[:,i] = np.linalg.solve(np.dot(X.T, np.dot(np.diag(Wi), X)) + lambda_ * np.eye(n_factors),
                                 np.dot(X.T, np.dot(np.diag(Wi), Q[:, i])))
    weighted_errors.append(get_error(Q, X, Y, W))
    print(f'{ii}th iteration is completed')
weighted_Q_hat = np.dot(X,Y)
print(f'Error of rated movies: {get_error(Q, X, Y, W)}')
plt.plot(weighted_errors);
plt.xlabel('Iteration Number');
plt.ylabel('Mean Squared Error');

## Spark - Distributed Implementation

In [None]:
import numpy as np
from numpy.random import rand
from numpy import matrix

In [None]:
def rmse(R, U, V): # Metric
    return np.sqrt(np.sum(np.power(R-U*V, 2))/(U.shape[0]*V.shape[1]))

In [None]:
def solver(mat, R, LAMBDA):  # solver to get R*mat(matT*mat + lambda*I)^-1
    d1 = mat.shape[0]
    d2 = mat.shape[1]

    X2 = mat.T * mat
    XY = mat.T * R.T

    for j in range(d2):
        X2[j, j] += LAMBDA * d1

    return np.linalg.solve(X2, XY)

In [None]:
# Not only caculation is paralleized but also the data is wisely partitioned and shared to improve locality.
def closedFormALS(R,InitialU,InitialVt,rank,iterations,numPartitions,LAMBDA=0.01):
    R_Userslice = sc.parallelize(R,numPartitions).cache() # R will automaticly be partitioned by row index
    R_Itemslice = sc.parallelize(R.T,numPartitions).cache() # R_T will automaticly be partitioned by row index
    U = InitialU
    Vt = InitialVt
    
    for i in range(iterations):
        
        print(f"Iteration: {i}")
        print(f"RMSE: {rmse(R, U, Vt.T)}")
        
        Vtb = sc.broadcast(Vt)
        U3d = R_Userslice.map(lambda x:solver(Vtb.value,x,LAMBDA)).collect() # a list of two 2-D matrix
        U = matrix(np.array(U3d)[:, :, 0]) # transfered to 2-D matrix
        
        Ub = sc.broadcast(U)
        Vt3d = R_Itemslice.map(lambda x:solver(Ub.value,x,LAMBDA)).collect() # a list of two 2-D matrix
        Vt = matrix(np.array(Vt3d)[:, :, 0])  # transfered to 2-D matrix
    
    return U, Vt 

In [None]:
# Only parallelize the calculation. It does not consider the data transmission cost
def simpleParalleling(R,InitialU,InitialVt,rank,iterations,numPartitions,LAMBDA=0.01):
    Rb = sc.broadcast(R)
    U = InitialU
    Vt = InitialVt
    Ub = sc.broadcast(U)
    Vtb = sc.broadcast(Vt)
    numUsers = InitialU.shape[0]
    numItems = InitialVt.shape[0]
    
    for i in range(iterations):
        print(f"Iteration: {i}")
        print(f"RMSE: {rmse(R, U, Vt.T)}")
        U3d = sc.parallelize(range(numUsers), numPartitions) \
           .map(lambda x: solver( Vtb.value, Rb.value[x, :],LAMBDA)) \
           .collect() # a list of two 2-D matrix
        U = matrix(np.array(U3d)[:, :, 0]) # transfered to 2-D matrix
        Ub = sc.broadcast(U)

        Vt3d = sc.parallelize(range(numItems), numPartitions) \
           .map(lambda x: solver(Ub.value, Rb.value.T[x,:],LAMBDA)) \
           .collect() # a list of two 2-D matrix
        Vt = matrix(np.array(Vt3d)[:, :, 0]) # transfered to 2-D matrix
        Vtb = sc.broadcast(Vt)
    return U, Vt

In [None]:
def main():
    LAMBDA = 0.01   # regularization parameter
    np.random.seed(100)
    numUsers = 5000
    numItems = 100
    rank = 10
    iterations = 5
    numPartitions = 4

    trueU = matrix(rand(numUsers, rank)) #True matrix U to generate R
    trueV = matrix(rand(rank, numItems)) #True matrix V to generate R
    R = matrix(trueU*trueV)   #generate Rating matrix
    
    InitialU = matrix(rand(numUsers, rank)) #Initialization of U
    InitialVt = matrix(rand(numItems,rank)) #Initialization of V
    
    print(f"Running ALS with numUser={numUsers}, numItem={numItems}, rank={rank}, iterations={n_iterations}, numPartitions={numPartitions}")
    
    print("Distributed Version---Two copies of R, one is partitioned by rowIdx, the other is partitioned by colIndx")
    closedFormALS(R,InitialU,InitialVt,rank,n_iterations,numPartitions,LAMBDA)
    
    print("Simple paralleling ---Suppose User Matrix R is small enough to be broadcast")
    simpleParalleling(R,InitialU,InitialVt,rank,n_iterations,numPartitions,LAMBDA)

In [None]:
main()

# Spark ML implementation of ALS

### MlLib Tutorial on Personalized Movie Recommendation
Link to Docs: https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

* What are the parameters for the MLLib implementation of collaborative filtering?
* How do you determine the ‘rank’ of your latent space vectors (i.e. number of latent factors)?
* After training your CF model a new user joins your platform. What will you need to do to generate predictions for that user?

### Some implementation details

Solve for U and V for maxIterations
https://github.com/apache/spark/blob/e1ea806b3075d279b5f08a29fe4c1ad6d3c4191a/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L1001
```
for (iter <- 0 until maxIter)
    itemFactors = computeFactors(userFactors, userOutBlocks, itemInBlocks, rank, regParam,
          userLocalIndexEncoder, solver = solver)
    userFactors = computeFactors(itemFactors, itemOutBlocks, userInBlocks, rank, regParam,
          itemLocalIndexEncoder, solver = solver)      
```

where the default non-negative solver is the `ML` NNLSSolver (non-negative least squares solver)     
https://github.com/apache/spark/blob/e1ea806b3075d279b5f08a29fe4c1ad6d3c4191a/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L767

which calls `MLLIB` NNLS Solver...
which implements the [conjugate gradient method](https://en.wikipedia.org/wiki/Conjugate_gradient_method)
https://github.com/apache/spark/blob/e1ea806b3075d279b5f08a29fe4c1ad6d3c4191a/mllib/src/main/scala/org/apache/spark/mllib/optimization/NNLS.scala

# Going further

TODO: Explain why the function is not convex. https://www.quora.com/Why-is-the-matrix-factorization-optimization-function-in-recommender-systems-not-convex

>A function f(x) is said to be convex if it satisfies the following property:

>𝑓(𝛼𝑥+𝛽𝑦)≤𝛼𝑓(𝑥)+𝛽𝑓(𝑦) where, 𝛼+𝛽=1,𝛼,𝛽≥=0and the domain of 𝑓 is a convex set.

>A simple matrix factorization based model would predict, say ratings, using the product of item and user latent factors

>𝑅𝑢𝑖=<𝑝𝑢,𝑞𝑖> where 𝑝𝑢 is the user latent factor representation and 𝑞𝑖is the item latent factor representation. The objective function includes this term and therefore is equivalent to minimizing 𝑓(𝑥,𝑦)=𝑥𝑦

>𝑓(𝛼𝑥1+𝛽𝑥2,𝛼𝑦1+𝛽𝑦2)=(𝛼𝑥1+𝛽𝑥2)(𝛼𝑦1+𝛽𝑦2)You can easily find a counter example to prove that this function does not satisfy the above property of convexity.

## Acronym Disambiguation
Factoring matrices comes up a lot in the context of ML.

- __SVD__  - “Singular Value Decomposition”
- __PCA__ - “Principal Component Analysis” 
- __FM__ - “Factorization Machine” (one latent vector per user or item)
- __FFM__ - “Field Aware Factorization Machine” (multiple latent vectors depending on the latent space)


ALS is tolerant of missing values.  
SVD requires all values to be present.   
To calculate PCA, one would perfom SVD. 



### Netflix Recommender system
https://www.youtube.com/watch?v=aKQfUbxU96c

SVD++ (uses both explicit and implict feedback, takes into account user and item bias)     
Restricted Bolzman Machine   
Nuclear Norm -> $||A||_{nuclear} = \sigma_1 + \sigma_2 + ... + \sigma_r$

Class for training explicit matrix factorization model using either ALS or SGD

https://gist.github.com/EthanRosenthal/a293bfe8bbe40d5d0995

## RS as a bi-partite graph - how do we solve this in graphX?

## Deep-Learning MF
In recent years a number of neural and deep-learning techniques have been proposed, some of which generalize traditional Matrix factorization algorithms via a non-linear neural architecture [15]. While deep learning has been applied to many different scenarios: context-aware, sequence-aware, social tagging etc. its real effectiveness when used in a simple Collaborative filtering scenario has been put into question. A systematic analysis of publications applying deep learning or neural methods to the top-k recommendation problem, published in top conferences (SIGIR, KDD, WWW, RecSys), has shown that on average less than 40% of articles are reproducible, with as little as 14% in some conferences. Overall the study identifies 18 articles, only 7 of them could be reproduced and 6 of them could be outperformed by much older and simpler properly tuned baselines. The article also highlights a number of potential problems in today's research scholarship and calls for improved scientific practices in that area.[16] Similar issues have been spotted also in sequence-aware recommender systems.[17] https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems)