**Matrix factorization** is a class of collaborative filtering algorithms used in recommender systems. **Matrix factorization** approximates a given rating matrix as a product of two lower-rank matrices.
It decomposes a rating matrix R(nxm) into a product of two matrices W(nxd) and U(mxd).

\begin{equation*}
\mathbf{R}_{n \times m} \approx \mathbf{\hat{R}} = 
\mathbf{V}_{n \times k} \times \mathbf{V}_{m \times k}^T
\end{equation*}

**Additional NOTE**

If you are interested in learning or exploring more about importance of feature selection in machine learning, then refer to my below blog offering.

https://www.analyticsvidhya.com/blog/2020/10/a-comprehensive-guide-to-feature-selection-using-wrapper-methods-in-python/

### Notebook - Table of Content

1. [**Importing necessary libraries**](#1.-Importing-necessary-libraries)   
2. [**Loading the data into PySpark dataframes**](#2.-Loading-the-data-into-PySpark-dataframes) 
3. [**Basic data exploration**](#3.-Basic-data-exploration)  
    3.1 [**Total number of users, movies and ratings**](#3.1-Total-number-of-users,-movies-and-ratings)  
    3.2 [**Distribution of ratings**](#3.2-Distribution-of-ratings)  
    3.3 [**Ratings per user**](#3.3-Ratings-per-user)      
    3.4 [**Ratings per movie**](#3.4-Ratings-per-movie)  
4. [**Train-test split**](#4.-Train-test-split)  
5. [**ALS based recommendation**](#5.-ALS-based-recommendation)  
    5.1 [**Analysing the model**](#5.1-Analysing-the-model)     
    5.2 [**Evaluating the results**](#5.2-Evaluating-the-results)  
    5.3 [**Hyperparameter tuning**](#5.3-Hyperparameter-tuning)
6. [**Additional performance measures for Recommendation**](#6.-Additional-performance-measures-for-Recommendation)      
    6.1 [**Precision and Recall**](#6.1-Precision-and-Recall)    
7. [**Handling Cold Start problem**](#7.-Handling-Cold-Start-problem)        

### 1. Importing the necessary libraries

In [None]:
!pip install pyspark

In [None]:
from pyspark import SparkContext, SQLContext   # required for dealing with dataframes
from pyspark.sql.functions import isnan, count, col
from pyspark.ml.evaluation import RegressionEvaluator
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark.ml.recommendation import ALS      # for Matrix Factorization using ALS 

In [None]:
sc = SparkContext()      # instantiating spark context 
sqlContext = SQLContext(sc) # instantiating SQL context 

### 2. Loading the data into PySpark dataframes

In [None]:
jester_ratings_df = sqlContext.read.csv("/kaggle/input/jester-17m-jokes-ratings-dataset/jester_ratings.csv",header = True, inferSchema = True)
jester_items_df = sqlContext.read.csv("/kaggle/input/jester-17m-jokes-ratings-dataset/jester_items.csv",header = True, inferSchema = True)

In [None]:
print("Ratings dataset shape:", (jester_ratings_df.count(), len(jester_ratings_df.columns)))
jester_ratings_df.show(5)

In [None]:
df = pd.read_csv("/kaggle/input/jester-17m-jokes-ratings-dataset/jester_ratings.csv")

In [None]:
df2 = pd.read_csv("/kaggle/input/jester-17m-jokes-ratings-dataset/jester_items.csv")

In [None]:
df2["jokeId"].nunique()

In [None]:
df.dtypes

In [None]:
set(df2["jokeId"].unique().tolist()) - set(df["jokeId"].unique().tolist())

In [None]:
df[df["jokeId"].isin([1, 2, 3, 4, 6, 9, 10, 11, 12, 14])]

In [None]:
df["rating"].max()

In [None]:
jester_ratings_df.show(5)

### 3. Basic data exploration

#### 3.1 Total number of users, movies and ratings

In [None]:
print("Number of unique users: ", jester_ratings_df.select("userId").distinct().count())
print("Number of unique jokes: ", jester_ratings_df.select("jokeId").distinct().count())
print("Total number of ratings: ", jester_ratings_df.count())

#### 3.2 Distribution of ratings

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
ax.set_title('Ratings distribution', fontsize=15)
sns.distplot(jester_ratings_df.toPandas()['rating'], kde=False, bins = 8,hist_kws=dict(edgecolor="k", linewidth=2))
ax.set_xlabel("ratings in interval")
ax.set_ylabel("Total number of ratings")

#### 3.3 Ratings per user

In [None]:
ratings_per_user = jester_ratings_df.groupby('userId').agg({"rating":"count"})
ratings_per_user.describe().show()

* Minimum number of ratings given by a user = **1**
* Maximum number of ratings given by a user = **140**
* Average ratings per user = **30**(after rounding)

#### 3.4 Ratings per joke

In [None]:
ratings_per_joke = jester_ratings_df.groupby('jokeId').agg({"rating":"count"})
ratings_per_joke.describe().show()

* Minimum number of ratings to a joke = **166** 
* Maximum number of ratings to a joke = **59122**
* Average ratings per joke = **12582**(after rounding) 

### 4. Train-test Split

In [None]:
X_train, X_test = jester_ratings_df.randomSplit([0.9,0.1])   # 90:10 ratio
print("Training data size : ", X_train.count())
print("Test data size : ", X_test.count())
print("Number of unique users in Training set", X_train[["userId"]].distinct().count())
print("Number of unique users in Test set", X_test[["userId"]].distinct().count())

### 5. ALS based recommendation

In [None]:
als = ALS(userCol="userId",itemCol="jokeId",ratingCol="rating",rank=5, maxIter=10, seed=0)
model = als.fit(X_train)

In [None]:
# displaying the latent features for five users
model.userFactors.show(5, truncate = False)  

In [None]:
model.transform(X_test).show(5)

#### 5.1 Analysing the model

It is common to have users and/or items in the test dataset that were not part of the training dataset and transform() method implementation of ALS returns **NaN** predictions for such records.  

In [None]:
model.transform(X_test).where(isnan('prediction')).show(5)

In [None]:
X_train[X_train.userId.isin([24578,54401,63338,19639,479])].show()

User with ids [24578,54401,63338,19639,479] from the test set are not available in the training dataset. Hence, the trained model does not generate latent factors for such users and the transform() method returns **NaN** predictions for them. 

In [None]:
# total number of NaN predictions
model.transform(X_test).where(isnan('prediction')).count()

In [None]:
model.transform(X_test[["userId","jokeId"]]).na.drop()[["prediction"]].show()

#### 5.2. Evaluating the results

In [None]:
evaluator=RegressionEvaluator(metricName="rmse",labelCol="rating",predictionCol="prediction")

In [None]:
train_predictions = model.transform(X_train)
test_predictions = model.transform(X_test).na.drop()
print("RMSE on training data : ", evaluator.evaluate(train_predictions))
print("RMSE on test data: ", evaluator.evaluate(test_predictions))

In [None]:
from pyspark.ml.tuning import CrossValidator,ParamGridBuilder

In [None]:
params = ParamGridBuilder()

In [None]:
params = ParamGridBuilder().addGrid(ALS.rank, [5, 6, 7, 8, 9, 10]) \
        .addGrid(ALS.regParam, [0.001, 0.01, 0.1, 1, 10]) \
        .build()

In [None]:
params = ParamGridBuilder().addGrid(ALS.rank, [5]).build()
        #.addGrid(ALS.regParam, [0.001]) \
        #.build()

In [None]:
params = ParamGridBuilder().build()

In [None]:
cv = CrossValidator(estimator=ALS(userCol="userId",itemCol="jokeId",ratingCol="rating",coldStartStrategy="drop"),estimatorParamMaps=params, evaluator=evaluator)
cv.fit(X_train)
#cv = cv.setNumFolds(10).setSeed(0).fit(X_train)

In [None]:
cv = cv.fit(X_train)

In [None]:
cv.avgMetrics

In [None]:
cv.avgMetrics

In [None]:
cv.avgMetrics

In [None]:
cv.avgMetrics

In [None]:
cv.avgMetrics

In [None]:
evaluator.evaluate(cv.transform(X_test).na.drop())

In [None]:
evaluator.evaluate(cv.transform(X_test))

In [None]:
evaluator.evaluate(cv.transform(X_test))

In [None]:
predictions.show(5)

In [None]:
X_train.where((X_train.userId == 5518) & (X_train.jokeId==148)).show()

In [None]:
# joining X_test and prediction dataframe and also dropping the records for which no predictions made
ratesAndPreds = X_test.join(other=predictions,on=['userId','jokeId'],how='inner').na.drop() 
ratesAndPreds.show(5)

#### Step 5. Evaluating the model

In [None]:
# converting the columns into numpy arrays for direct and easy calculations 
rating = np.array(ratesAndPreds.select("rating").collect()).ravel()
prediction = np.array(ratesAndPreds.select("prediction").collect()).ravel()
print("RMSE : ", np.sqrt(np.mean((rating - prediction)**2)))

#### Step 6. Recommending jokes

In [None]:
# recommending top 3 jokes for all the users with highest predicted rating 
model.recommendForAllUsers(3).show(5,truncate = False)

In [None]:
model.recommendForAllUsers(3).count()