# Part 2: Predicting user ratings

In Part 1, we explored item-item recommendations which involve recommending new items that are similar to a given item. An example of an item-item recommendation is Netflix's "Because you watched Movie X", which recommends movies based on a recent movie (Movie X) that you watched.  

In this tutorial, we will explore user-item recommendations. This approach is a more personalized type of recommendation which specifically predicts a user's degree of preference (e.g., rating) towards a given item. We will use a technique called **matrix factorization**. Let's get started!

In [2]:
import pandas as pd
import numpy as np

### Import Dataset

In [3]:
ratings = pd.read_csv("data/ratings.csv")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### Exploratory Data Analysis

For a more comprehensive exploratory data analysis, please see `Exploratory Data Analysis` in Part 1's notebook. 

In [5]:
n_ratings = len(ratings)
n_movies = ratings['movieId'].nunique()
n_users = ratings['userId'].nunique()

print(f"Number of ratings: {n_ratings}")
print(f"Number of unique movieId's: {n_movies}")
print(f"Number of unique users: {n_users}")
print(f"Average number of ratings per user: {round(n_ratings/n_users, 2)}")
print(f"Average number of ratings per movie: {round(n_ratings/n_movies, 2)}")

Number of ratings: 100836
Number of unique movieId's: 9724
Number of unique users: 610
Average number of ratings per user: 165.3
Average number of ratings per movie: 10.37


### Transforming the data

We have created a helper function called `create_X` which is able to transform the `ratings` dataframe into a n_users $\times$ n_items sparse matrix. The function returns a user-item matrix of type `scipy.sparse.csr_matrix` (a sparse matrix), and four dictionaries that id's to indices (and vice versa):

- **user_mapper:** user_id $\longrightarrow$ user_index
- **movie_mapper:** movie_id $\longrightarrow$ movie_index 
- **user_inv_mapper:** user_index $\longrightarrow$ user_id
- **movie_inv_mapper:** movie_index $\longrightarrow$ movie_id 


In [11]:
from utils import create_X

X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper = create_X(ratings)

In [12]:
X.shape

(9724, 610)

In [13]:
sparsity = X.count_nonzero()/(X.shape[0]*X.shape[1])

print("Matrix sparsity: {0:.2f}%".format(sparsity*100))

Matrix sparsity: 1.70%


As you can see, the user-item matrix is quite sparse. This means that there are few user-item ratings in our matrix relative to the total number of users and movies in our dataset, which makes sense given that a user typically rates only a handful of movies. If a user hasn't watched a particular movie, that interaction is an empty cell or a "missing" value. We are going to predict the ratings that would replace the missing values in this matrix using a technique called matrix factorization. 

### What is matrix factorization?

Matrix factorization is a concept from **linear algebra** that decomposes a large matrix into a set of small matrices. In this case, we want to decompose our user-item matrix into two matrices with the condensing our matrix into more meaningful $k$ latent features:

<img src="images/marix-factorization.png"/>


We are going to predict a user's rating of a movie using matrix factorization. 



In [14]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=200, n_iter=10, random_state=42)
Z = svd.fit_transform(X)
new_X = svd.inverse_transform(Z)

In [21]:
new_X.shape

(9724, 610)

In [28]:
new_X[:,0].argsort()[-5:]

array([ 862,  989,  899, 1223,   46])

In [29]:
new_X[862,0]

5.470953683291143

### Hyperparameter Tuning

How many latent features (n_components) do we select for our model? Though it isn't possible to calculate optimal `n_components` with an equation, we can attempt to identify the best `n_comopnents` empirically by testing out different values of `n_components` and see which `n_components` value scores best.

#### 1. Identifying a robust evaluation metric

#### 2. Splitting our data into training and validation sets

#### 3. Performing cross-validation