# Recommendation - Data Preparation 🎬

---

<img src="https://cdn-images-1.medium.com/max/1200/0*ePGWILY6GyplT-nn" />

---

In the next few challenges, you will build a powerful **movie recommender**.

We will use the open-source library [LightFM](https://github.com/lyst/lightfm) which provides easy python implementation of **hybrid** recommendation engines.

In this first part, we will prepare the data in order to train efficiently of the model.

We let you load the data `movies` and `ratings` downloaded from the **small** [movielens dataset](https://grouplens.org/datasets/movielens/).



In [2]:
### TODO: Load the movies and ratings datasets

import pandas as pd
movies = pd.read_csv("ml-latest-small/movies.csv")
ratings = pd.read_csv("ml-latest-small/ratings.csv")

print(movies.head())
print(ratings.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


**Q1**. What are the different types of recommendation models? Explain briefly with your own words the differences between them.

Answer:Content-based Filtering
In content-based filtering, the model tries to recommend items that are similar to those that a user liked in the past.

Collaborave Filtering
There are two approaches to collaborative filtering, one based on users, the other on items.
The key idea behind collaborative filterting is that similar users share the same interest and that similar items (in terms of ratings/interactions) will be liked by a user.

Hybrid System
Top notch tech companies use hybrid systems that combine both filtering methods (collaborative + content-based)

In [4]:
print("See the lecture")

See the lecture


**Q1bis**. What data is expected by the LightFM `fit` method? Especially, how does the train data should be organized, and what should be the type of the train dataset? 

Answer:
LightFM
It provides more accurate recommendations by balancing both user behavior and attribute preferences.
Presenting... LightFM ! Hybrid recommendation made easy

Assuming train is a (no_users, no_items) sparse matrix (with 1s denoting positive, and -1s negative interactions).
LightFM reduces to a traditional collaborative filtering matrix factorization method.
User and item features can be incorporated into training by passing them into the fit method

In [3]:
print("training data is a (no_users, no_items) sparse matrix (with 1s denoting positive, \
and -1s negative interactions)")

training data is a (no_users, no_items) sparse matrix (with 1s denoting positive, and -1s negative interactions)


**Q2**. Explore `movies` and `ratings`, what do those datasets contain? How are they organized?
Answer:
These datasets contain movieId, movie titles and movie genres. They are organized by their genres, title and movieId.

In [3]:
print(movies.shape)
movies.head()

(9742, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
print(ratings.shape)
print(ratings.head())

len(ratings['userId'].unique())

(100836, 4)
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


610

---

### Q3 & Q4 are optional

We created a few utils functions for you in `utils.py` script. Especially:
- `threshold_interactions_df`:
> Limit interactions df to minimum row and column interactions

**Q3**. Open `src/utils.py` file, and have a look at the documentation of this function to understand its goal and how it works.

Have a look the code to understand fully how it works. You should be familiar with everything.

What does represent the variable `sparsity`? What is the range of values in which sparsity can be?

In [8]:
print("sparsity corresponds to the percentage of existing ratings over\
the total number of interactions possible.\n\
It evolves between 0 when no users rate no movies \
and 1 (100%) when all users rate all movies")

sparsity corresponds to the percentage of existing ratings overthe total number of interactions possible.
It evolves between 0 when no users rate no movies and 1 (100%) when all users rate all movies


**Q4**. Create a new DataFrame `ratings_thresh`, that filters `ratings` with only:
- users that rated strictly more than 4 movies
- movies that have been rated at least 10 times

How many users/movies remain in this new dataset?

In [9]:
from utils import threshold_interactions_df
# Threshold data to only include:
# - users that rated >= 5 movies
# - movies that have been rated by >= 10 users
ratings_thresh = threshold_interactions_df(ratings, 'userId', 'movieId', 5, 10)
print("With these new conditions, it remains {} users and {} movies.".format(
    len(ratings_thresh.userId.unique()),
    len(ratings_thresh.movieId.unique())))

Starting interactions info
Number of rows: 610
Number of cols: 9724
Sparsity: 1.700%
Ending interactions info
Number of rows: 610
Number of columns: 3650
Sparsity: 4.055%
With these new conditions, it remains 610 users and 3650 movies.


In [10]:
ratings_thresh.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


**Q5**. In order to fit a [LightFM](https://lyst.github.io/lightfm/docs/home.html) model, we need to transform our Dataframe to a sparse matrix (cf. below). This is not straightforward so we included the function `df_to_matrix` in `utils.py`.

> 🔦 **Hint**:  Sparse matrices are just **big matrices with a lot of zeros or empty values**.
> 
> Existing tools (Pandas DataFrame, Numpy arrays for example) are not suitable for manipulating this kind of data. So we will use [Scipy sparse matrices](https://docs.scipy.org/doc/scipy-0.14.0/reference/sparse.html).
>
> It exists many different "types" of sparse matrices (CSC, CSR, COO, DIA, etc.). You don't need to know them. Just know that it corresponds to different formats with different methods of manipulation, slicing, indexing, etc.

> 🔦 **Hint 2**:  By going from a DataFrame to a sparse matrix, you will lose the information of the ids (userId and movieId), you will only deal with indices (row number and column number). Therefore, the `df_to_matrix` function also returns dictionaries mapping indexes to ids (ex: uid_to_idx mapping userId to index of the matrix) 


Have a look at the util function documentation, and use it to create 5 new variables:
- a final sparse matrix `ratings_matrix` (this will be the data used to train the model)
- the following utils mappers:
    - `uid_to_idx`
    - `idx_to_uid`
    - `mid_to_idx`
    - `idx_to_mid`

In [12]:
from utils import df_to_matrix

ratings_matrix, uid_to_idx, idx_to_uid, mid_to_idx, idx_to_mid = df_to_matrix(ratings, 'userId', 'movieId')
ratings_matrix

<610x9724 sparse matrix of type '<class 'numpy.float64'>'
	with 100836 stored elements in Compressed Sparse Row format>

**Q6**.
- On the one side, find what movies did the userId 4 rate?

- On the other side, what is the value of `ratings_matrix` for:
    - userId = 4 and movieId=1
    - userId = 4 and movieId=2
    - userId = 4 and movieId=21
    - userId = 4 and movieId=32
    - userId = 4 and movieId=126

Conclude on the values signification in `ratings_matrix`

In [10]:
ratings[ratings.userId==4]

Unnamed: 0,userId,movieId,rating,timestamp
300,4,21,3.0,986935199
301,4,32,2.0,945173447
302,4,45,3.0,986935047
303,4,47,2.0,945173425
304,4,52,3.0,964622786
...,...,...,...,...
511,4,4765,5.0,1007569445
512,4,4881,3.0,1007569445
513,4,4896,4.0,1007574532
514,4,4902,4.0,1007569465


In [13]:
for mid in [1, 2, 21, 32, 126]:
    print("For mid={}".format(mid))
    print("Value of ratings_matrix={}".format(
        ratings_matrix[uid_to_idx[4], mid_to_idx[mid]]))
print("ratings_matrix has value equal to 1 when user rated movie and 0 otherwise")

For mid=1
Value of ratings_matrix=0.0
For mid=2
Value of ratings_matrix=0.0
For mid=21
Value of ratings_matrix=1.0
For mid=32
Value of ratings_matrix=1.0
For mid=126
Value of ratings_matrix=1.0
ratings_matrix has value equal to 1 when user rated movie and 0 otherwise


**Q7**. Now that you have a `ratings_matrix` in the correct format, let's save it in pickle format:
- Create a variable `dst_dir` corresponding to the path of a data folder for later use
- Save the ratings_matrix in pickle (as `ratings_matrix.pkl`) in this corresponding directory

In [17]:
dst_dir = "./data"
import pickle
pickle.dump(ratings_matrix, open(dst_dir + "/ratings_matrix.pkl", "wb"))

**Q7.1**. Save also all mappings objects into pickle (`idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`) as it will be useful for later.

In [18]:
pickle.dump(idx_to_mid, open(dst_dir + "/idx_to_mid.pkl", "wb"))
pickle.dump(mid_to_idx, open(dst_dir + "/mid_to_idx.pkl", "wb"))
pickle.dump(uid_to_idx, open(dst_dir + "/uid_to_idx.pkl", "wb"))
pickle.dump(idx_to_uid, open(dst_dir + "/idx_to_uid.pkl", "wb"))

In [19]:
pickle.dump(movies, open(dst_dir + "/movies.pkl", "wb"))

Up to next challenge now! 🍿