# Recommendation - Data Preparation 🎬

---

<img src="https://cdn-images-1.medium.com/max/1200/0*ePGWILY6GyplT-nn" />

---

In the next few challenges, you will build a powerful **movie recommender**.

We will use the open-source library [LightFM](https://github.com/lyst/lightfm) which provides easy python implementation of **hybrid** recommendation engines.

In this first part, we will prepare the data in order to train efficiently of the model.

We let you load the data `movies` and `ratings` downloaded from the **small** [movielens dataset](https://grouplens.org/datasets/movielens/).



In [2]:
### TODO: Load the movies and ratings datasets
import pandas as pd
movies = pd.read_csv("ml-latest-small/movies.csv")
ratings = pd.read_csv("ml-latest-small/ratings.csv")

print(movies.head())
print(ratings.head())

   movieId                               title   
0        1                    Toy Story (1995)  \
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


**Q1**. What are the different types of recommendation models? Explain briefly with your own words the differences between them.

Content-Based Filtering: This approach recommends items based on their inherent characteristics and features. It analyzes the properties of items, such as genre, actors, or keywords, and recommends similar items to users based on their preferences for those features. For example, if a user enjoys action movies, a content-based recommender would suggest other action movies.

Collaborative Filtering: Collaborative filtering recommends items based on the preferences and behavior of similar users. It finds patterns and similarities among users' ratings or interactions with items and suggests items that are liked by users with similar tastes. Collaborative filtering can be further divided into two types:

a. User-Based Collaborative Filtering: This method identifies users with similar preferences and recommends items liked by those similar users. For example, if User A and User B have similar movie preferences and User A liked a particular movie, the system would recommend that movie to User B.

b. Item-Based Collaborative Filtering: This method identifies items that are similar based on user ratings and recommends items that are similar to those the user has already rated positively. For example, if a user liked Movie X, the system would recommend movies similar to Movie X.

Hybrid Recommender Systems: These systems combine multiple recommendation approaches to provide more accurate and diverse recommendations. They leverage the strengths of different models to overcome limitations and improve the quality of recommendations. For instance, a hybrid recommender system might use collaborative filtering to capture user preferences and content-based filtering to enhance the recommendations based on item features.

Knowledge-Based Recommender Systems: These systems make recommendations by using explicit knowledge about the user's preferences and needs. They typically employ rule-based or knowledge-based techniques to infer user preferences from their stated requirements or explicit feedback. For example, a knowledge-based recommender might ask the user a series of questions to understand their specific needs and then provide tailored recommendations based on that information.

Reinforcement Learning-Based Recommender Systems: These systems utilize reinforcement learning techniques to learn and adapt the recommendations based on user feedback and interactions. They aim to maximize user satisfaction by continuously learning from user behavior and optimizing the recommendations over time.

**Q1bis**. What data is expected by the LightFM `fit` method? Especially, how does the train data should be organized, and what should be the type of the train dataset? 

In LightFM, the fit method expects the training data to be organized in a specific format. The training data should be represented as a sparse matrix, where each row corresponds to a user and each column corresponds to an item (or feature).

The sparse matrix can be created using libraries like SciPy's csr_matrix or coo_matrix. The values in the matrix indicate the interaction strength or preference of a user for an item. For example, if a user has rated an item, the value in the corresponding matrix cell could represent the rating score.

Here is an example of how the train data should be organized:

Each row in the matrix corresponds to a user, and the number of rows equals the number of users in the dataset.
Each column in the matrix corresponds to an item (or feature), and the number of columns equals the number of items (or features) in the dataset.
The matrix should be sparse, meaning it should contain mostly zeros since users typically interact with only a small fraction of the available items.
For example:

   userId  movieId  rating
0       1        1     4.0
1       1        3     4.0
2       1        6     4.0
3       1       47     5.0
4       1       50     5.0
To convert this dataset into the required sparse matrix format, you would typically create a matrix where each row corresponds to a user, each column corresponds to a movie, and the matrix values represent the ratings. The resulting matrix might look like this:


array([[4.0, 0.0, 4.0, 0.0, 0.0, 4.0],
       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
       ...
       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]])

**Q2**. Explore `movies` and `ratings`, what do those datasets contain? How are they organized?

In [4]:
print(movies.shape)
movies.head()

(9742, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
print(ratings.shape)
ratings.head()

(100836, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


---

### Q3 & Q4 are optional
> you can come back to it if you have time after having finished the whole project of the day

We created a few utils functions for you in `utils.py` script. Especially:
- `threshold_interactions_df`:
> Limit interactions df to minimum row and column interactions

**Q3**. Open `src/utils.py` file, and have a look at the documentation of this function to understand its goal and how it works.

Have a look the code to understand fully how it works. You should be familiar with everything.

What does represent the variable `sparsity`? What is the range of values in which sparsity can be?

The variable sparsity in the given code represents the density or sparsity of the interactions matrix. It is calculated as the percentage of non-zero interactions in the matrix compared to the total number of possible interactions.

The range of values in which sparsity can be is from 0% to 100%. A sparsity value of 0% indicates a completely dense matrix, where every possible interaction is present. On the other hand, a sparsity value of 100% indicates a completely sparse matrix, where no interactions are present.

In the context of recommendation systems, a lower sparsity value indicates a higher density of user-item interactions, implying that a larger portion of the possible interactions has occurred. A higher sparsity value indicates a lower density, meaning that fewer interactions have occurred relative to the total number of possible interactions.

Sparsity is useful to understand the distribution and availability of interactions in the dataset. It can help determine the level of data sparsity and the difficulty of making accurate recommendations, as highly sparse data may pose challenges for collaborative filtering methods that rely on user-item interactions.

**Q4**. Create a new DataFrame `ratings_thresh`, that filters `ratings` with only:
- users that rated strictly more than 4 movies
- movies that have been rated at least 10 times

How many users/movies remain in this new dataset?

In [10]:
from utils import threshold_interactions_df

ratings_thresh_df = threshold_interactions_df(ratings,
                                              'userId',
                                              'movieId',
                                              5, 
                                              10
)

print(
    len(ratings_thresh_df.userId.unique()),
       len(ratings_thresh_df.movieId.unique())
     )

Starting interactions info
Number of rows: 610
Number of cols: 9724
Sparsity: 1.700%
Ending interactions info
Number of rows: 610
Number of columns: 3650
Sparsity: 4.055%
610 3650


**Q5**. In order to fit a [LightFM](https://lyst.github.io/lightfm/docs/home.html) model, we need to transform our Dataframe to a sparse matrix (cf. below). This is not straightforward so we included the function `df_to_matrix` in `utils.py`.

> 🔦 **Hint**:  Sparse matrices are just **big matrices with a lot of zeros or empty values**.
> 
> Existing tools (Pandas DataFrame, Numpy arrays for example) are not suitable for manipulating this kind of data. So we will use [Scipy sparse matrices](https://docs.scipy.org/doc/scipy-0.14.0/reference/sparse.html).
>
> It exists many different "types" of sparse matrices (CSC, CSR, COO, DIA, etc.). You don't need to know them. Just know that it corresponds to different formats with different methods of manipulation, slicing, indexing, etc.

> 🔦 **Hint 2**:  By going from a DataFrame to a sparse matrix, you will lose the information of the ids (userId and movieId), you will only deal with indices (row number and column number). Therefore, the `df_to_matrix` function also returns dictionaries mapping indexes to ids (ex: uid_to_idx mapping userId to index of the matrix) 


Have a look at the util function documentation, and use it to create 5 new variables:
- a final sparse matrix `ratings_matrix` (this will be the data used to train the model)
- the following utils mappers:
    - `uid_to_idx`
    - `idx_to_uid`
    - `mid_to_idx`
    - `idx_to_mid`

In [13]:
from utils import df_to_matrix

ratings_martrix, uid_to_idx,idx_to_uid,mid_to_idx,idx_to_mid = df_to_matrix(ratings_thresh_df,
                                              'userId',
                                              'movieId',
)
ratings_martrix

<610x3650 sparse matrix of type '<class 'numpy.float64'>'
	with 90274 stored elements in Compressed Sparse Row format>

**Q6**.
- On the one side, find what movies did the userId 4 rate?

- On the other side, what is the value of `ratings_matrix` for:
    - userId = 4 and movieId=1
    - userId = 4 and movieId=2
    - userId = 4 and movieId=21
    - userId = 4 and movieId=32
    - userId = 4 and movieId=126

Conclude on the values signification in `ratings_matrix`

In [14]:
ratings[ratings.userId ==4]

Unnamed: 0,userId,movieId,rating,timestamp
300,4,21,3.0,986935199
301,4,32,2.0,945173447
302,4,45,3.0,986935047
303,4,47,2.0,945173425
304,4,52,3.0,964622786
...,...,...,...,...
511,4,4765,5.0,1007569445
512,4,4881,3.0,1007569445
513,4,4896,4.0,1007574532
514,4,4902,4.0,1007569465


In [20]:
for mid in [1,2,21,32,126]:
    print('For MID',mid)
    print('Matrix value is(for user4)is:',ratings_martrix[uid_to_idx[mid],mid_to_idx[mid]])


For MID 1
Matrix value is(for user4)is: 1.0
For MID 2
Matrix value is(for user4)is: 0.0
For MID 21
Matrix value is(for user4)is: 0.0
For MID 32
Matrix value is(for user4)is: 1.0
For MID 126
Matrix value is(for user4)is: 0.0


**Q5**. Now that you have a `ratings_matrix` in the correct format, let's save it in pickle format:
- Create a variable `dst_dir` corresponding to the path of the folder `data/netflix` located at the root of the repository
- **Verify that this is the correct path**
- Save the ratings_matrix in pickle (as `ratings_matrix.pkl`) in this corresponding directory

In [22]:
directory = "./data"
import pickle


In [25]:
pickle.dump(ratings_martrix,open(directory + "/ratings_martrix.pkl","wb"))

**Q6**. Save also all mappings objects into pickle (`idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`) as it will be useful for later.

In [26]:
pickle.dump(idx_to_mid,open(directory + "/idx_to_mid.pkl","wb"))
pickle.dump(mid_to_idx,open(directory + "/mid_to_idx.pkl","wb"))
pickle.dump(uid_to_idx,open(directory + "/uid_to_idx.pkl","wb"))
pickle.dump(idx_to_uid,open(directory + "/idx_to_uid.pkl","wb"))

In [27]:
pickle.dump(movies,open(directory + "/movies.pkl","wb"))

Up to next challenge now! 🍿