# Recommendation - Data Preparation 🎬

---

<img src="https://cdn-images-1.medium.com/max/1200/0*ePGWILY6GyplT-nn" />

---

In the next few challenges, you will build a powerful **movie recommender**.

We will use the open-source library [LightFM](https://github.com/lyst/lightfm) which provides easy python implementation of **hybrid** recommendation engines.

In this first part, we will prepare the data in order to train efficiently of the model.

We let you load the data `movies` and `ratings` downloaded from the **small** [movielens dataset](https://grouplens.org/datasets/movielens/).



In [2]:
import pandas as pd
from scipy.sparse import coo_matrix

In [3]:
### TODO: Load the movies and ratings datasets
movies = pd.read_csv('ml-latest-small/movies.csv')
ratings = pd.read_csv('ml-latest-small/ratings.csv')

# Display the first few rows of each DataFrame to ensure they are loaded correctly
print(movies.head())
print(ratings.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


**Q1**. What are the different types of recommendation models? Explain briefly with your own words the differences between them.

There are several types of recommendation models, each with its own approach to suggesting items to users. The main types are:
>##### **1. Content-Based Filtering**
>>Content-based filtering recommends items similar to those the user has liked in the past. It relies on item features and user preferences. For example, if a user likes action movies, the system will recommend other action movies.
>>##### **Pros:**
>>Doesn't require data from other users.
>>Can start making recommendations immediately.
>>##### **Cons:**
>>Limited by the features used to describe items.
>>May not recommend items outside the user's established preferences.
>##### **2. Collaborative Filtering**
>>Collaborative filtering makes recommendations based on the preferences of similar users. There are two main types:
User-based Collaborative Filtering: Finds users similar to the target user and recommends items they have liked.
Item-based Collaborative Filtering: Finds items similar to those the target user has liked and recommends them.
>>##### **Pros:**
>>Can uncover hidden patterns and preferences.
>>Effective when there is a lot of user interaction data.
>>##### **Cons:**
>>Requires a large amount of data.
>>Suffers from the cold start problem for new users and items.
>##### **3. Matrix Factorization**
>>Matrix factorization decomposes the user-item interaction matrix into lower-dimensional matrices. This helps in discovering latent features that explain observed interactions. Popular techniques include Singular Value Decomposition (SVD) and Alternating Least Squares (ALS).
>>##### **Pros:**
>>Can handle large-scale data.
>>Effective in uncovering latent features.
>>##### **Cons:**
>>Computationally intensive.
>>Requires a good amount of data.
>##### **4. Hybrid Models**
>>Hybrid models combine multiple recommendation strategies to leverage their individual strengths. For example, a hybrid model might use both content-based and collaborative filtering to improve recommendations.
>>##### **Pros:**
>>Can provide more accurate recommendations.
>>Mitigates the weaknesses of individual models.
>>##### **Cons:**
>>More complex to implement and maintain.
>>Can be computationally expensive.
>##### **5. Deep Learning-based Models**
>>These models use neural networks to capture complex patterns in the data. Examples include autoencoders, convolutional neural networks (CNNs), and recurrent neural networks (RNNs).
>>##### **Pros**:
>>Can handle complex and high-dimensional data.
>>Capable of capturing intricate relationships.
>>##### **Cons**:
>>Requires a large amount of data and computational resources.
>>Complex to train and fine-tune.

**Q1bis**. What data is expected by the LightFM `fit` method? Especially, how does the train data should be organized, and what should be the type of the train dataset? 

The LightFM fit method expects the training data to be in the form of an interaction matrix. Here's a detailed explanation of what data is expected and how it should be organized:

>#### **Interaction Matrix**
>The interaction matrix represents the interactions between users and items. Each entry in the matrix indicates whether a user has interacted with an item (e.g., rated it, viewed it, liked it, etc.) and possibly the strength of that interaction (e.g., rating score).

>>##### **Format:**

>>>- Type: The interaction matrix should be a sparse matrix. LightFM works with scipy's csr_matrix (Compressed Sparse Row format) or coo_matrix (Coordinate format).
>>>- Shape: The matrix should have a shape of (number of users, number of items).
>>##### **Creating the Interaction Matrix**
>>To create the interaction matrix, you'll typically start with a dataset of user-item interactions. Here’s how you can create and organize this data:

>>>- Load Data:
>>>Load your user-item interactions into a pandas DataFrame. This should include at least user IDs, item IDs, and interaction values (e.g., ratings).

>>>- Encode Users and Items:
>>>Convert the user IDs and item IDs into integer indices that will be used to index the interaction matrix.

>>>- Build Sparse Matrix:
>>>Use the csr_matrix or coo_matrix from scipy to create the interaction matrix. Each non-zero entry in this matrix represents an interaction.

In [5]:
# Encode user IDs and item IDs
user_ids = ratings['userId'].astype('category').cat.codes
item_ids = ratings['movieId'].astype('category').cat.codes

# Create interaction matrix
interaction_matrix = coo_matrix(
    (ratings['rating'], (user_ids, item_ids)),
    shape=(user_ids.max() + 1, item_ids.max() + 1)
)

# Now you can use the interaction_matrix with LightFM
from lightfm import LightFM

model = LightFM(loss='warp')
model.fit(interaction_matrix, epochs=30, num_threads=2)

<lightfm.lightfm.LightFM at 0x7a1fc028d950>

**Q2**. Explore `movies` and `ratings`, what do those datasets contain? How are they organized?

#### **Exploring the movies Dataset**
The movies dataset typically contains information about the movies, such as movie IDs, titles, and genres.

- movieId: A unique identifier for each movie.
- title: The title of the movie, often including the release year.
- genres: A pipe-separated list of genres associated with the movie.

In [13]:
print("\nMovies DataFrame Info:")
print(movies.info())


Movies DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
None


#### **Exploring the ratings Dataset**
The ratings dataset contains user interactions with the movies, such as ratings.

- userId: A unique identifier for each user.
- movieId: A unique identifier for each movie (corresponds to the movieId in the movies dataset).
- rating: The rating given by the user to the movie, typically on a scale from 0.5 to 5.0.
- timestamp: The time when the rating was given, represented as a Unix timestamp.

In [14]:
print("\nRatings DataFrame Info:")
print(ratings.info())


Ratings DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB
None


---

### Q3 & Q4 are optional
> you can come back to it if you have time after having finished the whole project of the day

We created a few utils functions for you in `utils.py` script. Especially:
- `threshold_interactions_df`:
> Limit interactions df to minimum row and column interactions

**Q3**. Open `src/utils.py` file, and have a look at the documentation of this function to understand its goal and how it works.

Have a look the code to understand fully how it works. You should be familiar with everything.

What does represent the variable `sparsity`? What is the range of values in which sparsity can be?

**Q4**. Create a new DataFrame `ratings_thresh`, that filters `ratings` with only:
- users that rated strictly more than 4 movies
- movies that have been rated at least 10 times

How many users/movies remain in this new dataset?

**Q5**. In order to fit a [LightFM](https://lyst.github.io/lightfm/docs/home.html) model, we need to transform our Dataframe to a sparse matrix (cf. below). This is not straightforward so we included the function `df_to_matrix` in `utils.py`.

> 🔦 **Hint**:  Sparse matrices are just **big matrices with a lot of zeros or empty values**.
> 
> Existing tools (Pandas DataFrame, Numpy arrays for example) are not suitable for manipulating this kind of data. So we will use [Scipy sparse matrices](https://docs.scipy.org/doc/scipy-0.14.0/reference/sparse.html).
>
> It exists many different "types" of sparse matrices (CSC, CSR, COO, DIA, etc.). You don't need to know them. Just know that it corresponds to different formats with different methods of manipulation, slicing, indexing, etc.

> 🔦 **Hint 2**:  By going from a DataFrame to a sparse matrix, you will lose the information of the ids (userId and movieId), you will only deal with indices (row number and column number). Therefore, the `df_to_matrix` function also returns dictionaries mapping indexes to ids (ex: uid_to_idx mapping userId to index of the matrix) 


Have a look at the util function documentation, and use it to create 5 new variables:
- a final sparse matrix `ratings_matrix` (this will be the data used to train the model)
- the following utils mappers:
    - `uid_to_idx`
    - `idx_to_uid`
    - `mid_to_idx`
    - `idx_to_mid`

In [7]:
from utils import df_to_matrix

ratings_matrix, uid_to_idx, idx_to_uid, mid_to_idx, idx_to_mid = df_to_matrix(ratings, 'userId', 'movieId')

print(f"Shape of the ratings matrix: {ratings_matrix.shape}")
print(f"User ID to Index mapping: {list(uid_to_idx.items())[:5]}")
print(f"Index to User ID mapping: {list(idx_to_uid.items())[:5]}")
print(f"Movie ID to Index mapping: {list(mid_to_idx.items())[:5]}")
print(f"Index to Movie ID mapping: {list(idx_to_mid.items())[:5]}")

Shape of the ratings matrix: (610, 9724)
User ID to Index mapping: [(1, 0), (2, 1), (3, 2), (4, 3), (5, 4)]
Index to User ID mapping: [(0, 1), (1, 2), (2, 3), (3, 4), (4, 5)]
Movie ID to Index mapping: [(1, 0), (3, 1), (6, 2), (47, 3), (50, 4)]
Index to Movie ID mapping: [(0, 1), (1, 3), (2, 6), (3, 47), (4, 50)]


**Q6**.
- On the one side, find what movies did the userId 4 rate?

- On the other side, what is the value of `ratings_matrix` for:
    - userId = 4 and movieId=1
    - userId = 4 and movieId=2
    - userId = 4 and movieId=21
    - userId = 4 and movieId=32
    - userId = 4 and movieId=126

Conclude on the values signification in `ratings_matrix`

In [9]:
user_id = 4
movies_rated_by_user_4 = ratings[ratings['userId'] == user_id]['movieId'].tolist()
print(f"Movies rated by userId {user_id}: {movies_rated_by_user_4}")

Movies rated by userId 4: [21, 32, 45, 47, 52, 58, 106, 125, 126, 162, 171, 176, 190, 215, 222, 232, 235, 247, 260, 265, 296, 319, 342, 345, 348, 351, 357, 368, 417, 441, 450, 457, 475, 492, 509, 538, 539, 553, 588, 593, 595, 599, 608, 648, 708, 759, 800, 892, 898, 899, 902, 904, 908, 910, 912, 914, 919, 920, 930, 937, 1025, 1046, 1057, 1060, 1073, 1077, 1079, 1080, 1084, 1086, 1094, 1103, 1136, 1179, 1183, 1188, 1196, 1197, 1198, 1199, 1203, 1211, 1213, 1219, 1225, 1250, 1259, 1265, 1266, 1279, 1282, 1283, 1288, 1291, 1304, 1391, 1449, 1466, 1500, 1517, 1580, 1597, 1617, 1641, 1704, 1719, 1732, 1733, 1734, 1834, 1860, 1883, 1885, 1892, 1895, 1907, 1914, 1916, 1923, 1947, 1966, 1967, 1968, 2019, 2076, 2078, 2109, 2145, 2150, 2174, 2186, 2203, 2204, 2282, 2324, 2336, 2351, 2359, 2390, 2395, 2406, 2467, 2571, 2583, 2599, 2628, 2683, 2692, 2712, 2762, 2763, 2770, 2791, 2843, 2858, 2874, 2921, 2926, 2959, 2973, 2997, 3033, 3044, 3060, 3079, 3083, 3160, 3175, 3176, 3204, 3255, 3317, 3358, 3

In [10]:
user_idx = uid_to_idx[user_id]

movie_ids = [1, 2, 21, 32, 126]
ratings_values = {}
for movie_id in movie_ids:
    if movie_id in mid_to_idx:
        movie_idx = mid_to_idx[movie_id]
        value = ratings_matrix[user_idx, movie_idx]
        ratings_values[movie_id] = value
    else:
        ratings_values[movie_id] = 0

print(f"Ratings matrix values for userId {user_id}:")
for movie_id, value in ratings_values.items():
    print(f"MovieId {movie_id}: {value}")

Ratings matrix values for userId 4:
MovieId 1: 0.0
MovieId 2: 0.0
MovieId 21: 1.0
MovieId 32: 1.0
MovieId 126: 1.0


**Q5**. Now that you have a `ratings_matrix` in the correct format, let's save it in pickle format:
- Create a variable `dst_dir` corresponding to the path of the folder `data/netflix` located at the root of the repository
- **Verify that this is the correct path**
- Save the ratings_matrix in pickle (as `ratings_matrix.pkl`) in this corresponding directory

In [11]:
import os
import pickle
from scipy.sparse import save_npz

dst_dir = os.path.join('data', 'netflix')

if not os.path.exists(dst_dir):
    os.makedirs(dst_dir)

ratings_matrix_path = os.path.join(dst_dir, 'ratings_matrix.pkl')

with open(ratings_matrix_path, 'wb') as f:
    pickle.dump(ratings_matrix, f)

print(f"ratings_matrix saved to {ratings_matrix_path}")

ratings_matrix saved to data/netflix/ratings_matrix.pkl


**Q6**. Save also all mappings objects into pickle (`idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`) as it will be useful for later.

In [12]:
idx_to_mid_path = os.path.join(dst_dir, 'idx_to_mid.pkl')
mid_to_idx_path = os.path.join(dst_dir, 'mid_to_idx.pkl')
uid_to_idx_path = os.path.join(dst_dir, 'uid_to_idx.pkl')
idx_to_uid_path = os.path.join(dst_dir, 'idx_to_uid.pkl')

with open(idx_to_mid_path, 'wb') as f:
    pickle.dump(idx_to_mid, f)

with open(mid_to_idx_path, 'wb') as f:
    pickle.dump(mid_to_idx, f)

with open(uid_to_idx_path, 'wb') as f:
    pickle.dump(uid_to_idx, f)

with open(idx_to_uid_path, 'wb') as f:
    pickle.dump(idx_to_uid, f)

print(f"idx_to_mid saved to {idx_to_mid_path}")
print(f"mid_to_idx saved to {mid_to_idx_path}")
print(f"uid_to_idx saved to {uid_to_idx_path}")
print(f"idx_to_uid saved to {idx_to_uid_path}")

idx_to_mid saved to data/netflix/idx_to_mid.pkl
mid_to_idx saved to data/netflix/mid_to_idx.pkl
uid_to_idx saved to data/netflix/uid_to_idx.pkl
idx_to_uid saved to data/netflix/idx_to_uid.pkl


Up to next challenge now! 🍿