# Recommendation - Model 🍿

---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

Now, time for the exciting part! We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically! 

First, load again the dataframe `movies` and `ratings`

In [1]:
import pandas as pd

In [2]:
### TODO: Load the movies and ratings datasets
movies = pd.read_csv('ml-latest-small/movies.csv')
ratings = pd.read_csv('ml-latest-small/ratings.csv')

# Display the first few rows of each DataFrame to ensure they are loaded correctly
print(movies.head())
print(ratings.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


**Q1**. Start by loading all the pickle you saved during last challenge: `ratings_matrix`, `idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`

In [3]:
import os
import pickle

# Specify the directory where the pickled files are saved
src_dir = os.path.join('data', 'netflix')

# Load ratings_matrix
ratings_matrix_path = os.path.join(src_dir, 'ratings_matrix.pkl')
with open(ratings_matrix_path, 'rb') as f:
    ratings_matrix = pickle.load(f)

# Load idx_to_mid
idx_to_mid_path = os.path.join(src_dir, 'idx_to_mid.pkl')
with open(idx_to_mid_path, 'rb') as f:
    idx_to_mid = pickle.load(f)

# Load mid_to_idx
mid_to_idx_path = os.path.join(src_dir, 'mid_to_idx.pkl')
with open(mid_to_idx_path, 'rb') as f:
    mid_to_idx = pickle.load(f)

# Load uid_to_idx
uid_to_idx_path = os.path.join(src_dir, 'uid_to_idx.pkl')
with open(uid_to_idx_path, 'rb') as f:
    uid_to_idx = pickle.load(f)

# Load idx_to_uid
idx_to_uid_path = os.path.join(src_dir, 'idx_to_uid.pkl')
with open(idx_to_uid_path, 'rb') as f:
    idx_to_uid = pickle.load(f)

# Confirming loading
print("Pickles loaded successfully.")

Pickles loaded successfully.


**Q2**. Because the dataset is slightly different from what we have been used to (X as features, y as target), the usual `train_test_split` method from scikit-learn does not apply.

Hopefully, `lightfm` comes with a `random_train_test_split` located into `cross_validation` dedicated to this usecase 🙂

Split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

In [4]:
from lightfm.cross_validation import random_train_test_split

# Split the data into train and test sets
train, test = random_train_test_split(ratings_matrix, test_percentage=0.2, random_state=42)

# Confirm the shapes of train and test matrices
print("Shape of train matrix:", train.shape)
print("Shape of test matrix:", test.shape)

Shape of train matrix: (610, 9724)
Shape of test matrix: (610, 9724)


**Q3**. Train a LightFM model for 10 epochs. You can use the parameter `loss="warp"`.

In [5]:
from lightfm import LightFM

# Create a LightFM model with WARP loss
model = LightFM(loss='warp')

# Train the model for 10 epochs
model.fit(train, epochs=10)

print("Model training completed.")

Model training completed.


**Q4**. Evaluate your model on your test set. You can use the `precision_at_k` metric implemented in the LightFM library.

In [6]:
from lightfm.evaluation import precision_at_k

# Evaluate precision at k
precision = precision_at_k(model, test, k=5).mean()

print(f"Precision at k=5: {precision:.4f}")

Precision at k=5: 0.1013


**Q5**. What does the attribute `item_embeddings` of `model` contains?  This will be the heart of your recommendation engine! 💟 So make sure you understand fully what it contains.

The item_embeddings attribute of the LightFM model contains the learned latent representations of the items in the dataset. These embeddings represent the underlying characteristics or features of each item, which the model has learned during the training process. The embeddings are represented as a NumPy array where each row corresponds to the embedding vector of an item.

These embeddings are crucial for the recommendation engine because they encode the relationships between items in the latent space. By measuring the similarity between item embeddings, the model can identify items that are similar or related to each other. This similarity information is then used to make recommendations to users based on their interactions or preferences.

**Q6**. We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

> 🔦 **Hint**: For the similarity distance we can either use `cosine_similarity` function or `pearson_similarity`:
> - **Cosine similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> from sklearn.metrics.pairwise import cosine_similarity
> cosine_similarity(X, Y)
> ```
> - **Pearson similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> import numpy as np
> np.corrcoef(X, Y)
> ```

Compute the `similarity_scores` of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.

In [7]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity between item embeddings
similarity_scores = cosine_similarity(model.item_embeddings)

# Print the shape of the similarity_scores matrix
print("Shape of similarity_scores matrix:", similarity_scores.shape)

Shape of similarity_scores matrix: (9724, 9724)


In [8]:
import numpy as np

# Compute Pearson similarity between item embeddings
similarity_scores = np.corrcoef(model.item_embeddings)

# Print the shape of the similarity_scores matrix
print("Shape of similarity_scores matrix:", similarity_scores.shape)

Shape of similarity_scores matrix: (9724, 9724)


**Q7**. For movie of idx 20, what are the idx of the 10 most similar movies?

In [9]:
movie_idx = 20

movie_similarity_scores = similarity_scores[movie_idx]

most_similar_indices = movie_similarity_scores.argsort()[-11:-1][::-1]

print("Indices of the 10 most similar movies:")
print(most_similar_indices)


Indices of the 10 most similar movies:
[1027  162   15  728  987  401 2462 2199   70   84]


**Q8**. Let's now test our engine! Suppose we have an user that likes **Toy Story** 🧸 (movie_id = 1). Which movies would you recommend to that user? In other words, which movies are the most similar to the movie Toy Story 

> ⚠️ **Warning**: Remember that your `similarity_scores` works with `idx` and you have the `movie_id` associated to your movie.

Retrieve the **top 5 recommendations**.

In [11]:
toy_story_idx = 1

toy_story_index = idx_to_mid[toy_story_idx]

toy_story_similarity_scores = similarity_scores[toy_story_index]

most_similar_indices = toy_story_similarity_scores.argsort()[-6:-1][::-1]

top_5_recommendations = [idx_to_mid[idx] for idx in most_similar_indices]

print("Top 5 recommendations for a user who likes Toy Story:")
print(top_5_recommendations)

Top 5 recommendations for a user who likes Toy Story:
[480, 48, 588, 589, 780]


As the next step is to **deploy your model**, you need now to: 

**Q9**. Save your `similarity_scores` into pickle format. Save also `movies` DataFrame into pickle format. Save them at the `data/netflix` directory at the root of the repository.

In [12]:
dst_dir = os.path.join('data', 'netflix')

similarity_scores_path = os.path.join(dst_dir, 'similarity_scores.pkl')
with open(similarity_scores_path, 'wb') as f:
    pickle.dump(similarity_scores, f)

movies_path = os.path.join(dst_dir, 'movies.pkl')
movies.to_pickle(movies_path)

print(f"Similarity scores saved to {similarity_scores_path}")
print(f"Movies DataFrame saved to {movies_path}")


Similarity scores saved to data/netflix/similarity_scores.pkl
Movies DataFrame saved to data/netflix/movies.pkl


**Q10**. Encapsulate the previous code into functions, especially you will need:
- `get_sim_scores(mid)` function that returns the vector of the similarity scores `sims` between a movie `mid` and all the other movies
- `get_ranked_recos(sims)` that returns for a vector of similarity scores `sims` the list of all ranked recommendations (n_movies) (from most recommended to least recommended) - in the format list of (mid, score, name) tuple.

In [14]:
import os
import pickle

def load_similarity_scores(dst_dir):
    """Load the similarity scores matrix."""
    similarity_scores_path = os.path.join(dst_dir, 'similarity_scores.pkl')
    with open(similarity_scores_path, 'rb') as f:
        similarity_scores = pickle.load(f)
    return similarity_scores

def load_movies(dst_dir):
    """Load the movies DataFrame."""
    movies_path = os.path.join(dst_dir, 'movies.pkl')
    movies = pd.read_pickle(movies_path)
    return movies

def get_sim_scores(mid):
    """Get the vector of similarity scores between a movie and all other movies."""
    return similarity_scores[mid]

def get_ranked_recos(sims, movies):
    """
    Get the list of all ranked recommendations (from most recommended to least recommended).

    Args:
    sims (numpy.ndarray): Vector of similarity scores.
    movies (pandas.DataFrame): DataFrame containing movie information.

    Returns:
    list: List of (mid, score, name) tuples representing ranked recommendations.
    """
    sorted_indices = sims.argsort()[::-1]
    ranked_recos = []
    for idx in sorted_indices:
        try:
            name = movies.loc[idx, 'title']
            ranked_recos.append((idx, sims[idx], name))
        except KeyError:
            print(f"Movie ID {idx} does not have a title.")
    return ranked_recos

dst_dir = os.path.join('data', 'netflix')

similarity_scores = load_similarity_scores(dst_dir)
movies = load_movies(dst_dir)

mid = 20
sims = get_sim_scores(mid)
ranked_recos = get_ranked_recos(sims, movies)
print("Ranked recommendations for movie_id", mid)
for idx, score, name in ranked_recos[:10]:
    print(f"Movie ID: {idx}, Similarity Score: {score:.4f}, Name: {name}")


Ranked recommendations for movie_id 20
Movie ID: 20, Similarity Score: 1.0000, Name: Get Shorty (1995)
Movie ID: 1027, Similarity Score: 0.9712, Name: Dracula (Bram Stoker's Dracula) (1992)
Movie ID: 162, Similarity Score: 0.9704, Name: Scarlet Letter, The (1995)
Movie ID: 15, Similarity Score: 0.9697, Name: Casino (1995)
Movie ID: 728, Similarity Score: 0.9685, Name: Giant (1956)
Movie ID: 987, Similarity Score: 0.9646, Name: This Is Spinal Tap (1984)
Movie ID: 401, Similarity Score: 0.9645, Name: Getting Even with Dad (1994)
Movie ID: 2462, Similarity Score: 0.9625, Name: Boondock Saints, The (2000)
Movie ID: 2199, Similarity Score: 0.9605, Name: Drunken Master (Jui kuen) (1978)
Movie ID: 70, Similarity Score: 0.9599, Name: Crossing Guard, The (1995)


If you have extra time, feel free now to improve your recommendation engine!