# Recommendation - Model 🍿

---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

Now, time for the exciting part! We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically! 

First, load again the dataframe `movies` and `ratings`

In [2]:
### TODO: load the movies and ratings datasets
import pandas as pd
movies = pd.read_csv("ml-latest-small/movies.csv")
ratings = pd.read_csv("ml-latest-small/ratings.csv")

**Q1**. Start by loading all the pickle you saved during last challenge: `ratings_matrix`, `idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`

In [3]:
import pickle
ratings_matrix = pickle.load(open("data/ratings_matrix.pkl", "rb"))
idx_to_mid = pickle.load(open("data/idx_to_mid.pkl", "rb"))
mid_to_idx = pickle.load(open("data/mid_to_idx.pkl", "rb"))
uid_to_idx = pickle.load(open("data/uid_to_idx.pkl", "rb"))
idx_to_uid = pickle.load(open("data/idx_to_uid.pkl", "rb"))

**Q2**. Because the dataset is slightly different from what we have been used to (X as features, y as target), the usual `train_test_split` method from scikit-learn does not apply.

Hopefully, `lightfm` comes with a `random_train_test_split` located into `cross_validation` dedicated to this usecase 🙂

Split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

I had a lot of trouble trying to install lightfm, so I never could :(

In [4]:
import numpy as np
from lightfm.cross_validation import random_train_test_split

train, test = random_train_test_split(ratings_matrix, test_percentage = 0.2, random_state=np.random.RandomState(0))

ModuleNotFoundError: No module named 'lightfm'

**Q3**. Train a LightFM model for 10 epochs. You can use the parameter `loss="warp"`.

In [None]:
from lightfm import LightFM

model = LightFM(no_components = 100, loss="warp", random_state=0)

model.fit(train, epochs=10, verbose=True)

**Q4**. Evaluate your model on your test set. You can use the `precision_at_k` metric implemented in the LightFM library.

In [None]:
from lightfm.evaluation import precision_at_k

precision_k = precision_at_k(model, test, train, k=5)

print(f"Precision at k = 5 is {precision_k.mean()}")

**Q5**. What does the attribute `item_embeddings` of `model` contains?  This will be the heart of your recommendation engine! 💟 So make sure you understand fully what it contains.

In [None]:
print(model.item_embeddings.shape)
# first dimension represents each movie, second dimension represents each component (100 dimensions)

The item_embeddings attribute contains the learned embeddings of each movie in the model. This is a matrix where each row captures the features of the movies learned during the training process. The dimensions (100) are specified by us when training the model.

**Q6**. We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

> 🔦 **Hint**: For the similarity distance we can either use `cosine_similarity` function or `pearson_similarity`:
> - **Cosine similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> from sklearn.metrics.pairwise import cosine_similarity
> cosine_similarity(X, Y)
> ```
> - **Pearson similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> import numpy as np
> np.corrcoef(X, Y)
> ```

Compute the `similarity_scores` of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

sim_scores = cosine_similarity(model.item_embeddings)

**Q7**. For movie of idx 20, what are the idx of the 10 most similar movies?

In [None]:
idx = 20
sim_idx = sim_scores[idx]
ranked_sim = np.argsort(-sim_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_sim]
for mid in ranked_mid[:10]:
    count = 0
    print(moves[movies.movieId == mid]["title"])

**Q8**. Let's now test our engine! Suppose we have an user that likes **Toy Story** 🧸 (movie_id = 1). Which movies would you recommend to that user? In other words, which movies are the most similar to the movie Toy Story 

> ⚠️ **Warning**: Remember that your `similarity_scores` works with `idx` and you have the `movie_id` associated to your movie.

Retrieve the **top 5 recommendations**.

In [None]:
idx = mid_to_idx[1]
sim_idx = sim_scores[idx]
ranked_sim = np.argsort(-sim_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_sim]
for mid in ranked_mid[:5]:
    count = 0
    print(moves[movies.movieId == mid]["title"])

As the next step is to **deploy your model**, you need now to: 

**Q9**. Save your `similarity_scores` into pickle format. Save also `movies` DataFrame into pickle format. Save them at the `data/netflix` directory at the root of the repository.

In [None]:
pickle.dump(sim_scores, open("data/netflix/sim_scores.pkl", "wb"))
pickle.dump(movies, open("data/netflix/movies.pkl", "wb"))

**Q10**. Encapsulate the previous code into functions, especially you will need:
- `get_sim_scores(mid)` function that returns the vector of the similarity scores `sims` between a movie `mid` and all the other movies
- `get_ranked_recos(sims)` that returns for a vector of similarity scores `sims` the list of all ranked recommendations (n_movies) (from most recommended to least recommended) - in the format list of (mid, score, name) tuple.

In [2]:
def get_movie_title(mid, movies):
    try:
        return movies.loc[movies.movieId == mid].title.values[0]
    except:
        return "Unknown"
    

def get_sim_scores(mid):
    idx = mid_to_idx[mid]
    return sim_scores[idx]

def get_ranked_recos(sims, movies):
    recos = []
    for idx in np.argsort(-sims):
        mid = idx_to_mid[idx]
        title = get_movie_title(mid, movies)
        score = sims[idx]
        recos.append((mid, score, name))
    return recos

def get_recommendations(mid, movies, num_movies):
    sim_scores = get_sim_scores(mid)
    return get_ranked_recos(sim_scores, movies)[:num_movies]

In [None]:
get_recommendations(3, movies, 10)

If you have extra time, feel free now to improve your recommendation engine!