# Recommendation - Model 🍿

---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

Now, time for the exciting part! We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically! 

First, load again the dataframe `movies` and `ratings`

In [2]:
### TODO: load the movies and ratings datasets
raw_data = "./ml-latest-small/"

import pandas as pd
movies = pd.read_csv(raw_data + "movies.csv")
ratings = pd.read_csv(raw_data + "ratings.csv")

print(movies.head())
print(ratings.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


**Q1**. Start by loading all the pickle you saved during last challenge: `ratings_matrix`, `idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`

In [3]:
pkl_data = "./data/"

import pickle
ratings_matrix = pickle.load(open(pkl_data + "/ratings_matrix.pkl", "rb"))
idx_to_mid = pickle.load(open(pkl_data + "/idx_to_mid.pkl", "rb"))
mid_to_idx = pickle.load(open(pkl_data + "/mid_to_idx.pkl", "rb"))
uid_to_idx = pickle.load(open(pkl_data + "/uid_to_idx.pkl", "rb"))
idx_to_uid = pickle.load(open(pkl_data + "/idx_to_uid.pkl", "rb"))

**Q2**. Because the dataset is slightly different from what we have been used to (X as features, y as target), the usual `train_test_split` method from scikit-learn does not apply.

Hopefully, `lightfm` comes with a `random_train_test_split` located into `cross_validation` dedicated to this usecase 🙂

Split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

In [7]:
import numpy as np
from lightfm.cross_validation import random_train_test_split

train, test = random_train_test_split(ratings_matrix, test_percentage=0.2,
                                      random_state=np.random.RandomState(0))

train.shape, test.shape

((610, 9724), (610, 9724))

**Q3**. Train a LightFM model for 10 epochs. You can use the parameter `loss="warp"`.

In [8]:
from lightfm import LightFM

# Initialize model
model = LightFM(no_components=100, loss='warp', random_state=0)

# Fit model on train set
model.fit(train, epochs=10, verbose=True)

Epoch: 100%|█████████████████████████████████████| 10/10 [00:04<00:00,  2.34it/s]


<lightfm.lightfm.LightFM at 0x11f44beb0>

**Q4**. Evaluate your model on your test set. You can use the `precision_at_k` metric implemented in the LightFM library.

In [9]:
from lightfm.evaluation import precision_at_k

k = 5
pre_k = precision_at_k(model, test, train, k=k).mean()

print("Precision at k={} is {}".format(k, pre_k))

Precision at k=5 is 0.2680920958518982


**Q5**. What does the attribute `item_embeddings` of `model` contains?  This will be the heart of your recommendation engine! 💟 So make sure you understand fully what it contains.

In [11]:
print(model.item_embeddings.shape)
print("item_embeddings contains the representations of all movies (9724) into a 100-dimensional vector")
print(model.item_embeddings[0])

(9724, 100)
item_embeddings contains the representations of all movies (9724) into a 100-dimensional vector
[-0.3142591  -0.2041731   0.08945829 -0.08405492  0.34734696 -0.05399103
  0.11160227 -0.20095833 -0.11371228 -0.39413986  0.362561    0.05316165
 -0.4583187  -0.34357777 -0.18797755 -0.00553624  0.28137276  0.43153363
  0.15660733 -0.40245408  0.40595597  0.01247011 -0.27369824 -0.14441921
  0.20417038  0.02513273  0.2091002  -0.15891656  0.2149729   0.14988837
 -0.21464817  0.23834325 -0.2145693   0.24521717 -0.37559548  0.22957776
 -0.12345305  0.40108737 -0.36697736  0.22383201  0.223479    0.34414145
  0.2924009  -0.08213998 -0.16994385  0.14942707  0.3083662  -0.29354784
  0.2813535   0.19445191 -0.09893961  0.18049172  0.2639099  -0.2928732
  0.22133411  0.3848023  -0.3763623  -0.39090097  0.1398391  -0.19342977
  0.2217506  -0.25360197 -0.04459268 -0.24561936 -0.24732517 -0.2852928
 -0.3145209   0.23152035 -0.08398732  0.2357348   0.43923882 -0.22179724
 -0.26850736  0.22

**Q6**. We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

> 🔦 **Hint**: For the similarity distance we can either use `cosine_similarity` function or `pearson_similarity`:
> - **Cosine similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> from sklearn.metrics.pairwise import cosine_similarity
> cosine_similarity(X, Y)
> ```
> - **Pearson similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> import numpy as np
> np.corrcoef(X, Y)
> ```

Compute the `similarity_scores` of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.

In [12]:
# METHOD 1 : cosine
from sklearn.metrics.pairwise import cosine_similarity
similarity_scores = cosine_similarity(model.item_embeddings)
similarity_scores
# METHOD 2 : Pearson
#similarity_scores = np.corrcoef(model.item_embeddings)
#similarity_scores

array([[ 0.9999999 ,  0.35686016,  0.55905855, ..., -0.25982952,
        -0.3204423 , -0.3208219 ],
       [ 0.35686016,  1.        ,  0.4236119 , ..., -0.29630297,
        -0.30853668, -0.2569683 ],
       [ 0.55905855,  0.4236119 ,  1.0000001 , ..., -0.2764431 ,
        -0.18271978, -0.15962668],
       ...,
       [-0.25982952, -0.29630297, -0.2764431 , ...,  1.        ,
         0.7492958 ,  0.6513411 ],
       [-0.3204423 , -0.30853668, -0.18271978, ...,  0.7492958 ,
         0.9999997 ,  0.8015574 ],
       [-0.3208219 , -0.2569683 , -0.15962668, ...,  0.6513411 ,
         0.8015574 ,  1.0000001 ]], dtype=float32)

**Q7**. For movie of idx 20, what are the idx of the 10 most similar movies?

In [10]:
idx = 20
sims_idx = similarity_scores[idx]
ranked_idx = np.argsort(-sims_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]
[movies[movies.movieId == mid]["title"] for mid in ranked_mid[:10]]

[314    Forrest Gump (1994)
 Name: title, dtype: object,
 277    Shawshank Redemption, The (1994)
 Name: title, dtype: object,
 257    Pulp Fiction (1994)
 Name: title, dtype: object,
 224    Star Wars: Episode IV - A New Hope (1977)
 Name: title, dtype: object,
 0    Toy Story (1995)
 Name: title, dtype: object,
 461    Schindler's List (1993)
 Name: title, dtype: object,
 418    Jurassic Park (1993)
 Name: title, dtype: object,
 43    Seven (a.k.a. Se7en) (1995)
 Name: title, dtype: object,
 325    Mask, The (1994)
 Name: title, dtype: object,
 510    Silence of the Lambs, The (1991)
 Name: title, dtype: object]

**Q8**. Let's now test our engine! Suppose we have an user that likes **Toy Story** 🧸 (movie_id = 1). Which movies would you recommend to that user? In other words, which movies are the most similar to the movie Toy Story 

> ⚠️ **Warning**: Remember that your `similarity_scores` works with `idx` and you have the `movie_id` associated to your movie.

Retrieve the **top 5 recommendations**.

In [13]:
idx = mid_to_idx[1]
sims_idx = similarity_scores[idx]
ranked_idx = np.argsort(-sims_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]
[movies[movies.movieId == mid]["title"] for mid in ranked_mid[:10]]

[0    Toy Story (1995)
 Name: title, dtype: object,
 314    Forrest Gump (1994)
 Name: title, dtype: object,
 277    Shawshank Redemption, The (1994)
 Name: title, dtype: object,
 224    Star Wars: Episode IV - A New Hope (1977)
 Name: title, dtype: object,
 257    Pulp Fiction (1994)
 Name: title, dtype: object,
 964    Groundhog Day (1993)
 Name: title, dtype: object,
 1005    When Harry Met Sally... (1989)
 Name: title, dtype: object,
 325    Mask, The (1994)
 Name: title, dtype: object,
 615    Independence Day (a.k.a. ID4) (1996)
 Name: title, dtype: object,
 461    Schindler's List (1993)
 Name: title, dtype: object]

As the next step is to **deploy your model**, you need now to: 

**Q9**. Save your `similarity_scores` into pickle format. Save also `movies` DataFrame into pickle format. Save them at the `data/netflix` directory at the root of the repository.

In [14]:
with open(pkl_data + '/similarity_scores.pkl','wb') as f:
    pickle.dump(similarity_scores, f)
    
with open(pkl_data + '/movies.pkl','wb') as f:
    pickle.dump(movies, f)

**Q10**. Encapsulate the previous code into functions, especially you will need:
- `get_sim_scores(mid)` function that returns the vector of the similarity scores `sims` between a movie `mid` and all the other movies
- `get_ranked_recos(sims)` that returns for a vector of similarity scores `sims` the list of all ranked recommendations (n_movies) (from most recommended to least recommended) - in the format list of (mid, score, name) tuple.

In [13]:
def get_movie_name(mid, movies):
    try:
        name = movies.loc[movies.movieId == mid].title.values[0]
    except:
        name = "Unknown"
    return name

def get_sim_scores(mid):
    idx = mid_to_idx[mid]
    sims = similarity_scores[idx]
    return sims

def get_ranked_recos(sims):
    recos = []
    for idx in np.argsort(-sims):
        mid = idx_to_mid[idx]
        name = get_movie_name(mid, movies)
        score = sims[idx]
        recos.append((mid, score, name))
    return recos

In [14]:
sims = get_sim_scores(3)
get_ranked_recos(sims_idx)[:10]

[(1, 0.9999999, 'Toy Story (1995)'),
 (356, 0.8801334, 'Forrest Gump (1994)'),
 (318, 0.85700893, 'Shawshank Redemption, The (1994)'),
 (260, 0.8283075, 'Star Wars: Episode IV - A New Hope (1977)'),
 (296, 0.8230713, 'Pulp Fiction (1994)'),
 (1265, 0.80243915, 'Groundhog Day (1993)'),
 (1307, 0.80106086, 'When Harry Met Sally... (1989)'),
 (367, 0.79584306, 'Mask, The (1994)'),
 (780, 0.79139316, 'Independence Day (a.k.a. ID4) (1996)'),
 (527, 0.78670794, "Schindler's List (1993)")]

If you have extra time, feel free now to improve your recommendation engine!