# Recommender Systems

Recommender systems are useful for recommending users items based on their past preferences. Recommender systems are generally classified into the following categories:

- **Content-based** based on similarity of item attributes (e.g. keywords, categories) and users preferences: For instance, if a user liked Item A, the recommender system would recommend items with similar characteristics. If we would talk about movies, those characteristics would include producer, genre, etc... Typically, this can be done with a binary classification supervised algorithm.


- **Collaborative filtering**, which calculates similarity from interactions (e.g. ratings, number of purchases, likes, etc.). This process finds customers who have similar preferences, and then gives recommendations to one customer from a set of items that are new to that particular customer and preferred by other customers with a similar profile. The system makes an assumption that users with similar movie watching habits have similar preferences in general. It finds users who watch videos similar to those the user watches, then identifies videos those similar users have watched, but which the first user has not. There are 3 approaches to this; user-user collaborative filtering, item-item collaborative filtering and matrix factorization.


- New approaches and usually more powerful are **Hybrid systems**, which combine the two aforementioned methods.

<img src="https://wagon-public-datasets.s3.amazonaws.com/data-science-images/05-ML/06-Unsupervised-Learning/recommendation_systems.png" width=600>

The above diagram categorises in detail the different methodologies for building a recommender system.

In context to the `movieLens` that will be used later as a sample dataset:
- We will suggest different movies based on the content similarity, e.g. genre, cast, etc. applying an item-content filtering
- We could also compare the user metadata, such as age and gender, and suggest items to the user that similar users have liked. In that case, we would use a user-content filtering. The `movieLens` dataset doesn't contain any user content data, so we will build only an item-item collaborative filtering.

**Memory-based** content filtering

In memory-based methods there is no model that learns from the data to predict. A pre-computed matrix of similarities is constructed that can be used for movie predictions.

## Data collection and cleaning

Download the necessary datasets by running the lines below. Then, load the datasets in three different pandas DataFrames `movies`, `tags` and `ratings`.

In [1]:
# !curl https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/06-Unsupervised-Learning/movie_titles.csv > data/movies.csv
# !curl https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/06-Unsupervised-Learning/movie_tags.csv > data/tags.csv
# !curl https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/06-Unsupervised-Learning/movie_ratings.csv > data/ratings.csv

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings; warnings.simplefilter('ignore')

# Load movies data
movies = pd.read_csv('data/movies.csv')
tags = pd.read_csv('data/tags.csv')
ratings = pd.read_csv('data/ratings.csv')

movies['title'] = movies['title'].apply(lambda x: x.split(' (')[0])

display(movies.head())
display(tags.head())
display(ratings.head())

Unnamed: 0,movieId,title,genres
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji,Adventure|Children|Fantasy
2,3,Grumpier Old Men,Comedy|Romance
3,4,Waiting to Exhale,Comedy|Drama|Romance
4,5,Father of the Bride Part II,Comedy


Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [3]:
ratings[ratings['userId'] == 1].sort_values('timestamp', ascending=False).head(10)

Unnamed: 0,userId,movieId,rating,timestamp
161,1,2492,4.0,965719662
119,1,2012,4.0,964984176
160,1,2478,4.0,964984169
31,1,553,5.0,964984153
95,1,1445,3.0,964984112
9,1,157,5.0,964984100
42,1,780,3.0,964984086
201,1,3053,5.0,964984086
90,1,1298,5.0,964984086
214,1,3448,5.0,964984054


__Remove the '|' that separates the different genres of movies and replace with an empty space.__

In [4]:
movies['genres'] = movies['genres'].str.replace('|',' ')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story,Adventure Animation Children Comedy Fantasy
1,2,Jumanji,Adventure Children Fantasy
2,3,Grumpier Old Men,Comedy Romance
3,4,Waiting to Exhale,Comedy Drama Romance
4,5,Father of the Bride Part II,Comedy


__Filter the `movies` dataframe, to have only movies that have received a rating.__

In [5]:
# Filter the movies DF
movies = movies[movies.movieId.isin(ratings.movieId.unique().tolist())]
movies.shape

(9724, 3)

## Feature Engineering

We're gonna create a new feature called `metadata`, that merges **all the text data information** we have about a movie : the genre and the tags.

__Merge the `movies` and `tags` DataFrames.__

In [6]:
# Merge
merged = pd.merge(movies, tags, on='movieId', how='left')
# Drop useless columns.
merged.head(10)

Unnamed: 0,movieId,title,genres,userId,tag,timestamp
0,1,Toy Story,Adventure Animation Children Comedy Fantasy,336.0,pixar,1139046000.0
1,1,Toy Story,Adventure Animation Children Comedy Fantasy,474.0,pixar,1137207000.0
2,1,Toy Story,Adventure Animation Children Comedy Fantasy,567.0,fun,1525286000.0
3,2,Jumanji,Adventure Children Fantasy,62.0,fantasy,1528844000.0
4,2,Jumanji,Adventure Children Fantasy,62.0,magic board game,1528844000.0
5,2,Jumanji,Adventure Children Fantasy,62.0,Robin Williams,1528844000.0
6,2,Jumanji,Adventure Children Fantasy,474.0,game,1137376000.0
7,3,Grumpier Old Men,Comedy Romance,289.0,moldy,1143425000.0
8,3,Grumpier Old Men,Comedy Romance,289.0,old,1143425000.0
9,4,Waiting to Exhale,Comedy Drama Romance,,,


__Create a new dataframe `merged_df` with the `metadata` column, containing a string of both tags AND genres merged together.__

So you have to merge **tags first** together per movie. And **then** merge the combined tags with the **genres** column.

For example the `metadata` column of the movie Toy Story will be `pixar pixar fun Adventure Animation Children Comedy Fantasy`.

👉 `merged_df` must contain at least the following columns: `movieId`, `title` and `metadata`.

In [7]:
# Create a new DataFrame merged_df with a new metadata column from tags and genres.
merged.fillna("", inplace=True)
merged = pd.DataFrame(merged.groupby('movieId')['tag'].apply(lambda x: "%s" % ' '.join(x)))
merged_df = pd.merge(movies, merged, on='movieId', how='left')
merged_df['metadata'] = merged_df[['tag', 'genres']].apply(lambda x: ' '.join(x), axis = 1)

merged_df.head(10)

Unnamed: 0,movieId,title,genres,tag,metadata
0,1,Toy Story,Adventure Animation Children Comedy Fantasy,pixar pixar fun,pixar pixar fun Adventure Animation Children C...
1,2,Jumanji,Adventure Children Fantasy,fantasy magic board game Robin Williams game,fantasy magic board game Robin Williams game A...
2,3,Grumpier Old Men,Comedy Romance,moldy old,moldy old Comedy Romance
3,4,Waiting to Exhale,Comedy Drama Romance,,Comedy Drama Romance
4,5,Father of the Bride Part II,Comedy,pregnancy remake,pregnancy remake Comedy
5,6,Heat,Action Crime Thriller,,Action Crime Thriller
6,7,Sabrina,Comedy Romance,remake,remake Comedy Romance
7,8,Tom and Huck,Adventure Children,,Adventure Children
8,9,Sudden Death,Action,,Action
9,10,GoldenEye,Action Adventure Thriller,,Action Adventure Thriller


## Create a content latent matrix from metadata

### Count Vectorizer

Next we need to transform the metadata text to vectors, in order to feed it to our machine learning algorithms. Machine Learning models don't understand text data so we need to encode it.

We're gonna use the [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to encode the metadata column. 

Create a new dataframe `count_df` with the vectors you get from this count transformation. Every line is gonna be the frequency vector for this specific movie.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(merged_df['metadata'])
count_df = pd.DataFrame(count_matrix.toarray(), index=merged_df.index.tolist())

### Dimensionality reduction

Each movie metadata was transformed into a vector of length ~1675!

As we saw in previous lessons, we can apply dimensionality reduction methods to describe the data (movies) without much loss of information. Truncated singular value decomposition (SVD) is another advanced tool to reduce dimensions.

Contrary to PCA, this estimator **does not center the data before computing the singular value decomposition**. This means it can work with **sparse matrices efficiently**. In particular, truncated SVD works on term count/frequency matrices. In that context, it is known as latent semantic analysis (LSA).

You can look at [`TruncatedSVD`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) class in Scikit Learn but it's the same principle as PCA.

__👉 Apply the truncated SVD with a reduced number of features (say 25) to reduce the dimension of your count matrix.__

In [9]:
from sklearn.decomposition import TruncatedSVD

# Apply SVD
svd = TruncatedSVD(n_components=25)
latent_df = svd.fit_transform(count_df)
latent_df.shape

(9724, 25)

We can see that with only the first 25 components (from initially about 1700) we can explain **more than 80%** of the variance, which suffices for our study purposes.

👉 **Save the 25 components of this matrix into a new `latent_df` dataframe, indexed by the titles from the movies.**

In [10]:
# keep the first 25 number of latent components.
n = 25 
latent_df = pd.DataFrame(latent_df[:,0:n], index=merged_df.title.tolist())
latent_df.shape

(9724, 25)

## Create a Latent Matrix from User Ratings

Except from the metadata we have another valuable source of information, **the user ratings**. 

A recommender system can recommend a movie that is similar on the basis of user ratings (item-item collaborative filtering).

👉 **We prepare the dataset below to have movies as rows, `userId`s as columns.**

In [11]:
# Merge
ratings1 = pd.merge(movies[['movieId']], ratings, on="movieId", how="right")
# Pivot
ratings2 = ratings1.pivot(index = 'movieId', columns ='userId', values = 'rating').fillna(0)
display(ratings2.head())
ratings2.shape

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


(9724, 610)

We just created a dataset with users' ratings as vectors of length 9724.

👉**Once again, we will apply SVD to the `ratings2` dataframe, so as to keep only the first 200 components. Call this dataframe `latent_df_2`**

In [12]:
svd = TruncatedSVD(n_components=200)
latent_df_2 = svd.fit_transform(ratings2)

**👉 Re-index by Film Title.**

In [13]:
latent_df_2 = pd.DataFrame(latent_df_2, index=merged_df.title.tolist())
latent_df_2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
Toy Story,37.649828,-6.380094,14.994617,0.286947,-1.887147,4.79005,8.807629,-2.198636,1.153304,1.387562,...,-0.273726,2.541727,3.25549,-0.79746,1.225651,1.908654,1.346102,1.776804,-0.384718,1.09809
Jumanji,20.596193,-0.47788,10.865916,-8.635795,-3.072411,-0.517149,4.359215,-3.821408,-3.394116,0.236474,...,-0.545328,-1.321288,-2.469876,-0.186001,0.459565,0.304489,-1.903241,-0.97601,0.435597,0.932036
Grumpier Old Men,8.504182,-5.714924,3.441693,-6.38281,0.024162,-0.783432,-1.423026,-0.542292,3.270655,1.897237,...,-0.168528,0.527639,0.728421,-0.037184,0.185034,0.486992,0.493486,-0.273337,1.157014,-0.830371
Waiting to Exhale,0.46572,-0.828101,0.954583,-0.23908,-0.735695,-0.877414,0.21472,-0.527368,0.079467,0.015617,...,-0.006329,-0.032474,0.017021,0.018139,-0.062625,-0.312957,-0.285375,0.212839,-0.344627,0.307943
Father of the Bride Part II,6.071045,-2.673536,4.698195,-4.415807,-3.714221,-1.673627,1.297653,-1.308023,1.325734,1.229592,...,0.651449,0.295894,0.644498,0.334287,0.331904,0.206781,0.561937,0.647722,-0.582062,-0.973639


## Apply Cosine Similarity in Content and Collaborative Matrices

Next, we will use a similarity measure to find the top $N$ **most similar movies to "Toy Story"** based on the filtering methods we created. Cosine similarity is one of the similarity measures we can use Scikit-learn [`cosine_similarity`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html).

__👉 Compute the cosine similarity of a sample movie to both content (metadata) and collaborative (ratings) latent matrices.__

In [14]:
# Import streamers DF
streamers_df = pd.read_csv('MoviesOnStreamingPlatforms.csv', index_col="Unnamed: 0")
streamers_list = ['Netflix', 'Hulu', 'Prime Video', 'Disney+']
streamers_df.head(3)

Unnamed: 0,ID,Title,Year,Age,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type
0,1,The Irishman,2019,18+,98/100,1,0,0,0,0
1,2,Dangal,2016,7+,97/100,1,0,0,0,0
2,3,David Attenborough: A Life on Our Planet,2020,7+,95/100,1,0,0,0,0


In [15]:
from sklearn.metrics.pairwise import cosine_similarity

def movie_recomendations(movie_name):  
    '''
    function to return a table with the recomended movies in hybrid score order
    '''
    # Get the latent vectors for "Toy Story" from content and collaborative matrices
    v1 = np.array(latent_df.loc[movie_name]).reshape(1, -1)
    v2 = np.array(latent_df_2.loc[movie_name]).reshape(1, -1)

    # Compute the cosine similarity of this movie with the others in the list
    sim1 = cosine_similarity(latent_df, v1).reshape(-1)
    sim2 = cosine_similarity(latent_df_2, v2).reshape(-1)

    hybrid = ((sim1 + sim2)/2.0)

    dictDf = {'content': sim1 , 'collaborative': sim2, 'hybrid': hybrid} 
    recommendation_df = pd.DataFrame(dictDf, index = latent_df.index)
    recommendation_df.sort_values('hybrid', ascending=False, inplace=True)
    return recommendation_df

In [16]:
# create a dataframe of movie recomendations without duplicate (keepig first) and sorted in hybrid score order
# for each movie only use the top 10 recomended movies


movies_list = ['Toy Story', 'Shrek', 'Father of the Bride Part II', 'Jumanji'] 
# movies_list = ['GoldenEye', 'Heat', 'Usual Suspects, The', 'Seven']

movies_list_list = [movie_recomendations(i)[1:21].reset_index() for i in movies_list]

recommended_movies = pd.concat(movies_list_list).sort_values('hybrid', ascending=False).drop_duplicates(subset='index', keep='first')

In [17]:
# join our recomended movies df with the streamers df
joined_df = recommended_movies.merge(streamers_df, left_on='index', right_on='Title') #movies instead of streamers and title

In [18]:
# summarize our joined DF
dict_table = {}
for i in streamers_list:
    dict_table[i] = [joined_df[i].sum(), 
                     joined_df[joined_df[i] == 1].mean()['hybrid'], 
                     joined_df[joined_df[i] == 1].sum()['hybrid']/len(joined_df)
                    ]

In [19]:
pd.DataFrame.from_dict(dict_table, orient='index',
                       columns=['recomended movies','mean score of movies', 'total mean score']). \
                       sort_values(['recomended movies', 'mean score of movies'], ascending=False)

Unnamed: 0,recomended movies,mean score of movies,total mean score
Disney+,11,0.696616,0.364894
Hulu,6,0.717887,0.205111
Prime Video,3,0.656621,0.093803
Netflix,3,0.641735,0.091676


In [20]:
joined_df.tail(3)

Unnamed: 0,index,content,collaborative,hybrid,ID,Title,Year,Age,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type
18,Stardust,0.817202,0.402113,0.609657,169,Stardust,2007,13+,79/100,1,0,0,0,0
19,"Honey, I Shrunk the Kids",0.732112,0.482965,0.607539,8775,"Honey, I Shrunk the Kids",1989,7+,72/100,0,0,0,1,0
20,Atlantis: The Lost Empire,0.812174,0.366497,0.589335,8749,Atlantis: The Lost Empire,2001,7+,74/100,0,0,0,1,0
