# Introductory Recommender Systems

Summarized by QH

Last updated on 2023-01-18

## 1. What are recommender systems?

Recommender systems are been used very widely these days in the world in simple ways or more sophisticated ways:
* Your best friend recommended one restaruant to you because one of his friends recommended him this restaurant.
* Youtube recommends the videos you may like from your watching history
* Amazon recommends products based on your purchasing history or other people's choices who has similar purchasing behavior or preferences as you.

We are introducing the recommender systems used in the digital world. It is classified in three types: _Simple popularity-based Recommenders_, _Content-based recommenders_, _Collaborative filtering engines_.

## 1.1 Simple popularity-based Recommenders
It offers a general recommendation to every user based on the popularity. It can be further grouped into genre, district, etc. For movies recommending system, IMDB Top 250 is an example.
* _Algorithm_: 
    * Calculate the number of votes for each item.
    * Rank the items by their number votes.
    * Select the top $k$ items.
    
* _Limitations_: No customization for different users. All users get the same recommendation.

## 1.2 Content-based recommenders
It recommends similar items based on a particular characteristics of a item. The system utilizes the metadata of an item to make recommendations, for example, genre, director, description, actors, for movies. The assumption behind it is that if a user likes an item, then he/she will like similar items. Youtube is an example for this type.

* _Algorithm_:
    * Prepare the dataset for items, a lot of times using one-hot encoding for categorical features. e.g movies with genre, director, actors.
    * Find the similarity between each item. We can choose different similarity metrics.
    * Rank the similarity score from highest to lowest and select top $k$.

* _Limitations_: The recommendations are based on past experience of users. 
    * It will not recommend new areas the user has not experienced before. 
    * Even different users have different preference, if they have same past experience, the recommender will not differentiate them.


## 1.3 Collaborative filtering engines
It predicts the rating or preference of an item based on past ratings and preferences from other users.
* _User_based Filtering_: If a similar users like the item, the system will recommend the item to the user.
* _Item_based Filtering_: The system finds similar items based on how people have rated it in the past and if a user like one item, it will recommend the other one that is similar.

### 1.3.1 User-Item Interaction Matrix
One of the important data used in the _collaborative filtering engines_ is User-Item Interaction Matrix, where each cell is either __explicit__ or __implicit__ rating from the user to the item:
    * __explicit__ rating: Scores that users explicitly give to the item.
    * __implicit__ rating: Scores that derived from user behavior, e.g., number of time user is using/watching the item.
    * The matrix normally is very sparse since users cannot interactive with every items.
    
|User/Item|Item 1| Item 2 | Item ... | Item M|
|:--      |:--   |:--     |:--       |:--    |
|User 1   | 2    | 4      |          | 3     |
|User 2   |      | 3      |          | 1     |
|User ... | 1    |        | 4        |       |
|User N   | 5    |        |          | 2     |

* _Limitations_: We need prior information about the user or the item before we can derive the similarity between the user and other users or between the item and other items.  

### 1.3.2 Similarity Score Based Recommenders
Based on the User-Item Interaction Matrix, we can caluate the similarity score (metrics can be pearson correlation or consine correlation) between items or users. We use the user review as the measures to calculate the similarities.

* _User_based Filtering_: 
    * For a particular user, find a group of similar users (can use the metrics threshold to determine the group size).
    * Calculate the average rating of each item from the group of similar users.
    * Rank the items based on the average rating from highest to lowest and select the top $k$ that they have not interacted before.

* _Item_based Filtering_: 
    * For a particular item that a user liked, based on the similarity metric, rank the other items from highest to lowest.
    * Select the top $k$ items that the user has not interacted before.

### 1.3.3 Model Based Recommenders
A lot of times, there are some latent factors that motivates users to give such one rating. One of the techniques is to find these latent factors by decompose the user-item interaction matrix and then use latent features to infer the ratings the users might give for a product they have never interacted with before.

There are several methods for matrix decomposition:
* Factor Analysis
* PCA
* Non-negative matrix factorization (NMF)
* Truncated SVD

## 2. Examples using Movie Lens data
The source data is from [grouplens.org - movielens data](https://grouplens.org/datasets/movielens/). We only use the ml-latest-small which includes 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users, last updated as 9/2018 as the sample dataset.

In [69]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [50]:
# Read in the movie information and ids for imdb and tmdb
movies = pd.read_csv('./ml-latest-small/movies.csv')
links = pd.read_csv('./ml-latest-small/links.csv')
display(movies.head())
display(links.head())

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


First, we will consolidate the movies and links dataset to create a movie information data frame. We will modify the dataset as following:
* year: extracting the year in the brackets from title column
* genres: create a list of the characteristics

In [51]:
movies_info = movies.copy()
# Create year column
## Find the (yyyy) or (yyyy-yyyy)
movies_info['year'] = movies_info['title'].str.extract(r'(\(\d{4}\))', expand=False)
## Extract the digits without parenthesis
movies_info['year'] = movies_info['year'].str.extract(r'(\d{4})', expand=False)

# Remove the year component from title
movies_info['title'] = movies_info['title'].str.replace(r'(\(\d{4}\))', '', regex=True)
# Remove the leading and trailing blanks of the title
movies_info['title'] = movies_info['title'].str.strip(' ')
# Split the genres column
movies_info['genres'] = movies_info['genres'].str.split('|')

# Merge with links dataset
movies_info = movies_info.merge(links, on='movieId', how='left')
movies_info.head()

Unnamed: 0,movieId,title,genres,year,imdbId,tmdbId
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,114709,862.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,113497,8844.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,113228,15602.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,114885,31357.0
4,5,Father of the Bride Part II,[Comedy],1995,113041,11862.0


In [52]:
movies_info.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9742 entries, 0 to 9741
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   title    9742 non-null   object 
 2   genres   9742 non-null   object 
 3   year     9729 non-null   object 
 4   imdbId   9742 non-null   int64  
 5   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2), object(3)
memory usage: 532.8+ KB


In [53]:
# Read in the user ratings for the dataset
ratings = pd.read_csv('./ml-latest-small/ratings.csv')
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


### 2.1 Popularity-based recommender system

First approach is simply based on the popularity - number of votes from users and select the top movies.

In [56]:
# Top 10 mosted voted movies
top_10_voted = ratings.groupby(by=['movieId'], as_index=False)['rating'].count().nlargest(10, 'rating').rename(columns={'rating':'votes'})
# Merge with movie info dataset to get information
top_10_voted = top_10_voted.merge(movies_info, on='movieId', how='left')
top_10_voted

Unnamed: 0,movieId,votes,title,genres,year,imdbId,tmdbId
0,356,329,Forrest Gump,"[Comedy, Drama, Romance, War]",1994,109830,13.0
1,318,317,"Shawshank Redemption, The","[Crime, Drama]",1994,111161,278.0
2,296,307,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994,110912,680.0
3,593,279,"Silence of the Lambs, The","[Crime, Horror, Thriller]",1991,102926,274.0
4,2571,278,"Matrix, The","[Action, Sci-Fi, Thriller]",1999,133093,603.0
5,260,251,Star Wars: Episode IV - A New Hope,"[Action, Adventure, Sci-Fi]",1977,76759,11.0
6,480,238,Jurassic Park,"[Action, Adventure, Sci-Fi, Thriller]",1993,107290,329.0
7,110,237,Braveheart,"[Action, Drama, War]",1995,112573,197.0
8,589,224,Terminator 2: Judgment Day,"[Action, Sci-Fi]",1991,103064,280.0
9,527,220,Schindler's List,"[Drama, War]",1993,108052,424.0


The first approach does not take ratings into consideration. So we can use a more sophisticated weighted average of votes-score an rating-score to generate a score and select the top movies.

In [112]:
# First, scale the number of votes to be 0-5.
votes_avg_ratings = ratings.groupby(by='movieId', as_index=False).agg({'userId': 'count', 'rating': 'mean'}).rename(columns={'userId': 'votes', 'rating': 'avg_rating'})
votes_avg_ratings['scaled_votes'] = (votes_avg_ratings['votes'] - votes_avg_ratings['votes'].min())/(votes_avg_ratings['votes'].max() - votes_avg_ratings['votes'].min()) * 5
# Then, Calculate weighted rating
votes_avg_ratings['weighted_rating'] = 0.8 * votes_avg_ratings['scaled_votes'] + 0.2 * votes_avg_ratings['avg_rating']
top_10_movies = votes_avg_ratings.nlargest(10, 'weighted_rating')

# Merge with movie info dataset to get information
top_10_movies = top_10_movies.merge(movies_info, on='movieId', how='left')
top_10_movies

Unnamed: 0,movieId,votes,avg_rating,scaled_votes,weighted_rating,title,genres,year,imdbId,tmdbId
0,356,329,4.164134,5.0,4.832827,Forrest Gump,"[Comedy, Drama, Romance, War]",1994,109830,13.0
1,318,317,4.429022,4.817073,4.739463,"Shawshank Redemption, The","[Crime, Drama]",1994,111161,278.0
2,296,307,4.197068,4.664634,4.571121,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994,110912,680.0
3,593,279,4.16129,4.237805,4.222502,"Silence of the Lambs, The","[Crime, Horror, Thriller]",1991,102926,274.0
4,2571,278,4.192446,4.222561,4.216538,"Matrix, The","[Action, Sci-Fi, Thriller]",1999,133093,603.0
5,260,251,4.231076,3.810976,3.894996,Star Wars: Episode IV - A New Hope,"[Action, Adventure, Sci-Fi]",1977,76759,11.0
6,110,237,4.031646,3.597561,3.684378,Braveheart,"[Action, Drama, War]",1995,112573,197.0
7,480,238,3.75,3.612805,3.640244,Jurassic Park,"[Action, Adventure, Sci-Fi, Thriller]",1993,107290,329.0
8,527,220,4.225,3.338415,3.515732,Schindler's List,"[Drama, War]",1993,108052,424.0
9,589,224,3.970982,3.39939,3.513709,Terminator 2: Judgment Day,"[Action, Sci-Fi]",1991,103064,280.0


In [101]:
votes_avg_ratings['votes'].quantile(q=[0, 0.5, 0.625, 0.75, 0.875, 1]).to_list()

## References
1. Datacamp Recommender Systems in python beginner tutorial: https://www.datacamp.com/tutorial/recommender-systems-python
2. A Complete Guide To Recommender Systems — Tutorial with Sklearn, Surprise, Keras, Recommenders, [Medium](https://towardsdatascience.com/a-complete-guide-to-recommender-system-tutorial-with-sklearn-surprise-keras-recommender-5e52e8ceace1)