Foundations

metrics modified by popularity: It's hard to say whether we should place a movie rated 9 by 100000 users below another moview rated 9.5 by 1000 users. A weighted rating could be applied:
$$\text{WR} = (\frac{v}{v+m} \times R) + (\frac{m}{v+m} \times C)，$$
where $v$ is the number of votes for the specific movie, $m$ is the minimum number of votes required for movies to be in the rank list, $R$ is the mean rating of the specific movie and $C$ is the mean rating of all the movies in the dataset. Therefore, for a specific moviev, less votes means that the rating will be closer to the average rating. Meanwhile, more votes means that the rating will approach a value that is reflective of the specific
movie's quality and popularity with the general populace.

$m$ is to be selected subjectively (e.g. 80th percentile, must have collected more votes than at least
$80\%$ of the movies present in the dataset), higher the value of m, the higher the emphasis on the popularity of a movie. 

Considering to impose other restrictions such as duration.

Knowledge-based recommenders

Ask the user for the features of movies he/she is looking for (e.g. genres, range of duration, release date......). Filter the dataset with the information collected. Recalculate $m$ and $C$. Finally, recommend moviews to the user that have a high weighted rating.

Content-based recommenders: ask users for a few favorite movies and recommend results that are similar to those movies. 

Two vertorization method:
1. CountVectorizer

Count the total number of unique words in all documents we are intereted in with the extremely common words such as 
a, the, is, had, my... being ignored. Represent each docunment as a numerical vector with each element being the  number of times each recorded unique words occurs. (Very similar to creating dummy variables)
2. TF-IDFVectorizer

For every word $i$ in document $j$, applies the following equation to get the weight of word $i$ in documnet $j$:
$$w_{i, j} = tf_{i, j} \times \text{log}(\frac{N}{df_i}),$$
where $tf_{i, j}$ is the total number of occurences of word $i$ in document $j$, $df_i$ is the number of documents that contain the word $i$. $N$ is the total number of documents.

So the weight of a word in a document is greater if it occurs more frequently in that document and is present in fewer other documents.


Similarity measure
1. The cosine similarity score (particularly useful when applied in conjunction with TF-IDFVectorizer)

Given two documents $x$ and $y$, $\text{cos}(x, y) = \frac{x^Ty}{||x||_2 ||y||_2}$ measures the similarity between them. The closer the cosine score to $\pm 1$, the more similar the documents are to each other.


One commonly used submodel: Metadata-based recommender

Employ several features together such as genre, director, three major stars, keywords, etc. Since there are several features to work with, we need to create a "soup" that contains all the features (most of them are strings or lists of strings). In that way, we can feed the soup into selected Vectorizer and calculate similarity scores. 

In this model, it's better to use CountVectorizer. Because TF-IDFVectorizer would give actors and directors who have acted and directed in a relatively larger number of movies less weight and this is not desirable. The similarity contributed by the actors' and directors' name is important in this case.


Improvements? See reference.

Collaborative-filtering algorithms

Collaborative filtering demands data on user behavior. We might want to download the movieLens data https://www.kaggle.com/prajitdatta/movielens-100k-dataset, which contains the demographic information of 1000 users. We could gather the user ids, movie ids, demographic information per user, rating per movie and user into one dataframe.


1. User-based collaborative filtering: find users similar to a particular user and then recommend products that those users have liked to the first user.

First represent the ratings by a matrix where each row represents a user and each column represents a movie. Then the following models are worthwhile to try:

(a). Baseline model: outputs the mean rating for the movie by all the users who have rated it. If the rating for some movies are available only in the test dataset, then just default to $3.0$.

(b). Weighted mean: Give more weights to those users whose ratings are similar to the user we are predicting based on the following formula:
$$r_{u,m} = \frac{\sum_{u' \neq u} \text{sim}(u, u') r_{u', m}}{\sum_{u' \neq u}  \text{sim}(u, u')},$$
where $r_{u,m}$ represents the rating given by user u to movie m and $\text{sim}(\cdot)$ is the similarity measure.

The above two models did not include demographic information: How about we include some demographic features when computing the similarity measure and then proceed to use the same formula?

2. Item-based collaborative

Compute the pairwise similarity of every item in the inventory (e.g. movie) instead of every user. Just like transpose the rating matrix. 

3. Model-based approaches (machine learning)
Clustering, PCA, XGboost, neural networks, etc.

In [109]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [120]:
## Data manipulation for user-based method
# Import the three dataframes
colnames_user = ['UserID', 'Age', 'Sex', 'Occupation', 'Zip_code']
users = pd.read_csv('~/Desktop/Python_project/STAT_535/Recosys/ml-100k/u.user', sep = '|', names = colnames_user, encoding = 'latin-1')

colnames_item = ['movie_id', 'title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 
                 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 
                 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

movies = pd.read_csv('~/Desktop/Python_project/STAT_535/Recosys/ml-100k/u.item',sep = '|', names = colnames_item, encoding = 'latin-1')
movies_title = movies[['movie_id', 'title']]

colnames_rate = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('~/Desktop/Python_project/STAT_535/Recosys/ml-100k/u.data',sep = '\t', names = colnames_rate, encoding = 'latin-1')
ratings = ratings.drop('timestamp', axis=1)

# Train-test split: keep all movie_ids
X = ratings.copy()
link_index_for_y = X['user_id'] 
# Notice that the true response (y) should be the ratings. Here we just random split the user_id as a joint key 
# and we would then connect to the response 'rating' with this key.
X_train,X_test,y_train,y_test = train_test_split(X, link_index_for_y, 
                                                 test_size = 0.25, 
                                                 stratify = link_index_for_y,
                                                 random_state = 535)

# Train-test split: filter the movies rated by less than 100 users
threshold = 100
num_rating= ratings.groupby('movie_id')['user_id'].count()
qualify = num_rating[num_rating > threshold].index.tolist()
X_filter = ratings.copy()
X_filter = X_filter[X_filter.movie_id.isin(qualify)]
link_index_for_y = X_filter['user_id'] 
X_train_filter,X_test_filter,y_train_filter,y_test_filter = train_test_split(X_filter, link_index_for_y, 
                                                                             test_size = 0.25, 
                                                                             stratify = link_index_for_y,
                                                                             random_state = 535)

# Creat the rating matrix, ignoring the demographical information and the features of the movies.
r_matrix = X_train_filter.pivot_table(values = 'rating', index='user_id', columns='movie_id')
r_matrix_imputed = r_matrix.copy().fillna(0)

In [110]:
# evaluation funtion
def score(model_name):
    id_pairs = zip(X_test_filter['user_id'], X_test_filter['movie_id'])
    y_pred = np.array([model_name(user_id, movie_id) for (user_id, movie_id) in id_pairs])
    y_true = np.array(X_test_filter['rating'])
    return np.sqrt(mean_squared_error(y_true, y_pred))

# Baseline model: The mean rating
def baseline(user_id, movie_id):
    if movie_id in X_train_filter.movie_id.tolist():
        rating = r_matrix[movie_id].mean()
    else:
        rating = 3 # default
    return rating
score(baseline)

0.9869108700565985

In [139]:
# Data for Xgboost model
merge_trn = pd.merge(left = X_train_filter, right = movies, on = "movie_id")
xg_data_trn = pd.merge(left = merge_trn, right = users, on = None, left_on = "user_id", right_on = "UserID")
xg_data_trn = xg_data_trn.drop(['video release date', 'IMDb URL'], axis = 1)

merge_tst = pd.merge(left = X_test_filter, right = movies, on = "movie_id")
xg_data_tst = pd.merge(left = merge_tst, right = users, on = None, left_on = "user_id", right_on = "UserID")
xg_data_tst = xg_data_tst.drop(['video release date', 'IMDb URL'], axis = 1)

In [142]:
display(xg_data_trn)

Unnamed: 0,user_id,movie_id,rating,title,release date,unknown,Action,Adventure,Animation,Children's,...,Romance,Sci-Fi,Thriller,War,Western,UserID,Age,Sex,Occupation,Zip_code
0,566,82,4,Jurassic Park (1993),01-Jan-1993,0,1,1,0,0,...,0,1,0,0,0,566,20,M,student,14627
1,566,88,3,Sleepless in Seattle (1993),01-Jan-1993,0,0,0,0,0,...,1,0,0,0,0,566,20,M,student,14627
2,566,288,3,Scream (1996),20-Dec-1996,0,0,0,0,0,...,0,0,1,0,0,566,20,M,student,14627
3,566,96,3,Terminator 2: Judgment Day (1991),01-Jan-1991,0,1,0,0,0,...,0,1,1,0,0,566,20,M,student,14627
4,566,127,5,"Godfather, The (1972)",01-Jan-1972,0,1,0,0,0,...,0,0,0,0,0,566,20,M,student,14627
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48309,242,268,5,Chasing Amy (1997),01-Jan-1997,0,0,0,0,0,...,1,0,0,0,0,242,33,M,educator,31404
48310,242,275,5,Sense and Sensibility (1995),01-Jan-1995,0,0,0,0,0,...,1,0,0,0,0,242,33,M,educator,31404
48311,242,283,4,Emma (1996),02-Aug-1996,0,0,0,0,0,...,1,0,0,0,0,242,33,M,educator,31404
48312,242,111,4,"Truth About Cats & Dogs, The (1996)",26-Apr-1996,0,0,0,0,0,...,1,0,0,0,0,242,33,M,educator,31404
