In [1]:
# EXECUTE FIRST

# computational imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Reader, Dataset, KNNBasic, SVD

from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# plotting imports
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
# for reading files from urls
import urllib.request
# display imports
from IPython.display import display, IFrame
from IPython.core.display import HTML

# import notebook styling for tables and width etc.
response = urllib.request.urlopen('https://raw.githubusercontent.com/DataScienceUWL/DS775v2/master/ds755.css')
HTML(response.read().decode("utf-8"));

<font size=18>Lesson 11: Recommender Systems 2</font>

# User-Based Collaborative Filter

A user-based collaborative filter is one in which the preferences of users are used to identify suitable recommendations. For example, if Jaunita and Jeff largely like the same movies, but Jaunita liked Toy Story and Jeff hasn't seen it yet, then Toy Story might be a suitable recommendation for Jeff.

Be sure to carefully read Chapter 5 before starting this lesson. 

## Set Up

### Defining Data
In Chapter 6, Banik uses the movielens dataset to explore collaborative filtering. We're going to use what's called a "toy" dataset, which is just a very small dataset. This makes it easier to see what's happening at each step, though our predictions will be worse because we have much less data to go on.


In [2]:
# import pandas as pd
# import numpy as np

#load the information about users
users = pd.DataFrame({'user_id': [1,2,3,4,5],
                     'age': [24,53,23,20,55],
                     'sex': ['M','F','M','F','M'],
                     'occupation': ['technician', 'writer','teacher','technician','teacher'],
                     'zip_code': ['90210', '53704', '53706','53704','90210']})

display(users.head())

movies = pd.DataFrame({'movie_id': [1,2,3,4,5],
                      'title':['Toy Story','Titanic','Star Wars: The Clone Wars', 'Gone with the Wind', 'Sharknado']})


display(movies.head())

#generate a rating for each user/movie combination
ratings = pd.DataFrame(np.array(np.meshgrid([1, 2, 3,4,5], [1,2,3,4,5])).T.reshape(-1,2), columns=['user_id', 'movie_id'])
np.random.seed(1)
randratings = np.random.randint(1,6, ratings.shape[0])

ratings['rating'] = randratings

#we have 5 * 5 or 25 rows of data in the ratings, but we'll just look at the first 10
ratings.head(10)

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,90210
1,2,53,F,writer,53704
2,3,23,M,teacher,53706
3,4,20,F,technician,53704
4,5,55,M,teacher,90210


Unnamed: 0,movie_id,title
0,1,Toy Story
1,2,Titanic
2,3,Star Wars: The Clone Wars
3,4,Gone with the Wind
4,5,Sharknado


Unnamed: 0,user_id,movie_id,rating
0,1,1,4
1,1,2,5
2,1,3,1
3,1,4,2
4,1,5,4
5,2,1,1
6,2,2,1
7,2,3,2
8,2,4,5
9,2,5,5


With the data loaded, our job is to predict the rating, given a user and a movie. We will do this as a regression problem, even though the ratings could be considered categorical data (discrete values from 1 to 5), because a 4 is closer to a 5. Classification problems don't understand that nuance.

Let's split the data into train and test sets. Banik uses a hack here to stratify on the user. Stratifying on the user ensures that we have some of each user's ratings in both the train and the test set. 

In [3]:
#Import the train_test_split function
# from sklearn.model_selection import train_test_split

#Assign X as the original ratings dataframe and y as the user_id column of ratings.
X = ratings.copy()
y = ratings['user_id']

#Split into training and test datasets, stratified along user_id
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify=y, random_state=42)

Since we have such a small dataset, we can explore what's in our training and test data. You can see that every user is in both the training and test data, though not in equal measure.


In [4]:
#compare X_train to X_test
display(X_train)
display(X_test)

Unnamed: 0,user_id,movie_id,rating
1,1,2,5
20,5,1,5
22,5,3,2
8,2,4,5
5,2,1,1
17,4,3,3
24,5,5,2
10,3,1,2
6,2,2,1
9,2,5,5


Unnamed: 0,user_id,movie_id,rating
16,4,2,5
7,2,3,2
19,4,5,3
21,5,2,2
14,3,5,5
4,1,5,4
0,1,1,4


The variables y_train and y_test won't actually be used in our code. They're just used as a way to stratify the data. Typically you'd see y as the variable you're trying to predict. That's not how we're doing it here, since our X_train and X_test data are actually dataframes that contain both what we're using to make predictions (user_id and movie_id combination) and what we're predicting (rating). (It's a bit weird. We know.)

### RMSE Metric

Our metric for evaluation will be the Root Mean Squared Error. Banik builds a wrapper function around scikit-learn's mean_squared_error function, but that's unnecessary as of scikit-learn version 0.22.1. The function has a parameter we can use to tell it to return the RMSE instead of the MSE.

In [5]:
#Import the mean_squared_error function
# from sklearn.metrics import mean_squared_error

#test data
test_y_true = [3, -0.5, 2, 7]
test_y_pred = [2.5, 0.0, 2, 8]

#this returns MSE (not what we want)
print(mean_squared_error(test_y_true, test_y_pred))

#this returns the root mean squared error (and is what we want to use)
mean_squared_error(test_y_true, test_y_pred, squared=False)

0.375


0.6123724356957945

Let's define a baseline model. All our models will take in a user_id and a movie_id. The baseline model always returns the MEDIAN of our possible ratings. In other words, our baseline model is trying to be as non-commital as possible.

We're going to alter Banik's function so that it also accepts optional arguments. We don't need any for this function, but later we will need additional arguments and this keeps our coding consistent.

In [6]:
#first determine the median of our ratings (we could have done this by hand, but numpy does it so well... )
print(f"The median of this rating range is {np.median(np.arange(np.min(ratings['rating']), (np.max(ratings['rating']) + 1)))}")

#define a baseline model to always return the median
def baseline(user_id, movie_id, *args):
    return 3.0

The median of this rating range is 3.0


Next we need a way to score our model.

Here's where we diverge from Banik's approach just a bit. Instead of relying on global variables, we will explicitly pass in our data for our scoring model. Note we're again using the special parameter \*args. This tells our scoring function to accept any optional arguments we might need, and we'll pass those right along to our model.

We'll also use sklearn's built in RMSE function.

In [7]:
#Function to compute the RMSE score obtained on the testing set by a model
def score(cf_model, X_test, *args):
    
    #Construct a list of user-movie tuples from the testing dataset
    id_pairs = zip(X_test['user_id'], X_test['movie_id'])
    
    #Predict the rating for every user-movie tuple
    y_pred = np.array([cf_model(user, movie, *args) for (user, movie) in id_pairs])
    
    #Extract the actual ratings given by the users in the test data
    y_true = np.array(X_test['rating'])
    
    #Return the final RMSE score
    return mean_squared_error(y_true, y_pred, squared=False)
                              
#let's test it with our baseline model
score(baseline, X_test)

1.3093073414159542

## Basic Models


Everything we've done so far is just setting us up to be able to use something more than our baseline model to do some real user-based collaborative filtering. Now let's try out some basic approaches and compare them to our baseline model.

Before we can start, though, we need to do yet more data wrangling. We need a matrix that has movies as columns and users as rows, with each user's rating for that movie at the intersection. Note that although we know that every user has rated every movie, we don't have all that data in our training set, so we still end up with some NaN values.

In [8]:
#Build the ratings matrix using pivot_table function
r_matrix = X_train.pivot_table(values='rating', index='user_id', columns='movie_id')

r_matrix.head()

movie_id,1,2,3,4,5
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,,5.0,1.0,2.0,
2,1.0,1.0,,5.0,5.0
3,2.0,3.0,5.0,3.0,
4,4.0,,3.0,5.0,
5,5.0,,2.0,1.0,2.0


### Mean

Note that our mean function requires the ratings_matrix argument. Here's where that \*args parameter comes in. We can pass r_matrix to our score function and it gets passed along to our cf_user_mean model.

In [9]:
#User Based Collaborative Filter using Mean Ratings
def cf_user_mean(user_id, movie_id, ratings_matrix):
    
    #Check if movie_id exists in r_matrix (rm)
    if movie_id in ratings_matrix:
        #Compute the mean of all the ratings given to the movie
        mean_rating = ratings_matrix[movie_id].mean()
    
    else:
        #Default to a rating of 3.0 in the absence of any information
        mean_rating = 3.0
    
    return mean_rating

score(cf_user_mean, X_test, r_matrix)

1.153411090139653

### Weighted Mean
Weighted mean is going to give more weight to the users that are more similar to each other. We'll do this using cosine similarity.

In [10]:
#Create a dummy ratings matrix with all null values imputed to 0
r_matrix_dummy = r_matrix.copy().fillna(0)
# Import cosine_score 
# from sklearn.metrics.pairwise import cosine_similarity

#Compute the cosine similarity matrix using the dummy ratings matrix
cosine_sim = cosine_similarity(r_matrix_dummy, r_matrix_dummy)

#Convert into pandas dataframe 
cosine_sim = pd.DataFrame(cosine_sim, index=r_matrix.index, columns=r_matrix.index)

cosine_sim.head(10)

user_id,1,2,3,4,5
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1.0,0.379777,0.692411,0.335659,0.125245
2,0.379777,1.0,0.404557,0.568737,0.475651
3,0.692411,0.404557,1.0,0.78388,0.57536
4,0.335659,0.568737,0.78388,1.0,0.75186
5,0.125245,0.475651,0.57536,0.75186,1.0


With the cosine similarity matrix in hand, we can set up the weighted mean function. This function needs 2 additional arguments - the rating_matrix and the cosine similarity matrix (c_sim_matrix).

In [11]:
#User Based Collaborative Filter using Weighted Mean Ratings
def cf_user_wmean(user_id, movie_id, ratings_matrix, c_sim_matrix):
    
    #Check if movie_id exists in r_matrix
    if movie_id in ratings_matrix:
        
        #Get the similarity scores for the user in question with every other user
        sim_scores = c_sim_matrix[user_id]
        
        #Get the user ratings for the movie in question
        m_ratings = ratings_matrix[movie_id]
        
        #Extract the indices containing NaN in the m_ratings series
        idx = m_ratings[m_ratings.isnull()].index
        
        #Drop the NaN values from the m_ratings Series
        m_ratings = m_ratings.dropna()
        
        #Drop the corresponding cosine scores from the sim_scores series
        sim_scores = sim_scores.drop(idx)
        
        #Compute the final weighted mean
        wmean_rating = np.dot(sim_scores, m_ratings)/ sim_scores.sum()

    else:
        #Default to a rating of 3.0 in the absence of any information
        wmean_rating = 3.0
    
    return wmean_rating



score(cf_user_wmean, X_test, r_matrix, cosine_sim)

1.2892045169426134

### User Demographics

The general idea here is that users with the same demographics might have similar tastes. (Note that with our toy dataset, since we used random number generation for our ratings, it's unlikely there's any actual correlation between demographics and taste. But we'll step through this anyway.)

In [12]:
#merge the training set with the user data
merged_df = pd.merge(X_train, users.copy())

#Compute the mean rating of every movie by gender
gender_mean = merged_df[['movie_id', 'sex', 'rating']].copy().groupby(['movie_id', 'sex'])['rating'].mean()

display(gender_mean.head())

#Set the index of the users dataframe to the user_id
#we need to do this so that we can fetch the right data in our model function
users = users.set_index('user_id')

display(users)

movie_id  sex
1         F      2.5
          M      3.5
2         F      1.0
          M      4.0
3         F      3.0
Name: rating, dtype: float64

Unnamed: 0_level_0,age,sex,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,90210
2,53,F,writer,53704
3,23,M,teacher,53706
4,20,F,technician,53704
5,55,M,teacher,90210


Once again we're going to modify Banik's function to compute the gender-influenced score. This one needs to take in our results matrix (ratings_matrix), our user dataframe (user_df) and our gender mean data (gen_mean).

In [13]:
 #Gender Based Collaborative Filter using Mean Ratings
def cf_gender(user_id, movie_id, ratings_matrix, user_df, gen_mean):
    
    #Check if movie_id exists in r_matrix (or training set)
    if movie_id in ratings_matrix:
        #Identify the gender of the user
        gender = user_df.loc[user_id]['sex']
        
        #Check if the gender has rated the movie
        if gender in gen_mean[movie_id]:
            
            #Compute the mean rating given by that gender to the movie
            gender_rating = gen_mean[movie_id][gender]
        
        else:
            gender_rating = 3.0
    
    else:
        #Default to a rating of 3.0 in the absence of any information
        gender_rating = 3.0
    
    return gender_rating

score(cf_gender,  X_test, r_matrix, users, gender_mean)

2.3375811674219387

We can also combine multiple demographic variables. Let's combine gender and occupation. Note that for this function, we're actually passing in the rating with the gender/occupation dataframe, which is a slightly different approach than the gender only model.

In [14]:
#Compute the mean rating by gender and occupation
gen_occ_mean = merged_df[['sex', 'rating', 'movie_id', 'occupation']].pivot_table(
    values='rating', index='movie_id', columns=['occupation', 'sex'], aggfunc='mean')

gen_occ_mean.head()

#Gender and Occupation Based Collaborative Filter using Mean Ratings
def cf_gen_occ(user_id, movie_id, user_df, gen_occ_mean_df):
    
    #Check if movie_id exists in gen_occ_mean
    if movie_id in gen_occ_mean_df.index:
        
        #Identify the user
        user = user_df.loc[user_id]
        
        #Identify the gender and occupation
        gender = user['sex']
        occ = user['occupation']
        
        #Check if the occupation has rated the movie
        if occ in gen_occ_mean_df.loc[movie_id]:
            
            #Check if the gender has rated the movie
            if gender in gen_occ_mean_df.loc[movie_id][occ]:
                
                #Extract the required rating
                rating = gen_occ_mean_df.loc[movie_id][occ][gender]
                
                #Default to 3.0 if the rating is null
                if np.isnan(rating):
                    rating = 3.0
                
                return rating
            
    #Return the default rating    
    return 3.0

#compute the RMSE
score(cf_gen_occ,  X_test, users, gender_mean)

1.3093073414159542

## Model Based Approaches

All of the above models were relatively simple and straightforward calculations. Machine learning algorithms can give us a more powerful approach. We'll look a couple of options.

### K Nearest Neighbors

When we looked at demographics, we were using hard-coded data to determine what makes people "similar" and assuming that if they were similar in that respect, their taste in movies would also be similar. But that might be a faulty assumption. 

K Nearest Neighbors, on the other hand, can train the data on how users have actually rated movies. Perhaps there are clusters of individuals that rate movies in similar ways. K Nearest neighbors will try to find these clusters.

Specifically, what this algorithm does is:
- Find the k-nearest neighbors that have rated movie m
- Outputs the average rating of the k users for the movie m

The <a href="https://surprise.readthedocs.io/en/stable/knn_inspired.html">documentation for KNNBasic</a> goes over all the parameters you can set when you're setting up the algorithm.

Note that in this toy set, since we only have a handful of neighbors, we will need to decrease the number of neighbors (k) that the algorithm takes into consideration. Otherwise, we'll just be getting the mean of all the considered ratings in each fold.

In [15]:
# this has been edited - replace "evaluate" with "cross_validate"

#Import the required classes and methods from the surprise library
# from surprise import Reader, Dataset, KNNBasic

#Define a Reader object
#The Reader object helps in parsing the file or dataframe containing ratings
reader = Reader(rating_scale=(1,5)) # defaults to (0,5)

#Create the dataset to be used for building the filter
data = Dataset.load_from_df(ratings, reader)

#define a random seed for consistent results
np.random.seed(1)
#Define the algorithm object; in this case kNN
knn = KNNBasic(k=3, verbose=False) #the default for k is 40, we're also setting verbose to False to supress messages

#This code cross validates (evaluates) the model
from surprise.model_selection import cross_validate
knn_cv = cross_validate(knn, data, measures=['RMSE'], cv=5, verbose=True)
print(knn_cv)

#to extract the mean RMSE, we need to get the mean of the test_rmse values
knn_RMSE = np.mean(knn_cv['test_rmse'])
print(f'\nThe RMSE across five folds was {knn_RMSE}')

Evaluating RMSE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.6963  1.7883  1.7354  2.2908  3.0011  2.1024  0.4983  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    
{'test_rmse': array([1.6963266 , 1.7882953 , 1.73543669, 2.29079882, 3.00109583]), 'fit_time': (7.295608520507812e-05, 2.9087066650390625e-05, 2.5987625122070312e-05, 0.00010418891906738281, 3.695487976074219e-05), 'test_time': (0.00011134147644042969, 6.723403930664062e-05, 5.7220458984375e-05, 9.131431579589844e-05, 0.00010251998901367188)}

The RMSE across five folds was 2.10239064827985


### Singular-value Decomposition (SVD)

The theory behind SVD is covered in Banik's book. The very high-level concept is that it's a method that allows you to reduce the dimensions of a sparse matrix and "fill in the blanks" with predictions. We don't expect you to understand the intricacies. The code itself is extremely simple, once you've already got a suprise data object set up. Read the <a href="https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD">full documentation</a> if you're curious.

In [16]:
# this has been edited - replace "evaluate" with "cross_validate"
#Import SVD
# from surprise import SVD

#define a random seed for consistent results
np.random.seed(1)
#Define the SVD algorithm object
svd = SVD()

#Evaluate the performance in terms of RMSE
svd_cv = cross_validate(svd, data, measures=['RMSE'], cv=5, verbose=True)
#to extract the mean RMSE, we need to get the mean of the test_rmse values
svd_RMSE = np.mean(svd_cv['test_rmse'])
print(f'\nThe RMSE across five folds was {svd_RMSE}')

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.1140  1.5535  1.8721  1.5735  1.8817  1.5990  0.2802  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    

The RMSE across five folds was 1.5989842066442819


# Self Assessment
Follow the examples and use the code files provided from from chapters 5-7 in **Hands-On Recommendation Systems with Python** by Rounak Banik to do the following self-assessment exercises.  

The self-assessments in this lesson will be using a subset of data from the Book-Crossing dataset.  Click <a href = http://www2.informatik.uni-freiburg.de/~cziegler/BX/> here </a> for more details on the Book-Crossing dataset.

## User-Based Collaborative Filter

### *Self-Assessment: Setting up the File*

The file **BX-Book-Ratings-3000.csv** (found in the presentation download for this lesson) is loaded here for you, though you may need to change the file path unless you create the same folder structure. Note that book ratings have been adjusted so the scale goes from 1 to 11.   

Run the cell below it to load the file and then do the following:

* display the first 5 lines of the data (get familiar with the data frame)
* calculate the mean book rating for all books (just to get an idea)
* split the data set so that 70\% of a users ratings are in the training set and 30\% are in the testing set

In [17]:
# load the data
import pandas as pd
bx = pd.read_csv('./data/BX-Book-Ratings-3000.csv')

In [18]:
# enter your code here

### *Self-Assessment: Baseline RMSE to Assess Model Performance*

Build a baseline model that assigns a neutral rating and compute the RMSE of these simple "predictions" using the testing set. Make sure to make this model accept \*args so that it aligns with more complicated models.

A neutral rating would occur at the midpoint of the rating scale.  Calculate the median of the rating scale to determine what the baseline model should return.

In [19]:
# enter your code here

### *Self-Assessment: Weighted Mean User-Based Filter*

Build a ratings matrix from the data frame of users, books, and ratings and build a user-based collaborative filtering model that weights mean rank using cosine similarity among users.  Fit the model on the training set and compute the RMSE for this model using the test set and compare it to the RMSE of the baseline model.  Is it better than baseline?  (*i.e.* is the RMSE smaller?)

In [20]:
# enter your code here

### *Self-Assessment: Weighted Mean Item-Based Filter*

Create a new ratings matrix from the data frame of users, books, and ratings with the rows defined by books (*i.e.* items) and columns defined by users to build an item-based collaborative filtering model that weights mean rank using cosine similarity among items.  Fit the model on the training set and compute the RMSE for this model on the test and compare it to the RMSEs of the baseline and weighted mean user-based models.  Is this one better than baseline?

In [21]:
# enter your code here

### *Self-Assessment: kNN-Based Collaborative Filter*

Use the *surprise* library in Python to build an kNN-based collaborative filtering model for the BX-Books ratings.  Fit the model on the full data set (this is what we did in the examples) and compute the average RMSE for this model from 5 cross-validations. Compare it to the RMSEs of the baseline, weighted mean user-based, and weighted mean item-based models previously obtained.




In [22]:
# enter your code here

### *Self-Assessment: SVD Filter*

Use the *surprise* library in Python to build a filtering model based on singular-value decomposition (SVD).  Fit the model on the full data set and compute the average RMSE for this model from 5 cross-validations and compare it to the RMSEs of the previous models.

In [23]:
# enter your code here

### *Self-Assessment: Hybrid Recommender*

Create a recommender system that is a hybrid of an item-based collaborative filter and the SVD collaborative filter.  
Your recommender should do the following:

* Take in a user ID and book title, the cosine similarity matrix, the item data and the trained predictor algorithm as user input
* Use cosine similarity among books to find the 25 most similar books
* Compute the predicted ratings that the user might give to these 25 books using the SVD collaborative filter

Return the top 10 book recommendations along with their predicted ratings when user **31315** enters the book with ISBN **440214041**.  

Use the entire data set to build this model (*i.e.* don't split into training and testing sets). This is what we did in the examples.

In [24]:
# enter your code here

# One More Self-Assessment

### *Self-Assessment: Type of Recommenders*

Match the type of recommender system with its brief description by matching the letter of the recommender system with the number of the description.

**Recommenders**

a. simple 

b. knowledge-based

c. content-based

d. user-based collaborative filters 

e. item-based collaborative filters  

f. hybrid 


**Descriptions**

1.  This model provides recommendations based on items with similar descriptions and features that match the profile of the user.


2. Uses similarity among items to to create an ordered list of recommended items based on metric of interest (*i.e.* ratings).


3. Recommendations made to users are based on an ordered list of items that are ranked according to some metric of interest (*i.e.* ratings).


4. A combination of recommender systems that makes use of the advantages of each system used.


5. Items are recommended that meet the specifics and preferences elicited from users and are ranked according to  metric of interest (*i.e.* ratings).


6.  Uses similarity among users to create an ordered list of recommended items based on metric of interest (*i.e.* ratings).


**Put the letter of the recommender system with the number of its description.**

1.

2.

3.
 
4.

5.

6.
