<h1> Content-Based Recommender Systems</h1>

In content-based recommender systems, we use the content information of both users and items while building recommendation engines.

A typical content-based recommender system will perform the following steps:
1. Retrieve user, item and activity data
2. Generate item profiles
3. Generate user profiles
4. Generate the recommendation engine model
5. Suggest the top N recommendations

<h2>Step 1 - Retrieve Data</h2>
The first step would always be to gather the data and pull it into the programming environment.

For our use case, we download the MovieLens dataset containing three sets of data,
 - Movie data containing a certain movie's information, such as movieID, release date, URL, genre details, and so on
 - User data containing the user information, such as userID, age, gender, occupation, ZIP code, and so on
 - Ratings data containing userID, itemID, rating, timestamp

In [1]:
# Import the libraries that are going to be used here
import pandas as pd
import numpy as np
import scipy
import sklearn

In [2]:
# Column headers for the dataset
data_cols = ['user id','movie id','rating','timestamp']
item_cols = ['movie id','movie title','release date', 'video release date','IMDb URL','unknown','Action', 'Adventure','Animation','Childrens','Comedy','Crime', 'Documentary','Drama','Fantasy','Film-Noir','Horror', 'Musical','Mystery','Romance ','Sci-Fi','Thriller', 'War' ,'Western']
user_cols = ['user id','age','gender','occupation', 'zip code']

In [38]:
# List of users
df_u_user = pd.read_csv('/home/nbuser/library/dataset/u.user', header=None, sep='|', names=user_cols, encoding='latin-1')
df_u_user = df_u_user.sort_values('user id', ascending=1)
df_u_user.columns
df_u_user.head(10)

Unnamed: 0,user id,age,gender,occupation,zip code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


In [39]:
# List of movie items
df_u_item = pd.read_csv('/home/nbuser/library/dataset/u.item', header=None, sep='|', names=item_cols, encoding='latin-1')
df_u_item = df_u_item.sort_values('movie id', ascending=1)
df_u_item.columns
df_u_item.head(10)

Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Childrens,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,01-Jan-1995,,http://us.imdb.com/Title?Yao+a+yao+yao+dao+wai...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,7,Twelve Monkeys (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Twelve%20Monk...,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,8,Babe (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Babe%20(1995),0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8,9,Dead Man Walking (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Dead%20Man%20...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,10,Richard III (1995),22-Jan-1996,,http://us.imdb.com/M/title-exact?Richard%20III...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [40]:
# Remove the last column (timestamp) from the dataframe as it's not required for this analysis
df_u_item_final = df_u_item.drop(['release date', 'video release date','IMDb URL','unknown'], axis=1)
df_u_item_final.head(10)

Unnamed: 0,movie id,movie title,Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
6,7,Twelve Monkeys (1995),0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
7,8,Babe (1995),0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0
8,9,Dead Man Walking (1995),0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
9,10,Richard III (1995),0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0


In [43]:
# User activity data
df_u_data = pd.read_csv('/home/nbuser/library/dataset/u.data', header=None, sep='\t', names=data_cols, encoding='latin-1')
df_u_data = df_u_data.sort_values('user id', ascending=1)
df_u_data.columns
df_u_data.head(10)

Unnamed: 0,user id,movie id,rating,timestamp
66567,1,55,5,875072688
62820,1,203,4,878542231
10207,1,183,5,875072262
9971,1,150,5,876892196
22496,1,68,4,875072688
9811,1,201,3,878542960
9722,1,157,4,876892918
9692,1,184,4,875072956
9566,1,210,4,878542909
9382,1,163,4,875072442


In [61]:
# Remove the last two columns (rating and timestamp) from the dataframe as it's not required for this analysis.
# The specific rating value doesn't matter for this simplistic case, as long as the user has rated a movie.
df_u_data_final = df_u_data.drop(['timestamp', 'rating'], axis=1)
df_u_data_final.insert(2, 'rating', '1')
df_u_data_final.head(10)

Unnamed: 0,user id,movie id
66567,1,55
62820,1,203
10207,1,183
9971,1,150
22496,1,68
9811,1,201
9722,1,157
9692,1,184
9566,1,210
9382,1,163


<h2>Step 2 - Generate Item Profiles</h2>

In this step, we create a profile for each item using the content information we have about the items. The item profile is usually created using a widely-used information retrieval technique called TF-IDF. In information retrieval, TF-IDF (term frequency–inverse document frequency) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus (Wikipedia).

Create the item profile using tf-idf functions available in the sklearn package. To generate tf-idf, we use the TfidfVectorizer(). The fit_transform() methods are in the sklearn package. The following code shows how we can create tfidf.


In [45]:
# In the following code, the choice of the number of features to be included depends on the dataset,
# and the optimal number of features can be selected by the cross-validation approach:

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer(stop_words ="english", max_features = 100, ngram_range=(0,3), sublinear_tf =True)
x = v.fit_transform(df_u_item_final['movie title'])
profile_item = x.todense()

In [46]:
profile_item

matrix([[ 0.33092217,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.60881239,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.60881239,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        ..., 
        [ 0.47346531,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.57888669,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.49223887,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]])

<h2>Step 3 - Generate User Profiles</h2>

In this step, we take the user activity dataset and preprocess the data into a proper format to create a user profile. We should remember that, in a content-based recommender system, the user profile is created with respect to the item content, that is, we have to extract or compute the preferences of the user for the item content or item features. Usually, a dot product between user activity and item profile gives us the user profile.

<h3> User Activity </h3>

A user-item rating density matrix needs to be created that where,
 - rows are the users IDs
 - columns are the item IDs
 - cells are the rating value (1 indicates if the user has ranked the web page, else 0) 

In [68]:
# We use pivot to create binary rating matrix
rating_matrix = df_u_data_final.pivot(index='user id', columns='movie id', values='rating').fillna(0)

In [69]:
rating_matrix.head(10)

movie id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1,0,0,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,1,0,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,1,0,0,1,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [70]:
# Finally, we create a dense matrix:
rating_matrix = rating_matrix.to_dense().as_matrix()

In [73]:
rating_matrix

array([['1', '1', '1', ..., 0, 0, 0],
       ['1', 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       ['1', 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, '1', 0, ..., 0, 0, 0]], dtype=object)

<h3>User Profile</h3>

We now have item profile and user activity in hand; the dot product between these two matrices will create a new matrix with dimensions equal to # of users by # Item features.
To compute the dot product between user activity and item profile, we use the scipy package methods such as linalg, dot available.

In [76]:
# Run the following code to compute the dot product for the user profile creation
from scipy import linalg, dot
profile_user = dot(rating_matrix, profile_item) / linalg.norm(rating_matrix) / linalg.norm(profile_item)

TypeError: can't multiply sequence by non-int of type 'float'

In [None]:
profile_user

<h2>Step 4 - Building Recommender Engine Model</h2>

Now that we have the user profile and item profile in hand, we will proceed to build a recommendation model.

Computing a cosine similarity between the user profile and item profile gives us the affinity of the user to each of the items.

In [None]:
# To compute the cosine calculations, we will be using the sklearn package.
# The following code will calculate the cosine similarity between userprofile an item profile

import sklearn.metrics
similarityCalc = sklearn.metrics.pairwise.cosine_similarity(profile_user, profile_item, dense_output=True)

In [None]:
# We can see the results of the preceding calculation as follows:
similarityCalc

In [None]:
# Now, let's format the preceding results calculated as binary data (0,1), as follows:

# First, we convert the rating to binary format:
final_pred = np.where(similarityCalc > 0.6, 1, 0)

In [None]:
# Then we examine the final predictions of first three users:
final_pred[1]
final_pred[2]
final_pred[3]

Removing the zero values from the preceding results gives us the list of the probable items that can be recommended to the users:



In [None]:
# For user 213 the recommended items are generated as follows:
indexes_of_user = np.where(final_pred[213] == 1)
indexes_of_user