In content-based recommender systems, we use the content information of both users and
items while building recommendation engines. A typical content-based recommender
system will perform the following steps:
1. Generate user profiles.
2. Generate item profile.
3. Generate the recommendation engine model.
4. Suggest the top N recommendations.

A profile typically
contains preferences for the features of items and users (refer to Chapter 3, Recommendation
Engines Explained for details). Once the profiles are created, we choose a method to build the
recommendation engine model. Many data-mining techniques such as classification, text
similarity approaches such as tf-idf similarity, and Matrix factorization models can be
applied for building content-based recommendation engines.

we will use the multiclass classification approach to build our basic content-based recommendation engine.

The first step would always be to gather the data and pull it into the programming environment so that we may apply further steps. 

The second step would be preparing the data required to build the classification models. In this step, we extract the required features of the users and class labels to build the
classification model:

You might be wondering why we are choosing binary classification instead of multiclass classification. The choice of model is left to the person building the recommender system; in our case, with the dataset we have chosen, binary class classification fits better than a multiclass classification.

The third step will be to build the binary classification model. We will choose the RandomForest algorithm to build the class.

The fourth and final step will be to generate the top-N recommendations for the users.

Item profile creation: In this step, we create a profile for each item using the content information we have about the items. The item profile is usually created using a widely-used information retrieval technique called tf-idf. In Chapter 4,
Data Mining Techniques for Recommendation Engines, we explained tf-idf in detail. To recap, the tf-idf value gives the relative importance of features with respect to all the items or documents.

User profile creation: In this step, we take the user activity dataset and preprocess the data into a proper format to create a user profile. We should remember that, in a content-based recommender system, the user profile is created with respect to the item content, that is, we have to extract or compute the preferences of the user for the item content or item features. Usually, a dot product between user activity and item profile gives us the user profile.

Recommendation engine model generation: Now that we have the user profile and item profile in hand, we will proceed to build a recommendation model.
Computing a cosine similarity between the user profile and item profile gives us the affinity of the user to each of the items.

Generation of the top-N recommendations: In the final step, we shall sort the user-item preferences based on the values calculated in the previous step and
then suggest the top-N recommendations.

<h3>Dataset Description</h3>

In [None]:
# The following is the list of packages we will be using for this exercise:

import pandas as pd
import numpy as np
import scipy
import sklearn

#Loading the data:
path = "~/anonymous-msweb.test.txt"

raw_data = pd.read_csv(path,header=None,skiprows=7)
raw_data.head()



We can observe the following from the preceding figure:
The first column contains three types of values: A/V/C, where A represents case
ID, V represents the user, and C represents the case IDs that the user has accessed
The second column contains IDs to represent users and items
The third column contains the description of website area
The fourth contains the URL for the website area on the website


<h3> User Activity </h3>

Before we proceed toward profile generation, we will have to format the user activity data;
the following section explains how to create a user activity dataset.

In this section, we will create a user-item rating matrix containing users as rows, items as columns, and the value as the cells. Here, the value is either 0 or 1, indicating 1 if the user
has accessed the web page, else 0:

In [None]:
# First we filter only records that don't contain "A" in the first column:
user_activity = raw_data.loc[raw_data[0] != "A"]

# Next, we assign then we remove unwanted columns from the dataset:
user_activity = user_activity.loc[:, :1]

# Assigning names to the columns of user_activity DataFrame:
user_activity.columns = ['category','value']

# The following code shows the sample user_activity data:
user_activity.head(15)

# To get the total unique webid in the dataset, see as the following code:
len(user_activity.loc[user_activity['category'] =="V"].value.unique())

# To get the unique users count, see following code:
len(user_activity.loc[user_activity['category'] =="C"].value.unique())

# Now let's run the following code to create a user-item-rating matrix, as follows:

# First, we assign variables:
tmp = 0
nextrow = False

#Then we get the last index of the dataset:
lastindex = user_activity.index[len(user_activity)-1]
lastindex


In [None]:
# The for loop code loops through each record and adds new columns('userid', 'webid') to user_activity data frame which shows userid and corresponding web activity:
for index,row in user_activity.iterrows():
    if(index <= lastindex ):
        if(user_activity.loc[index,'category'] == "C"):
            tmp = 0
            userid = user_activity.loc[index,'value']
            user_activity.loc[index,'userid'] = userid
            user_activity.loc[index,'webid'] = userid
            tmp = userid
            nextrow = True
            
        elif(user_activity.loc[index,'category'] != "C" and nextrow == True):
            webid = user_activity.loc[index,'value']
            user_activity.loc[index,'webid'] = webid
            user_activity.loc[index,'userid'] = tmp
            
            if(index != lastindex and user_activity.loc[index+1,'category'] == "C"):
                nextrow = False
                caseid = 0


In [None]:
#Next, we remove the unwanted rows from the preceding data frame, that is, we will be removing the rows containing "C" in the category column:
user_activity = user_activity[user_activity['category'] == "V" ]

#We subset the columns, and remove the first two columns, which we no longer needed:
user_activity = user_activity[['userid','webid']]

# Next, we sort the data by webid; this is to make sure that the rating matrix generation is in good format:
user_activity_sort = user_activity.sort('webid', ascending=True)

# Now, let's create a dense binary rating matrix containing user_item_rating using the following code:

# First, we get the size of webid column:
sLength = len(user_activity_sort['webid'])

# Then we add a new column, 'rating' to the user_activity data frame which contains only 1:
user_activity_sort['rating'] = pd.Series(np.ones((sLength,)), index=user_activity.index)

# Next, we use pivot to create binary rating matrix:
ratmat = user_activity_sort.pivot(index='userid', columns='webid', values='rating').fillna(0)

# Finally, we create a dense matrix:
ratmat = ratmat.to_dense().as_matrix()

<h3>Item profile generation</h3>

In [None]:
# To create item data, we will consider the data that contains A in the first column:
    
#First, we filter all the records containing first column as "A"
items = raw_data.loc[raw_data[0] == "A"]

# Then we name the columns as follows:
items.columns = ['record','webid','vote','desc','url']

# To generate item profile we only needed two columns so we slice the dataframe as follows:
items = items[['webid','desc']]

# To see the dimensions of the items, the dataframe is given like, We observe that there are 294 unique webid in the dataset:
items.shape

# To check the sample of the data, we use the following code:
Items.head()

# To check the count of unique webid, we use the following code:
items['webid'].unique().shape[0]

# We can also only those webid which are present in the user_activity data:
items2 = items[items['webid'].isin(user_activity['webid'].tolist())]

# We can use the following code check type of the object
type(items2)

# We can also sort the data by webid:
items_sort = items2.sort('webid', ascending=True)

# Let'see what we have done till now, using the head(5) function:
items_sort.head(5)



Now, we shall create the item profile using the tf-idf functions available in the sklearn package. To generate tf-idf, we use the TfidfVectorizer(). The fit_transform()
methods are in the sklearn package. The following code shows how we can create tfidf.

In [None]:
# In the following code, the choice of the number of features to be included depends on the dataset, and the optimal number of features can be selected by the cross-validation approach:

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer(stop_words ="english",max_features = 100,ngram_range=(0,3),sublinear_tf =True)
x = v.fit_transform(items_sort['desc'])
itemprof = x.todense()

itemprof

<h3> User Profile Creation </h3>

We now have item profile and user activity in hand; the dot product between these two matrices will create a new matrix with dimensions equal to # of users by # Item features.
To compute the dot product between user activity and item profile, we use the scipy package methods such as linalg, dot available.

In [None]:
# Run the following code to compute the dot product:

#user profile creation
from scipy import linalg, dot
userprof = dot(ratmat,itemprof)/linalg.norm(ratmat)/linalg.norm(itemprof)
userprof

<h3> Recommendation </h3>

The final step in a recommendation engine model would be to compute the active user preferences for the items. For this, we do a cosine similarity between user profile and item
profile.
To compute the cosine calculations, we will be using the sklearn package. The following code will calculate the cosine_similarity:

In [None]:
# We calculate the cosine similarity between userprofile an item profile:

import sklearn.metrics
similarityCalc = sklearn.metrics.pairwise.cosine_similarity(userprof, itemprof, dense_output=True)

In [None]:
# We can see the results of the preceding calculation as follows:
similarityCalc

In [None]:
# Now, let's format the preceding results calculated as binary data (0,1), as follows:

# First, we convert the rating to binary format:
final_pred= np.where(similarityCalc>0.6, 1, 0)

# Then we examine the final predictions of first three users:
final_pred[1]
final_pred[2]
final_pred[3]

In [None]:
# Removing the zero values from the preceding results gives us the list of the probable items that can be recommended to the users:

# For user 213 the recommended items are generated as follows:
indexes_of_user = np.where(final_pred[213] == 1)
indexes_of_user
