# User-Item Filtering System

The purpose of this workbook is to develop the user-item filtering system, which will be our first attempt at developing a collaborative based recommender system.

To build this model, I am making use of the ratings matrix that I developed in the previous workbook. The ratings matrix is simply a sparse matrix consisting of matrix of star ratings given to each of the 6K+ businesses by each of the 96K+ users.  This is considered to be a sparse matrix, as most values in our ratings matrix are null values, since not every user has provided a rating for every business, in fact, far from it.

For the user-item filtering system, I will be developing a methodology to compare the 'similarity' of different users using the cosine_similarity formulas and functions built into the 'sklearn' package.

I walk through the individual steps to first determine the cosine similarity between two example users. I then wrap up these individual steps into my custom-built `find_user_similarity` function which I will import into future workbooks.

I then walk through the individual steps to predict the rating that any user might give any business using an example, using the cosine similarity between a set of two users using the `find_user_similarity` function. Once again, I then wrap up these individual steps into the `user_item_rating_prediction` formula, which I will also import into future workbooks.

***

In [1]:
# Import Python libraries as needed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Import ratings_matrix data-frame as created in a previous workbook
ratings_matrix = pd.read_pickle('data/user/ratings_matrix.pkl')

In [3]:
# Review the ratings matrix
ratings_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6548,6549,6550,6551,6552,6553,6554,6555,6556,6557
0,4.0,,,,,,,,,,...,,,,,,,,,,
1,5.0,,,,,,,,,,...,,,,,,,,,,
2,1.0,,,,,,,,,,...,,,,,,,,,,
3,4.0,,,,,,,,,,...,,,,,,,,,,
4,1.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96527,,,,,,,,,,,...,,,,,,,,,,
96528,,,,,,,,,,,...,,,,,,,,,,
96529,,,,,,,,,,,...,,,,,,,,,,5.0
96530,,,,,,,,,,,...,,,,,,,,,,


The ratings matrix above represents a table of 96,532 users and 6,558 businesses.

In [4]:
# Take a look at all the ratings of the first user in our ratings matrix table defined as user 0
ratings_matrix.loc[0, :]

0       4.0
1       NaN
2       NaN
3       NaN
4       NaN
       ... 
6553    NaN
6554    NaN
6555    NaN
6556    NaN
6557    NaN
Name: 0, Length: 6558, dtype: float64

In [5]:
# Take a look at the ratings of another user in our ratings matrix table defined as user 323
ratings_matrix.loc[323, :]

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
6553   NaN
6554   NaN
6555   NaN
6556   NaN
6557   NaN
Name: 323, Length: 6558, dtype: float64

In [6]:
# Convert the series of ratings of user 0 into a True/False boolean-type series where True = null value i.e. a rating value does not exist for this business
ratings_matrix.loc[0, :].isna()

0       False
1        True
2        True
3        True
4        True
        ...  
6553     True
6554     True
6555     True
6556     True
6557     True
Name: 0, Length: 6558, dtype: bool

In [7]:
# Reverse the series above such that True = a rating exists for this business for user 0 and False = rating does not exist
~ratings_matrix.loc[0, :].isna()

0        True
1       False
2       False
3       False
4       False
        ...  
6553    False
6554    False
6555    False
6556    False
6557    False
Name: 0, Length: 6558, dtype: bool

In [8]:
# Save the series of True/False values where a rating exists for user 0
businesses_rated_by_user0 = ~ratings_matrix.loc[0, :].isna()

# Save the series of True/False values where a rating exists for user 323
businesses_rated_by_user323 = ~ratings_matrix.loc[323, :].isna()

In [9]:
# Determine a list of businesses that were rated by both user 0 and user 323
# Combine the above two series together into a new series such that True values indicate businesses that were rated by both users
businesses_rated_by_both_user0_and_user323 = businesses_rated_by_user0 & businesses_rated_by_user323
businesses_rated_by_both_user0_and_user323

0       False
1       False
2       False
3       False
4       False
        ...  
6553    False
6554    False
6555    False
6556    False
6557    False
Length: 6558, dtype: bool

In [10]:
# Calculate the total number of businesses rated by both user 0 and user 323
businesses_rated_by_both_user0_and_user323.sum()

5

In [11]:
# Print the ratings given by user 0 and user 323 for the 5 businesses that they have rated together
ratings_matrix.loc[[0, 323], businesses_rated_by_both_user0_and_user323]

Unnamed: 0,79,1427,2721,3914,4827
0,4.0,3.0,4.0,4.0,4.0
323,4.0,3.0,4.0,4.0,3.0


In [12]:
# Reshape the list of ratings for user 0 for the same 5 businesses that were also rated by user 323 into a format suitable for the cosine similarity formula
# The cosine similarity formula requires an array of shape (1, n) where n represents the number of businesses that are rated by the two users being compared
selected_ratings_for_user0 = ratings_matrix.loc[0, businesses_rated_by_both_user0_and_user323].values.reshape(1, -1)
print(selected_ratings_for_user0.shape)
selected_ratings_for_user0

(1, 5)


array([[4., 3., 4., 4., 4.]])

In [13]:
# Complete the same step as above but for user 323
selected_ratings_for_user323 = ratings_matrix.loc[323, businesses_rated_by_both_user0_and_user323].values.reshape(1, -1)

In [14]:
# Calculate the cosine similarity between the ratings for user 0 and user 323 specifically for the 5 businesses that they have both rated together
cs_user0_and_user323 = cosine_similarity(selected_ratings_for_user0, selected_ratings_for_user323)
cs_user0_and_user323

array([[0.99406708]])

In [15]:
# We can see that the numerical value is contained within two arrays
# Here we will print out only the cosine similarity value as above but in a numerical float format
cs_user0_and_user323[0][0]

0.9940670826869249

***

Now I will attempt to consolidate the code above into single function that will compare the consine similarity between any given set of two users.

In [16]:
# Create a function with the user_id value of any two users as the two parameters for the function
def find_user_similarity(userA, userB, ratings_matrix):
    
    # Create a True/False list of businesses that were given a rating for each of the two users
    businesses_rated_by_userA = ~ratings_matrix.loc[userA, :].isna()
    businesses_rated_by_userB = ~ratings_matrix.loc[userB, :].isna()
    
    # Consolidate the two boolean lists into a single one which represents only those businesses rated by both users
    businesses_rated_by_both_users = businesses_rated_by_userA & businesses_rated_by_userB
    
    # Capture the rating values of both users for those businesses that were rated by both users
    # Also transform these values into a format suitable for the cosine_similarity function
    ratings_of_userA = ratings_matrix.loc[userA, businesses_rated_by_both_users].values.reshape(1, -1)
    ratings_of_userB = ratings_matrix.loc[userB, businesses_rated_by_both_users].values.reshape(1, -1)
    
    # Capture the similaritiy between the two users by comparing their ratings for the set of businesses that they have both provided a rating for
    similarity = cosine_similarity(ratings_of_userA, ratings_of_userB)[0][0]
    
    # Return the consine similarity value as the output of this function
    return similarity

In [17]:
# Calculate the cosine_similarity between user 0 and user 323 and confirm that the function is working as intended
find_user_similarity(0, 323, ratings_matrix)

0.9940670826869249

In [18]:
# Note that there are some user combinations that generate an error when calculating the cosine similarity between two different users
# We get a value error when we try to compare user 0 and user 12 since they have 0 businesses that they have both rated together
try:
    find_user_similarity(0,12, ratings_matrix)
except:
    print('Value Error')

Value Error


***

I will now try to use the function created above to generate the similarities between users such that I can make predication for an example scenario.

Specifically, what is the rating that we can predict for business 1 made by user 0?

In [19]:
# Capture the values for our target user and target business into a variable
target_user = 0
target_business = 1

In [20]:
# Create empty lists to store the:
# 1. Similarities with other users to our target user
similarities_to_target_user = []
# 2. Existing ratings provided to our target business
ratings_given_to_target_business = []

In [21]:
# Create a new data-frame includes all ratings for only those users that have rated the target business
target_ratings_matrix = ratings_matrix[~ratings_matrix.iloc[:, target_business].isna()].copy()
target_ratings_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6548,6549,6550,6551,6552,6553,6554,6555,6556,6557
10,,4.0,,,,,,,,,...,,,,,,,,,,
11,,4.0,,,,,,,,,...,,,,,,,,,,
12,,4.0,,,,,,,,,...,,,,,,,,,,
13,,5.0,,,,,,,,,...,,,,,,,,,,
14,,5.0,,5.0,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16516,,4.0,,,,,,,,,...,,,,,,,,,,
16517,,1.0,,,,,,,,,...,,,,,,,,,,
16936,,5.0,,,,,,,,,...,,,,,,,,,,
17878,,5.0,,,,,,,,,...,,,,,,,,,,


In [23]:
# Confirm that our new data frame has a ratings value populated for our target business by every user in this new smaller data-frame
(~target_ratings_matrix.loc[:, 1].isna()).value_counts()

True    145
Name: 1, dtype: int64

In [24]:
# Loop over every user in our target ratings matrix
# We can refer to each user as the 'other_user' since we know that our target user did not provide a rating for our target business and hence is not in this smaller data frame
for other_user in target_ratings_matrix.index:
    # To compensate for the value error that may occur when the two users we are comparing have 0 businesses that they have both rated together
    try:
        # Capture the cosine similarity between our target user and the current user from the list of user we are looping over
        similarity = find_user_similarity(target_user, other_user, ratings_matrix)
        # Capture this similarity value to our list of similarity values
        similarities_to_target_user.append(similarity)
        # Capture the rating value of the current 'other_user' into our list of ratings given to our target businesses
        ratings_given_to_target_business.append(ratings_matrix.loc[other_user, target_business])
    # If a value error is generated, we simply pass over to the next loop
    # Since we will not be appending no values to neither our list of similarities and list of ratings, we will not be impacting our final calculation
    except:
        pass

In [25]:
# Use our list of cosine similarities and list of ratings given to our target business to predict the score for our target business as given by our target user
predicted_rating = np.dot(ratings_given_to_target_business, similarities_to_target_user)/np.sum(similarities_to_target_user)
print(f'Predicted rating for business 1 by user 0 is {round(predicted_rating, 2)}')

Predicted rating for business 1 by user 0 is 3.72


***

Now I will try to wrap the above code I have experimented with into a single function that can be called to predict the value of any missing rating value for our target business as given by our target user.

In [26]:
# Create a function to calculate the user-item rating prediction based on cosine similarity, with the following two parameters:
# target_business = business_id value for business for whom rating is being predicted for
# target_user = user_id value for the user for whom rating is being predicted for
def user_item_rating_prediction(target_user, target_business, ratings_matrix):
   
    # Create empty lists to store the:
    # 1. Similarities with other users to our target user
    similarities_to_target_user = []
    # 2. Existing ratings provided to our target business
    ratings_given_to_target_business = []
    
    # Create a list of all users that have provided a rating for the target business
    list_of_users_rating_target_business = list(ratings_matrix[~ratings_matrix.iloc[:, target_business].isna()].index)
    
    # Loop over every user in our target ratings matrix
    # We can refer to each user as the 'other_user' since we know that our target user did not provide a rating for our target business and hence is not in this smaller data frame
    for other_user in list_of_users_rating_target_business:
        # To compensate for the value error that may occur when the two users we are comparing have 0 businesses that they have both rated together
        try:
            # Capture the cosine similarity between our target user and the current user from the list of user we are looping over
            similarity = find_user_similarity(target_user, other_user, ratings_matrix)
            # Capture this similarity value to our list of similarity values
            similarities_to_target_user.append(similarity)
            # Capture the rating value of the current 'other_user' into our list of ratings given to our target businesses
            ratings_given_to_target_business.append(ratings_matrix.loc[other_user, target_business])
        # If a value error is generated, we simply pass over to the next loop
        # Since we will not be appending no values to neither our list of similarities and list of ratings, we will not be impacting our final calculation
        except:
            pass
    
    # Use the cosine similarity value to calculate the weighted average of all ratings (for those users that have at least 1 business that they have rated together)
    return np.dot(ratings_given_to_target_business, similarities_to_target_user)/np.sum(similarities_to_target_user)

In [27]:
# Confirm that our function is performing correctly we have previously calculated by taking each step individually
user_item_rating_prediction(target_user, target_business, ratings_matrix)

3.7220670351652867

***

Now that I have experimented and confirmed that the formulas I have created for user-item filtering are working as intended, I will save both the `find_user_similarity` and `user_item_rating_prediction` formulas to a .py file such that they can be called into future workbooks.