# Item-Item Filtering System

The purpose of this workbook is to develop the item-item filtering system, which will be our second attempt at developing a collaborative based recommender system.

This workbook is almost exactly the same as the previous workbook where I developed a user-item filtering system. Most of the code is mirrored as in the previous workbook, with the only major difference being that rather than comparing the cosine simlarities between two users, we are now comparing the similarities between two items.

Just like in the previous workbook, we will take steps to develop two function for use in future workbooks: the `find_business_similarity` and `item-item-rating-prediction` functions.

***

In [1]:
# Import Python libraries as needed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Import ratings_matrix data-frame as created in a previous workbook
ratings_matrix = pd.read_pickle('data/user/ratings_matrix.pkl')

In [3]:
# Review the ratings matrix
ratings_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6548,6549,6550,6551,6552,6553,6554,6555,6556,6557
0,4.0,,,,,,,,,,...,,,,,,,,,,
1,5.0,,,,,,,,,,...,,,,,,,,,,
2,1.0,,,,,,,,,,...,,,,,,,,,,
3,4.0,,,,,,,,,,...,,,,,,,,,,
4,1.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96527,,,,,,,,,,,...,,,,,,,,,,
96528,,,,,,,,,,,...,,,,,,,,,,
96529,,,,,,,,,,,...,,,,,,,,,,5.0
96530,,,,,,,,,,,...,,,,,,,,,,


The ratings matrix above represents a table of 96,532 users and 6,558 businesses.

In [4]:
# Take a look at all the ratings of the first business in our ratings matrix table defined as business 0
ratings_matrix.loc[:, 0]

0        4.0
1        5.0
2        1.0
3        4.0
4        1.0
        ... 
96527    NaN
96528    NaN
96529    NaN
96530    NaN
96531    NaN
Name: 0, Length: 96532, dtype: float64

In [5]:
# Take a look at the ratings of another business in our ratings matrix table defined as business 669
ratings_matrix.loc[:, 669]

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
         ..
96527   NaN
96528   NaN
96529   NaN
96530   NaN
96531   NaN
Name: 669, Length: 96532, dtype: float64

In [6]:
# Convert the series of ratings of business 0 into a True/False boolean-type series where True = null value i.e. a rating value does not exist for this business
ratings_matrix.loc[:, 0].isna()

0        False
1        False
2        False
3        False
4        False
         ...  
96527     True
96528     True
96529     True
96530     True
96531     True
Name: 0, Length: 96532, dtype: bool

In [7]:
# Reverse the series above such that True = a rating exists for this user for business 0 and False = rating does not exist
~ratings_matrix.loc[:, 0].isna()

0         True
1         True
2         True
3         True
4         True
         ...  
96527    False
96528    False
96529    False
96530    False
96531    False
Name: 0, Length: 96532, dtype: bool

In [8]:
# Save the series of True/False values where a rating exists for business 0
users_rated_by_business0 = ~ratings_matrix.loc[:, 0].isna()

# Save the series of True/False values where a rating exists for business 669
users_rated_by_business669 = ~ratings_matrix.loc[:, 669].isna()

In [9]:
# Determine a list of users that provided a rating for both business 0 and business 669
# Combine the above two series together into a new series such that True values indicate users that provided a rating for both businesses
users_rated_by_both_business0_and_business669 = users_rated_by_business0 & users_rated_by_business669
users_rated_by_both_business0_and_business669

0        False
1        False
2        False
3        False
4        False
         ...  
96527    False
96528    False
96529    False
96530    False
96531    False
Length: 96532, dtype: bool

In [10]:
# Calculate the total number of users that rated both business 0 and business 669
users_rated_by_both_business0_and_business669.sum()

4

In [11]:
# Print the ratings given by business 0 and business 669 for the 4 users that rated both businesses together
ratings_matrix.loc[users_rated_by_both_business0_and_business669, [0, 669]]

Unnamed: 0,0,669
5,4.0,4.0
587,4.0,4.0
2821,3.0,4.0
9902,3.0,2.0


In [12]:
# Reshape the list of ratings for business 0 for the same 4 users that also rated business 669 into a format suitable for the cosine similarity formula
# The cosine similarity formula requires an array of shape (1, n) where n represents the number of businesses that are rated by the two users being compared
selected_ratings_for_business0 = ratings_matrix.loc[users_rated_by_both_business0_and_business669, 0].values.reshape(1, -1)
print(selected_ratings_for_business0.shape)
selected_ratings_for_business0

(1, 4)


array([[4., 4., 3., 3.]])

In [13]:
# Complete the same step as above but for business 669
selected_ratings_for_business669 = ratings_matrix.loc[users_rated_by_both_business0_and_business669, 669].values.reshape(1, -1)

In [14]:
# Calculate the cosine similarity between the ratings for business 0 and business 669 specifically for the 4 users that both gave them a rating
cs_business0_and_buisness669 = cosine_similarity(selected_ratings_for_business0, selected_ratings_for_business669)
cs_business0_and_buisness669

array([[0.98058068]])

In [15]:
# We can see that the numerical value is contained within two arrays
# Here we will print out only the cosine similarity value as above but in a numerical float format
cs_business0_and_buisness669[0][0]

0.9805806756909202

***

Now I will attempt to consolidate the code above into single function that will compare the consine similarity between any given set of two businesses.

In [16]:
# Create a function with the business_id value of any two businesses as the two parameters for the function
def find_business_similarity(businessA, businessB, ratings_matrix):
    
    # Create a True/False list of users that gave a rating for each of the two businesses
    users_who_rated_businessA = ~ratings_matrix.loc[:, businessA].isna()
    users_who_rated_businessB = ~ratings_matrix.loc[:, businessB].isna()
    
    # Consolidate the two boolean lists into a single one which represents only those users that rated both businesses
    users_who_rated_both_businesses = users_who_rated_businessA & users_who_rated_businessB
    
    # Capture the rating values of both businesses for those users that rated both businesses
    # Also transform these values into a format suitable for the cosine_similarity function
    ratings_of_businessA = ratings_matrix.loc[users_who_rated_both_businesses, businessA].values.reshape(1, -1)
    ratings_of_businessB = ratings_matrix.loc[users_who_rated_both_businesses, businessB].values.reshape(1, -1)
    
    # Capture the similaritiy between the two businesses by comparing their ratings for the set of users that both provided a rating for them
    similarity = cosine_similarity(ratings_of_businessA, ratings_of_businessB)[0][0]
    
    # Return the consine similarity value as the output of this function
    return similarity

In [17]:
# Calculate the cosine_similarity between business 0 and business 669 and confirm that the function is working as intended
find_business_similarity(0, 669, ratings_matrix)

0.9805806756909202

In [18]:
# Note that there are some business combinations that generate an error when calculating the cosine similarity between two different businesses
# We get a value error when we try to compare business 0 and business 3 since they have 0 users that gave each of them a rating
try:
    find_business_similarity(0,3, ratings_matrix)
except:
    print('Value Error')

Value Error


***

I will now try to use the function created above to generate the similarities between businesses such that I can make predication for an example scenario.

Specifically, what is the rating that we can predict for business 1 made by user 0?

In [19]:
# Capture the values for our target user and target business into a variable
target_user = 0
target_business = 1

In [20]:
# Create empty lists to store the:
# 1. Similarities with other users to our target user
similarities_to_target_business = []
# 2. Existing ratings provided to our target business
ratings_given_by_target_user = []

In [21]:
# Create a new data-frame includes all ratings for only those users that have rated the target business
target_ratings_matrix = ratings_matrix.loc[:, ~ratings_matrix.iloc[target_user, :].isna()].copy()
target_ratings_matrix

Unnamed: 0,0,79,846,847,1427,2632,2721,2778,3490,3758,3914,4827
0,4.0,4.0,3.0,3.0,3.0,4.0,4.0,1.0,3.0,3.0,4.0,4.0
1,5.0,,,,,,,,,,,
2,1.0,,,,,,,,,,,
3,4.0,,,,,,,,,,,
4,1.0,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
96527,,,,,,,,,,,,
96528,,,,,,,,,,,,
96529,,,,,,,,,,,,
96530,,,,,,,,,,,,


In [23]:
# Confirm that our new data frame has a ratings value populated for our target business by every user in this new smaller data-frame
(~target_ratings_matrix.loc[0, :].isna()).value_counts()

True    12
Name: 0, dtype: int64

In [24]:
# Loop over every user in our target ratings matrix
# We can refer to each user as the 'other_user' since we know that our target user did not provide a rating for our target business and hence is not in this smaller data frame
for other_business in target_ratings_matrix.columns:
    # To compensate for the value error that may occur when the two users we are comparing have 0 businesses that they have both rated together
    try:
        # Capture the cosine similarity between our target user and the current user from the list of user we are looping over
        similarity = find_business_similarity(target_business, other_business, ratings_matrix)
        # Capture this similarity value to our list of similarity values
        similarities_to_target_business.append(similarity)
        # Capture the rating value of the current 'other_user' into our list of ratings given to our target businesses
        ratings_given_by_target_user.append(ratings_matrix.loc[target_user, other_business])
    # If a value error is generated, we simply pass over to the next loop
    # Since we will not be appending no values to neither our list of similarities and list of ratings, we will not be impacting our final calculation
    except:
        pass

In [25]:
# Use our list of cosine similarities and list of ratings given to our target business to predict the score for our target business as given by our target user
predicted_rating = np.dot(ratings_given_by_target_user, similarities_to_target_business)/np.sum(similarities_to_target_business)
print(f'Predicted rating for business 1 by user 0 is {round(predicted_rating, 2)}')

Predicted rating for business 1 by user 0 is 3.79


***

Now I will try to wrap the above code I have experimented with into a single function that can be called to predict the value of any missing rating value for our target business as given by our target user.

In [32]:
# Create a function to calculate the user-item rating prediction based on cosine similarity, with the following two parameters:
# target_business = business_id value for business for whom rating is being predicted for
# target_user = user_id value for the user for whom rating is being predicted for
def item_item_rating_prediction(target_user, target_business, ratings_matrix):
   
    # Create empty lists to store the:
    # 1. Similarities with other users to our target user
    similarities_to_target_business = []
    # 2. Existing ratings provided to our target business
    ratings_given_by_target_user = []
    
    # Create a list of all users that have provided a rating for the target business
    list_of_businesses_rated_by_target_user = list(ratings_matrix.loc[:, ~ratings_matrix.iloc[target_user, :].isna()].columns)
    
    # Loop over every user in our target ratings matrix
    # We can refer to each user as the 'other_user' since we know that our target user did not provide a rating for our target business and hence is not in this smaller data frame
    for other_business in list_of_businesses_rated_by_target_user:
        # To compensate for the value error that may occur when the two users we are comparing have 0 businesses that they have both rated together
        try:
            # Capture the cosine similarity between our target user and the current user from the list of user we are looping over
            similarity = find_business_similarity(target_business, other_business, ratings_matrix)
            # Capture this similarity value to our list of similarity values
            similarities_to_target_business.append(similarity)
            # Capture the rating value of the current 'other_user' into our list of ratings given to our target businesses
            ratings_given_by_target_user.append(ratings_matrix.loc[target_user, other_business])
        # If a value error is generated, we simply pass over to the next loop
        # Since we will not be appending no values to neither our list of similarities and list of ratings, we will not be impacting our final calculation
        except:
            pass
    
    # Use the cosine similarity value to calculate the weighted average of all ratings (for those users that have at least 1 business that they have rated together)
    return np.dot(ratings_given_by_target_user, similarities_to_target_business)/np.sum(similarities_to_target_business)

In [33]:
# Confirm that our function is performing correctly we have previously calculated by taking each step individually
item_item_rating_prediction(0, 1, ratings_matrix)

3.7917695651231935

***

Now that I have experimented and confirmed that the formulas I have created for user-item filtering are working as intended, I will save both the `find_business_similarity` and `item_item_rating_prediction` formulas to a .py file such that they can be called into future workbooks.