# Collaborative filtering - memory based using cosine distance and kNN

Recommender systems are an integral part of many online systems. From e-commerce to online streaming platforms.
Recommender systems employ the past purchase patters on it's user to predict which other products they may in interested in and likey to purchase. Recommending the right products gives a significat advantage to the business. A mojor portion of the revenue is generated through recommendations.


The Collaborative Filtering algorithm is very popular in online streaming platforms and e-commerse sites where the customer interacts with each product (which can be a movie/ song or consumer products) by either liking/ disliking or giving a rating of sorts.
One of the requirements to be able to apply collaborative filtering is that sufficient number of products need ratings associated with not them. User interaction is required.




This notebook walks through the implementation of collaborative filtering using memory based technique of distnce proximity using cosine distances and nearest neighbours.

## Importing libraries and initial data checks

In [None]:
# import required libraries
import pandas as pd
import numpy as np

### About the data

This is a dataset related to over 2 Million customer reviews and ratings of Beauty related products sold on Amazon's website.

It contains:
- the unique UserId (Customer Identification),
- the product ASIN (Amazon's unique product identification code for each product),
- Ratings (ranging from 1-5 based on customer satisfaction) and
- the Timestamp of the rating (in UNIX time)

In [None]:

df = pd.read_csv('ratings_Beauty.csv')
df.shape


In [None]:
# check the first 5 rows
df.head()


Check if there are any duplicate values present.

Here, the code calculates the number of duplicate records in the DataFrame by checking for duplicates based on the columns "UserId", "ProductId", "Rating", and "Timestamp". The .duplicated() function returns a boolean Series indicating whether each row is a duplicate or not, and .sum() is then used to count the True values in this Series. The count of duplicates is stored in the variable duplicates.

In [None]:
duplicates = df.duplicated(["UserId","ProductId", "Rating", "Timestamp"]).sum()
print(' Duplicate records: ',duplicates)


See the number of unique values present

In [None]:
print('unique users:',len(df.UserId.unique()))
print('unique products:',len(df.ProductId.unique()))
print("total ratings: ",df.shape[0])


Check for null values

This line checks for missing values in the DataFrame by using the .isnull() function, which returns a DataFrame of boolean values indicating whether each element is null or not. .any() then checks if there are any True values along each column, indicating the presence of missing values.

In [None]:
df.isnull().any()


Number of rated products per user

This line calculates the number of ratings given by each user. It uses the .groupby() function to group the data by "UserId", then applies .count() to count the number of ratings for each user. The results are sorted in descending order using .sort_values() and stored in the products_user Series.

In [None]:
products_user= df.groupby(by = "UserId")["Rating"].count().sort_values(ascending =False)
products_user.head()


Number of ratings per product

In [None]:
product_rated = df.groupby(by = "ProductId")["Rating"].count().sort_values(ascending = False)
product_rated.head()


Number of products rated by each user

This line calculates the number of products rated by each user. It groups the data by "UserId" and counts the number of unique products rated by each user. The results are sorted in descending order and stored in the rated_users Series.

In [None]:
rated_users=df.groupby("UserId")["ProductId"].count().sort_values(ascending=False)
print(rated_users)


In [None]:
rated_products=df.groupby("ProductId")["UserId"].count().sort_values(ascending=False)
print(rated_products)


Number of products with some minimum ratings

In [None]:
print('Number of products with minimum of 5 reviews/ratings:',rated_products[rated_products>5].count())
print('Number of products with minimum of 4 reviews/ratings:',rated_products[rated_products>4].count())
print('Number of products with minimum of 3 reviews/ratings:',rated_products[rated_products>3].count())
print('Number of products with minimum of 2 reviews/ratings:',rated_products[rated_products>2].count())
print('Number of products with minimum of 1 reviews/ratings:',rated_products[rated_products>1].count())


## Visualizing the data

Define lists index and values containing the names of metrics and their corresponding values. These metrics include the total size of records, the number of unique users, and the number of unique products in the dataset.
Create a bar chart using the go.Bar() function from Plotly. It uses the index list as the x-axis labels and the values list as the y-axis values for the bars. The textposition='auto' argument allows text labels to be placed automatically on the bars.
Display the interactive bar chart using the .show() method.

In [None]:
# plot the data
import plotly.graph_objects as go
index = ['Total size of records', "Number of unique users","Number of unique products"]
values =[len(df),len(df['UserId'].unique()),len(df['ProductId'].unique())]

plot = go.Figure([go.Bar(x=index, y=values,textposition='auto')])
plot.update_layout(title_text='Number of Users and Products w.r.to Total size of Data',
                    xaxis_title="Records",
                    yaxis_title="Total number of Records")

plot.show()


### The ratings given by users

 The df['Rating'].value_counts() function calculates the count of each unique rating value in the 'Rating' column. List called values is created containing the counts of each unique rating value. Create a bar chart using Plotly. It uses the unique rating values as x-axis labels and the values list (which contains the count of each rating value) as y-axis values for the bars. The textposition='auto' argument allows text labels to be placed automatically on the bars.

In [None]:
print("Range of Ratings: ", df['Rating'].value_counts())
print(list(df['Rating'].value_counts()))

values = list(df['Rating'].value_counts())

plot = go.Figure([go.Bar(x = df['Rating'].value_counts().index, y = values,textposition='auto')])

plot.update_layout(title_text='Ratings given by user',
                    xaxis_title="Rating",
                    yaxis_title="Total number of Ratings")

plot.show()


### Products which are most popular

A list called values is created containing the counts of occurrences for each product ID. Creates a bar chart for the most frequently occurring product IDs. It uses the indices of the top 5 products with the highest occurrence counts as x-axis labels and the values list as y-axis values for the bars. 

In [None]:
print("Products with occurred the most: \n",df['ProductId'].value_counts().nlargest(5))

values = list(df['ProductId'].value_counts())


plot = go.Figure([go.Bar(x = df['ProductId'].value_counts().nlargest(5).index, y = values,textposition='auto')])

plot.update_layout(title_text='Most rated products',
                    xaxis_title="ProductID",
                    yaxis_title="Number of times occurred in the data")

plot.show()


### Average rating given by each user


Calculate the number of ratings given by each user and store the results in the ratings_per_user Series. The sort_values(ascending=False) function sorts the Series in descending order based on the count of ratings. The first few users with the highest rating counts are then printed.Create a histogram using Plotly to visualize the distribution of ratings given by each user.

In [None]:
ratings_per_user = df.groupby('UserId')['Rating'].count().sort_values(ascending=False)
print("Average rating given by each user: ",ratings_per_user.head())

plot = go.Figure(data=[go.Histogram(x=ratings_per_user)])
plot.show()


Create a histogram using Plotly to visualize the distribution of ratings given by each user.

In [None]:
ratings_per_product = df.groupby('ProductId')['Rating'].count().sort_values(ascending=False)
# print("Average rating given by each user: ",ratings_per_user.head())

plot = go.Figure(data=[go.Histogram(x=ratings_per_product)])
plot.show(title_text='Number of ratings per product',
                    xaxis_title="Product",
                    yaxis_title="Number of ratings")


Create a histogram showing the distribution of ratings per product, focusing on the top 2000 most rated products. The Plotly plot is displayed using the .show() method.

In [None]:
ratings_per_product = df.groupby('ProductId')['Rating'].count().sort_values(ascending=False)
# print("Average rating given by each user: ",ratings_per_user.head())

plot = go.Figure(data=[go.Histogram(x=ratings_per_product.nlargest(2000))])
plot.show(title_text='Number of ratings per product',
                    xaxis_title="Product",
                    yaxis_title="Number of ratings")


### Products with very less ratings


Groups the data in the DataFrame df by the "ProductId" column and then calculates the count of ratings for each product using the .count() function. The result is a Pandas Series where the index represents the product IDs, and the values represent the count of ratings each product has received. 
number_of_ratings_given is created using the rating_of_products Series. This DataFrame will be used to analyze and categorize the number of ratings given to products.
Code then categorizes products based on the count of ratings they have received.
This code segment performs an analysis of the distribution of rating counts among products and categorizes products based on the number of ratings they have received. It then prints out these statistics.

In [None]:

rating_of_products = df.groupby('ProductId')['Rating'].count()
# convert to make dataframe to analyse data
number_of_ratings_given = pd.DataFrame(rating_of_products)
print("Products with ratings given by users: \n",number_of_ratings_given.head())

less_than_ten = []
less_than_fifty_greater_than_ten = []
greater_than_fifty_less_than_hundred = []
greater_than_hundred = []
average_rating = []

for rating in number_of_ratings_given['Rating']:
    if rating <=10:
        less_than_ten.append(rating)
    if rating > 10 and rating <= 50:
        less_than_fifty_greater_than_ten.append(rating)
    if rating > 50 and rating <= 100:
        greater_than_fifty_less_than_hundred.append(rating)
    if rating > 100:
        greater_than_hundred.append(rating)

    average_rating.append(rating)
    
print("Ratings_count_less_than_ten: ", len(less_than_ten))
print("Ratings_count_greater_than_ten_less_than_fifty: ", len(less_than_fifty_greater_than_ten))
print("Ratings_count_greater_than_fifty_less_than_hundred: ", len(greater_than_fifty_less_than_hundred))
print("Ratings_count_greater_than_hundred: ", len(greater_than_hundred))
print("Average number of products rated by users: ", np.mean(average_rating))



The code segment creates a bar chart that visualizes the distribution of products among different ranges of rating counts. The chart showcases how many products fall into each rating count range. Additionally, it adds an annotation to the plot and sets the plot's title and axis labels for clarity. The chart is displayed using Plotly's interactive visualization capabilities.

In [None]:
x_values = ["Ratings_count_less_than_ten","Ratings_count_greater_than_ten_less_than_fifty",
           "Ratings_count_greater_than_fifty_less_than_hundred","Ratings_count_greater_than_hundred"]
y_values = [len(less_than_ten),len(less_than_fifty_greater_than_ten),len(greater_than_fifty_less_than_hundred),
            len(greater_than_hundred)]


plot = go.Figure([go.Bar(x = x_values, y = y_values, textposition='auto')])

plot.add_annotation(
        x=1,
        y=100000,
        xref="x",
        yref="y")

plot.update_layout(title_text='Ratings Count on Products',
                    xaxis_title="Ratings Range",
                    yaxis_title="Count of Rating")
plot.show()


In [None]:
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()


### To convert alphanumeric data to numeric

A new DataFrame named dataset is created as a copy of the original DataFrame df. Then, the label encoder is applied to the 'UserId' and 'ProductId' columns to convert them into numerical values. These transformed values are stored in new columns 'user' and 'product' in the dataset

In [None]:
dataset = df
dataset['user'] = label_encoder.fit_transform(df['UserId'])
dataset['product'] = label_encoder.fit_transform(df['ProductId'])
dataset.head()


The code calculates the average rating given by each user by using the groupby() function to group the data by 'user'. The mean of the 'Rating' column within each user group is computed using .mean(). The result is stored in the average_rating DataFrame, which shows the average rating given by each user. The .head() function is used to display the first few rows of this DataFrame.

The code merges the average_rating DataFrame with the dataset DataFrame using the 'user' column as the key for merging. This adds a new column 'Rating_y' to the dataset DataFrame, containing the average rating given by each user. The result is displayed using .head().

The columns are renamed to clarify their meaning. The 'Rating_x' column is renamed to 'real_rating', and the 'Rating_y' column is renamed to 'average_rating'. The result is displayed using .head().

In [None]:

# average rating given by each user
average_rating = dataset.groupby(by="user", as_index=False)['Rating'].mean()
print("Average rating given by users: \n",average_rating.head())
print("----------------------------------------------------------\n")


# let's merge it with the dataset as we will be using that later
dataset = pd.merge(dataset, average_rating, on="user")
print("Modified dataset: \n", dataset.head())
print("----------------------------------------------------------\n")

# renaming columns
dataset = dataset.rename(columns={"Rating_x": "real_rating", "Rating_y": "average_rating"})
print("Dataset: \n", dataset.head())
print("----------------------------------------------------------\n")


Certain users tend to give higher ratings while others tend to give lower ratings. To negate this bias, we normalise the ratings given by the users.

Calculate the normalized rating for each entry by subtracting the 'average_rating' from the 'real_rating'. The result is stored in a new column called 'normalized_rating', which represents how much a user's rating differs from their average rating.

In [None]:
dataset['normalized_rating'] = dataset['real_rating'] - dataset['average_rating']
print("Data with adjusted rating: \n", dataset.head())


# Cosine Similarity

We use a distance based metric - cosine similarity to identify similar users. It is important first, to remove products that have very low number of ratings.

## Filter based on number of ratings available

The code groups the data by 'product' and calculates the count of real ratings (not the normalized ratings) for each product using .count(). The result is stored in the rating_of_product Series, and then it's converted into a DataFrame called ratings_of_products_df.

In [None]:
rating_of_product = dataset.groupby('product')['real_rating'].count() # apply groupby 
ratings_of_products_df = pd.DataFrame(rating_of_product)
print("Real ratings:\n",ratings_of_products_df.head()) # check for real rating for products


Filter out products that have less than 200 real ratings by using boolean indexing. It creates a DataFrame called filtered_ratings_per_product containing only the products with 200 or more real ratings.

In [None]:
filtered_ratings_per_product = ratings_of_products_df[ratings_of_products_df.real_rating >= 200]
print(filtered_ratings_per_product.head())
print(filtered_ratings_per_product.shape)


The code extracts the indices of the filtered DataFrame (filtered_ratings_per_product) and converts them into a list called popular_products. These are the product IDs that have received 200 or more real ratings.
Filters the dataset DataFrame to keep only the rows where the product ID matches those in the popular_products list. The result is a new DataFrame called filtered_ratings_data that contains data only for popular products.

In [None]:
# build a list of products to keep
popular_products = filtered_ratings_per_product.index.tolist()
print("Popular product count which have ratings over average rating count: ",len(popular_products))
print("--------------------------------------------------------------------------------")

filtered_ratings_data = dataset[dataset["product"].isin(popular_products)]
print("Filtered rated product in the dataset: \n",filtered_ratings_data.head())
print("---------------------------------------------------------------------------------")

print("The size of dataset has changed from ", len(dataset), " to ", len(filtered_ratings_data))
print("---------------------------------------------------------------------------------")


## Creating the User-item matrix

This pivot table will be used to calculate the similarity between users based on their normalized ratings for different products. Pivot table will have rows representing users, columns representing products, and the cell values representing the normalized ratings given by users to the respective products. This matrix will be used to calculate user-user similarity.

Fill any missing values (NaNs) in the similarity pivot table with zeros. Missing values might occur if a user hasn't rated a particular product, resulting in a NaN value in the pivot table.

In [None]:
similarity = pd.pivot_table(filtered_ratings_data,values='normalized_rating',index='UserId',columns='product')
similarity = similarity.fillna(0)
print("Updated Dataset: \n",similarity.head())


As you can see, this is a very sparse matrix

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import operator


List called selecting_users is created containing the indices (user IDs) from the similarity pivot table. It then truncates the list to contain only the first 100 user IDs. These user IDs will be used for selecting users to calculate recommendations for.

In [None]:
selecting_users = list(similarity.index)
selecting_users = selecting_users[:100]
print("You can select users from the below list:\n",selecting_users)


The next part of the code defines two functions for user-based collaborative filtering:

getting_top_5_similar_users(user_id, similarity_table, k=5): This function calculates the top k similar users to the given user ID based on the cosine similarity of their rating vectors.

getting_top_5_recommendations_based_on_users(user_id, similar_users, similarity_table, top_recommendations=5): This function generates top recommendations for a user based on the ratings of similar users.

In [None]:
def getting_top_5_similar_users(user_id, similarity_table, k=5):
    '''

    :param user_id: the user we want to recommend
    :param similarity_table: the user-item matrix
    :return: Similar users to the user_id.
    '''

    # create a dataframe of just the current user
    user = similarity_table[similarity_table.index == user_id]
    # and a dataframe of all other users
    other_users = similarity_table[similarity_table.index != user_id]
    # calculate cosine similarity between user and each other user
    similarities = cosine_similarity(user, other_users)[0].tolist()

    indices = other_users.index.tolist()
    index_similarity = dict(zip(indices, similarities))

    # sort by similarity
    index_similarity_sorted = sorted(index_similarity.items(), key=operator.itemgetter(1))
    index_similarity_sorted.reverse()

    # take users
    top_users_similarities = index_similarity_sorted[:k]
    users = []
    for user in top_users_similarities:
        users.append(user[0])

    return users


Demonstrate the first function by calculating the top 5 similar users to a specific user ID ("A0010876CNE3ILIM9HV0") using the getting_top_5_similar_users function. It then prints out the IDs of these similar users.

In [None]:
user_id = "A0010876CNE3ILIM9HV0"
similar_users = getting_top_5_similar_users(user_id, similarity)


In [None]:
print("Top 5 similar users for user_id:",user_id," are: ",similar_users)


## Recommend products based on these top similar users

In [None]:
def getting_top_5_recommendations_based_on_users(user_id, similar_users, similarity_table, top_recommendations=5):
    '''

    :param user_id: user for whom we want to recommend
    :param similar_users: top 5 similar users
    :param similarity_table: the user-item matrix
    :param top_recommendations: no. of recommendations
    :return: top_5_recommendations
    '''

    # taking the data for similar users
    similar_user_products = dataset[dataset.UserId.isin(similar_users)]
#     print("Products used by other users: \n", similar_user_products.head())
#     print("---------------------------------------------------------------------------------")

    # getting all similar users
    similar_users = similarity_table[similarity_table.index.isin(similar_users)]

    #getting mean ratings given by users
    similar_users = similar_users.mean(axis=0)


    similar_users_df = pd.DataFrame(similar_users, columns=['mean'])

    # for the current user data
    user_df = similarity_table[similarity_table.index == user_id]


    # transpose it so its easier to filter
    user_df_transposed = user_df.transpose()


    # rename the column as 'rating'
    user_df_transposed.columns = ['rating']

    # rows with a 0 value.
    user_df_transposed = user_df_transposed[user_df_transposed['rating'] == 0]


    # generate a list of products the user has not used
    products_not_rated = user_df_transposed.index.tolist()
#     print("Products not used by target user: ", products_not_rated)
#     print("-------------------------------------------------------------------")

    # filter avg ratings of similar users for only products the current user has not rated
    similar_users_df_filtered = similar_users_df[similar_users_df.index.isin(products_not_rated)]

    # order the dataframe
    similar_users_df_ordered = similar_users_df_filtered.sort_values(by=['mean'], ascending=False)



    # take the top products
    top_products = similar_users_df_ordered.head(top_recommendations)
    top_products_indices = top_products.index.tolist()


    return top_products_indices



Demonstrate the second function by generating and printing the top 5 recommended product IDs for the given user ID using the getting_top_5_recommendations_based_on_users function and the list of similar users calculated earlier.

In [None]:
print("Top 5 productID recommended are: ",
      getting_top_5_recommendations_based_on_users(user_id, similar_users, similarity))


Display the shape and the first few rows of the filtered_ratings_data DataFrame, and then filter the data to show only rows where the 'UserId' is "A0010876CNE3ILIM9HV0".

In [None]:
filtered_ratings_data.shape

In [None]:
filtered_ratings_data.head()

In [None]:

filtered_ratings_data[filtered_ratings_data['UserId']=="A0010876CNE3ILIM9HV0"]


Data is split into training and testing sets using the train_test_split function from scikit-learn. The test_size parameter is set to 0.2, indicating that 20% of the data will be used for testing. The training and testing data are then converted back into Pandas DataFrames.

In [None]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(filtered_ratings_data,test_size=0.2)

train_data = pd.DataFrame(train_data)
test_data = pd.DataFrame(test_data)


Creates a pivot table similarity using the training data. The pivot table represents the user-product matrix of normalized ratings. Missing values (NaNs) are filled with zeros. 

In [None]:
similarity = pd.pivot_table(train_data,values='normalized_rating',index='UserId',columns='product')
similarity = similarity.fillna(0)
print("Updated Dataset: \n",similarity.head())


In [None]:
similarity.shape


Code selects the first 100 user IDs from the similarity matrix as a list of potential users for generating recommendations.

In [None]:
selecting_users = list(similarity.index)
selecting_users = selecting_users[:100]
print("You can select users from the below list:\n",selecting_users)


Calculate the top 5 similar users for the given user ID ("A02720223TDVZSWVZYFN7") using the getting_top_5_similar_users function and print them. Additionally, the top 5 recommended product IDs for the given user are calculated using the getting_top_5_recommendations_based_on_users function

In [None]:
user_id = "A02720223TDVZSWVZYFN7"
similar_users = getting_top_5_similar_users(user_id, similarity)


In [None]:
print("Top 5 similar users for user_id:",user_id," are: ",similar_users)


In [None]:
print("Top 5 productID recommended are: ",
      getting_top_5_recommendations_based_on_users(user_id, similar_users, similarity))


In [None]:
test_data.shape


In [None]:
len(test_data.user.unique())


In [None]:
test_data.UserId


In [None]:
test_data.head()


Define a function recommend_products_for_user that takes a user ID and a similarity matrix as inputs. It calculates the similar users using the getting_top_5_similar_users function and generates a list of recommended product IDs using the getting_top_5_recommendations_based_on_users function. The function then returns this list of recommended product IDs. Finally, the function is called with the user ID "A2XVNI270N97GL" and the similarity matrix, and the recommended products are printed.

In [None]:
def recommend_products_for_user(userId, similarity_matrix):
    similar_users = getting_top_5_similar_users(user_id, similarity_matrix)
#     print("Top 5 similar users for user_id:",user_id," are: ",similar_users)
    product_list = getting_top_5_recommendations_based_on_users(user_id, similar_users, similarity)
#     print("Top 5 productID recommended are: ", product_list)
    return product_list


In [None]:
recommend_products_for_user("A2XVNI270N97GL", similarity)


### Conclusion

Recommender systems are a powerful technology that adds to a businesses value. Some business thrive on their recommender systems. It helps the business by creating more sales and it helps the end user buy enabling them to find items they like.