# Chapter 5 - Collaborative Filtering (Part 2)

<div style="text-align:center;">
    <img src='images/intro.jpg' width='800'>
</div>

Collaborative filtering is the predictive process behind recommendation engines. Recommendation engines analyze information about users with similar tastes to assess the probability that a target individual will enjoy something.

Collaborative filtering uses algorithms to filter data from user reviews to make personalized recommendations for users with similar preferences. Collaborative filtering is also used to select content and advertising for individuals on social media.

Collaborative filtering filters information by using the interactions and data collected by the system from other users. For example when we want to find a new movie to watch we'll often ask our friends for recommendations.

Naturally, we have greater trust in the recommendations from friends who share tastes similar to our own. Collaborative filtering does the same job. Collaborative filtering m**ostly focuses on finding similarity between** users and recommend each other their likes. There are various ways to find the similarity meas*ure : Cosine simi*la*rity, Pearson simi*la*rity, Jaccard simi*larity etc.

In [1]:
# Importing basic libraries
import pandas as pd
import numpy as np
import random

# Importing scipy.sparse.csr_matrix for kNN data preparation
from scipy.sparse import csr_matrix

# Importing kNN algorithm
from sklearn.neighbors import NearestNeighbors

# Importing cosine_similarity to calculate cosine similarity in memory based collaborative filtering
from sklearn.metrics.pairwise import cosine_similarity

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Importing surprise.Reader,Dataset for surprise data preparation
from surprise import Reader, Dataset

# Importing for surprise model customizations
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV

In [3]:
# Importing algorithms from Surprise package
from surprise.prediction_algorithms import NMF,CoClustering,SVD

# Importing accuracy to get metrics such as RMSE and MAE
from surprise import accuracy

### About the Dataset
The following is the data dictionary for the dataset; it has nine features (columns).

    • InvoiceNo: The invoice number of a particular transaction
    • StockCode: The unique identifier for a particular item
    • Quantity: The quantity of that item bought by the customer
    • InvoiceDate: The date and time when the transaction was made
    • DeliveryDate: The date and time when the delivery happened
    • Discount%: Percentage of discount on the purchased item
    • ShipMode: Mode of shipping
    • ShippingCost: Cost of shipping that item
    • CustomerID: The unique identifier of a particular customer

In [4]:
#read csv data
df = pd.read_excel('data/Rec_sys_data.xlsx')

#view first 5 rows
df.head()

Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,DeliveryDate,Discount%,ShipMode,ShippingCost,CustomerID
0,536365,84029E,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.2,ExpressAir,30.12,17850
1,536365,71053,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.21,ExpressAir,30.12,17850
2,536365,21730,6,2010-12-01 08:26:00,2010-12-03 08:26:00,0.56,Regular Air,15.22,17850
3,536365,84406B,8,2010-12-01 08:26:00,2010-12-03 08:26:00,0.3,Regular Air,15.22,17850
4,536365,22752,2,2010-12-01 08:26:00,2010-12-04 08:26:00,0.57,Delivery Truck,5.81,17850


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272404 entries, 0 to 272403
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   InvoiceNo     272404 non-null  int64         
 1   StockCode     272404 non-null  object        
 2   Quantity      272404 non-null  int64         
 3   InvoiceDate   272404 non-null  datetime64[ns]
 4   DeliveryDate  272404 non-null  datetime64[ns]
 5   Discount%     272404 non-null  float64       
 6   ShipMode      272404 non-null  object        
 7   ShippingCost  272404 non-null  float64       
 8   CustomerID    272404 non-null  int64         
dtypes: datetime64[ns](2), float64(2), int64(3), object(2)
memory usage: 18.7+ MB


### Data Preparation

In [6]:
# null check
df.isnull().sum().sort_values(ascending=False)

InvoiceNo       0
StockCode       0
Quantity        0
InvoiceDate     0
DeliveryDate    0
Discount%       0
ShipMode        0
ShippingCost    0
CustomerID      0
dtype: int64

In [7]:
# Drop NaN
data1 = df.dropna()

data1.describe()

Unnamed: 0,InvoiceNo,Quantity,InvoiceDate,DeliveryDate,Discount%,ShippingCost,CustomerID
count,272404.0,272404.0,272404,272404,272404.0,272404.0,272404.0
mean,553740.733319,13.579536,2011-05-16 04:33:17.259658240,2011-05-18 04:33:04.572620288,0.300092,17.053491,15284.323523
min,536365.0,1.0,2010-12-01 08:26:00,2010-12-02 08:26:00,0.0,5.81,12346.0
25%,545312.0,2.0,2011-03-01 13:51:00,2011-03-03 14:53:00,0.15,5.81,13893.0
50%,553902.0,6.0,2011-05-19 18:02:00,2011-05-22 08:52:30,0.3,15.22,15157.0
75%,562457.0,12.0,2011-08-05 11:00:00,2011-08-07 12:05:00,0.45,30.12,16788.0
max,569629.0,74215.0,2011-10-05 11:37:00,2011-10-08 11:37:00,0.6,30.12,18287.0
std,9778.082879,149.136756,,,0.176023,10.01321,1714.478624


<img src='images/tacf.png' width='900'>

# Memory-Based Approach
In Memory-Based approach, the closest user or items are calculated only by using Cosine similarity or Pearson correlation coefficients, which are only based on arithmetic operations.

A common distance metric is cosine similarity. The metric can be thought of geometrically if one treats a given user’s (item’s) row (column) of the ratings matrix as a vector. For user-based collaborative filtering, two users’ similarity is measured as the cosine of the angle between the two users’ vectors. For users u and u′, the cosine similarity is:

As no training or optimization is involved, it is an easy to use approach. But its performance decreases when we have sparse data which hinders scalability of this approach for most of the real-world problems.

Memory-Based approach is further divided into :

1. User-to-User Collaborative Filtering
2. Item-to-Item Collaborative Filtering

## User-to-User Collaborative Filtering
User-Based Collaborative Filtering is a technique used to predict the items that a user might like on the basis of ratings given to that item by the other users who have similar taste with that of the target user.

<div style="text-align:center;">
    <img src='images/User_based1.jpg' width='400'>
</div>

In User-Based Collaborative Filtering, we create a matrix that describes behaviour of all users corresponding to all the items. Further, we build relation between mutiple users to identify the similar users.

### Implementation
We are creating a data(matrix) which contains CustomerID and whether they have ever purchased a product using groupby.

In [8]:
data1.head()

Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,DeliveryDate,Discount%,ShipMode,ShippingCost,CustomerID
0,536365,84029E,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.2,ExpressAir,30.12,17850
1,536365,71053,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.21,ExpressAir,30.12,17850
2,536365,21730,6,2010-12-01 08:26:00,2010-12-03 08:26:00,0.56,Regular Air,15.22,17850
3,536365,84406B,8,2010-12-01 08:26:00,2010-12-03 08:26:00,0.3,Regular Air,15.22,17850
4,536365,22752,2,2010-12-01 08:26:00,2010-12-04 08:26:00,0.57,Delivery Truck,5.81,17850


In [9]:
purchase_df = (data1.groupby(['CustomerID', 'StockCode'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('CustomerID'))

purchase_df.head()

StockCode,10002,10080,10120,10125,10133,10135,11001,15030,15034,15036,...,90214R,90214S,90214V,90214Y,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12348,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0
12350,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
12352,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,5.0


We need to do encoding as 1 (if purchased) or 0 (not purchased):

In [10]:
'''
Create a map
'''

def encode_units(x):
    if x < 1:    # If the quantity is less than 1
        return 0 # Not purchased
    if x >= 1:   # If the quantity is greater than 1
        return 1 # Purchased


purchase_df = purchase_df.applymap(encode_units)

purchase_df.head()

StockCode,10002,10080,10120,10125,10133,10135,11001,15030,15034,15036,...,90214R,90214S,90214V,90214Y,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12347,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12350,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12352,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1


The purchase matrix describes the behaviour of Customers corresponding to all the items. Now, we can apply Collaborative filtering on it.

In [11]:
# Applying cosine_similarity on the purchase matrix
user_similarities = cosine_similarity(purchase_df)

# Storing the similarity scores in a dataframe, i.e., the similarity scores matrix
user_similarity_data = pd.DataFrame(user_similarities,index=purchase_df.index,columns=purchase_df.index)

user_similarity_data.head()

CustomerID,12346,12347,12348,12350,12352,12353,12354,12355,12356,12358,...,18269,18270,18272,18273,18278,18280,18281,18282,18283,18287
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.114708,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347,0.0,1.0,0.070632,0.053567,0.048324,0.0,0.029001,0.091885,0.075845,0.0,...,0.041739,0.0,0.050669,0.0,0.036811,0.069843,0.0,0.0,0.087667,0.021253
12348,0.0,0.070632,1.0,0.051709,0.031099,0.0,0.027995,0.118262,0.146427,0.061546,...,0.0,0.0,0.024456,0.0,0.0,0.0,0.0,0.0,0.123091,0.082061
12350,0.0,0.053567,0.051709,1.0,0.035377,0.0,0.0,0.0,0.033315,0.070014,...,0.0,0.0,0.027821,0.0,0.0,0.0,0.0,0.0,0.052511,0.0
12352,0.0,0.048324,0.031099,0.035377,1.0,0.0,0.095765,0.040456,0.10018,0.084215,...,0.110264,0.065233,0.133855,0.0,0.0,0.0,0.0,0.0,0.094742,0.056143


This is how the user_similarity_data looks like. It contains the similarity score of users with 0 being the least similar while 1 being the most similar.

### Making Recommendations

In [12]:
def fetch_similar_users(user_id,k=5):
    # separating data rows for the entered user id
    user_similarity = user_similarity_data[user_similarity_data.index == user_id]
    
    # a data of all other users
    other_users_similarities = user_similarity_data[user_similarity_data.index != user_id]
    
    # calc cosine similarity between user and each other user
    similarities = cosine_similarity(user_similarity,other_users_similarities)[0].tolist()
    
    # create list of indices of these users
    user_indices = other_users_similarities.index.tolist()
    
    # create key/values pairs of user index and their similarity
    index_similarity_pair = dict(zip(user_indices, similarities))
    
    # sort by similarity
    sorted_index_similarity_pair = sorted(index_similarity_pair.items(),reverse=True)
    
    # grab k users off the top
    top_k_users_similarities = sorted_index_similarity_pair[:k]
    similar_users = [u[0] for u in top_k_users_similarities]
    
    print('The users with behaviour similar to that of user {0} are:'.format(user_id))
    return similar_users

In [13]:
similar_users = fetch_similar_users(12347)
similar_users

The users with behaviour similar to that of user 12347 are:


[18287, 18283, 18282, 18281, 18280]

Further the similar users can be stored in a list and later we can display the items purchased by the similar users as done below.

In [14]:
def fetch_similar_users4(user_id,k=5): # version without print
    
    # separating data rows for the entered user id
    user_similarity = user_similarity_data[user_similarity_data.index == user_id]
    
    # a data of all other users
    other_users_similarities = user_similarity_data[user_similarity_data.index != user_id]
    
    # calc cosine similarity between user and each other user
    similarities = cosine_similarity(user_similarity,other_users_similarities)[0].tolist()
    
    # create list of indices of these users
    user_indices = other_users_similarities.index.tolist()
    
    # create key/values pairs of user index and their similarity
    index_similarity_pair = dict(zip(user_indices, similarities))
    
    # sort by similarity
    sorted_index_similarity_pair = sorted(index_similarity_pair.items(),reverse=True)
    
    # grab k users off the top
    top_k_users_similarities = sorted_index_similarity_pair[:k]
    similar_users = [u[0] for u in top_k_users_similarities]
    
    return similar_users


def simular_users_recommendation(userid):
    
    similar_users = fetch_similar_users4(userid)

    #obtaining all the items bought by similar users
    simular_users_recommendation_list = []
    for j in similar_users:
        item_list = data1[data1["CustomerID"]==j]['StockCode'].to_list()
        simular_users_recommendation_list.append(item_list)
    
    #this gives us multi-dimensional list
    # we need to flatten it
    flat_list = []
    for sublist in simular_users_recommendation_list:
        for item in sublist:
            flat_list.append(item)
    final_recommendations_list = list(dict.fromkeys(flat_list))
    
    # storing 10 random recommendations in a list
    ten_random_recommendations = random.sample(final_recommendations_list, 10)
    
    print('Items bought by Similar users based on Cosine Similarity')
    
    #returning 10 random recommendations
    return ten_random_recommendations

In [15]:
simular_users_recommendation(12347)

Items bought by Similar users based on Cosine Similarity


[23296, 22367, 20963, 22725, 20712, 21975, 84755, 22583, 23206, 22975]

# Item-to-Item Collaborative Filtering
An item-to-item filtering process uses a matrix to determine the likeness of pairs of items. Item-to-item processes then compare the current user’s preference to the items in the matrix for similarities upon which to base recommendations.

#### Implementation
We are creating a data(matrix) which contains item names and whether they have been ever purchased by a customer using groupby.

In [16]:
data1.head()

Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,DeliveryDate,Discount%,ShipMode,ShippingCost,CustomerID
0,536365,84029E,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.2,ExpressAir,30.12,17850
1,536365,71053,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.21,ExpressAir,30.12,17850
2,536365,21730,6,2010-12-01 08:26:00,2010-12-03 08:26:00,0.56,Regular Air,15.22,17850
3,536365,84406B,8,2010-12-01 08:26:00,2010-12-03 08:26:00,0.3,Regular Air,15.22,17850
4,536365,22752,2,2010-12-01 08:26:00,2010-12-04 08:26:00,0.57,Delivery Truck,5.81,17850


In [17]:
items_purchase_df = (data1.groupby(['StockCode','CustomerID'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('StockCode'))

items_purchase_df.head()

CustomerID,12346,12347,12348,12350,12352,12353,12354,12355,12356,12358,...,18269,18270,18272,18273,18278,18280,18281,18282,18283,18287
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10080,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10120,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10133,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Thus we need to do encoding as 1 (if purchased) or 0 (not purchased):

In [18]:
items_purchase_df = items_purchase_df.applymap(encode_units)

The item_purchase matrix describes if the item was purchased by particular customer or not. We can now apply Collaborative filtering on it.

In [19]:
# Applying Cosine similarity on the items
item_similarities = cosine_similarity(items_purchase_df)

# Storing the similarity scores in a dataframe
item_similarity_data = pd.DataFrame(item_similarities,index=items_purchase_df.index,columns=items_purchase_df.index)

item_similarity_data.head()

StockCode,10002,10080,10120,10125,10133,10135,11001,15030,15034,15036,...,90214R,90214S,90214V,90214Y,BANK CHARGES,C2,DOT,M,PADS,POST
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,1.0,0.0,0.108821,0.094281,0.062932,0.091902,0.110096,0.059761,0.083771,0.096449,...,0.0,0.0,0.0,0.0,0.0,0.032275,0.0,0.079333,0.0,0.066986
10080,0.0,1.0,0.0,0.043033,0.028724,0.067116,0.0,0.0,0.076472,0.044023,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10120,0.108821,0.0,1.0,0.068399,0.068483,0.026669,0.079872,0.086711,0.121547,0.034986,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076739,0.0,0.013885
10125,0.094281,0.043033,0.068399,1.0,0.044499,0.051988,0.0519,0.0,0.03949,0.0341,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.074796,0.0,0.063155
10133,0.062932,0.028724,0.068483,0.044499,1.0,0.266043,0.051964,0.075218,0.079078,0.05311,...,0.0,0.0,0.0,0.0,0.0,0.040622,0.0,0.066567,0.049752,0.024089


### Making Recommendations

In [20]:
def fetch_similar_items(item_id, k=10):
    
    if item_id not in item_similarity_data.index:
        raise ValueError(f"Item ID {item_id} not found in the item similarity data.")
    
    # separating data rows of the selected item
    item_similarity = item_similarity_data.loc[[item_id]]
    
    # a data of all other items
    other_items_similarities = item_similarity_data.drop(item_id)
    
    if other_items_similarities.empty:
        raise ValueError(f"No other items found to compare with item ID {item_id}.")
    
    # calculate cosine similarity between selected item with other items
    similarities = cosine_similarity(item_similarity, other_items_similarities)[0].tolist()
    
    # create list of indices of these items
    item_indices = other_items_similarities.index.tolist()
    
    # create key/values pairs of item index and their similarity
    index_similarity_pair = dict(zip(item_indices, similarities))
    
    # sort by similarity
    sorted_index_similarity_pair = sorted(index_similarity_pair.items(), key=lambda x: x[1], reverse=True)
    
    # grab k items from the top
    top_k_item_similarities = sorted_index_similarity_pair[:k]
    similar_items = [u[0] for u in top_k_item_similarities]
    
    print('Similar items based on purchase behaviour (item-to-item collaborative filtering)')
    return similar_items

In [21]:
similar_items = fetch_similar_items(10002)
similar_items

Similar items based on purchase behaviour (item-to-item collaborative filtering)


[23310, 22821, 23076, 22549, 23078, 22435, 21889, 22614, 22560, 22181]

Below there is the same function with some improvements.

In [22]:
# Alternative advanced version

def fetch_similar_items2(item_id, k=10):
    if item_id not in item_similarity_data.index:
        raise ValueError(f"Item ID {item_id} not found in the item similarity data.")
    
    # separating data rows of the selected item
    item_similarity = item_similarity_data.loc[[item_id]]
    
    # a data of all other items
    other_items_similarities = item_similarity_data.drop(item_id)
    
    if other_items_similarities.empty:
        raise ValueError(f"No other items found to compare with item ID {item_id}.")
    
    # calculate cosine similarity between selected item with other items
    similarities = cosine_similarity(item_similarity, other_items_similarities)[0].tolist()
    
    # create list of indices of these items
    item_indices = other_items_similarities.index.tolist()
    
    # create key/values pairs of item index and their similarity
    index_similarity_pair = dict(zip(item_indices, similarities))
    
    # sort by similarity
    sorted_index_similarity_pair = sorted(index_similarity_pair.items(), key=lambda x: x[1], reverse=True)
    
    # grab k items from the top
    top_k_item_similarities = sorted_index_similarity_pair[:k]
    similar_items = [u[0] for u in top_k_item_similarities]
    
    print('Similar items based on purchase behaviour (item-to-item collaborative filtering)')
    return similar_items

Further the similar items can be stored in a list and later we can display the similar items purchased by the our selected user as below

In [23]:
def fetch_similar_items3(item_id, k=10):  # version without print
    
    if item_id not in item_similarity_data.index:
        raise ValueError(f"Item ID {item_id} not found in the item similarity data.")
    
    # separating data rows of the selected item
    item_similarity = item_similarity_data.loc[[item_id]]
    
    # a data of all other items
    other_items_similarities = item_similarity_data.drop(item_id)
    
    if other_items_similarities.empty:
        raise ValueError(f"No other items found to compare with item ID {item_id}.")
    
    # calculate cosine similarity between selected item with other items
    similarities = cosine_similarity(item_similarity, other_items_similarities)[0].tolist()
    
    # create list of indices of these items
    item_indices = other_items_similarities.index.tolist()
    
    # create key/values pairs of item index and their similarity
    index_similarity_pair = dict(zip(item_indices, similarities))
    
    # sort by similarity
    sorted_index_similarity_pair = sorted(index_similarity_pair.items(), key=lambda x: x[1], reverse=True)
    
    # grab k items from the top
    top_k_item_similarities = sorted_index_similarity_pair[:k]
    similar_items = [u[0] for u in top_k_item_similarities]
    
    return similar_items


def simular_item_recommendation(userid):
    
    simular_items_recommendation_list = []
    
    #obtaining all the similar items to items bought by user
    item_list = data1[data1["CustomerID"]==userid]['StockCode'].to_list()
    for item in item_list:
        similar_items = fetch_similar_items3(item)
        simular_items_recommendation_list.append(item_list)
    
    #this gives us multi-dimensional list
    # we need to flatten it
    flat_list = []
    for sublist in simular_items_recommendation_list:
        for item in sublist:
            flat_list.append(item)
    final_recommendations_list = list(dict.fromkeys(flat_list))
    
    # storing 10 random recommendations in a list
    ten_random_recommendations = random.sample(final_recommendations_list, 10)
    
    print('Similar Items bought by our users based on Cosine Similarity')
    
    #returning 10 random recommendations
    return ten_random_recommendations

In [24]:
simular_item_recommendation(12347)

Similar Items bought by our users based on Cosine Similarity


[22821, 23316, 22374, 21064, 22727, 71477, 23076, 21975, 21791, 22212]

# Model-Based Approach

In this approach, Collaborative Filtering models are created using machine learning algorithms to predict if the user is likely to purchase an item or not based on their past behaviour.

The possible approaches can be:

1. KNN : Collaborative Filtering Using k-Nearest Neighbors (kNN) kNN is a machine learning algorithm to find clusters of similar users based on past behaviour, and make predictions using the average of top-k nearest neighbors.

2. Matrix Factorization (MF): The idea behind such models is that attitudes or preferences of a user can be determined by a small number of hidden factors. We can call these factors as Embeddings.

## Collaborative Filtering using k-Nearest Neighbors

<div style="text-align:center;">
    <img src='images/knn.jpg' width='600'>
</div>

For passing our sparse matrix into KNN we need to convert it into CSR

CSR divides a sparse matrix into 3 arrays : values, extent of rows, index of columns

### Model building

In [25]:
purchase_matrix = csr_matrix(purchase_df.values)

# Creating KNN Model with metric parameter as euclidean distance
knn_model = NearestNeighbors(metric = 'euclidean', algorithm = 'brute')

# Fitting the model on purchase_matrix
knn_model.fit(purchase_matrix)

#### Finding similar users

In [26]:
def fetch_similar_users_knn(purchase_df,query_index):
    
    # Creating empty list where we will store user id of similar users
    simular_users_knn = []
    
    # Storing the distance and index of nearest neighors
    distances, indices = knn_model.kneighbors(purchase_df.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 5)
    for i in range(0, len(distances.flatten())):
        if i == 0:
            print('Recommendations for {0}:\n'.format(purchase_df.index[query_index]))
        else:
            print('{0}: {1}, with distance of {2}:'.format(i, purchase_df.index[indices.flatten()[i]], distances.flatten()[i]))
            
            simular_users_knn.append(purchase_df.index[indices.flatten()[i]])  
    return simular_users_knn

In [27]:
simular_users_knn = fetch_similar_users_knn(purchase_df,1497)
simular_users_knn

Recommendations for 14729:

1: 16917, with distance of 8.12403840463596:
2: 16989, with distance of 8.12403840463596:
3: 15124, with distance of 8.12403840463596:
4: 12897, with distance of 8.246211251235321:


[16917, 16989, 15124, 12897]

### Making Recommendations

In [28]:
def knn_recommendation(simular_users_knn):
    
    #obtaining all the items bought by similar users
    knn_recommnedations = []
    for j in simular_users_knn:
        item_list = data1[data1["CustomerID"]==j]['StockCode'].to_list()
        knn_recommnedations.append(item_list)
    
    #this gives us multi-dimensional list
    # we need to flatten it
    flat_list = []
    for sublist in knn_recommnedations:
        for item in sublist:
            flat_list.append(item)
    final_recommendations_list = list(dict.fromkeys(flat_list))
    
    # storing 10 random recommendations in a list
    ten_random_recommendations = random.sample(final_recommendations_list, 10)
    
    print('Items bought by Similar users based on KNN')
    
    #returning 10 random recommendations
    return ten_random_recommendations

In [29]:
knn_recommendation(simular_users_knn)

Items bought by Similar users based on KNN


[22921, 22917, 22487, 23188, 84978, 22961, 22920, 22501, 22926, 22957]

## Collaborative Filtering using Matrix Factorization

<div style="text-align:center;">
    <img src='images/mf.png' width='600'>
</div>

For Matrix Factorization, we are using the Surprise Package.

Surprise package has been specially developed to make recommendation based on collaborative filtering easy. It has default implementation for a variety of Collaborative Filtering algorithms such as NMF, kNN, Co-Clustering, SVD.

In [30]:
items_purchase_df.head()

CustomerID,12346,12347,12348,12350,12352,12353,12354,12355,12356,12358,...,18269,18270,18272,18273,18278,18280,18281,18282,18283,18287
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10080,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10120,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10125,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10133,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
data3 = items_purchase_df.stack().to_frame()

#Renaming the column as Quantity
data3 = data3.reset_index().rename(columns={0:"Quantity"})

data3.head()

Unnamed: 0,StockCode,CustomerID,Quantity
0,10002,12346,0
1,10002,12347,0
2,10002,12348,0
3,10002,12350,0
4,10002,12352,0


In [32]:
print(items_purchase_df.shape)
print(data3.shape)

(3538, 3647)
(12903086, 3)


This size is too big to pass into an algorithm so we need to reduce the size of dataset by shortlisting.

#### Shortlisting customers & items based on no. of orders

In [33]:
# Storing all customer ids in customers
customer_ids = data1['CustomerID']

# Storing all item descriptions in items
item_ids = data1['StockCode']

In [34]:
from collections import Counter

# counting no. of orders made by each customer
count_orders = Counter(customer_ids)

# storing the count and customer id in a dataframe
customer_count_df = pd.DataFrame.from_dict(count_orders, orient='index').reset_index().rename(columns={0:"Quantity"})

# dropping all customer ids with less than 120 orders
customer_count_df = customer_count_df[customer_count_df["Quantity"]>120]

# renaming the index column as CustomerID for inner join
customer_count_df.rename(columns={'index':'CustomerID'},inplace=True)

customer_count_df.head()

Unnamed: 0,CustomerID,Quantity
0,17850,297
1,13047,140
2,12583,182
6,14688,265
8,15311,1892


In [35]:
# counting no. of times an item was ordered
count_items = Counter(item_ids)

# storing the count and item description in a dataframe
item_count_df = pd.DataFrame.from_dict(count_items, orient='index').reset_index().rename(columns={0:"Quantity"})

# dropping all items which were ordered less than 120 times
item_count_df = item_count_df[item_count_df["Quantity"]>120]

# renaming the index column as Description for inner join
item_count_df.rename(columns={'index':'StockCode'},inplace=True)

item_count_df.head()

Unnamed: 0,StockCode,Quantity
0,84029E,161
1,71053,220
3,84406B,213
4,22752,229
5,85123A,1606


In [36]:
data4 = pd.merge(data3, item_count_df, on='StockCode', how='inner')
data4 = pd.merge(data4, customer_count_df, on='CustomerID', how='inner')

data4.head()

Unnamed: 0,StockCode,CustomerID,Quantity_x,Quantity_y,Quantity
0,10133,12347,0,124,124
1,15036,12347,0,278,124
2,17003,12347,0,138,124
3,20675,12347,0,188,124
4,20676,12347,0,242,124


In [37]:
# dropping columns which are not necessary
data4.drop(columns=['Quantity_x', 'Quantity_y'], inplace=True)

data4.head()

Unnamed: 0,StockCode,CustomerID,Quantity
0,10133,12347,124
1,15036,12347,124
2,17003,12347,124
3,20675,12347,124
4,20676,12347,124


In [38]:
data4.describe()

Unnamed: 0,CustomerID,Quantity
count,385672.0,385672.0
mean,15360.985915,279.089789
std,1719.468125,337.879413
min,12347.0,121.0
25%,13996.25,151.0
50%,15413.0,198.0
75%,16840.0,290.0
max,18283.0,5095.0


This format is exactly what is suitable to be passed into surprise library.

In [39]:
# reading the data in a format supported by surprise library.
reader = Reader(rating_scale=(0,5095))  # the range has been set as 0,5095 as the maximum value of quantity is 5095.

# loading Dataset in a format supported by surprise library.
formated_data = Dataset.load_from_df(data4, reader)

# performing train test split on the dataset
train_set, test_set = train_test_split(formated_data, test_size= 0.2)

### Implementing NMF

In [40]:
# defining the model
algo1 = NMF()

# model fitting
algo1.fit(train_set)

# model prediction
pred1 = algo1.test(test_set)

accuracy.rmse(pred1)
accuracy.mae(pred1)

RMSE: 427.5187
MAE:  272.4638


272.46380372857476

In [41]:
cross_validate(algo1, formated_data, verbose=True)

Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    425.6117430.6909429.4152428.6779426.4138428.16191.8864  
MAE (testset)     272.0039273.2451272.5994272.8597272.5739272.65640.4059  
Fit time          3.73    3.29    3.25    3.76    3.08    3.42    0.27    
Test time         0.30    0.26    0.25    0.38    0.15    0.27    0.07    


{'test_rmse': array([425.61165833, 430.6909144 , 429.41524354, 428.67790941,
        426.41376043]),
 'test_mae': array([272.00390795, 273.24514797, 272.59938835, 272.85969226,
        272.57394354]),
 'fit_time': (3.728477954864502,
  3.2921035289764404,
  3.253654718399048,
  3.761504650115967,
  3.0797278881073),
 'test_time': (0.2990443706512451,
  0.2636291980743408,
  0.24800992012023926,
  0.3765370845794678,
  0.15335893630981445)}

### Implementing Co-Clustering

Co-clustering (also known as bi-clustering) is commonly used in collaborative filtering. It is a data-mining technique that simultaneously clusters the columns and rows of a DataFrame/matrix. It differs from normal clustering, where each object is checked for similarity with other objects based on a single entity/type of comparison. As in coclustering, you check for co-grouping of two different entities/types of comparison for each object simultaneously as a pairwise interaction.

In [42]:
# defining the model
algo2 = CoClustering()

# model fitting
algo2.fit(train_set)

# model prediction
pred2 = algo2.test(test_set)

accuracy.rmse(pred2)
accuracy.mae(pred2)

RMSE: 7.0523
MAE:  5.5102


5.510182149492576

The RMSE and MAE are very low for this model. Until now, this has performed the best (better than NMF).

In [43]:
cross_validate(algo2, formated_data, verbose=True)

Evaluating RMSE, MAE of algorithm CoClustering on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    7.0557  7.2152  7.0835  6.9400  7.0498  7.0688  0.0881  
MAE (testset)     5.6282  5.8697  5.7445  5.5559  5.6999  5.6996  0.1065  
Fit time          3.31    3.51    3.48    3.38    3.76    3.49    0.15    
Test time         0.29    0.16    0.28    0.29    0.28    0.26    0.05    


{'test_rmse': array([7.05569533, 7.21522948, 7.08349467, 6.94004721, 7.04977556]),
 'test_mae': array([5.62821377, 5.86968575, 5.74448441, 5.55588044, 5.69994357]),
 'fit_time': (3.3091537952423096,
  3.5131096839904785,
  3.482100009918213,
  3.376316785812378,
  3.7582485675811768),
 'test_time': (0.2895350456237793,
  0.157606840133667,
  0.2782618999481201,
  0.2930905818939209,
  0.2833852767944336)}

### Implementing SVD

In [44]:
# defining the model
algo3 = SVD()

# model fitting
algo3.fit(train_set)

# model prediction
pred3 = algo3.test(test_set)

accuracy.rmse(pred3)
accuracy.mae(pred3)

RMSE: 4827.9312
MAE:  4816.1467


4816.146729759513

The RMSE and MAE are significantly high for this model. Until now, this has performed the worst (worse than NMF and co-clustering).

In [45]:
cross_validate(algo3, formated_data, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    4829.16194826.99934825.95104828.46534828.16314827.74811.1381  
MAE (testset)     4818.28014814.81484813.44914817.08824815.91884815.91021.6889  
Fit time          2.26    2.32    2.08    2.21    2.19    2.21    0.08    
Test time         0.35    0.34    0.18    0.34    0.34    0.31    0.07    


{'test_rmse': array([4829.16194493, 4826.99926917, 4825.95100539, 4828.46532304,
        4828.16313124]),
 'test_mae': array([4818.2801452 , 4814.81476632, 4813.4490886 , 4817.08821013,
        4815.91882957]),
 'fit_time': (2.2604963779449463,
  2.315991163253784,
  2.0812432765960693,
  2.2061409950256348,
  2.1882290840148926),
 'test_time': (0.3495798110961914,
  0.33536481857299805,
  0.17518353462219238,
  0.3387141227722168,
  0.33811283111572266)}

## Testing the models

In [46]:
#taking item 47590B and customer 15738 for testing
data1[(data1['StockCode']=='47590B')&(data1['CustomerID']==15738)].Quantity.sum()

78

In [47]:
# Predicted value given out by model is 3.10 while actual was 78
algo1.test([['47590B',15738,78]])

[Prediction(uid='47590B', iid=15738, r_ui=78, est=3.078870063800001, details={'was_impossible': False})]

In [48]:
# Predicted value given out by model is 142.82 while actual was 78
algo2.test([['47590B',15738,78]])

[Prediction(uid='47590B', iid=15738, r_ui=78, est=127.35908463431406, details={'was_impossible': False})]

In [49]:
# Predicted value given out by model is 5095 while actual was 78
algo3.test([['47590B',15738,78]])

[Prediction(uid='47590B', iid=15738, r_ui=78, est=5095, details={'was_impossible': False})]

#### Giving out predictions

In [50]:
# Predictions given out by Co-Clustering
#pred2

#### Best and Worst Predictions made by Co-Clustering

In [51]:
def get_item_orders(user_id):
    try:
        # for an item, return the no. of orders made
        return len(train_set.ur[train_set.to_inner_uid(user_id)])
    except ValueError:
        # user not present in training
        return 0
    
def get_customer_orders(item_id):
    try:
        # for an customer, return the no. of orders made
        return len(train_set.ir[train_set.to_inner_iid(item_id)])
    except ValueError:
        # item not present in training
        return 0

In [52]:
predictions_data = pd.DataFrame(pred2, columns=['item_id', 'customer_id', 'quantity', 'prediction', 'details'])
predictions_data.head()

Unnamed: 0,item_id,customer_id,quantity,prediction,details
0,23173,14667,455.0,445.693068,{'was_impossible': False}
1,22189,14221,171.0,178.355448,{'was_impossible': False}
2,22649,16186,188.0,170.171427,{'was_impossible': False}
3,22867,15572,132.0,133.569485,{'was_impossible': False}
4,22456,15021,365.0,362.767029,{'was_impossible': False}


In [53]:
predictions_data['item_orders'] = predictions_data.item_id.apply(get_item_orders)
predictions_data['customer_orders'] = predictions_data.customer_id.apply(get_customer_orders)
predictions_data['error'] = abs(predictions_data.prediction - predictions_data.quantity)

best_predictions = predictions_data.sort_values(by='error')[:10]
worst_predictions = predictions_data.sort_values(by='error')[-10:]

predictions_data.head()

Unnamed: 0,item_id,customer_id,quantity,prediction,details,item_orders,customer_orders,error
0,23173,14667,455.0,445.693068,{'was_impossible': False},443,530,9.306932
1,22189,14221,171.0,178.355448,{'was_impossible': False},458,541,7.355448
2,22649,16186,188.0,170.171427,{'was_impossible': False},462,541,17.828573
3,22867,15572,132.0,133.569485,{'was_impossible': False},451,540,1.569485
4,22456,15021,365.0,362.767029,{'was_impossible': False},464,539,2.232971


In [54]:
best_predictions.head()

Unnamed: 0,item_id,customer_id,quantity,prediction,details,item_orders,customer_orders,error
65392,22178,17841,5095.0,5095.0,{'was_impossible': False},434,552,0.0
49565,22385,17841,5095.0,5095.0,{'was_impossible': False},455,552,0.0
77022,22892,17841,5095.0,5095.0,{'was_impossible': False},450,552,0.0
33219,21500,17841,5095.0,5095.0,{'was_impossible': False},473,552,0.0
1259,22661,17841,5095.0,5095.0,{'was_impossible': False},454,552,0.0


In [55]:
worst_predictions.head()

Unnamed: 0,item_id,customer_id,quantity,prediction,details,item_orders,customer_orders,error
1533,21718,15218,145.0,118.780128,{'was_impossible': False},451,541,26.219872
54054,21718,16409,179.0,152.780128,{'was_impossible': False},451,536,26.219872
7640,21718,16283,149.0,122.780128,{'was_impossible': False},451,546,26.219872
691,21718,14446,183.0,156.780128,{'was_impossible': False},451,543,26.219872
61189,21718,16348,197.0,170.780128,{'was_impossible': False},451,547,26.219872


In [56]:
# Getting item list for user 12347
item_list = predictions_data[predictions_data['customer_id']==12347]['item_id'].values.tolist()

#item_list

In [57]:
# Getting list of unique customers who also bught same items (item_list)
customer_list = predictions_data[predictions_data['item_id'].isin(item_list)]['customer_id'].values
customer_list = np.unique(customer_list).tolist()

#customer_list

In [58]:
# filtering those customers from predictions data
filtered_data = predictions_data[predictions_data['customer_id'].isin(customer_list)]

# removing the items already bought
filtered_data = filtered_data[~filtered_data['item_id'].isin(item_list)]

# getting the top items (prediction)
recommended_items = filtered_data.sort_values('prediction',ascending=False).reset_index(drop=True).head(10)['item_id'].values.tolist()

recommended_items

[22661, 21500, 22178, 22385, 21035, 22697, '85099B', 21930, 23202, 23203]