# Chapter 4 - Collaborative Filtering (Part 1)

Collaborative filtering is a very popular method in recommendation engines. It is the predictive process behind the suggestions provided by these systems. It processes and analyzes customers’ information and suggests items they will likely appreciate.

Collaborative filtering algorithms use a customer’s purchase history and ratings to find similar customers and then suggest items that they liked.

<div style="text-align:center;">
    <img src='images/colf.png' width='800'>
</div>

For example, to find a new movie or show to watch, you can ask your friends for suggestions since you all share similar tastes in content. The same concept is used in collaborative filtering, where user-user similarity finds similar users to get recommendations based on each other’s likes.
    
There are two types of collaborative filtering methods—user-to-user and item-to-item. They are explored in the upcoming sections. This chapter looks at the implementation of these two methods using **Cosine Similarity** before diving into implementing the more popularly used **KNN-based algorithm** for collaborative filtering.

<div style="text-align:center;">
    <img src='images/uibc.jpg' width='800'>
</div>

In [1]:
#Importing the libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import random

import warnings
warnings.filterwarnings("ignore")

### About the Dataset
The 8 columns or features are as

    • InvoiceNo: The invoice number of a particular transaction
    • StockCode: The unique identifier for a particular item
    • Descripion : The description of particular item
    • Quantity: The quantity of that item bought by the customer
    • InvoiceDate: The date and time when the transaction was made
    • UnitPrice : The price of 1 unit of particular item
    • CustomerID : The unique id of customer who bought the item
    • Country : The country or region of the customere customere customer

In [2]:
#read csv data
df = pd.read_csv('data/data.csv',encoding= 'unicode_escape')

#view first 5 rows
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


### Data Preparation

In [4]:
# null check
df.isnull().sum().sort_values(ascending=False)

CustomerID     135080
Description      1454
InvoiceNo           0
StockCode           0
Quantity            0
InvoiceDate         0
UnitPrice           0
Country             0
dtype: int64

In [5]:
# Drop NaN
df_new = df.dropna()

df_new.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,406829.0,406829.0,406829.0
mean,12.061303,3.460471,15287.69057
std,248.69337,69.315162,1713.600303
min,-80995.0,0.0,12346.0
25%,2.0,1.25,13953.0
50%,5.0,1.95,15152.0
75%,12.0,3.75,16791.0
max,80995.0,38970.0,18287.0


In [6]:
df_new = df_new[df_new.Quantity > 0]

df_new.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,397924.0,397924.0,397924.0
mean,13.021823,3.116174,15294.315171
std,180.42021,22.096788,1713.169877
min,1.0,0.0,12346.0
25%,2.0,1.25,13969.0
50%,6.0,1.95,15159.0
75%,12.0,3.75,16795.0
max,80995.0,8142.75,18287.0


# User-to-User Collaborative Filtering using Cosine Similiarity

In [7]:
# We are creating a df which contains CustomerID and whether they have ever purchased a product using groupby 

purchase = (df_new.groupby(['CustomerID', 'Description'])['Quantity'].sum().unstack().reset_index().
            fillna(0).set_index('CustomerID'))

purchase.head(10)

Description,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,I LOVE LONDON MINI RUCKSACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,...,ZINC STAR T-LIGHT HOLDER,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12348.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12349.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12350.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12352.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12353.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12354.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12355.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12356.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
# We are getting the quantity ordered (example: 48,24,126) while we just want to know if that particular item 
# is purchased or not thus we are encoding units as 1 (if purchased) or 0 (not purchased)

def encode_units(x):
    if x < 1:
        return 0
    if x >= 1:
        return 1

purchase = purchase.applymap(encode_units)

In [9]:
purchase.head(10)

Description,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,I LOVE LONDON MINI RUCKSACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,...,ZINC STAR T-LIGHT HOLDER,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12347.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12349.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12350.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12352.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12353.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12354.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12355.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12356.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Cosine Similarity

In [10]:
from sklearn.metrics.pairwise import cosine_similarity

user_similarity = cosine_similarity(purchase)

user_similarity_df = pd.DataFrame(user_similarity,index=purchase.index,columns=purchase.index)

user_similarity_df

CustomerID,12346.0,12347.0,12348.0,12349.0,12350.0,12352.0,12353.0,12354.0,12355.0,12356.0,...,18273.0,18274.0,18276.0,18277.0,18278.0,18280.0,18281.0,18282.0,18283.0,18287.0
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,1.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
12347.0,0.0,1.000000,0.063022,0.046130,0.047795,0.038484,0.0,0.012938,0.136641,0.094742,...,0.0,0.029709,0.052668,0.000000,0.032844,0.062318,0.000000,0.113776,0.101565,0.012828
12348.0,0.0,0.063022,1.000000,0.024953,0.051709,0.027756,0.0,0.027995,0.118262,0.146427,...,0.0,0.064282,0.113961,0.000000,0.000000,0.000000,0.000000,0.000000,0.168053,0.083269
12349.0,0.0,0.046130,0.024953,1.000000,0.056773,0.121900,0.0,0.030737,0.032461,0.144692,...,0.0,0.105868,0.000000,0.000000,0.039014,0.000000,0.000000,0.067574,0.113547,0.015237
12350.0,0.0,0.047795,0.051709,0.056773,1.000000,0.031575,0.0,0.000000,0.000000,0.033315,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.044118,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18280.0,0.0,0.062318,0.000000,0.000000,0.000000,0.000000,0.0,0.041523,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.105409,1.000000,0.119523,0.000000,0.000000,0.000000
18281.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.049629,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.119523,1.000000,0.000000,0.045835,0.000000
18282.0,0.0,0.113776,0.000000,0.067574,0.000000,0.037582,0.0,0.000000,0.160128,0.079305,...,0.0,0.174078,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.017504,0.000000
18283.0,0.0,0.101565,0.168053,0.113547,0.044118,0.078939,0.0,0.111463,0.033634,0.091616,...,0.0,0.036564,0.016205,0.042875,0.000000,0.000000,0.045835,0.017504,1.000000,0.094726


In [11]:
def similar_users(user_id,k=5):
    
    # separating df rows for the entered user id
    user = user_similarity_df[user_similarity_df.index == user_id]
    
    # a df of all other users
    other_users = user_similarity_df[user_similarity_df.index != user_id]
    
    # calc cosine similarity between user and each other user
    similarities = cosine_similarity(user,other_users)[0].tolist()
    
    # create list of indices of these users
    indices = other_users.index.tolist()
    
    # create key/values pairs of user index and their similarity
    index_similarity = dict(zip(indices, similarities))
    
    # sort by similarity
    index_similarity_sorted = sorted(index_similarity.items(),reverse=True)
    
    # grab k users off the top
    top_users_similarities = index_similarity_sorted[:k]
    users = [u[0] for u in top_users_similarities]
    
    return users

In [12]:
# Further the similar users can be stored in a list and later we can display the items purchased by the similar users 

n = 12347

simu = similar_users(n)  # simu will store the top 5 similar users to 12347

print('The users with behavior similar to that of user',n,'are:')
simu

The users with behavior similar to that of user 12347 are:


[18287.0, 18283.0, 18282.0, 18281.0, 18280.0]

In [13]:
'''
This function gets the similar users for the given customer (ID) and obtains a list of all the items 
bought by these similar users. This list is then flattened to get a final list of unique items, 
from which shows randomly chosen ten recommended items for a given user.
'''

def simu_recommendation(userid):
    
    simu = similar_users(userid)

    #obtaining all the items bought by similar users
    simu_rec = []
    for j in simu:
        desc = df_new[df_new["CustomerID"]==j]['Description'].to_list()
        simu_rec.append(desc)
    
    #this gives us multi-dimensional list
    # we need to flatten it
    flat_list = []
    for sublist in simu_rec:
        for item in sublist:
            flat_list.append(item)
    final_list = list(dict.fromkeys(flat_list))
    
    # storing 10 random recommendations in a list
    ten_recs = random.sample(final_list, 10)
    
    print('Items bought by Similar users based on Cosine Similarity')
    
    #returning 10 random recommendations
    return ten_recs

In [14]:
simu_recommendation(12347)

Items bought by Similar users based on Cosine Similarity


['PAINTED METAL STAR WITH HOLLY BELLS',
 'LARGE PURPLE BABUSHKA NOTEBOOK  ',
 'OVEN MITT APPLES DESIGN',
 'BLUE 3 PIECE POLKADOT CUTLERY SET',
 'RIBBON REEL CHRISTMAS SOCK BAUBLE',
 'ICE CREAM SUNDAE LIP GLOSS',
 'BISCUIT TIN VINTAGE CHRISTMAS',
 'FOLK ART METAL STAR T-LIGHT HOLDER',
 'CERAMIC HEART FAIRY CAKE MONEY BANK',
 'CHRISTMAS RETROSPOT STAR WOOD']

# User-to-User Collaborative Filtering using KNN

In [15]:
# For passing our sparse matrix into KNN we need to convert it into CSR
# CSR divides a sparse matrix into 3 arrays : values, extent of rows, index of columns

from scipy.sparse import csr_matrix

purchase_matrix = csr_matrix(purchase.values)

from sklearn.neighbors import NearestNeighbors

model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model_knn.fit(purchase_matrix)

In [16]:
simu_knn = []

def similar_users_knn(purchase,query_index):

    distances, indices = model_knn.kneighbors(purchase.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 6)
    for i in range(0, len(distances.flatten())):
        if i == 0:
            print('Recommendations for {0}:\n'.format(purchase.index[query_index]))
        else:
            print('{0}: {1}, with distance of {2}:'.format(i, purchase.index[indices.flatten()[i]], distances.flatten()[i]))
            simu_knn.append(purchase.index[indices.flatten()[i]])    

In [17]:
similar_users_knn(purchase,1497)

Recommendations for 14389.0:

1: 16748.0, with distance of 0.47513611891852214:
2: 15417.0, with distance of 0.5065362287801733:
3: 14489.0, with distance of 0.5232687053772038:
4: 17031.0, with distance of 0.5607023148930206:
5: 15747.0, with distance of 0.5695947101270704:


In [18]:
# Just curiosity to see
purchase.index[1497]

14389.0

In [19]:
simu_knn   # This show what customers have the same taste to the target.

[16748.0, 15417.0, 14489.0, 17031.0, 15747.0]

In [20]:
def simu_recommendation_knn(simu_knn):
    
    #obtaining all the items bought by similar users
    simu_rec = []
    for j in simu_knn:
        desc = df_new[df_new["CustomerID"]==j]['Description'].to_list()
        simu_rec.append(desc)
    
    #this gives us multi-dimensional list
    # we need to flatten it
    flat_list = []
    for sublist in simu_rec:
        for item in sublist:
            flat_list.append(item)
    final_list = list(dict.fromkeys(flat_list))
    
    # storing 10 random recommendations in a list
    ten_recs = random.sample(final_list, 10)
    
    print('Items bought by Similar users based on KNN')
    
    #returning 10 random recommendations
    return ten_recs

In [21]:
simu_recommendation_knn(simu_knn)

Items bought by Similar users based on KNN


['JUMBO BAG VINTAGE DOILY ',
 'LUNCH BAG  BLACK SKULL.',
 'STRAWBERRY CHARLOTTE BAG',
 'JUMBO BAG APPLES',
 'RECYCLING BAG RETROSPOT ',
 'PINK FAIRY CAKE CHILDRENS APRON',
 'SET OF 6 T-LIGHTS SANTA',
 'JUMBO BAG VINTAGE CHRISTMAS ',
 'JUMBO BAG VINTAGE LEAF',
 "JUMBO BAG 50'S CHRISTMAS "]

# Item-to-Item Collaborative Filtering

In [22]:
df_new.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [23]:
# We are creating a df which contains item names and whether they have been ever purchased by a customer using groupby 

items_purchase = (df_new.groupby(['Description','CustomerID'])['Quantity'].sum().unstack().
                  reset_index().fillna(0).set_index('Description'))

items_purchase.head()

CustomerID,12346.0,12347.0,12348.0,12349.0,12350.0,12352.0,12353.0,12354.0,12355.0,12356.0,...,18273.0,18274.0,18276.0,18277.0,18278.0,18280.0,18281.0,18282.0,18283.0,18287.0
Description,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4 PURPLE FLOCK DINNER CANDLES,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50'S CHRISTMAS GIFT BAG LARGE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DOLLY GIRL BEAKER,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I LOVE LONDON MINI BACKPACK,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I LOVE LONDON MINI RUCKSACK,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
items_purchase = items_purchase.applymap(encode_units)

item_similarity = cosine_similarity(items_purchase)

item_similarity_df = pd.DataFrame(item_similarity,index=items_purchase.index,columns=items_purchase.index)

item_similarity_df.head()

Description,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,I LOVE LONDON MINI RUCKSACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,...,ZINC STAR T-LIGHT HOLDER,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
Description,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4 PURPLE FLOCK DINNER CANDLES,1.0,0.0,0.017961,0.023583,0.0,0.0,0.02805,0.0,0.031384,0.017125,...,0.0,0.042333,0.043885,0.032001,0.0,0.026774,0.0,0.061379,0.0,0.042333
50'S CHRISTMAS GIFT BAG LARGE,0.0,1.0,0.058277,0.038261,0.0,0.036073,0.060676,0.332508,0.033945,0.083348,...,0.0,0.045787,0.047465,0.034612,0.0,0.094114,0.0,0.033193,0.0,0.022893
DOLLY GIRL BEAKER,0.017961,0.058277,1.0,0.144437,0.1,0.037139,0.046852,0.066259,0.061159,0.200227,...,0.0,0.02357,0.048868,0.089087,0.0,0.096896,0.0,0.034174,0.028868,0.070711
I LOVE LONDON MINI BACKPACK,0.023583,0.038261,0.144437,1.0,0.131306,0.048766,0.041013,0.043501,0.126195,0.112676,...,0.0,0.061898,0.048125,0.035093,0.0,0.039148,0.0,0.056091,0.0,0.061898
I LOVE LONDON MINI RUCKSACK,0.0,0.0,0.1,0.131306,1.0,0.0,0.0,0.0,0.0,0.095346,...,0.0,0.0,0.0,0.089087,0.0,0.074536,0.0,0.085436,0.0,0.0


In [25]:
def similar_items(item,k=10):
    # separating df rows of the selected item
    item = item_similarity_df[item_similarity_df.index == item]
    
    # a df of all other items
    other_items = item_similarity_df
    
    # calc cosine similarity between selected item with other items
    similarities = cosine_similarity(item,other_items)[0].tolist()
    
    # create list of indices of these items
    indices = other_items.index.tolist()
    
    # create key/values pairs of item index and their similarity
    index_similarity = dict(zip(indices, similarities))
    
    # sort by similarity
    index_similarity_sorted = sorted(index_similarity.items())
    
    # grab k items from the top
    top_item_similarities = index_similarity_sorted[:k]
    items = [u[0] for u in top_item_similarities]
    
    print('Similar items based on purchase behaviour (item-to-item collaborative filtering)')
    return items

In [26]:
similar_items(' 4 PURPLE FLOCK DINNER CANDLES')

Similar items based on purchase behaviour (item-to-item collaborative filtering)


[' 4 PURPLE FLOCK DINNER CANDLES',
 " 50'S CHRISTMAS GIFT BAG LARGE",
 ' DOLLY GIRL BEAKER',
 ' I LOVE LONDON MINI BACKPACK',
 ' I LOVE LONDON MINI RUCKSACK',
 ' NINE DRAWER OFFICE TIDY',
 ' OVAL WALL MIRROR DIAMANTE ',
 ' RED SPOT GIFT BAG LARGE',
 ' SET 2 TEA TOWELS I LOVE LONDON ',
 ' SPACEBOY BABY GIFT SET']