# Chapter 10 - Graph–Based Recommender Systems

Deep learning techniques utilize recent and rapidly growing network architectures and optimization algorithms to train on large amounts of data and build more expressive and better-performing models. Graphics Processing Units (GPUs) and deep learning have been driving advances in recommender systems for the past few years. Due to their massively parallel architecture, using GPUs for computation provides higher performance and cost savings. Let’s first explore the basics of deep learning and then look at the deep learning–based collaborative filtering method (neural collaborative filtering)..

<div style="text-align:center;">
    <img src='images/graph.png' width='600'>
</div>

In [1]:
# Importing libraries

import pandas as pd
import numpy as np

from neo4j import GraphDatabase, basic_auth
import neo4jupyter
from py2neo import Graph
import re

#import warnings
#warnings.filterwarnings("ignore")

### Establishing a connection between our Neo4j database and python notebook:


In [2]:
# https://neo4j.com/sandbox/

g = Graph("neo4j+s://4fbc2c3a.databases.neo4j.io", password = "OGkyXIaUGyoMB297VLaSR03atj7yF1LpTwnzDWoqjOo")

In [3]:
driver = GraphDatabase.driver(
  "neo4j+s://4fbc2c3a.databases.neo4j.io",
  auth=basic_auth("neo4j", "OGkyXIaUGyoMB297VLaSR03atj7yF1LpTwnzDWoqjOo"))

In [4]:
def execute_transactions(transaction_execution_commands):
    # Establishing connection with database
    data_base_connection = GraphDatabase.driver( "neo4j+s://4fbc2c3a.databases.neo4j.io",
  auth=basic_auth("neo4j", "OGkyXIaUGyoMB297VLaSR03atj7yF1LpTwnzDWoqjOo"))
    # Creating a session
    session = data_base_connection.session()    
    for i in transaction_execution_commands:
        session.run(i)

### Loading the Datasets

In [5]:
#read csv data
df = pd.read_excel(r'data/Rec_sys_data.xlsx')

df.head()

Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,DeliveryDate,Discount%,ShipMode,ShippingCost,CustomerID
0,536365,84029E,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.2,ExpressAir,30.12,17850
1,536365,71053,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.21,ExpressAir,30.12,17850
2,536365,21730,6,2010-12-01 08:26:00,2010-12-03 08:26:00,0.56,Regular Air,15.22,17850
3,536365,84406B,8,2010-12-01 08:26:00,2010-12-03 08:26:00,0.3,Regular Air,15.22,17850
4,536365,22752,2,2010-12-01 08:26:00,2010-12-04 08:26:00,0.57,Delivery Truck,5.81,17850


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272404 entries, 0 to 272403
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   InvoiceNo     272404 non-null  int64         
 1   StockCode     272404 non-null  object        
 2   Quantity      272404 non-null  int64         
 3   InvoiceDate   272404 non-null  datetime64[ns]
 4   DeliveryDate  272404 non-null  datetime64[ns]
 5   Discount%     272404 non-null  float64       
 6   ShipMode      272404 non-null  object        
 7   ShippingCost  272404 non-null  float64       
 8   CustomerID    272404 non-null  int64         
dtypes: datetime64[ns](2), float64(2), int64(3), object(2)
memory usage: 18.7+ MB


In [7]:
# Little bit of preprocessing so that we can easily run NoSQL queries.
df['CustomerID'] = df['CustomerID'].apply(str)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272404 entries, 0 to 272403
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   InvoiceNo     272404 non-null  int64         
 1   StockCode     272404 non-null  object        
 2   Quantity      272404 non-null  int64         
 3   InvoiceDate   272404 non-null  datetime64[ns]
 4   DeliveryDate  272404 non-null  datetime64[ns]
 5   Discount%     272404 non-null  float64       
 6   ShipMode      272404 non-null  object        
 7   ShippingCost  272404 non-null  float64       
 8   CustomerID    272404 non-null  object        
dtypes: datetime64[ns](2), float64(2), int64(2), object(3)
memory usage: 18.7+ MB


In [8]:
# This dataset contains detailed information about each stock which will be
# used to link stockcodes and their description/title.
df1 = pd.read_excel('data/Rec_sys_data.xlsx','product')

df1.head()

Unnamed: 0,StockCode,Product Name,Description,Category,Brand,Unit Price
0,22629,Ganma Superheroes Ordinary Life Case For Samsu...,"New unique design, great gift.High quality pla...",Cell Phones|Cellphone Accessories|Cases & Prot...,Ganma,13.99
1,21238,Eye Buy Express Prescription Glasses Mens Wome...,Rounded rectangular cat-eye reading glasses. T...,Health|Home Health Care|Daily Living Aids,Eye Buy Express,19.22
2,22181,MightySkins Skin Decal Wrap Compatible with Ni...,Each Nintendo 2DS kit is printed with super-hi...,Video Games|Video Game Accessories|Accessories...,Mightyskins,14.99
3,84879,Mediven Sheer and Soft 15-20 mmHg Thigh w/ Lac...,The sheerest compression stocking in its class...,Health|Medicine Cabinet|Braces & Supports,Medi,62.38
4,84836,Stupell Industries Chevron Initial Wall D cor,Features: -Made in the USA. -Sawtooth hanger o...,Home Improvement|Paint|Wall Decals|All Wall De...,Stupell Industries,35.99


To implement knowledge graphs in Neo4J, the DataFrame must be converted into a relational database. First, customers and stocks must be converted into entities (or nodes of a graph) to build a relationship between them.

In [9]:
#creating a list of all unique customer IDs
customerids = df['CustomerID'].unique().tolist()

# storing all the create commands to be executed into create_customers list
create_customers = []

for i in customerids:
  # example of create statement "create (n:entity {property_key : '12345'})" 
  statement = "create (c:customer{cid:"+ '"' + str(i) + '"' +"})"
  create_customers.append(statement)

# running all the queries into neo4j to create customer entities
execute_transactions(create_customers)

Once we are done with customer nodes, we need to create nodes for stocks as well.

In [10]:
# creating a lit of all unique stockcodes
stockcodes = df['StockCode'].unique().tolist()

# storing all the create commands to be executed into the create_stockcodes list
create_stockcodes = []

for i in stockcodes:
  # example of create statement "create (m:entity {property_key : 'XYZ'})"
  statement = "create (s:stock{stockcode:"+ '"' + str(i) + '"' +"})"
  create_stockcodes.append(statement)

# running all the queries into neo4j to create stock entities
execute_transactions(create_stockcodes)

Once we are done creating nodes for customers and stocks, we need to create a link between stockcodes and title which will be needed to recommend items.

For this we will create another property key called as 'title' into the stock entity already present in our neo4j database.

In [11]:
# creating a blank dataframe
df2 = pd.DataFrame(columns = ['StockCode', 'Title'])
df2.head()

Unnamed: 0,StockCode,Title


In [12]:
#Converting stockcodes to string in both the dataframe
df['StockCode'] = df['StockCode'].astype(str)
df1['StockCode'] = df1['StockCode'].astype(str)

In [13]:
# Get the unique stock codes
stockcodes = df['StockCode'].unique().tolist()

# Initialize an empty list to hold the dictionaries
records = []

# Loop through each stock code and create a dictionary for each
for stockcode in stockcodes:
    dict_temp = {}
    dict_temp['StockCode'] = stockcode
    product_name = df1[df1['StockCode'] == stockcode]['Product Name'].values
    dict_temp['Title'] = product_name[0] if len(product_name) > 0 else None
    records.append(dict_temp)

# Create the DataFrame from the list of dictionaries
df2 = pd.DataFrame(records)

# Reset the index
df2 = df2.reset_index(drop=True)

print(df2)

     StockCode                                              Title
0       84029E  3 1/2"W x 20"D x 20"H Funston Craftsman Smooth...
1        71053  Awkward Styles Shamrock Flag St. Patrick's Day...
2        21730  Ebe Men Black Rectangle Half Rim Spring Hinge ...
3       84406B  MightySkins Skin Decal Wrap Compatible with Ap...
4        22752  awesome since 1948 - 69th birthday gift t-shir...
...        ...                                                ...
3533     23532  Handcrafted Ercolano Music Box Featuring "Lunc...
3534     23537  Ebe Reading Glasses Mens Womens Amber Red Oval...
3535     23500  Port Company PC61 Traditional MenÃ¢â‚¬s T-Shir...
3536     23465  Z620 Workstation, 2x Xeon E5-2630 v2 2.6GHz Si...
3537     23501  Ebe Reading Glasses Mens Womens Black Blue Ret...

[3538 rows x 2 columns]


In [14]:
# Initialize an empty list to hold the dictionaries
records = []

# Get the unique stock codes
stockcodes = df['StockCode'].unique().tolist()

# Loop through each stock code and create a dictionary for each
for stockcode in stockcodes:
    dict_temp = {}
    dict_temp['StockCode'] = stockcode
    product_name = df1[df1['StockCode'] == stockcode]['Product Name'].values
    dict_temp['Title'] = product_name[0] if len(product_name) > 0 else None
    records.append(dict_temp)

# Create the DataFrame from the list of dictionaries
df2 = pd.DataFrame(records)

# Reset the index
df2 = df2.reset_index(drop=True)

df2.head()

Unnamed: 0,StockCode,Title
0,84029E,"3 1/2""W x 20""D x 20""H Funston Craftsman Smooth..."
1,71053,Awkward Styles Shamrock Flag St. Patrick's Day...
2,21730,Ebe Men Black Rectangle Half Rim Spring Hinge ...
3,84406B,MightySkins Skin Decal Wrap Compatible with Ap...
4,22752,awesome since 1948 - 69th birthday gift t-shir...


In [16]:
# Doing some data preprocessing such that these queries can be run in neo4j

df2['Title'] = df2['Title'].apply(str)
df2['Title'] = df2['Title'].map(lambda x: re.sub(r'\W+', ' ', x))
df2['Title'] = df2['Title'].apply(str)

df2.head()

Unnamed: 0,StockCode,Title
0,84029E,3 1 2 W x 20 D x 20 H Funston Craftsman Smooth...
1,71053,Awkward Styles Shamrock Flag St Patrick s Day ...
2,21730,Ebe Men Black Rectangle Half Rim Spring Hinge ...
3,84406B,MightySkins Skin Decal Wrap Compatible with Ap...
4,22752,awesome since 1948 69th birthday gift t shirt ...


In [17]:
# This query will add the 'title' property key to each stock entity in our neo4j database
for i in range(len(df2)):
  query = """
  MATCH (s:stock {stockcode:""" + '"' + str(df2['StockCode'][i]) + '"' + """})
  SET s.title ="""+ '"' + str(df2['Title'][i]) + '"' + """
  RETURN s.stockcode, s.title
  """

  g.run(query)

<div style="text-align:center;">
    <img src='images/neo4j.png' width='1200'>
</div>

### Creating Relation Between Customers and Stocks

Since we have all the transactions in our dataset, the relation is already known and present. But since we have to convert it into a RDS, we have to run cypher queries in neo4j to build this relationship.

In [20]:
# Storing transaction values in a list
transaction_list = df.values.tolist()

# storing all commands to build relationship in an empty list relation
relation = []

for i in transaction_list:
  # the 9th column in df is customerID and 2nd column is stockcode which we are appending in the statement
  statement = """MATCH (a:customer),(b:stock) WHERE a.cid = """+'"' + str(i[8])+ '"' + """ AND b.stockcode = """ + '"' + str(i[1]) + '"' + """ CREATE (a)-[:bought]->(b) """
  relation.append(statement)

In [None]:
execute_transactions(relation)

<div style="text-align:center;">
    <img src='images/neo4j2.png' width='1200'>
</div>

Next, let’s find similarities between users using the relationship created.

The Jaccard similarity can be calculated as the ratio between the intersection and the union of two sets. It is a measure of similarity, and as it is a percentage value, it ranges between 0% to 100%. More similar sets have a higher value.

In [None]:
def similar_users(id) :
    
  # This query will find users who have bought stocks in common with the customer having id specified by user 
  # Later we will find jaccard index for each of them 
  # We wil return the neighbors sorted by jaccard index in descending order
    
  query = """
  MATCH (c1:customer)-[:bought]->(s:stock)<-[:bought]-(c2:customer)
  WHERE c1 <> c2 AND c1.cid =""" + '"' + str(id) +'"' """
  WITH c1, c2, COUNT(DISTINCT s) as intersection
  MATCH (c:customer)-[:bought]->(s:stock)
  WHERE c in [c1, c2]
  WITH c1, c2, intersection, COUNT(DISTINCT s) as union
  WITH c1, c2, intersection, union, (intersection * 1.0 / union) as jaccard_index
  ORDER BY jaccard_index DESC, c2.cid
  WITH c1, COLLECT([c2.cid, jaccard_index, intersection, union])[0..15] as neighbors
  WHERE SIZE(neighbors) = 15   // return users with enough neighbors
  RETURN c1.cid as customer, neighbors
  
  """
  neighbors = pd.DataFrame([['CustomerID','JaccardIndex','Intersection','Union']])
  for i in g.run(query).data():
    neighbors = neighbors.append(i["neighbors"])
  
  print("\n----------- customer's 15 nearest neighbors ---------\n")
  print(neighbors)

In [None]:
similar_users('12347')

In [None]:
similar_users(17975)

In [None]:
similar_users(16359)