# Content-based Filtering Recommendations - Node Similarity

## Introduction


Calculate the pair-wise similarity between Poi node, purely based on the content (attributes) of Poi itself, doesn't involve user interaction.

Similarity measures:

- Numerical attributes: Euclidean Distance Similarity

- Categorical attributes: Jaccard Similarity

- Textual attributes: Cosine Similarity

Overall similarity is computed by combining all attributes on a weighted basis.

Cut off similarity score below 0.5, to make sure all pairs are similar.

Make recommendation by finding the most similar pois.

## Prerequisites

Neo4j server with a recent version (2.0+) of GDS installed.

The `graphdatascience` Python library to operate Neo4j GDS.

`Cypher` query to generate recommendations.

`py2neo` package to write pandas dataframe back to neo4j database.

## Setup

Installing and importing  dependencies, and setting up GDS client connection to the database.

In [1]:
# Install necessary dependencies
%pip install graphdatascience
%pip install matplotlib
%pip install scikit-learn
%pip install py2neo

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [35]:
import os
import configparser

import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score

from py2neo import Graph
from graphdatascience import GraphDataScience

In [36]:
# Using an ini file for credentials, otherwise providing defaults
HOST = 'neo4j://localhost'
DATABASE = 'neo4j'
PASSWORD = 'password'

NEO4J_CONF_FILE = 'neo4j.ini'

if NEO4J_CONF_FILE is not None and os.path.exists(NEO4J_CONF_FILE):
    config = configparser.RawConfigParser()
    config.read(NEO4J_CONF_FILE)
    HOST = config['NEO4J']['HOST']
    DATABASE = config['NEO4J']['DATABASE']
    PASSWORD = config['NEO4J']['PASSWORD']
    print(f'Using custom database properties \nHOST: {HOST}; DATABASE: {DATABASE}; PASSWORD: {PASSWORD}')
else:
    print('Could not find database properties file, using defaults')

# Connecting with the Neo4j database using GDS library
gds = GraphDataScience(HOST,auth=(DATABASE, PASSWORD))

# Connect to Neo4j database using py2neo
graph = Graph(HOST, auth=(DATABASE, PASSWORD))

Using custom database properties 
HOST: bolt://44.199.250.187:7687; DATABASE: neo4j; PASSWORD: braids-agent-overvoltage


# Compute pair-wise node similarity between POIs

In [3]:
# Extract raw data of node poi and its attributes from GDS
result = gds.run_cypher("""
MATCH (poi:Poi)
OPTIONAL MATCH (poi)-[:BELONGS_TO]->(category:Category)
OPTIONAL MATCH (poi)-[:LOCATED_AT]->(region:Region)
RETURN poi.id AS poi_id, 
       poi.name AS name, 
                        
       poi.description AS description, 

       poi.openingHours AS opening_hours, 
       poi.duration AS duration, 
       category.name AS category, 
       region.name AS region,
                        
       poi.price AS price, 
       poi.avgRating AS avg_rating, 
       poi.numReviews AS num_reviews, 
       poi.numReviews_5 AS num_reviews_5, 
       poi.numReviews_4 AS num_reviews_4, 
       poi.numReviews_3 AS num_reviews_3, 
       poi.numReviews_2 AS num_reviews_2, 
       poi.numReviews_1 AS num_reviews_1
""")

# Convert result to DataFrame
df_pois = pd.DataFrame(result)

df_pois

Unnamed: 0,poi_id,name,description,opening_hours,duration,category,region,price,avg_rating,num_reviews,num_reviews_5,num_reviews_4,num_reviews_3,num_reviews_2,num_reviews_1
0,2149128,Gardens by the Bay,"An integral part of Singapore's ""City in a Gar...",5:00 AM - 2:00 AM,More than 3 hours,Points of Interest & Landmarks,Central Area/City Area,8.01,4.5,60393,43439,13817,2541,406,199
1,2149128,Gardens by the Bay,"An integral part of Singapore's ""City in a Gar...",5:00 AM - 2:00 AM,More than 3 hours,Gardens,Central Area/City Area,8.01,4.5,60393,43439,13817,2541,406,199
2,310900,Singapore Botanic Gardens,This national park is open daily and features ...,5:00 AM - 12:00 AM,1-2 hours,Parks,Tanglin,0.00,4.5,20016,14192,4899,822,66,37
3,310900,Singapore Botanic Gardens,This national park is open daily and features ...,5:00 AM - 12:00 AM,1-2 hours,Gardens,Tanglin,0.00,4.5,20016,14192,4899,822,66,37
4,4400781,Cloud Forest,,9:00 AM - 9:00 PM,2-3 hours,Points of Interest & Landmarks,Central Area/City Area,11.42,4.5,15161,11177,3078,715,138,53
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,17821111,Low Salt Low Sugar,Discover the joys of Peranakan cooking in rich...,,,Cooking Classes,Punggol Town Centre,0.00,3.5,3,2,0,0,0,1
92,17738872,Zenith Wholesale Enterprise,Our company has a fleet of vehicles to perform...,7:00 AM - 11:30 PM,,,Outram,0.00,2.5,6,1,1,0,1,3
93,26356283,Singapore airport,,,,Gardens,,0.00,0.0,0,0,0,0,0,0
94,21353012,J2 Terrarium,Terrariums are no stranger to the urban indivi...,,,Paint & Pottery Studios,Yishun,25.59,0.0,0,0,0,0,0,0


In [4]:
# Extracting distinct poi_id and poi_name
df_distinct_pois = df_pois.copy()
df_distinct_pois = df_distinct_pois[['poi_id', 'name']].drop_duplicates()

df_distinct_pois

Unnamed: 0,poi_id,name
0,2149128,Gardens by the Bay
2,310900,Singapore Botanic Gardens
4,4400781,Cloud Forest
7,324542,Singapore Zoo
8,678639,Singapore Flyer
...,...,...
91,17821111,Low Salt Low Sugar
92,17738872,Zenith Wholesale Enterprise
93,26356283,Singapore airport
94,21353012,J2 Terrarium


In [5]:
# Numerical Features - Min-Max Normalization
# Attributes: 'price', 'avg_rating', 'num_reviews', 'num_reviews_5', 'num_reviews_4', 'num_reviews_3', 'num_reviews_2', 'num_reviews_1'

scaler = MinMaxScaler()
numerical_cols = ['price', 'avg_rating', 'num_reviews', 'num_reviews_5', 'num_reviews_4', 'num_reviews_3', 'num_reviews_2', 'num_reviews_1']
# Create a new DataFrame with scaled columns and poi_id
df_numerical_cols = df_pois.copy()
df_numerical_cols = df_numerical_cols[['poi_id'] + numerical_cols]
# retaining only distinct entries
df_numerical_cols = df_numerical_cols.drop_duplicates()
# Fill missing values in numerical columns with 0
df_numerical_cols.fillna(0, inplace=True)

# scale properties
df_numerical_cols[numerical_cols] = scaler.fit_transform(df_numerical_cols[numerical_cols])

df_numerical_cols

Unnamed: 0,poi_id,price,avg_rating,num_reviews,num_reviews_5,num_reviews_4,num_reviews_3,num_reviews_2,num_reviews_1
0,2149128,0.239176,0.9,1.000000,1.000000,1.000000,1.000000,0.876890,0.786561
2,310900,0.000000,0.9,0.331429,0.326711,0.354563,0.323495,0.142549,0.146245
4,4400781,0.340997,0.9,0.251039,0.257303,0.222769,0.281385,0.298056,0.209486
7,324542,1.000000,0.9,0.373288,0.331614,0.443439,0.591499,0.688985,0.750988
8,678639,0.000000,0.9,0.288278,0.213725,0.416371,0.771350,0.598272,0.541502
...,...,...,...,...,...,...,...,...,...
91,17821111,0.000000,0.7,0.000050,0.000046,0.000000,0.000000,0.000000,0.003953
92,17738872,0.000000,0.5,0.000099,0.000023,0.000072,0.000000,0.002160,0.011858
93,26356283,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
94,21353012,0.764109,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [6]:
# Categorical Features - One-hot Encoding
# Attributes: category, region, opening_Hours, duration

categorical_cols = ['category', 'region', 'opening_hours', 'duration']

# Copy df_pois with only the specified categorical columns
df_categorical_cols = df_pois.copy()
df_categorical_cols = df_categorical_cols[['poi_id'] + categorical_cols]

# Do one-hot encoding for categorical columns
df_categorical_cols = pd.get_dummies(df_categorical_cols, columns=categorical_cols)


# Merge rows with the same poi_id while applying OR logical operation
df_categorical_cols = df_categorical_cols.groupby('poi_id').max().reset_index()

df_categorical_cols

Unnamed: 0,poi_id,category_Architectural Buildings,category_Art Galleries,category_Bars & Clubs,category_Beaches,category_Biking Trails,category_Breweries,category_Churches & Cathedrals,category_Cooking Classes,category_Cultural Events,...,opening_hours_9:00 AM - 6:00 PM,opening_hours_9:00 AM - 9:00 PM,opening_hours_9:30 AM - 11:00 PM,opening_hours_9:30 AM - 5:30 PM,opening_hours_Closed until further notice,duration_,duration_1-2 hours,duration_2-3 hours,duration_< 1 hour,duration_More than 3 hours
0,310895,False,False,False,False,False,False,True,False,False,...,True,False,False,False,False,False,True,False,False,False
1,310896,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2,310900,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
3,315470,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4,317402,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64,23138270,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
65,23235560,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
66,23808268,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
67,25547486,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


In [7]:
# Textual Features - Token count
# Attributes: description

textual_cols = ['description']

# Copy df_pois with only the specified textual columns
df_cols = df_pois.copy()
df_cols = df_cols[['poi_id'] + textual_cols]

# retaining only distinct entries
df_cols = df_cols.drop_duplicates()

# create a mask to check if description column contains empty strings
empty_description = df_cols['description'] == ''
# Fill empty strings with "NULL"
df_cols.loc[empty_description, 'description'] = 'NULL'

# initialize token counter, ignore stop words
count_vectorizer = CountVectorizer(stop_words="english")

# Create an empty DataFrame to store token counts
df_textual_cols = pd.DataFrame()

# Iterate over each POI and its description
for index, row in df_cols.iterrows():

    # Tokenize the description
    description = [row['description']]
    # print(f'description: {description}')
    
    # Count token and store in sparse matrix, then convert to dense matrix
    sparse_matrix = count_vectorizer.fit_transform(description)
    doc_term_matrix = sparse_matrix.todense()
    
    # Create DataFrame from the dense matrix
    df_token_counts = pd.DataFrame(
        doc_term_matrix,
        columns=count_vectorizer.get_feature_names_out(),
        index=[row['poi_id']]
    )
    
    # Append the DataFrame to df_token_counts
    df_textual_cols = pd.concat([df_textual_cols, df_token_counts])

# Reset index, rename to poi_id, and fill NaN values with 0
df_textual_cols.reset_index(inplace=True)
df_textual_cols = df_textual_cols.rename(columns={'index': 'poi_id'})
df_textual_cols.fillna(0, inplace=True)

df_textual_cols

  df_textual_cols.reset_index(inplace=True)


Unnamed: 0,poi_id,101,artistry,bay,bring,central,city,comprising,downtown,east,...,enthusiasts,explore,leisure,professional,seasonal,selection,specially,suit,teamwork,whip
0,2149128,1.0,1.0,6.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,310900,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4400781,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,324542,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,678639,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64,17821111,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
65,17738872,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
66,26356283,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
67,21353012,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
# Compute pair-wise similarity between pois

# Initialize similarity matrix
similarity_matrix = {}

# Calculate similarity for each pair of distinct POIs
for i in range(len(df_distinct_pois)):
    for j in range(i+1, len(df_distinct_pois)):

        # get poi id in pairs
        poi1_id, poi2_id = df_distinct_pois.iloc[i]['poi_id'], df_distinct_pois.iloc[j]['poi_id']
        #print(f"poi1_id: {poi1_id}")
        #print(f"poi2_id: {poi2_id}")
        

        # Calculate Jaccard similarity for categorical attributes

        # Get selected rows from DataFrame and convert to 1D array-like format
        poi1_categorical_row = df_categorical_cols[df_categorical_cols['poi_id'] == poi1_id].iloc[:, 1:].values.flatten()
        poi2_categorical_row = df_categorical_cols[df_categorical_cols['poi_id'] == poi2_id].iloc[:, 1:].values.flatten()
        #print(f"poi1_categorical_row: {type(poi1_categorical_row)}, value: {poi1_categorical_row}")
        #print(f"poi2_categorical_row: {type(poi2_categorical_row)}, value: {poi2_categorical_row}")

        cat_cols_similarity = jaccard_score(poi1_categorical_row, poi2_categorical_row)
        #print(f"cat_cols_similarity: {cat_cols_similarity}")
        

        # Calculate euclidean distance similarity for numerical attributes

        poi1_numerical_row = df_numerical_cols[df_numerical_cols['poi_id'] == poi1_id].iloc[:, 1:].values.flatten()
        poi2_numerical_row = df_numerical_cols[df_numerical_cols['poi_id'] == poi2_id].iloc[:, 1:].values.flatten()
        #print(f"poi1_numerical_row: {type(poi1_numerical_row)}, value: {poi1_numerical_row}")
        #print(f"poi2_numerical_row: {type(poi2_numerical_row)}, value: {poi2_numerical_row}")

        euclidean_distance = math.dist(poi1_numerical_row, poi2_numerical_row)
        num_cols_similarity = 1 / ( 1 + euclidean_distance )
        #print(f"num_cols_similarity: {num_cols_similarity}")


        # Calculate cosine similarity for textual attributes
        poi1_textual_row = df_textual_cols[df_textual_cols['poi_id'] == poi1_id].iloc[:, 1:].values.flatten()
        poi2_textual_row = df_textual_cols[df_textual_cols['poi_id'] == poi2_id].iloc[:, 1:].values.flatten()
        #print(f"poi1_textual_row: {type(poi1_textual_row)}, value: {poi1_textual_row}")
        #print(f"poi2_textual_row: {type(poi2_textual_row)}, value: {poi2_textual_row}")

        text_cols_similarity = cosine_similarity([poi1_textual_row, poi2_textual_row])[0][1]
        #print(f"text_cols_similarity: {text_cols_similarity}")


        # Compute weighted overall similarity

        num_cat_cols = len(categorical_cols)
        num_num_cols = len(numerical_cols)
        num_text_cols = len(textual_cols)
        similarity = ( num_cat_cols * cat_cols_similarity + num_num_cols * num_cols_similarity + num_text_cols * text_cols_similarity ) / ( num_cat_cols + num_num_cols + num_text_cols )
        
        # Store similarity in the matrix
        similarity_matrix[(poi1_id, poi2_id)] = similarity

# Convert similarity matrix to DataFrame
df_similarity = pd.DataFrame(similarity_matrix.items(), columns=['POI Pair', 'Similarity'])
# Drop rows where Similarity is less than 0.5
df_similarity = df_similarity[df_similarity['Similarity'] >= 0.5]
# Split the 'POI Pair' column into two separate columns
df_similarity[['poi1_id', 'poi2_id']] = pd.DataFrame(df_similarity['POI Pair'].tolist(), index=df_similarity.index)
# Drop the original 'POI Pair' column
df_similarity.drop(columns=['POI Pair'], inplace=True)
# Reorder the columns
df_similarity = df_similarity[['poi1_id', 'poi2_id', 'Similarity']]
# Reorder the DataFrame by the column "Similarity"
df_similarity = df_similarity.sort_values(by='Similarity', ascending=False)
# Reindex the DataFrame
df_similarity = df_similarity.reset_index(drop=True)

df_similarity

Unnamed: 0,poi1_id,poi2_id,Similarity
0,317473,317421,0.920007
1,3459679,12105952,0.837722
2,1583427,14149881,0.836683
3,3915753,23138270,0.820727
4,19786382,23235560,0.807692
...,...,...,...
1164,324543,17821111,0.502522
1165,17434131,1888871,0.502424
1166,310900,4400781,0.502298
1167,17434131,1888873,0.501096


We have set the similarity cutoff point as 0.5.

We have found 1169 pair (undirected relationship) of similar POIs.

# Write similarity back to databse

Create a new relationship for future query when making recommendation, to make it easy for info retrieval using simple cypher query.

In [17]:
# write SIMILAR relationship between pois with property similarity
# duration: 5m

# Iterate over the DataFrame rows and write the relationships to Neo4j
for index, row in df_similarity.iterrows():
    poi1_id = row['poi1_id']
    poi2_id = row['poi2_id']
    similarity = row['Similarity']
    
    # Write undirected relationship between poi1_id and poi2_id with similarity property
    query = f"""
    MATCH (poi1:Poi {{id: {poi1_id}}})
    MATCH (poi2:Poi {{id: {poi2_id}}})
    MERGE (poi1)-[s1:CBF_SIMILAR]->(poi2)
    ON CREATE SET s1.score = {similarity}
    MERGE (poi1)<-[s2:CBF_SIMILAR]-(poi2)
    ON CREATE SET s2.score = {similarity}
    """
    graph.run(query)

# Exploring the results

Inspect the results by using Cypher. 

Use the `SIMILAR` relationship type to filter out the relationships.

In [18]:
# find the pair-wise maximum similarity
gds.run_cypher(
    """
        MATCH (p1:Poi)-[r:CBF_SIMILAR]->(p2:Poi)
        RETURN p1.name AS poi1, p2.name AS poi2, r.score AS similarity
        ORDER BY similarity DESCENDING, poi1, poi2
        LIMIT 30
    """
)

Unnamed: 0,poi1,poi2,similarity
0,City Hall Building,Old Supreme Court Building,0.920007
1,Old Supreme Court Building,City Hall Building,0.920007
2,Changi Beach,Lazarus Island,0.837722
3,Lazarus Island,Changi Beach,0.837722
4,Former Ford Factory,Surviving the Japanese Occupation: War and Its...,0.836683
5,Surviving the Japanese Occupation: War and Its...,Former Ford Factory,0.836683
6,Smith Street,Tanjong Pagar,0.820727
7,Tanjong Pagar,Smith Street,0.820727
8,Gallop Extension,Gardens By The Bay East,0.807692
9,Gardens By The Bay East,Gallop Extension,0.807692


In [19]:
# find the pair-wise minimum similarity
gds.run_cypher(
    """
        MATCH (p1:Poi)-[r:CBF_SIMILAR]->(p2:Poi)
        RETURN p1.name AS poi1, p2.name AS poi2, r.score AS similarity
        ORDER BY similarity ASC, poi1, poi2
        LIMIT 30
    """
)

Unnamed: 0,poi1,poi2,similarity
0,Ngee Ann City,Supertree Grove,0.500311
1,Supertree Grove,Ngee Ann City,0.500311
2,Bugis Junction,Floral Fantasy,0.501096
3,Floral Fantasy,Bugis Junction,0.501096
4,Cloud Forest,Singapore Botanic Gardens,0.502298
5,Singapore Botanic Gardens,Cloud Forest,0.502298
6,Floral Fantasy,Sri Krishnan Temple,0.502424
7,Sri Krishnan Temple,Floral Fantasy,0.502424
8,Changi Chapel and Museum,Low Salt Low Sugar,0.502522
9,Low Salt Low Sugar,Changi Chapel and Museum,0.502522


The similarity measures is in range from 0.500311 to 0.920007.

## Making recommendations

Make poi recommendations for target poi based on similarity using a simple Cypher query.

In [44]:
# FUNCTION: make recommendation based on Content Based Filtering Recommendations - Node Similarity
# INPUT: poi_id
# OUTPUT: dataframe[poi_id, rec_poi_id]

def similar_poi_recommendation(poi_id):
    result = gds.run_cypher(
        """
            MATCH (p1:Poi {id: $target_poi})-[s:CBF_SIMILAR]->(p2:Poi)
            RETURN p1.id as poi_id, p2.id as rec_poi_id
            ORDER BY s.score DESC
        """, params = {'target_poi': poi_id}
    )
    result = result.drop_duplicates()
    return result

In [45]:
# target poi id
poi_id = 1815807

similar_poi_recommendation(poi_id)

Unnamed: 0,poi_id,rec_poi_id
0,1815807,7221059
1,1815807,310896


# Evaluation

In [39]:
# dataframes of pois
df_pois = gds.run_cypher("""\
    MATCH (poi:Poi)    
    RETURN poi.id
    """)

df_pois

Unnamed: 0,poi.id
0,2149128
1,310900
2,4400781
3,324542
4,678639
...,...
64,17821111
65,17738872
66,26356283
67,21353012


In [40]:
# dataframes of reviews
# duration: 30s

df_reviews = gds.run_cypher("""\
    MATCH (user:User)-[review:REVIEWED]->(poi:Poi)
    RETURN user.id AS user_id, poi.id AS poi_id
    """)

df_reviews

Unnamed: 0,user_id,poi_id
0,847,2149128
1,21070,2149128
2,21061,2149128
3,21003,2149128
4,21227,2149128
...,...,...
85029,58650,7275891
85030,58652,7275891
85031,58654,17821111
85032,58656,17821111


In [41]:
# Group by 'user_id' and count occurrences
user_counts = df_reviews.groupby('user_id').size()

# Filter out users with less than 5 occurrences
valid_users = user_counts[user_counts >= 5].index

# Filter the original DataFrame based on valid users
filtered_df_reviews = df_reviews[df_reviews['user_id'].isin(valid_users)].copy()
filtered_df_reviews

Unnamed: 0,user_id,poi_id
277,20980,2149128
329,20419,2149128
644,20445,2149128
712,20803,2149128
764,20108,2149128
...,...,...
84680,1753,1888873
84685,39201,1888873
84688,6967,1888873
84702,21691,1888873


In [42]:
# Splitting the dataset into 90% training and 10% test sets
df_train, df_test = train_test_split(filtered_df_reviews, test_size=0.1, random_state=100)

df_train

Unnamed: 0,user_id,poi_id
43559,14882,678639
52599,17393,8634325
78604,25812,315470
81971,8079,13078277
66625,13223,8016698
...,...,...
80529,29227,310896
47967,29380,1837767
14022,7053,2149128
4555,16889,2149128


In [43]:
df_test

Unnamed: 0,user_id,poi_id
14227,7435,2149128
48823,23137,1837767
59373,41985,1888876
12031,9481,2149128
81597,11220,2138910
...,...,...
44908,4635,678639
82916,38946,2139492
57328,27424,644919
53084,41524,317415


In [46]:
# retrieve recommendation for row in test set
# duration: 500 instances takes 2 minutes

df_all_retrieved = pd.DataFrame()
for index, row in df_test.iterrows():
    poi_id = row['poi_id']

    recommended_interactions = similar_poi_recommendation(poi_id)

    # Concatenate recommended_interactions with test_recommendations
    df_all_retrieved = pd.concat([df_all_retrieved, recommended_interactions], ignore_index=True)

# Drop the duplicate column
df_all_retrieved = df_all_retrieved.drop_duplicates()

df_all_retrieved

Unnamed: 0,poi_id,rec_poi_id
0,1888876,23138270
1,1888876,22834159
2,1888876,3915753
3,1888876,1888871
4,1888876,12105952
...,...,...
3948,17434131,7221059
3949,17434131,4400781
3950,17434131,310895
3951,17434131,1888871


In [47]:
# Extracting the true interactions of all poi-poi pair reviewed by distinct user
# duration: 1m

# Group by user_id and aggregate poi_id as a list
grouped = df_reviews.groupby('user_id')['poi_id'].apply(list)

# Initialize an empty DataFrame for the result
df_true_interactions = pd.DataFrame(columns=['poi_id', 'rec_poi_id'])

# Iterate through each group
for user_id, poi_ids in grouped.items():
    # Create pairs of poi_id and rec_poi_id for each user
    pairs = [(poi_id, other_poi_id) for poi_id in poi_ids for other_poi_id in poi_ids if poi_id != other_poi_id]
    df_pairs = pd.DataFrame(pairs, columns=['poi_id', 'rec_poi_id'])
    # Add pairs to the result DataFrame
    df_true_interactions = pd.concat([df_true_interactions, df_pairs], ignore_index=True)

df_true_interactions = df_true_interactions.drop_duplicates()
# Display the result DataFrame
df_true_interactions

Unnamed: 0,poi_id,rec_poi_id
0,2149128,315470
1,315470,2149128
2,2149128,1837767
3,1837767,2149128
4,2149128,644919
...,...,...
83163,14904083,8178306
83164,8178306,3915753
83165,3915753,8178306
83166,317421,1888873


In [56]:
# Get all relevant instances by merging the true interactions and test instance
df_all_relevant = pd.merge(df_true_interactions, df_test, on=['poi_id'], how='inner')
# Remove the user_id column
df_all_relevant = df_all_relevant.drop(columns=['user_id'])
# check poi pair not same poi
df_all_relevant = df_all_relevant[df_all_relevant['poi_id'] != df_all_relevant['rec_poi_id']]
# drop duplicate
df_all_relevant = df_all_relevant.drop_duplicates()

df_all_relevant

Unnamed: 0,poi_id,rec_poi_id
0,2149128,315470
48,2149128,1837767
96,2149128,644919
144,2149128,8016698
192,2149128,317415
...,...,...
20637,1888871,1888873
20638,1888871,317473
20639,1888871,317438
20640,1888871,3915753


In [57]:
# Get all relevant retrieved instance by merging the true interactions and recommended interactions
df_retrived_relevant = pd.merge(df_all_retrieved, df_all_relevant, on=['poi_id', 'rec_poi_id'], how='inner')

df_retrived_relevant

Unnamed: 0,poi_id,rec_poi_id
0,1888876,3915753
1,1888876,1888871
2,1888876,12105952
3,1888876,8634325
4,1888876,3459679
...,...,...
297,8574463,1888876
298,8574463,13078277
299,17434131,310896
300,17434131,8634325


In [58]:
# calculate the precision score
relevant_retrieved = df_retrived_relevant.shape[0]
all_retrived = df_all_retrieved.shape[0]

precision = relevant_retrieved / all_retrived

print(f'Precision Score: {precision}')

Precision Score: 0.3471264367816092


In [59]:
# calculate the recall score
relevant_retrieved = df_retrived_relevant.shape[0]
all_relevant = df_all_relevant.shape[0]

recall = relevant_retrieved / all_relevant
print(f'Recall Score: {recall}')

Recall Score: 0.2873453853472883


In [60]:
# calculate the coverage score
num_recommended_pois = df_all_retrieved['rec_poi_id'].nunique()
num_all_pois = df_pois.shape[0]

coverage = num_recommended_pois / num_all_pois
print(f'Coverage Score: {coverage}')

Coverage Score: 0.8840579710144928
