# Ensemble Learning - Majority Voting

## Introduction


Ensemble Learning based on following recommendation algortithms:

- Content Based Filtering - Heuristic

- Content Based Filtering - Node Similarity

- Collaborative Filtering - UserKnn with FastRP

- Collaborative Filtering - ItemKnn with FastRP

## Prerequisites

Neo4j server with a recent version (2.0+) of GDS installed.

The `graphdatascience` Python library to operate Neo4j GDS.

`Cypher` query to generate recommendations.

`py2neo` package to write pandas dataframe back to neo4j database.

## Setup

Installing and importing  dependencies, and setting up neo4j python driver, py2neo and GDS client connection to the database.

In [1]:
# Install necessary dependencies
%pip install graphdatascience
%pip install matplotlib
%pip install scikit-learn
%pip install py2neo

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import os
import textwrap
import configparser

import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score

from py2neo import Graph
from neo4j import GraphDatabase
from graphdatascience import GraphDataScience

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Using an ini file for credentials, otherwise providing defaults
HOST = 'neo4j://localhost'
DATABASE = 'neo4j'
PASSWORD = 'password'

NEO4J_CONF_FILE = 'neo4j.ini'

if NEO4J_CONF_FILE is not None and os.path.exists(NEO4J_CONF_FILE):
    config = configparser.RawConfigParser()
    config.read(NEO4J_CONF_FILE)
    HOST = config['NEO4J']['HOST']
    DATABASE = config['NEO4J']['DATABASE']
    PASSWORD = config['NEO4J']['PASSWORD']
    print(f'Using custom database properties \nHOST: {HOST}; DATABASE: {DATABASE}; PASSWORD: {PASSWORD}')
else:
    print('Could not find database properties file, using defaults')

# Connecting with neo4j python driver
driver = GraphDatabase.driver(HOST, auth=(DATABASE, PASSWORD))

# Connecting with the Neo4j database using GDS library
gds = GraphDataScience(HOST,auth=(DATABASE, PASSWORD))

# Connect to Neo4j database using py2neo
graph = Graph(HOST, auth=(DATABASE, PASSWORD))

Using custom database properties 
HOST: bolt://3.94.20.148:7687; DATABASE: neo4j; PASSWORD: anthems-schedulers-blade


In [4]:
# driver helper function
def run(driver, query, params=None):
    with driver.session() as session:
        if params is not None:
            return [r for r in session.run(query, params)]
        else:
            return [r for r in session.run(query)]

## 1) Content Based Filtering Recommendations - Heuristic Method

### query function

In [5]:
# FUNCTION: make recommendation based on Content Based Filtering Recommendations - Heuristic Method
# INPUT: user_id, poi_id
# OUTPUT: dataframe[user_id, poi_id, rec_poi_id]

def heuristic_recommendation(user_id, poi_id):
    # get pois in the same region as reviewed_poi by the user
    records_region = run(driver, textwrap.dedent("""\
        MATCH (user {id: $user_id})-[:REVIEWED]->(poi:Poi {id: $poi_id})-[:LOCATED_AT]->(region:Region)<-[:LOCATED_AT]-(other_poi:Poi)<-[rated:RATED]-(review:Review)
        WHERE poi <> other_poi
        WITH user, poi, other_poi, region, count(DISTINCT rated) AS num_reviews
        RETURN user.id AS user_id, poi.id AS poi_id, other_poi.id AS rec_poi_id, region.name AS region, num_reviews AS occurrences
        """),
        params = {'user_id': user_id, 'poi_id': poi_id}
    )

    # get pois in the same category as reviewed_poi by the user 
    records_category = run(driver, textwrap.dedent("""\
        MATCH (user {id: $user_id})-[:REVIEWED]->(poi:Poi {id: $poi_id})-[:BELONGS_TO]->(category:Category)<-[:BELONGS_TO]-(other_poi:Poi)<-[rated:RATED]-(review:Review)
        WHERE poi <> other_poi
        WITH user, poi, other_poi, category, count(DISTINCT rated) AS num_reviews
        RETURN user.id AS user_id, poi.id AS poi_id, other_poi.id AS rec_poi_id, category.name AS category_name, num_reviews AS occurrences
        """),
        params = {'user_id': user_id, 'poi_id': poi_id}
    )

    
    # Convert the result to a DataFrame
    if records_region:
        df_records_region = pd.DataFrame([dict(record) for record in records_region])
        # Group by 'poi_id', 'poi_name', and 'occurrences', then aggregate the count of occurrences
        df_records_region_agg = df_records_region.groupby(['user_id', 'poi_id', 'rec_poi_id', 'occurrences']).size().reset_index(name='weight')
    else:
        df_records_region_agg = pd.DataFrame(columns=['user_id', 'poi_id', 'rec_poi_id', 'occurrences', 'weight'])

    if records_category:
        df_records_category = pd.DataFrame([dict(record) for record in records_category])
        # Group by 'poi_id', 'poi_name', and 'occurrences', then aggregate the count of occurrences
        df_records_category_agg = df_records_category.groupby(['user_id', 'poi_id', 'rec_poi_id', 'occurrences']).size().reset_index(name='weight')
    else:
       df_records_category_agg = pd.DataFrame(columns=['user_id', 'poi_id', 'rec_poi_id', 'occurrences', 'weight'])


    # compute appearance fequency of pois in both lists

    # Merge the two DataFrames on 'rec_poi_id'
    recommended_interactions = pd.merge(df_records_region_agg, df_records_category_agg, on='rec_poi_id', suffixes=('_region', '_category'), how='outer')

    # Fill NaN values in '_region' columns with values from '_category' columns
    recommended_interactions['user_id_region'].fillna(recommended_interactions['user_id_category'], inplace=True)
    recommended_interactions['poi_id_region'].fillna(recommended_interactions['poi_id_category'], inplace=True)
    recommended_interactions['occurrences_region'].fillna(recommended_interactions['occurrences_category'], inplace=True)

    # Rename the columns '_region'
    recommended_interactions.rename(columns={'user_id_region': 'user_id'}, inplace=True)
    recommended_interactions.rename(columns={'poi_id_region': 'poi_id'}, inplace=True)
    recommended_interactions.rename(columns={'occurrences_region': 'occurrences'}, inplace=True)

    # Fill NaN values with 0 for the 'weight' columns
    recommended_interactions['weight_region'].fillna(0, inplace=True)
    recommended_interactions['weight_category'].fillna(0, inplace=True)
    # Sum the 'weight' columns to get the total weight
    recommended_interactions['total_weight'] = recommended_interactions['weight_region'] + recommended_interactions['weight_category']

    # Drop the individual 'weight' columns if needed
    recommended_interactions.drop(['user_id_category', 'poi_id_category', 'occurrences_category', 'weight_region', 'weight_category'], axis=1, inplace=True)
    # Order the DataFrame by 'total_weight' in descending order, then by 'occurrences'
    recommended_interactions = recommended_interactions.sort_values(by=['total_weight', 'occurrences'], ascending=[False, False])
    # Reindex the DataFrame
    recommended_interactions.reset_index(drop=True, inplace=True)
    # Rearrange the columns
    recommended_interactions = recommended_interactions[['user_id', 'poi_id', 'rec_poi_id']]
    # drop duplicate
    recommended_interactions = recommended_interactions.drop_duplicates()

    # Display the merged DataFrame
    return recommended_interactions

## 2) Content-based Filtering Recommendations - Node Similarity

### preparation

In [6]:
# duration: 6m

# Extract raw data of node poi and its attributes from GDS
result = gds.run_cypher("""
MATCH (poi:Poi)
OPTIONAL MATCH (poi)-[:BELONGS_TO]->(category:Category)
OPTIONAL MATCH (poi)-[:LOCATED_AT]->(region:Region)
RETURN poi.id AS poi_id, 
       poi.name AS name, 
                        
       poi.description AS description, 

       poi.openingHours AS opening_hours, 
       poi.duration AS duration, 
       category.name AS category, 
       region.name AS region,
                        
       poi.price AS price, 
       poi.avgRating AS avg_rating, 
       poi.numReviews AS num_reviews, 
       poi.numReviews_5 AS num_reviews_5, 
       poi.numReviews_4 AS num_reviews_4, 
       poi.numReviews_3 AS num_reviews_3, 
       poi.numReviews_2 AS num_reviews_2, 
       poi.numReviews_1 AS num_reviews_1
""")

# Convert result to DataFrame
df_pois = pd.DataFrame(result)

# Extracting distinct poi_id and poi_name
df_distinct_pois = df_pois.copy()
df_distinct_pois = df_distinct_pois[['poi_id', 'name']].drop_duplicates()

# Numerical Features - Min-Max Normalization
# Attributes: 'price', 'avg_rating', 'num_reviews', 'num_reviews_5', 'num_reviews_4', 'num_reviews_3', 'num_reviews_2', 'num_reviews_1'

scaler = MinMaxScaler()
numerical_cols = ['price', 'avg_rating', 'num_reviews', 'num_reviews_5', 'num_reviews_4', 'num_reviews_3', 'num_reviews_2', 'num_reviews_1']
# Create a new DataFrame with scaled columns and poi_id
df_numerical_cols = df_pois.copy()
df_numerical_cols = df_numerical_cols[['poi_id'] + numerical_cols]
# retaining only distinct entries
df_numerical_cols = df_numerical_cols.drop_duplicates()
# Fill missing values in numerical columns with 0
df_numerical_cols.fillna(0, inplace=True)

# scale properties
df_numerical_cols[numerical_cols] = scaler.fit_transform(df_numerical_cols[numerical_cols])


# Categorical Features - One-hot Encoding
# Attributes: category, region, opening_Hours, duration

categorical_cols = ['category', 'region', 'opening_hours', 'duration']

# Copy df_pois with only the specified categorical columns
df_categorical_cols = df_pois.copy()
df_categorical_cols = df_categorical_cols[['poi_id'] + categorical_cols]

# Do one-hot encoding for categorical columns
df_categorical_cols = pd.get_dummies(df_categorical_cols, columns=categorical_cols)


# Merge rows with the same poi_id while applying OR logical operation
df_categorical_cols = df_categorical_cols.groupby('poi_id').max().reset_index()


# Textual Features - Token count
# Attributes: description

textual_cols = ['description']

# Copy df_pois with only the specified textual columns
df_cols = df_pois.copy()
df_cols = df_cols[['poi_id'] + textual_cols]

# retaining only distinct entries
df_cols = df_cols.drop_duplicates()

# create a mask to check if description column contains empty strings
empty_description = df_cols['description'] == ''
# Fill empty strings with "NULL"
df_cols.loc[empty_description, 'description'] = 'NULL'

# initialize token counter, ignore stop words
count_vectorizer = CountVectorizer(stop_words="english")

# Create an empty DataFrame to store token counts
df_textual_cols = pd.DataFrame()

# Iterate over each POI and its description
for index, row in df_cols.iterrows():

    # Tokenize the description
    description = [row['description']]
    # print(f'description: {description}')
    
    # Count token and store in sparse matrix, then convert to dense matrix
    sparse_matrix = count_vectorizer.fit_transform(description)
    doc_term_matrix = sparse_matrix.todense()
    
    # Create DataFrame from the dense matrix
    df_token_counts = pd.DataFrame(
        doc_term_matrix,
        columns=count_vectorizer.get_feature_names_out(),
        index=[row['poi_id']]
    )
    
    # Append the DataFrame to df_token_counts
    df_textual_cols = pd.concat([df_textual_cols, df_token_counts])

# Reset index, rename to poi_id, and fill NaN values with 0
df_textual_cols.reset_index(inplace=True)
df_textual_cols = df_textual_cols.rename(columns={'index': 'poi_id'})
df_textual_cols.fillna(0, inplace=True)


# Compute pair-wise similarity between pois

# Initialize similarity matrix
similarity_matrix = {}

# Calculate similarity for each pair of distinct POIs
for i in range(len(df_distinct_pois)):
    for j in range(i+1, len(df_distinct_pois)):

        # get poi id in pairs

        poi1_id, poi2_id = df_distinct_pois.iloc[i]['poi_id'], df_distinct_pois.iloc[j]['poi_id']

        # Calculate Jaccard similarity for categorical attributes

        poi1_categorical_row = df_categorical_cols[df_categorical_cols['poi_id'] == poi1_id].iloc[:, 1:].values.flatten()
        poi2_categorical_row = df_categorical_cols[df_categorical_cols['poi_id'] == poi2_id].iloc[:, 1:].values.flatten()
        cat_cols_similarity = jaccard_score(poi1_categorical_row, poi2_categorical_row)

        # Calculate euclidean distance similarity for numerical attributes

        poi1_numerical_row = df_numerical_cols[df_numerical_cols['poi_id'] == poi1_id].iloc[:, 1:].values.flatten()
        poi2_numerical_row = df_numerical_cols[df_numerical_cols['poi_id'] == poi2_id].iloc[:, 1:].values.flatten()
        euclidean_distance = math.dist(poi1_numerical_row, poi2_numerical_row)
        num_cols_similarity = 1 / ( 1 + euclidean_distance )

        # Calculate cosine similarity for textual attributes

        poi1_textual_row = df_textual_cols[df_textual_cols['poi_id'] == poi1_id].iloc[:, 1:].values.flatten()
        poi2_textual_row = df_textual_cols[df_textual_cols['poi_id'] == poi2_id].iloc[:, 1:].values.flatten()
        text_cols_similarity = cosine_similarity([poi1_textual_row, poi2_textual_row])[0][1]

        # Compute weighted overall similarity

        num_cat_cols = len(categorical_cols)
        num_num_cols = len(numerical_cols)
        num_text_cols = len(textual_cols)
        similarity = ( num_cat_cols * cat_cols_similarity + num_num_cols * num_cols_similarity + num_text_cols * text_cols_similarity ) / ( num_cat_cols + num_num_cols + num_text_cols )
        
        # Store similarity in the matrix
        similarity_matrix[(poi1_id, poi2_id)] = similarity

# Convert similarity matrix to DataFrame
df_similarity = pd.DataFrame(similarity_matrix.items(), columns=['POI Pair', 'Similarity'])
# Drop rows where Similarity is less than 0.5
df_similarity = df_similarity[df_similarity['Similarity'] >= 0.5]
# Split the 'POI Pair' column into two separate columns
df_similarity[['poi1_id', 'poi2_id']] = pd.DataFrame(df_similarity['POI Pair'].tolist(), index=df_similarity.index)
# Drop the original 'POI Pair' column
df_similarity.drop(columns=['POI Pair'], inplace=True)
# Reorder the columns
df_similarity = df_similarity[['poi1_id', 'poi2_id', 'Similarity']]
# Reorder the DataFrame by the column "Similarity"
df_similarity = df_similarity.sort_values(by='Similarity', ascending=False)
# Reindex the DataFrame
df_similarity = df_similarity.reset_index(drop=True)


# write SIMILAR relationship between pois with property similarity
# duration: 5m

# Iterate over the DataFrame rows and write the relationships to Neo4j
for index, row in df_similarity.iterrows():
    poi1_id = row['poi1_id']
    poi2_id = row['poi2_id']
    similarity = row['Similarity']
    
    # Write undirected relationship between poi1_id and poi2_id with similarity property
    query = f"""
    MATCH (poi1:Poi {{id: {poi1_id}}})
    MATCH (poi2:Poi {{id: {poi2_id}}})
    MERGE (poi1)-[s1:CBF_SIMILAR]->(poi2)
    ON CREATE SET s1.score = {similarity}
    MERGE (poi1)<-[s2:CBF_SIMILAR]-(poi2)
    ON CREATE SET s2.score = {similarity}
    """
    graph.run(query)

  df_textual_cols.reset_index(inplace=True)


### query function

In [7]:
# FUNCTION: make recommendation based on Content Based Filtering Recommendations - Node Similarity
# INPUT: poi_id
# OUTPUT: dataframe[poi_id, rec_poi_id]

def similar_poi_recommendation(poi_id):
    result = gds.run_cypher(
        """
            MATCH (p1:Poi {id: $target_poi})-[s:CBF_SIMILAR]->(p2:Poi)
            RETURN p1.id as poi_id, p2.id as rec_poi_id
            ORDER BY s.score DESC
        """, params = {'target_poi': poi_id}
    )
    result = result.drop_duplicates()
    return result

## 3) Collaborative Filtering Recommendations - User-Based kNN based on FastRP embeddings

### preparation

These projected graph and fastRP embedding will be used for both algorithm (3）and (4).

In [8]:
# Projection Graph

# duration: 20s

# define how to project database into GDS
node_projection = ["User", "Poi"]
relationship_projection = {"REVIEWED": {"orientation": "UNDIRECTED", "properties": "rating"}}

# proceed with projection
G, result = gds.graph.project("myGraph", node_projection, relationship_projection)


# Create Fast RP embeddings

# run FastRP and mutate our projected graph with the results
result = gds.fastRP.mutate(
    G,
    randomSeed=42,
    embeddingDimension=256,
    relationshipWeightProperty="rating",
    iterationWeights=[0, 1, 1, 1],
    mutateProperty="embedding"
)

print(f"Number of embedding vectors produced: {result['nodePropertiesWritten']}")

Number of embedding vectors produced: 58725


In [9]:
# Similarity with User-based KNN

# Run the kNN with optimal topK hyperparameter and write back to db
# duration: 3m

topK_best = 12

result = gds.knn.write(
    G,
    topK=topK_best,
    nodeLabels=['User'],
    nodeProperties=["embedding"],
    randomSeed=42,
    concurrency=1,
    sampleRate=1.0,
    deltaThreshold=0.0,
    writeRelationshipType="CF_SIMILAR_USER",
    writeProperty="score",

)

print(f"Relationships produced: {result['relationshipsWritten']}")
print(f"Nodes compared: {result['nodesCompared']}")
print(f"Mean similarity: {result['similarityDistribution']['mean']}")

Relationships produced: 703872
Nodes compared: 58656
Mean similarity: 0.9976186487680783


### query function

In [10]:
# FUNCTION: make recommendation based on Collaborative Filtering Recommendations - User-Based kNN based on FastRP embeddings
# INPUT: user_id
# OUTPUT: dataframe[user_id, rec_poi_id]

def userKNN_recommendation(user_id):

    result = gds.run_cypher(
        """
            MATCH (u1:User {id: $target_user})-[s:CF_SIMILAR_USER]->(u2:User)-[:REVIEWED]->(p:Poi)
            WITH u1, p, s.score AS user_similarity
            RETURN u1.id as user_id, p.id as rec_poi_id
            ORDER BY user_similarity DESC, p.avgRating DESC
        """, params = {'target_user': user_id}
    )
    result = result.drop_duplicates()
    return result

## 4) Collaborative Filtering Recommendations - Item-Based kNN based on FastRP embeddings

### Preparation

In [11]:
# Similarity with Item-based KNN

# Run the kNN with optimal topK hyperparameter and write back to db

topK_best = 2

result = gds.knn.write(
    G,
    topK=topK_best,
    nodeLabels = ['Poi'],
    nodeProperties=["embedding"],
    randomSeed=42,
    concurrency=1,
    sampleRate=1.0,
    deltaThreshold=0.0,
    similarityCutoff = 0.5,
    writeRelationshipType="CF_SIMILAR_POI",
    writeProperty="score"
)

print(f"Relationships produced: {result['relationshipsWritten']}")
print(f"Nodes compared: {result['nodesCompared']}")
print(f"Mean similarity: {result['similarityDistribution']['mean']}")

Relationships produced: 112
Nodes compared: 69
Mean similarity: 0.6929127488817487


### Query Function

In [12]:
# FUNCTION: make recommendation based on Collaborative Filtering Recommendations - Item-Based kNN based on FastRP embeddings
# INPUT: poi_id
# OUTPUT: dataframe[poi_id, rec_poi_id]

def itemKNN_recommendation(poi_id):
    result = gds.run_cypher(
        """
            MATCH (p1:Poi {id: $target_poi})-[s:CF_SIMILAR_POI]->(p2:Poi)
            RETURN p1.id as poi_id, p2.id as rec_poi_id
            ORDER BY s.score DESC, p2.avgRating DESC
        """, params = {'target_poi': poi_id}
    )
    result = result.drop_duplicates()
    return result

## Making recommendations with Ensemble Learning

Make poi recommendations for user from other similar users using a simple Cypher query.

In [13]:
# FUNCTION: helper to cleaning up each df after calling recommendation algorithm, prepare them for ensemble learning
def df_cleaning (df):
    if not df.empty:    # Reset index, get rank, and re-arrange columns
        df.reset_index(drop=True, inplace=True)          
        df = df.reset_index().rename(columns={'index': 'rank'})
        df['rank'] += 1
        df = df.reindex(columns=['user_id', 'poi_id', 'rec_poi_id', 'rank', 'df_name'])

    return df

In [17]:
# FUNCTION: make recommendation based on Ensemble Learning - Majority Voting
# INPUT: poi_id, user_id, algo_combination
# OUTPUT: dataframe[poi_id, user_id, rec_poi_id]

def ensemble_recommendation(poi_id, user_id, algo_combination):

    # Based on the chosen algorithm combination, decide whether to call each recommendation function

    if 1 in algo_combination:
        rec_CBF_heuristic = heuristic_recommendation(user_id, poi_id)   # OUTPUT: dataframe[user_id, poi_id, rec_poi_id]
        rec_CBF_heuristic['df_name'] = 'rec_CBF_heuristic'              # Add DataFrame name as a column
        rec_CBF_heuristic = df_cleaning (rec_CBF_heuristic)             # Clean up df for ensemble
    else:
        rec_CBF_heuristic = pd.DataFrame()

    if 2 in algo_combination:
        rec_CBF_similarity = similar_poi_recommendation(poi_id)         # OUTPUT: dataframe[poi_id, rec_poi_id]
        rec_CBF_similarity['df_name'] = 'rec_CBF_similarity'            # Add DataFrame name as a column
        rec_CBF_similarity['user_id'] = user_id                         # Add missing columns
        rec_CBF_similarity = df_cleaning (rec_CBF_similarity)           # Clean up df for ensemble
    else:
        rec_CBF_similarity = pd.DataFrame()

    if 3 in algo_combination:
        rec_CF_userKnn =  userKNN_recommendation(user_id)               # OUTPUT: dataframe[user_id, rec_poi_id]
        rec_CF_userKnn['df_name'] = 'rec_CF_userKnn'                    # Add DataFrame name as a column
        rec_CF_userKnn['poi_id'] = poi_id                               # Add missing columns
        rec_CF_userKnn = df_cleaning (rec_CF_userKnn)                   # Clean up df for ensemble
    else:
        rec_CF_userKnn = pd.DataFrame()

    if 4 in algo_combination:
        rec_CF_itemKnn = itemKNN_recommendation(poi_id)                 # OUTPUT: dataframe[poi_id, rec_poi_id]
        rec_CF_itemKnn['df_name'] = 'rec_CF_itemKnn'                    # Add DataFrame name as a column
        rec_CF_itemKnn['user_id'] = user_id                             # Add missing columns
        rec_CF_itemKnn = df_cleaning (rec_CF_itemKnn)                   # Clean up df for ensemble
    else:
        rec_CF_itemKnn = pd.DataFrame()

    # Print the reordered DataFrames
    #print(rec_CBF_heuristic)
    #print(rec_CBF_similarity)
    #print(rec_CF_userKnn)
    #print(rec_CF_itemKnn)
    
    # Concatenate the DataFrames along the rows
    merged_df = pd.concat([rec_CF_itemKnn, rec_CF_userKnn, rec_CBF_similarity, rec_CBF_heuristic])
    #print(f'merged_df: \n{merged_df}')

    # check if merged df is not empty
    if not merged_df.empty:
        # Group by user_id, poi_id, rec_poi_id and compute average rank and count
        grouped_df = merged_df.groupby(['user_id', 'poi_id', 'rec_poi_id']).agg({'rank': 'mean', 'df_name': 'count'}).reset_index()

        # Rename the count column to count
        grouped_df.rename(columns={'df_name': 'count'}, inplace=True)

        # drop any item with count = 1
        grouped_df = grouped_df[grouped_df['count'] > 1]

        # Sort by count in descending order and average rank in ascending order
        sorted_df = grouped_df.sort_values(by=['count', 'rank'], ascending=[False, True])
        #print(f'sorted_df: \n{sorted_df}')

        # Drop the 'count' and 'rank' columns
        result = sorted_df.drop(columns=['count', 'rank'])
        #result = sorted_df.copy()
        result.reset_index(drop=True, inplace=True)
    else:
        result = merged_df.drop(columns=['df_name'])

    return result


# target user's id
user_id = 17518
# target poi's id
poi_id = 310900

# combination of algorithm for ensemble learning
algo_combination = [1,2,3,4]

#Choose from the below Algorithms:
#(1) Content Based Filtering - Heuristic
#(2) Content Based Filtering - Node Similarity
#(3) Collaborative Filtering - UserKnn with FastRP
#(4) Collaborative Filtering - ItemKnn with FastRP

ensemble_recommendation(poi_id, user_id, algo_combination)

Unnamed: 0,user_id,poi_id,rec_poi_id
0,17518,310900,4400781
1,17518,310900,591382
2,17518,310900,2149128


# Evaluation

In [18]:
# dataframes of pois
df_pois = gds.run_cypher("""\
    MATCH (poi:Poi)    
    RETURN poi.id
    """)

df_pois

Unnamed: 0,poi.id
0,2149128
1,310900
2,4400781
3,324542
4,678639
...,...
64,17821111
65,17738872
66,26356283
67,21353012


In [19]:
# dataframes of reviews
# duration: 32s

df_reviews = gds.run_cypher("""\
    MATCH (user:User)-[review:REVIEWED]->(poi:Poi)
    RETURN user.id AS user_id, poi.id AS poi_id
    """)

df_reviews

Unnamed: 0,user_id,poi_id
0,847,2149128
1,21070,2149128
2,21061,2149128
3,21003,2149128
4,21227,2149128
...,...,...
85029,58650,7275891
85030,58652,7275891
85031,58654,17821111
85032,58656,17821111


In [20]:
# Group by 'user_id' and count occurrences
user_counts = df_reviews.groupby('user_id').size()

# Filter out users with less than 5 occurrences
valid_users = user_counts[user_counts >= 5].index

# Filter the original DataFrame based on valid users
filtered_df_reviews = df_reviews[df_reviews['user_id'].isin(valid_users)].copy()
filtered_df_reviews

Unnamed: 0,user_id,poi_id
277,20980,2149128
329,20419,2149128
644,20445,2149128
712,20803,2149128
764,20108,2149128
...,...,...
84680,1753,1888873
84685,39201,1888873
84688,6967,1888873
84702,21691,1888873


In [21]:
# Splitting the dataset into 90% training and 10% test sets
df_train, df_test = train_test_split(filtered_df_reviews, test_size=0.1, random_state=100)

df_train

Unnamed: 0,user_id,poi_id
43559,14882,678639
52599,17393,8634325
78604,25812,315470
81971,8079,13078277
66625,13223,8016698
...,...,...
80529,29227,310896
47967,29380,1837767
14022,7053,2149128
4555,16889,2149128


In [22]:
df_test

Unnamed: 0,user_id,poi_id
14227,7435,2149128
48823,23137,1837767
59373,41985,1888876
12031,9481,2149128
81597,11220,2138910
...,...,...
44908,4635,678639
82916,38946,2139492
57328,27424,644919
53084,41524,317415


In [23]:
# Extracting the true interactions of all poi-poi pair reviewed by distinct user
# duration: 1m 30s

# Group by user_id and aggregate poi_id as a list
grouped = df_reviews.groupby('user_id')['poi_id'].apply(list)

# Initialize an empty DataFrame for the result
df_true_interactions = pd.DataFrame(columns=['user_id', 'poi_id', 'rec_poi_id'])

# Iterate through each group
for user_id, poi_ids in grouped.items():
    # Create pairs of target_poi_id and poi_id for each user
    entry = [(user_id, poi_id, other_poi_id) for poi_id in poi_ids for other_poi_id in poi_ids if poi_id != other_poi_id]
    df_entry = pd.DataFrame(entry, columns=['user_id', 'poi_id', 'rec_poi_id'])
    # Add pairs to the result DataFrame
    df_true_interactions = pd.concat([df_true_interactions, df_entry], ignore_index=True)

df_true_interactions = df_true_interactions.drop_duplicates()
# Display the result DataFrame
df_true_interactions

Unnamed: 0,user_id,poi_id,rec_poi_id
0,5,2149128,315470
1,5,315470,2149128
2,8,2149128,1837767
3,8,1837767,2149128
4,11,2149128,644919
...,...,...,...
83163,57330,14904083,8178306
83164,57915,8178306,3915753
83165,57915,3915753,8178306
83166,58217,317421,1888873


In [24]:
# Get all relevant instances by merging the true interactions and test instance on user id and poi id
df_all_relevant = pd.merge(df_test, df_true_interactions, on=['user_id', 'poi_id'], how='inner')
df_all_relevant

Unnamed: 0,user_id,poi_id,rec_poi_id
0,7435,2149128,310900
1,7435,2149128,4400781
2,7435,2149128,324542
3,7435,2149128,1837767
4,7435,2149128,317415
...,...,...,...
2280,41524,317415,310896
2281,32830,1888876,324542
2282,32830,1888876,8634325
2283,32830,1888876,8016698


start running algorithm to retrieve recommendations

duration for 500 instances: 

algo [1,2,3,4]: 15m 

algo [2,3,4]: 9m

algo [1,3,4]: 11m

algo [1,2,4]: 11m

algo [1,2,3]: 11m

algo [1,2]: 9m

algo [1,3]: 8m

algo [1,4]: 8m

algo [2,3]: 6m

algo [2,4]: 7m

algo [3,4]: 7m

In [75]:
# retrieve recommendation for row in test set

# Tuning selection of algorithm combination to get better result
algo_combination = [1,2,3,4]

df_all_retrieved = pd.DataFrame()
for index, row in df_test.iterrows():
    user_id = row['user_id']
    poi_id = row['poi_id']
    #print(f'\nuser_id: {user_id}')
    #print(f'poi_id: {poi_id}')

    recommended_interactions = ensemble_recommendation(poi_id, user_id, algo_combination)

    # Concatenate recommended_interactions with test_recommendations
    df_all_retrieved = pd.concat([df_all_retrieved, recommended_interactions], ignore_index=True)

# Drop the duplicates
df_all_retrieved = df_all_retrieved.drop_duplicates()

df_all_retrieved

  merged_df = pd.concat([rec_CF_itemKnn, rec_CF_userKnn, rec_CBF_similarity, rec_CBF_heuristic])
  merged_df = pd.concat([rec_CF_itemKnn, rec_CF_userKnn, rec_CBF_similarity, rec_CBF_heuristic])
  merged_df = pd.concat([rec_CF_itemKnn, rec_CF_userKnn, rec_CBF_similarity, rec_CBF_heuristic])
  merged_df = pd.concat([rec_CF_itemKnn, rec_CF_userKnn, rec_CBF_similarity, rec_CBF_heuristic])
  merged_df = pd.concat([rec_CF_itemKnn, rec_CF_userKnn, rec_CBF_similarity, rec_CBF_heuristic])
  merged_df = pd.concat([rec_CF_itemKnn, rec_CF_userKnn, rec_CBF_similarity, rec_CBF_heuristic])
  merged_df = pd.concat([rec_CF_itemKnn, rec_CF_userKnn, rec_CBF_similarity, rec_CBF_heuristic])
  merged_df = pd.concat([rec_CF_itemKnn, rec_CF_userKnn, rec_CBF_similarity, rec_CBF_heuristic])
  merged_df = pd.concat([rec_CF_itemKnn, rec_CF_userKnn, rec_CBF_similarity, rec_CBF_heuristic])
  merged_df = pd.concat([rec_CF_itemKnn, rec_CF_userKnn, rec_CBF_similarity, rec_CBF_heuristic])
  merged_df = pd.concat([rec_C

Unnamed: 0,user_id,poi_id,rec_poi_id
0,7435.0,2149128.0,4400781
1,7435.0,2149128.0,1837767
2,7435.0,2149128.0,310900
3,23137.0,1837767.0,8016698
4,23137.0,1837767.0,8634325
...,...,...,...
1713,41524.0,317415.0,3915753
1714,32830.0,1888876.0,8634325
1715,32830.0,1888876.0,315470
1716,32830.0,1888876.0,379351


Number of retrieved entries:

algo [1,2,3,4]: 1718 

algo [2,3,4]: 549 

algo [1,3,4]: 1003 

algo [1,2,4]: 822 

algo [1,2,3]: 1514  

algo [1,2]: 645 

algo [1,3]: 829 

algo [1,4]: 170

algo [2,3]: 314

algo [2,4]: 201 

algo [3,4]: 200 

In [76]:
# Get all relevant retrieved instance by merging the relevant and recommended interactions
df_retrived_relevant = pd.merge(df_all_relevant, df_all_retrieved, on=['user_id', 'poi_id', 'rec_poi_id'], how='inner')
df_retrived_relevant

Unnamed: 0,user_id,poi_id,rec_poi_id
0,7435,2149128,310900
1,7435,2149128,4400781
2,7435,2149128,1837767
3,23137,1837767,8634325
4,23137,1837767,8016698
...,...,...,...
699,38946,2139492,2138910
700,27424,644919,8016698
701,41524,317415,8016698
702,32830,1888876,8634325


Number of retrived_relevant records

algo [1,2,3,4]: 704 

algo [2,3,4]: 274 

algo [1,3,4]: 612 

algo [1,2,4]: 133 

algo [1,2,3]: 639  

algo [1,2]: 90

algo [1,3]: 540 

algo [1,4]: 72

algo [2,3]: 179 

algo [2,4]: 51

algo [3,4]: 142 

In [77]:
# calculate the precision score
relevant_retrieved = df_retrived_relevant.shape[0]
all_retrived = df_all_retrieved.shape[0]

precision = relevant_retrieved / all_retrived

print(f'Precision Score: {precision}')

Precision Score: 0.409778812572759


Precision Score 

algo [1,2,3,4]: 0.409778812572759

algo [2,3,4]: 0.4990892531876138

algo [1,3,4]: 0.6101694915254238

algo [1,2,4]: 0.16180048661800486

algo [1,2,3]: 0.42206076618229854

algo [1,2]: 0.13953488372093023

algo [1,3]: 0.6513872135102533

algo [1,4]: 0.4235294117647059

algo [2,3]: 0.5700636942675159

algo [2,4]: 0.2537313432835821

algo [3,4]: 0.71

In [78]:
# calculate the recall score
relevant_retrieved = df_retrived_relevant.shape[0]
all_relevant = df_all_relevant.shape[0]

recall = relevant_retrieved / all_relevant
print(f'Recall Score: {recall}')

Recall Score: 0.3080962800875274


Recall Score 

algo [1,2,3,4]: 0.3080962800875274

algo [2,3,4]: 0.11991247264770241

algo [1,3,4]: 0.26783369803063456

algo [1,2,4]: 0.05820568927789934

algo [1,2,3]: 0.27964989059080964

algo [1,2]: 0.03938730853391685

algo [1,3]: 0.2363238512035011

algo [1,4]: 0.03150984682713348

algo [2,3]: 0.07833698030634573

algo [2,4]: 0.022319474835886213

algo [3,4]: 0.062144420131291025

In [79]:
# calculate the coverage score
num_recommended_pois = df_all_retrieved['rec_poi_id'].nunique()
num_all_pois = df_pois.shape[0]

coverage = num_recommended_pois / num_all_pois
print(f'Coverage Score: {coverage}')

Coverage Score: 0.6811594202898551


Coverage Score 

algo [1,2,3,4]: 0.6811594202898551

algo [2,3,4]: 0.5217391304347826

algo [1,3,4]: 0.42028985507246375

algo [1,2,4]: 0.5362318840579711

algo [1,2,3]: 0.6666666666666666

algo [1,2]: 0.463768115942029

algo [1,3]: 0.37681159420289856 

algo [1,4]: 0.10144927536231885

algo [2,3]: 0.4492753623188406

algo [2,4]: 0.15942028985507245

algo [3,4]: 0.17391304347826086

## Cleaning up

Delete both the GDS in-memory state and the database.

In [80]:
# Remove our projection from the GDS graph catalog
G.drop()

# Remove all the example data from the database
# _ = gds.run_cypher("MATCH (n) DETACH DELETE n")

graphName                                                          myGraph
database                                                             neo4j
databaseLocation                                                     local
memoryUsage                                                               
sizeInBytes                                                             -1
nodeCount                                                            58725
relationshipCount                                                   170068
configuration            {'relationshipProjection': {'REVIEWED': {'aggr...
density                                                           0.000049
creationTime                           2024-03-15T07:05:16.639119930+00:00
modificationTime                       2024-03-15T07:05:36.532602204+00:00
schema                   {'graphProperties': {}, 'nodes': {'User': {'em...
schemaWithOrientation    {'graphProperties': {}, 'nodes': {'User': {'em...
Name: 0, dtype: object