# Content Based Filtering Recommendations - Heuristic Method

## Introduction

The heuristic method will find the poi that target users have visited, find the same category, or within the category, find the pois that users haven't visited, and find the top poi with the most number of reviews.

For each POI, count the number of occurrences in both the region and category lists.
Combine these counts into a single measure called 'total_weight'. This total weight represents how frequently the POI appears across both region and category lists. Rank the POIs based on their total weight in descending order. For a tie in total_weight, then rank in based on occurrency.

This metric emphasizes POIs that are commonly found in both the region and category lists, indicating a higher level of relevance or similarity to the user's reviewed POIs. This approach focuses on the frequency of appearances in different contexts (region and category lists) to prioritize recommendations.

## Prerequisites

`neo4j` database instance must be already initialized and populated with data.

The connection details `HOST`, `DATABASE` and `PASSWORD` must be store in `NEO4J_CONF_FILE` in order to establish connection to the neo4j database.

Use `neo4j python driver` to query the neo4j database.

Use `Cypher` query to generate recommendations.

## Setup

Installing and importing dependencies, and establish neo4j driver connection to the database.

In [40]:
# Install necessary dependencies
%pip install scikit-learn
%pip install neo4j
%pip install configparser
%pip install textwrap
%pip install pandas

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip



Collecting configparser
  Obtaining dependency information for configparser from https://files.pythonhosted.org/packages/bf/c1/c9d33f208bf25164ec315a571a9c0a6b71a5d38f364426db987cec12a152/configparser-6.0.1-py3-none-any.whl.metadata
  Downloading configparser-6.0.1-py3-none-any.whl.metadata (10 kB)
Downloading configparser-6.0.1-py3-none-any.whl (19 kB)
Installing collected packages: configparser
Successfully installed configparser-6.0.1
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement textwrap (from versions: none)
ERROR: No matching distribution found for textwrap

[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [41]:
import os
import configparser
import textwrap

import pandas as pd
from sklearn.model_selection import train_test_split

from neo4j import GraphDatabase

In [43]:
# Using an ini file for credentials, otherwise providing defaults
HOST = 'neo4j://localhost'
DATABASE = 'neo4j'
PASSWORD = 'password'

NEO4J_CONF_FILE = 'neo4j.ini'

if NEO4J_CONF_FILE is not None and os.path.exists(NEO4J_CONF_FILE):
    config = configparser.RawConfigParser()
    config.read(NEO4J_CONF_FILE)
    HOST = config['NEO4J']['HOST']
    DATABASE = config['NEO4J']['DATABASE']
    PASSWORD = config['NEO4J']['PASSWORD']
    print('Using custom database properties')
else:
    print('Could not find database properties file, using defaults')


# Connecting with neo4j python driver
driver = GraphDatabase.driver(HOST, auth=(DATABASE, PASSWORD))

Using custom database properties


In [44]:
# helper function
def run(driver, query, params=None):
    with driver.session() as session:
        if params is not None:
            return [r for r in session.run(query, params)]
        else:
            return [r for r in session.run(query)]

## Heuristic Recommendations

based on poi that user has reviewed before, recommend other poi in the same category or region.

In [56]:
# FUNCTION: make recommendation based on Content Based Filtering Recommendations - Heuristic Method
# INPUT: user_id, poi_id
# OUTPUT: dataframe[user_id, poi_id, rec_poi_id]

def heuristic_recommendation(user_id, poi_id):
    # get pois in the same region as reviewed_poi by the user
    records_region = run(driver, textwrap.dedent("""\
        MATCH (user {id: $user_id})-[:REVIEWED]->(poi:Poi {id: $poi_id})-[:LOCATED_AT]->(region:Region)<-[:LOCATED_AT]-(other_poi:Poi)<-[rated:RATED]-(review:Review)
        WHERE poi <> other_poi
        WITH user, poi, other_poi, region, count(DISTINCT rated) AS num_reviews
        RETURN user.id AS user_id, poi.id AS poi_id, other_poi.id AS rec_poi_id, region.name AS region, num_reviews AS occurrences
        """),
        params = {'user_id': user_id, 'poi_id': poi_id}
    )

    # get pois in the same category as reviewed_poi by the user 
    records_category = run(driver, textwrap.dedent("""\
        MATCH (user {id: $user_id})-[:REVIEWED]->(poi:Poi {id: $poi_id})-[:BELONGS_TO]->(category:Category)<-[:BELONGS_TO]-(other_poi:Poi)<-[rated:RATED]-(review:Review)
        WHERE poi <> other_poi
        WITH user, poi, other_poi, category, count(DISTINCT rated) AS num_reviews
        RETURN user.id AS user_id, poi.id AS poi_id, other_poi.id AS rec_poi_id, category.name AS category_name, num_reviews AS occurrences
        """),
        params = {'user_id': user_id, 'poi_id': poi_id}
    )

    
    # Convert the result to a DataFrame
    if records_region:
        df_records_region = pd.DataFrame([dict(record) for record in records_region])
        # Group by 'poi_id', 'poi_name', and 'occurrences', then aggregate the count of occurrences
        df_records_region_agg = df_records_region.groupby(['user_id', 'poi_id', 'rec_poi_id', 'occurrences']).size().reset_index(name='weight')
    else:
        df_records_region_agg = pd.DataFrame(columns=['user_id', 'poi_id', 'rec_poi_id', 'occurrences', 'weight'])

    if records_category:
        df_records_category = pd.DataFrame([dict(record) for record in records_category])
        # Group by 'poi_id', 'poi_name', and 'occurrences', then aggregate the count of occurrences
        df_records_category_agg = df_records_category.groupby(['user_id', 'poi_id', 'rec_poi_id', 'occurrences']).size().reset_index(name='weight')
    else:
       df_records_category_agg = pd.DataFrame(columns=['user_id', 'poi_id', 'rec_poi_id', 'occurrences', 'weight'])


    # compute appearance fequency of pois in both lists

    # Merge the two DataFrames on 'rec_poi_id'
    recommended_interactions = pd.merge(df_records_region_agg, df_records_category_agg, on='rec_poi_id', suffixes=('_region', '_category'), how='outer')

    # Fill NaN values in '_region' columns with values from '_category' columns
    recommended_interactions['user_id_region'].fillna(recommended_interactions['user_id_category'], inplace=True)
    recommended_interactions['poi_id_region'].fillna(recommended_interactions['poi_id_category'], inplace=True)
    recommended_interactions['occurrences_region'].fillna(recommended_interactions['occurrences_category'], inplace=True)

    # Rename the columns '_region'
    recommended_interactions.rename(columns={'user_id_region': 'user_id'}, inplace=True)
    recommended_interactions.rename(columns={'poi_id_region': 'poi_id'}, inplace=True)
    recommended_interactions.rename(columns={'occurrences_region': 'occurrences'}, inplace=True)

    # Fill NaN values with 0 for the 'weight' columns
    recommended_interactions['weight_region'].fillna(0, inplace=True)
    recommended_interactions['weight_category'].fillna(0, inplace=True)
    # Sum the 'weight' columns to get the total weight
    recommended_interactions['total_weight'] = recommended_interactions['weight_region'] + recommended_interactions['weight_category']

    # Drop the individual 'weight' columns if needed
    recommended_interactions.drop(['user_id_category', 'poi_id_category', 'occurrences_category', 'weight_region', 'weight_category'], axis=1, inplace=True)
    # Order the DataFrame by 'total_weight' in descending order, then by 'occurrences'
    recommended_interactions = recommended_interactions.sort_values(by=['total_weight', 'occurrences'], ascending=[False, False])
    # Reindex the DataFrame
    recommended_interactions.reset_index(drop=True, inplace=True)
    # Rearrange the columns
    recommended_interactions = recommended_interactions[['user_id', 'poi_id', 'rec_poi_id']]
    # drop duplicate
    recommended_interactions = recommended_interactions.drop_duplicates()

    # Display the merged DataFrame
    return recommended_interactions

In [57]:
# target user's id
user_id = 17518
# target poi's id
poi_id = 310900

df_recommend = heuristic_recommendation(user_id, poi_id)

df_recommend

Unnamed: 0,user_id,poi_id,rec_poi_id
0,17518,310900,4400781
1,17518,310900,2149128
2,17518,310900,591382
3,17518,310900,17434131


# Evaluation

In [49]:
# dataframes of pois
pois = run(driver, textwrap.dedent("""\
    MATCH (poi:Poi)
    RETURN poi.id
    """),
    params = {}
)

df_pois = pd.DataFrame([r.data() for r in pois])
df_pois

Unnamed: 0,poi.id
0,2149128
1,310900
2,4400781
3,324542
4,678639
...,...
64,17821111
65,17738872
66,26356283
67,21353012


In [50]:
# dataframes of reviews
#duration: 2m

reviews = run(driver, textwrap.dedent("""\
    MATCH (user:User)-[review:REVIEWED]->(poi:Poi)
    RETURN user.id AS user_id, poi.id AS poi_id
    """),
    params = {}
)

df_reviews = pd.DataFrame([r.data() for r in reviews])
df_reviews

Unnamed: 0,user_id,poi_id
0,847,2149128
1,21070,2149128
2,21061,2149128
3,21003,2149128
4,21227,2149128
...,...,...
85029,58650,7275891
85030,58652,7275891
85031,58654,17821111
85032,58656,17821111


In [51]:
# Group by 'user_id' and count occurrences
user_counts = df_reviews.groupby('user_id').size()

# Filter out users with less than 5 occurrences
valid_users = user_counts[user_counts >= 5].index

# Filter the original DataFrame based on valid users
filtered_df_reviews = df_reviews[df_reviews['user_id'].isin(valid_users)].copy()
filtered_df_reviews

Unnamed: 0,user_id,poi_id
277,20980,2149128
329,20419,2149128
644,20445,2149128
712,20803,2149128
764,20108,2149128
...,...,...
84680,1753,1888873
84685,39201,1888873
84688,6967,1888873
84702,21691,1888873


In [54]:
# Splitting the dataset into 90% training and 10% test sets
df_train, df_test = train_test_split(filtered_df_reviews, test_size=0.1, random_state=100)

df_train

Unnamed: 0,user_id,poi_id
43559,14882,678639
52599,17393,8634325
78604,25812,315470
81971,8079,13078277
66625,13223,8016698
...,...,...
80529,29227,310896
47967,29380,1837767
14022,7053,2149128
4555,16889,2149128


In [55]:
df_test

Unnamed: 0,user_id,poi_id
14227,7435,2149128
48823,23137,1837767
59373,41985,1888876
12031,9481,2149128
81597,11220,2138910
...,...,...
44908,4635,678639
82916,38946,2139492
57328,27424,644919
53084,41524,317415


In [58]:
# retrieve recommendation for row in test set
# 500 instances takes 5 minutes

df_all_retrieved = pd.DataFrame()
for index, row in df_test.iterrows():
    recommended_interactions = heuristic_recommendation(row['user_id'], row['poi_id'])
    # Concatenate recommended_interactions with test_recommendations
    df_all_retrieved = pd.concat([df_all_retrieved, recommended_interactions], ignore_index=True)

# drop duplicate columns
df_all_retrieved = df_all_retrieved.drop_duplicates()

df_all_retrieved

  df_all_retrieved = pd.concat([df_all_retrieved, recommended_interactions], ignore_index=True)
  df_all_retrieved = pd.concat([df_all_retrieved, recommended_interactions], ignore_index=True)
  df_all_retrieved = pd.concat([df_all_retrieved, recommended_interactions], ignore_index=True)
  df_all_retrieved = pd.concat([df_all_retrieved, recommended_interactions], ignore_index=True)
  df_all_retrieved = pd.concat([df_all_retrieved, recommended_interactions], ignore_index=True)
  df_all_retrieved = pd.concat([df_all_retrieved, recommended_interactions], ignore_index=True)
  df_all_retrieved = pd.concat([df_all_retrieved, recommended_interactions], ignore_index=True)
  df_all_retrieved = pd.concat([df_all_retrieved, recommended_interactions], ignore_index=True)
  df_all_retrieved = pd.concat([df_all_retrieved, recommended_interactions], ignore_index=True)
  df_all_retrieved = pd.concat([df_all_retrieved, recommended_interactions], ignore_index=True)
  df_all_retrieved = pd.concat([df_all_r

Unnamed: 0,user_id,poi_id,rec_poi_id
0,7435.0,2149128.0,4400781
1,7435.0,2149128.0,8634325
2,7435.0,2149128.0,17434131
3,7435.0,2149128.0,310900
4,7435.0,2149128.0,678639
...,...,...,...
3332,41524.0,317415.0,324751
3333,32830.0,1888876.0,678639
3334,32830.0,1888876.0,315470
3335,32830.0,1888876.0,7221059


In [59]:
# Extracting the true interactions from all the reviews
df_true_interactions = df_reviews[['user_id', 'poi_id']]
df_true_interactions

Unnamed: 0,user_id,poi_id
0,847,2149128
1,21070,2149128
2,21061,2149128
3,21003,2149128
4,21227,2149128
...,...,...
85029,58650,7275891
85030,58652,7275891
85031,58654,17821111
85032,58656,17821111


In [65]:
# Get all relevant instances by merging the true interactions and test instance on user id
df_all_relevant = pd.merge(df_test, df_true_interactions, on=['user_id'], how='inner')
# Rename the columns poi_id_x to poi_id and poi_id_y to rec_poi_id
df_all_relevant = df_all_relevant.rename(columns={'poi_id_x': 'poi_id', 'poi_id_y': 'rec_poi_id'})
# Drop rows where poi_id is equal to rec_poi_id
df_all_relevant = df_all_relevant[df_all_relevant['poi_id'] != df_all_relevant['rec_poi_id']]
df_all_relevant = df_all_relevant.drop_duplicates()

df_all_relevant

Unnamed: 0,user_id,poi_id,rec_poi_id
1,7435,2149128,310900
2,7435,2149128,4400781
3,7435,2149128,324542
4,7435,2149128,1837767
5,7435,2149128,317415
...,...,...,...
2739,41524,317415,310896
2740,32830,1888876,324542
2741,32830,1888876,8634325
2743,32830,1888876,8016698


In [66]:
# Get all relevant retrieved instance by merging the true interactions and recommended interactions
df_retrived_relevant = pd.merge(df_all_relevant, df_all_retrieved, on=['user_id', 'poi_id', 'rec_poi_id'], how='inner')
df_retrived_relevant

Unnamed: 0,user_id,poi_id,rec_poi_id
0,7435,2149128,310900
1,7435,2149128,4400781
2,7435,2149128,1837767
3,7435,1837767,2149128
4,7435,1837767,4400781
...,...,...,...
553,4635,678639,2149128
554,4635,678639,4400781
555,4635,678639,1837767
556,38946,2139492,2138910


In [67]:
# calculate the precision score
relevant_retrieved = df_retrived_relevant.shape[0]
all_retrived = df_all_retrieved.shape[0]

precision = relevant_retrieved / all_retrived

print(f'Precision Score: {precision}')

Precision Score: 0.16721606233143543


In [68]:
# calculate the recall score
relevant_retrieved = df_retrived_relevant.shape[0]
all_relevant = df_all_relevant.shape[0]

recall = relevant_retrieved / all_relevant
print(f'Recall Score: {recall}')

Recall Score: 0.24420131291028446


In [69]:
# calculate the coverage score
num_recommended_pois = df_all_retrieved['rec_poi_id'].nunique()
num_all_pois = df_pois.shape[0]

coverage = num_recommended_pois / num_all_pois
print(f'Coverage Score: {coverage}')

Coverage Score: 0.6521739130434783


# Close the driver

In [21]:
driver.close()