### Item-base Collaborative Filtering - Testing
This notebook is used to simulate the performance of the item-based collaborative filtering algorithm in a retrieval setting and in a prediction setting. 

#### Pre-requisites
- The model is trained and the index is created in the notebook `ItemCF Model & Index.ipynb`.
- The index is saved in the file `yelp_ItemCF.db` in the same directory as this notebook.

In [1]:
# import the python file from ../utilities.py
import sys
sys.path.append('../')
from utilities import *

In [2]:
# Define the database folder path and file names
db_folder = '../../data/processed_data/yelp_data/'
data_files = ['business', 'categories', 'review']

# Load data into a dictionary
yelp_data = load_data_from_db(db_folder, data_files)

# Check loaded data
for table, df in yelp_data.items():
    print(f"Loaded {len(df)} rows from {table} table.")

Loaded 78059 rows from business table.
Loaded 360656 rows from categories table.
Loaded 980418 rows from review table.


In [3]:
df_business = yelp_data["business"]
df_review = yelp_data["review"]

user_mapping, business_mapping, user_business = get_user_business(df_business, df_review)

In [4]:
# split the data into training and test sets
train_data, test_data = train_test_split(user_business, test_size=0.2, random_state=42)

In [5]:
# balance the test data, comment this line to use the original test data
test_data = balance_test_data(test_data)

# group the test data by user_id and get the business_id
test_data_grouped = test_data.groupby('user_id')['business_id'].apply(list).reset_index()

Number of positive reviews: 136473
Number of negative reviews: 59624
Total number of reviews: 197147
Ratio of positive to negative reviews: 2.29
Number of positive reviews: 59624
Number of negative reviews: 59624
Total number of reviews: 119248
Ratio of positive to negative reviews: 1.00


In [6]:
# Connect to the SQLite database
db_path = './yelp_ItemCF.db'
conn = sqlite3.connect(db_path)

### Retrieval Evaluation

In [7]:
retrieval_recommendations = simulate_recommendations(test_data_grouped, user_mapping, business_mapping, conn, k=300, num_users=10)

true_positive, true_negative, false_positive, false_negative, total, total_positive, ranks = check_retrieval_recommendations(retrieval_recommendations, test_data, test_data_grouped)

evaluation_metric, confusion_matrix, background_stats = compute_evaluation_metric(true_positive, true_negative, false_positive, false_negative, total, total_positive, ranks)

In [8]:
print("Testing Data Statistics")
display(background_stats)

print("Evaluation Metrics")
display(evaluation_metric)

print("Confusion Matrix")
display(confusion_matrix)

Testing Data Statistics


Unnamed: 0,Total Positive,Total Negative,Total,Ratio
0,14,8,22,0.636364


Evaluation Metrics


Unnamed: 0,Accuracy,Precision,Recall,F1 Score,F-beta Score,Mean Reciprocal Rank
0,0.4091,0.6,0.2143,0.3158,0.2459,0.0318


Confusion Matrix


Unnamed: 0,True Positive,True Negative,False Positive,False Negative
0,3,6,2,11


### Prediction Evaluation

In [9]:
predicted_labels, actual_labels, positive_count, negative_count, null_count, unrated_count = predict_recommendations(test_data, test_data_grouped, business_mapping, conn)

evaluation_metric, confusion_matrix, background_stats = compute_prediction_evaluation(actual_labels, predicted_labels)





In [15]:
print("Testing Data Statistics")
display(background_stats)

print("Evaluation Metrics")
display(evaluation_metric)

# drop the MRR column if it exists
if 'Mean Reciprocal Rank' in evaluation_metric.columns:
    evaluation_metric.drop(columns=['MRR'], inplace=True)

print("Confusion Matrix")
display(confusion_matrix)

Testing Data Statistics


Unnamed: 0,Total Positive,Total Negative,Total,Ratio
0,584,623,1207,0.483844


Evaluation Metrics


Unnamed: 0,Accuracy,Precision,Recall,F1 Score,F-beta Score,Unrated Count
0,0.6048,0.5822,0.649,0.6138,0.6344,0.0


Confusion Matrix


Unnamed: 0,True Positive,True Negative,False Positive,False Negative
0,379,351,272,205


In [11]:
conn.close()