### Item-base Collaborative Filtering - Testing
This notebook is used to simulate the performance of the item-based collaborative filtering algorithm in a retrieval setting and in a prediction setting. 

#### Pre-requisites
- The model is trained and the index is created in the notebook `ItemCF Model & Index.ipynb`.
- The index is saved in the file `yelp_ItemCF.db` in the same directory as this notebook.

In [1]:
# import the python file from ../utilities.py
import sys
sys.path.append('../')
from utilities import *

In [2]:
# Define the database folder path and file names
db_folder = '../../data/processed_data/yelp_data/'
data_files = ['business', 'categories', 'review']

# Load data into a dictionary
yelp_data = load_data_from_db(db_folder, data_files)

# Check loaded data
for table, df in yelp_data.items():
    print(f"Loaded {len(df)} rows from {table} table.")

Loaded 78059 rows from business table.
Loaded 360656 rows from categories table.
Loaded 980418 rows from review table.


In [3]:
df_business = yelp_data["business"]
df_review = yelp_data["review"]

user_mapping, business_mapping, user_business = get_user_business(df_business, df_review)

In [4]:
# split the data into training and test sets
train_data, test_data = train_test_split(user_business, test_size=0.2, random_state=42)

In [5]:
# balance the test data, comment this line to use the original test data
test_data = balance_test_data(test_data)

# group the test data by user_id and get the business_id
test_data_grouped = test_data.groupby('user_id')['business_id'].apply(list).reset_index()

Number of positive reviews: 136473
Number of negative reviews: 59624
Total number of reviews: 197147
Ratio of positive to negative reviews: 2.29
Number of positive reviews: 59624
Number of negative reviews: 59624
Total number of reviews: 119248
Ratio of positive to negative reviews: 1.00


In [6]:
# Connect to the SQLite database
db_path = './yelp_ItemCF.db'
conn = sqlite3.connect(db_path)

### Retrieval Evaluation

In [7]:
retrieval_recommendations = simulate_recommendations(test_data_grouped, user_mapping, business_mapping, conn, k=300, num_users=1000)

true_positive, true_negative, false_positive, false_negative, total, total_positive, ranks = check_retrieval_recommendations(retrieval_recommendations, test_data, test_data_grouped)

evaluation_metric, confusion_matrix, background_stats = compute_evaluation_metric(true_positive, true_negative, false_positive, false_negative, total, total_positive, ranks)

In [13]:
print("Testing Data Statistics")
display(background_stats)

print("Evaluation Metrics")
display(evaluation_metric)

print("Confusion Matrix")
display(confusion_matrix)

Testing Data Statistics


Unnamed: 0,Total Positive,Total Negative,Total,Ratio
0,1045,1083,2128,0.491071


Evaluation Metrics


Unnamed: 0,Accuracy,Precision,Recall,F1 Score,F-beta Score,Mean Reciprocal Rank
0,0.5329,0.553,0.2545,0.3486,0.2853,0.0686


Confusion Matrix


Unnamed: 0,True Positive,True Negative,False Positive,False Negative
0,266,868,215,779


### Prediction Evaluation

In [9]:
# predicted_labels, actual_labels, positive_count, negative_count, null_count, unrated_count = predict_recommendations(test_data, test_data_grouped, business_mapping, conn)

# evaluation_metric, confusion_matrix, background_stats = compute_prediction_evaluation(actual_labels, predicted_labels)

In [10]:
# print("Testing Data Statistics")
# display(background_stats)

# print("Evaluation Metrics")
# display(evaluation_metric)

# # drop the MRR column if it exists
# if 'Mean Reciprocal Rank' in evaluation_metric.columns:
#     evaluation_metric.drop(columns=['Mean Reciprocal Rank'], inplace=True)

# print("Confusion Matrix")
# display(confusion_matrix)

In [11]:
conn.close()

### Adding Retrieval Result to the db

In [None]:
# db_path = '../Retrieval Result/Retrieval.db'
# conn = sqlite3.connect(db_path)
# cursor = conn.cursor()

# # Create a lookup for fast access to star ratings from the test data:
# # This dictionary maps (user_id, business_id) to the star rating.
# test_data_lookup = {
#     (row['user_id'], row['business_id']): row['stars_review']
#     for _, row in test_data.iterrows()
# }

# # Prepare bulk records for insertion into SQLite.
# # Format: (model, user_id, business_id, real_label)
# # Here we assume a positive review (real_label = 1) if stars >= 4, else negative (real_label = 0).
# bulk_records = []
# model_name = "ItemCF"  # You can change this if needed

# for user_id, recommended_businesses in retrieval_recommendations.items():
#     for business_id in recommended_businesses[0]:
#         # Check the star rating from the test data lookup
#         star_rating = test_data_lookup.get((user_id, business_id))
#         # Define the real label: 1 if rating is available and >= 4, otherwise 0.
#         real_label = 1 if star_rating is not None and star_rating >= 4 else 0
#         if real_label == 1:
#             print(f"User {user_id} was recommended business {business_id} with a positive rating of {star_rating}.")
#         bulk_records.append((model_name, user_id, business_id, real_label))

# # Example: Now perform a bulk insert using SQLite's executemany.
# # Make sure your 'recommendations' table has a UNIQUE constraint on (model, user_id, business_id) if needed.
# cursor.executemany("""
#     INSERT OR IGNORE INTO recommendations (model, user_id, business_id, real_label)
#     VALUES (?, ?, ?, ?)
# """, bulk_records)
# conn.commit()

# # Close the connection
# conn.close()

User --Vu3Gux9nPnLcG9yO_HxA was recommended business WwPdWaLYBRz7By915qlXRg with a positive rating of 5.0.
User --XwFm4qERD6J5SX0JAsbg was recommended business MbNcVhRqpNPcvgFzWgaxSQ with a positive rating of 5.0.
User --u09WAjW741FdfkJXxNmg was recommended business 71U7MxQEhwitJOm4CQpRwQ with a positive rating of 4.0.
User -0KrCHEsOcjJ6N4k_k1A9A was recommended business KdAWjL9MKjpJzEeI902qBA with a positive rating of 4.0.
User -0MIp6WKJ8QvGnYZQ5ETyg was recommended business j-qtdD55OLfSqfsWuQTDJg with a positive rating of 5.0.
User -13RX4Gy_F-zoLIenWAo-w was recommended business dGeXdSMah56gEHwZNaRQKA with a positive rating of 5.0.
User -2cKJFFNJ9XVyWBt62mWvA was recommended business NbOWECn3ilz4gWL6dm5P6g with a positive rating of 5.0.
User -3Dzhux7DmA0Rj6P8PtQNA was recommended business EaqASiPkxV9OUkvsAp4ODg with a positive rating of 5.0.
User -3s52C4zL_DHRK0ULG6qtg was recommended business 0QYWhij_YZ7Lyk9F6213Sg with a positive rating of 5.0.
User -3s52C4zL_DHRK0ULG6qtg was recom