### User-base Collaborative Filtering - Model & Index
This notebook demonstrates how to build a User-based collaborative filtering model using Yelp dataset. You can adjust the model to add more features or change the hyperparameters to improve the model performance. The index is built and stored in the `yelp_UserCF.db` file.

Objective: Build a basic UserCF model for retrieval and prediction.  
Strategy: Use Jaccard similarity on training data; store in yelp_UserCF.db.

#### Pre-requisites
1. Have the processed Yelp dataset in the `../../data/processed_data/yelp_data` folder.
2. Have the virtual environment setup and used for the notebook.

#### Move to Production
1. Copy the `yelp_UserCF.db` file to the `../../data/processed_data` folder.
2. Update the `UserCF.py` file in the `../backend/models` folder if there is changes in retrieval process.

In [12]:
# import the python file from ../utilities.py
import sys
sys.path.append('../')
from utilities import *

import scipy.sparse as sp
from sparse_dot_topn import sp_matmul_topn
from scipy.sparse import csr_matrix
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
# Load Yelp data
db_folder = '../../data/processed_data/yelp_data/'
data_files = ['business', 'categories', 'review']
yelp_data = load_data_from_db(db_folder, data_files)
for table, df in yelp_data.items():
    print(f"Loaded {len(df)} rows from {table} table.")

Loaded 78059 rows from business table.
Loaded 360656 rows from categories table.
Loaded 980418 rows from review table.


In [3]:
# Prepare data
df_business = yelp_data["business"]
df_review = yelp_data["review"]
df_concat = df_business.merge(df_review, on='business_id', how='outer', suffixes=('_business', '_review'))
df_concat["timestamp"] = pd.to_datetime(df_concat["date"]).astype(int) // 10**9
user_business = df_concat[["user_id", "business_id", "stars_review"]]

In [4]:
# Split into train (80%) and test (20%); use train for model
train_data, test_data = train_test_split(user_business, test_size=0.2, random_state=42)
user_business = train_data.copy()

In [5]:
# Create user and business index mappings
user_mapping = {user: idx for idx, user in enumerate(user_business['user_id'].unique())}
business_mapping = {biz: idx for idx, biz in enumerate(user_business['business_id'].unique())}

# Map user_id and business_id to numerical indices
user_business['user_idx'] = user_business['user_id'].map(user_mapping)
user_business['business_idx'] = user_business['business_id'].map(business_mapping)

# Creating the sparse user-item interaction matrix using stars_review
user_item_sparse = csr_matrix(
    (user_business['stars_review'], (user_business['user_idx'], user_business['business_idx'])),
    shape=(len(user_mapping), len(business_mapping))
)

# Replace NaN values in the sparse matrix
user_item_sparse.data = np.nan_to_num(user_item_sparse.data)

# Convert ratings to binary (1 if interacted, 0 otherwise) for Jaccard
binary_user_item_sparse = (user_item_sparse > 0).astype(int)

In [13]:
# Compute user-user Jaccard similarity
def jaccard_similarity_topn(A, top_n=50, threshold=0.01):
    """
    Compute Jaccard similarity for users in a sparse user-item matrix efficiently.
    Returns a sparse matrix containing only the top N similar users per user.
    """
    # Convert to binary interactions
    A_bin = (A > 0).astype(int)

    # Compute intersection (co-occurrence): A @ A.T
    intersection = A_bin @ A_bin.T

    # Compute user-wise interaction counts (sparse)
    user_sums = np.array(A_bin.sum(axis=1)).flatten()

    # Compute union using non-zero indices
    row_indices, col_indices = intersection.nonzero()
    intersection_values = intersection.data

    # Compute union: |A| + |B| - |A ∩ B|
    union_values = user_sums[row_indices] + user_sums[col_indices] - intersection_values

    # Compute Jaccard similarity
    jaccard_values = intersection_values / union_values

    # Apply thresholding
    mask = jaccard_values >= threshold
    row_indices, col_indices, jaccard_values = row_indices[mask], col_indices[mask], jaccard_values[mask]

    # Create sparse matrix for efficient storage
    jaccard_sim_sparse = csr_matrix(
        (jaccard_values, (row_indices, col_indices)), shape=(A.shape[0], A.shape[0])
    )

    # Keep only top-N similar users using sparse_dot_topn
    jaccard_sim_sparse = sp_matmul_topn(jaccard_sim_sparse, jaccard_sim_sparse, top_n=top_n, threshold=threshold, n_threads=4, sort=True)

    return jaccard_sim_sparse

In [14]:
# Compute user similarity using Jaccard
user_similarity_sparse = jaccard_similarity_topn(binary_user_item_sparse, top_n=50, threshold=0.01)

In [15]:
def optimize_db(conn):
    """Apply SQLite performance optimizations."""
    cursor = conn.cursor()
    cursor.executescript('''
        PRAGMA synchronous = OFF;
        PRAGMA journal_mode = MEMORY;
        PRAGMA temp_store = MEMORY;
        PRAGMA cache_size = 1000000;
    ''')
    conn.commit()

# Optimized batch insert for user-item interactions
def insert_user_item(user_business, conn, batch_size=50000):
    """Optimized batch insert for user-item interactions."""
    cursor = conn.cursor()
    cursor.execute('BEGIN TRANSACTION')
    total_records = len(user_business)
    data = user_business[['user_id', 'business_id', 'stars_review']].values.tolist()
    for i in range(0, total_records, batch_size):
        batch = data[i:i + batch_size]
        cursor.executemany('''INSERT OR IGNORE INTO user_item_index (user_id, business_id, stars_review)
                              VALUES (?, ?, ?)''', batch)
        if i % (batch_size * 5) == 0:
            conn.commit()
            print(f"Inserted {i + len(batch)} / {total_records} user-item records.")
    conn.commit()
    print(f"Total {total_records} user-item records inserted.")

# Optimized batch insert for user similarity vectors
def insert_user_vectors(user_similarity_sparse, user_mapping, conn, batch_size=5000, progress_interval=50000):
    """Optimized batch insert for user similarity vectors."""
    cursor = conn.cursor()
    cursor.execute('BEGIN TRANSACTION')
    total_inserted = 0
    batch = []
    user_keys = list(user_mapping.keys())
    for row_idx in range(user_similarity_sparse.shape[0]):
        row_vector = user_similarity_sparse.getrow(row_idx)
        serialized_row = pickle.dumps((row_vector.indices, row_vector.data))
        user_id = user_keys[row_idx]
        batch.append((user_id, serialized_row))
        if len(batch) >= batch_size:
            cursor.executemany('''INSERT OR REPLACE INTO user_user_similarity (user_id, similarity_vector)
                                  VALUES (?, ?)''', batch)
            total_inserted += len(batch)
            if total_inserted % progress_interval == 0:
                print(f"Inserted {total_inserted} user vectors...")
            batch = []
    if batch:
        cursor.executemany('''INSERT OR REPLACE INTO user_user_similarity (user_id, similarity_vector)
                              VALUES (?, ?)''', batch)
        total_inserted += len(batch)
    conn.commit()
    print(f"Total {total_inserted} user vectors inserted.")

# Optimized batch insert for mappings
def insert_mappings(mapping, conn, table_name, key_col, val_col, batch_size=50000):
    """Optimized batch insert for mappings."""
    cursor = conn.cursor()
    cursor.execute('BEGIN TRANSACTION')
    data = list(mapping.items())
    total_records = len(data)
    for i in range(0, total_records, batch_size):
        batch = data[i:i + batch_size]
        cursor.executemany(f'''INSERT OR REPLACE INTO {table_name} ({key_col}, {val_col})
                              VALUES (?, ?)''', batch)
        if i % (batch_size * 5) == 0:
            conn.commit()
            print(f"Inserted {i + len(batch)} / {total_records} {table_name} records.")
    conn.commit()
    print(f"Total {total_records} {table_name} records inserted.")

In [18]:
# Connect to SQLite and set up database
db_path = './yelp_UserCF.db'
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
optimize_db(conn)

# Drop existing indexes if they exist
cursor.execute('''DROP INDEX IF EXISTS idx_user_item''')
cursor.execute('''DROP INDEX IF EXISTS idx_user_similarity''')

# Create tables for user-item and user-user indexes
cursor.execute('''CREATE TABLE IF NOT EXISTS user_item_index (
    user_id TEXT,
    business_id TEXT,
    stars_review REAL,
    PRIMARY KEY (user_id, business_id)
)''')
cursor.execute('''CREATE INDEX idx_user_item ON user_item_index(user_id, business_id)''')
cursor.execute('''CREATE TABLE IF NOT EXISTS user_user_similarity (
    user_id TEXT PRIMARY KEY,
    similarity_vector BLOB
)''')
cursor.execute('''CREATE INDEX idx_user_similarity ON user_user_similarity(user_id)''')
cursor.execute('''CREATE TABLE IF NOT EXISTS user_mapping (
    user_id TEXT PRIMARY KEY,
    user_idx INTEGER
)''')
cursor.execute('''CREATE TABLE IF NOT EXISTS business_mapping (
    business_id TEXT PRIMARY KEY,
    business_idx INTEGER
)''')
conn.commit()

In [19]:
# Insert data into database
insert_user_item(user_business, conn)
insert_user_vectors(user_similarity_sparse, user_mapping, conn)
insert_mappings(user_mapping, conn, 'user_mapping', 'user_id', 'user_idx')
insert_mappings(business_mapping, conn, 'business_mapping', 'business_id', 'business_idx')

Inserted 50000 / 788585 user-item records.
Inserted 300000 / 788585 user-item records.
Inserted 550000 / 788585 user-item records.
Inserted 788585 / 788585 user-item records.
Total 788585 user-item records inserted.
Inserted 50000 user vectors...
Inserted 100000 user vectors...
Total 148521 user vectors inserted.
Inserted 50000 / 148521 user_mapping records.
Total 148521 user_mapping records inserted.
Inserted 50000 / 74698 business_mapping records.
Total 74698 business_mapping records inserted.


In [20]:
# Close the connection
conn.close()