### Item-base Collaborative Filtering - Model & Index
This notebook demonstrates how to build a Item-based collaborative filtering model using Yelp dataset. You can adjust the model to add more features or change the hyperparameters to improve the model performance. The index is built and stored in the `yelp_ItemCF.db` file.

#### Pre-requisites
1. Have the processed Yelp dataset in the `../../data/processed_data/yelp_data` folder.
2. Have the virtual environment setup and used for the notebook.

#### Move to Production
1. Copy the `yelp_ItemCF.db` file to the `../../data/processed_data` folder.
2. Update the `ItemCF.py` file in the `../backend/models` folder if there is changes in retrieval process.


#### adding clustering to the pipeline
considering the number of review of each user, determine the median of review count
- for user review count larger than the median, index only using individual interaction
- for user review count less than the median, consider the other user in the same cluster

In [1]:
# import the python file from ../utilities.py
import sys
sys.path.append('../')
from utilities import *

from scipy.sparse import csr_matrix
from sparse_dot_topn import sp_matmul_topn
from sklearn.model_selection import train_test_split
import time

In [2]:
# Define the database folder path and file names
db_folder = '../../data/processed_data/yelp_data/'
data_files = ['business', 'categories', 'review']

# Load data into a dictionary
yelp_data = load_data_from_db(db_folder, data_files)

# Check loaded data
for table, df in yelp_data.items():
    print(f"Loaded {len(df)} rows from {table} table.")
    
df_review = yelp_data["review"]
df_business = yelp_data["business"]


Loaded 78059 rows from business table.
Loaded 360656 rows from categories table.
Loaded 980418 rows from review table.


In [3]:
# LAMBDA = 0.0000000005
# LAMBDA = 0.00000000005
LAMBDA = 0.0000000001
current_timestamp = int(time.time())

df_review['timestamp'] = pd.to_datetime(df_review['date']).astype(int) // 10**9
df_review['timestamp'] = np.exp(-LAMBDA * (current_timestamp - df_review["timestamp"]))
df_review['stars'] = df_review['timestamp'] * df_review['stars']

In [5]:
df_cluster = pd.read_excel("../data_processing/clustered_users.xlsx")

In [7]:
user_cluster = df_cluster.set_index('user_id')['cluster'].to_dict()
df_review['cluster'] = df_review['user_id'].map(user_cluster)
# cluster_items = df_review.groupby('cluster')['business_id'].apply(set).to_dict()

In [29]:
df_cluster_review = df_review.groupby(['cluster', 'business_id'])['stars'].sum().reset_index()
df_cluster_review

Unnamed: 0,cluster,business_id,stars
0,0.0,0aktKA1TIWpjXSz0_0qOkg,5.884306
1,0.0,2xxkaRy7rP5EUyjFt2J5kA,2.945568
2,0.0,42cHjHD6Kkv7Ms3-lLsimw,2.852694
3,0.0,5AenUmkr8mkgaNEUTVGbwA,4.898115
4,0.0,5drbv3fz5FTvp_Z3d3aPSQ,4.849228
...,...,...,...
721322,2502.0,yMeC2ltA33lJ-nBIQP4sJQ,3.858453
721323,2502.0,z6I4QVP1M1HXETgwD92XHg,4.856031
721324,2502.0,zDEzXNbn84HD0lR0KD0seg,2.870508
721325,2502.0,zjqh_qoBS1BWVSbC51BNjw,3.949119


In [30]:
cluster_business = df_cluster_review.copy()

In [31]:
def get_cluster_business(cluster_business):
    cluster_mapping = {clus: idx for idx, clus in enumerate(cluster_business['cluster'].unique())}
    business_mapping = {biz: idx for idx, biz in enumerate(cluster_business['business_id'].unique())}    
    return cluster_mapping, business_mapping

In [33]:
cluster_mapping, business_mapping = get_cluster_business(cluster_business)

In [32]:
cluster_business

Unnamed: 0,cluster,business_id,stars
0,0.0,0aktKA1TIWpjXSz0_0qOkg,5.884306
1,0.0,2xxkaRy7rP5EUyjFt2J5kA,2.945568
2,0.0,42cHjHD6Kkv7Ms3-lLsimw,2.852694
3,0.0,5AenUmkr8mkgaNEUTVGbwA,4.898115
4,0.0,5drbv3fz5FTvp_Z3d3aPSQ,4.849228
...,...,...,...
721322,2502.0,yMeC2ltA33lJ-nBIQP4sJQ,3.858453
721323,2502.0,z6I4QVP1M1HXETgwD92XHg,4.856031
721324,2502.0,zDEzXNbn84HD0lR0KD0seg,2.870508
721325,2502.0,zjqh_qoBS1BWVSbC51BNjw,3.949119


In [34]:
train_data, test_data = train_test_split(cluster_business, test_size=0.2, random_state=42)

cluster_business = train_data.copy()

In [39]:
cluster_business

Unnamed: 0,cluster,business_id,stars,cluster_idx,business_idx
66451,256.0,A-5IN85MwL9F8wJRsDna6g,1.939342,256,30314
445245,1264.0,Du6NbXKI0Bu7bSIF_FPYFQ,0.957208,1264,36628
220771,711.0,w3bYKltczgGMGU0eAD0low,4.936224,711,42150
310277,958.0,DVBJRvnCpkqaYl6nHroaMg,4.914664,958,21472
675767,2242.0,BWK7MAUayTZlQ_3PJeaudg,1.959892,2242,718
...,...,...,...,...,...
259178,779.0,adu5voMt1rln1nilzZh9uA,4.916211,779,55275
365838,1090.0,Av_XS0ESX4Jpwvg6D2CfVQ,4.940924,1090,29615
131932,446.0,0ndzIekJs0PRPA5Aoz9sow,2.905188,446,1624
671155,2212.0,rdS9Uy5sDv97k6phZTceWA,1.925248,2212,37553


In [35]:
# Map user_id and business_id to numerical indices
cluster_business['cluster_idx'] = cluster_business['cluster'].map(cluster_mapping)
cluster_business['business_idx'] = cluster_business['business_id'].map(business_mapping)

# Creating the sparse user-item interaction matrix using weighted_stars
user_item_sparse = csr_matrix(
    (cluster_business['stars'], (cluster_business['cluster_idx'], cluster_business['business_idx'])),
    shape=(len(cluster_mapping), len(business_mapping))
)

# Replace NaN values in the sparse matrix
user_item_sparse.data = np.nan_to_num(user_item_sparse.data)

In [36]:
def sparse_cosine_similarity_topn(A, top_n, threshold=0):
    C = sp_matmul_topn(A.T, A.T, top_n=top_n, threshold=threshold, n_threads=4, sort=True)
    return C


In [37]:
# Compute item similarity
item_similarity_sparse = sparse_cosine_similarity_topn(user_item_sparse, top_n=50, threshold=0.01)

In [38]:
item_similarity_sparse

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 3403550 stored elements and shape (70727, 70727)>

In [23]:
def optimize_db(conn):
    """Apply SQLite performance optimizations."""
    cursor = conn.cursor()
    cursor.executescript('''
        PRAGMA synchronous = OFF;
        PRAGMA journal_mode = MEMORY;
        PRAGMA temp_store = MEMORY;
        PRAGMA cache_size = 1000000;
    ''')
    conn.commit()


def insert_user_item(user_business, conn, batch_size=50000):
    """Optimized batch insert for user-item interactions."""
    cursor = conn.cursor()
    cursor.execute('BEGIN TRANSACTION')

    total_records = len(user_business)
    data = user_business[['user_id', 'business_id', 'stars_review']].values.tolist()

    for i in range(0, total_records, batch_size):
        batch = data[i:i + batch_size]
        cursor.executemany('''INSERT OR IGNORE INTO user_item_index (user_id, business_id, stars_review)
                              VALUES (?, ?, ?)''', batch)

        if i % (batch_size * 5) == 0:  # Commit every 5 batches
            conn.commit()
            print(f"Inserted {i + len(batch)} / {total_records} user-item records.")

    conn.commit()  # Final commit
    print(f"Total {total_records} user-item records inserted.")


def insert_item_vectors(item_similarity_sparse, business_mapping, conn, batch_size=5000, progress_interval=50000):
    """Optimized batch insert for item similarity vectors."""
    cursor = conn.cursor()
    cursor.execute('BEGIN TRANSACTION')

    total_inserted = 0
    batch = []
    business_keys = list(business_mapping.keys())  # Convert keys to list for faster indexing

    for row_idx in range(item_similarity_sparse.shape[0]):
        row_vector = item_similarity_sparse.getrow(row_idx)
        row_indices = row_vector.indices
        row_data = row_vector.data

        serialized_row = pickle.dumps((row_indices, row_data))
        item_id = business_keys[row_idx]  # Faster lookup

        batch.append((item_id, serialized_row))

        if len(batch) >= batch_size:
            cursor.executemany('''INSERT OR REPLACE INTO item_item_similarity (item_id, similarity_vector)
                                  VALUES (?, ?)''', batch)
            total_inserted += len(batch)

            if total_inserted % progress_interval == 0:
                print(f"Inserted {total_inserted} item vectors...")

            batch = []

    if batch:  # Insert remaining records
        cursor.executemany('''INSERT OR REPLACE INTO item_item_similarity (item_id, similarity_vector)
                              VALUES (?, ?)''', batch)
        total_inserted += len(batch)

    conn.commit()
    print(f"Total {total_inserted} item vectors inserted.")


def insert_mappings(business_mapping, conn, batch_size=50000):
    """Optimized batch insert for business mappings."""
    cursor = conn.cursor()
    cursor.execute('BEGIN TRANSACTION')

    data = list(business_mapping.items())
    total_records = len(data)

    for i in range(0, total_records, batch_size):
        batch = data[i:i + batch_size]
        cursor.executemany('''INSERT OR REPLACE INTO business_mapping (business_id, business_idx)
                              VALUES (?, ?)''', batch)

        if i % (batch_size * 5) == 0:  # Commit every 5 batches
            conn.commit()
            print(f"Inserted {i + len(batch)} / {total_records} business mappings.")

    conn.commit()
    print(f"Total {total_records} business mappings inserted.")


In [24]:
# Connect to SQLite (this will create a file-based database)
db_path = './yelp_ItemCF.db'
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
optimize_db(conn)

# Create tables for user-item and item-item indexes
cursor.execute('''CREATE TABLE IF NOT EXISTS user_item_index (
    user_id TEXT,
    business_id TEXT,
    stars_review REAL,
    PRIMARY KEY (user_id, business_id)
)''')

cursor.execute('''CREATE INDEX IF NOT EXISTS idx_user_item ON user_item_index(user_id, business_id)''')

cursor.execute('''CREATE TABLE IF NOT EXISTS item_item_similarity (
    item_id TEXT PRIMARY KEY,
    similarity_vector BLOB
)''')

cursor.execute('''CREATE INDEX IF NOT EXISTS idx_item_similarity ON item_item_similarity(item_id)''')

# cursor.execute('''CREATE TABLE IF NOT EXISTS user_mapping (
#     user_id TEXT PRIMARY KEY,
#     user_idx INTEGER
# )''')

cursor.execute('''CREATE TABLE IF NOT EXISTS business_mapping (
    business_id TEXT PRIMARY KEY,
    business_idx INTEGER
)''')


# Commit the changes
conn.commit()

In [25]:
insert_user_item(user_business, conn)
insert_item_vectors(item_similarity_sparse, business_mapping, conn)
insert_mappings(business_mapping, conn)

Inserted 50000 / 734418 user-item records.
Inserted 300000 / 734418 user-item records.
Inserted 550000 / 734418 user-item records.
Total 734418 user-item records inserted.
Inserted 50000 item vectors...
Total 78059 item vectors inserted.
Inserted 50000 / 78059 business mappings.
Total 78059 business mappings inserted.


In [26]:
# Close the connection when done
conn.close()

In [None]:
print("Database operations completed successfully.")