### Item-base Collaborative Filtering - Model & Index
This notebook demonstrates how to build a Item-based collaborative filtering model using Yelp dataset. You can adjust the model to add more features or change the hyperparameters to improve the model performance. The index is built and stored in the `yelp_ItemCF.db` file.

#### Pre-requisites
1. Have the processed Yelp dataset in the `../../data/processed_data/yelp_data` folder.
2. Have the virtual environment setup and used for the notebook.

#### Move to Production
1. Copy the `yelp_ItemCF.db` file to the `../../data/processed_data` folder.
2. Update the `ItemCF.py` file in the `../backend/models` folder if there is changes in retrieval process.


#### adding clustering to the pipeline
considering the number of review of each user, determine the median of review count
- for user review count larger than the median, index only using individual interaction
- for user review count less than the median, consider the other user in the same cluster

In [1]:
# import the python file from ../utilities.py
import sys
sys.path.append('../')
from utilities import *

from scipy.sparse import csr_matrix
from sparse_dot_topn import sp_matmul_topn
from sklearn.model_selection import train_test_split
import time

In [2]:
# Define the database folder path and file names
db_folder = '../../data/processed_data/yelp_data/'
data_files = ['business', 'categories', 'review']

# Load data into a dictionary
yelp_data = load_data_from_db(db_folder, data_files)

# Check loaded data
for table, df in yelp_data.items():
    print(f"Loaded {len(df)} rows from {table} table.")
    
df_review = yelp_data["review"]
df_business = yelp_data["business"]


Loaded 78059 rows from business table.
Loaded 360656 rows from categories table.
Loaded 980418 rows from review table.


In [3]:
# LAMBDA = 0.0000000005
# LAMBDA = 0.00000000005
LAMBDA = 0.0000000001
current_timestamp = int(time.time())

df_review['timestamp'] = pd.to_datetime(df_review['date']).astype(int) // 10**9
df_review['timestamp'] = np.exp(-LAMBDA * (current_timestamp - df_review["timestamp"]))
df_review['stars'] = df_review['timestamp'] * df_review['stars']

In [4]:
# per user
review_counts = df_review.groupby('user_id').size()
median_review_count = np.median(review_counts)
# find the quartiles
q75 = np.percentile(review_counts, 75)
q90 = np.percentile(review_counts, 90)
q95 = np.percentile(review_counts, 95)
q100 = np.percentile(review_counts, 100)
print(f"Median number of reviews per user: {median_review_count}")
print(f"75th percentile of reviews per user: {q75}")
print(f"90th percentile of reviews per user: {q90}")
print(f"95th percentile of reviews per user: {q95}")
print(f"100th percentile of reviews per user: {q100}")

Median number of reviews per user: 2.0
75th percentile of reviews per user: 5.0
90th percentile of reviews per user: 13.0
95th percentile of reviews per user: 23.0
100th percentile of reviews per user: 953.0


In [5]:
df_cluster = pd.read_excel("../data_processing/clustered_users.xlsx")

In [6]:
df_cluster['cluster'].describe()

count    99812.000000
mean      1207.753897
std        670.185910
min          0.000000
25%        666.000000
50%       1156.000000
75%       1733.000000
max       2502.000000
Name: cluster, dtype: float64

In [7]:
user_cluster = df_cluster.set_index('user_id')['cluster'].to_dict()
df_review['cluster'] = df_review['user_id'].map(user_cluster)
# cluster_items = df_review.groupby('cluster')['business_id'].apply(set).to_dict()

In [8]:
# Calculate the number of reviews per cluster
cluster_review_counts = df_review.groupby('cluster').size()

# Dictionary to store the timestamp threshold for recent reviews in each cluster
cluster_recent_thresholds = {}

# Determine the threshold timestamp for each cluster
for cluster, count in cluster_review_counts.items():
    # Sort reviews by timestamp descending (higher values = more recent)
    cluster_reviews = df_review[df_review['cluster'] == cluster].sort_values(by='timestamp', ascending=False)
    # Find the index that marks the top one-fifth of reviews
    threshold_index = int(count / 5)
    # Get the timestamp at this index (if cluster has fewer than 3 reviews, take all)
    threshold_timestamp = cluster_reviews.iloc[min(threshold_index, len(cluster_reviews) - 1)]['timestamp']
    cluster_recent_thresholds[cluster] = threshold_timestamp

# Filter df_review to keep only recent reviews
df_recent_reviews = df_review[df_review.apply(
    lambda row: row['timestamp'] >= cluster_recent_thresholds.get(row['cluster'], 0), axis=1
)]

In [9]:
df_recent_reviews['cluster'] = df_recent_reviews['user_id'].map(user_cluster)
# cluster_items = df_recent_reviews.groupby('cluster')[['business_id', 'stars']].apply(
#     lambda x: x.reset_index(drop=True)
# ).to_dict()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_recent_reviews['cluster'] = df_recent_reviews['user_id'].map(user_cluster)


In [10]:
cluster_items = df_recent_reviews.groupby('cluster')['business_id'].apply(set).to_dict()

In [11]:
cluster_items

{0.0: {'0aktKA1TIWpjXSz0_0qOkg',
  '2xxkaRy7rP5EUyjFt2J5kA',
  'GehOuYICXSNZlxkC44R9jQ',
  'Ht40JGyv37VVrEGJp2s3mg',
  'IqyHN1g7PO179qtqDf01tw',
  'RdsPXJj5ta_ed12At_sR3w',
  'Z1ntn-P9HzXQ-6QzU8PO9A',
  'qswhfcTS2jZozSzswsNSrA'},
 1.0: {'2qVZiMB22YPC2HCM62Gnbw',
  'BcxCBbj7MeLBsyPzuHgklw',
  'EtbukcoR6QW6_5fEeFcVoQ',
  'KSYONgGtrK0nKXfroB-bwg',
  'MCQJc2X_NE6vbfwJdRZl4w',
  'NvMYRJ_wPJmt4FFsEF7O0A',
  'mE16tq2q9kIAeI1wnPkKXQ'},
 2.0: {'R3dw5YDLYSNpUn_u994rhQ', 'heUFH7q9_syl4a5QSqy8rA'},
 3.0: {'0aJXMRNfbKqiU99t-iOk8g',
  '50mfwRz6WA20SS3E080uTg',
  '6eXzDaL0NA0RvASXpvYt8Q',
  'A3VrcPqMLUfcmTbkDXTSaQ',
  'CS3dJ8m5rVd6u6bkqyt0lQ',
  'DO_op-iq4qoaOBEhJH6EdQ',
  'IJ5N0TNgDDFP0UauRmepDg',
  'JfThFucRnJdJxQtS6gHfWw',
  'KaTElVme7lVLlULON3EJTw',
  'Q7aDbce5rFbH1Cgcyry9zQ',
  'TSnzP2pDd403xVXSNXuFFA',
  'X8P-rVqwMB-QI-Atjpe76w',
  'bISwIzk6cMQtTf3QSlXi5Q',
  'bxyT4iNcdnfHvPyc6NTBIw',
  'e4InIycH2PAJWccBBj0tAA',
  'gf0JIIduGDyy6V0vhPjCtA',
  'hlq5APjsJb7mwmDql0tIGw',
  'jLTNVE_86BXEjYO97JTCWA',

In [12]:
# Compute average stars per business within each cluster
cluster_business_avg = df_recent_reviews.groupby(['cluster', 'business_id'])['stars'].mean().reset_index()

# Convert to a dictionary for quick lookup: (cluster, business_id) -> avg_stars
cluster_business_avg_dict = cluster_business_avg.set_index(['cluster', 'business_id'])['stars'].to_dict()

In [13]:
user_mapping, business_mapping, user_business = get_user_business(df_business, df_review)

In [14]:
user_business

Unnamed: 0,user_id,business_id,stars_review
0,razUB7ciYZluvxWM6shmtw,--30_8IhuyMHbSOcNWd6DQ,4.803491
1,3YhG4h4Ok654iVfqdmkuRg,--7PUidqRWpRSpXebiyxTg,1.940845
2,VyC2fG4dcMG07nrxh4jLnw,--7PUidqRWpRSpXebiyxTg,0.976479
3,Q5jOFJYhIsN8ouJ1rnsLQQ,--7PUidqRWpRSpXebiyxTg,0.966943
4,gdcRlubKDmslUYFPHUp1Cg,--8IbOsAAxjKRoYsBFL-PA,1.939110
...,...,...,...
985727,TkwnhxZfy7AFW1cEIn5u1A,zznJox6-nmXlGYNWgTDwQQ,3.851865
985728,weuxfeOxeGs8InkBS1ivbQ,zznJox6-nmXlGYNWgTDwQQ,2.949528
985729,Gix3hMYtxiiQd4Pg626GfQ,zznJox6-nmXlGYNWgTDwQQ,0.978447
985730,rB1vREB0x_uynI0ADMs2iA,zztOG2cKm87I6Iw_tleZsQ,4.939903


In [15]:
train_data, test_data = train_test_split(user_business, test_size=0.2, random_state=42)

user_business = train_data.copy()

In [16]:
for user_id, group in user_business.groupby('user_id'):
    print(group)
    break

                       user_id             business_id  stars_review
701644  ---UgP94gokyCDuB5zUssA  hKr-RKMVpj3gRkSWcjg3Zw      2.903193
785599  ---UgP94gokyCDuB5zUssA  mtrXz0nBaMO-ijQewaLG6A      3.897722
376009  ---UgP94gokyCDuB5zUssA  NAMen7YzwlYDs_5ECMnuYQ      4.909942
779732  ---UgP94gokyCDuB5zUssA  mV1UTSvEm-mhaPGFiIGhhQ      0.970888


In [None]:
cluster_id = user_cluster.get('---UgP94gokyCDuB5zUssA')
for business_id in cluster_items.get(cluster_id):
    print(business_id)
    print(cluster_business_avg_dict.get((cluster_id, business_id))
)
    break

FXsu9JqnLl2D0wqeQP4QKw
4.908063605566664


In [17]:
# Initialize a list for augmented interactions
df_augmented_interactions = pd.DataFrame(
    columns=['user_id', 'business_id', 'stars_review']
)
# Process each user
for user_id, group in user_business.groupby('user_id'):
    user_review_count = review_counts.get(user_id, 0)  # Default to 0 if user not in df_reviewc
    if user_review_count >= median_review_count:
        # Users with sufficient reviews: keep original interactions
        for _, row in group.iterrows():
            df_augmented_interactions = pd.concat(
                [df_augmented_interactions, pd.DataFrame(
                    {'user_id': [user_id], 'business_id': [row['business_id']], 'stars_review': [row['stars_review']]}
                )],
                ignore_index=True
            )
    else:
        # Users with fewer reviews: check cluster and augment
        user_cluster_id = user_cluster.get(user_id)
        if user_cluster_id is not None:  # If user has a cluster
            # Get the cluster's businesses as a DataFrame
            cluster_data = cluster_items.get(user_cluster_id)
            if cluster_data is not None:
                # Get the user's interacted businesses
                user_interacted_businesses = set(group['business_id'])
                # Add cluster businesses not yet reviewed with cluster-specific average rating
                for business_id in cluster_data:
                    if business_id not in user_interacted_businesses:
                        # Look up the cluster-specific average rating
                        avg_rating = cluster_business_avg_dict.get((user_cluster_id, business_id))
                        if avg_rating is not None:
                            df_augmented_interactions = pd.concat(
                                [df_augmented_interactions, pd.DataFrame(
                                    {'user_id': [user_id], 'business_id': [business_id], 'stars_review': [avg_rating]}
                                )],
                                ignore_index=True
                            )
            # Keep the user's original interactions
            for _, row in group.iterrows():
                df_augmented_interactions = pd.concat(
                    [df_augmented_interactions, pd.DataFrame(
                        {'user_id': [user_id], 'business_id': [row['business_id']], 'stars_review': [row['stars_review']]}
                    )],
                    ignore_index=True
                )

  df_augmented_interactions = pd.concat(


In [18]:
user_business = df_augmented_interactions.copy()

In [19]:
user_business

Unnamed: 0,user_id,business_id,stars_review
0,---UgP94gokyCDuB5zUssA,hKr-RKMVpj3gRkSWcjg3Zw,2.903193
1,---UgP94gokyCDuB5zUssA,mtrXz0nBaMO-ijQewaLG6A,3.897722
2,---UgP94gokyCDuB5zUssA,NAMen7YzwlYDs_5ECMnuYQ,4.909942
3,---UgP94gokyCDuB5zUssA,mV1UTSvEm-mhaPGFiIGhhQ,0.970888
4,--17Db1K-KujRuN7hY9Z0Q,3fpAmsSuEFNF29UUPpgwlw,4.883035
...,...,...,...
734413,zztkCqqgR6VntYbqio4UTQ,dG1pPWaVoIHWMIChKjfolg,4.895180
734414,zztkCqqgR6VntYbqio4UTQ,BmKCJsV_payJ5ANqC7i85g,4.888912
734415,zztkCqqgR6VntYbqio4UTQ,yGKeP2m3RefZY-kFqjTNGQ,3.909339
734416,zztkCqqgR6VntYbqio4UTQ,LGYIhGqbYakMMdsn_GCzJg,4.899462


In [20]:
# Map user_id and business_id to numerical indices
user_business['user_idx'] = user_business['user_id'].map(user_mapping)
user_business['business_idx'] = user_business['business_id'].map(business_mapping)

# Creating the sparse user-item interaction matrix using weighted_stars
user_item_sparse = csr_matrix(
    (user_business['stars_review'], (user_business['user_idx'], user_business['business_idx'])),
    shape=(len(user_mapping), len(business_mapping))
)

# Replace NaN values in the sparse matrix
user_item_sparse.data = np.nan_to_num(user_item_sparse.data)

In [21]:
def sparse_cosine_similarity_topn(A, top_n, threshold=0):
    # A is the sparse matrix (user-item matrix)
    # ntop is the number of top similar items you want
    # lower_bound is the minimum similarity score to consider
    
    # Compute the top N cosine similarities in a sparse format
    C = sp_matmul_topn(A.T, A.T, top_n=top_n, threshold=threshold, n_threads=4, sort=True)

    return C


In [22]:
# Compute item similarity
item_similarity_sparse = sparse_cosine_similarity_topn(user_item_sparse, top_n=50, threshold=0.01)

In [23]:
def optimize_db(conn):
    """Apply SQLite performance optimizations."""
    cursor = conn.cursor()
    cursor.executescript('''
        PRAGMA synchronous = OFF;
        PRAGMA journal_mode = MEMORY;
        PRAGMA temp_store = MEMORY;
        PRAGMA cache_size = 1000000;
    ''')
    conn.commit()


def insert_user_item(user_business, conn, batch_size=50000):
    """Optimized batch insert for user-item interactions."""
    cursor = conn.cursor()
    cursor.execute('BEGIN TRANSACTION')

    total_records = len(user_business)
    data = user_business[['user_id', 'business_id', 'stars_review']].values.tolist()

    for i in range(0, total_records, batch_size):
        batch = data[i:i + batch_size]
        cursor.executemany('''INSERT OR IGNORE INTO user_item_index (user_id, business_id, stars_review)
                              VALUES (?, ?, ?)''', batch)

        if i % (batch_size * 5) == 0:  # Commit every 5 batches
            conn.commit()
            print(f"Inserted {i + len(batch)} / {total_records} user-item records.")

    conn.commit()  # Final commit
    print(f"Total {total_records} user-item records inserted.")


def insert_item_vectors(item_similarity_sparse, business_mapping, conn, batch_size=5000, progress_interval=50000):
    """Optimized batch insert for item similarity vectors."""
    cursor = conn.cursor()
    cursor.execute('BEGIN TRANSACTION')

    total_inserted = 0
    batch = []
    business_keys = list(business_mapping.keys())  # Convert keys to list for faster indexing

    for row_idx in range(item_similarity_sparse.shape[0]):
        row_vector = item_similarity_sparse.getrow(row_idx)
        row_indices = row_vector.indices
        row_data = row_vector.data

        serialized_row = pickle.dumps((row_indices, row_data))
        item_id = business_keys[row_idx]  # Faster lookup

        batch.append((item_id, serialized_row))

        if len(batch) >= batch_size:
            cursor.executemany('''INSERT OR REPLACE INTO item_item_similarity (item_id, similarity_vector)
                                  VALUES (?, ?)''', batch)
            total_inserted += len(batch)

            if total_inserted % progress_interval == 0:
                print(f"Inserted {total_inserted} item vectors...")

            batch = []

    if batch:  # Insert remaining records
        cursor.executemany('''INSERT OR REPLACE INTO item_item_similarity (item_id, similarity_vector)
                              VALUES (?, ?)''', batch)
        total_inserted += len(batch)

    conn.commit()
    print(f"Total {total_inserted} item vectors inserted.")


def insert_mappings(business_mapping, conn, batch_size=50000):
    """Optimized batch insert for business mappings."""
    cursor = conn.cursor()
    cursor.execute('BEGIN TRANSACTION')

    data = list(business_mapping.items())
    total_records = len(data)

    for i in range(0, total_records, batch_size):
        batch = data[i:i + batch_size]
        cursor.executemany('''INSERT OR REPLACE INTO business_mapping (business_id, business_idx)
                              VALUES (?, ?)''', batch)

        if i % (batch_size * 5) == 0:  # Commit every 5 batches
            conn.commit()
            print(f"Inserted {i + len(batch)} / {total_records} business mappings.")

    conn.commit()
    print(f"Total {total_records} business mappings inserted.")


In [24]:
# Connect to SQLite (this will create a file-based database)
db_path = './yelp_ItemCF.db'
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
optimize_db(conn)

# Create tables for user-item and item-item indexes
cursor.execute('''CREATE TABLE IF NOT EXISTS user_item_index (
    user_id TEXT,
    business_id TEXT,
    stars_review REAL,
    PRIMARY KEY (user_id, business_id)
)''')

cursor.execute('''CREATE INDEX IF NOT EXISTS idx_user_item ON user_item_index(user_id, business_id)''')

cursor.execute('''CREATE TABLE IF NOT EXISTS item_item_similarity (
    item_id TEXT PRIMARY KEY,
    similarity_vector BLOB
)''')

cursor.execute('''CREATE INDEX IF NOT EXISTS idx_item_similarity ON item_item_similarity(item_id)''')

# cursor.execute('''CREATE TABLE IF NOT EXISTS user_mapping (
#     user_id TEXT PRIMARY KEY,
#     user_idx INTEGER
# )''')

cursor.execute('''CREATE TABLE IF NOT EXISTS business_mapping (
    business_id TEXT PRIMARY KEY,
    business_idx INTEGER
)''')


# Commit the changes
conn.commit()

In [25]:
insert_user_item(user_business, conn)
insert_item_vectors(item_similarity_sparse, business_mapping, conn)
insert_mappings(business_mapping, conn)

Inserted 50000 / 734418 user-item records.
Inserted 300000 / 734418 user-item records.
Inserted 550000 / 734418 user-item records.
Total 734418 user-item records inserted.
Inserted 50000 item vectors...
Total 78059 item vectors inserted.
Inserted 50000 / 78059 business mappings.
Total 78059 business mappings inserted.


In [26]:
# Close the connection when done
conn.close()

In [None]:
print("Database operations completed successfully.")