# Item-Based Collaborative Filtering for Product Recommendations

This notebook implements an item-based collaborative filtering recommendation system using product category data from user interactions. We'll use the K-Nearest Neighbors algorithm to find similar users and recommend products.

## Import Required Libraries

In [80]:
# Data manipulation and analysis
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Machine learning
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

# Database and file operations
import duckdb
import os
from pathlib import Path
from typing import List
from pprint import pprint

## Data Loading and Connection Setup

Connect to the DuckDB database and load the event data.

In [81]:
# Define database path
DB_NAME = Path("/home/c-enjalbert/Documents/EPSI/MSPR/bloc_2/amazing/amazing.duckdb")

# Define data folder path
data_folder = Path("../..").resolve() / "data"

# Connect to the database
con = duckdb.connect(str(DB_NAME))

# Check if the connection was successful
print(f"Connected to database at: {DB_NAME}")

Connected to database at: /home/c-enjalbert/Documents/EPSI/MSPR/bloc_2/amazing/amazing.duckdb


In [82]:
# List all tables in the database
tables_info = con.sql("""
    SELECT table_name
    FROM information_schema.tables
    WHERE table_schema = 'main'
    ORDER BY table_name
""").df()

print(f"Found {len(tables_info)} tables in the database:\n")

if len(tables_info) > 0:
    for i, table_name in enumerate(tables_info['table_name']):
        print(f"{i+1}. {table_name}")
else:
    print("No tables found in the database.")

Found 2 tables in the database:

1. all_events
2. loaded_files


## Load User Events Data

We'll load the events data from the database into a DataFrame for analysis.

In [83]:
# Load events data (limit to 20,000 rows for better performance)
print("Loading event data...")
all_events = con.sql("""
    SELECT *
    FROM all_events
    LIMIT 20000
""")

# Display sample of the data
print(all_events.limit(10).df())

Loading event data...
           event_time event_type product_id          category_id  \
0 2019-10-01 00:00:00       view   44600062  2103807459595387724   
1 2019-10-01 00:00:00       view    3900821  2053013552326770905   
2 2019-10-01 00:00:01       view   17200506  2053013559792632471   
3 2019-10-01 00:00:01       view    1307067  2053013558920217191   
4 2019-10-01 00:00:04       view    1004237  2053013555631882655   
5 2019-10-01 00:00:05       view    1480613  2053013561092866779   
6 2019-10-01 00:00:08       view   17300353  2053013553853497655   
7 2019-10-01 00:00:08       view   31500053  2053013558031024687   
8 2019-10-01 00:00:10       view   28719074  2053013565480109009   
9 2019-10-01 00:00:11       view    1004545  2053013555631882655   

                         category_code     brand    price    user_id  \
0                                 None  shiseido    35.79  541312140   
1  appliances.environment.water_heater      aqua    33.20  554748717   
2           f

In [84]:
# Convert query result to DataFrame for easier manipulation
all_events_df = pd.DataFrame(all_events.df())

# Show summary statistics
all_events_df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
event_time,20000.0,,,,2019-10-01 02:29:27.043549,2019-10-01 00:00:00,2019-10-01 02:28:14.750000,2019-10-01 02:37:18,2019-10-01 02:46:00,2019-10-01 02:54:01,
event_type,20000.0,3.0,view,19548.0,,,,,,,
product_id,20000.0,7178.0,1004856,234.0,,,,,,,
category_id,20000.0,443.0,2053013555631882655,5387.0,,,,,,,
category_code,13412.0,115.0,electronics.smartphone,5387.0,,,,,,,
brand,17185.0,922.0,samsung,2330.0,,,,,,,
price,20000.0,,,,289.381401,0.0,60.23,154.42,359.08,2574.07,368.18694
user_id,20000.0,5012.0,531063605,57.0,,,,,,,
user_session,20000.0,5669.0,f85114d0-6167-4c8b-88f8-0cd1cfc2c10c,55.0,,,,,,,


## Data Preprocessing and Cleaning

We'll prepare the data for collaborative filtering by:
1. Selecting relevant columns
2. Cleaning missing values
3. Converting event types to binary values (purchase = 1, view/cart = 0)

In [85]:
# Select only the columns we need for collaborative filtering
allevents_df_CF = all_events_df[["user_id", "product_id", "category_code", "category_id", "event_type"]]

# Check shape of our filtered dataset
print(f"Shape of filtered dataset: {allevents_df_CF.shape}")

Shape of filtered dataset: (20000, 5)


In [86]:
# Check for missing values in important columns
print(f"Missing values in product_id: {allevents_df_CF['product_id'].isna().sum()}")
print(f"Missing values in category_id: {allevents_df_CF['category_id'].isna().sum()}")
print(f"Missing values in category_code: {allevents_df_CF['category_code'].isna().sum()}")
print(f"Missing values in user_id: {allevents_df_CF['user_id'].isna().sum()}")
print(f"Missing values in event_type: {allevents_df_CF['event_type'].isna().sum()}")

# Remove rows with missing category_code (since we'll use this for recommendations)
allevents_df_CF.dropna(subset=["category_code"], inplace=True)
print(f"\nAfter dropping rows with missing category_code: {allevents_df_CF.shape}")

Missing values in product_id: 0
Missing values in category_id: 0
Missing values in category_code: 6588
Missing values in user_id: 0
Missing values in event_type: 0

After dropping rows with missing category_code: (13412, 5)




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [87]:
# Map event types to binary values (0 for view/cart, 1 for purchase)
# This helps us focus on items that users actually purchased
all_events_df["event_type"] = all_events_df["event_type"].map({"view": 0, "cart": 0, "purchase": 1})

# Check the distribution of event types after mapping
print("Distribution of event types after mapping:")
print(all_events_df["event_type"].value_counts())

Distribution of event types after mapping:
event_type
0    19742
1      258
Name: count, dtype: int64


## Category Analysis

Let's analyze which categories are most popular in terms of purchases.

In [88]:
# Calculate sum of event_type (purchases) per category_code
category_purchase_counts = all_events_df.groupby('category_code')['event_type'].sum().reset_index()
category_purchase_counts = category_purchase_counts.sort_values('event_type', ascending=False)

# Display the top categories by purchase count
print("Top categories by number of purchases:")
print(category_purchase_counts.head(10))

# Create an interactive bar chart with Plotly showing top 20 categories
fig = px.bar(
    category_purchase_counts.head(20), 
    x='event_type', 
    y='category_code',
    title='Number of Purchases by Category',
    labels={'event_type': 'Number of Purchases', 'category_code': 'Category'},
    width=1000,
    height=600
)
fig.update_layout(yaxis={'categoryorder': 'total ascending'})
fig.show()

Top categories by number of purchases:
                       category_code  event_type
86            electronics.smartphone         123
79       electronics.audio.headphone          12
85                electronics.clocks           7
41  appliances.kitchen.refrigerators           6
24     appliances.environment.vacuum           5
44         appliances.kitchen.washer           4
64                computers.notebook           4
90              electronics.video.tv           4
88             electronics.telephone           4
0                    accessories.bag           3


## Preparing User-Item Matrix for Collaborative Filtering

We'll create a user-item matrix where rows represent users and columns represent product categories. The values in the matrix represent the user's preference for that category (1 for purchased, 0 for not purchased).

In [89]:
# Create a dataframe with user-category interactions
# For each user-category combination, calculate the mean of event_type
# This gives us a value between 0 and 1 representing the user's preference for that category
cf_dataset = all_events_df.copy().groupby(by=['user_id','category_code'], as_index=False).agg({"event_type":"mean"})

# Display the first few rows of the dataset
cf_dataset.head()

Unnamed: 0,user_id,category_code,event_type
0,306441847,appliances.kitchen.oven,0.0
1,434894256,computers.notebook,0.0
2,442188017,computers.peripherals.monitor,0.0
3,443779465,electronics.clocks,0.0
4,444990139,electronics.smartphone,0.0


In [90]:
# Get statistics about our dataset
num_users = len(cf_dataset['user_id'].unique())
num_items = len(cf_dataset['category_code'].unique())
print('Number of unique users in the dataset: {}'.format(num_users))
print('Number of unique categories in the dataset: {}'.format(num_items))
print('Total user-category interactions: {}'.format(len(cf_dataset)))

Number of unique users in the dataset: 3591
Number of unique categories in the dataset: 115
Total user-category interactions: 4150


In [91]:
# Create a pivot table: users as rows, categories as columns
# Values are the event_type (0 for view/cart, 1 for purchase)
user_to_product_df = cf_dataset.pivot(
    index='user_id',
    columns='category_code',
    values='event_type'
).fillna(0)  # Fill missing values with 0 (no interaction)

# Display the first few rows of the pivot table
print(f"Shape of user-item matrix: {user_to_product_df.shape}")
user_to_product_df.head()

Shape of user-item matrix: (3591, 115)


category_code,accessories.bag,accessories.umbrella,accessories.wallet,apparel.costume,apparel.dress,apparel.jacket,apparel.jeans,apparel.scarf,apparel.shirt,apparel.shoes,...,kids.skates,kids.swing,kids.toys,medicine.tools.tonometer,sport.bicycle,sport.ski,sport.snowboard,sport.tennis,sport.trainer,stationery.cartrige
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
306441847,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
434894256,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
442188017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
443779465,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
444990139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Creating User ID Mappings

We'll create mappings between the original user IDs and their position in the matrix (index) to make lookup easier.

In [92]:
# Create mapping dictionaries between position index and original user_id
user_index_to_id_map = {}
user_id_to_index_map = {}

# Populate the mapping dictionaries
for i, user_id in enumerate(user_to_product_df.index):
    user_index_to_id_map[i] = user_id  # Maps position to actual user_id
    user_id_to_index_map[user_id] = i  # Maps user_id to position
    
# Create a DataFrame to store these mappings for easy reference
user_mapping_df = pd.DataFrame({
    'position_index': range(len(user_to_product_df)),
    'user_id': user_to_product_df.index
})

# Display the first few rows of the mapping DataFrame
user_mapping_df.head()

Unnamed: 0,position_index,user_id
0,0,306441847
1,1,434894256
2,2,442188017
3,3,443779465
4,4,444990139


In [93]:
user_mapping_df.values

array([[0, '306441847'],
       [1, '434894256'],
       [2, '442188017'],
       ...,
       [3588, '555467172'],
       [3589, '555467188'],
       [3590, '555467192']], dtype=object)

In [94]:
user_mapping_df.loc[user_mapping_df['user_id'] == '512384348']

Unnamed: 0,position_index,user_id
34,34,512384348


## Creating Sparse Matrix for KNN Model

We'll convert our user-item matrix to a sparse matrix for efficient processing by the KNN algorithm.

In [95]:
# Transform matrix to scipy sparse matrix for efficiency
user_to_product_sparse_df = csr_matrix(user_to_product_df.values)

# Get information about the sparse matrix
print(f"Shape of sparse matrix: {user_to_product_sparse_df.shape}")
print(f"Number of non-zero entries: {user_to_product_sparse_df.nnz}")
print(f"Sparsity: {user_to_product_sparse_df.nnz / (user_to_product_sparse_df.shape[0] * user_to_product_sparse_df.shape[1]):.4f}")

Shape of sparse matrix: (3591, 115)
Number of non-zero entries: 180
Sparsity: 0.0004


## Building the K-Nearest Neighbors Model

We'll use the KNN algorithm with cosine similarity to find users with similar preferences.

In [96]:
# Initialize KNN model with cosine similarity metric
knn_model = NearestNeighbors(metric='cosine', algorithm='brute')

# Fit the model to our sparse matrix
knn_model.fit(user_to_product_sparse_df)
print("KNN model successfully trained!")

KNN model successfully trained!


## Finding Similar Users

Now let's create a function to find users with similar preferences to a given user.

In [97]:
def get_similar_users(user_index,real_user_id, n=5):
    """
    Find the top n users similar to the given user.
    
    Parameters:
    user_index: Position index of the user in the user_to_product_df (not the actual user ID)
    n: Number of similar users to return
    
    Returns:
    indices: Indices of similar users
    distances: Cosine distances to the similar users
    """
    # Get the user's preferences as a vector
    knn_input = np.asarray([user_to_product_df.values[user_index]])
    
    # Find k nearest neighbors (add 1 because the user themselves will be included)
    distances, indices = knn_model.kneighbors(knn_input, n_neighbors=n+1)

    # Print results
    print(f"Top {n} users who are similar to User {user_index} (Real ID: {real_user_id}):")
    print("")
    for i in range(1, len(distances[0])):
        similar_user_index = indices[0][i]
        print("similar_user_index", similar_user_index)
        similar_user_id = user_mapping_df.iloc[similar_user_index]['user_id']
        print(f"{i}. User Index: {similar_user_index}, Real ID: {similar_user_id}, Distance: {distances[0][i]:.4f}")
    
    # Return indices and distances (excluding the user themselves)
    return indices.flatten()[1:], distances.flatten()[1:]

## Category Recommendation Function

Now let's create a function to recommend categories based on similar users' preferences.

In [98]:
def recommend_categories(user_index, n=10, only_new=True):
    """
    Recommend categories to a user based on similar users' preferences.
    
    Parameters:
    user_index: Position index of the user in the user_to_product_df (not the actual user ID)
    n: Maximum number of recommendations to return
    only_new: If True, only recommend categories the user hasn't interacted with
    
    Returns:
    recommended_categories: List of recommended category names
    """
    # Get the real user ID
    real_user_id = user_mapping_df.iloc[user_index]['user_id']
    
    # Get similar users
    similar_user_indices, distances = get_similar_users(user_index, real_user_id, 5)
    
    # Calculate weights for each similar user (inversely proportional to distance)
    weights = 1 / (distances + 0.000001)  # Add small value to avoid division by zero
    weights = weights / np.sum(weights)  # Normalize weights
    
    # Get category preferences of similar users
    similar_users_prefs = user_to_product_df.iloc[similar_user_indices].values
    
    # Reshape weights for matrix multiplication
    weights = weights[:, np.newaxis]
    
    # Calculate weighted preferences
    weighted_prefs = weights * similar_users_prefs
    
    # Sum weighted preferences across all similar users
    category_scores = weighted_prefs.sum(axis=0)
    
    # Get list of all category names
    all_categories = user_to_product_df.columns
    
    # If only_new is True, exclude categories the user has already interacted with
    if only_new:
        # Get user's current preferences - use the user_index to access their row in the matrix
        user_prefs = user_to_product_df.iloc[user_index].values
        
        # Create a mask for categories the user hasn't interacted with
        new_category_mask = user_prefs == 0
        
        # Apply the mask to category scores
        category_scores = category_scores * new_category_mask
    
    # Sort categories by score in descending order
    top_indices = np.argsort(category_scores)[::-1][:n]
    
    # Get the names of the top categories
    recommended_categories = np.array(all_categories)[top_indices]
    
    # Print recommendations
    print(f"\nTop {n} recommended categories for User {user_index} (Real ID: {real_user_id}):")
    for i, category in enumerate(recommended_categories):
        score = category_scores[top_indices[i]]
        if score > 0:  # Only show recommendations with positive scores
            print(f"{i+1}. {category} (Score: {score:.4f})")
    
    return recommended_categories

In [99]:
def analyze_user_purchases_and_compare(user_index, recommended_categories, df=all_events_df):
    """
    Analyze a user's purchase behavior and compare with recommendations
    
    Parameters:
    user_index: The position index of the user
    recommended_categories: List of recommended categories to compare with
    df: DataFrame containing the events data
    
    Returns:
    purchased_categories: Array containing the categories the user has purchased from
    """
    # Get the real user ID
    real_user_id = user_mapping_df.iloc[user_index]['user_id']
    
    # Filter the original dataframe for this user
    user_data = df[df['user_id'] == real_user_id]
    
    # Get only purchase events (event_type = 1)
    user_purchases = user_data[user_data['event_type'] == 1]
    
    # Get unique categories purchased from
    purchased_categories = user_purchases['category_code'].unique()
    
    # Print summary
    print(f"\n----- Analysis for User Index {user_index} (Real ID: {real_user_id}) -----")
    print(f"Total events: {len(user_data)}")
    print(f"Total purchases: {len(user_purchases)}")
    print(f"Number of unique categories purchased from: {len(purchased_categories)}")
    
    print("\nCategories purchased from:")
    for i, category in enumerate(purchased_categories):
        # Count purchases in this category
        category_purchase_count = user_purchases[user_purchases['category_code'] == category].shape[0]
        print(f"{i+1}. {category} ({category_purchase_count} purchases)")
    
    # Compare with what was recommended
    print("\n----- Comparing Purchases vs Recommendations -----")
    print("Recommended categories that the user has already purchased from:")
    common_categories = set(purchased_categories).intersection(set(recommended_categories))
    if common_categories:
        for i, category in enumerate(common_categories):
            print(f"{i+1}. {category}")
    else:
        print("None of the recommended categories have been purchased from before (good!)")
    
    return purchased_categories

In [100]:
count = 0
for index, row in user_to_product_df.iterrows():
    for col in user_to_product_df.columns:

        if user_to_product_df.loc[index, col] != 0:
            print(count)
            print(index)
            print(f"Column: {col}, Value: {user_to_product_df.loc[index, col]}")
    count += 1 

34
512384348
Column: electronics.smartphone, Value: 0.25
43
512416959
Column: appliances.kitchen.refrigerators, Value: 0.2
70
512478069
Column: accessories.bag, Value: 0.10526315789473684
109
512550575
Column: auto.accessories.videoregister, Value: 0.16666666666666666
112
512556357
Column: electronics.smartphone, Value: 0.3333333333333333
125
512584306
Column: appliances.environment.vacuum, Value: 0.2
130
512592316
Column: electronics.smartphone, Value: 0.2222222222222222
141
512607090
Column: electronics.smartphone, Value: 0.2
153
512682014
Column: electronics.smartphone, Value: 0.0625
175
512730342
Column: electronics.smartphone, Value: 0.2857142857142857
176
512736046
Column: electronics.clocks, Value: 0.1
189
512758113
Column: electronics.smartphone, Value: 0.2
201
512778902
Column: stationery.cartrige, Value: 0.125
216
512812487
Column: electronics.smartphone, Value: 0.25
255
512907792
Column: appliances.kitchen.blender, Value: 0.1111111111111111
273
512968794
Column: electronics.

## Testing the Recommendation System

Let's test our recommendation system on a few users.

In [101]:
int(user_mapping_df.iloc[3459]['user_id'])

555465285

In [102]:
# Test the recommendation system on multiple users
test_user_indices = [34]  # Test with different user indices

for test_user_index in test_user_indices:
    print("\n" + "="*80)
    print(f"TESTING USER INDEX: {test_user_index}")
    print("="*80)
    
    # Get the user's current category interactions
    real_user_id = user_mapping_df.iloc[test_user_index]['user_id']
    user_categories = cf_dataset[cf_dataset['user_id'] == real_user_id]['category_code']
    
    print(f"User {test_user_index} (Real ID: {real_user_id}) has interacted with these categories:")
    for i, category in enumerate(user_categories):
        print(f"{i+1}. {category}")
    
    # Get recommendations for the user
    recommended_categories = recommend_categories(test_user_index, n=5, only_new=True)
    
    # Analyze purchases and compare with recommendations
    analyze_user_purchases_and_compare(test_user_index, recommended_categories)


TESTING USER INDEX: 34
User 34 (Real ID: 512384348) has interacted with these categories:
1. electronics.smartphone
Top 5 users who are similar to User 34 (Real ID: 512384348):

similar_user_index 2278
1. User Index: 2278, Real ID: 543624132, Distance: 0.0000
similar_user_index 3450
2. User Index: 3450, Real ID: 555465207, Distance: 0.0000
similar_user_index 153
3. User Index: 153, Real ID: 512682014, Distance: 0.0000
similar_user_index 1955
4. User Index: 1955, Real ID: 536328930, Distance: 0.0000
similar_user_index 3459
5. User Index: 3459, Real ID: 555465285, Distance: 0.0000

Top 5 recommended categories for User 34 (Real ID: 512384348):

----- Analysis for User Index 34 (Real ID: 512384348) -----
Total events: 4
Total purchases: 1
Number of unique categories purchased from: 1

Categories purchased from:
1. electronics.smartphone (1 purchases)

----- Comparing Purchases vs Recommendations -----
Recommended categories that the user has already purchased from:
None of the recommende

## Conclusion

In this notebook, we've implemented an item-based collaborative filtering recommendation system that suggests product categories to users based on the preferences of similar users. The system works as follows:

1. We created a user-item matrix where rows represent users, columns represent product categories, and values represent user preferences.
2. We used the K-Nearest Neighbors algorithm with cosine similarity to find users with similar preferences.
3. We weighted the preferences of similar users to generate personalized recommendations.
4. We filtered out categories the user has already interacted with to provide fresh recommendations.

**Key Findings:**

- The system successfully identifies similar users based on their category preferences.
- It can generate personalized recommendations that account for the user's existing preferences.
- The recommendations are weighted by the similarity between users, giving more importance to users with very similar preferences.

**Limitations and Potential Improvements:**

1. We could incorporate the specific product IDs in addition to categories for more fine-grained recommendations.
2. We could incorporate temporal information to account for changing user preferences over time.
3. We could combine this approach with content-based filtering for a hybrid recommendation system.
4. More data would help improve the quality of recommendations.

The collaborative filtering approach demonstrates strong potential for recommending new categories to users based on similar user preferences.

In [103]:
def filtered_category_recommendations(user_index, n=5, threshold=0.0, num_similar_users=5, df=all_events_df):
    """
    Recommend categories to a user based on similar users' preferences,
    filtered to only show categories the user hasn't purchased from before.
    
    Parameters:
    user_index: Position index of the user in the user_to_product_df
    n: Maximum number of recommendations to return
    threshold: Minimum score for a category to be recommended (0.0 to include all positive scores)
    num_similar_users: Number of similar users to consider for recommendations
    df: DataFrame containing the events data
    
    Returns:
    final_category_list: List of recommended categories the user hasn't purchased from
    """
    # Get the real user ID
    real_user_id = user_mapping_df.iloc[user_index]['user_id']
    
    # Get similar users
    similar_user_indices, distances = get_similar_users(user_index, real_user_id, num_similar_users)
    
    # Calculate weights for each similar user (inversely proportional to distance)
    weights = 1 / (distances + 0.000001)  # Add small value to avoid division by zero
    weights = weights / np.sum(weights)  # Normalize weights
    
    # Get category preferences of similar users
    similar_users_prefs = user_to_product_df.iloc[similar_user_indices].values
    
    # Calculate weighted preferences
    weighted_prefs = (weights[:, np.newaxis] * similar_users_prefs).sum(axis=0)
    
    # Get list of all category names
    all_categories = user_to_product_df.columns
    
    # Apply threshold to scores - only consider categories with scores above threshold
    threshold_mask = weighted_prefs > threshold
    
    # Apply threshold filter
    if sum(threshold_mask) > 0:
        filtered_scores = weighted_prefs * threshold_mask
    else:
        # If no categories meet the threshold, keep all scores
        print(f"Warning: No categories met the threshold of {threshold}. Using all scores.")
        filtered_scores = weighted_prefs
    
    # Sort categories by score in descending order
    sortd_index = np.argsort(filtered_scores)[::-1]
    
    # Remove categories with zero scores
    try:
        first_zero_index = np.where(filtered_scores[sortd_index] == 0)[0][0]
        sortd_index = sortd_index[:first_zero_index]
    except IndexError:
        # No zero values found, use all indices
        pass
    
    # Get categories user has already purchased from
    user_data = df[df['user_id'] == real_user_id]
    user_purchases = user_data[user_data['event_type'] == 1]
    purchased_categories = set(user_purchases['category_code'].unique())
    
    # Filter recommendations to only show categories not purchased before
    filtered_category_list = list(all_categories[sortd_index])
    count = 0
    final_category_list = []
    
    for category in filtered_category_list:
        if category not in purchased_categories:
            count += 1
            score = filtered_scores[list(all_categories).index(category)]
            final_category_list.append((category, score))
        if count == n:
            break
    
    # Print results
    print(f"\n----- Filtered Recommendations for User {user_index} (Real ID: {real_user_id}) -----")
    print(f"Total events: {len(user_data)}")
    print(f"Total purchases: {len(user_purchases)}")
    print(f"Number of unique categories purchased from: {len(purchased_categories)}")
    print(f"Using threshold: {threshold} and considering {num_similar_users} similar users")
    
    if count == 0:
        print("\nThere are no categories left which haven't been purchased by this user but have been purchased by similar users.")
        print("Try decreasing the threshold, increasing the number of similar users, or including already purchased categories.")
    else:
        print(f"\nTop {count} recommended categories (not previously purchased):")
        for i, (category, score) in enumerate(final_category_list):
            print(f"{i+1}. {category} (Score: {score:.4f})")
    
    return [category for category, _ in final_category_list]

In [104]:
# Test with different thresholds
test_user_index = 34

print("\nTesting with high threshold (0.5):")
high_recommendations = filtered_category_recommendations(test_user_index, n=5, threshold=0.5)


Testing with high threshold (0.5):
Top 5 users who are similar to User 34 (Real ID: 512384348):

similar_user_index 2278
1. User Index: 2278, Real ID: 543624132, Distance: 0.0000
similar_user_index 3450
2. User Index: 3450, Real ID: 555465207, Distance: 0.0000
similar_user_index 153
3. User Index: 153, Real ID: 512682014, Distance: 0.0000
similar_user_index 1955
4. User Index: 1955, Real ID: 536328930, Distance: 0.0000
similar_user_index 3459
5. User Index: 3459, Real ID: 555465285, Distance: 0.0000

----- Filtered Recommendations for User 34 (Real ID: 512384348) -----
Total events: 4
Total purchases: 1
Number of unique categories purchased from: 1
Using threshold: 0.5 and considering 5 similar users

There are no categories left which haven't been purchased by this user but have been purchased by similar users.
Try decreasing the threshold, increasing the number of similar users, or including already purchased categories.


In [105]:
def category_recommender_system(user_index, n_similar_users=5, n_categories=10):
    """
    Complete recommender system for categories based on collaborative filtering.
    
    Parameters:
    user_index: Position index of the user in the user_to_product_df
    n_similar_users: Number of similar users to consider
    n_categories: Number of categories to recommend
    
    Returns:
    final_category_list: List of recommended categories
    """
    # Get the real user ID
    real_user_id = user_mapping_df.iloc[user_index]['user_id']
    
    # Print categories the user has already interacted with
    print("Categories seen by the User:")
    user_categories = cf_dataset[cf_dataset['user_id'] == real_user_id]['category_code']
    pprint(list(user_categories))
    print("")
    
    # Define local function to get similar users
    def get_similar_users(user_idx, n=5):
        # Get the user's preferences as a vector
        knn_input = np.asarray([user_to_product_df.values[user_idx]])
        
        # Find k nearest neighbors (add 1 because the user themselves will be included)
        distances, indices = knn_model.kneighbors(knn_input, n_neighbors=n+1)
        
        # Print results
        print(f"Top {n} users who are very much similar to the User-{user_idx} (Real ID: {real_user_id}) are:")
        print(" ")
        
        for i in range(1, len(distances[0])):
            similar_user_index = indices[0][i]
            similar_user_id = user_mapping_df.iloc[similar_user_index]['user_id']
            print(f"{i}. User: {similar_user_index} (Real ID: {similar_user_id}), separated by distance of {distances[0][i]:.4f}")
        print("")
        
        # Return indices and distances (excluding the user themselves)
        return indices.flatten()[1:], distances.flatten()[1:]
    
    # Define local function for filtered recommendations
    def filtered_category_recommendations(n=10):
        # Find the last zero index in the mean rating list
        try:
            first_zero_index = np.nonzero(mean_rating_list == 0)[0][-1]
            sortd_index = np.argsort(mean_rating_list)[::-1]
            sortd_index = sortd_index[:list(sortd_index).index(first_zero_index)]
        except (IndexError, ValueError):
            # Handle case where there are no zero values
            sortd_index = np.argsort(mean_rating_list)[::-1]
        
        # Limit to requested number of recommendations
        n = min(len(sortd_index), n)
        
        # Get categories the user has already interacted with
        categories_seen = list(cf_dataset[cf_dataset['user_id'] == real_user_id]['category_code'])
        
        # Filter out categories the user has already interacted with
        filtered_category_list = list(categories_list[sortd_index])
        count = 0
        final_category_list = []
        
        for category in filtered_category_list:
            if category not in categories_seen:
                count += 1
                score = mean_rating_list[list(categories_list).index(category)]
                final_category_list.append((category, score))
            if count == n:
                break
        
        # Print results
        if count == 0:
            print("There are no categories left which haven't been seen by this user but have been seen by similar users.")
            print("Try increasing the number of similar users to find more diverse recommendations.")
        else:
            print(f"Top {count} recommended categories:")
            for i, (category, score) in enumerate(final_category_list):
                print(f"{i+1}. {category} (Score: {score:.4f})")
        
        return [category for category, _ in final_category_list]
    
    # Get similar users
    similar_user_list, distance_list = get_similar_users(user_index, n_similar_users)
    
    # Calculate weights for each similar user (inversely proportional to distance)
    weightage_list = distance_list / np.sum(distance_list)
    
    # Get category preferences of similar users
    cat_prefs_sim_users = user_to_product_df.values[similar_user_list]
    
    # Get list of all category names
    categories_list = user_to_product_df.columns
    
    # Expand weightage list to match the shape of categories
    weightage_list = weightage_list[:, np.newaxis] + np.zeros(len(categories_list))
    
    # Calculate weighted preferences
    new_rating_matrix = weightage_list * cat_prefs_sim_users
    
    # Sum weighted preferences across all similar users
    mean_rating_list = new_rating_matrix.sum(axis=0)
    
    # Generate and print recommendations
    print("")
    print("Categories recommended based on similar users are:")
    print("")
    final_recommendations = filtered_category_recommendations(n_categories)
    
    return final_recommendations

In [106]:
def visualize_category_recommendations_3d(user_index, similar_user_list, weightage_list, categories_list, mean_rating_list):
    """
    Visualize user similarities and recommendations in 3D using PCA.
    
    Parameters:
    user_index: Position index of the user
    similar_user_list: List of similar user indices
    weightage_list: Weights for each similar user
    categories_list: List of all category names
    mean_rating_list: Weighted scores for each category
    """
    import plotly.express as px
    from sklearn.decomposition import PCA
    
    # Get real user ID
    real_user_id = user_mapping_df.iloc[user_index]['user_id']
    
    # Combine user data for PCA
    data_matrix = user_to_product_df.values
    
    # Dimensionality reduction
    pca = PCA(n_components=3)
    reduced_data = pca.fit_transform(data_matrix)
    
    # Create a DataFrame for visualization
    df_vis = pd.DataFrame(reduced_data, columns=["PCA1", "PCA2", "PCA3"])
    df_vis["user_index"] = range(len(user_to_product_df))
    df_vis["user_id"] = user_to_product_df.index
    
    # Add user type information
    df_vis["type"] = "Other Users"
    df_vis.loc[user_index, "type"] = "Input User"  # Highlight the input user
    for sim_user in similar_user_list:
        df_vis.loc[sim_user, "type"] = "Similar Users"
    
    # Create the plot
    fig = px.scatter_3d(
        df_vis,
        x="PCA1",
        y="PCA2",
        z="PCA3",
        color="type",
        symbol="type",
        size=df_vis["type"].apply(lambda x: 10 if x == "Input User" else 5),
        title=f"3D Representation of Users and Recommendations for User {user_index} (ID: {real_user_id})",
        labels={"PCA1": "PCA Component 1", "PCA2": "PCA Component 2", "PCA3": "PCA Component 3"},
        hover_data=["user_id"]
    )
    
    # Annotate the top categories
    top_categories = np.argsort(mean_rating_list)[-5:]  # Top 5 rated categories
    for cat_idx in top_categories:
        if mean_rating_list[cat_idx] > 0:  # Only show positive recommendations
            fig.add_annotation(
                x=reduced_data[user_index, 0],
                y=reduced_data[user_index, 1],
                z=reduced_data[user_index, 2],
                text=f"{categories_list[cat_idx]}",
                showarrow=True,
                arrowhead=2,
                arrowsize=1.5,
                ax=0,
                ay=-40
            )
    
    # Show the figure
    fig.show()

In [107]:
# Test the recommendation system with visualization
test_user_index = 34  # You can change this
n_similar_users = 5
n_categories = 10

# Run the recommender system
final_recommendations = category_recommender_system(test_user_index, n_similar_users, n_categories)

# Get similar users for visualization
real_user_id = user_mapping_df.iloc[test_user_index]['user_id']
similar_user_list, distance_list = get_similar_users(test_user_index, real_user_id, n_similar_users)

# Calculate weights
weightage_list = distance_list / np.sum(distance_list)

# Get category preferences of similar users
cat_prefs_sim_users = user_to_product_df.values[similar_user_list]

# Get list of all category names
categories_list = user_to_product_df.columns

# Expand weightage list to match the shape of categories
expanded_weightage_list = weightage_list[:, np.newaxis] + np.zeros(len(categories_list))

# Calculate weighted preferences
new_rating_matrix = expanded_weightage_list * cat_prefs_sim_users

# Sum weighted preferences across all similar users
mean_rating_list = new_rating_matrix.sum(axis=0)

# Visualize the recommendations
visualize_category_recommendations_3d(test_user_index, similar_user_list, weightage_list, categories_list, mean_rating_list)

Categories seen by the User:
['electronics.smartphone']

Top 5 users who are very much similar to the User-34 (Real ID: 512384348) are:
 
1. User: 2278 (Real ID: 543624132), separated by distance of 0.0000
2. User: 3450 (Real ID: 555465207), separated by distance of 0.0000
3. User: 153 (Real ID: 512682014), separated by distance of 0.0000
4. User: 1955 (Real ID: 536328930), separated by distance of 0.0000
5. User: 3459 (Real ID: 555465285), separated by distance of 0.0000


Categories recommended based on similar users are:

Top 10 recommended categories:
1. stationery.cartrige (Score: nan)
2. appliances.kitchen.blender (Score: nan)
3. appliances.kitchen.coffee_machine (Score: nan)
4. appliances.kitchen.dishwasher (Score: nan)
5. appliances.kitchen.grill (Score: nan)
6. appliances.kitchen.hob (Score: nan)
7. appliances.kitchen.hood (Score: nan)
8. appliances.kitchen.juicer (Score: nan)
9. appliances.kitchen.kettle (Score: nan)
10. appliances.kitchen.meat_grinder (Score: nan)
Top 5 user


invalid value encountered in divide


invalid value encountered in divide



In [108]:
# Close the database connection
con.close()
print("Database connection closed.")

Database connection closed.
