# Item-Based Collaborative Filtering for Product Recommendations

This notebook implements a collaborative filtering recommendation system for product categories. We use the K-Nearest Neighbors algorithm to find similar users and recommend products they might be interested in based on purchases from similar users.

## Import Required Libraries

In [1]:
# Data manipulation and analysis
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.decomposition import PCA

# Machine learning
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

# Database and other utilities
import duckdb
from pathlib import Path
from pprint import pprint

## Data Loading and Preprocessing

In [2]:
# Define database path
DB_NAME = Path("/home/c-enjalbert/Documents/EPSI/MSPR/bloc_2/amazing/amazing.duckdb")

# Connect to the database
con = duckdb.connect(str(DB_NAME))
print(f"Connected to database at: {DB_NAME}")

# Load events data (limit for better performance)
all_events = con.sql("""
    SELECT *
    FROM all_events
    LIMIT 20000
""")

# Convert query result to DataFrame
all_events_df = pd.DataFrame(all_events.df())
print(f"Loaded {len(all_events_df)} events")

Connected to database at: /home/c-enjalbert/Documents/EPSI/MSPR/bloc_2/amazing/amazing.duckdb
Loaded 20000 events


In [3]:
# Select relevant columns and clean data
allevents_df_CF = all_events_df[["user_id", "product_id", "category_code", "category_id", "event_type"]]

# Remove rows with missing category_code
allevents_df_CF.dropna(subset=["category_code"], inplace=True)
print(f"Clean dataset shape: {allevents_df_CF.shape}")

# Map event types to binary values (0 for view/cart, 1 for purchase)
all_events_df["event_type"] = all_events_df["event_type"].map({"view": 0, "cart": 0, "purchase": 1})
print("Distribution of event types after mapping:")
print(all_events_df["event_type"].value_counts())

Clean dataset shape: (13412, 5)
Distribution of event types after mapping:
event_type
0    19742
1      258
Name: count, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  allevents_df_CF.dropna(subset=["category_code"], inplace=True)


## Creating User-Item Matrix

In [4]:
# Create a dataframe with user-category interactions
cf_dataset = all_events_df.copy().groupby(by=['user_id','category_code'], as_index=False).agg({"event_type":"mean"})

# Display dataset stats
num_users = len(cf_dataset['user_id'].unique())
num_items = len(cf_dataset['category_code'].unique())
print(f'Number of unique users: {num_users}')
print(f'Number of unique categories: {num_items}')
print(f'Total user-category interactions: {len(cf_dataset)}')

# Create user-item matrix (pivot table)
user_to_product_df = cf_dataset.pivot(
    index='user_id',
    columns='category_code',
    values='event_type'
).fillna(0)  # Fill missing values with 0 (no interaction)

print(f"Shape of user-item matrix: {user_to_product_df.shape}")

Number of unique users: 3591
Number of unique categories: 115
Total user-category interactions: 4150
Shape of user-item matrix: (3591, 115)


In [5]:
# Create mapping between position index and user ID
user_mapping_df = pd.DataFrame({
    'position_index': range(len(user_to_product_df)),
    'user_id': user_to_product_df.index
})

# Create sparse matrix for KNN model
user_to_product_sparse_df = csr_matrix(user_to_product_df.values)
print(f"Sparsity: {user_to_product_sparse_df.nnz / (user_to_product_sparse_df.shape[0] * user_to_product_sparse_df.shape[1]):.4f}")

# Initialize and fit KNN model
knn_model = NearestNeighbors(metric='cosine', algorithm='brute')
knn_model.fit(user_to_product_sparse_df)
print("KNN model successfully trained!")

Sparsity: 0.0004
KNN model successfully trained!


## Helper Functions for Recommendations

In [6]:
def get_similar_users(user_index, real_user_id, n=5):
    """
    Find the top n users similar to the given user.
    
    Parameters:
    user_index: Position index of the user in the user_to_product_df
    real_user_id: The actual user ID for display purposes
    n: Number of similar users to return
    
    Returns:
    indices: Indices of similar users
    distances: Cosine distances to the similar users
    """
    # Get the user's preferences vector and find nearest neighbors
    knn_input = np.asarray([user_to_product_df.values[user_index]])
    distances, indices = knn_model.kneighbors(knn_input, n_neighbors=n+1)
    
    # Print results
    print(f"\nTop {n} users similar to User {user_index} (Real ID: {real_user_id}):")
    for i in range(1, len(distances[0])):
        similar_user_index = indices[0][i]
        similar_user_id = user_mapping_df.iloc[similar_user_index]['user_id']
        print(f"{i}. User Index: {similar_user_index}, Real ID: {similar_user_id}, Distance: {distances[0][i]:.4f}")
    
    # Return indices and distances (excluding the user themselves)
    return indices.flatten()[1:], distances.flatten()[1:]

In [7]:
def filtered_category_recommendations(user_index, n=5, threshold=0.0, num_similar_users=5, df=all_events_df):
    """
    Recommend categories to a user based on similar users' preferences,
    filtered to only show categories the user hasn't purchased from before.
    
    Parameters:
    user_index: Position index of the user
    n: Maximum number of recommendations to return
    threshold: Minimum score for a category to be recommended
    num_similar_users: Number of similar users to consider
    df: DataFrame containing the events data
    
    Returns:
    List of recommended categories
    """
    # Get the real user ID
    real_user_id = user_mapping_df.iloc[user_index]['user_id']
    
    # Get similar users
    similar_user_indices, distances = get_similar_users(user_index, real_user_id, num_similar_users)
    
    # Calculate weights for each similar user (inversely proportional to distance)
    weights = 1 / (distances + 0.000001)  # Add small value to avoid division by zero
    weights = weights / np.sum(weights)  # Normalize weights
    
    # Get category preferences of similar users
    similar_users_prefs = user_to_product_df.iloc[similar_user_indices].values
    
    # Calculate weighted preferences
    weighted_prefs = (weights[:, np.newaxis] * similar_users_prefs).sum(axis=0)
    
    # Get list of all category names
    all_categories = user_to_product_df.columns
    
    # Apply threshold to scores
    threshold_mask = weighted_prefs > threshold
    filtered_scores = weighted_prefs * threshold_mask if sum(threshold_mask) > 0 else weighted_prefs
    
    # Sort categories by score in descending order
    try:
        sortd_index = np.argsort(filtered_scores)[::-1]
        first_zero_index = np.where(filtered_scores[sortd_index] == 0)[0][0]
        sortd_index = sortd_index[:first_zero_index]
    except IndexError:
        # No zero values found, use all indices
        sortd_index = np.argsort(filtered_scores)[::-1]
    
    # Get categories user has already purchased from
    user_data = df[df['user_id'] == real_user_id]
    user_purchases = user_data[user_data['event_type'] == 1]
    purchased_categories = set(user_purchases['category_code'].unique())
    
    # Filter recommendations to exclude already purchased categories
    filtered_category_list = list(all_categories[sortd_index])
    count = 0
    final_category_list = []
    
    for category in filtered_category_list:
        if category not in purchased_categories:
            count += 1
            score = filtered_scores[list(all_categories).index(category)]
            final_category_list.append((category, score))
        if count == n:
            break
    
    # Print results
    print(f"\n----- Recommendations for User {user_index} (Real ID: {real_user_id}) -----")
    print(f"Using threshold: {threshold} and considering {num_similar_users} similar users")
    
    if count == 0:
        print("\nNo new categories to recommend. Try decreasing the threshold.")
    else:
        print(f"\nTop {count} recommended categories (not previously purchased):")
        for i, (category, score) in enumerate(final_category_list):
            print(f"{i+1}. {category} (Score: {score:.4f})")
    
    return [category for category, _ in final_category_list]

In [8]:
def category_recommender_system(user_index, n_similar_users=5, n_categories=10):
    """
    Complete recommender system for categories based on collaborative filtering.
    
    Parameters:
    user_index: Position index of the user
    n_similar_users: Number of similar users to consider
    n_categories: Number of categories to recommend
    
    Returns:
    List of recommended categories
    """
    # Get the real user ID
    real_user_id = user_mapping_df.iloc[user_index]['user_id']
    
    # Print categories the user has already interacted with
    print("Categories seen by the User:")
    user_categories = cf_dataset[cf_dataset['user_id'] == real_user_id]['category_code']
    pprint(list(user_categories))
    
    # Get similar users
    similar_user_indices, distances = get_similar_users(user_index, real_user_id, n_similar_users)
    
    # Calculate weights for similar users (inversely proportional to distance)
    weights = distances / np.sum(distances)
    
    # Get category preferences of similar users
    cat_prefs_sim_users = user_to_product_df.values[similar_user_indices]
    
    # Get all category names
    categories_list = user_to_product_df.columns
    
    # Calculate weighted preferences
    weightage_list = weights[:, np.newaxis] + np.zeros(len(categories_list))
    new_rating_matrix = weightage_list * cat_prefs_sim_users
    mean_rating_list = new_rating_matrix.sum(axis=0)
    
    # Define filtered_category_recommendations nested function
    def filtered_category_recommendations(n=10):
        # Find non-zero scores
        try:
            first_zero_index = np.nonzero(mean_rating_list == 0)[0][-1]
            sortd_index = np.argsort(mean_rating_list)[::-1]
            sortd_index = sortd_index[:list(sortd_index).index(first_zero_index)]
        except (IndexError, ValueError):
            sortd_index = np.argsort(mean_rating_list)[::-1]
        
        # Limit to requested number of recommendations
        n = min(len(sortd_index), n)
        
        # Get categories the user has already interacted with
        categories_seen = list(cf_dataset[cf_dataset['user_id'] == real_user_id]['category_code'])
        
        # Filter out already seen categories
        filtered_category_list = list(categories_list[sortd_index])
        count = 0
        final_category_list = []
        
        for category in filtered_category_list:
            if category not in categories_seen:
                count += 1
                score = mean_rating_list[list(categories_list).index(category)]
                final_category_list.append((category, score))
            if count == n:
                break
        
        # Print results
        if count == 0:
            print("No categories left which haven't been seen by this user.")
        else:
            print(f"Top {count} recommended categories:")
            for i, (category, score) in enumerate(final_category_list):
                print(f"{i+1}. {category} (Score: {score:.4f})")
        
        return [category for category, _ in final_category_list]
    
    # Generate and print recommendations
    print("\nCategories recommended based on similar users:")
    final_recommendations = filtered_category_recommendations(n_categories)
    
    return final_recommendations

In [9]:
def visualize_category_recommendations_3d(user_index, similar_user_list, weightage_list, categories_list, mean_rating_list):
    """
    Visualize user similarities and recommendations in 3D using PCA.
    
    Parameters:
    user_index: Position index of the user
    similar_user_list: List of similar user indices
    weightage_list: Weights for each similar user
    categories_list: List of all category names
    mean_rating_list: Weighted scores for each category
    """
    # Get real user ID
    real_user_id = user_mapping_df.iloc[user_index]['user_id']
    
    # Dimensionality reduction with PCA
    pca = PCA(n_components=3)
    reduced_data = pca.fit_transform(user_to_product_df.values)
    
    # Create a DataFrame for visualization
    df_vis = pd.DataFrame(reduced_data, columns=["PCA1", "PCA2", "PCA3"])
    df_vis["user_index"] = range(len(user_to_product_df))
    df_vis["user_id"] = user_to_product_df.index
    
    # Add user type information
    df_vis["type"] = "Other Users"
    df_vis.loc[user_index, "type"] = "Input User"  # Highlight the input user
    for sim_user in similar_user_list:
        df_vis.loc[sim_user, "type"] = "Similar Users"
    
    # Create the 3D plot
    fig = px.scatter_3d(
        df_vis,
        x="PCA1",
        y="PCA2",
        z="PCA3",
        color="type",
        symbol="type",
        size=df_vis["type"].apply(lambda x: 10 if x == "Input User" else 5),
        title=f"3D Representation of Users and Recommendations for User {user_index} (ID: {real_user_id})",
        labels={"PCA1": "PCA Component 1", "PCA2": "PCA Component 2", "PCA3": "PCA Component 3"},
        hover_data=["user_id"]
    )
    
    # Annotate the top categories
    top_categories = np.argsort(mean_rating_list)[-5:]  # Top 5 rated categories
    for cat_idx in top_categories:
        if mean_rating_list[cat_idx] > 0:  # Only show positive recommendations
            fig.add_annotation(
                x=reduced_data[user_index, 0],
                y=reduced_data[user_index, 1],
                z=reduced_data[user_index, 2],
                text=f"{categories_list[cat_idx]}",
                showarrow=True,
                arrowhead=2,
                arrowsize=1.5,
                ax=0,
                ay=-40
            )
    
    # Show the figure
    fig.show()

## Test the Recommendation System

In [10]:
# Test the recommendation system
test_user_index = 34  # You can change this
n_similar_users = 5
n_categories = 10

# Get recommendations
final_recommendations = category_recommender_system(test_user_index, n_similar_users, n_categories)

# Test different threshold levels
print("\nTesting with higher threshold (0.2):")
high_recommendations = filtered_category_recommendations(test_user_index, n=5, threshold=0.2)

Categories seen by the User:
['electronics.smartphone']

Top 5 users similar to User 34 (Real ID: 512384348):
1. User Index: 2278, Real ID: 543624132, Distance: 0.0000
2. User Index: 3450, Real ID: 555465207, Distance: 0.0000
3. User Index: 153, Real ID: 512682014, Distance: 0.0000
4. User Index: 1955, Real ID: 536328930, Distance: 0.0000
5. User Index: 3459, Real ID: 555465285, Distance: 0.0000

Categories recommended based on similar users:
Top 10 recommended categories:
1. stationery.cartrige (Score: nan)
2. appliances.kitchen.blender (Score: nan)
3. appliances.kitchen.coffee_machine (Score: nan)
4. appliances.kitchen.dishwasher (Score: nan)
5. appliances.kitchen.grill (Score: nan)
6. appliances.kitchen.hob (Score: nan)
7. appliances.kitchen.hood (Score: nan)
8. appliances.kitchen.juicer (Score: nan)
9. appliances.kitchen.kettle (Score: nan)
10. appliances.kitchen.meat_grinder (Score: nan)

Testing with higher threshold (0.2):

Top 5 users similar to User 34 (Real ID: 512384348):
1.

  weights = distances / np.sum(distances)


In [11]:
# Test with 3D visualization
# Get similar users for visualization
real_user_id = user_mapping_df.iloc[test_user_index]['user_id']
similar_user_list, distance_list = get_similar_users(test_user_index, real_user_id, n_similar_users)

# Calculate weights
weightage_list = distance_list / np.sum(distance_list)

# Get category preferences and calculate weighted preferences
cat_prefs_sim_users = user_to_product_df.values[similar_user_list]
categories_list = user_to_product_df.columns
expanded_weightage_list = weightage_list[:, np.newaxis] + np.zeros(len(categories_list))
new_rating_matrix = expanded_weightage_list * cat_prefs_sim_users
mean_rating_list = new_rating_matrix.sum(axis=0)

# Visualize the recommendations
visualize_category_recommendations_3d(test_user_index, similar_user_list, weightage_list, categories_list, mean_rating_list)


Top 5 users similar to User 34 (Real ID: 512384348):
1. User Index: 2278, Real ID: 543624132, Distance: 0.0000
2. User Index: 3450, Real ID: 555465207, Distance: 0.0000
3. User Index: 153, Real ID: 512682014, Distance: 0.0000
4. User Index: 1955, Real ID: 536328930, Distance: 0.0000
5. User Index: 3459, Real ID: 555465285, Distance: 0.0000


  weightage_list = distance_list / np.sum(distance_list)


## Conclusion

In this notebook, we've implemented a collaborative filtering recommendation system that suggests product categories to users based on the preferences of similar users. The system:

1. Creates a user-item matrix where values represent user preferences for categories
2. Uses K-Nearest Neighbors with cosine similarity to find similar users
3. Weights preferences of similar users to generate personalized recommendations
4. Filters out categories the user has already interacted with
5. Provides a threshold parameter to control recommendation quality
6. Visualizes user similarities and recommendations in 3D

Potential improvements could include incorporating product IDs for more fine-grained recommendations, considering temporal information, and combining with content-based filtering for a hybrid approach.

In [12]:
# Close the database connection
con.close()
print("Database connection closed.")

Database connection closed.
