# E-Commerce Recommendation System


## Notebook Overview

This notebook demonstrates the development of a **product recommendation system** for an **e-commerce dataset**.  
It utilizes the [`implicit`](https://github.com/benfred/implicit) library for building an **Alternating Least Squares (ALS)** model, a popular collaborative filtering algorithm.  

The notebook covers the following steps:

- **Data Loading and Preprocessing**  
- **Exploratory Data Analysis (EDA)**  
- **Data Splitting** for training and testing  
- **Model Training** using the ALS algorithm  
- **Model Evaluation** using various ranking metrics:
  - Precision@k  
  - Recall@k  
  - Mean Average Precision (MAP@k)  
  - Normalized Discounted Cumulative Gain (NDCG@k)  

Finally, it showcases how to **generate product recommendations** for individual customers.


### 1. Setup and Data Loading

##### This section imports necessary libraries and loads the e-commerce dataset.

In [None]:
import os
import sys
import warnings
import implicit
import numpy as np
import pandas as pd
import datetime as dt
import scipy.sparse as sparse

from sklearn.model_selection import train_test_split
sys.path.append(os.path.abspath(".."))
from utils.functions import ModelSolutions

# Suppress all warnings for cleaner output
warnings.filterwarnings('ignore')

# Set pandas option to display all columns in a DataFrame
pd.set_option('display.max_columns', None)

# Model Solution Initialization
model_solutions = ModelSolutions()

# Load the E-Commerce data from a CSV file
df = model_solutions.load_data(folders=['data','raw'],file_name='E-Commerce_data.csv')
df.head()

### 2. Data Preprocessing


##### This section cleans and prepares the data for model training. It involves converting date columns, handling missing values, and filtering out returned/exchanged items.

In [2]:
# Convert date columns to datetime objects
df["purchase_date"] = pd.to_datetime(df["purchase_date"])
df["return_date"] = pd.to_datetime(df["return_date"], errors="coerce")

# Including Related Columns for the model
model_data = df[
    ['id','returning','product','category','purchase_amount','purchase_date','returned','return_date']
    ].copy()

# Filling missing values in 'return_date'
model_data['return_date'] = model_data['return_date'].fillna("Not-Returned")

# Filter out returned or exchanged items to focus on successful purchases
fd = model_data[(model_data['returned']!='refund') & (model_data['returned']!='exchange')]

# Prepare data for collaborative filtering, focusing on 'id', 'product', and 'purchase_amount'
cf_data = fd[['id','product', 'purchase_amount']].copy()

# Identify users with high purchase counts (more than 4 purchases)
purchase_counts = cf_data.groupby('id').size()
high_purchase_ids = purchase_counts[purchase_counts > 4].index
filtered_cf = cf_data[cf_data['id'].isin(high_purchase_ids)]

# Aggregate low purchase data (users with 4 or fewer purchases) by summing purchase amounts
low_purchase_data = cf_data[~cf_data['id'].isin(high_purchase_ids)]
low_purchase_data = low_purchase_data.groupby(['id','product'])['purchase_amount'].sum().reset_index()

# Set random seed for reproducibility
np.random.seed(42)

# Split high purchase data into training and testing sets based on a random mask
# Approximately 80% for training
split_mask = filtered_cf.groupby('id')['purchase_amount'].transform(lambda x: np.random.rand(len(x))<0.8)
train_data = filtered_cf[split_mask]
test_data = filtered_cf[~split_mask]

# Create a pivot table for the complete dataset, counting purchase amounts
data = cf_data.groupby(['id','product'])['purchase_amount'].count().reset_index()

# Create pivot tables for training and testing data, representing user-item interactions
train_pivot = train_data.pivot_table(index='id',columns='product',values='purchase_amount',aggfunc='count',fill_value=0)
test_pivot = test_data.pivot_table(index='id',columns='product',values='purchase_amount',aggfunc='count',fill_value=0)
data = data.pivot_table(index='id',columns='product',values='purchase_amount',aggfunc='count',fill_value=0)

# Convert pivot tables to Compressed Sparse Row (CSR) matrices for efficient computation with implicit library
train_csr = sparse.csr_matrix(train_pivot.values)
test_csr = sparse.csr_matrix(test_pivot.values)

### 3. Evaluation Metrics


##### This section defines functions for evaluating the recommendation model using various metrics.



In [6]:
def simple_evaluation(test_csr, predictions, k):
    """
    Evaluate recommendations using average Precision@k and Recall@k.
    
    Parameters:
      test_csr: 2D array or CSR matrix of actual user-item interactions (n_users x n_items).
      predictions: 2D array where each row contains recommended item indices for the corresponding user.
      k: Number of top recommendations to consider.
    
    Returns:
      avg_precision: Average Precision@k over all users in the predictions array.
      avg_recall: Average Recall@k over all users in the predictions array.
    """
    n_users = predictions.shape[0]
    precisions = []
    recalls = []
    
    for user in range(n_users):
        if hasattr(test_csr[user], "toarray"):
            user_actual = test_csr[user].toarray()[0]
        else:
            user_actual = test_csr[user]
        
        actual_items = set(np.where(user_actual > 0)[0])
        recommended_items = set(predictions[user][:k])
        
        hits = len(actual_items & recommended_items)
        
        precision = hits / k
        recall = hits / len(actual_items) if actual_items else 0
        
        precisions.append(precision)
        recalls.append(recall)
    
    avg_precision = np.mean(precisions)
    avg_recall = np.mean(recalls)
    
    return avg_precision, avg_recall


def evaluate_ranking_metrics(test_csr, predictions, k):
    """
    Evaluate ranking metrics: Mean Average Precision (MAP@k) and NDCG@k.
    Parameters:
      test_csr: 2D array or CSR matrix of actual user-item interactions 
                (shape: n_users x n_items).
      predictions: 2D array where each row contains recommended item indices 
                   for the corresponding user.
      k: Number of top recommendations to consider.
    
    Returns:
      avg_map: Mean Average Precision at k over all users.
      avg_ndcg: Mean Normalized Discounted Cumulative Gain at k over all users.
    """
    map_scores = []
    ndcg_scores = []
    n_users = predictions.shape[0]
    
    for user in range(n_users):
        if hasattr(test_csr[user], "toarray"):
            user_actual = test_csr[user].toarray()[0]
        else:
            user_actual = test_csr[user]
         
        actual_items = set(np.where(user_actual > 0)[0])
        if not actual_items:
            continue
        
        # ----- MAP@k Calculation -----
        num_hits = 0.0
        ap = 0.0
        for i, pred in enumerate(predictions[user][:k]):
            if pred in actual_items:
                num_hits += 1
                ap += num_hits / (i + 1)
        average_precision = ap / len(actual_items)
        map_scores.append(average_precision)
        
        # ----- NDCG@k Calculation -----
        dcg = 0.0
        for i, pred in enumerate(predictions[user][:k]):
            if pred in actual_items:
                dcg += 1.0 / np.log2(i + 2)  # i+2 because positions are 1-indexed in the log term.
        
        ideal_hits = min(len(actual_items), k)
        idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_hits))
        ndcg = dcg / idcg if idcg > 0 else 0.0
        ndcg_scores.append(ndcg)
    
    avg_map = np.mean(map_scores) if map_scores else 0.0
    avg_ndcg = np.mean(ndcg_scores) if ndcg_scores else 0.0
    return avg_map, avg_ndcg

### 4. Model Training and Evaluation


##### This section trains the ALS model on the `train_csr` data and evaluates its performance on the `test_csr` using the defined metrics.

In [None]:
# Define the number of recommendations to generate for evaluation.
N_recommendations = 50 

# Initialize the Alternating Least Squares (ALS) model.
# 'factors' determines the dimensionality of the latent factors (embedding size).
# 'calculate_training_loss=True' enables calculation of loss during training, which can be useful for monitoring.
model = implicit.als.AlternatingLeastSquares(factors=5, calculate_training_loss=True)

# Train the ALS model on the training sparse matrix.
# The 'implicit' library expects the user-item matrix to be in CSR format.
model.fit(train_csr)

# --- Evaluation Preparation: Aligning Predictions and Actuals ---
# Identify common user IDs present in both training and test sets.
# We can only evaluate for users who have interactions in both.
common_user_ids = list(set(train_pivot.index) & set(test_pivot.index))

# Initialize lists to store recommendations and actual interactions for common users.
# These lists will be aligned, meaning the i-th element in aligned_predictions will correspond
# to the i-th element in aligned_actuals_csr, both for the same user.
aligned_predictions = []
aligned_actuals_csr = []

# Iterate through each common user to generate their recommendations and fetch their actual test interactions.
for user_id in common_user_ids:
    # Get the integer index of the user in the training pivot table.
    # This index is used by the 'model.recommend' method.
    train_user_idx = train_pivot.index.get_loc(user_id)
    
    # Get the integer index of the user in the test pivot table.
    # This index is used to retrieve the actual interactions from 'test_csr'.
    test_user_idx = test_pivot.index.get_loc(user_id)

    # Generate recommendations for this specific user.
    # 'user_items' provides the user's interactions from the training set to the model.
    # 'filter_already_liked_items=True' ensures that the model doesn't recommend items the user already interacted with in training.
    recommended_items_indices, _ = model.recommend(
        userid=train_user_idx,
        user_items=train_csr[train_user_idx],
        N=N_recommendations,
        filter_already_liked_items=True
    )
    aligned_predictions.append(recommended_items_indices)
    
    # Collect the actual interactions for this user from the test set.
    aligned_actuals_csr.append(test_csr[test_user_idx])

# Convert the list of recommendation arrays into a single NumPy array,
# which is the expected format for the evaluation functions.
aligned_predictions_np = np.array(aligned_predictions)

# --- Evaluation ---
# Define K, the number of top recommendations to consider for evaluation.
K_eval = 50 

# Call the evaluation functions with the now aligned predictions and actuals.
avg_precision, avg_recall = simple_evaluation(aligned_actuals_csr, aligned_predictions_np, k=K_eval)
avg_map, avg_ndcg = evaluate_ranking_metrics(aligned_actuals_csr, aligned_predictions_np, k=K_eval)

# Print the evaluation results, formatted to four decimal places for readability.
print(f"MAP@{K_eval}: {avg_map:.4f}")
print(f"NDCG@{K_eval}: {avg_ndcg:.4f}")
print(f"Average Precision@{K_eval}: {avg_precision:.4f}")
print(f"Average Recall@{K_eval}: {avg_recall:.4f}")


100%|██████████| 15/15 [00:10<00:00,  1.45it/s, loss=0.0155]


MAP@50: 0.0365
NDCG@50: 0.1053
Average Precision@50: 0.0130
Average Recall@50: 0.3105


### 5. Generating Recommendations for a Specific Customer (Test Data)

##### This section demonstrates how to get recommendations for a randomly chosen customer from the test set.

In [9]:
# Choose a random customer from the test set
customer_id = np.random.choice(list(test_pivot.index))

# Get recommendations for the selected customer
# user_items is set to the specific row in test_csr for that customer
product_indices, scores = model.recommend(test_pivot.index.get_loc(customer_id),
                                          user_items= test_csr[test_pivot.index.get_loc(customer_id)],
                                          N=20, # Number of recommendations to return
                                          filter_already_liked_items=False # Do not filter already liked items in this case for demonstration
                                          )

# Display the recommended product names
print(test_pivot.columns[product_indices])

Index(['Car Wash Mitt', 'Frozen Meals', 'Dehumidifier',
       'Headlight Restoration Kit', 'Dog Collar', 'Convection Oven',
       'Cat Shampoo', 'Kombucha', 'Car Drying Towel', 'Bangle', 'Candy',
       'Acne Treatment', 'Hand Cream', 'Hamster Cage', 'Wine', 'Spark Plugs',
       'Tire Shine', 'Bread Maker', 'RC Helicopters', 'Utility Knife'],
      dtype='object', name='product')


### 6. Training on Complete Data and Generating Recommendations


This section shows how to train the model on the entire dataset (`data_csr`) and then generate recommendations for a random customer.

In [11]:
# Convert the complete data pivot table to a CSR matrix
data_csr = sparse.csc_matrix(data)

# Train the ALS model on the complete dataset
model = implicit.als.AlternatingLeastSquares(factors=50, calculate_training_loss=True)
model.fit(data_csr)

# Choose a random customer from the complete dataset
customer_id = np.random.choice(list(data.index))

100%|██████████| 15/15 [00:45<00:00,  3.04s/it, loss=0.0109]


Index(['Stand Mixer', 'Smart Lock', 'Snare Drum', 'Beer', 'Toe Ring', 'Chips',
       'Pet Scale', 'Travel Guides', 'Miter Saw', 'Thriller Novels'],
      dtype='object', name='product')


In [12]:
# Get recommendations for the selected customer from the complete dataset
product_indices, scores = model.recommend(data.index.get_loc(customer_id),
                                          user_items= data_csr[data.index.get_loc(customer_id)].tocsr(),
                                          N=10, # Number of recommendations
                                          filter_already_liked_items=True # Filter out items the user has already interacted with
                                          )

# Display the recommended product names
print(data.columns[product_indices])

Index(['Stand Mixer', 'Smart Lock', 'Snare Drum', 'Beer', 'Toe Ring', 'Chips',
       'Pet Scale', 'Travel Guides', 'Miter Saw', 'Thriller Novels'],
      dtype='object', name='product')


In [None]:
# Exporting the model
model_solutions.save_model(folders=['models','product_recommendation'],file_name='product_recommendation_model.pkl',obj=model)