# retail-recommendation-system

This project implements a product recommendation system using PyTorch. The goal is to train a model that can recommend relevant products to users based on their past behavior, including clicks, cart additions, and purchases.

I am using Neural Collaborative Filtering in order to learn nonlinear relationships while also understanding the ins and outs of PyTorch.

## Features
Literally just preprocessing, building, training, and using the model for predictions. Also evaluation metrics.

## Dataset
https://www.kaggle.com/datasets/retailrocket/ecommerce-dataset/data

## Tech Stack
- Python 3
- VSCode
- PyTorch, pandas, numpy, scikit-learn
- Jupyter Notebook

## The Process
- The goal of this project was primarily to learn PyTorch in a more interactive way, as I have not used it to it's fullest up to this point (mostly boilerplate code).
- I didn't want the project itself to be too easy, and so I decided to implement something I haven't even learned the theory about up to this point in recommendation systems.
- At the end, I wanted to implement a Neural Collaborative Filtering model to make use of neural networks in recommender systems.
- I ran into many problems including simple overfitting but also just not knowing what I was doing, but through a few resources and some trial and error, I was able to figure it out.

*Keep in mind, there is always room for improvement, but I consider this a success given my new found understanding of PyTorch*

## How to use
1. Pull the git or download the files.
2. Download the dataset to the same directory under data/
3. Pip install the requirements.txt to your env.
4. Run the notebook.

## What's next?
- Fix up the evaluations.
- Log more stuff.
- Spend some time fine-tuning hyperparameters for better convergence.

## References
- https://arxiv.org/abs/1708.05031
- https://github.com/yihong-chen/neural-collaborative-filtering/tree/master
- https://youtu.be/O4lk9Lw7lS0?si=T4tDJ1xz1I9IuCvv



## Imports

In [64]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from collections import defaultdict

## Exploring the data

In [33]:
# Load the datasets
events = pd.read_csv('../data/events.csv')
prop1 = pd.read_csv('../data/item_properties_part1.csv')
prop2 = pd.read_csv('../data/item_properties_part2.csv')

properties = pd.concat([prop1, prop2], ignore_index=True)

In [34]:
properties.head()

Unnamed: 0,timestamp,itemid,property,value
0,1435460400000,460429,categoryid,1338
1,1441508400000,206783,888,1116713 960601 n277.200
2,1439089200000,395014,400,n552.000 639502 n720.000 424566
3,1431226800000,59481,790,n15360.000
4,1431831600000,156781,917,828513


In [35]:
#events.sort_values(by=['visitorid', 'itemid'], inplace=True)
events.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221332117,257597,view,355908,
1,1433224214164,992329,view,248676,
2,1433221999827,111016,view,318965,
3,1433221955914,483717,view,253185,
4,1433221337106,951259,view,367447,


Properties is not going to be too useful to us, but we can incorporate it later in a content-based filtering system

## Preprocessing

Now we are trying to make a data matrix that has view counts, binary added-to-cart, and binary transaction.

In [36]:
views = events[events['event'] == 'view'][['visitorid', 'itemid']].copy()
views = views.groupby(['visitorid', 'itemid']).size().reset_index(name='view')

carts = events[events['event'] == 'addtocart'][['visitorid', 'itemid']].copy()
carts['cart'] = 1

transactions = events[events['event'] == 'transaction'][['visitorid', 'itemid']].copy()
transactions['transaction'] = 1

merged = pd.merge(views, carts, on=['visitorid', 'itemid'], how='outer')
merged = pd.merge(merged, transactions, on=['visitorid', 'itemid'], how='outer')

merged.fillna(0, inplace=True)
merged.head()

Unnamed: 0,visitorid,itemid,view,cart,transaction
0,0,67045,1.0,0.0,0.0
1,0,285930,1.0,0.0,0.0
2,0,357564,1.0,0.0,0.0
3,1,72028,1.0,0.0,0.0
4,2,216305,2.0,0.0,0.0


For the scoring, decided to include a bit of everything for targets. Logging the view count, capping at 5, 5 for cart, and 10 for purchasing.
It is then also min-max normalized.

In [37]:
data = merged.copy()
data['score'] = np.minimum(np.log(1 + data['view']), 5) + 5*data['cart'] + 10*data['transaction']

data.head()

Unnamed: 0,visitorid,itemid,view,cart,transaction,score
0,0,67045,1.0,0.0,0.0,0.693147
1,0,285930,1.0,0.0,0.0,0.693147
2,0,357564,1.0,0.0,0.0,0.693147
3,1,72028,1.0,0.0,0.0,0.693147
4,2,216305,2.0,0.0,0.0,1.098612


Boom. I think we need to map everything to unique IDs now for the NN.

In [38]:
num_users = data['visitorid'].nunique()
num_items = data['itemid'].nunique()

print(f"Unique users: {num_users}")
print(f"Unique items: {num_items}")

Unique users: 1407580
Unique items: 235061


In [54]:
unique_users = data.visitorid.unique()
user_to_index = {old: new for new,old in enumerate(unique_users)}

unique_products = data.itemid.unique()
product_to_index = {old: new for new,old in enumerate(unique_products)}

data['usermap'] = data['visitorid'].map(user_to_index)
data['itemmap'] = data['itemid'].map(product_to_index)

# Min-max normalization
min_score = data['score'].min()
max_score = data['score'].max()

data['score'] = (data['score'] - min_score) / (max_score - min_score)

data.head()

Unnamed: 0,visitorid,itemid,view,cart,transaction,score,usermap,itemmap
0,0,67045,1.0,0.0,0.0,0.0,0,0
1,0,285930,1.0,0.0,0.0,0.0,0,1
2,0,357564,1.0,0.0,0.0,0.0,0,2
3,1,72028,1.0,0.0,0.0,0.0,1,3
4,2,216305,2.0,0.0,0.0,0.021001,2,4


## Machine Learning

In [55]:
class NCFDataset(Dataset):
    def __init__(self, users, items, score):
        self.users = users
        self.items = items
        self.score = score

    def __len__(self):
        return len(self.users)

    def __getitem__(self, idx):
        users = self.users[idx]
        items = self.items[idx]
        score = self.score[idx]
        return {
            "users": torch.tensor(users, dtype = torch.long),
            "items": torch.tensor(items, dtype = torch.long),
            "score": torch.tensor(score, dtype = torch.float)
        }

In [56]:
class NCFModel(nn.Module):
    def __init__(self, config):
        super(NCFModel, self).__init__()

        self.config = config
        self.num_users = config['num_users']
        self.num_items = config['num_items']
        self.latent_dim_mf = config['latent_dim_mf']
        self.latent_dim_mlp = config['latent_dim_mlp']

        # Matrix Factorization
        self.user_embedding_mf = torch.nn.Embedding(num_embeddings=self.num_users, embedding_dim=self.latent_dim_mf)
        self.item_embedding_mf = torch.nn.Embedding(num_embeddings=self.num_items, embedding_dim=self.latent_dim_mf)

        # Multilayer Perceptron
        self.user_embedding_mlp = torch.nn.Embedding(num_embeddings=self.num_users, embedding_dim=self.latent_dim_mlp)
        self.item_embedding_mlp = torch.nn.Embedding(num_embeddings=self.num_items, embedding_dim=self.latent_dim_mlp)

        self.fc_layers = torch.nn.ModuleList()
        for idx, (in_size, out_size) in enumerate(zip(config['layers'][:-1], config['layers'][1:])):
            self.fc_layers.append(torch.nn.Linear(in_size, out_size))
        
        self.logits = torch.nn.Linear(in_features=config['layers'][-1] + config['latent_dim_mf'], out_features=1)
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, user_i, item_i):
        user_embedding_mlp = self.user_embedding_mlp(user_i)
        item_embedding_mlp = self.item_embedding_mlp(item_i)
        user_embedding_mf = self.user_embedding_mf(user_i)
        item_embedding_mf = self.item_embedding_mf(item_i)

        # MF part
        mf_vector = torch.mul(user_embedding_mf, item_embedding_mf)
        mf_vector = torch.nn.Dropout(self.config['dropout_rate_mf'])(mf_vector)

        # MLP part
        mlp_vector = torch.cat([user_embedding_mlp, item_embedding_mlp], dim=-1)

        for idx, _ in enumerate(range(len(self.fc_layers))):
            mlp_vector = self.fc_layers[idx](mlp_vector)
            mlp_vector = torch.nn.ReLU()(mlp_vector) # Activation
        mlp_vector = torch.nn.Dropout(self.config['dropout_rate_mlp'])(mlp_vector)

        # Combining results
        vector = torch.cat([mlp_vector, mf_vector], dim=-1)
        output = self.logits(vector)
        output = self.sigmoid(output)
        return output

In [57]:
device = torch.device('cuda' if torch.cuda.is_available()else "cpu")
device

device(type='cuda')

In [61]:
df = data[['usermap', 'itemmap', 'score']]

# First split: train+val and test
df_temp, df_test = train_test_split(
    df, test_size=0.1, random_state=18
)

# Second split: train and val from the remaining 90%
df_train, df_val = train_test_split(
    df_temp, test_size=0.1, random_state=18
)

train_dataset = NCFDataset(list(df_train['usermap']), list(df_train['itemmap']), list(df_train['score']))
val_dataset = NCFDataset(list(df_val['usermap']), list(df_val['itemmap']), list(df_val['score']))
test_dataset = NCFDataset(list(df_test['usermap']), list(df_test['itemmap']), list(df_test['score']))

In [62]:
# Hyperparameters
config = {
    'num_epoch': 50,
    'batch_size': 2048,
    'num_users': num_users,
    'num_items': num_items,
    'latent_dim_mf': 32,
    'latent_dim_mlp': 32,
    'layers': [64, 32, 16],
    'dropout_rate_mf': 0.1,
    'dropout_rate_mlp': 0.1,
    'decay': 1e-5,
    'learning_rate': 0.0005
}

# Model
recc_model = NCFModel(config).to(device)
optimizer = optim.Adam(recc_model.parameters(), lr=config['learning_rate'], weight_decay=config['decay'])
loss_fn = nn.MSELoss()

# Data loaders
train_loader = DataLoader(train_dataset, batch_size=config['batch_size'], shuffle=False)
val_loader = DataLoader(val_dataset, batch_size=config['batch_size'], shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=config['batch_size'], shuffle=False)

# Training
losses = []
recc_model.train()
for e in range(config['num_epoch']):
    epoch_loss = 0
    for idx, train_data in enumerate(train_loader):
        output = recc_model(train_data['users'].to(device), train_data['items'].to(device)).squeeze()
        score = (train_data['score'].to(torch.float32).to(device))
        loss = loss_fn(output, score)
        #print(f"Output min/max: {output.min().item():.4f}/{output.max().item():.4f}")
        #print(f"Score min/max: {score.min().item():.4f}/{score.max().item():.4f}")
        #print(output, score)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
    
    # Validation
    recc_model.eval()
    val_loss = 0
    with torch.no_grad():
        for val_data in val_loader:
            output = recc_model(val_data['users'].to(device), val_data['items'].to(device)).squeeze()
            score = val_data['score'].to(torch.float32).to(device)
            loss = loss_fn(output, score)
            val_loss += loss.item()

    losses.append(epoch_loss)
    print(f"Epoch {e+1}: Loss = {epoch_loss:.4f} | Validation Loss = {val_loss:.4f}")
    recc_model.train()


Epoch 1: Loss = 37.0387 | Validation Loss = 0.8877
Epoch 2: Loss = 7.8203 | Validation Loss = 0.8747
Epoch 3: Loss = 7.7610 | Validation Loss = 0.8695
Epoch 4: Loss = 7.7061 | Validation Loss = 0.8624
Epoch 5: Loss = 7.6226 | Validation Loss = 0.8488
Epoch 6: Loss = 7.4600 | Validation Loss = 0.8348
Epoch 7: Loss = 7.2928 | Validation Loss = 0.8152
Epoch 8: Loss = 7.1066 | Validation Loss = 0.7985
Epoch 9: Loss = 6.9122 | Validation Loss = 0.7759
Epoch 10: Loss = 6.6322 | Validation Loss = 0.7529
Epoch 11: Loss = 6.2921 | Validation Loss = 0.7384
Epoch 12: Loss = 5.9786 | Validation Loss = 0.7299
Epoch 13: Loss = 5.7139 | Validation Loss = 0.7267
Epoch 14: Loss = 5.4594 | Validation Loss = 0.7290
Epoch 15: Loss = 5.1795 | Validation Loss = 0.7266
Epoch 16: Loss = 4.9038 | Validation Loss = 0.7271
Epoch 17: Loss = 4.6567 | Validation Loss = 0.7320
Epoch 18: Loss = 4.4283 | Validation Loss = 0.7327
Epoch 19: Loss = 4.1659 | Validation Loss = 0.7413
Epoch 20: Loss = 3.8840 | Validation Lo

## Evaluation
Note; using AI for this, will change in the future.

In [69]:
def dcg_at_k(r, k):
    r = np.asarray(r, dtype=np.float32)[:k]
    return np.sum(r / np.log2(np.arange(2, r.size + 2)))

def ndcg_at_k(r, k):
    dcg_max = dcg_at_k(sorted(r, reverse=True), k)
    if not dcg_max:
        return 0.0
    return dcg_at_k(r, k) / dcg_max

def precision_at_k(r, k):
    r = np.asarray(r)[:k]
    return np.mean(r)

In [65]:
def evaluate_model(model, test_df, all_item_ids, K=10, device='cpu'):
    model.eval()
    user_item_scores = defaultdict(list)
    
    # Group test data by user
    grouped = test_df.groupby('usermap')

    with torch.no_grad():
        for user, group in grouped:
            true_items = set(group['itemmap'].values)
            
            # Predict scores for all items
            user_tensor = torch.LongTensor([user] * len(all_item_ids)).to(device)
            item_tensor = torch.LongTensor(all_item_ids).to(device)
            
            scores = model(user_tensor, item_tensor).squeeze().cpu().numpy()
            ranked_items = np.argsort(scores)[::-1]  # Descending order
            
            recommended = [all_item_ids[i] for i in ranked_items[:K]]
            rel = [1 if item in true_items else 0 for item in recommended]
            
            user_item_scores['ndcg'].append(ndcg_at_k(rel, K))
            user_item_scores['precision'].append(precision_at_k(rel, K))

    avg_ndcg = np.mean(user_item_scores['ndcg'])
    avg_precision = np.mean(user_item_scores['precision'])
    
    print(f"NDCG@{K}: {avg_ndcg:.4f}")
    print(f"Precision@{K}: {avg_precision:.4f}")

In [70]:
all_item_ids = list(range(config['num_items']))

# test_df = original test split
evaluate_model(recc_model, df_test, all_item_ids, K=10, device=device)

KeyboardInterrupt: 