# Modelling <a id=''></a>

## Table of Contents
  * 1 [Import libriaries and Load data](#load_data)
  * 2 [Prepare data](#prepare_data)
  * 3 [Overview of recommender systems](#overview)
    * 3.1 [Two primary approaches](#approaches)
    * 3.2 [Deep learning-based recommender systems](#DL_rec_systems)
  * 4 [Modelling](#modelling)
    * 4.1 [Item2Vec and hierarchical Item2Vec models](#item2vec)
    * 4.2 [Hyperparameters](#hyperparameters)
  * 5 [Evaluation](#evaluation)
  * 6 [Conclusion](#conclusion)

## 1 Import libraries and Load data<a id='load_data'></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split

from itertools import permutations, chain

from HuffmanTree import HuffmanNode,build_huffman_tree,generate_codebook,visualize_huffman_tree
from CategoryTree import TreeNode, build_tree, add_to_node, build_category_tree, get_path
from parameters import Params
from ItemMap import ItemMap
from HierarchicalItem2Vec import HierarchicalItem2Vec, Trainer
from batch_tool import BatchToolItem # this is obsolete as of Sep 25, 2025

pd.options.mode.copy_on_write = True
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

print('torch version: ', torch.__version__)
print('numpy version: ', np.__version__)


torch version:  2.2.0
numpy version:  1.25.2


In [2]:
df_evt= pd.read_csv('events_df.csv')
df_cat= pd.read_csv('category_df.csv')

df_trs = df_evt[(df_evt['event'] == 'transaction') & (df_evt['categoryid']> -1)]
df_freq = df_trs.groupby('itemid').agg(frequency = pd.NamedAgg(column='itemid', aggfunc='size'),
                                      categoryid = pd.NamedAgg(column='categoryid',aggfunc= 'first')).reset_index()

## 2 Prepare data<a id='prepare_data'></a>

In [3]:
df_items = df_trs.groupby(['visitorid','session_by_day']).agg(items = pd.NamedAgg(column='itemid', aggfunc=list)).reset_index()
print(df_items.count())

df_filtered = df_items[df_items['items'].apply(len) > 1]
print(df_filtered.count())

visitorid         13027
session_by_day    13027
items             13027
dtype: int64
visitorid         3055
session_by_day    3055
items             3055
dtype: int64


**In order to optimize model performance, we perform pruning on the data. Low frequency or unused data can negatively affect the performance indirectly through shared internal nodes in the Huffman tree.**

In [4]:
items = df_filtered['items'].tolist()

flat = list(chain.from_iterable(items))
set_items = set(flat)

print(df_freq.count())
df_freq = df_freq.loc[df_freq['itemid'].isin(set_items)]
df_freq.count()

itemid        11645
frequency     11645
categoryid    11645
dtype: int64


itemid        7547
frequency     7547
categoryid    7547
dtype: int64

In [5]:
test_items = [461686,119736,213834,7943,312728] # highest frequency items
do_test = True

inputs = []
outputs = []
for items_in_session in items:
    
    pairs = list(permutations(items_in_session, 2))
    for pair in pairs:
        target_id, context_id = pair[0],pair[1]
        if target_id not in test_items and context_id not in test_items and do_test:
            continue
        inputs.append([target_id])
        outputs.append(context_id)
        
# Convert to torch.tensor
X = torch.tensor(inputs, dtype=torch.long)
y = torch.tensor(outputs, dtype=torch.long)

# Split using train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Wrap in TensorDataset
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)

## 3 Overview of Recommender Systems<a id='overview'></a>

### 3.1 Two primary approaches<a id='approaches'></a>

*Recommender systems* suggest items to users based on their preferences, behavior, or similarities with others. They're widely used in platforms like Netflix, Amazon, Spotify, and YouTube. The primary goal is to predict a user’s preferences and recommend items they are likely to engage with such as movies, products, news articles, or even people. There are mainly two core types of recommender systems, **collaborative filtering** and **content-based filtering**. 

**Collaborative filtering** can be further divided into:
* *User-based filtering*, which identifies users with similar preferences and recommends items those users have liked.
* *Item-based filtering*, which finds items similar to those the user has previously liked and recommends them.

Similarity between users or items is typically computed using distance or similarity measures such as *Euclidean distance, Pearson correlation*, or *cosine similarity*. It's important to note that in collaborative filtering, item similarity is not based on the inherent features of the items, but rather on user interaction patterns. Item-based collaborative filtering is particularly effective in systems where the number of users significantly exceeds the number of items. 
However, collaborative filtering has several limitations:

- Data sparsity: User-item interaction matrices are often sparse, making it difficult to identify meaningful patterns.
- Cold start: The system struggles to make recommendations for new users or new items due to a lack of interaction data.
- Scalability: Performance can degrade as the number of users or items grows.
- Popularity bias: Tends to over-recommend popular items, reducing personalization and diversity.
- Lack of Interpretability: Recommendations are based on patterns in user behavior, not explicit item attributes.
- Gray sheep problem: Users with unique or atypical preferences may receive poor recommendations.

**Content-based filtering**, on the other hand, recommends items that are similar to those a user has liked in the past. These similarities are determined based on item attributes or features, such as categories, tags, genres, or other metadata. This approach can help address some of the limitations of collaborative filtering.

- No Need for Other Users' Data: Recommendations are based solely on the user’s own preferences and item features.
- Handles Cold Start (User Side): Can recommend items to new users after only a few interactions.
- Interpretable Recommendations.
- Less prone to recommending only popular items.
- Privacy-Friendly: Doesn’t require analyzing other users’ data.




### 3.2 Deep learning-based recommender systems<a id='DL_rec_systems'></a>   


In recent years, deep learning-based recommender systems have gained prominence due to their superior performance and ability to model complex user-item interactions.

- Better at Capturing Non-Linear and Complex Patterns
- Better at Cold Start and Sparse Data
- Effective Feature Representation (Embeddings): can automatically learn *dense, low-dimensional representations (embeddings)* of users and items from sparse interaction data
- Integration of Multiple Data Types (Multimodal Inputs)
- Personalized and Context-Aware Recommendations: Deep models can learn user-specific behaviors, preferences, and temporal patterns.

different methods


## 4 Modelling <a id='modelling'></a>

### 4.1 Item2Vec and hierarchical Item2Vec models<a id='item2vec'></a>   

This project focuses on the Item2Vec model, which is based on the Word2Vec architecture originally developed for learning optimized word embeddings in natural language processing (NLP). By drawing an analogy between sequences of words and sequences of user-item interactions, Item2Vec learns dense vector representations of items that capture similarity and co-occurrence patterns. This results in more effective recommendations through meaningful item embeddings.

Item2Vec is an item-based collaborative filtering technique and, as such, inherits some of the limitations discussed in the previous section, such as cold-start problems and a lack of content awareness. To address these issues, we explore a hybrid approach called Hierarchical Item2Vec, which integrates hierarchical content-based information into the embedding process. This method helps mitigate the shortcomings of purely collaborative approaches by incorporating category-level knowledge.

Next, we will explore and fine-tune the model architecture and its components.

### 4.2 Hyperparameters<a id='hyperparameters'></a>  

To optimize model performance, we consider the following **hyperparameters**:


* $\lambda_{cat}$ = {0, 0.1, 1.0, 10}
* Embedding dimension = {32,64, 96, 128}
* Threshold for item frequency
* Learning rate = {0.005, 0.01, 0.05}
* Loss function = {Negative sampling, Hierarchical softmax}
* Similarity measure: cosine similarity, distance

| Cosine Similarity Score | Interpretation                                          |
| ----------------------- | ------------------------------------------------------- |
| **> 0.8**               | Very high similarity (strong recommendation candidates) |
| **0.6 – 0.8**           | High similarity (related items, likely of interest)     |
| **0.4 – 0.6**           | Moderate similarity (some shared context)               |
| **< 0.4**               | Low or weak similarity                                  |




#### 4.2.1 Regularization parameter $\lambda_{cat}$<a id='lambda_cat'></a> 

The value of $\lambda_{cat}$ regulates the strength of alignment between item embeddings and their respective category embeddings.

We will select the following values during hyperparameter tuning:: $\lambda_{cat}$ = {0, 0.1, 1, 10}

| $\lambda_{\text{cat}}$ Value                 | Meaning / Effect                                                                 | When to Use                                                           |
| -------------------------------------------- | -------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| **0**                                        | No category alignment at all. Equivalent to Item2Vec.            | Baseline comparison; when category data is noisy or not useful.       |
| **1 $\times 10^{-4}$ to 1 $\times 10^{-3}$** | Very weak alignment. Minor influence from category embeddings.                   | Categories are somewhat useful, but item-level patterns dominate.     |
| **1 $\times 10^{-2}$ to 0.1**              | Moderate alignment. Balanced influence between item behavior and category info.  | Often a good starting point for tuning; works well in many scenarios. |
| **0.1 to 1.0**                           | Strong alignment. Item embeddings are pulled significantly toward category ones. | When categories are well-defined and strongly predictive.             |
| **>1.0**                                   | Very strong alignment. Item embeddings may lose individuality.                   | Only use if category structure is known to be highly reliable.        |


#### 4.2.2 Embedding dimension <a id='dim_embedding'></a> 

In [6]:
# Huffman Tree
begin_index = 500000
imap = ItemMap(df_freq, df_cat)
itemmap = imap.dict_items
flat_itemmap = imap.flat_items
total_inner_nodes, huff_tree = build_huffman_tree(begin_index, None, flat_itemmap, None)
print(f'Total number of inner nodes : {total_inner_nodes}')

total number of items  7547  total number of categories  1670
Total number of inner nodes : 7546


In [7]:
# Category Tree
root_code = 10000
cat_tree =  build_category_tree(root_code, df_cat, itemmap)

In [8]:
params = Params()
params.model_name = 'HierarchicalItem2Vec'
params.model_dir = "weights/{}".format(params.model_name)
params.dim_embedding = 60
params.lambda_cat = 0.2
params.batch_size = 2
params.n_epochs = 10 # 1 for test
os.makedirs(params.model_dir, exist_ok=True)

In [9]:
batchtool = BatchToolItem(imap, params)
hi2v = HierarchicalItem2Vec(imap, params, huff_tree, cat_tree)
optimizer = torch.optim.Adam(params = hi2v.parameters(), lr=0.02)

In [10]:
trainer = Trainer(
        model=hi2v,
        params=params,
        optimizer=optimizer,
        train_iter=train_dataset,
        valid_iter=val_dataset,
        map=imap,
        method =batchtool,
        debug = 0
    )
trainer.test_tokens = [461686,119736,213834,7943,312728]
trainer.train()


-----------
461686 || 398756 (0.438) 427472 (0.415) 206995 (0.415) 27412 (0.408) 

119736 || 314952 (0.478) 86844 (0.464) 98498 (0.453) 204494 (0.440) 

213834 || 209125 (0.490) 496 (0.433) 354345 (0.417) 118551 (0.405) 

7943 || 77991 (0.452) 289298 (0.445) 98708 (0.422) 375450 (0.414) 

312728 || 79714 (0.450) 6047 (0.411) 337162 (0.409) 412865 (0.408) 

-----------


Epoch 1/10: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1282/1282 [00:11<00:00, 114.54it/s, loss=0.857]
Epoch 1/10: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 321/321 [00:01<00:00, 264.64it/s, loss=0.884]


Epoch: 1/10
     Train Loss: 0.74
     Valid Loss: 0.68
     Training Time (mins): 0.2



Epoch 2/10: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1282/1282 [00:12<00:00, 106.52it/s, loss=0.55]
Epoch 2/10: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 321/321 [00:01<00:00, 217.99it/s, loss=0.472]


Epoch: 2/10
     Train Loss: 0.44
     Valid Loss: 0.68
     Training Time (mins): 0.2



Epoch 3/10: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1282/1282 [00:11<00:00, 107.70it/s, loss=0.357]
Epoch 3/10: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 321/321 [00:01<00:00, 223.33it/s, loss=0.432]


Epoch: 3/10
     Train Loss: 0.39
     Valid Loss: 0.69
     Training Time (mins): 0.2



Epoch 4/10: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1282/1282 [00:12<00:00, 104.53it/s, loss=0.0859]
Epoch 4/10: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 321/321 [00:01<00:00, 173.35it/s, loss=0.345]


Epoch: 4/10
     Train Loss: 0.37
     Valid Loss: 0.68
     Training Time (mins): 0.2



Epoch 5/10: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1282/1282 [00:12<00:00, 106.39it/s, loss=0.341]
Epoch 5/10: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 321/321 [00:01<00:00, 210.05it/s, loss=0.384]


Epoch: 5/10
     Train Loss: 0.35
     Valid Loss: 0.69
     Training Time (mins): 0.2



Epoch 6/10: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1282/1282 [00:11<00:00, 107.91it/s, loss=0.225]
Epoch 6/10: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 321/321 [00:01<00:00, 230.79it/s, loss=0.381]


Epoch: 6/10
     Train Loss: 0.34
     Valid Loss: 0.69
     Training Time (mins): 0.2



Epoch 7/10: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1282/1282 [00:11<00:00, 107.52it/s, loss=0.0342]
Epoch 7/10: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 321/321 [00:01<00:00, 223.83it/s, loss=0.459]


Epoch: 7/10
     Train Loss: 0.33
     Valid Loss: 0.71
     Training Time (mins): 0.2



Epoch 8/10: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1282/1282 [00:12<00:00, 104.26it/s, loss=0.46]
Epoch 8/10: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 321/321 [00:01<00:00, 203.80it/s, loss=0.347]


Epoch: 8/10
     Train Loss: 0.33
     Valid Loss: 0.71
     Training Time (mins): 0.2



Epoch 9/10: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1282/1282 [00:11<00:00, 108.25it/s, loss=0.333]
Epoch 9/10: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 321/321 [00:01<00:00, 214.47it/s, loss=0.365]


Epoch: 9/10
     Train Loss: 0.33
     Valid Loss: 0.72
     Training Time (mins): 0.2



Epoch 10/10: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1282/1282 [00:11<00:00, 108.07it/s, loss=0.396]
Epoch 10/10: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 321/321 [00:01<00:00, 222.30it/s, loss=0.295]


Epoch: 10/10
     Train Loss: 0.32
     Valid Loss: 0.72
     Training Time (mins): 0.2


-----------
461686 || 65215 (0.900) 422376 (0.895) 67423 (0.893) 124081 (0.885) 

119736 || 186702 (0.912) 210137 (0.892) 400077 (0.891) 151178 (0.881) 

213834 || 277833 (0.866) 130865 (0.862) 346892 (0.859) 345373 (0.856) 

7943 || 318333 (0.733) 163689 (0.557) 318697 (0.546) 147879 (0.535) 

312728 || 63899 (0.847) 46232 (0.817) 455763 (0.765) 217548 (0.763) 

-----------


## 4 Evaluation <a id='evaluation'></a>

Test item groups: high-frequency, moderate-frequency items, cold items.
For problems at hand, ranking-based evaluation metrics such as Precision@k, Recall@k, NDCG, MAP are most appropriate.

* Precision at K
* Recall at K