# Feature engineering & Train data development <a id=''></a>

## Table of Contents
  * 1 [Import libriaries and Load data](#load_data)
  * 2 [Training data development](#training_data)
  * 3 [Train-test split](#train-test_split)
  * 4 [Models](#models)

## 1 Import libraries and Load data<a id='load_data'></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split

from itertools import permutations


from HuffmanTree import HuffmanNode,build_huffman_tree,generate_codebook,visualize_huffman_tree
from CategoryTree import TreeNode, build_tree, add_to_node, get_path

pd.options.mode.copy_on_write = True
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

print('torch version: ', torch.__version__)
print('numpy version: ', np.__version__)

_debug_ = True

torch version:  2.2.0
numpy version:  1.25.2


In [2]:
df_evt= pd.read_csv('events_df.csv')
df_cat= pd.read_csv('category_df.csv')

In [3]:

df_trs = df_evt[(df_evt['event'] == 'transaction') & (df_evt['categoryid']> -1)]
df_freq = df_trs.groupby('itemid').agg(frequency = pd.NamedAgg(column='itemid', aggfunc='size'),
                                      categoryid = pd.NamedAgg(column='categoryid',aggfunc= 'first')).reset_index()

## 2 Training data development<a id='training_data'></a>

In [4]:
df_trs.head()

Unnamed: 0,visitorid,event,itemid,transactionid,date,session_by_day,categoryid
333,172,transaction,465522,9725,2015-08-15,3,196
357,172,transaction,10034,9725,2015-08-15,3,1219
385,186,transaction,49029,8726,2015-08-12,1,579
495,264,transaction,161949,8445,2015-09-07,1,1421
499,264,transaction,459835,8445,2015-09-07,1,1421


**To ensure efficient training and simplicity, only sessions with multiple items (>1) involved in transactions are selected. In reality, context-target item pairs are not always symmetric, and the order matters in purchasing related items; for example, a laptop is often bought before a charger. Therefore, it would be more accurate to model single items as well in future.**

In [5]:
df_items = df_trs.groupby(['visitorid','session_by_day']).agg(items = pd.NamedAgg(column='itemid', aggfunc=list)).reset_index()
print(df_items.count())

df_filtered = df_items[df_items['items'].apply(len) > 1]
print(df_filtered.count())

items = df_filtered['items'].tolist()

visitorid         13027
session_by_day    13027
items             13027
dtype: int64
visitorid         3055
session_by_day    3055
items             3055
dtype: int64


**Next, all possible order-sensitive item-item pairs are constructed from the items present in a given session.**

In [6]:
inputs = []
outputs = []
for items_in_session in items:
    pairs = list(permutations(items_in_session, 2))
    for pair in pairs:
        target_id, context_id = pair[0],pair[1]
        inputs.append([target_id])
        outputs.append(context_id)

## 3 Train-test split<a id='train-test_split'></a>

In [7]:
# Convert to torch.tensor
X = torch.tensor(inputs, dtype=torch.long)
y = torch.tensor(outputs, dtype=torch.long)

In [8]:
print(X[:10])
print(y[:10])

tensor([[465522],
        [ 10034],
        [161949],
        [459835],
        [393144],
        [445559],
        [342086],
        [346661],
        [ 19278],
        [353548]])
tensor([ 10034, 465522, 459835, 161949, 445559, 393144, 346661, 342086, 353548,
         19278])


In [9]:
# Split using train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Wrap in TensorDataset
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)

## 4 Models<a id='models'></a>

This project explores various recommender system models, with the Hierarchical Item2Vec model as the main focus. The model and supporting utility functions are imported from the corresponding source files.

- **ItemMap.py** : contains category, item and frequency and methods to fetch index and item/category  (TokenMap ~ ItemMap -> HuffmanTree) 
- **HierarchicalItem2Vec.py** : main model and trainer, the model reduces to Item2Vec when Params.lambda_cat is set to zero.
- **parameters.py**
- batch_tool.py : this method is still being called, but it is obsolete.

In [10]:
from parameters import Params
from ItemMap import ItemMap
from HierarchicalItem2Vec import HierarchicalItem2Vec, Trainer
from batch_tool import BatchToolItem # this is obsolete as of Sep 25, 2025

In [11]:

def build_category_tree(rootcode_, df_, itemmap_ : ItemMap):
    tree_cat = TreeNode(rootcode_) 
    build_tree(0,{root_code:tree_cat}, df_) 
    for catid, itemid in itemmap_.items():
        for itemidx in itemid.keys():
            add_to_node(tree_cat,catid,itemidx) 
    return tree_cat


In [12]:
# Huffman Tree
begin_index = 500000
imap = ItemMap(df_freq, df_cat)
itemmap = imap.dict_items
flat_itemmap = imap.flat_items
total_inner_nodes, huff_tree = build_huffman_tree(begin_index, None, flat_itemmap, None)
print(f'Total number of inner nodes : {total_inner_nodes}')

total number of items  11645  total number of categories  1670
Total number of inner nodes : 11644


In [13]:
# Category Tree
root_code = 10000
cat_tree =  build_category_tree(root_code, df_cat, itemmap)

In [14]:
params = Params()
params.model_name = 'HierarchicalItem2Vec'
params.model_dir = "weights/{}".format(params.model_name)
params.n_epochs = 1 # 1 for test
os.makedirs(params.model_dir, exist_ok=True)

In [15]:
batchtool = BatchToolItem(imap, params)
hi2v = HierarchicalItem2Vec(imap, params, huff_tree, cat_tree)
optimizer = torch.optim.Adam(params = hi2v.parameters())

In [None]:
trainer = Trainer(
        model=hi2v,
        params=params,
        optimizer=optimizer,
        train_iter=train_dataset,
        valid_iter=val_dataset,
        map=imap,
        method =batchtool
    )
trainer.train()


-----------
464286 || 278363 (0.361) 285412 (0.350) 25 (0.342) 460187 (0.339) 

277689 || 92682 (0.330) 109589 (0.326) 423217 (0.314) 12905 (0.314) 

223985 || 272649 (0.346) 177036 (0.344) 345156 (0.341) 27897 (0.333) 

306380 || 369158 (0.360) 294210 (0.358) 389321 (0.356) 266900 (0.335) 

329766 || 415660 (0.373) 380196 (0.350) 170287 (0.348) 76617 (0.330) 

234958 || 349809 (0.359) 207494 (0.356) 214847 (0.342) 455708 (0.328) 

181687 || 83175 (0.348) 155589 (0.347) 272770 (0.332) 183758 (0.326) 

217218 || 432742 (0.351) 145403 (0.348) 149822 (0.333) 363578 (0.333) 

188455 || 427978 (0.353) 63406 (0.340) 38272 (0.339) 200242 (0.324) 

399926 || 86675 (0.333) 241716 (0.320) 260901 (0.315) 286645 (0.314) 

-----------


Epoch 1/1:  82%|███████████████████████████████████████████████████████████████████████████████████████▉                   | 5177/6300 [02:26<00:31, 36.14it/s, loss=0.773]

Testing was performed on 10 randomly selected items to identify the 5 closest items before and after training. At this stage, we run only one epoch to verify that the model executes correctly.

In [None]:
if _debug_ == True:
    
    dataloader = DataLoader(
                train_dataset,
                batch_size=params.batch_size,
                shuffle=True,
                #collate_fn=batchtool.collate_fn
            )

    print("ItemMap check :",' index(322295)=', imap.get_item_index(322295), ', dim(items)= ',imap.dim_items)
    print("DataLoader check : ")
    for i, batch in enumerate(dataloader, 1):
        inputs = batch[0]
        outputs = batch[1]
        flat_inputs = torch.flatten(inputs).tolist()
        input_indices = imap.get_item_index(flat_inputs)
        torch_input = torch.tensor(input_indices)
        output_indices = imap.get_item_index(torch.tensor(outputs).tolist())
        torch_output = torch.tensor(output_indices)
        print('\t',i, flat_inputs[:3], outputs[:3], torch_input[:3], torch_output[:3])
        if i>5:
            break

    item_embeddings = nn.Embedding(
            imap.dim_items, 
            params.dim_embedding
        )
    target_embs = item_embeddings(torch_input) 
    print(target_embs[0,:10])