# Amazon KDD Cup 2023 - Task 1 - Next Product Recommendation 

![](https://images.aicrowd.com/raw_images/challenges/banner_file/1116/6c8fecd6d7c225b4ed11.jpg)

This notebook will contains instructions and example submission with random predictions.



## Installations 🤖

1. `aicrowd-cli` for downloading challenge data and making submissions
2. `pyarrow` for saving to parquet for submissions

In [7]:
# !pip install aicrowd-cli pyarrow gensim

## Login to AIcrowd and download the data 📚

In [3]:
!aicrowd login

Please login here: [34m[1m[4mhttps://api.aicrowd.com/auth/aXPlzA1oNs5LN8h8XfRt00wvjzao2w5e9Aki8TKFQOc[0m
/usr/bin/xdg-open: 870: /usr/bin/xdg-open: www-browser: not found
/usr/bin/xdg-open: 870: /usr/bin/xdg-open: links2: not found
/usr/bin/xdg-open: 870: /usr/bin/xdg-open: elinks: not found
/usr/bin/xdg-open: 870: /usr/bin/xdg-open: links: not found
/usr/bin/xdg-open: 870: /usr/bin/xdg-open: lynx: not found
/usr/bin/xdg-open: 870: /usr/bin/xdg-open: w3m: not found
xdg-open: no method available for opening 'https://api.aicrowd.com/auth/aXPlzA1oNs5LN8h8XfRt00wvjzao2w5e9Aki8TKFQOc'
[32mAPI Key valid[0m
[32mGitlab access token valid[0m
[32mSaved details successfully![0m


In [4]:
!aicrowd dataset download --challenge task-1-next-product-recommendation

sessions_test_task1.csv:   5%|▋            | 1.05M/19.4M [00:00<00:02, 8.67MB/s]
sessions_test_task1.csv:  65%|████████▍    | 12.6M/19.4M [00:00<00:00, 18.3MB/s][A
products_train.csv:   0%|                   | 1.05M/589M [00:00<01:28, 6.60MB/s][A

sessions_test_task3.csv:   0%|                      | 0.00/2.67M [00:00<?, ?B/s][A[A

sessions_test_task3.csv: 100%|█████████████| 2.67M/2.67M [00:00<00:00, 13.7MB/s][A[A
sessions_test_task1.csv: 100%|█████████████| 19.4M/19.4M [00:01<00:00, 16.2MB/s]
sessions_test_task2.csv:   0%|                      | 0.00/1.92M [00:00<?, ?B/s]
sessions_test_task2.csv: 100%|█████████████| 1.92M/1.92M [00:00<00:00, 11.2MB/s][A

sessions_train.csv:   2%|▍                  | 6.29M/259M [00:00<00:08, 28.1MB/s][A
products_train.csv:   3%|▌                  | 16.8M/589M [00:02<01:22, 6.89MB/s][A
sessions_train.csv:   4%|▊                  | 10.5M/259M [00:01<00:34, 7.17MB/s][A
products_train.csv:   4%|▊                  | 25.2M/589M [00:03<01:16, 7.32M

## Setup data and task information

In [8]:
import os
import numpy as np
import pandas as pd
from functools import lru_cache
from gensim.models import Word2Vec

## Config

In [9]:
debug = False

debug_session_num = 1000

In [27]:
train_data_dir = '.'
test_data_dir = '.'
task = 'task1'
PREDS_PER_SESSION = 100

model_dir = '../model_training'

w2v_model_file = os.path.join(model_dir, 'w2v.model')

In [11]:
# Cache loading of data for multiple calls

@lru_cache(maxsize=1)
def read_product_data():
    return pd.read_csv(os.path.join(train_data_dir, 'products_train.csv'))

@lru_cache(maxsize=1)
def read_train_data():
    return pd.read_csv(os.path.join(train_data_dir, 'sessions_train.csv'))

@lru_cache(maxsize=3)
def read_test_data(task):
    return pd.read_csv(os.path.join(test_data_dir, f'sessions_test_{task}.csv'))

## Data Description

The Multilingual Shopping Session Dataset is a collection of **anonymized customer sessions** containing products from six different locales, namely English, German, Japanese, French, Italian, and Spanish. It consists of two main components: **user sessions** and **product attributes**. User sessions are a list of products that a user has engaged with in chronological order, while product attributes include various details like product title, price in local currency, brand, color, and description.

---

### Each product as its associated information:


**locale**: the locale code of the product (e.g., DE)

**id**: a unique for the product. Also known as Amazon Standard Item Number (ASIN) (e.g., B07WSY3MG8)

**title**: title of the item (e.g., “Japanese Aesthetic Sakura Flowers Vaporwave Soft Grunge Gift T-Shirt”)

**price**: price of the item in local currency (e.g., 24.99)

**brand**: item brand name (e.g., “Japanese Aesthetic Flowers & Vaporwave Clothing”)

**color**: color of the item (e.g., “Black”)

**size**: size of the item (e.g., “xxl”)

**model**: model of the item (e.g., “iphone 13”)

**material**: material of the item (e.g., “cotton”)

**author**: author of the item (e.g., “J. K. Rowling”)

**desc**: description about a item’s key features and benefits called out via bullet points (e.g., “Solid colors: 100% Cotton; Heather Grey: 90% Cotton, 10% Polyester; All Other Heathers …”)


## EDA 💽

In [12]:
def read_locale_data(locale, task):
    products = read_product_data().query(f'locale == "{locale}"')
    sess_train = read_train_data().query(f'locale == "{locale}"')
    sess_test = read_test_data(task).query(f'locale == "{locale}"')
    return products, sess_train, sess_test

def show_locale_info(locale, task):
    products, sess_train, sess_test = read_locale_data(locale, task)

    train_l = sess_train['prev_items'].apply(lambda sess: len(sess))
    test_l = sess_test['prev_items'].apply(lambda sess: len(sess))

    print(f"Locale: {locale} \n"
          f"Number of products: {products['id'].nunique()} \n"
          f"Number of train sessions: {len(sess_train)} \n"
          f"Train session lengths - "
          f"Mean: {train_l.mean():.2f} | Median {train_l.median():.2f} | "
          f"Min: {train_l.min():.2f} | Max {train_l.max():.2f} \n"
          f"Number of test sessions: {len(sess_test)}"
        )
    if len(sess_test) > 0:
        print(
             f"Test session lengths - "
            f"Mean: {test_l.mean():.2f} | Median {test_l.median():.2f} | "
            f"Min: {test_l.min():.2f} | Max {test_l.max():.2f} \n"
        )
    print("======================================================================== \n")

In [13]:
products = read_product_data()
locale_names = products['locale'].unique()
for locale in locale_names:
    show_locale_info(locale, task)

Locale: DE 
Number of products: 518327 
Number of train sessions: 1111416 
Train session lengths - Mean: 57.89 | Median 40.00 | Min: 27.00 | Max 2060.00 
Number of test sessions: 104568
Test session lengths - Mean: 57.23 | Median 40.00 | Min: 27.00 | Max 700.00 


Locale: JP 
Number of products: 395009 
Number of train sessions: 979119 
Train session lengths - Mean: 59.61 | Median 40.00 | Min: 27.00 | Max 6257.00 
Number of test sessions: 96467
Test session lengths - Mean: 59.90 | Median 40.00 | Min: 27.00 | Max 1479.00 


Locale: UK 
Number of products: 500180 
Number of train sessions: 1182181 
Train session lengths - Mean: 54.85 | Median 40.00 | Min: 27.00 | Max 2654.00 
Number of test sessions: 115936
Test session lengths - Mean: 53.51 | Median 40.00 | Min: 27.00 | Max 872.00 


Locale: ES 
Number of products: 42503 
Number of train sessions: 89047 
Train session lengths - Mean: 48.82 | Median 40.00 | Min: 27.00 | Max 792.00 
Number of test sessions: 0

Locale: FR 
Number of produc

In [14]:
# products.sample(5)

In [15]:
train_sessions = read_train_data()
train_sessions.sample(5)

Unnamed: 0,prev_items,next_item,locale
359563,['B01EQZB0VM' 'B01E4IV354'],B01EQZB0T4,DE
2392170,['B07RSRWH6X' 'B01A84QLG4' 'B01A84QLG4'],B01A84QLS2,UK
1401460,['B09RJX9G5K' 'B09RJYNC43'],B002HEDFFE,JP
2806238,['B07PPGGPND' 'B07Z618HH2'],B07TLBRLNQ,UK
2941006,['B004SGXY00' 'B004SGO4DG' 'B004SGXY00' 'B004S...,B017HLFMMU,UK


In [38]:
test_sessions = read_test_data(task)
test_sessions.sample(5)

Unnamed: 0,prev_items,locale
230087,['1407191136' '1407186736' 'B09Z39N3ZF' 'B09BN...,UK
48344,['B09G9D34N2' 'B07FY2LPH4'],DE
129642,['B08F63JSZ7' 'B08F5Z379N' 'B095LC2LHR' 'B095H...,JP
278540,['B0BHVZXHW6' 'B0BG4TJBZ8' 'B0BF8QNWFL' 'B0B9J...,UK
315216,['B09JHGH2Q4' 'B00L67CZ9U' 'B08WCF99MF'],UK


In [17]:
if debug:
  train_sessions = train_sessions.sample(debug_session_num)

In [18]:
train_sessions.shape

(3606249, 3)

In [19]:
def process_item_lst(row):
  prev_items = row['prev_items']
  res = [ele.replace('[', '').replace(']', '').replace('\n', '').replace("'", '').replace(' ', '') for ele in prev_items.split(' ')]
  return res

In [20]:
train_sessions['prev_items'] = train_sessions.apply(lambda row: process_item_lst(row), axis=1)

# Word2vec

In [21]:
# train_sessions['prev_items'].to_list()

In [22]:
vector_size = 32
epochs = 10
sg = 1 # 1 for skip-gram
pop_thresh = 0.82415
window = 4

sentences = train_sessions['prev_items'].to_list()
len(sentences)

3606249

In [23]:
from gensim.models.callbacks import CallbackAny2Vec

class callback(CallbackAny2Vec):
    '''Callback to print loss after each epoch.'''

    def __init__(self):
        self.epoch = 0
        self.loss_to_be_subed = 0

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        loss_now = loss - self.loss_to_be_subed
        self.loss_to_be_subed = loss
        print('Loss after epoch {}: {}'.format(self.epoch, loss_now))
        self.epoch += 1

In [24]:
w2vec = Word2Vec(sentences=sentences, vector_size=vector_size, epochs = epochs, sg=sg,
                 min_count=1, workers=14,
                 window=window,
                  compute_loss=True
              , callbacks=[callback()])

Loss after epoch 0: 11735244.0
Loss after epoch 1: 6392140.0
Loss after epoch 2: 3035380.0
Loss after epoch 3: 2696710.0
Loss after epoch 4: 2523156.0
Loss after epoch 5: 2180340.0
Loss after epoch 6: 2067096.0
Loss after epoch 7: 1779330.0
Loss after epoch 8: 1260252.0
Loss after epoch 9: 331948.0


In [28]:
w2vec.save(w2v_model_file)

In [30]:
# ! ls

In [31]:
# ! ls sample_data

## Generate Submission 🏋️‍♀️



Submission format:
1. The submission should be a **parquet** file with the sessions from all the locales. 
2. Predicted products ids per locale should only be a valid product id of that locale. 
3. Predictions should be added in new column named **"next_item_prediction"**.
4. Predictions should be a list of string id values

In [32]:
# def random_predicitons(locale, sess_test_locale):
#     random_state = np.random.RandomState(42)
#     products = read_product_data().query(f'locale == "{locale}"')
#     predictions = []
#     for _ in range(len(sess_test_locale)):
#         predictions.append(
#             list(products['id'].sample(PREDS_PER_SESSION, replace=True, random_state=random_state))
#         ) 
#     sess_test_locale['next_item_prediction'] = predictions
#     sess_test_locale.drop('prev_items', inplace=True, axis=1)
#     return sess_test_locale

In [33]:
test_sessions.head()

Unnamed: 0,prev_items,locale
0,['B08V12CT4C' 'B08V1KXBQD' 'B01BVG1XJS' 'B09VC...,DE
1,['B00R9R5ND6' 'B00R9RZ9ZS' 'B00R9RZ9ZS'],DE
2,['B07YSRXJD3' 'B07G7Q5N6G' 'B08C9Q7QVK' 'B07G7...,DE
3,['B08KQBYV43' '3955350843' '3955350843' '39553...,DE
4,['B09FPTCWMC' 'B09FPTQP68' 'B08HMRY8NG' 'B08TB...,DE


In [41]:
# test_sessions

In [39]:
test_sessions = read_test_data(task)

if debug:
  test_sessions = test_sessions.sample(debug_session_num)

test_sessions['prev_items'] = test_sessions.apply(lambda row: process_item_lst(row), axis=1)
test_sessions.shape

(316971, 2)

In [42]:
test_sessions.head()

Unnamed: 0,prev_items,locale
0,"[B08V12CT4C, B08V1KXBQD, B01BVG1XJS, B09VC5PKN...",DE
1,"[B00R9R5ND6, B00R9RZ9ZS, B00R9RZ9ZS]",DE
2,"[B07YSRXJD3, B07G7Q5N6G, B08C9Q7QVK, B07G7Q5N6G]",DE
3,"[B08KQBYV43, 3955350843, 3955350843, 395535086...",DE
4,"[B09FPTCWMC, B09FPTQP68, B08HMRY8NG, B08TBBQ4B...",DE


In [43]:
# predictions = []
# test_locale_names = test_sessions['locale'].unique()
# for locale in test_locale_names:
#     sess_test_locale = test_sessions.query(f'locale == "{locale}"').copy()
#     predictions.append(
#         random_predicitons(locale, sess_test_locale)
#     )
# predictions = pd.concat(predictions).reset_index(drop=True)
# predictions.sample(5)

In [44]:
def get_predictions(row):
    prev_items = row['prev_items']
    try:
        similarity_dic = w2vec.wv.most_similar(positive=prev_items, topn=100)
        res = [item for item, simi in similarity_dic] 
    except:
        res = prev_items
    return res 

In [None]:
test_sessions['next_item_prediction'] = test_sessions.apply(lambda row: get_predictions(row), axis=1)

In [None]:
predictions = test_sessions[['locale', 'next_item_prediction']]

In [None]:
predictions

# Validate predictions ✅ 😄

In [None]:
def check_predictions(predictions, check_products=False):
    """
    These tests need to pass as they will also be applied on the evaluator
    """
    test_locale_names = test_sessions['locale'].unique()
    for locale in test_locale_names:
        sess_test = test_sessions.query(f'locale == "{locale}"')
        preds_locale =  predictions[predictions['locale'] == sess_test['locale'].iloc[0]]
        assert sorted(preds_locale.index.values) == sorted(sess_test.index.values), f"Session ids of {locale} doesn't match"

        if check_products:
            # This check is not done on the evaluator
            # but you can run it to verify there is no mixing of products between locales
            # Since the ground truth next item will always belong to the same locale
            # Warning - This can be slow to run
            products = read_product_data().query(f'locale == "{locale}"')
            predicted_products = np.unique( np.array(list(preds_locale["next_item_prediction"].values)) )
            assert np.all( np.isin(predicted_products, products['id']) ), f"Invalid products in {locale} predictions"

In [None]:
check_predictions(predictions)

In [None]:
# Its important that the parquet file you submit is saved with pyarrow backend
predictions.to_parquet(f'submission_{task}.parquet', engine='pyarrow')

## Submit to AIcrowd 🚀

In [None]:
# You can submit with aicrowd-cli, or upload manually on the challenge page.
!aicrowd submission create -c task-1-next-product-recommendation -f "submission_task1.parquet"