# Candidate ReRank Model using Handcrafted Rules
In this notebook, we present a "candidate rerank" model using handcrafted rules. We can improve this model by engineering features, merging them unto items and users, and training a reranker model (such as XGB) to choose our final 20. Furthermore to tune and improve this notebook, we should build a local CV scheme to experiment new logic and/or models.

UPDATE: I published a notebook to compute validation score [here][10] using Radek's scheme described [here][11].

Note in this competition, a "session" actually means a unique "user". So our task is to predict what each of the `1,671,803` test "users" (i.e. "sessions") will do in the future. For each test "user" (i.e. "session") we must predict what they will `click`, `cart`, and `order` during the remainder of the week long test period.

### Step 1 - Generate Candidates
For each test user, we generate possible choices, i.e. candidates. In this notebook, we generate candidates from 5 sources:
* User history of clicks, carts, orders
* Most popular 20 clicks, carts, orders during test week
* Co-visitation matrix of click/cart/order to cart/order with type weighting
* Co-visitation matrix of cart/order to cart/order called buy2buy
* Co-visitation matrix of click/cart/order to clicks with time weighting

### Step 2 - ReRank and Choose 20
Given the list of candidates, we must select 20 to be our predictions. In this notebook, we do this with a set of handcrafted rules. We can improve our predictions by training an XGBoost model to select for us. Our handcrafted rules give priority to:
* Most recent previously visited items
* Items previously visited multiple times
* Items previously in cart or order
* Co-visitation matrix of cart/order to cart/order
* Current popular items

![](https://raw.githubusercontent.com/cdeotte/Kaggle_Images/main/Nov-2022/c_r_model.png)
  
# Credits
We thank many Kagglers who have shared ideas. We use co-visitation matrix idea from Vladimir [here][1]. We use groupby sort logic from Sinan in comment section [here][4]. We use duplicate prediction removal logic from Radek [here][5]. We use multiple visit logic from Pietro [here][2]. We use type weighting logic from Ingvaras [here][3]. We use leaky test data from my previous notebook [here][4]. And some ideas may have originated from Tawara [here][6] and KJ [here][7]. We use Colum2131's parquets [here][8]. Above image is from Ravi's discussion about candidate rerank models [here][9]

[1]: https://www.kaggle.com/code/vslaykovsky/co-visitation-matrix
[2]: https://www.kaggle.com/code/pietromaldini1/multiple-clicks-vs-latest-items
[3]: https://www.kaggle.com/code/ingvarasgalinskas/item-type-vs-multiple-clicks-vs-latest-items
[4]: https://www.kaggle.com/code/cdeotte/test-data-leak-lb-boost
[5]: https://www.kaggle.com/code/radek1/co-visitation-matrix-simplified-imprvd-logic
[6]: https://www.kaggle.com/code/ttahara/otto-mors-aid-frequency-baseline
[7]: https://www.kaggle.com/code/whitelily/co-occurrence-baseline
[8]: https://www.kaggle.com/datasets/columbia2131/otto-chunk-data-inparquet-format
[9]: https://www.kaggle.com/competitions/otto-recommender-system/discussion/364721
[10]: https://www.kaggle.com/cdeotte/compute-validation-score-cv-564
[11]: https://www.kaggle.com/competitions/otto-recommender-system/discussion/364991

# Notes
Below are notes about versions:
* **Version 1 LB 0.573** Uses popular ideas from public notebooks and adds additional co-visitation matrices and additional logic. Has CV `0.563`. See validation notebook version 2 [here][1].
* **Version 2 LB 573** Refactor logic for `suggest_buys(df)` to make it clear how new co-visitation matrices are reranking the candidates by adding to candidate weights. Also new logic boosts CV by `+0.0003`. Also LB is slightly better too. See validation notebook version 3 [here][1]
* **Version 3** is the same as version 2 but 1.5x faster co-visitation matrix computation!
* **Version 4 LB 575** Use top20 for clicks and top15 for carts and buys (instead of top40 and top40). This boosts CV `+0.0015` hooray! New CV is `0.5647`. See validation version 5 [here][1]
* **Version 5** is the same as version 4 but 2x faster co-visitation matrix computation! (and 3x faster than version 1)
* **Version 6** Stay tuned for more versions...

[1]: https://www.kaggle.com/code/cdeotte/compute-validation-score-cv-564

# Packages 

In [221]:

import pandas as pd, numpy as np
from tqdm.notebook import tqdm
import os, sys, pickle, glob, gc
from collections import Counter
import itertools
VER = 5
DISK_PIECES = 4


# Config 

In [267]:
DEBUG = False
final_submission = False
# whether for rerank retraining or validation
for_rerank_training = False
rec_num = 40


test_stage1_limit = True

if test_stage1_limit:
    assert for_rerank_training == False


file_path = '../data/cart_order_features.csv'
final_submission_unique_session_num = 1671803

model_version = 'candidate_v2_train1_data'

# debug_train_file_num = 40
# debug_test_session_num = 100
# type_labels = {'clicks':0, 'carts':1, 'orders':2}
# DISK_PIECES = 4
# model_dir = '../model_training/candiate_v1/'
model_dir = f'../model_training/{model_version}/'

# data_dir = '../data/parquet/val/test.parquet'
# submission_file = '../data/val_candidates.csv'

if final_submission:
    data_dir = '../data/parquet/test/*'
    submission_dir = '../submission/candidate_final_submission/'
#     submission_file = f'../data/{model_version}_test_submission.csv'
else:
    if for_rerank_training:
        data_dir = '../data/parquet/train2/*.parquet'
        submission_dir = '../submission/candidate_for_rerank_training/'

#         submission_file = f'../data/{model_version}_test_submission.csv'
    else:
        data_dir = '../data/parquet/val/*.parquet'
#         submission_file = f'../data/{model_version}_test_submission.csv'
        submission_dir = '../submission/candidate_for_validation/'
        if rec_num <= 20:
            submission_dir = '../submission/candidate_final_submission/'
        
    
    
# submission_file = os.path.join(submission_dir, f'{model_version}_test_submission.csv')
file_path = os.path.join(submission_dir, f"feature_recnum_{rec_num}_"+model_version+'_test_submission.csv')
submission_file = os.path.join(submission_dir, f"recnum_{rec_num}_"+model_version+'_test_submission.csv')

if test_stage1_limit:
    file_path += 'test_stage1_limit'
    submission_file += 'test_stage1_limit'




In [240]:
file_path

'../submission/candidate_for_validation/feature_recnum_40_candidate_v2_train1_data_test_submission.csvtest_stage1_limt'

In [241]:
# if final_submission:
# else:
#     if for_rerank_training:
#         file_path = '../data/cart_order_features.csv'
#     else:
#         file_path = '../data/test_cart_order_features.csv'


In [242]:
file_path

'../submission/candidate_for_validation/feature_recnum_40_candidate_v2_train1_data_test_submission.csvtest_stage1_limt'

In [243]:
file_path

'../submission/candidate_for_validation/feature_recnum_40_candidate_v2_train1_data_test_submission.csvtest_stage1_limt'

In [244]:
# ! tail -n 5 ../data/test_cart_order_features.csv

In [245]:
file_path

'../submission/candidate_for_validation/feature_recnum_40_candidate_v2_train1_data_test_submission.csvtest_stage1_limt'

In [246]:
! ls {submission_dir}

candidate_v2_test_submission.csv
candidate_v2_train1_data_test_submission.csv
feature_recnum_100_candidate_v2_train1_data_test_submission.csv
recnum_100_candidate_v2_train1_data_test_submission.csv
recnum_40_candidate_v2_train1_data_test_submission.csvtest_stage1_limt


# Step 2 - ReRank (choose 20) using handcrafted rules
For description of the handcrafted rules, read this notebook's intro.

## Process code 

In [247]:
def load_test(data_dir, debug=DEBUG):    
    dfs = []
    files = glob.glob(data_dir)
    if debug:
        files = files[:2]
    print(f'file number: {len(files)}')

    for e, chunk_file in enumerate(files):
        chunk = pd.read_parquet(chunk_file)
        chunk.ts = (chunk.ts/1000).astype('int32')
        if chunk['type'].dtype == str:
            chunk['type'] = chunk['type'].map(type_labels).astype('int8')
        dfs.append(chunk)
    return pd.concat(dfs).reset_index(drop=True) #.astype({"ts": "datetime64[ms]"})

print(data_dir)
test_df = load_test(data_dir=data_dir)
print('Test data has shape',test_df.shape)
test_df.head()

../data/parquet/val/*.parquet
file number: 18
Test data has shape (7580968, 4)


Unnamed: 0,session,aid,ts,type
0,12613814,1221210,1661635407,clicks
1,12613815,1509935,1661635407,clicks
2,12613816,1647157,1661635407,clicks
3,12613816,121148,1661635479,clicks
4,12613816,991590,1661635579,clicks


In [28]:
if DEBUG or not final_submission:
    unique_session_num = len(test_df['session'].unique())
else:
    unique_session_num = final_submission_unique_session_num
unique_session_num

1783737

In [29]:
test_df.head()

Unnamed: 0,session,aid,ts,type
0,12613814,1221210,1661635407,clicks
1,12613815,1509935,1661635407,clicks
2,12613816,1647157,1661635407,clicks
3,12613816,121148,1661635479,clicks
4,12613816,991590,1661635579,clicks


## Load Model 

In [30]:
# TOP CLICKS AND ORDERS IN TEST
top_clicks = test_df.loc[test_df['type']=='clicks','aid'].value_counts().index.values[:rec_num]
top_orders = test_df.loc[test_df['type']=='orders','aid'].value_counts().index.values[:rec_num]

In [31]:
def pqt_to_dict(df):
    return df.groupby('aid_x').aid_y.apply(list).to_dict()
# LOAD THREE CO-VISITATION MATRICES
top_20_clicks = pqt_to_dict( pd.read_parquet(os.path.join(model_dir, f'top_20_clicks_v{VER}_0.pqt')) )
for k in range(1,DISK_PIECES): 
    top_20_clicks.update( pqt_to_dict( pd.read_parquet(os.path.join(model_dir, f'top_20_clicks_v{VER}_{k}.pqt')) ) )

In [32]:
# top_20_clicks

In [33]:
%%time
top_20_buys = pqt_to_dict( pd.read_parquet(os.path.join(model_dir, f'top_15_carts_orders_v{VER}_0.pqt')) )
for k in range(1,DISK_PIECES): 
    top_20_buys.update( pqt_to_dict( pd.read_parquet(os.path.join(model_dir, f'top_15_carts_orders_v{VER}_{k}.pqt') )) )
top_20_buy2buy = pqt_to_dict( pd.read_parquet(os.path.join(model_dir, f'top_15_buy2buy_v{VER}_0.pqt')) )

print('Here are size of our 3 co-visitation matrices:')
print( len( top_20_clicks ), len( top_20_buy2buy ), len( top_20_buys ) )

Here are size of our 3 co-visitation matrices:
1686352 859067 1686352
CPU times: user 29.3 s, sys: 2.39 s, total: 31.7 s
Wall time: 33 s


In [34]:
type_weight_multipliers = {'clicks': 1, 'carts': 6, 'orders': 3}
# type_weight_multipliers = {0: 1, 1: 6, 2: 3}

def suggest_clicks(df):
    """
    Three parts
        1. unique_uids from this session
        2. aid based on click co-visitation matric
        3. top_clicks from test data
    """
    # USER HISTORY AIDS AND TYPES
    aids=df.aid.tolist()
    types = df.type.tolist()
    unique_aids = list(dict.fromkeys(aids[::-1] ))
    # RERANK CANDIDATES USING WEIGHTS
    if len(unique_aids)>=rec_num:
        # most recent action has the highest score
        weights=np.logspace(0.1,1,len(aids),base=2, endpoint=True)-1
        aids_temp = Counter() 
        # RERANK BASED ON REPEAT ITEMS AND TYPE OF ITEMS
        for aid,w,t in zip(aids,weights,types): 
            aids_temp[aid] += w * type_weight_multipliers[t]
        sorted_aids = [k for k,v in aids_temp.most_common(rec_num)]
        return sorted_aids
    # USE "CLICKS" CO-VISITATION MATRIX get potential aids -> include all of the aid in one list
    aids2 = list(itertools.chain(*[top_20_clicks[aid] for aid in unique_aids if aid in top_20_clicks]))
    # RERANK CANDIDATES based on presence times
    top_aids2 = [aid2 for aid2, cnt in Counter(aids2).most_common(rec_num) if aid2 not in unique_aids]
    # combined unique_aids and top_aids2
    result = unique_aids + top_aids2[:rec_num - len(unique_aids)]
    # If not enough, USE TOP20 TEST CLICKS 
    return result + list(top_clicks)[:rec_num-len(result)]

def suggest_buys(df):
    """
    """
    # USER HISTORY AIDS AND TYPES
    aids=df.aid.tolist()
    types = df.type.tolist()
    # UNIQUE AIDS AND UNIQUE BUYS
    unique_aids = list(dict.fromkeys(aids[::-1] ))
    df = df.loc[(df['type']==1)|(df['type']==2)]
    unique_buys = list(dict.fromkeys( df.aid.tolist()[::-1] ))
    # RERANK CANDIDATES USING WEIGHTS
    if len(unique_aids)>=rec_num:
        weights=np.logspace(0.5,1,len(aids),base=2, endpoint=True)-1
        aids_temp = Counter() 
        # RERANK BASED ON REPEAT ITEMS AND TYPE OF ITEMS
        for aid,w,t in zip(aids,weights,types): 
            aids_temp[aid] += w * type_weight_multipliers[t]
        # RERANK CANDIDATES USING "BUY2BUY" CO-VISITATION MATRIX
        aids3 = list(itertools.chain(*[top_20_buy2buy[aid] for aid in unique_buys if aid in top_20_buy2buy]))
        for aid in aids3: 
            aids_temp[aid] += 0.1
        sorted_aids = [k for k,v in aids_temp.most_common(rec_num)]
        return sorted_aids
    # USE "CART ORDER" CO-VISITATION MATRIX
    aids2 = list(itertools.chain(*[top_20_buys[aid] for aid in unique_aids if aid in top_20_buys]))
    # USE "BUY2BUY" CO-VISITATION MATRIX
    aids3 = list(itertools.chain(*[top_20_buy2buy[aid] for aid in unique_buys if aid in top_20_buy2buy]))
    # RERANK CANDIDATES
    top_aids2 = [aid2 for aid2, cnt in Counter(aids2+aids3).most_common(rec_num) if aid2 not in unique_aids] 
    result = unique_aids + top_aids2[:rec_num - len(unique_aids)]
    # USE TOP20 TEST ORDERS
    return result + list(top_orders)[:rec_num -len(result)]

In [237]:
def suggest_clicks_weight(df):
    """
    Three parts
        1. unique_uids from this session
        2. aid based on click co-visitation matric
        3. top_clicks from test data
    """
    # USER HISTORY AIDS AND TYPES
    aids=df.aid.tolist()
    types = df.type.tolist()
    unique_aids = list(dict.fromkeys(aids[::-1] ))
    
    # features
    type_weight_current_session = []
    click_covisitation_num = []
    
    # RERANK CANDIDATES USING WEIGHTS
    if len(unique_aids)>=rec_num:
        # most recent action has the highest score
        weights=np.logspace(0.1,1,len(aids),base=2, endpoint=True)-1
        aids_temp = Counter() 
        # RERANK BASED ON REPEAT ITEMS AND TYPE OF ITEMS
        for aid,w,t in zip(aids,weights,types): 
            aids_temp[aid] += w * type_weight_multipliers[t]
        sorted_aids = [k for k,v in aids_temp.most_common(rec_num)]
        type_weight_current_session +=[v for k,v in aids_temp.most_common(rec_num)]
        return sorted_aids
    # USE "CLICKS" CO-VISITATION MATRIX get potential aids -> include all of the aid in one list
    aids2 = list(itertools.chain(*[top_20_clicks[aid] for aid in unique_aids if aid in top_20_clicks]))
    # RERANK CANDIDATES based on presence times
    top_aids2 = [aid2 for aid2, cnt in Counter(aids2).most_common(rec_num) if aid2 not in unique_aids]
    # combined unique_aids and top_aids2
    result = unique_aids + top_aids2[:rec_num - len(unique_aids)]
    # If not enough, USE TOP20 TEST CLICKS 
    final_aid_lst = result + list(top_clicks)[:rec_num-len(result)]
    type_weight_current_session += [0]*(rec_num-len(type_weight_current_session))
    assert len(final_aid_lst) == len(type_weight_current_session), f"{len(final_aid_lst)} VS {len(type_weight_current_session)}"
    return final_aid_lst, type_weight_current_session


#!/usr/bin/env python
# -*- coding:utf-8 -*-
# @Filename:    playground.py
# @Time:        18/01/2023 19:21
# @Desc:


def suggest_clicks_weight(df):
    """
    Three parts
        1. unique_uids from this session
        2. aid based on click co-visitation matric
        3. top_clicks from test data
    """
    # USER HISTORY AIDS AND TYPES
    aids = df.aid.tolist()
    types = df.type.tolist()
    unique_aids = list(dict.fromkeys(aids[::-1]))

    # features
    type_weight_current_session = []
    click_covisitation_num = []

    # RERANK CANDIDATES USING WEIGHTS
    if len(unique_aids) >= rec_num:
        # most recent action has the highest score
        weights = np.logspace(0.1, 1, len(aids), base=2, endpoint=True) - 1
        aids_temp = Counter()
        # RERANK BASED ON REPEAT ITEMS AND TYPE OF ITEMS
        for aid, w, t in zip(aids, weights, types):
            aids_temp[aid] += w * type_weight_multipliers[t]
        sorted_aids = [k for k, v in aids_temp.most_common(rec_num)]
        type_weight_current_session += [v for k, v in aids_temp.most_common(rec_num)]
        return sorted_aids
    # USE "CLICKS" CO-VISITATION MATRIX get potential aids -> include all of the aid in one list
    aids2 = list(itertools.chain(*[top_20_clicks[aid] for aid in unique_aids if aid in top_20_clicks]))
    # RERANK CANDIDATES based on presence times
    top_aids2 = [aid2 for aid2, cnt in Counter(aids2).most_common(rec_num) if aid2 not in unique_aids]
    # combined unique_aids and top_aids2
    result = unique_aids + top_aids2[:rec_num - len(unique_aids)]
    # If not enough, USE TOP20 TEST CLICKS
    final_aid_lst = result + list(top_clicks)[:rec_num - len(result)]
    type_weight_current_session += [0] * (rec_num - len(type_weight_current_session))
    assert len(final_aid_lst) == len(
        type_weight_current_session), f"{len(final_aid_lst)} VS {len(type_weight_current_session)}"
    return final_aid_lst, type_weight_current_session


def buys_features(df):
    """
    """
    # USER HISTORY AIDS AND TYPES
    aids = df.aid.tolist()
    types = df.type.tolist()
    if test_stage1_limit:
#         print(df['ground_truth'].values)
        ground_truth = df['ground_truth'].values[0]
    # UNIQUE AIDS AND UNIQUE BUYS
    unique_aids = list(dict.fromkeys(aids[::-1]))
    df = df.loc[(df['type'] == 1) | (df['type'] == 2)]
    unique_buys = list(dict.fromkeys(df.aid.tolist()[::-1]))
    # RERANK CANDIDATES USING WEIGHTS

    # features
    type_weight_current_session = []
    click_covisitation_num = []
    # if len(unique_aids) >= rec_num:
    weights = np.logspace(0.5, 1, len(aids), base=2, endpoint=True) - 1
    type_weight_dict = Counter()
    # RERANK BASED ON REPEAT ITEMS AND TYPE OF ITEMS
    for aid, w, t in zip(aids, weights, types):
        type_weight_dict[aid] += w * type_weight_multipliers[t]
        # sorted_aids = [k for k, v in type_weight_dict.most_common(rec_num)]
        # type_weight_current_session += [v for k, v in type_weight_dict.most_common(rec_num)]
        # return sorted_aids

    # USE "CART ORDER" CO-VISITATION MATRIX
    aids2 = list(itertools.chain(*[top_20_buys[aid] for aid in unique_aids if aid in top_20_buys]))
    # USE "BUY2BUY" CO-VISITATION MATRIX
    aids3 = list(itertools.chain(*[top_20_buy2buy[aid] for aid in unique_buys if aid in top_20_buy2buy]))
    cart_order_num_counter  = Counter(aids2)
    buy_buy_num_counter = Counter(aids3)
    # RERANK CANDIDATES
    top_aids2 = [aid2 for aid2, cnt in Counter(aids2 + aids3).most_common(rec_num) if aid2 not in unique_aids]

    # get the candidate
    result = unique_aids + top_aids2[:rec_num - len(unique_aids)]
    # USE TOP20 TEST ORDERS
    final_aid_lst = result + list(top_orders)[:rec_num - len(result)]
    if test_stage1_limit:
#         print(ground_truth)
#         print(set(ground_truth))
        more_lst = list(set(ground_truth) - set(final_aid_lst))
        final_aid_lst = more_lst + final_aid_lst
        if len(more_lst)>0:
            assert len(final_aid_lst) > rec_num, f"{len(final_aid_lst)}; {len(more_lst)}"

    type_weight_current_session = [round(type_weight_dict.get(aid, 0), 2) for aid in final_aid_lst]
    card_order_num = [cart_order_num_counter.get(aid, 0) for aid in final_aid_lst]
    buy_buy_num = [buy_buy_num_counter.get(aid, 0) for aid in final_aid_lst]
    
    #     # += [0] * (rec_num - len(type_weight_current_session))
    # if sum(type_weight_current_session) > 0:
    #     print(type_weight_current_session)
    # assert len(final_aid_lst) == len(
    #     type_weight_current_session), f"{len(final_aid_lst)} VS {len(type_weight_current_session)}"

    return final_aid_lst, type_weight_current_session, card_order_num, buy_buy_num

In [212]:
test_df.head()

Unnamed: 0,session,aid,ts,type,ground_truth
0,12613814,1221210,1661635407,clicks,[185543]
1,12613815,1509935,1661635407,clicks,"[1359971, 1462420, 486086, 1509372]"
2,12613816,1647157,1661635407,clicks,[1309497]
3,12613816,121148,1661635479,clicks,[1309497]
4,12613816,991590,1661635579,clicks,[1309497]


In [213]:
print(test_df.shape)
# if DEBUG:
#     sample_session = np.random.choice(test_df['session'].unique(), debug_test_session_num, replace=False)
#     test_df = test_df[test_df['session'].isin(sample_session)]
print(test_df.shape)

(7580968, 5)
(7580968, 5)


In [214]:
assert len(test_df['session'].unique()) == unique_session_num

# Save Features -> only carts labels

In [248]:
if test_stage1_limit:
    def get_all_ground_truth(group):
        import itertools
        return list(set(itertools.chain(*group['ground_truth'].values)))
    from pathlib import Path
    import pandas as pd

    data_dir = Path('../data/parquet/val_label/')
    test_labels = pd.concat(
        pd.read_parquet(parquet_file)
        for parquet_file in data_dir.glob('*.parquet')
    )
    combined_test_labels = test_labels.groupby('session').apply(lambda group: get_all_ground_truth(group)).reset_index().rename(
        columns={ 0: 'ground_truth'

        }
    )
    test_df = test_df.merge(combined_test_labels[['session', 'ground_truth']], how='left', on='session')
    assert test_df.isna().sum().sum() == 0

In [249]:
test_df.isna().sum()

session         0
aid             0
ts              0
type            0
ground_truth    0
dtype: int64

In [250]:
feature_df = test_df.sort_values(["session", "ts"]).groupby(["session"]).apply(
    lambda x: buys_features(x)
)

In [251]:
feature_df

session
11098528    ([990658, 950341, 1679529, 1462506, 1561739, 9...
11098529    ([1105029, 1049489, 459126, 333991, 1307369, 1...
11098530    ([409236, 264500, 1603001, 963957, 254154, 752...
11098531    ([396199, 1271998, 452188, 1728212, 1365569, 6...
11098532    ([1596491, 876469, 7651, 108125, 1159379, 1202...
                                  ...                        
12899774    ([33035, 1539309, 819288, 95488, 270852, 74397...
12899775    ([1743151, 1760714, 1163166, 1255910, 1498443,...
12899776    ([1737908, 548599, 695829, 773354, 447841, 144...
12899777    ([384045, 1308634, 395762, 1688215, 1486067, 7...
12899778    ([561560, 1167224, 32070, 971566, 1068655, 139...
Length: 1783737, dtype: object

In [252]:
feature_df = pd.DataFrame(feature_df.add_suffix("_carts"), columns=["labels"]).reset_index()

In [253]:
feature_df.head()

Unnamed: 0,session,labels
0,11098528_carts,"([990658, 950341, 1679529, 1462506, 1561739, 9..."
1,11098529_carts,"([1105029, 1049489, 459126, 333991, 1307369, 1..."
2,11098530_carts,"([409236, 264500, 1603001, 963957, 254154, 752..."
3,11098531_carts,"([396199, 1271998, 452188, 1728212, 1365569, 6..."
4,11098532_carts,"([1596491, 876469, 7651, 108125, 1159379, 1202..."


In [254]:
add_df = pd.DataFrame(feature_df['labels'].tolist(), index=feature_df.index)[[0, 1, 2, 3]].rename(columns={
    0: 'labels',
    1: 'type_weight',
    2: 'cart_order_num',
    3: 'buy_buy_num'
})

In [255]:
feature_df = feature_df[['session']].merge(add_df, left_index=True, right_index=True)

In [256]:
feature_df.head()

Unnamed: 0,session,labels,type_weight,cart_order_num,buy_buy_num
0,11098528_carts,"[990658, 950341, 1679529, 1462506, 1561739, 90...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.41, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,11098529_carts,"[1105029, 1049489, 459126, 333991, 1307369, 16...","[0.41, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,11098530_carts,"[409236, 264500, 1603001, 963957, 254154, 7523...","[8.23, 0.93, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,11098531_carts,"[396199, 1271998, 452188, 1728212, 1365569, 62...","[4.2, 5.1, 3.72, 4.8, 2.36, 0.77, 0.75, 1.03, ...","[6, 6, 2, 6, 6, 1, 0, 0, 6, 0, 0, 3, 2, 2, 2, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,11098532_carts,"[1596491, 876469, 7651, 108125, 1159379, 12026...","[0, 1.0, 0.41, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...","[0, 0, 0, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [257]:
for col in ['labels', 'type_weight', 'cart_order_num', 'buy_buy_num']:
    feature_df[col] = feature_df[col].apply(lambda x: " ".join(map(str,x)))

In [258]:
feature_df.shape

(1783737, 5)

In [259]:
feature_df = feature_df.rename(columns={'session': 'session_type'})

In [260]:
feature_df.head()

Unnamed: 0,session_type,labels,type_weight,cart_order_num,buy_buy_num
0,11098528_carts,990658 950341 1679529 1462506 1561739 907564 3...,0 0 0 0 0 0 0 0 0 0 0 0.41 0 0 0 0 0 0 0 0 0 0...,0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 ...,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
1,11098529_carts,1105029 1049489 459126 333991 1307369 1632356 ...,0.41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 ...,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
2,11098530_carts,409236 264500 1603001 963957 254154 752334 364...,8.23 0.93 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...,1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 0 0 0 0 ...,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
3,11098531_carts,396199 1271998 452188 1728212 1365569 624163 1...,4.2 5.1 3.72 4.8 2.36 0.77 0.75 1.03 0.55 0.52...,6 6 2 6 6 1 0 0 6 0 0 3 2 2 2 2 2 2 2 2 2 2 1 ...,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
4,11098532_carts,1596491 876469 7651 108125 1159379 1202618 779...,0 1.0 0.41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,0 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...


In [265]:
file_path

'../submission/candidate_for_validation/feature_recnum_40_candidate_v2_train1_data_test_submission.csvtest_stage1_limt'

In [266]:
! ls {file_path}

../submission/candidate_for_validation/feature_recnum_40_candidate_v2_train1_data_test_submission.csvtest_stage1_limt


In [None]:
../submission/candidate_for_validation/feature_recnum_40_candidate_v2_train1_data_test_submission.csvtest_stage1_limit

In [262]:
feature_df.to_csv(file_path, index=False)

In [263]:
! ls ../submission/candidate_final_submission

candidate_v2_test_submission.csv
candidate_v2_train1_data_test_submission.csv


In [264]:
! ls -al {file_path}

-rw-r--r--@ 1 hua  staff  1014475380 Jan 28 23:29 ../submission/candidate_for_validation/feature_recnum_40_candidate_v2_train1_data_test_submission.csvtest_stage1_limt


In [None]:
file_path

# Submission file

In [43]:
submission_file

'../submission/candidate_for_validation/recnum_40_candidate_v2_train1_data_test_submission.csvtest_stage1_limt'

In [44]:
# ! rm {submission_file}

In [45]:
%%time
pred_df_clicks = test_df.sort_values(["session", "ts"]).groupby(["session"]).apply(
    lambda x: suggest_clicks(x)
)

CPU times: user 1min 24s, sys: 5.67 s, total: 1min 29s
Wall time: 1min 32s


In [46]:
pred_df_buys = test_df.sort_values(["session", "ts"]).groupby(["session"]).apply(
    lambda x: suggest_buys(x)
)

In [47]:
assert len(pred_df_clicks.index.unique()) == unique_session_num

In [48]:
assert len(pred_df_buys.index.unique()) == unique_session_num

In [49]:
clicks_pred_df = pd.DataFrame(pred_df_clicks.add_suffix("_clicks"), columns=["labels"]).reset_index()
orders_pred_df = pd.DataFrame(pred_df_buys.add_suffix("_orders"), columns=["labels"]).reset_index()
carts_pred_df = pd.DataFrame(pred_df_buys.add_suffix("_carts"), columns=["labels"]).reset_index()

## get pred_df 

In [110]:
pred_df = pd.concat([clicks_pred_df, orders_pred_df, carts_pred_df])

In [111]:
pred_df.shape

(5351211, 2)

In [112]:
pred_df.head()

Unnamed: 0,session,labels
0,11098528_clicks,"[11830, 588923, 1732105, 1157882, 571762, 8845..."
1,11098529_clicks,"[1105029, 459126, 1049489, 1339838, 1383767, 1..."
2,11098530_clicks,"[409236, 264500, 1603001, 963957, 254154, 5830..."
3,11098531_clicks,"[396199, 1271998, 452188, 1728212, 1365569, 62..."
4,11098532_clicks,"[876469, 7651, 108125, 1159379, 1202618, 77906..."


In [113]:
pred_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5351211 entries, 0 to 1783736
Data columns (total 2 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   session  object
 1   labels   object
dtypes: object(2)
memory usage: 122.5+ MB


In [120]:
if test_stage1_limit:

    
    def get_session_type(row):
        session = str(row['session'])
        type_value = row['type']
        return session + '_' + type_value
    def get_final_label(row):
        labels = row['labels']
        ground_truth = row['ground_truth']
    #     print(ground_truth)
    #     print(type(ground_truth))
    #     print
        if isinstance(ground_truth, float):
            return labels
        else:
    #         print(labels)
    #         print(ground_truth)
            more_lst = list(set(ground_truth) - set(labels))
    #         if len(more_lst)>0 and len(labels)>=40:
    #             print(row)
            return more_lst + labels


    # import polars as pl
    # test_label_file = '../data/parquet/val_label/*.parquet'
    # test_labels = pd.read_parquet(test_label_file)
    from pathlib import Path
    import pandas as pd

    data_dir = Path('../data/parquet/val_label/')
    test_labels = pd.concat(
        pd.read_parquet(parquet_file)
        for parquet_file in data_dir.glob('*.parquet')
    )
    test_labels['session_type'] = test_labels.apply(lambda row: get_session_type(row), axis=1)
    pred_df = pred_df.merge(test_labels[['session_type', 'ground_truth']], how='left', left_on='session', right_on='session_type')
    pred_df['labels'] = pred_df.apply(lambda row: get_final_label(row), axis=1)


In [122]:
pred_df.sample(10)

Unnamed: 0,session,labels,session_type,ground_truth
4820097,12364546_carts,"[1854177, 638324, 197342, 26689, 969261, 12527...",,
4931870,12477131_carts,"[876469, 1159379, 1202618, 77906, 669555, 9702...",,
451887,11557823_clicks,"[1794384, 463233, 1298975, 1063114, 1176161, 8...",11557823_clicks,[1794384]
2062675,11377941_orders,"[259480, 11810, 13493, 158597, 160821, 181302,...",,
2280090,11603475_orders,"[1119142, 1489275, 349016, 624677, 1689819, 89...",,
4569839,12112609_carts,"[861506, 1290509, 1312341, 523700, 849257, 171...",,
1983408,11298498_orders,"[44793, 574728, 1357500, 1016908, 1022050, 406...",,
2491770,11816730_orders,"[305933, 1175442, 1406660, 1155739, 1147608, 7...",,
3737927,11269221_carts,"[799822, 1349048, 1765954, 100247, 701016, 142...",,
2182585,11502666_orders,"[1590709, 1327520, 1229555, 1148482, 956562, 1...",,


In [123]:
pred_df = pred_df[['session', 'labels']]

In [124]:
num_lst = pred_df['labels'].apply(len)

In [125]:
max(num_lst)

125

In [126]:
num_lst.describe()

count    5.351211e+06
mean     4.023317e+01
std      6.872194e-01
min      4.000000e+01
25%      4.000000e+01
50%      4.000000e+01
75%      4.000000e+01
max      1.250000e+02
Name: labels, dtype: float64

In [61]:
# explode_df = pred_df.explode('labels')


In [62]:
# explode_df.head()

In [63]:
# explode_df.shape

In [64]:
# def get_session_type(row):
#     original_session = row['session']
#     session, type_str = original_session.split('_')
#     session = int(session)
#     type_int = type_labels[type_str]
#     return session, type_int

In [65]:
# pred_df.to_parquet('../data/candidate_submission.parquet')

## Save Submission CSV
Inferring test data with Pandas groupby is slow. We need to accelerate the following code.

In [128]:
submission_dir

'../submission/candidate_for_validation/'

In [129]:
submission_file

'../submission/candidate_for_validation/recnum_40_candidate_v2_train1_data_test_submission.csvtest_stage1_limt'

In [130]:
pred_df.head()

Unnamed: 0,session,labels
0,11098528_clicks,"[1679529, 11830, 588923, 1732105, 1157882, 571..."
1,11098529_clicks,"[1105029, 459126, 1049489, 1339838, 1383767, 1..."
2,11098530_clicks,"[409236, 264500, 1603001, 963957, 254154, 5830..."
3,11098531_clicks,"[396199, 1271998, 452188, 1728212, 1365569, 62..."
4,11098532_clicks,"[1596491, 876469, 7651, 108125, 1159379, 12026..."


In [131]:
pred_df.columns = ["session_type", "labels"]
pred_df["labels"] = pred_df.labels.apply(lambda x: " ".join(map(str,x)))
pred_df.to_csv(submission_file, index=False)
# pred_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pred_df["labels"] = pred_df.labels.apply(lambda x: " ".join(map(str,x)))


In [132]:
assert len(pred_df) == unique_session_num*3

In [133]:
submission_file

'../submission/candidate_for_validation/recnum_40_candidate_v2_train1_data_test_submission.csvtest_stage1_limt'

In [72]:
# ! head ../submission/candidate_for_rerank_training/candidate_v2_train1_data_test_submission.csv

In [73]:
# ! head {submission_file}