In this notebook we will train an LGBM Ranker.

In his very informative post, [Recommendation Systems for Large Datasets](https://www.kaggle.com/competitions/otto-recommender-system/discussion/364721) [@ravishah1](https://www.kaggle.com/ravishah1) explains how re-ranking models are the industry standard for dealing with datasets like we are presented with in this competition, that is ones with high cardinality categories!

Earlier in this competition I shared a notebook [co-visitation matrix - simplified, imprvd logic 🔥](https://www.kaggle.com/code/radek1/co-visitation-matrix-simplified-imprvd-logic) which introduces the co-visitation matrix that can be used for candidate generation and scoring. (to read more about co-visitation matrices and how they work, please see [💡 What is the co-visiation matrix, really?](https://www.kaggle.com/competitions/otto-recommender-system/discussion/365358))

Here, we will only look at ranking. I don't expect this notebook to achieve a particularly good score, but it will provide all the low level plumbing needed for training ranking models. One will be able to build on it and improve the result (via for instance adding new candidates generated using co-visitation matrices!).

For data processing we will use [polars](https://www.pola.rs/). Polars is a very interesting library that I wanted to try for a very long time now. It is written in Rust and embraces running on multiple cores. And I must say it delivers! I liked the API quite a bit and its speed (though in that department `cudf` would still be my first choice!). I am however not touching my GPU quata on Kaggle just yet as I have a couple of things lined up that I would like to share with you that definitely will require the GPU! 🙂

To simplify the code, I am using a version of the dataset that I shared [here](https://www.kaggle.com/datasets/radek1/otto-train-and-test-data-for-local-validation). No need for dealing with `jsonl` files any longer as it's all `parquet` files now! (Specifically, I am using a version of this dataset that I preprared for local validation [in this notebook](https://www.kaggle.com/code/radek1/a-robust-local-validation-framework).)

## Other resources you might find useful:


* [💡 [2 methods] How-to ensemble predictions 🏅🏅🏅](https://www.kaggle.com/code/radek1/2-methods-how-to-ensemble-predictions)
* [📖 What are some good resources to learn about how gradient-boosted tree ranking models work?](https://www.kaggle.com/competitions/otto-recommender-system/discussion/366477)
* [💡What is a good initial goal in the competition? How to improve beyond it? 📈](https://www.kaggle.com/competitions/otto-recommender-system/discussion/368685)
* [💡How to improve the results of your Approximate Nearest Neighbor search! (annoy)](https://www.kaggle.com/competitions/otto-recommender-system/discussion/368385)
* [from zero to 60 in 2 seconds or less 🏎️🚓🚓🚓](https://www.kaggle.com/competitions/otto-recommender-system/discussion/367058)


# Packages 

In [1]:
# ! pip install pandas

In [2]:
import polars as pl
import glob
import pandas as pd
import gc
from sklearn.pipeline import Pipeline
import joblib
import os
import lightgbm
from gensim.models import Word2Vec


# Config 

In [3]:
debug = False
target_type = 'carts'
# target_type = 'clicks'
estimator = 10
rerank_model_version = f'rerank_v3_{estimator}'



# w2v_model_path = '../model_training/w2v_v1/w2v.model'
final_submission = False
candidate_model_version = 'candidate_v2_train1_data'
# if True, val data are left for val; otherwise, there's no validation data
for_local_val = True

train_data_dir = '../submission/candidate_for_rerank_training/'
val_data_dir = '../submission/candidate_for_validation/'

type2id = {"clicks": 0, "carts": 1, "orders": 2}
# id2type = dict(zip(type2id.values(), type2id.keys()))
model_path = f'../model_training/{rerank_model_version}'

if not os.path.isdir(model_path):
    os.makedirs(model_path)
model_file = os.path.join(model_path, f'{target_type}_ranker.pkl')


if final_submission:
    test_candidate_file = os.path.join('../submission/final_submission_candiate/', f'{candidate_model_version}_test_submission.csv')
    final_submission_file = os.path.join('../submission/final_submission', f'{target_type}_{rerank_model_version}_test_submission.csv')
else:
    test_candidate_file = os.path.join('../submission/candidate_for_validation/', f'{candidate_model_version}_test_submission.csv')
    final_submission_file = os.path.join('../submission/submission_for_validation', f'{target_type}_{rerank_model_version}_test_submission.csv')
    



feature_cols = [
#     'aid', 
#                 'type',
#                 'action_num_reverse_chrono',
#     'session_length',
#     'log_recency_score',
#                 'type_weighted_log_recency_score'
    'type_weight', 'cart_order_num', 'buy_buy_num'
               ] 
# + ['aid_vector' + str(num) for num in range(6)]
target = 'gt'


debug_candidate_file = '../submission/debug/debug_submission.csv'

In [4]:
feature_cols

['type_weight', 'cart_order_num', 'buy_buy_num']

In [5]:
rerank_model_version

'rerank_v3_10'

In [6]:
model_file

'../model_training/rerank_v3_10/carts_ranker.pkl'

In [7]:
test_candidate_file

'../submission/candidate_for_validation/candidate_v2_train1_data_test_submission.csv'

In [8]:
train_data_path = os.path.join(train_data_dir, f'{candidate_model_version}_test_submission.csv')
val_data_path = os.path.join(val_data_dir, f'{candidate_model_version}_test_submission.csv')

if for_local_val:
    train_label_path = '../data/parquet/train2_label/*.parquet'
else:
    train_label_path = '../data/parquet/val_label/*.parquet'
    
if debug:
    test = pl.read_csv(test_candidate_file, n_rows=1000)
else:
    test = pl.read_csv(test_candidate_file)

In [9]:
if debug: 
    train = pl.read_csv(train_data_path, n_rows=10000)
else:
#     train = pl.read_csv('../data/val_candidates.csv')
    train = pl.read_csv(train_data_path)
train_labels = pl.read_parquet(train_label_path)
test_label_file = '../data/parquet/val_label/*.parquet'

test_labels = pl.read_parquet(test_label_file)

In [10]:
train_data_path

'../submission/candidate_for_rerank_training/candidate_v2_train1_data_test_submission.csv'

In [11]:
train = pl.read_csv('../data/cart_order_features.csv')
test = pl.read_csv('../data/test_cart_order_features.csv')

In [12]:
train_data_path

'../submission/candidate_for_rerank_training/candidate_v2_train1_data_test_submission.csv'

In [13]:
train.shape

(2224034, 5)

In [14]:
if debug:
    train = train.head(100000)
    test = test.head(10000)

In [15]:
train.head()

session_type,labels,type_weight,cart_order_num,buy_buy_num
str,str,str,str,str
"""8643220_carts""","""573273 399315 ...","""0.41 0 0 0 0 0...","""0 1 1 1 1 1 1 ...","""0 0 0 0 0 0 0 ..."
"""8643221_carts""","""921137 1133584...","""0.41 0 0 0 0 0...","""0 1 1 1 1 1 1 ...","""0 0 0 0 0 0 0 ..."
"""8643222_carts""","""1037630 930597...","""1.0 0.41 0 0 0...","""0 0 2 2 2 2 1 ...","""0 0 0 0 0 0 0 ..."
"""8643223_carts""","""1811963 206418...","""1.41 0.68 0 0 ...","""1 1 2 2 2 2 2 ...","""0 0 0 0 0 0 0 ..."
"""8643224_carts""","""778561 1106262...","""6.97 0.95 0.92...","""0 0 1 3 0 0 4 ...","""0 0 0 0 0 0 0 ..."


In [16]:
print(f"Previous shape: {train_labels.shape}; {test_labels.shape}")
train_labels = train_labels.filter(train_labels['type']==target_type)
test_labels = test_labels.filter(test_labels['type']==target_type)
print(f"Current shape: {train_labels.shape}; {test_labels.shape}")

Previous shape: (2738344, 3); (2189204, 3)
Current shape: (383122, 3); (301057, 3)


# W2V model 

In [17]:
# w2v = Word2Vec.load(w2v_model_path)

In [18]:
# w2v.wv.key_to_index

In [19]:
train.head()

session_type,labels,type_weight,cart_order_num,buy_buy_num
str,str,str,str,str
"""8643220_carts""","""573273 399315 ...","""0.41 0 0 0 0 0...","""0 1 1 1 1 1 1 ...","""0 0 0 0 0 0 0 ..."
"""8643221_carts""","""921137 1133584...","""0.41 0 0 0 0 0...","""0 1 1 1 1 1 1 ...","""0 0 0 0 0 0 0 ..."
"""8643222_carts""","""1037630 930597...","""1.0 0.41 0 0 0...","""0 0 2 2 2 2 1 ...","""0 0 0 0 0 0 0 ..."
"""8643223_carts""","""1811963 206418...","""1.41 0.68 0 0 ...","""1 1 2 2 2 2 2 ...","""0 0 0 0 0 0 0 ..."
"""8643224_carts""","""778561 1106262...","""6.97 0.95 0.92...","""0 0 1 3 0 0 4 ...","""0 0 0 0 0 0 0 ..."


In [20]:
model_file

'../model_training/rerank_v3_10/carts_ranker.pkl'

# Data Processing

# Function

In [21]:
def get_session(row):
    session = row
#     print(session)
    return session.split('_')[0]

def get_type(row):
    session = row
#     print(session)
    return session.split('_')[1]
# def get_vector(row, index, w2v=w2v):
#     try:
#         vector = w2v.wv[row]
#         return vector[index]
#     except:
#         return 0
def get_w2v_features(df):
    return df.with_columns(
                [
                    pl.col('aid').apply(lambda s: get_vector(row=s, index=0), ).alias('aid_vector0'),
                    pl.col('aid').apply(lambda s: get_vector(row=s, index=1), ).alias('aid_vector1'),
                    pl.col('aid').apply(lambda s: get_vector(row=s, index=2), ).alias('aid_vector2'),
                    pl.col('aid').apply(lambda s: get_vector(row=s, index=3), ).alias('aid_vector3'),
                    pl.col('aid').apply(lambda s: get_vector(row=s, index=4), ).alias('aid_vector4'),
                    pl.col('aid').apply(lambda s: get_vector(row=s, index=5), ).alias('aid_vector5'),
                ]
            )


def data_preprocess(train, target_type_id=type2id[target_type]):
#     type_weight_df = train.select([
#         pl.col('session_type'),
#         pl.col('type_weight').str.split(' ')
#     ]).explode('type_weight').with_columns(
#         [
#             pl.col('type_weight').cast(pl.Float64)
#         ]
#     )
#     cart_order_num_df = train.select([
#         pl.col('session_type'),
#         pl.col('cart_order_num').str.split(' ')
#     ]).explode('cart_order_num').with_columns(
#         [
#             pl.col('cart_order_num').cast(pl.Int32)
#         ]
#     )
#     buy_buy_num_df = train.select([
#         pl.col('session_type'),
#         pl.col('buy_buy_num').str.split(' ')
#     ]).explode('buy_buy_num').with_columns(
#         [
#             pl.col('buy_buy_num').cast(pl.Int32)
#         ]
#     )
    
    train = train.with_columns(
                [
                    pl.col('labels').str.split(' '),
                    pl.col('type_weight').str.split(' '),
                    pl.col('cart_order_num').str.split(' '),
                    pl.col('buy_buy_num').str.split(' '),
        #             pl.col('session_type').str.split('_').map(lambda s: s[0]),
                    pl.col('session_type').apply(lambda s: get_session(s)).alias('session'),
                    pl.col('session_type').apply(lambda s: get_type(s)).alias('type')
                ]
            ).explode(['labels', 'type_weight', 'cart_order_num', 'buy_buy_num']).with_columns(
                [
                    pl.col('labels').cast(pl.datatypes.Int32).alias('aid'),
                     pl.col('session').cast(pl.datatypes.Int32),
                    pl.col('type').apply(lambda x: type2id[x])
                ]
            ).drop(['labels', 'session_type']).with_columns(
                [
                    pl.col('session').cast(pl.datatypes.Int32),
                    pl.col('type').cast(pl.datatypes.UInt8),
                    pl.col('aid').cast(pl.datatypes.Int32),
                    pl.col('type_weight').cast(pl.Float32),
                    pl.col('cart_order_num').cast(pl.Int32),
                    pl.col('buy_buy_num').cast(pl.Int32)
                ]
            )
    train = train.filter(train['type']==target_type_id)
    return train
    

def add_action_num_reverse_chrono(df):
    return df.select([
        pl.col('*'),
        pl.col('session').cumcount().reverse().over('session').alias('action_num_reverse_chrono')
    ])

def add_session_length(df):
    return df.select([
        pl.col('*'),
        pl.col('session').count().over('session').alias('session_length')
    ])

def add_log_recency_score(df):
    linear_interpolation = 0.1 + ((1-0.1) / (df['session_length']-1)) * (df['session_length']-df['action_num_reverse_chrono']-1)
    return df.with_columns(pl.Series(2**linear_interpolation - 1).alias('log_recency_score')).fill_nan(1)

# def add_type_weighted_log_recency_score(df):
#     type_weights = {0:1, 1:6, 2:3}
#     type_weighted_log_recency_score = pl.Series(df['log_recency_score'] / df['type'].apply(lambda x: type_weights[x]))
#     return df.with_column(type_weighted_log_recency_score.alias('type_weighted_log_recency_score'))

def add_train_label(df, train_labels=train_labels):
    train = df
    train_labels = train_labels.explode('ground_truth').with_columns([
        pl.col('ground_truth').alias('aid'),
        pl.col('type').apply(lambda x: type2id[x])
    ])[['session', 'type', 'aid']]

    train_labels = train_labels.with_columns([
        pl.col('session').cast(pl.datatypes.Int32),
        pl.col('type').cast(pl.datatypes.UInt8),
        pl.col('aid').cast(pl.datatypes.Int32)
    ])
    train_labels = train_labels.with_column(pl.lit(1).alias('gt'))
    train = train.join(train_labels, how='left', on=['session', 'type', 'aid']).with_column(pl.col('gt').fill_null(0))
    return train

def add_test_label(df, train_labels=test_labels):
    train = df
    train_labels = train_labels.explode('ground_truth').with_columns([
        pl.col('ground_truth').alias('aid'),
        pl.col('type').apply(lambda x: type2id[x])
    ])[['session', 'type', 'aid']]

    train_labels = train_labels.with_columns([
        pl.col('session').cast(pl.datatypes.Int32),
        pl.col('type').cast(pl.datatypes.UInt8),
        pl.col('aid').cast(pl.datatypes.Int32)
    ])
    train_labels = train_labels.with_column(pl.lit(1).alias('gt'))
    train = train.join(train_labels, how='left', on=['session', 'type', 'aid']).with_column(pl.col('gt').fill_null(0))
    return train

def apply(df, pipeline):
    for f in pipeline:
        df = f(df)
    return df

train_pipeline = [ data_preprocess, 
#                   get_w2v_features, 
#                   add_action_num_reverse_chrono, add_session_length, add_log_recency_score, 
            add_train_label
           ]
test_pipeline = [ data_preprocess, 
#                  get_w2v_features,
#                  add_action_num_reverse_chrono, add_session_length, add_log_recency_score, 
            add_test_label
           ]

# Codes 

In [22]:
print(f"{train.shape}; {train_labels.shape}")

(2224034, 5); (383122, 3)


In [23]:
train = apply(train, train_pipeline)


In [24]:
test = apply(test, test_pipeline)

In [25]:
test.shape

(71591633, 7)

In [26]:
test.head()

type_weight,cart_order_num,buy_buy_num,session,type,aid,gt
f32,i32,i32,i32,u8,i32,i32
0.41,0,0,11098528,1,11830,0
0.0,1,0,11098528,1,1732105,0
0.0,1,0,11098528,1,588923,0
0.0,1,0,11098528,1,1157882,0
0.0,1,0,11098528,1,884502,0


In [27]:
train.head()

type_weight,cart_order_num,buy_buy_num,session,type,aid,gt
f32,i32,i32,i32,u8,i32,i32
0.41,0,0,8643220,1,573273,0
0.0,1,0,8643220,1,399315,0
0.0,1,0,8643220,1,1308823,0
0.0,1,0,8643220,1,1337750,0
0.0,1,0,8643220,1,1768884,0


In [28]:
# train.apply(lambda x: type2id[x])

In [29]:
# w2v.wv['573273']

In [30]:
# test_label_file = '../data/parquet/val_label/*.parquet'

In [31]:
# test_labels = pl.read_parquet(test_label_file)

In [32]:
# test_labels = test_labels.explode('ground_truth').with_columns([
#     pl.col('ground_truth').alias('aid'),
#     pl.col('type').apply(lambda x: type2id[x])
# ])[['session', 'type', 'aid']]

In [33]:
# train_labels = train_labels.explode('ground_truth').with_columns([
#     pl.col('ground_truth').alias('aid'),
#     pl.col('type').apply(lambda x: type2id[x])
# ])[['session', 'type', 'aid']]

# train_labels = train_labels.with_columns([
#     pl.col('session').cast(pl.datatypes.Int32),
#     pl.col('type').cast(pl.datatypes.UInt8),
#     pl.col('aid').cast(pl.datatypes.Int32)
# ])
# train_labels = train_labels.with_column(pl.lit(1).alias('gt'))
# train = train.join(train_labels, how='left', on=['session', 'type', 'aid']).with_column(pl.col('gt').fill_null(0))

In [34]:
train_labels.shape

(383122, 3)

In [35]:
train_labels.head()

session,type,ground_truth
i64,str,list[i64]
8643226,"""carts""",[1845885]
8643228,"""carts""","[1210130, 1698483]"
8643232,"""carts""","[1539992, 1520883, ... 1469007]"
8643234,"""carts""","[712557, 635590]"
8643236,"""carts""",[280876]


In [36]:
train_labels.head()

session,type,ground_truth
i64,str,list[i64]
8643226,"""carts""",[1845885]
8643228,"""carts""","[1210130, 1698483]"
8643232,"""carts""","[1539992, 1520883, ... 1469007]"
8643234,"""carts""","[712557, 635590]"
8643236,"""carts""",[280876]


In [37]:
train.head()

type_weight,cart_order_num,buy_buy_num,session,type,aid,gt
f32,i32,i32,i32,u8,i32,i32
0.41,0,0,8643220,1,573273,0
0.0,1,0,8643220,1,399315,0
0.0,1,0,8643220,1,1308823,0
0.0,1,0,8643220,1,1337750,0
0.0,1,0,8643220,1,1768884,0


In [38]:
type2id

{'clicks': 0, 'carts': 1, 'orders': 2}

In [39]:
# train['type']

In [40]:
# train[train['type']==2]#['gt'].value_counts()

In [41]:
train['gt'].value_counts()

gt,counts
i32,u32
0,88943081
1,311782


Ok, so we now have our preprocessed dataset, a column with ground truth, which means that the only thing we are missing for our Ranker is... information how to group individual rows into sessions!

In [42]:
train.shape

(89254863, 7)

In [43]:
train.head()

type_weight,cart_order_num,buy_buy_num,session,type,aid,gt
f32,i32,i32,i32,u8,i32,i32
0.41,0,0,8643220,1,573273,0
0.0,1,0,8643220,1,399315,0
0.0,1,0,8643220,1,1308823,0
0.0,1,0,8643220,1,1337750,0
0.0,1,0,8643220,1,1768884,0


In [44]:
def get_session_lenghts(df):
    return df.groupby('session').agg([
        pl.col('session').count().alias('session_length')
    ])['session_length'].to_numpy()

In [45]:
session_lengths_train = get_session_lenghts(train)
session_lengths_test = get_session_lenghts(test)

In [46]:
session_lengths_train.shape

(2224034,)

In [47]:
session_lengths_train

array([40, 40, 40, ..., 40, 40, 40], dtype=uint32)

# Model training

In [48]:
import lightgbm

In [49]:
from lightgbm.sklearn import LGBMRanker

In [50]:
ranker = LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    boosting_type="dart",
    n_estimators=estimator, 
    importance_type='gain',
    eval_at=[5]
)

In [51]:
estimator

10

In [52]:
# train[feature_cols]

In [53]:
train[feature_cols].shape

(89254863, 3)

In [54]:
test[feature_cols].shape

(71591633, 3)

In [56]:
test[feature_cols].head()

type_weight,cart_order_num,buy_buy_num
f32,i32,i32
0.41,0,0
0.0,1,0
0.0,1,0
0.0,1,0
0.0,1,0


In [57]:
# train_labels['gt']

In [58]:
final_submission_file

'../submission/submission_for_validation/carts_rerank_v3_10_test_submission.csv'

In [59]:
min(session_lengths_train)

40

In [60]:
ranker = ranker.fit(
    X=train[feature_cols].to_pandas(),
    y=train[target].to_pandas(),
    group=session_lengths_train,
    eval_set=[(train[feature_cols].to_pandas(), train[target].to_pandas()),
             (test[feature_cols].to_pandas(), test[target].to_pandas())
             ],
    eval_group=[session_lengths_train, session_lengths_test]
)



[1]	valid_0's ndcg@5: 0.959302	valid_1's ndcg@5: 0.961886
[2]	valid_0's ndcg@5: 0.959989	valid_1's ndcg@5: 0.962595
[3]	valid_0's ndcg@5: 0.959676	valid_1's ndcg@5: 0.962248
[4]	valid_0's ndcg@5: 0.959754	valid_1's ndcg@5: 0.962329
[5]	valid_0's ndcg@5: 0.959702	valid_1's ndcg@5: 0.962289
[6]	valid_0's ndcg@5: 0.959806	valid_1's ndcg@5: 0.962396
[7]	valid_0's ndcg@5: 0.960146	valid_1's ndcg@5: 0.962758
[8]	valid_0's ndcg@5: 0.960055	valid_1's ndcg@5: 0.962667
[9]	valid_0's ndcg@5: 0.960147	valid_1's ndcg@5: 0.962757
[10]	valid_0's ndcg@5: 0.960162	valid_1's ndcg@5: 0.962769


In [61]:
impotant_df = pd.DataFrame(
    {
        'features': ranker.feature_name_,
        'importance': ranker.feature_importances_
    }
).sort_values('importance', ascending=False)
impotant_df

Unnamed: 0,features,importance
0,type_weight,5387609.0
1,cart_order_num,464210.7
2,buy_buy_num,0.0


In [62]:
pipe = Pipeline([
    ('model', ranker)
])

In [63]:
debug

False

In [64]:
if not debug:
    joblib.dump(
        value=pipe,
        filename=model_file)

In [65]:
# del train, train_labels
# gc.collect()

# Load models 

In [91]:
model_file

'../model_training/rerank_v3_10/carts_ranker.pkl'

In [93]:
! ls -al {model_file}

-rw-r--r--  1 hua  staff  40201 Jan 24 22:05 ../model_training/rerank_v3_10/carts_ranker.pkl


In [66]:
new_pipeline = joblib.load(
    filename=model_file
)

In [67]:
new_pipeline

# Predict on test data

Let's load our test set, process it and predict on it.

In [68]:
final_submission_file

'../submission/submission_for_validation/carts_rerank_v3_10_test_submission.csv'

In [69]:
test_candidate_file

'../submission/candidate_for_validation/candidate_v2_train1_data_test_submission.csv'

In [70]:
# ! ls ../submission/candiate_for_validation/

In [71]:
# ../submission/candidate_for_validation/

In [72]:
# assert len(test['session_type'].unique()) == 5015409

In [73]:
test.head()

type_weight,cart_order_num,buy_buy_num,session,type,aid,gt
f32,i32,i32,i32,u8,i32,i32
0.41,0,0,11098528,1,11830,0
0.0,1,0,11098528,1,1732105,0
0.0,1,0,11098528,1,588923,0
0.0,1,0,11098528,1,1157882,0
0.0,1,0,11098528,1,884502,0


In [74]:
scores = new_pipeline.predict(test[feature_cols].to_pandas())

# Create submission

In [75]:
test.head()

type_weight,cart_order_num,buy_buy_num,session,type,aid,gt
f32,i32,i32,i32,u8,i32,i32
0.41,0,0,11098528,1,11830,0
0.0,1,0,11098528,1,1732105,0
0.0,1,0,11098528,1,588923,0
0.0,1,0,11098528,1,1157882,0
0.0,1,0,11098528,1,884502,0


In [76]:
feature_cols

['type_weight', 'cart_order_num', 'buy_buy_num']

In [77]:
test = test.with_columns(pl.Series(name='score', values=scores))
test_predictions = test.sort(['session', 'score'], reverse=True).groupby('session').agg([
    pl.col('aid').limit(20).list()
])

In [78]:
test_predictions.head()

session,aid
i32,list[i32]
12899778,"[561560, 1167224, ... 1236775]"
12899777,"[384045, 1308634, ... 1236775]"
12899776,"[548599, 695829, ... 1236775]"
12899775,"[1743151, 1760714, ... 1406660]"
12899774,"[33035, 1539309, ... 1236775]"


In [79]:
session_types = []
labels = []

for session, preds in zip(test_predictions['session'].to_numpy(), test_predictions['aid'].to_numpy()):
    l = ' '.join(str(p) for p in preds)
    for session_type in ['clicks', 'carts', 'orders']:
        labels.append(l)
        session_types.append(f'{session}_{session_type}')

In [80]:
! ls -al {final_submission_file}

ls: ../submission/submission_for_validation/carts_rerank_v3_10_test_submission.csv: No such file or directory


In [81]:
! ls ../submission/submission_for_validation/

carts_rerank_v1_10_test_submission.csv rerank_v1_10_test_submission.csv
carts_rerank_v2_10_test_submission.csv rerank_v1_40_test_submission.csv


In [82]:
submission = pl.DataFrame({'session_type': session_types, 'labels': labels})
if not debug:
    submission.write_csv(final_submission_file)

In [83]:
submission.head()

session_type,labels
str,str
"""12899778_click...","""561560 1167224..."
"""12899778_carts...","""561560 1167224..."
"""12899778_order...","""561560 1167224..."
"""12899777_click...","""384045 1308634..."
"""12899777_carts...","""384045 1308634..."


In [84]:
final_submission_file

'../submission/submission_for_validation/carts_rerank_v3_10_test_submission.csv'

In [85]:
! ls -al ../submission/submission_for_validation/carts_rerank_v1_10_test_submission.csv

-rw-r--r--  1 hua  staff  870329360 Jan 15 19:15 ../submission/submission_for_validation/carts_rerank_v1_10_test_submission.csv


# Dig into the feature & target 

In [86]:
test['gt'].value_counts()

gt,counts
i32,u32
0,71348275
1,243358


In [94]:
test.filter(pl.col('gt')==1)[['session'] +feature_cols + [target]]
# .filter(pl.col('session')==11098528)['gt']

session,type_weight,cart_order_num,buy_buy_num,gt
i32,f32,i32,i32,i32
11098534,2.17,2,0,1
11098538,1.0,4,0,1
11098538,1.64,0,0,1
11098542,1.0,0,0,1
11098545,4.74,2,0,1
11098545,3.84,0,0,1
11098545,2.88,1,0,1
11098545,0.67,1,0,1
11098546,1.0,0,0,1
11098554,0.0,1,0,1


In [95]:
session_id = 12899676

In [96]:
test.filter(pl.col('session')==session_id).sort('score', reverse=True)

type_weight,cart_order_num,buy_buy_num,session,type,aid,gt,score
f32,i32,i32,i32,u8,i32,i32,f64
1.0,1,0,12899676,1,35328,1,1.160342
0.41,1,0,12899676,1,1780088,0,0.494228
0.0,2,0,12899676,1,182264,0,-0.354045
0.0,2,0,12899676,1,1784638,0,-0.354045
0.0,2,0,12899676,1,888801,0,-0.354045
0.0,2,0,12899676,1,980008,0,-0.354045
0.0,2,0,12899676,1,95165,0,-0.354045
0.0,2,0,12899676,1,399586,0,-0.354045
0.0,1,0,12899676,1,270062,0,-0.540887
0.0,1,0,12899676,1,667921,0,-0.540887
