In this notebook we will train an LGBM Ranker.

In his very informative post, [Recommendation Systems for Large Datasets](https://www.kaggle.com/competitions/otto-recommender-system/discussion/364721) [@ravishah1](https://www.kaggle.com/ravishah1) explains how re-ranking models are the industry standard for dealing with datasets like we are presented with in this competition, that is ones with high cardinality categories!

Earlier in this competition I shared a notebook [co-visitation matrix - simplified, imprvd logic 🔥](https://www.kaggle.com/code/radek1/co-visitation-matrix-simplified-imprvd-logic) which introduces the co-visitation matrix that can be used for candidate generation and scoring. (to read more about co-visitation matrices and how they work, please see [💡 What is the co-visiation matrix, really?](https://www.kaggle.com/competitions/otto-recommender-system/discussion/365358))

Here, we will only look at ranking. I don't expect this notebook to achieve a particularly good score, but it will provide all the low level plumbing needed for training ranking models. One will be able to build on it and improve the result (via for instance adding new candidates generated using co-visitation matrices!).

For data processing we will use [polars](https://www.pola.rs/). Polars is a very interesting library that I wanted to try for a very long time now. It is written in Rust and embraces running on multiple cores. And I must say it delivers! I liked the API quite a bit and its speed (though in that department `cudf` would still be my first choice!). I am however not touching my GPU quata on Kaggle just yet as I have a couple of things lined up that I would like to share with you that definitely will require the GPU! 🙂

To simplify the code, I am using a version of the dataset that I shared [here](https://www.kaggle.com/datasets/radek1/otto-train-and-test-data-for-local-validation). No need for dealing with `jsonl` files any longer as it's all `parquet` files now! (Specifically, I am using a version of this dataset that I preprared for local validation [in this notebook](https://www.kaggle.com/code/radek1/a-robust-local-validation-framework).)

## Other resources you might find useful:


* [💡 [2 methods] How-to ensemble predictions 🏅🏅🏅](https://www.kaggle.com/code/radek1/2-methods-how-to-ensemble-predictions)
* [📖 What are some good resources to learn about how gradient-boosted tree ranking models work?](https://www.kaggle.com/competitions/otto-recommender-system/discussion/366477)
* [💡What is a good initial goal in the competition? How to improve beyond it? 📈](https://www.kaggle.com/competitions/otto-recommender-system/discussion/368685)
* [💡How to improve the results of your Approximate Nearest Neighbor search! (annoy)](https://www.kaggle.com/competitions/otto-recommender-system/discussion/368385)
* [from zero to 60 in 2 seconds or less 🏎️🚓🚓🚓](https://www.kaggle.com/competitions/otto-recommender-system/discussion/367058)


# Config 

In [20]:
debug = False

type2id = {"clicks": 0, "carts": 1, "orders": 2}
# id2type = dict(zip(type2id.values(), type2id.keys()))
model_file = '../model_training/ranker.pkl'

test_candidate_file = '../data/test_candidates.csv'
final_submission_file = '../data/final_submission.csv'

feature_cols = ['aid', 'type', 'action_num_reverse_chrono', 'session_length', 'log_recency_score',
#                 'type_weighted_log_recency_score'
               ]
target = 'gt'

# Data Processing

In [6]:
import polars as pl
import glob
import pandas as pd
import gc
from sklearn.pipeline import Pipeline
import joblib

In [3]:
! ls ../out/

debug_submission.csv      test_labels.jsonl         train_sessions.jsonl
debug_test_labels.jsonl   test_sessions.jsonl
debug_test_sessions.jsonl test_sessions_full.jsonl


# Function

In [4]:
def get_session(row):
    session = row
#     print(session)
    return session.split('_')[0]
def get_type(row):
    session = row
#     print(session)
    return session.split('_')[1]
def data_preprocess(train):
    return train.with_columns(
                [
                    pl.col('labels').str.split(' '),
        #             pl.col('session_type').str.split('_').map(lambda s: s[0]),
                    pl.col('session_type').apply(lambda s: get_session(s)).alias('session'),
                    pl.col('session_type').apply(lambda s: get_type(s)).alias('type')
                ]
            ).explode('labels').with_columns(
                [
                    pl.col('labels').cast(pl.datatypes.Int32).alias('aid'),
                     pl.col('session').cast(pl.datatypes.Int32),
                    pl.col('type').apply(lambda x: type2id[x])
                ]
            ).drop(['session_type', 'labels']).with_columns(
                [
                    pl.col('session').cast(pl.datatypes.Int32),
                    pl.col('type').cast(pl.datatypes.UInt8),
                    pl.col('aid').cast(pl.datatypes.Int32)
                ]
            )
    

def add_action_num_reverse_chrono(df):
    return df.select([
        pl.col('*'),
        pl.col('session').cumcount().reverse().over('session').alias('action_num_reverse_chrono')
    ])

def add_session_length(df):
    return df.select([
        pl.col('*'),
        pl.col('session').count().over('session').alias('session_length')
    ])

def add_log_recency_score(df):
    linear_interpolation = 0.1 + ((1-0.1) / (df['session_length']-1)) * (df['session_length']-df['action_num_reverse_chrono']-1)
    return df.with_columns(pl.Series(2**linear_interpolation - 1).alias('log_recency_score')).fill_nan(1)

# def add_type_weighted_log_recency_score(df):
#     type_weights = {0:1, 1:6, 2:3}
#     type_weighted_log_recency_score = pl.Series(df['log_recency_score'] / df['type'].apply(lambda x: type_weights[x]))
#     return df.with_column(type_weighted_log_recency_score.alias('type_weighted_log_recency_score'))

def apply(df, pipeline):
    for f in pipeline:
        df = f(df)
    return df

pipeline = [ data_preprocess, add_action_num_reverse_chrono, add_session_length, add_log_recency_score, 
#             add_type_weighted_log_recency_score
           ]

# Codes 

In [21]:
if debug: 
    train = pl.read_csv('../out/debug_submission.csv')
else:
    train = pl.read_csv('../data/val_candidates.csv')
# train = pl.read_parquet('../data/explode_debug.parquet')
train_labels = pl.read_parquet('../data/parquet/val/test_labels.parquet')


In [5]:
train.head()

session_type,labels
str,str
"""11098528_click...","""11830 588923 1..."
"""11098529_click...","""1105029 634024..."
"""11098530_click...","""409236 264500 ..."
"""11098531_click...","""396199 1271998..."
"""11098532_click...","""876469 7651 10..."


In [6]:
train.shape

(5403753, 2)

In [7]:
# # train['labels'] = 
# # train = 
# train = (
#     train.with_columns(
#         [
#             pl.col('session').alias('session_type')
#         ])
#     .drop(['session', '__index_level_0__'])
#     .with_columns(
#         [
# #             pl.col('labels').str.split(' '),
# #             pl.col('session_type').str.split('_').map(lambda s: s[0])
#             pl.col('session_type').apply(lambda s: get_session(s)).alias('session'),
#             pl.col('session_type').apply(lambda s: get_type(s)).alias('type')
# #             pl.col("session_type").arr().get(0).alias("a"),
#         ]
#     )
#     .explode('labels')
#     .with_columns(
#         [
#             pl.col('labels').cast(pl.datatypes.Int32).alias('aid'),
#              pl.col('session').cast(pl.datatypes.Int32),
#             pl.col('type').apply(lambda x: type2id[x])
#         ]
#     )
#     .drop(['session_type', 'labels'])
#     .with_columns(
#         [
#             pl.col('session').cast(pl.datatypes.Int32),
#             pl.col('type').cast(pl.datatypes.UInt8),
#             pl.col('aid').cast(pl.datatypes.Int32)
#         ]
#     )
    
# )
# # .with_columns(
# #     [
# #         pl.col('session_type').alias('new')
# #     ]
# # )
# # .with_columns(
# #     [
# #         pl.col('session_type').get(0)
# #     ]
# # )

In [8]:
train.head()

session_type,labels
str,str
"""11098528_click...","""11830 588923 1..."
"""11098529_click...","""1105029 634024..."
"""11098530_click...","""409236 264500 ..."
"""11098531_click...","""396199 1271998..."
"""11098532_click...","""876469 7651 10..."


In [9]:

# train =

In [10]:
train.head()

session_type,labels
str,str
"""11098528_click...","""11830 588923 1..."
"""11098529_click...","""1105029 634024..."
"""11098530_click...","""409236 264500 ..."
"""11098531_click...","""396199 1271998..."
"""11098532_click...","""876469 7651 10..."


In [11]:
# transformed_train.apply(lambda row: get_session(row))

In [12]:
print(f"{train.shape}; {train_labels.shape}")

(5403753, 2); (2212692, 3)


In [13]:
train.head()

session_type,labels
str,str
"""11098528_click...","""11830 588923 1..."
"""11098529_click...","""1105029 634024..."
"""11098530_click...","""409236 264500 ..."
"""11098531_click...","""396199 1271998..."
"""11098532_click...","""876469 7651 10..."


In [14]:
# train = train.drop('ts')

In [15]:
train_labels.head()

session,type,ground_truth
i64,str,list[i64]
11098528,"""clicks""",[1679529]
11098528,"""carts""",[1199737]
11098528,"""orders""","[990658, 950341, ... 1033148]"
11098529,"""clicks""",[1105029]
11098530,"""orders""",[409236]


In [16]:
train.head()

session_type,labels
str,str
"""11098528_click...","""11830 588923 1..."
"""11098529_click...","""1105029 634024..."
"""11098530_click...","""409236 264500 ..."
"""11098531_click...","""396199 1271998..."
"""11098532_click...","""876469 7651 10..."


In [17]:
train_labels.shape

(2212692, 3)

We are calculating the scores that we used for creating co-vistation matrices! We know they carry signal, so let's provde this information to our `LGBM Ranker`!

In [20]:
train.head()

session_type,labels
str,str
"""11098528_click...","""11830 588923 1..."
"""11098529_click...","""1105029 634024..."
"""11098530_click...","""409236 264500 ..."
"""11098531_click...","""396199 1271998..."
"""11098532_click...","""876469 7651 10..."


In [21]:
train.head(100)

session_type,labels
str,str
"""11098528_click...","""11830 588923 1..."
"""11098529_click...","""1105029 634024..."
"""11098530_click...","""409236 264500 ..."
"""11098531_click...","""396199 1271998..."
"""11098532_click...","""876469 7651 10..."
"""11098533_click...","""1165015 385390..."
"""11098534_click...","""908024 223062 ..."
"""11098535_click...","""745365 767201 ..."
"""11098536_click...","""1320019 180837..."
"""11098537_click...","""336024 358965 ..."


In [22]:
train = apply(train, pipeline)

All done!

In [23]:
train.head()

session,type,aid,action_num_reverse_chrono,session_length,log_recency_score
i32,u8,i32,u32,u32,f64
11098528,0,11830,52,53,0.071773
11098528,0,588923,51,53,0.084709
11098528,0,1732105,50,53,0.0978
11098528,0,571762,49,53,0.111049
11098528,0,876129,48,53,0.124459


In [24]:
train.shape

(142104561, 6)

In [25]:
train['type'].value_counts()

type,counts
u8,u32
0,51639967
2,45232297
1,45232297


Now we need to process our labels a little bit and merge them onto our train set.

In [26]:
train_labels = train_labels.explode('ground_truth').with_columns([
    pl.col('ground_truth').alias('aid'),
    pl.col('type').apply(lambda x: type2id[x])
])[['session', 'type', 'aid']]

In [27]:
train_labels.shape

(2650372, 3)

In [28]:
train_labels.head()

session,type,aid
i64,i64,i64
11098528,0,1679529
11098528,1,1199737
11098528,2,990658
11098528,2,950341
11098528,2,1462506


In [29]:
train_labels = train_labels.with_columns([
    pl.col('session').cast(pl.datatypes.Int32),
    pl.col('type').cast(pl.datatypes.UInt8),
    pl.col('aid').cast(pl.datatypes.Int32)
])

In [30]:
train_labels = train_labels.with_column(pl.lit(1).alias('gt'))

In [31]:
train_labels.head()

session,type,aid,gt
i32,u8,i32,i32
11098528,0,1679529,1
11098528,1,1199737,1
11098528,2,990658,1
11098528,2,950341,1
11098528,2,1462506,1


In [32]:
train_labels['gt'].value_counts()

gt,counts
i32,u32
1,2650372


In [33]:
train = train.join(train_labels, how='left', on=['session', 'type', 'aid']).with_column(pl.col('gt').fill_null(0))

In [34]:
train.head()

session,type,aid,action_num_reverse_chrono,session_length,log_recency_score,gt
i32,u8,i32,u32,u32,f64,i32
11098528,0,11830,52,53,0.071773,0
11098528,0,588923,51,53,0.084709,0
11098528,0,1732105,50,53,0.0978,0
11098528,0,571762,49,53,0.111049,0
11098528,0,876129,48,53,0.124459,0


In [35]:
type2id

{'clicks': 0, 'carts': 1, 'orders': 2}

In [36]:
# train['type']

In [37]:
# train[train['type']==2]#['gt'].value_counts()

In [38]:
train['gt'].value_counts()

gt,counts
i32,u32
0,140605321
1,1499240


Ok, so we now have our preprocessed dataset, a column with ground truth, which means that the only thing we are missing for our Ranker is... information how to group individual rows into sessions!

In [39]:
train.shape

(142104561, 7)

In [40]:
train.head()

session,type,aid,action_num_reverse_chrono,session_length,log_recency_score,gt
i32,u8,i32,u32,u32,f64,i32
11098528,0,11830,52,53,0.071773,0
11098528,0,588923,51,53,0.084709,0
11098528,0,1732105,50,53,0.0978,0
11098528,0,571762,49,53,0.111049,0
11098528,0,876129,48,53,0.124459,0


In [41]:
def get_session_lenghts(df):
    return df.groupby('session').agg([
        pl.col('session').count().alias('session_length')
    ])['session_length'].to_numpy()

In [42]:
session_lengths_train = get_session_lenghts(train)

In [43]:
session_lengths_train.shape

(1801251,)

In [44]:
session_lengths_train

array([  3,  53, 120, ..., 105, 120, 120], dtype=uint32)

# Model training

In [45]:
import lightgbm

In [46]:
from lightgbm.sklearn import LGBMRanker

In [47]:
ranker = LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    boosting_type="dart",
    n_estimators=100,
    importance_type='gain',
    eval_at=[5]
)

In [49]:
# train[feature_cols]

In [50]:
ranker = ranker.fit(
    X=train[feature_cols].to_pandas(),
    y=train[target].to_pandas(),
    group=session_lengths_train,
    eval_set=[(train[feature_cols].to_pandas(), train[target].to_pandas())],
    eval_group=[session_lengths_train]
)

[1]	valid_0's ndcg@1: 0.663897	valid_0's ndcg@2: 0.680538	valid_0's ndcg@3: 0.699604	valid_0's ndcg@4: 0.714222	valid_0's ndcg@5: 0.725929
[2]	valid_0's ndcg@1: 0.669665	valid_0's ndcg@2: 0.687172	valid_0's ndcg@3: 0.705388	valid_0's ndcg@4: 0.719791	valid_0's ndcg@5: 0.730201
[3]	valid_0's ndcg@1: 0.670806	valid_0's ndcg@2: 0.688729	valid_0's ndcg@3: 0.706955	valid_0's ndcg@4: 0.721182	valid_0's ndcg@5: 0.731471
[4]	valid_0's ndcg@1: 0.67072	valid_0's ndcg@2: 0.688837	valid_0's ndcg@3: 0.707043	valid_0's ndcg@4: 0.721324	valid_0's ndcg@5: 0.731529
[5]	valid_0's ndcg@1: 0.670911	valid_0's ndcg@2: 0.689083	valid_0's ndcg@3: 0.707252	valid_0's ndcg@4: 0.721551	valid_0's ndcg@5: 0.731762
[6]	valid_0's ndcg@1: 0.671056	valid_0's ndcg@2: 0.689381	valid_0's ndcg@3: 0.707483	valid_0's ndcg@4: 0.721677	valid_0's ndcg@5: 0.731844
[7]	valid_0's ndcg@1: 0.671329	valid_0's ndcg@2: 0.689659	valid_0's ndcg@3: 0.707722	valid_0's ndcg@4: 0.721838	valid_0's ndcg@5: 0.731971
[8]	valid_0's ndcg@1: 0.6716

[60]	valid_0's ndcg@1: 0.674129	valid_0's ndcg@2: 0.691419	valid_0's ndcg@3: 0.709158	valid_0's ndcg@4: 0.723145	valid_0's ndcg@5: 0.733426
[61]	valid_0's ndcg@1: 0.675216	valid_0's ndcg@2: 0.691936	valid_0's ndcg@3: 0.709622	valid_0's ndcg@4: 0.723576	valid_0's ndcg@5: 0.733821
[62]	valid_0's ndcg@1: 0.675255	valid_0's ndcg@2: 0.69194	valid_0's ndcg@3: 0.709618	valid_0's ndcg@4: 0.723586	valid_0's ndcg@5: 0.733829
[63]	valid_0's ndcg@1: 0.675262	valid_0's ndcg@2: 0.691987	valid_0's ndcg@3: 0.709648	valid_0's ndcg@4: 0.723598	valid_0's ndcg@5: 0.733844
[64]	valid_0's ndcg@1: 0.675066	valid_0's ndcg@2: 0.691817	valid_0's ndcg@3: 0.709438	valid_0's ndcg@4: 0.723394	valid_0's ndcg@5: 0.733674
[65]	valid_0's ndcg@1: 0.675071	valid_0's ndcg@2: 0.691821	valid_0's ndcg@3: 0.70944	valid_0's ndcg@4: 0.723391	valid_0's ndcg@5: 0.733667
[66]	valid_0's ndcg@1: 0.675201	valid_0's ndcg@2: 0.691963	valid_0's ndcg@3: 0.709621	valid_0's ndcg@4: 0.723581	valid_0's ndcg@5: 0.733829
[67]	valid_0's ndcg@1:

In [52]:
pipe = Pipeline([
    ('model', ranker)
])

In [53]:
joblib.dump(
    value=pipe,
    filename=model_file)

['../model_training/ranker.pkl']

In [7]:
del train
gc.collect()

NameError: name 'train' is not defined

# Load models 

In [8]:
new_pipeline = joblib.load(
    filename=model_file
)

In [9]:
# ranker.predict(train[feature_cols].to_pandas(), raw_score=True)

In [10]:
# new_pipeline.predict(train[feature_cols].to_pandas(), raw_score=True)

In [11]:
new_pipeline

Pipeline(steps=[('model',
                 LGBMRanker(boosting_type='dart', eval_at=[5],
                            importance_type='gain', metric='ndcg',
                            objective='lambdarank'))])

# Predict on test data

Let's load our test set, process it and predict on it.

In [12]:
test = pl.read_csv(test_candidate_file)

In [13]:
assert len(test['session_type'].unique()) == 5015409

In [14]:
test = apply(test, pipeline)

In [16]:
test.head()

session,type,aid,action_num_reverse_chrono,session_length,log_recency_score
i32,u8,i32,u32,u32,f64
12899779,0,59625,119,120,0.071773
12899779,0,94230,118,120,0.077407
12899779,0,1253524,117,120,0.08307
12899779,0,1660529,116,120,0.088762
12899779,0,3295,115,120,0.094485


In [17]:
assert len(test['session'].unique())*3 == 5015409

In [21]:
scores = new_pipeline.predict(test[feature_cols].to_pandas())

In [22]:
scores

array([ 0.36783401,  0.35013847, -0.21159791, ..., -1.98587036,
       -1.97487119, -1.97487119])

In [23]:
scores.shape

(200616360,)

In [24]:
test.shape

(200616360, 6)

In [25]:
test.head()

session,type,aid,action_num_reverse_chrono,session_length,log_recency_score
i32,u8,i32,u32,u32,f64
12899779,0,59625,119,120,0.071773
12899779,0,94230,118,120,0.077407
12899779,0,1253524,117,120,0.08307
12899779,0,1660529,116,120,0.088762
12899779,0,3295,115,120,0.094485


# Create submission

In [26]:
test = test.with_columns(pl.Series(name='score', values=scores))
test_predictions = test.sort(['session', 'score'], reverse=True).groupby('session').agg([
    pl.col('aid').limit(20).list()
])

In [27]:
test_predictions.head()

session,aid
i32,list[i32]
14571581,"[1100210, 1100210, ... 1158237]"
14571580,"[202353, 202353, ... 433425]"
14571579,"[739876, 739876, ... 304799]"
14571578,"[519105, 519105, ... 822641]"
14571577,"[1141710, 1141710, ... 1666114]"


In [28]:
session_types = []
labels = []

for session, preds in zip(test_predictions['session'].to_numpy(), test_predictions['aid'].to_numpy()):
    l = ' '.join(str(p) for p in preds)
    for session_type in ['clicks', 'carts', 'orders']:
        labels.append(l)
        session_types.append(f'{session}_{session_type}')

In [29]:
submission = pl.DataFrame({'session_type': session_types, 'labels': labels})
submission.write_csv(final_submission_file)

In [31]:
assert len(submission['session_type'].unique()) == 5015409

In [32]:
submission.head()

session_type,labels
str,str
"""14571581_click...","""1100210 110021..."
"""14571581_carts...","""1100210 110021..."
"""14571581_order...","""1100210 110021..."
"""14571580_click...","""202353 202353 ..."
"""14571580_carts...","""202353 202353 ..."


In [33]:
submission.shape

(5015409, 2)

In [34]:
final_submission_file

'../data/final_submission.csv'

In [35]:
! ls -al ../data

total 8309912
drwxr-xr-x  10 hua  staff         320 Jan  3 22:25 [34m.[m[m
drwxr-xr-x  23 hua  staff         736 Jan  2 10:30 [34m..[m[m
-rw-r--r--   1 hua  staff   730097104 Jan  2 09:34 candidate_submission.parquet
drwxr-xr-x   3 hua  staff          96 Jan  1 21:57 [34mdebug[m[m
-rw-r--r--@  1 hua  staff   814184553 Jan  4 08:18 final_submission.csv
drwxr-xr-x   8 hua  staff         256 Jan  2 07:00 [34mparquet[m[m
drwxr-xr-x   4 hua  staff         128 Dec 31 13:31 [34msubmission_data[m[m
-rw-r--r--   1 hua  staff  1546537442 Jan  4 07:56 test_candidates.csv
drwxr-xr-x   4 hua  staff         128 Dec 31 13:32 [34mtrain_data[m[m
-rw-r--r--   1 hua  staff  1125712760 Jan  3 21:47 val_candidates.csv
