In this notebook we will train an LGBM Ranker.

In his very informative post, [Recommendation Systems for Large Datasets](https://www.kaggle.com/competitions/otto-recommender-system/discussion/364721) [@ravishah1](https://www.kaggle.com/ravishah1) explains how re-ranking models are the industry standard for dealing with datasets like we are presented with in this competition, that is ones with high cardinality categories!

Earlier in this competition I shared a notebook [co-visitation matrix - simplified, imprvd logic 🔥](https://www.kaggle.com/code/radek1/co-visitation-matrix-simplified-imprvd-logic) which introduces the co-visitation matrix that can be used for candidate generation and scoring. (to read more about co-visitation matrices and how they work, please see [💡 What is the co-visiation matrix, really?](https://www.kaggle.com/competitions/otto-recommender-system/discussion/365358))

Here, we will only look at ranking. I don't expect this notebook to achieve a particularly good score, but it will provide all the low level plumbing needed for training ranking models. One will be able to build on it and improve the result (via for instance adding new candidates generated using co-visitation matrices!).

For data processing we will use [polars](https://www.pola.rs/). Polars is a very interesting library that I wanted to try for a very long time now. It is written in Rust and embraces running on multiple cores. And I must say it delivers! I liked the API quite a bit and its speed (though in that department `cudf` would still be my first choice!). I am however not touching my GPU quata on Kaggle just yet as I have a couple of things lined up that I would like to share with you that definitely will require the GPU! 🙂

To simplify the code, I am using a version of the dataset that I shared [here](https://www.kaggle.com/datasets/radek1/otto-train-and-test-data-for-local-validation). No need for dealing with `jsonl` files any longer as it's all `parquet` files now! (Specifically, I am using a version of this dataset that I preprared for local validation [in this notebook](https://www.kaggle.com/code/radek1/a-robust-local-validation-framework).)

## Other resources you might find useful:


* [💡 [2 methods] How-to ensemble predictions 🏅🏅🏅](https://www.kaggle.com/code/radek1/2-methods-how-to-ensemble-predictions)
* [📖 What are some good resources to learn about how gradient-boosted tree ranking models work?](https://www.kaggle.com/competitions/otto-recommender-system/discussion/366477)
* [💡What is a good initial goal in the competition? How to improve beyond it? 📈](https://www.kaggle.com/competitions/otto-recommender-system/discussion/368685)
* [💡How to improve the results of your Approximate Nearest Neighbor search! (annoy)](https://www.kaggle.com/competitions/otto-recommender-system/discussion/368385)
* [from zero to 60 in 2 seconds or less 🏎️🚓🚓🚓](https://www.kaggle.com/competitions/otto-recommender-system/discussion/367058)


# Packages 

In [71]:
# ! pip install pandas

In [72]:
import polars as pl
import glob
import pandas as pd
import gc
from sklearn.pipeline import Pipeline
import joblib
import os
import lightgbm
from gensim.models import Word2Vec


# Config 

In [73]:
debug = False
target_type = 'all'
# target_type = 'clicks'
estimator = 100
rerank_model_version = f'{target_type}_rerank_v1_{estimator}'
final_submission = False
cg_num = 40

test_stage1_limit = True


# w2v_model_path = '../model_training/w2v_v1/w2v.model'

candidate_model_version = 'candidate_v2_train1_data'
# if True, val data are left for val; otherwise, there's no validation data
for_local_val = True

train_data_dir = '../submission/candidate_for_rerank_training/'
val_data_dir = '../submission/candidate_for_validation/'

type2id = {"clicks": 0, "carts": 1, "orders": 2}
# id2type = dict(zip(type2id.values(), type2id.keys()))
model_path = f'../model_training/{rerank_model_version}'

if not os.path.isdir(model_path):
    os.makedirs(model_path)
model_file = os.path.join(model_path, f'{target_type}_ranker.pkl')


if final_submission:
    test_candidate_file = os.path.join('../submission/candidate_final_submission/', f'{candidate_model_version}_test_submission.csv')
    final_submission_file = os.path.join('../submission/final_submission', f'{target_type}_{rerank_model_version}_test_submission.csv')
else:
    test_candidate_file = os.path.join('../submission/candidate_for_validation/', f'feature_recnum_{cg_num}_{candidate_model_version}_test_submission.csv')
    final_submission_file = os.path.join('../submission/submission_for_validation', f'{target_type}_{rerank_model_version}_test_submission.csv')
    if test_stage1_limit:
        test_candidate_file += 'test_stage1_limit'
        final_submission_file += 'test_stage1_limit'
    



feature_cols = [
#     'aid', 
#                 'type',
#                 'action_num_reverse_chrono',
#     'session_length',
#     'log_recency_score',
#                 'type_weighted_log_recency_score'
    'type_weight', 'cart_order_num', 'buy_buy_num'
               ] 
# + ['aid_vector' + str(num) for num in range(6)]
target = 'gt'


debug_candidate_file = '../submission/debug/debug_submission.csv'

In [74]:
model_path

'../model_training/all_rerank_v1_100'

In [75]:
! ls {model_path}

all_ranker.pkl


In [76]:
feature_cols

['type_weight', 'cart_order_num', 'buy_buy_num']

In [77]:
rerank_model_version

'all_rerank_v1_100'

In [78]:
model_file

'../model_training/all_rerank_v1_100/all_ranker.pkl'

In [79]:
test_candidate_file

'../submission/candidate_for_validation/feature_recnum_40_candidate_v2_train1_data_test_submission.csvtest_stage1_limit'

In [80]:
# ! ls ../submission/candidate_for_validation/

In [81]:
train_data_path = os.path.join(train_data_dir, f'{candidate_model_version}_test_submission.csv')
val_data_path = os.path.join(val_data_dir, f'{candidate_model_version}_test_submission.csv')

if for_local_val:
    train_label_path = '../data/parquet/train2_label/*.parquet'
else:
    train_label_path = '../data/parquet/val_label/*.parquet'
    
if debug:
    test = pl.read_csv(test_candidate_file, n_rows=1000)
else:
    test = pl.read_csv(test_candidate_file)

In [82]:
if debug: 
    train = pl.read_csv(train_data_path, n_rows=10000)
else:
#     train = pl.read_csv('../data/val_candidates.csv')
    train = pl.read_csv(train_data_path)
train_labels = pl.read_parquet(train_label_path)
test_label_file = '../data/parquet/val_label/*.parquet'

test_labels = pl.read_parquet(test_label_file)

In [83]:
train_data_path

'../submission/candidate_for_rerank_training/candidate_v2_train1_data_test_submission.csv'

In [84]:
# train = pl.read_csv('../data/cart_order_features.csv')
# test = pl.read_csv('../data/test_cart_order_features.csv')

In [85]:
train_data_path

'../submission/candidate_for_rerank_training/candidate_v2_train1_data_test_submission.csv'

In [86]:
train.shape

(6672102, 2)

In [87]:
if debug:
    train = train.head(100000)
    test = test.head(10000)

In [88]:
train.head()

session_type,labels
str,str
"""8643220_clicks...","""573273 399315 ..."
"""8643221_clicks...","""921137 1797158..."
"""8643222_clicks...","""1037630 930597..."
"""8643223_clicks...","""1811963 206418..."
"""8643224_clicks...","""778561 1106262..."


In [89]:
type2id.keys()

dict_keys(['clicks', 'carts', 'orders'])

In [90]:
print(f"Previous shape: {train_labels.shape}; {test_labels.shape}")
if target_type in type2id.keys():

    train_labels = train_labels.filter(train_labels['type']==target_type)
    test_labels = test_labels.filter(test_labels['type']==target_type)
print(f"Current shape: {train_labels.shape}; {test_labels.shape}")

Previous shape: (2738344, 3); (2189204, 3)
Current shape: (2738344, 3); (2189204, 3)


# Data Processing

# Function

In [91]:
def process_label(train_labels):
    new_train_labels = train_labels.explode('ground_truth').with_columns([
        pl.col('ground_truth').alias('aid'),
        pl.col('type').apply(lambda x: type2id[x])
    ])[['session', 'type', 'aid']]

    new_train_labels = new_train_labels.with_columns([
        pl.col('session').cast(pl.datatypes.Int32),
        pl.col('type').cast(pl.datatypes.UInt8),
        pl.col('aid').cast(pl.datatypes.Int32)
    ])
    typeid2label = {0: 1, 1: 3, 2: 6}
    new_train_labels = new_train_labels.with_columns(
        [
            pl.col('type').apply(lambda x: typeid2label[x]).alias('gt')
        ]
    ).drop(['type'])
    new_train_labels = new_train_labels.groupby(['session', 'aid']).agg(pl.col('gt').max())
    return new_train_labels

train_labels = process_label(train_labels=train_labels)
test_labels = process_label(train_labels=test_labels)

In [92]:
def get_session(row):
    session = row
#     print(session)
    return session.split('_')[0]

def get_type(row):
    session = row
#     print(session)
    return session.split('_')[1]
# def get_vector(row, index, w2v=w2v):
#     try:
#         vector = w2v.wv[row]
#         return vector[index]
#     except:
#         return 0
def get_w2v_features(df):
    return df.with_columns(
                [
                    pl.col('aid').apply(lambda s: get_vector(row=s, index=0), ).alias('aid_vector0'),
                    pl.col('aid').apply(lambda s: get_vector(row=s, index=1), ).alias('aid_vector1'),
                    pl.col('aid').apply(lambda s: get_vector(row=s, index=2), ).alias('aid_vector2'),
                    pl.col('aid').apply(lambda s: get_vector(row=s, index=3), ).alias('aid_vector3'),
                    pl.col('aid').apply(lambda s: get_vector(row=s, index=4), ).alias('aid_vector4'),
                    pl.col('aid').apply(lambda s: get_vector(row=s, index=5), ).alias('aid_vector5'),
                ]
            )


def data_preprocess(train, type2id=type2id, target_type=target_type):
    train = train.with_columns(
                [
                    pl.col('labels').str.split(' '),
                    pl.col('type_weight').str.split(' '),
                    pl.col('cart_order_num').str.split(' '),
                    pl.col('buy_buy_num').str.split(' '),
        #             pl.col('session_type').str.split('_').map(lambda s: s[0]),
                    pl.col('session_type').apply(lambda s: get_session(s)).alias('session'),
                    pl.col('session_type').apply(lambda s: get_type(s)).alias('type')
                ]
            ).explode(['labels', 'type_weight', 'cart_order_num', 'buy_buy_num']).with_columns(
                [
                    pl.col('labels').cast(pl.datatypes.Int32).alias('aid'),
                     pl.col('session').cast(pl.datatypes.Int32),
                    pl.col('type').apply(lambda x: type2id[x])
                ]
            ).drop(['labels', 'session_type']).with_columns(
                [
                    pl.col('session').cast(pl.datatypes.Int32),
                    pl.col('type').cast(pl.datatypes.UInt8),
                    pl.col('aid').cast(pl.datatypes.Int32),
                    pl.col('type_weight').cast(pl.Float32),
                    pl.col('cart_order_num').cast(pl.Int32),
                    pl.col('buy_buy_num').cast(pl.Int32)
                ]
            )
    if target_type in type2id.keys():
        target_type_id=type2id[target_type]
        train = train.filter(train['type']==target_type_id)
    else:
        train = train.drop(['type'])
#         print(train.head())
        print(f"shape before unique: {train.shape}")
        train = train.unique()
        print(f"shape after unique: {train.shape}")
    return train
    

def add_action_num_reverse_chrono(df):
    return df.select([
        pl.col('*'),
        pl.col('session').cumcount().reverse().over('session').alias('action_num_reverse_chrono')
    ])

def add_session_length(df):
    return df.select([
        pl.col('*'),
        pl.col('session').count().over('session').alias('session_length')
    ])

def add_log_recency_score(df):
    linear_interpolation = 0.1 + ((1-0.1) / (df['session_length']-1)) * (df['session_length']-df['action_num_reverse_chrono']-1)
    return df.with_columns(pl.Series(2**linear_interpolation - 1).alias('log_recency_score')).fill_nan(1)

# def add_type_weighted_log_recency_score(df):
#     type_weights = {0:1, 1:6, 2:3}
#     type_weighted_log_recency_score = pl.Series(df['log_recency_score'] / df['type'].apply(lambda x: type_weights[x]))
#     return df.with_column(type_weighted_log_recency_score.alias('type_weighted_log_recency_score'))

# def add_train_label(df, train_labels=train_labels):
#     train = df
#     train_labels = train_labels.explode('ground_truth').with_columns([
#         pl.col('ground_truth').alias('aid'),
#         pl.col('type').apply(lambda x: type2id[x])
#     ])[['session', 'type', 'aid']]

#     train_labels = train_labels.with_columns([
#         pl.col('session').cast(pl.datatypes.Int32),
#         pl.col('type').cast(pl.datatypes.UInt8),
#         pl.col('aid').cast(pl.datatypes.Int32)
#     ])
#     train_labels = train_labels.with_column(pl.lit(1).alias('gt'))
#     train = train.join(train_labels, how='left', on=['session', 'type', 'aid']).with_column(pl.col('gt').fill_null(0))
#     return train

# def add_test_label(df, train_labels=test_labels):
#     train = df
#     train_labels = train_labels.explode('ground_truth').with_columns([
#         pl.col('ground_truth').alias('aid'),
#         pl.col('type').apply(lambda x: type2id[x])
#     ])[['session', 'type', 'aid']]

#     train_labels = train_labels.with_columns([
#         pl.col('session').cast(pl.datatypes.Int32),
#         pl.col('type').cast(pl.datatypes.UInt8),
#         pl.col('aid').cast(pl.datatypes.Int32)
#     ])
#     train_labels = train_labels.with_column(pl.lit(1).alias('gt'))
#     train = train.join(train_labels, how='left', on=['session', 'type', 'aid']).with_column(pl.col('gt').fill_null(0))
#     return train

def add_train_label_all(processed_train, train_labels=train_labels):
    train = processed_train
    train = train.join(train_labels, how='left', on=['session', 'aid']).with_column(pl.col('gt').fill_null(0))
    return train

def add_test_label_all(processed_train, train_labels=test_labels):
    train = processed_train
    train = train.join(train_labels, how='left', on=['session', 'aid']).with_column(pl.col('gt').fill_null(0))
    return train


def apply(df, pipeline):
    for f in pipeline:
        df = f(df)
    return df

train_pipeline = [ data_preprocess, 
#                   get_w2v_features, 
#                   add_action_num_reverse_chrono, add_session_length, add_log_recency_score, 
            add_train_label_all
           ]
test_pipeline = [ data_preprocess, 
#                  get_w2v_features,
#                  add_action_num_reverse_chrono, add_session_length, add_log_recency_score, 
            add_test_label_all
           ]

# Codes 

In [23]:
print(f"{train.shape}; {train_labels.shape}")

(6672102, 2); (2899717, 3)


In [21]:
# data_preprocess(test)

In [22]:
# process_train = data_preprocess(train)
# process_train.sample(10)
# train_labels.filter(pl.col('session')==9005729)

In [23]:
train = apply(train, train_pipeline)


shape before unique: (89254863, 5)
shape after unique: (89026433, 5)


In [24]:
train['gt'].value_counts()

gt,counts
i64,u32
0,87559095
6,264261
3,207461
1,995616


In [93]:
test = apply(test, test_pipeline)

shape before unique: (72754033, 5)
shape after unique: (72595164, 5)


In [26]:
test['gt'].value_counts()

gt,counts
i64,u32
0,70280510
6,206134
3,162941
1,783179


Ok, so we now have our preprocessed dataset, a column with ground truth, which means that the only thing we are missing for our Ranker is... information how to group individual rows into sessions!

In [27]:
def get_session_lenghts(df):
    return df.groupby('session').agg([
        pl.col('session').count().alias('session_length')
    ])['session_length'].to_numpy()

In [29]:
session_lengths_train = get_session_lenghts(train)
session_lengths_test = get_session_lenghts(test)

# Model training

In [89]:
import lightgbm

In [90]:
from lightgbm.sklearn import LGBMRanker

In [91]:
ranker = LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    boosting_type="dart",
    n_estimators=estimator, 
    importance_type='gain',
    eval_at=[5]
)

In [92]:
estimator

100

In [93]:
# train[feature_cols]

In [94]:
train[feature_cols].shape

(89026433, 3)

In [95]:
test[feature_cols].shape

(71432764, 3)

In [96]:
test[feature_cols].head()

type_weight,cart_order_num,buy_buy_num
f32,i32,i32
0.41,0,0
0.0,1,0
0.0,1,0
0.0,1,0
0.0,1,0


In [97]:
# train_labels['gt']

In [98]:
final_submission_file

'../submission/submission_for_validation/all_all_rerank_v1_100_test_submission.csv'

In [99]:
min(session_lengths_train)

34

In [100]:
ranker = ranker.fit(
    X=train[feature_cols].to_pandas(),
    y=train[target].to_pandas(),
    group=session_lengths_train,
    eval_set=[(train[feature_cols].to_pandas(), train[target].to_pandas()),
             (test[feature_cols].to_pandas(), test[target].to_pandas())
             ],
    eval_group=[session_lengths_train, session_lengths_test]
)



[1]	valid_0's ndcg@5: 0.772866	valid_1's ndcg@5: 0.787976
[2]	valid_0's ndcg@5: 0.773536	valid_1's ndcg@5: 0.788599
[3]	valid_0's ndcg@5: 0.77488	valid_1's ndcg@5: 0.78975
[4]	valid_0's ndcg@5: 0.77463	valid_1's ndcg@5: 0.789493
[5]	valid_0's ndcg@5: 0.774739	valid_1's ndcg@5: 0.789586
[6]	valid_0's ndcg@5: 0.774931	valid_1's ndcg@5: 0.78978
[7]	valid_0's ndcg@5: 0.775168	valid_1's ndcg@5: 0.790061
[8]	valid_0's ndcg@5: 0.775306	valid_1's ndcg@5: 0.790208
[9]	valid_0's ndcg@5: 0.775438	valid_1's ndcg@5: 0.790353
[10]	valid_0's ndcg@5: 0.775502	valid_1's ndcg@5: 0.790435
[11]	valid_0's ndcg@5: 0.775525	valid_1's ndcg@5: 0.79045
[12]	valid_0's ndcg@5: 0.775477	valid_1's ndcg@5: 0.790405
[13]	valid_0's ndcg@5: 0.775531	valid_1's ndcg@5: 0.790475
[14]	valid_0's ndcg@5: 0.775581	valid_1's ndcg@5: 0.79053
[15]	valid_0's ndcg@5: 0.775683	valid_1's ndcg@5: 0.790615
[16]	valid_0's ndcg@5: 0.775763	valid_1's ndcg@5: 0.790695
[17]	valid_0's ndcg@5: 0.775756	valid_1's ndcg@5: 0.790687
[18]	valid_0

In [101]:
impotant_df = pd.DataFrame(
    {
        'features': ranker.feature_name_,
        'importance': ranker.feature_importances_
    }
).sort_values('importance', ascending=False)
impotant_df

Unnamed: 0,features,importance
0,type_weight,36473350.0
1,cart_order_num,4439984.0
2,buy_buy_num,0.0


In [102]:
pipe = Pipeline([
    ('model', ranker)
])

In [103]:
debug

False

In [104]:
if not debug:
    joblib.dump(
        value=pipe,
        filename=model_file)

In [105]:
# del train, train_labels
# gc.collect()

# Load models 

In [94]:
model_file

'../model_training/all_rerank_v1_100/all_ranker.pkl'

In [95]:
! ls -al {model_file}

-rw-r--r--  1 hua  staff  366350 Jan 25 06:44 ../model_training/all_rerank_v1_100/all_ranker.pkl


In [96]:
new_pipeline = joblib.load(
    filename=model_file
)

In [97]:
new_pipeline

# Predict on test data

Let's load our test set, process it and predict on it.

In [98]:
final_submission_file

'../submission/submission_for_validation/all_all_rerank_v1_100_test_submission.csvtest_stage1_limit'

In [99]:
test_candidate_file

'../submission/candidate_for_validation/feature_recnum_40_candidate_v2_train1_data_test_submission.csvtest_stage1_limit'

In [100]:
# ! ls ../submission/candiate_for_validation/

In [101]:
# ../submission/candidate_for_validation/

In [102]:
test.head()

type_weight,cart_order_num,buy_buy_num,session,aid,gt
f32,i32,i32,i32,i32,i64
0.0,0,0,11098528,990658,6
0.0,0,0,11098528,950341,6
0.0,0,0,11098528,1679529,1
0.0,0,0,11098528,1462506,6
0.0,0,0,11098528,1561739,6


In [103]:
scores = new_pipeline.predict(test[feature_cols].to_pandas())

# Create submission

In [104]:
test.head()

type_weight,cart_order_num,buy_buy_num,session,aid,gt
f32,i32,i32,i32,i32,i64
0.0,0,0,11098528,990658,6
0.0,0,0,11098528,950341,6
0.0,0,0,11098528,1679529,1
0.0,0,0,11098528,1462506,6
0.0,0,0,11098528,1561739,6


In [105]:
feature_cols

['type_weight', 'cart_order_num', 'buy_buy_num']

In [106]:
test = test.with_columns(pl.Series(name='score', values=scores))
test_predictions = test.sort(['session', 'score'], reverse=True).groupby('session').agg([
    pl.col('aid').limit(20).list()
])

In [107]:
test_predictions.head()

session,aid
i32,list[i32]
12899778,"[561560, 1167224, ... 1236775]"
12899777,"[384045, 1308634, ... 1236775]"
12899776,"[548599, 695829, ... 1460571]"
12899775,"[1743151, 1760714, ... 1236775]"
12899774,"[33035, 1539309, ... 1236775]"


In [108]:
session_types = []
labels = []

for session, preds in zip(test_predictions['session'].to_numpy(), test_predictions['aid'].to_numpy()):
    l = ' '.join(str(p) for p in preds)
    for session_type in ['clicks', 'carts', 'orders']:
        labels.append(l)
        session_types.append(f'{session}_{session_type}')

In [109]:
! ls -al {final_submission_file}

ls: ../submission/submission_for_validation/all_all_rerank_v1_100_test_submission.csvtest_stage1_limit: No such file or directory


In [110]:
! ls ../submission/submission_for_validation/

all_all_rerank_v1_100_test_submission.csv
all_all_rerank_v1_10_test_submission.csv
carts_rerank_v1_10_test_submission.csv
carts_rerank_v2_10_test_submission.csv
carts_rerank_v3_10_test_submission.csv
rerank_v1_10_test_submission.csv
rerank_v1_40_test_submission.csv


In [111]:
if final_submission:
    assert len(submission['session_type'].unique()) == 5015409

In [112]:
submission = pl.DataFrame({'session_type': session_types, 'labels': labels})
if not debug:
    submission.write_csv(final_submission_file)

In [122]:
submission.head()

session_type,labels
str,str
"""12899778_click...","""561560 1167224..."
"""12899778_carts...","""561560 1167224..."
"""12899778_order...","""561560 1167224..."
"""12899777_click...","""384045 1308634..."
"""12899777_carts...","""384045 1308634..."


In [123]:
final_submission_file

'../submission/submission_for_validation/all_all_rerank_v1_100_test_submission.csvtest_stage1_limit'

In [125]:
! ls -al {final_submission_file}

-rw-r--r--  1 hua  staff  872188802 Jan 28 23:37 ../submission/submission_for_validation/all_all_rerank_v1_100_test_submission.csvtest_stage1_limit


In [115]:
# ! ls -al ../submission/submission_for_validation/carts_rerank_v1_10_test_submission.csv

# Dig into the feature & target 

In [116]:
test['gt'].value_counts()

gt,counts
i64,u32
0,70280510
6,311649
3,404458
1,1598547


In [117]:
test['cart_order_num'].value_counts().sort('counts', reverse=True)

cart_order_num,counts
i32,u32
1,33132177
0,31898031
2,4875399
3,1464320
4,587457
5,274142
6,142672
7,80000
8,47016
9,29105


In [118]:
test['buy_buy_num'].value_counts().sort('counts', reverse=True)

buy_buy_num,counts
i32,u32
0,72595164


In [119]:
train['buy_buy_num'].value_counts().sort('counts', reverse=True)

NotFoundError: buy_buy_num

In [134]:
test.filter(pl.col('gt')==6)[['session'] +feature_cols + [target]]
# .filter(pl.col('session')==11098528)['gt']

session,type_weight,cart_order_num,buy_buy_num,gt
i32,f32,i32,i32,i64
11098528,0.41,0,0,6
11098530,8.23,1,0,6
11098531,2.36,6,0,6
11098533,0.65,0,0,6
11098534,2.17,2,0,6
11098538,1.0,4,0,6
11098538,1.64,0,0,6
11098542,5.65,3,0,6
11098542,4.65,0,0,6
11098542,4.09,0,0,6


In [197]:
session_id = 12899005

In [198]:
single_session = test.filter(pl.col('session')==session_id).sort('score', reverse=True)
single_session

type_weight,cart_order_num,buy_buy_num,session,aid,gt,score
f32,i32,i32,i32,i32,i64,f64
0.41,0,0,12899005,88011,6,1.916284
0.0,1,0,12899005,1781619,0,-0.799011
0.0,1,0,12899005,1573684,0,-0.799011
0.0,1,0,12899005,1380413,0,-0.799011
0.0,1,0,12899005,1468482,0,-0.799011
0.0,1,0,12899005,187164,0,-0.799011
0.0,1,0,12899005,714208,0,-0.799011
0.0,1,0,12899005,179554,0,-0.799011
0.0,1,0,12899005,613617,0,-0.799011
0.0,1,0,12899005,1759013,0,-0.799011


In [199]:
single_session['gt'].value_counts()

gt,counts
i64,u32
0,39
6,1


In [200]:
label_aids = test_labels.filter(pl.col('session')==session_id)['aid'].to_pandas()
label_aids

0    88011
Name: aid, dtype: int32

In [201]:
candiate_aids = single_session['aid'].to_pandas()
candiate_aids

0       88011
1     1781619
2     1573684
3     1380413
4     1468482
5      187164
6      714208
7      179554
8      613617
9     1759013
10      88724
11     137815
12     928064
13     159789
14     642804
15    1716469
16     876493
17    1406660
18    1460571
19    1236775
20     166037
21    1531805
22     836852
23     332654
24     923948
25     122983
26     258353
27     231487
28     634452
29    1043508
30    1006198
31     544144
32    1596897
33     986164
34     832192
35    1733943
36    1462420
37     801774
38    1581568
39     373490
Name: aid, dtype: int32

In [202]:
set(candiate_aids).intersection(label_aids)

{88011}