Импортируем нужные библиотеки.

In [1]:
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
import lightgbm

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt

Распакуйте архив с данными в папку,где находится этот jupyter notebook (baseline.ipynb). У вас будет папка data,  содержащая необходимые файлы. 

В данном соревновании перед вами ставится задача предсказания категории возраста, к которой принадлежит клиент банка, на основании его транзакций.
В обучающем наборе содержатся информация по транзакциям 30000 клиентов, она находится в файле **transactions_train.csv**. Правильная категория возраста для обучающего набора находится в файле **train_target.csv**.

Считаем данные по транзакциям и правильные ответы.

In [3]:
PROJECT_PATH = '/data/kireev-ia/data_open_ds/age-prediction-nti-sbebank-2019/'

In [4]:
transactions_train = pd.read_csv(PROJECT_PATH + 'data/transactions_train.csv')
transactions_test = pd.read_csv(PROJECT_PATH + 'data/transactions_test.csv')

In [5]:
train_target=pd.read_csv(PROJECT_PATH + 'data/train_target.csv')

Посмотрим на данные.

In [6]:
transactions_train.head()

Unnamed: 0,client_id,trans_date,small_group,amount_rur
0,33172,6,4,71.463
1,33172,6,35,45.017
2,33172,8,11,13.887
3,33172,9,11,15.983
4,33172,10,11,21.341


* client_id - уникальный идентификатор клиента
* trans_date - дата совершения транзакции
* small_group - категория покупки
* amount_rur - сумма транзакции

In [7]:
train_target.head(5)

Unnamed: 0,client_id,bins
0,24662,2
1,1046,0
2,34089,2
3,34848,1
4,47076,3


* client_id - уникальный идентификатор клиента, соответствует полю client_id из транзакций
* bins - целевая переменная, которую нужно предсказать, это категория возраста клиента

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
train_target, test_target = train_test_split(train_target, test_size=0.4,
                                             stratify=train_target['bins'], random_state=42)

In [10]:
train_target.shape, test_target.shape

((18000, 2), (12000, 2))

# Metric Learning Dataset preparation

In [None]:
transactions_all = pd.concat([transactions_train, transactions_test], axis=0)

In [None]:
transactions_all.groupby('client_id')['trans_date'].count().value_counts().sort_index()

In [None]:
transactions_all['trans_date'].value_counts().sort_index()

In [None]:
transactions_all['small_group'].value_counts().sort_index()

In [None]:
transactions_all['amount_rur'].clip(None, 200).hist(figsize=(12, 6), bins=20)

In [None]:
np.log1p(transactions_all['amount_rur']).hist(figsize=(12, 6), bins=20)

In [None]:
transactions_all['amount_rur'] = np.log1p(transactions_all['amount_rur']).clip(0, 8) / 8

In [None]:
%%time
train_dataset = transactions_all \
    .assign(event_time=lambda x: x['trans_date']) \
    .set_index(['client_id', 'event_time']).sort_index() \
    .groupby('client_id').apply(lambda x: {k: np.array(v) for k, v in x.to_dict(orient='list').items()}) \
    .rename('feature_arrays').reset_index().to_dict(orient='records')

In [None]:
def copy_time(rec):
    rec['event_time'] = rec['feature_arrays']['trans_date'].copy()
    return rec

In [None]:
train_dataset = [copy_time(r) for r in train_dataset]

In [None]:
len(train_dataset)

In [None]:
path_for_save = PROJECT_PATH + 'sber_all_trx.p'
with open(path_for_save, 'wb') as f:
    pickle.dump(train_dataset, f)
print(f'Saved to: "{path_for_save}"')

In [None]:
list(train_dataset[0]['feature_arrays'].keys())

# Target DL Dataset Preparaion

In [None]:
path_for_load = PROJECT_PATH + 'sber_all_trx.p'
with open(path_for_load, 'rb') as f:
    all_trx_dataset = pickle.load(f)
print(f'Loaded from: "{path_for_load}"')

In [None]:
d_train_target = train_target.set_index('client_id').to_dict(orient='index')
d_test_target = test_target.set_index('client_id').to_dict(orient='index')

In [None]:
train_trx_dataset = [
    dict([('target', d_train_target[rec['client_id']]['bins'])] + list(rec.items()))
    for rec in all_trx_dataset if rec['client_id'] in d_train_target
]

In [None]:
test_trx_dataset = [
    dict([('target', d_test_target[rec['client_id']]['bins'])] + list(rec.items()))
    for rec in all_trx_dataset if rec['client_id'] in d_test_target
]

In [None]:
path_for_save = PROJECT_PATH + 'sber_train_trx_dataset.p'
with open(path_for_save, 'wb') as f:
    pickle.dump(train_trx_dataset, f)
print(f'Saved to: "{path_for_save}"')

path_for_save = PROJECT_PATH + 'sber_test_trx_dataset.p'
with open(path_for_save, 'wb') as f:
    pickle.dump(test_trx_dataset, f)
print(f'Saved to: "{path_for_save}"')

# Prepare features

## Agg featrues

In [11]:
# transactions_train.set_index('client_id') \
#     .groupby(level='client_id')['trans_date'].diff() \
#     .groupby(level='client_id').agg(['mean', 'std']) \
#     .rename(columns={'mean': 'ext_tdd_mean', 'std': 'ext_tdd_std'})

In [12]:
agg_features = pd.concat([
    transactions_train.groupby('client_id')['amount_rur'].agg(['sum','mean','std','min','max']),
    transactions_train.groupby('client_id')['small_group'].nunique().rename('ext_small_group_unique'),
], axis=1)

agg_features.head()

Unnamed: 0_level_0,sum,mean,std,min,max,ext_small_group_unique
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4,28404.121,39.450168,73.511624,0.043,1341.802,22
6,15720.739,21.535259,26.200397,0.045,315.781,17
7,53630.036,69.379089,253.261383,0.043,4505.971,42
10,34419.365,48.752642,63.191701,0.045,654.893,28
11,26789.404,32.991877,107.395139,0.388,2105.058,34


In [13]:
cat_counts_train = pd.concat([
    transactions_train.pivot_table(
        index='client_id', columns='small_group', values='amount_rur', aggfunc='count').fillna(0.0),
    transactions_train.pivot_table(
        index='client_id', columns='small_group', values='amount_rur', aggfunc='mean').fillna(0.0),
    transactions_train.pivot_table(
        index='client_id', columns='small_group', values='amount_rur', aggfunc='std').fillna(0.0),
], axis=1, keys=['small_group_count', 'small_group_mean', 'small_group_std'])


In [14]:
cat_counts_train

Unnamed: 0_level_0,small_group_count,small_group_count,small_group_count,small_group_count,small_group_count,small_group_count,small_group_count,small_group_count,small_group_count,small_group_count,...,small_group_std,small_group_std,small_group_std,small_group_std,small_group_std,small_group_std,small_group_std,small_group_std,small_group_std,small_group_std
small_group,0,1,2,3,4,5,6,7,8,9,...,186,187,190,191,192,193,195,196,197,198
client_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
4,0.0,447.0,1.0,44.0,93.0,0.0,0.0,0.0,1.0,13.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2.0,397.0,0.0,172.0,10.0,0.0,0.0,0.0,0.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2.0,79.0,5.0,27.0,19.0,1.0,0.0,2.0,1.0,39.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,12.0,309.0,1.0,71.0,65.0,0.0,0.0,0.0,3.0,19.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11,2.0,423.0,0.0,59.0,23.0,3.0,0.0,0.0,0.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49993,0.0,206.0,1.0,83.0,40.0,4.0,0.0,1.0,1.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49995,14.0,158.0,5.0,66.0,30.0,2.0,0.0,1.0,2.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49996,1.0,296.0,0.0,11.0,42.0,2.0,0.0,0.0,2.0,18.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49997,0.0,372.0,0.0,12.0,10.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
cat_counts_train.columns = ['_'.join(map(str, c)) for c in cat_counts_train.columns.values]

In [16]:
cat_counts_train.head()

Unnamed: 0_level_0,small_group_count_0,small_group_count_1,small_group_count_2,small_group_count_3,small_group_count_4,small_group_count_5,small_group_count_6,small_group_count_7,small_group_count_8,small_group_count_9,...,small_group_std_186,small_group_std_187,small_group_std_190,small_group_std_191,small_group_std_192,small_group_std_193,small_group_std_195,small_group_std_196,small_group_std_197,small_group_std_198
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,0.0,447.0,1.0,44.0,93.0,0.0,0.0,0.0,1.0,13.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2.0,397.0,0.0,172.0,10.0,0.0,0.0,0.0,0.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2.0,79.0,5.0,27.0,19.0,1.0,0.0,2.0,1.0,39.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,12.0,309.0,1.0,71.0,65.0,0.0,0.0,0.0,3.0,19.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11,2.0,423.0,0.0,59.0,23.0,3.0,0.0,0.0,0.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
col_agg_features = \
    [col for col in agg_features.columns.tolist() if not col.startswith('ext_')] + \
    [col for col in cat_counts_train.columns.tolist() if col.startswith('small_group_count_')]
len(col_agg_features)

207

In [18]:
col_ext_agg_features = agg_features.columns.tolist() + cat_counts_train.columns.tolist()
len(col_ext_agg_features)

604

## Random Tree Features

In [None]:
from tqdm.autonotebook import tqdm

In [None]:
transactions_train.head()

In [None]:
def build_rt_features(vector_size):
    def to_pivot_norm(df, col):
        df = df.assign(rt=col).pivot_table(
            index='client_id', columns='rt', values='trans_date', aggfunc='count').fillna(0.0)
        df = df.div(df.sum(axis=1), axis=0)
        return df

    _rt_map_trans_date = {v: i for v, i in 
                     zip(transactions_train['trans_date'].unique(), np.random.randint(0, vector_size, 1000))}

    _rt_map_group = {v: i for v, i in 
                     zip(transactions_train['small_group'].unique(), np.random.randint(0, vector_size, 1000))}

    _rt_map_amount_rur = np.unique(transactions_train['amount_rur'].quantile(np.linspace(0, 1, vector_size)).values[:-1])

    _rt_e_size = vector_size
    _rt_m1 = {v: i for v, i in zip(range(vector_size * vector_size), np.random.randint(0, _rt_e_size * 8, vector_size * vector_size))}
    _rt_m2 = {v: i for v, i in zip(range(_rt_e_size * 8 * vector_size),
                                   np.random.randint(0, _rt_e_size, _rt_e_size * 8 * vector_size))}
    
    _v1 = transactions_train['trans_date'].map(_rt_map_trans_date)
    _v2 = transactions_train['small_group'].map(_rt_map_group)
    _v3 = pd.cut(transactions_train['amount_rur'],
                 _rt_map_amount_rur.tolist() + [max(transactions_train['amount_rur']) + 1],
                 labels=range(len(_rt_map_amount_rur))).astype(int).values

    s_rt_bin = ((_v1 * vector_size + _v2).map(_rt_m1) * vector_size + _v3).map(_rt_m2)

    rt_agg_features = pd.concat([
        to_pivot_norm(transactions_train, s_rt_bin),
        to_pivot_norm(transactions_train, _v1),
        to_pivot_norm(transactions_train, _v2),
        to_pivot_norm(transactions_train, _v3),
    ], axis=1)
    
    return rt_agg_features

In [None]:
rt_agg_features = pd.concat([build_rt_features(32) for _ in tqdm(range(4))], axis=1)

In [None]:
rt_agg_features.columns = [f"rt_{i}" for i in range(len(rt_agg_features.columns))]

In [None]:
col_rt_agg_features = rt_agg_features.columns.tolist()
len(col_rt_agg_features)

## Featuretools

In [None]:
import featuretools as ft

In [None]:
es = ft.EntitySet()

In [None]:
transactions_train.head()

In [None]:
es.entity_from_dataframe(
    entity_id='transactions_train',
    dataframe=transactions_train.reset_index(),
    index='index',
    variable_types={
        'small_group': ft.variable_types.Categorical,
        'amount_rur': ft.variable_types.Numeric,
    },
    make_index=False,
    time_index='trans_date',
)

In [None]:
es['transactions_train'].variables

In [None]:
es.normalize_entity(
    base_entity_id='transactions_train',
    new_entity_id='client',
    index='client_id',
    additional_variables=None,
)

In [None]:
es.normalize_entity(
    base_entity_id='transactions_train',
    new_entity_id='small_group',
    index='small_group',
    additional_variables=None,
)

In [None]:
es['transactions_train'].df.head()

In [None]:
es['client'].df.head()

In [None]:
es['small_group'].df.head()

In [None]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_entity='client',
    max_depth=2,
)

In [None]:
feature_matrix

In [None]:
feature_defs

## Metric Learning Embeddings

In [19]:
df_embeddings = pd.read_pickle(PROJECT_PATH + 'sber_all_vectors.pickle').set_index('client_id')
df_embeddings.columns = ["embedding_" + col for col in df_embeddings.columns]
df_embeddings.head()

Unnamed: 0_level_0,embedding_v000,embedding_v001,embedding_v002,embedding_v003,embedding_v004,embedding_v005,embedding_v006,embedding_v007,embedding_v008,embedding_v009,...,embedding_v186,embedding_v187,embedding_v188,embedding_v189,embedding_v190,embedding_v191,embedding_v192,embedding_v193,embedding_v194,embedding_v195
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,-0.018819,-0.1543,0.001924,0.021944,-0.154031,-0.116138,-0.141707,-0.116143,-0.008368,-0.033296,...,0.10924,0.069249,0.002146,-0.042166,-0.0117,0.101981,0.127735,0.062277,-0.124495,0.073498
1,-0.07207,-0.151834,0.042985,0.017488,-0.160942,-0.039488,-0.027478,-0.064663,-0.025479,-0.034576,...,0.112494,0.001801,0.028328,-0.042896,-0.014569,0.092731,0.125989,-0.010774,-0.12074,0.116426
2,-0.021234,-0.143193,0.012123,0.048002,-0.172032,-0.146024,-0.151516,-0.101916,-0.045478,-0.035959,...,-0.067618,0.004167,0.05104,-0.039184,-0.023984,0.047566,0.087273,-0.033537,-0.04457,0.038441
3,0.0479,-0.139892,-0.022462,0.033817,-0.172863,-0.108164,-0.065419,-0.05035,0.00765,-0.040774,...,0.066909,0.066017,-0.054806,-0.040898,-0.043716,0.036167,0.12587,-0.026692,-0.124356,0.093088
4,-0.099266,-0.137044,0.088491,0.025266,-0.144386,-0.111142,-0.137188,-0.099266,-0.017697,-0.047963,...,0.121105,0.017826,0.027143,-0.05156,-0.035835,0.080949,0.152129,0.007338,-0.140292,0.133065


In [20]:
col_embedding_features = df_embeddings.columns.tolist()

## Target Model Scores

In [21]:
df_target_scores = pd.concat([
    pd.read_pickle(PROJECT_PATH + f"sber_target_vectors/{i:03d}.pickle").set_index('client_id')
    for i in range(5)
], axis=0)
df_target_scores.columns = ["score_" + col for col in df_target_scores.columns]
df_target_scores.head()

Unnamed: 0_level_0,score_v000,score_v001,score_v002,score_v003
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4,-7.723969,-0.00902,-10.558405,-4.766298
6,-1.983091,-0.728669,-4.681918,-0.992781
10,-1.154974,-1.274818,-2.416892,-1.15121
12,-4.873233,-9.158072,-0.008199,-7.796382
13,-2.030518,-7.715308,-0.152592,-4.62466


In [22]:
len(df_target_scores)

30000

In [23]:
col_score_features = df_target_scores.columns.tolist()

## Fine Tuning Model Scores

In [24]:
df_ft_scores = pd.concat([
    pd.read_pickle(PROJECT_PATH + f"sber_ft_vectors/{i:03d}.pickle").set_index('client_id')
    for i in range(5)
], axis=0)
df_ft_scores.columns = ["ft_" + col for col in df_ft_scores.columns]
df_ft_scores.head()

Unnamed: 0_level_0,ft_v000,ft_v001,ft_v002,ft_v003
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4,-7.868518,-0.005001,-10.209911,-5.388455
6,-1.785164,-0.783238,-4.043094,-1.027884
10,-1.285108,-1.686247,-2.012617,-0.905035
12,-5.671308,-10.613232,-0.003599,-8.994674
13,-2.984654,-7.455932,-0.056067,-5.686997


In [25]:
len(df_ft_scores)

30000

In [26]:
col_ft_features = df_ft_scores.columns.tolist()

## Combine all features

In [27]:
train = pd.merge(train_target, agg_features, left_on='client_id', right_index=True)
train = pd.merge(train, cat_counts_train, left_on='client_id', right_index=True)
# train = pd.merge(train, rt_agg_features, left_on='client_id', right_index=True)
train = pd.merge(train, df_embeddings, left_on='client_id', right_index=True)
train = pd.merge(train, df_target_scores, left_on='client_id', right_index=True)
train = pd.merge(train, df_ft_scores, left_on='client_id', right_index=True)

In [28]:
train.head()

Unnamed: 0,client_id,bins,sum,mean,std,min,max,ext_small_group_unique,small_group_count_0,small_group_count_1,...,embedding_v194,embedding_v195,score_v000,score_v001,score_v002,score_v003,ft_v000,ft_v001,ft_v002,ft_v003
4812,45979,3,87476.355,110.729563,302.37412,0.06,4803.101,58,0.0,189.0,...,-0.097828,0.061561,-1.014562,-1.607859,-3.560163,-0.894801,-1.387066,-1.424644,-3.517234,-0.734142
18933,11534,0,32977.599,43.678939,79.426087,0.284,979.773,49,2.0,162.0,...,-0.060607,0.09076,-0.808736,-2.889763,-2.560991,-0.863315,-0.666702,-3.153454,-3.693807,-0.869838
21318,33962,3,67879.611,66.224011,463.932282,0.302,14747.796,53,4.0,270.0,...,-0.12161,0.032585,-1.118014,-2.077857,-3.408392,-0.664018,-1.105899,-2.706249,-3.399079,-0.564058
25624,22211,0,40288.786,40.168281,134.134048,0.043,3579.59,53,4.0,328.0,...,-0.142472,0.107691,-0.457521,-2.963936,-2.553839,-1.436527,-0.35281,-4.43012,-1.989951,-1.905997
13070,28838,0,51654.224,52.440837,191.147127,0.043,4872.555,55,0.0,218.0,...,-0.146451,0.079197,-1.208248,-1.18542,-3.472433,-1.008898,-0.970942,-2.163274,-3.064574,-0.777286


In [29]:
test = pd.merge(test_target, agg_features, left_on='client_id', right_index=True)
test = pd.merge(test, cat_counts_train, left_on='client_id', right_index=True)
# test = pd.merge(test, rt_agg_features, left_on='client_id', right_index=True)
test = pd.merge(test, df_embeddings, left_on='client_id', right_index=True)
test = pd.merge(test, df_target_scores, left_on='client_id', right_index=True)
test = pd.merge(test, df_ft_scores, left_on='client_id', right_index=True)

In [30]:
test.head()

Unnamed: 0,client_id,bins,sum,mean,std,min,max,ext_small_group_unique,small_group_count_0,small_group_count_1,...,embedding_v194,embedding_v195,score_v000,score_v001,score_v002,score_v003,ft_v000,ft_v001,ft_v002,ft_v003
14986,13208,1,62400.013,61.843422,116.999367,0.518,1509.466,44,0.0,320.0,...,-0.160209,-0.028746,-5.160534,-0.079212,-8.289759,-2.656888,-4.909724,-0.051263,-7.165516,-3.174301
28251,2078,2,14677.71,20.557017,61.008919,0.432,1358.519,36,0.0,259.0,...,-0.055514,0.086506,-4.333508,-10.129864,-0.013571,-8.053814,-5.586629,-12.31354,-0.003805,-9.992947
17150,4331,2,12856.757,17.733458,32.678898,0.092,503.061,27,0.0,338.0,...,-0.045182,0.102017,-2.326214,-4.243108,-0.149793,-3.608734,-2.047637,-3.754925,-0.19849,-3.590286
19437,19860,3,44064.945,54.671148,182.899582,0.043,4121.575,50,1.0,181.0,...,-0.106573,0.059305,-0.6877,-3.069371,-2.948502,-0.920298,-0.572014,-3.048374,-3.979825,-0.995639
11437,15753,2,78882.494,96.315621,546.004161,0.045,4121.575,42,29.0,219.0,...,-0.145721,0.101698,-2.090112,-3.56941,-0.23004,-2.925225,-1.45885,-3.758252,-0.387053,-2.731569


In [33]:
param={'objective':'multi:softprob','num_class':4,'n_jobs':4,'seed':42}

# Estimate features

## Baseline Agg features

In [None]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_agg_features]
X_test=test[col_agg_features]

In [34]:
model=xgb.XGBClassifier(**param,n_estimators=300)
model.fit(X_train,y_train)

pred = model.predict(X_test)

accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6108
CPU times: user 1min 58s, sys: 196 ms, total: 1min 58s
Wall time: 30.2 s


На public лидерборде такое предсказание должно дать качество 0.6118.

На части train такое предсказание должно дать качество: 0.6108.

In [37]:
%%time
model=lightgbm.LGBMClassifier(n_estimators=300, n_jobs=4)
model.fit(X_train,y_train)

pred = model.predict(X_test)

accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6091
CPU times: user 36.4 s, sys: 188 ms, total: 36.6 s
Wall time: 9.38 s


## Extended Agg features

In [56]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_ext_agg_features]
X_test=test[col_ext_agg_features]

CPU times: user 40 ms, sys: 20 ms, total: 60 ms
Wall time: 59.2 ms


In [44]:
model=xgb.XGBClassifier(**param,n_estimators=300)
model.fit(X_train,y_train)

CPU times: user 6min 32s, sys: 9.52 s, total: 6min 41s
Wall time: 1min 49s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=4,
              nthread=None, num_class=4, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, silent=None, subsample=1, verbosity=1)

In [45]:
pred = model.predict(X_test)
pred

array([1, 2, 2, ..., 3, 2, 3])

На части train такое предсказание должно дать качество 0.6182.

In [46]:
accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6182


In [57]:
%%time
model=lightgbm.LGBMClassifier(n_estimators=300, n_jobs=4)
model.fit(X_train,y_train)

pred = model.predict(X_test)

accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6208
CPU times: user 1min 58s, sys: 3.8 s, total: 2min 2s
Wall time: 35.3 s


## Random Tree agg features

In [None]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_rt_agg_features]
X_test=test[col_rt_agg_features]

In [69]:
model=xgb.XGBClassifier(**param, n_estimators=300)
model.fit(X_train,y_train)

CPU times: user 12min 47s, sys: 10.8 s, total: 12min 58s
Wall time: 3min 23s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=4,
              nthread=None, num_class=4, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, silent=None, subsample=1, verbosity=1)

In [70]:
pred = model.predict(X_test)
pred

array([1, 2, 2, ..., 0, 2, 3])

accuracy: 0.5992, если делать деревья размером 128, брать общее дерево и по каждой фиче

In [71]:
accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6013


In [None]:
%%time
model=lightgbm.LGBMClassifier(n_estimators=300, n_jobs=4)
model.fit(X_train,y_train)

pred = model.predict(X_test)

accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

## Embedding features

In [58]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_embedding_features]
X_test=test[col_embedding_features]

CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 10.7 ms


In [31]:
model=xgb.XGBClassifier(**param,n_estimators=300)
model.fit(X_train,y_train)

CPU times: user 5min 30s, sys: 84 ms, total: 5min 31s
Wall time: 1min 22s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=4,
              nthread=None, num_class=4, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, silent=None, subsample=1, verbosity=1)

In [32]:
pred = model.predict(X_test)
pred

array([1, 2, 2, ..., 0, 2, 1])

In [33]:
accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6121


| accuracy   | n_epoch | train content | hid_s | embed_s | loss_margin | cnt_min | cnt_max |
| ---------- | ------- | ------------- | ----- | ------- | ----------- | ------- | ------- |
|   0.5033   | 25      |  train only   |   16  |   4     | 1.0         | 25      |  400    |
|   0.5215   | 25      |  train + test |   16  |   4     | 1.0         | 25      |  400    |
|   0.5389   | 50      |  train + test |   16  |   4     | 1.0         | 25      |  400    |
|   0.5537   | 50      |  train + test |   32  |   4     | 1.0         | 25      |  400    |
|   0.5643   | 25      |  train + test |  128  |   4     | 1.0         | 25      |  400    |
|   0.5916   | 50      |  train + test |  128  |   8     | 1.0         | 25      |  400    |
|   0.5982   | 50      |  train + test |  128  |   8     | 0.5         | 25      |  400    |
| **0.6057** | 50      |  train + test |  128  |  16     | 0.5         | 25      |  400    |
|   0.5977   | 50      |  train + test |  128  |  16     | 0.5         | 300     |  400    |
|   0.5894   | 50      |  train + test |  128  |  16     | 0.5         | 300     |  600    |
|   0.5968   | 50      |  train + test |  128  |  16     | 0.5         | 25      |  600    |
| ---------- | ------- | ------------- | ----- | ------- | ----------- | ------- | ------- |
|   0.5945   | 25      |  train + test |  128  |  16     | 0.5         | 25      |  100    |
|   0.6038   | 25      |  train + test |  128  |  16     | 0.5         | 80      |  300    |
|   0.5831   | 25      |  train + test |  128  |  16     | 0.5         | 250     |  700    |
|   0.6053   | 25      |  train + test |  128  |  16     | 0.5         | SML     |  SML    |
| **0.6097** | 75      |  train + test |  196  |  16     | 0.5         | 25      |  600    |
| **0.6122** | 150     |  train + test |  196  |  16     | 0.5         | 25      |  600    |
|   0.6117   | 300     |  train + test |  196  |  16     | 0.5         | 25      |  600    |
|   0.6121   | 300     |  all data     |  196  |  16     | 0.5         | 25      |  600    |
| ---------- | ------- | ------------- | ----- | ------- | ----------- | ------- | ------- |
|   0.5643   | 25      |  train + test |  128  |   4     | 1.0         | 25      |  400    |
|   0.5683   | 25 amntfix      |  train + test |  128  |   4     | 1.0         | 25      |  400    |

In [59]:
%%time
model=lightgbm.LGBMClassifier(n_estimators=300, n_jobs=4)
model.fit(X_train,y_train)

pred = model.predict(X_test)

accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6106
CPU times: user 1min 35s, sys: 2.7 s, total: 1min 38s
Wall time: 26.1 s


## Scores features

In [60]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_score_features]
X_test=test[col_score_features]

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.54 ms


In [34]:
model=xgb.XGBClassifier(**param,n_estimators=300)
model.fit(X_train,y_train)

CPU times: user 16.3 s, sys: 20 ms, total: 16.4 s
Wall time: 4.09 s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=4,
              nthread=None, num_class=4, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, silent=None, subsample=1, verbosity=1)

In [35]:
pred = model.predict(X_test)
pred

array([1, 2, 2, ..., 1, 2, 3])

In [36]:
accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6189


In [61]:
%%time
model=lightgbm.LGBMClassifier(n_estimators=300, n_jobs=4)
model.fit(X_train,y_train)

pred = model.predict(X_test)

accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6090
CPU times: user 8.24 s, sys: 2.05 s, total: 10.3 s
Wall time: 3.43 s


## FT scores

In [62]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_ft_features]
X_test=test[col_ft_features]

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 2.45 ms


In [56]:
model=xgb.XGBClassifier(**param,n_estimators=300)
model.fit(X_train,y_train)

CPU times: user 16.6 s, sys: 28 ms, total: 16.6 s
Wall time: 4.18 s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=4,
              nthread=None, num_class=4, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, silent=None, subsample=1, verbosity=1)

In [57]:
pred = model.predict(X_test)
pred

array([1, 2, 2, ..., 1, 2, 3])

In [58]:
accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6318


In [63]:
%%time
model=lightgbm.LGBMClassifier(n_estimators=300, n_jobs=4)
model.fit(X_train,y_train)

pred = model.predict(X_test)

accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6166
CPU times: user 8.23 s, sys: 2.01 s, total: 10.2 s
Wall time: 3.43 s


## Feature combinations

In [37]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_agg_features + col_embedding_features]
X_test=test[col_agg_features + col_embedding_features]

model=xgb.XGBClassifier(**param,n_estimators=300)
model.fit(X_train,y_train)

CPU times: user 7min 21s, sys: 232 ms, total: 7min 21s
Wall time: 1min 50s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=4,
              nthread=None, num_class=4, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, silent=None, subsample=1, verbosity=1)

In [38]:
pred = model.predict(X_test)
pred

array([1, 2, 2, ..., 3, 2, 1])

In [39]:
accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6207


| accuracy   | n_epoch | train content | hid_s | embed_s | loss_margin | cnt_min | cnt_max |
| ---------- | ------- | ------------- | ----- | ------- | ----------- | ------- | ------- |
|   0.60     | 25      |  train only   |   16  |   4     | 1.0         | 25      |  400    |
|   0.6120   | 25      |  train + test |   16  |   4     | 1.0         | 25      |  400    |
|   0.6112   | 50      |  train + test |   16  |   4     | 1.0         | 25      |  400    |
|   0.6156   | 50      |  train + test |   32  |   4     | 1.0         | 25      |  400    |
|   0.6149   | 25      |  train + test |  128  |   4     | 1.0         | 25      |  400    |
|   0.6117   | 50      |  train + test |  128  |   8     | 1.0         | 25      |  400    |
|   0.6148   | 50      |  train + test |  128  |   8     | 0.5         | 25      |  400    |
|   0.6181   | 50      |  train + test |  128  |  16     | 0.5         | 25      |  400    |
|   0.6160   | 50      |  train + test |  128  |  16     | 0.5         | 300     |  400    |
|   0.6107   | 50      |  train + test |  128  |  16     | 0.5         | 300     |  600    |
| **0.6183** | 50      |  train + test |  128  |  16     | 0.5         | 25      |  600    |
| ---------- | ------- | ------------- | ----- | ------- | ----------- | ------- | ------- |
|   0.6123   | 25      |  train + test |  128  |  16     | 0.5         | 25      |  100    |
|   0.6158   | 25      |  train + test |  128  |  16     | 0.5         | 80      |  300    |
|   0.6092   | 25      |  train + test |  128  |  16     | 0.5         | 250     |  700    |
|   0.6137   | 25      |  train + test |  128  |  16     | 0.5         | SML     |  SML    |
|   0.6218   | 75      |  train + test |  196  |  16     | 0.5         | 25      |  600    |
|   0.6218   | 150     |  train + test |  196  |  16     | 0.5         | 25      |  600    |
| **0.6241** | 300     |  train + test |  196  |  16     | 0.5         | 25      |  600    |
|   0.6207   | 300     |  all data     |  196  |  16     | 0.5         | 25      |  600    |
| ---------- | ------- | ------------- | ----- | ------- | ----------- | ------- | ------- |
|   0.6149   | 25      |  train + test |  128  |   4     | 1.0         | 25      |  400    |
|   0.6074   | 25 amntfix      |  train + test |  128  |   4     | 1.0         | 25      |  400    |

In [40]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_agg_features + col_score_features]
X_test=test[col_agg_features + col_score_features]

model=xgb.XGBClassifier(**param,n_estimators=300)
model.fit(X_train,y_train)

CPU times: user 2min 4s, sys: 96 ms, total: 2min 4s
Wall time: 31.5 s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=4,
              nthread=None, num_class=4, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, silent=None, subsample=1, verbosity=1)

In [41]:
pred = model.predict(X_test)
pred

array([1, 2, 2, ..., 1, 2, 1])

In [42]:
accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6269


In [59]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_agg_features + col_ft_features]
X_test=test[col_agg_features + col_ft_features]

model=xgb.XGBClassifier(**param,n_estimators=300)
model.fit(X_train,y_train)

CPU times: user 2min 2s, sys: 80 ms, total: 2min 2s
Wall time: 30.7 s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=4,
              nthread=None, num_class=4, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, silent=None, subsample=1, verbosity=1)

In [60]:
pred = model.predict(X_test)
pred

array([1, 2, 2, ..., 3, 2, 1])

In [61]:
accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6316


In [64]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_ext_agg_features + col_ft_features]
X_test=test[col_ext_agg_features + col_ft_features]

CPU times: user 44 ms, sys: 52 ms, total: 96 ms
Wall time: 93.8 ms


In [144]:
model=xgb.XGBClassifier(**param,n_estimators=300)
model.fit(X_train,y_train)

CPU times: user 6min 39s, sys: 8.58 s, total: 6min 48s
Wall time: 1min 48s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=4,
              nthread=None, num_class=4, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, silent=None, subsample=1, verbosity=1)

In [145]:
pred = model.predict(X_test)
pred

array([1, 2, 2, ..., 1, 2, 3])

In [146]:
accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6382


In [65]:
%%time
model=lightgbm.LGBMClassifier(n_estimators=300, n_jobs=4)
model.fit(X_train,y_train)

pred = model.predict(X_test)

accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6307
CPU times: user 1min 47s, sys: 3.67 s, total: 1min 51s
Wall time: 33.2 s


In [62]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_score_features + col_ft_features]
X_test=test[col_score_features + col_ft_features]

model=xgb.XGBClassifier(**param,n_estimators=300)
model.fit(X_train,y_train)

CPU times: user 22.9 s, sys: 16 ms, total: 22.9 s
Wall time: 5.73 s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=4,
              nthread=None, num_class=4, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, silent=None, subsample=1, verbosity=1)

In [63]:
pred = model.predict(X_test)
pred

array([1, 2, 2, ..., 1, 2, 3])

In [64]:
accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6311


In [65]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_embedding_features + col_ft_features]
X_test=test[col_embedding_features + col_ft_features]

model=xgb.XGBClassifier(**param,n_estimators=300)
model.fit(X_train,y_train)

CPU times: user 5min 38s, sys: 160 ms, total: 5min 38s
Wall time: 1min 24s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=4,
              nthread=None, num_class=4, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, silent=None, subsample=1, verbosity=1)

In [66]:
pred = model.predict(X_test)
pred

array([1, 2, 2, ..., 3, 2, 3])

In [67]:
accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6296


In [43]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_agg_features + col_embedding_features + col_score_features]
X_test=test[col_agg_features + col_embedding_features + col_score_features]

model=xgb.XGBClassifier(**param,n_estimators=300)
model.fit(X_train,y_train)

CPU times: user 7min 19s, sys: 132 ms, total: 7min 20s
Wall time: 1min 50s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=4,
              nthread=None, num_class=4, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, silent=None, subsample=1, verbosity=1)

In [44]:
pred = model.predict(X_test)
pred

array([1, 2, 2, ..., 1, 2, 3])

In [45]:
accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6231


In [68]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_agg_features + col_embedding_features + col_score_features + col_ft_features]
X_test=test[col_agg_features + col_embedding_features + col_score_features + col_ft_features]

model=xgb.XGBClassifier(**param,n_estimators=300)
model.fit(X_train,y_train)

CPU times: user 7min 27s, sys: 236 ms, total: 7min 27s
Wall time: 1min 52s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=4,
              nthread=None, num_class=4, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, silent=None, subsample=1, verbosity=1)

In [69]:
pred = model.predict(X_test)
pred

array([1, 2, 2, ..., 3, 2, 3])

In [70]:
accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6303


In [46]:
%%time
y_train=train['bins']
y_test=test['bins']
X_train=train[col_embedding_features + col_score_features]
X_test=test[col_embedding_features + col_score_features]

model=xgb.XGBClassifier(**param,n_estimators=300)
model.fit(X_train,y_train)

CPU times: user 5min 35s, sys: 72 ms, total: 5min 35s
Wall time: 1min 24s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=4,
              nthread=None, num_class=4, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, silent=None, subsample=1, verbosity=1)

In [47]:
pred = model.predict(X_test)
pred

array([1, 2, 2, ..., 1, 2, 3])

In [48]:
accuracy = (y_test == pred).mean()
print(f'accuracy: {accuracy:.4f}')

accuracy: 0.6206


# Сводная таблица

| Метод                            | accuracy          | pos on leaderboard |
| -------------------------------- | ----------------- | ------------------ |
| Baseline на агрегатных фичах     |   0.6108          | 144                |
| Улучшенные агрегатные фичи       |   0.6182          | 100                |
| Metric learning embeddings       |   0.6121 (0.6122) | 121 (120)          |
| Скоры с обучения под таргет      |   0.6189          | 96                 |
| Скоры с Fine Tuning под таргет   | **0.6318**        | 57                 |
| -------------------------------- | ----------------- | ------------------ |
| агрегаты и эмбеддинги            |   0.6207 (0.6241) | 87 (73)            |
| агрегаты и скоры под target      |   0.6269          | 67                 |
| агрегаты и fine tuning скоры     |   0.6316          | 57                 |
| улучшенные агрегаты и ft скоры   | **0.6382**        | 44                 |
| эмбеддинги и скоры               |   0.6206          | 88                 |
| -------------------------------- | ----------------- | ------------------ |
| агрегаты скоры и эмбеддинги      |   0.6231          | 77                 |

---
Использовалась модель XGBoost

Lightgbm дает меньшее качество на тех же фичах.

AutoMl не работает в многоклассовом режиме.
