## Постановка задачи
Загрузим данные, приведем их к числовым, заполним пропуски, нормализуем данные и оптимизируем память.

Разделим выборку на обучающую/проверочную в соотношении 80/20.

Построим ансамбль решающих деревьев, используя патентованный градиентный бустинг Яндекса (CatBoost). Используем перекрестную проверку, чтобы найти наилучшие параметры ансамбля.

Проведем предсказание и проверим качество через каппа-метрику.

Данные:
* https://video.ittensive.com/machine-learning/prudential/train.csv.gz

Соревнование: https://www.kaggle.com/c/prudential-life-insurance-assessment/

© ITtensive, 2020

In [1]:
GRAIN = 11
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score, confusion_matrix, make_scorer
from sklearn.model_selection import GridSearchCV, cross_val_score
from catboost import Pool, CatBoostClassifier
from sklearn import preprocessing
from etl_utils import reduce_mem_usage


data = pd.read_csv("https://video.ittensive.com/machine-learning/prudential/train.csv.gz")

data['Product_Info_2_1'] = data['Product_Info_2'].str.slice(0, 1)
data['Product_Info_2_2'] = pd.to_numeric(data['Product_Info_2'].str.slice(1, 2))
data = data.drop('Product_Info_2', axis='columns')

onehot_df = pd.get_dummies(data['Product_Info_2_1'])
onehot_df.columns = ['Product_Info_2_1' + column for column in onehot_df.columns]
data = pd.merge(left=data, right=onehot_df, left_index=True, right_index=True).drop('Product_Info_2_1', axis=1).fillna(-1)
del onehot_df

data['Response'] = data['Response'] - 1

### Набор столбцов для расчета

In [2]:
columns_groups = ['Insurance_History', 'InsurеdInfo', 'Medical_Keyword', 'Family_Hist', 'Medical_History', 'Product_Info']
columns = ['Wt', 'Ht', 'Ins_Age', 'BMI']
for cg in columns_groups:
    columns.extend(data.columns[data.columns.str.startswith(cg)])
print(columns)

['Wt', 'Ht', 'Ins_Age', 'BMI', 'Insurance_History_1', 'Insurance_History_2', 'Insurance_History_3', 'Insurance_History_4', 'Insurance_History_5', 'Insurance_History_7', 'Insurance_History_8', 'Insurance_History_9', 'Medical_Keyword_1', 'Medical_Keyword_2', 'Medical_Keyword_3', 'Medical_Keyword_4', 'Medical_Keyword_5', 'Medical_Keyword_6', 'Medical_Keyword_7', 'Medical_Keyword_8', 'Medical_Keyword_9', 'Medical_Keyword_10', 'Medical_Keyword_11', 'Medical_Keyword_12', 'Medical_Keyword_13', 'Medical_Keyword_14', 'Medical_Keyword_15', 'Medical_Keyword_16', 'Medical_Keyword_17', 'Medical_Keyword_18', 'Medical_Keyword_19', 'Medical_Keyword_20', 'Medical_Keyword_21', 'Medical_Keyword_22', 'Medical_Keyword_23', 'Medical_Keyword_24', 'Medical_Keyword_25', 'Medical_Keyword_26', 'Medical_Keyword_27', 'Medical_Keyword_28', 'Medical_Keyword_29', 'Medical_Keyword_30', 'Medical_Keyword_31', 'Medical_Keyword_32', 'Medical_Keyword_33', 'Medical_Keyword_34', 'Medical_Keyword_35', 'Medical_Keyword_36', 'M

### Нормализация данных и Разделение данных

In [3]:
scaler = preprocessing.StandardScaler()
data_transformed = pd.DataFrame(scaler.fit_transform(data[columns]))
columns_transformed = data_transformed.columns
data_transformed['Response'] = data['Response']
data_transformed = reduce_mem_usage(data_transformed)

data_train, data_test = train_test_split(data_transformed, test_size=.2, random_state=GRAIN)
print(data_train.shape, data_test.shape)

Потребление памяти меньше на 40.49 Мб (-75.1%)
(47504, 119) (11877, 119)


### CatBoost
Основные преимущества: умение работать с категориальными (номинативными) признаками и бОльшая точность, чем LighGBM

Алгоритм запускается сразу на всех ядрах процессора, это существенно ускоряет работу.

В качестве ансамблирования выберем метод опорных векторов (MVS), он ранее показал хорошую точность (и для CatBoost он тоже повышает точность на рассматриваемых данных).

In [4]:
train_dataset = Pool(data=data_train[columns_transformed], label=data_train['Response'])
model = CatBoostClassifier(
    random_seed=GRAIN,
    iterations=10, learning_rate=.57, depth=6, loss_function='MultiClass', bootstrap_type='MVS', custom_metric='WKappa'
)

Диапазон тестирования параметров модели ограничен только вычислительной мощностью. Для проверки модели имеет смысл провести индивидуальные перекрестные проверки для каждого параметра в отдельности, затем в итоговой проверке перепроверить самые лучшие найденные параметры с отклонением +/-10%.

Гиперпараметры оптимизации:
* depth - максимальная глубина деревьев,
* learning_rate - скорость обучения
* l2_leaf_reg - L2 параметр регуляризации для функции стоимости

In [5]:
cb_params = {
    'depth': range(5, 8),
    'learning_rate': np.arange(.56, .59, .01),
    'l2_leaf_reg': range(1, 5)
}
cb_grid = model.grid_search(cb_params, cv=5, X=data_train[columns_transformed], y=data_train['Response'], verbose=True)
print(cb_grid['params'])

0:	learn: 1.5846370	test: 1.5781232	best: 1.5781232 (0)	total: 188ms	remaining: 1.69s
1:	learn: 1.4811964	test: 1.4792620	best: 1.4792620 (1)	total: 217ms	remaining: 867ms
2:	learn: 1.3749320	test: 1.3751744	best: 1.3751744 (2)	total: 267ms	remaining: 624ms
3:	learn: 1.3403232	test: 1.3406583	best: 1.3406583 (3)	total: 313ms	remaining: 470ms
4:	learn: 1.3163674	test: 1.3191554	best: 1.3191554 (4)	total: 411ms	remaining: 411ms
5:	learn: 1.2963539	test: 1.2993463	best: 1.2993463 (5)	total: 475ms	remaining: 316ms
6:	learn: 1.2868305	test: 1.2906937	best: 1.2906937 (6)	total: 521ms	remaining: 223ms
7:	learn: 1.2755469	test: 1.2820922	best: 1.2820922 (7)	total: 565ms	remaining: 141ms
8:	learn: 1.2672861	test: 1.2757816	best: 1.2757816 (8)	total: 612ms	remaining: 68ms
9:	learn: 1.2529354	test: 1.2614291	best: 1.2614291 (9)	total: 675ms	remaining: 0us

bestTest = 1.261429089
bestIteration = 9

0:	loss: 1.2614291	best: 1.2614291 (0)	total: 728ms	remaining: 25.5s
0:	learn: 1.5832489	test: 1.576

5:	learn: 1.2987493	test: 1.3013438	best: 1.3013438 (5)	total: 269ms	remaining: 180ms
6:	learn: 1.2840328	test: 1.2867515	best: 1.2867515 (6)	total: 331ms	remaining: 142ms
7:	learn: 1.2737848	test: 1.2774022	best: 1.2774022 (7)	total: 394ms	remaining: 98.4ms
8:	learn: 1.2661594	test: 1.2705629	best: 1.2705629 (8)	total: 440ms	remaining: 48.9ms
9:	learn: 1.2549304	test: 1.2605986	best: 1.2605986 (9)	total: 489ms	remaining: 0us

bestTest = 1.260598632
bestIteration = 9

8:	loss: 1.2605986	best: 1.2605986 (8)	total: 4.76s	remaining: 14.3s
0:	learn: 1.5875174	test: 1.5810285	best: 1.5810285 (0)	total: 41.7ms	remaining: 375ms
1:	learn: 1.4842527	test: 1.4816493	best: 1.4816493 (1)	total: 86.7ms	remaining: 347ms
2:	learn: 1.3772337	test: 1.3767772	best: 1.3767772 (2)	total: 136ms	remaining: 318ms
3:	learn: 1.3454131	test: 1.3449137	best: 1.3449137 (3)	total: 201ms	remaining: 301ms
4:	learn: 1.3218211	test: 1.3236632	best: 1.3236632 (4)	total: 248ms	remaining: 248ms
5:	learn: 1.3023707	test: 

2:	learn: 1.3469211	test: 1.3499338	best: 1.3499338 (2)	total: 168ms	remaining: 392ms
3:	learn: 1.3193245	test: 1.3252927	best: 1.3252927 (3)	total: 217ms	remaining: 325ms
4:	learn: 1.3057889	test: 1.3135286	best: 1.3135286 (4)	total: 246ms	remaining: 246ms
5:	learn: 1.2929277	test: 1.3025898	best: 1.3025898 (5)	total: 293ms	remaining: 195ms
6:	learn: 1.2830997	test: 1.2944741	best: 1.2944741 (6)	total: 340ms	remaining: 146ms
7:	learn: 1.2729973	test: 1.2876389	best: 1.2876389 (7)	total: 395ms	remaining: 98.8ms
8:	learn: 1.2490610	test: 1.2632178	best: 1.2632178 (8)	total: 440ms	remaining: 48.8ms
9:	learn: 1.2315320	test: 1.2454383	best: 1.2454383 (9)	total: 487ms	remaining: 0us

bestTest = 1.245438292
bestIteration = 9

17:	loss: 1.2454383	best: 1.2436398 (16)	total: 9.39s	remaining: 9.39s
0:	learn: 1.5443263	test: 1.5410115	best: 1.5410115 (0)	total: 71.9ms	remaining: 648ms
1:	learn: 1.4475634	test: 1.4407088	best: 1.4407088 (1)	total: 115ms	remaining: 461ms
2:	learn: 1.3576428	test:

0:	learn: 1.5071640	test: 1.5034176	best: 1.5034176 (0)	total: 68.6ms	remaining: 618ms
1:	learn: 1.4080823	test: 1.4076699	best: 1.4076699 (1)	total: 137ms	remaining: 548ms
2:	learn: 1.3700019	test: 1.3784322	best: 1.3784322 (2)	total: 207ms	remaining: 483ms
3:	learn: 1.2964425	test: 1.3084163	best: 1.3084163 (3)	total: 278ms	remaining: 418ms
4:	learn: 1.2779049	test: 1.2920770	best: 1.2920770 (4)	total: 351ms	remaining: 351ms
5:	learn: 1.2602918	test: 1.2800272	best: 1.2800272 (5)	total: 422ms	remaining: 282ms
6:	learn: 1.2506514	test: 1.2755745	best: 1.2755745 (6)	total: 497ms	remaining: 213ms
7:	learn: 1.2433800	test: 1.2707527	best: 1.2707527 (7)	total: 569ms	remaining: 142ms
8:	learn: 1.2268776	test: 1.2580619	best: 1.2580619 (8)	total: 644ms	remaining: 71.5ms
9:	learn: 1.2178535	test: 1.2548041	best: 1.2548041 (9)	total: 712ms	remaining: 0us

bestTest = 1.254804067
bestIteration = 9

26:	loss: 1.2548041	best: 1.2417011 (24)	total: 15.2s	remaining: 5.07s
0:	learn: 1.5130229	test: 

7:	learn: 1.2608356	test: 1.2783573	best: 1.2783573 (7)	total: 616ms	remaining: 154ms
8:	learn: 1.2476170	test: 1.2668522	best: 1.2668522 (8)	total: 689ms	remaining: 76.5ms
9:	learn: 1.2318500	test: 1.2537575	best: 1.2537575 (9)	total: 761ms	remaining: 0us

bestTest = 1.253757526
bestIteration = 9

34:	loss: 1.2537575	best: 1.2404764 (33)	total: 20.9s	remaining: 597ms
0:	learn: 1.5152420	test: 1.5108402	best: 1.5108402 (0)	total: 68.9ms	remaining: 620ms
1:	learn: 1.4194269	test: 1.4170087	best: 1.4170087 (1)	total: 137ms	remaining: 549ms
2:	learn: 1.3834917	test: 1.3838871	best: 1.3838871 (2)	total: 209ms	remaining: 488ms
3:	learn: 1.3097088	test: 1.3138002	best: 1.3138002 (3)	total: 280ms	remaining: 421ms
4:	learn: 1.2882863	test: 1.2965805	best: 1.2965805 (4)	total: 349ms	remaining: 349ms
5:	learn: 1.2728382	test: 1.2840630	best: 1.2840630 (5)	total: 418ms	remaining: 278ms
6:	learn: 1.2586434	test: 1.2710252	best: 1.2710252 (6)	total: 485ms	remaining: 208ms
7:	learn: 1.2507476	test: 

Выведем самые оптимальные параметры и построим итоговую модель

In [10]:
print(cb_grid['params'])
model = CatBoostClassifier(
    iterations=100, random_seed=GRAIN, loss_function='MultiClass', bootstrap_type='MVS', custom_metric='WKappa',
    learning_rate=cb_grid['params']['learning_rate'],
    depth=cb_grid['params']['depth'],
    l2_leaf_reg=cb_grid['params']['l2_leaf_reg'],
)

{'depth': 7, 'l2_leaf_reg': 4, 'learning_rate': 0.56}


In [11]:
model.fit(train_dataset);

0:	learn: 1.5308178	total: 57.9ms	remaining: 5.74s
1:	learn: 1.3929447	total: 118ms	remaining: 5.76s
2:	learn: 1.3382318	total: 175ms	remaining: 5.67s
3:	learn: 1.3114882	total: 235ms	remaining: 5.63s
4:	learn: 1.2898592	total: 298ms	remaining: 5.66s
5:	learn: 1.2793855	total: 358ms	remaining: 5.61s
6:	learn: 1.2686422	total: 418ms	remaining: 5.56s
7:	learn: 1.2501712	total: 480ms	remaining: 5.53s
8:	learn: 1.2412142	total: 545ms	remaining: 5.51s
9:	learn: 1.2343899	total: 605ms	remaining: 5.44s
10:	learn: 1.2250090	total: 666ms	remaining: 5.38s
11:	learn: 1.2124148	total: 724ms	remaining: 5.31s
12:	learn: 1.2087139	total: 796ms	remaining: 5.32s
13:	learn: 1.2037673	total: 855ms	remaining: 5.25s
14:	learn: 1.1999816	total: 912ms	remaining: 5.17s
15:	learn: 1.1963732	total: 972ms	remaining: 5.1s
16:	learn: 1.1884244	total: 1.03s	remaining: 5.04s
17:	learn: 1.1840390	total: 1.09s	remaining: 4.99s
18:	learn: 1.1781063	total: 1.16s	remaining: 4.93s
19:	learn: 1.1736369	total: 1.22s	remaini

### Предсказание данных и оценка модели

In [12]:
data_test['target'] = model.predict(Pool(data=data_test[columns_transformed]))
print("CatBoost:", round(cohen_kappa_score(data_test["target"], data_test["Response"], weights="quadratic"), 3))

CatBoost: 0.543


Кластеризация дает 0.192, kNN(100) - 0.3, лог. регрессия - 0.512/0.496, SVM - 0.95, реш. дерево - 0.3, случайный лес - 0.487, XGBoost - 0.534, градиентный бустинг - 0.545, LightGBM - 0.551

### Матрица неточностей

In [9]:
print("CatBoost\n", confusion_matrix(data_test["target"], data_test["Response"]))

CatBoost
 [[ 296  191   34   41   64  153   76   57]
 [ 217  353   22    9  127  147   63   50]
 [  26   21   62   31    2    1    0    0]
 [  33   20   45  179    0    9    1    4]
 [  97  138    8    0  561  163   18   15]
 [ 214  237   22   21  174 1085  252  210]
 [ 132  130    4    6   53  257  684  244]
 [ 223  217    5   29   83  421  468 3372]]
