## Постановка задачи
Загрузим данные и подготовим все данные для анализа: проведем нормализацию и преобразование категорий. Оптимизируем потребление памяти.

Разделим выборку на обучающую/проверочную в соотношении 80/20.

Применим наивный Байес для классификации скоринга. Будем использовать все возможные столбцы.

Проверим качество предсказания через каппа-метрику и матрицу неточностей.

Данные:
* https://video.ittensive.com/machine-learning/prudential/train.csv.gz

Соревнование: https://www.kaggle.com/c/prudential-life-insurance-assessment/

© ITtensive, 2020

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score, confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn import preprocessing
import re
from etl_utils import reduce_mem_usage, show_inf_and_na, inf_and_na_columns
pd.set_option('display.max_columns', 200)

data = pd.read_csv("https://video.ittensive.com/machine-learning/prudential/train.csv.gz")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59381 entries, 0 to 59380
Columns: 128 entries, Id to Response
dtypes: float64(18), int64(109), object(1)
memory usage: 58.0+ MB


###  Категоризация данных и оптимизация потребления памяти

In [2]:
data['Product_Info_2_1'] = data['Product_Info_2'].str.slice(0, 1)
data['Product_Info_2_2'] = pd.to_numeric(data['Product_Info_2'].str.slice(1, 2))
data = reduce_mem_usage(data.drop('Product_Info_2', axis='columns'))
data.info()

Потребление памяти меньше на 49.89 Мб (-85.4%)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59381 entries, 0 to 59380
Columns: 129 entries, Id to Product_Info_2_2
dtypes: category(1), float16(18), int16(1), int32(1), int8(108)
memory usage: 8.6 MB


### Предобработка: категоризация, единичные векторы
| Product |
| - |
| A |
| B |
| C |
| A |

Переходит в

| ProductA | ProductB | ProductC |
| -- | -- | -- |
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
| 1 | 0 | 0 |

Можно использовать sklearn.preprocessing.OneHotEncoder, но для этого потребуется дополнительно преобразовать фрейм данных (набор единичных векторов для каждого кортежа данных).

Также не будем использовать кодирование категорий (A->1, B->2, C->3, D->4, E->5), потому что это переводит номинативную случайную величину в ранговую/числовую, и является существенным допущением относительно исходных данных.

In [3]:
%%time
onehot_df = pd.get_dummies(data['Product_Info_2_1'])
onehot_df.columns = ['Product_Info_2_1' + column for column in onehot_df.columns]
data = pd.merge(left=data, right=onehot_df, left_index=True, right_index=True).drop('Product_Info_2_1', axis=1)

CPU times: total: 15.6 ms
Wall time: 19 ms


### Заполним отсутствующие значения
-1 увеличивает "расстояние" при расчете ближайших соседей

In [4]:
data.fillna(-1, inplace=True)

### Столбцы для модели

In [5]:
feature_regsearcher = r'Insurance_History.*|InsuredInfo.*|Medical_Keyword|Family_Hist.*|Medical_History.*|Product_Info.*|Wt|Ht|Ins_Age|BMI'
columns = [column for column in data.columns if re.match(feature_regsearcher, column) != None]
columns

['Product_Info_1',
 'Product_Info_3',
 'Product_Info_4',
 'Product_Info_5',
 'Product_Info_6',
 'Product_Info_7',
 'Ins_Age',
 'Ht',
 'Wt',
 'BMI',
 'InsuredInfo_1',
 'InsuredInfo_2',
 'InsuredInfo_3',
 'InsuredInfo_4',
 'InsuredInfo_5',
 'InsuredInfo_6',
 'InsuredInfo_7',
 'Insurance_History_1',
 'Insurance_History_2',
 'Insurance_History_3',
 'Insurance_History_4',
 'Insurance_History_5',
 'Insurance_History_7',
 'Insurance_History_8',
 'Insurance_History_9',
 'Family_Hist_1',
 'Family_Hist_2',
 'Family_Hist_3',
 'Family_Hist_4',
 'Family_Hist_5',
 'Medical_History_1',
 'Medical_History_2',
 'Medical_History_3',
 'Medical_History_4',
 'Medical_History_5',
 'Medical_History_6',
 'Medical_History_7',
 'Medical_History_8',
 'Medical_History_9',
 'Medical_History_10',
 'Medical_History_11',
 'Medical_History_12',
 'Medical_History_13',
 'Medical_History_14',
 'Medical_History_15',
 'Medical_History_16',
 'Medical_History_17',
 'Medical_History_18',
 'Medical_History_19',
 'Medical_Histor

### Предобработка данных
Дополнительно проведем z-нормализацию данных через предварительную обработку (preprocessing). Нормализуем весь исходный набор данных.

In [6]:
scaler = preprocessing.StandardScaler().fit(data[columns])

### Разделение данных

In [7]:
data_train, data_test = train_test_split(data, test_size=0.2)
data_train.head()

Unnamed: 0,Id,Product_Info_1,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,Wt,BMI,Employment_Info_1,Employment_Info_2,Employment_Info_3,Employment_Info_4,Employment_Info_5,Employment_Info_6,InsuredInfo_1,InsuredInfo_2,InsuredInfo_3,InsuredInfo_4,InsuredInfo_5,InsuredInfo_6,InsuredInfo_7,Insurance_History_1,Insurance_History_2,Insurance_History_3,Insurance_History_4,Insurance_History_5,Insurance_History_7,Insurance_History_8,Insurance_History_9,Family_Hist_1,Family_Hist_2,Family_Hist_3,Family_Hist_4,Family_Hist_5,Medical_History_1,Medical_History_2,Medical_History_3,Medical_History_4,Medical_History_5,Medical_History_6,Medical_History_7,Medical_History_8,Medical_History_9,Medical_History_10,Medical_History_11,Medical_History_12,Medical_History_13,Medical_History_14,Medical_History_15,Medical_History_16,Medical_History_17,Medical_History_18,Medical_History_19,Medical_History_20,Medical_History_21,Medical_History_22,Medical_History_23,Medical_History_24,Medical_History_25,Medical_History_26,Medical_History_27,Medical_History_28,Medical_History_29,Medical_History_30,Medical_History_31,Medical_History_32,Medical_History_33,Medical_History_34,Medical_History_35,Medical_History_36,Medical_History_37,Medical_History_38,Medical_History_39,Medical_History_40,Medical_History_41,Medical_Keyword_1,Medical_Keyword_2,Medical_Keyword_3,Medical_Keyword_4,Medical_Keyword_5,Medical_Keyword_6,Medical_Keyword_7,Medical_Keyword_8,Medical_Keyword_9,Medical_Keyword_10,Medical_Keyword_11,Medical_Keyword_12,Medical_Keyword_13,Medical_Keyword_14,Medical_Keyword_15,Medical_Keyword_16,Medical_Keyword_17,Medical_Keyword_18,Medical_Keyword_19,Medical_Keyword_20,Medical_Keyword_21,Medical_Keyword_22,Medical_Keyword_23,Medical_Keyword_24,Medical_Keyword_25,Medical_Keyword_26,Medical_Keyword_27,Medical_Keyword_28,Medical_Keyword_29,Medical_Keyword_30,Medical_Keyword_31,Medical_Keyword_32,Medical_Keyword_33,Medical_Keyword_34,Medical_Keyword_35,Medical_Keyword_36,Medical_Keyword_37,Medical_Keyword_38,Medical_Keyword_39,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,Response,Product_Info_2_2,Product_Info_2_1A,Product_Info_2_1B,Product_Info_2_1C,Product_Info_2_1D,Product_Info_2_1E
39782,52834,1,10,0.076904,2,3,1,0.268555,0.63623,0.320068,0.609375,0.070007,1,3,0.0,3,0.0,1,2,3,3,1,2,1,2,1,1,3,-1.0,3,2,3,1,-1.0,-1.0,-1.0,-1.0,1.0,112,2,1,1,1,2,2,1,-1.0,3,2,1,3,0.0,3,3,1,1,2,1,2,3,-1.0,1,3,3,1,1,2,3,-1.0,3,1,1,2,2,1,3,3,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,3,0,0,0,1,0
33811,44885,1,26,0.076904,2,3,1,0.656738,0.654785,0.322266,0.589355,0.031204,1,3,0.0,2,0.0,1,2,8,3,1,2,1,2,1,1,3,-1.0,3,2,3,3,-1.0,0.519531,-1.0,0.606934,1.0,16,2,2,1,3,2,2,1,-1.0,3,2,3,3,-1.0,3,3,1,1,2,1,2,3,-1.0,2,2,3,1,1,2,3,-1.0,3,3,1,3,2,1,3,3,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,0,1,0
32320,42872,1,26,0.487061,2,1,1,0.268555,0.799805,0.330566,0.441895,0.11499,9,1,0.0,2,0.25,2,2,3,3,1,1,1,2,1,1,3,-1.0,3,2,3,2,0.507324,-1.0,0.408447,-1.0,0.0,491,2,2,1,3,2,2,2,-1.0,3,2,3,3,-1.0,1,3,1,1,2,1,2,1,-1.0,1,3,3,1,3,2,3,-1.0,3,3,1,2,2,1,1,3,3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,5,1,0,0,0,0
32742,43439,1,26,0.230713,2,3,1,0.19397,0.654785,0.188232,0.337646,0.045013,9,1,0.0,2,0.150024,1,2,5,3,1,2,1,2,1,1,3,-1.0,3,2,3,3,0.376709,-1.0,0.309814,-1.0,22.0,161,2,2,1,3,2,2,1,-1.0,3,2,3,3,-1.0,1,3,1,1,2,2,2,3,-1.0,2,2,3,1,3,2,3,-1.0,3,1,1,3,2,1,3,3,3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,5,1,0,0,0,0
55958,74542,1,26,0.487061,2,3,1,0.432861,0.672852,0.393311,0.695801,0.094971,3,1,-1.0,2,0.350098,1,2,3,3,1,1,1,2,1,1,3,-1.0,3,2,3,2,-1.0,-1.0,-1.0,0.294678,5.0,112,2,1,1,1,2,2,1,-1.0,3,2,1,3,-1.0,1,3,1,1,2,1,2,1,-1.0,1,3,3,1,3,2,3,-1.0,3,3,1,2,2,1,1,3,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,2,0,0,0,1,0


### Расчет модели наивного Байеса
\begin{equation}
P(A\mid B) = \frac{P(B\mid A)\ P(A)}{P(B)}
\end{equation}
Для каждого параметра вычисляется его вероятность принять определенное значение - P(B). Для каждого класса вычисляется его вероятность (по факту, доля) - P(A). Затем вычисляется вероятность для каждого параметра принять определенное значение при определенном классе - P(B\A).

По всем вычисленным значениям находится вероятность при известных параметрах принять какое-либо значение класса.

In [9]:
y = data_train['Response']
x = scaler.transform(data_train[columns])
bayes = GaussianNB().fit(x, y)

### Предсказание данных

In [10]:
x_test = scaler.transform(data_test[columns])
data_test['target'] = bayes.predict(x_test)
data_test.head()

Unnamed: 0,Id,Product_Info_1,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,Wt,BMI,Employment_Info_1,Employment_Info_2,Employment_Info_3,Employment_Info_4,Employment_Info_5,Employment_Info_6,InsuredInfo_1,InsuredInfo_2,InsuredInfo_3,InsuredInfo_4,InsuredInfo_5,InsuredInfo_6,InsuredInfo_7,Insurance_History_1,Insurance_History_2,Insurance_History_3,Insurance_History_4,Insurance_History_5,Insurance_History_7,Insurance_History_8,Insurance_History_9,Family_Hist_1,Family_Hist_2,Family_Hist_3,Family_Hist_4,Family_Hist_5,Medical_History_1,Medical_History_2,Medical_History_3,Medical_History_4,Medical_History_5,Medical_History_6,Medical_History_7,Medical_History_8,Medical_History_9,Medical_History_10,Medical_History_11,Medical_History_12,Medical_History_13,Medical_History_14,Medical_History_15,Medical_History_16,Medical_History_17,Medical_History_18,Medical_History_19,Medical_History_20,Medical_History_21,Medical_History_22,Medical_History_23,Medical_History_24,Medical_History_25,Medical_History_26,Medical_History_27,Medical_History_28,Medical_History_29,Medical_History_30,Medical_History_31,Medical_History_32,Medical_History_33,Medical_History_34,Medical_History_35,Medical_History_36,Medical_History_37,Medical_History_38,Medical_History_39,Medical_History_40,Medical_History_41,Medical_Keyword_1,Medical_Keyword_2,Medical_Keyword_3,Medical_Keyword_4,Medical_Keyword_5,Medical_Keyword_6,Medical_Keyword_7,Medical_Keyword_8,Medical_Keyword_9,Medical_Keyword_10,Medical_Keyword_11,Medical_Keyword_12,Medical_Keyword_13,Medical_Keyword_14,Medical_Keyword_15,Medical_Keyword_16,Medical_Keyword_17,Medical_Keyword_18,Medical_Keyword_19,Medical_Keyword_20,Medical_Keyword_21,Medical_Keyword_22,Medical_Keyword_23,Medical_Keyword_24,Medical_Keyword_25,Medical_Keyword_26,Medical_Keyword_27,Medical_Keyword_28,Medical_Keyword_29,Medical_Keyword_30,Medical_Keyword_31,Medical_Keyword_32,Medical_Keyword_33,Medical_Keyword_34,Medical_Keyword_35,Medical_Keyword_36,Medical_Keyword_37,Medical_Keyword_38,Medical_Keyword_39,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,Response,Product_Info_2_2,Product_Info_2_1A,Product_Info_2_1B,Product_Info_2_1C,Product_Info_2_1D,Product_Info_2_1E,target
17642,23510,1,26,1.0,2,3,1,0.462646,0.872559,0.341064,0.392822,0.059998,9,1,0.0,2,0.600098,1,2,8,3,1,1,1,2,1,3,1,8.34465e-07,1,3,2,3,-1.0,-1.0,-1.0,0.25,15.0,335,2,2,1,3,2,2,2,-1.0,3,2,3,3,-1.0,1,3,1,1,2,1,2,3,-1.0,1,3,3,1,3,2,3,-1.0,3,3,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,1,0,0,0,1,0,4
5613,7470,1,26,0.128174,2,3,1,0.358154,0.63623,0.246826,0.467529,0.049988,9,1,0.0,2,-1.0,1,2,6,3,1,1,1,2,1,1,3,-1.0,3,2,3,3,0.362305,-1.0,0.295654,-1.0,2.0,140,2,2,1,3,2,2,2,-1.0,3,2,3,3,-1.0,1,3,1,1,2,1,2,3,-1.0,1,3,3,1,3,2,3,-1.0,3,3,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,4,0,0,0,1,0,4
6310,8414,1,26,0.128174,2,3,1,0.358154,0.708984,0.205078,0.321533,0.024002,9,1,0.006001,2,-1.0,2,2,8,3,1,1,1,1,1,3,1,0.0006666183,2,1,2,3,0.536133,-1.0,0.436523,-1.0,3.0,81,2,2,1,3,2,2,2,-1.0,3,2,3,3,-1.0,1,3,1,1,2,1,2,3,-1.0,1,3,3,1,3,2,3,-1.0,3,3,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,2,1,0,0,0,0,4
56870,75765,1,26,0.230713,2,3,1,0.522461,0.745605,0.403809,0.615234,1.0,12,1,0.078003,2,0.990234,1,2,8,3,1,1,1,2,1,3,1,0.0001666546,1,3,2,3,-1.0,-1.0,-1.0,0.598145,1.0,156,2,1,1,3,2,2,1,-1.0,3,2,3,3,240.0,1,3,2,1,2,1,2,1,-1.0,1,3,3,1,1,2,3,-1.0,3,3,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,2,3,0,0,0,1,0,7
5158,6858,2,26,0.487061,3,3,1,0.208984,0.727051,0.261475,0.401123,0.059998,12,1,0.300049,2,1.0,1,2,3,3,3,1,1,2,1,1,3,-1.0,3,2,3,3,0.333252,-1.0,0.309814,-1.0,3.0,16,2,1,1,3,2,2,2,-1.0,3,2,1,3,-1.0,1,3,1,1,2,1,2,3,-1.0,1,3,3,1,3,2,3,-1.0,1,3,1,2,2,1,3,3,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,6,3,0,0,0,1,0,4


### Оценка модели

In [11]:
print('Байес:', cohen_kappa_score(data_test['Response'], data_test['target'], weights='quadratic'))

Байес: 0.19506590874215646


### Матрица неточностей

In [12]:
print(confusion_matrix(data_test["target"], data_test["Response"]))

[[ 272  209   16    7   92  174   86  109]
 [ 112  155    7    4   49   77   22   38]
 [ 105  126   22    7   42   80   15   39]
 [ 327  356  109  215  489 1031  750 2275]
 [  41   59    2    0   29   40   12   10]
 [  15   17    1    0   15   20    4    1]
 [ 323  367   21   40  353  702  633  793]
 [  37   18    2    9   41   87   92  676]]
