## **Урок 7. Кейс 2. Типы моделей для задачи тарификации**

### **Тимур С.**

### Домашнее задание

**Построить обобщенную линейную модель (GLM) для прогнозирования наступления страховых случаев на рассмотренных в ноутбуке данных. Подобрать необходимое распределение и тип связи, при необходимости ознакомиться с документацией H20. Придумать и использовать дополнительные факторы при построении модели (например, пересечения признаков или функции от них и т.д.). Оценить результаты построенной модели при помощи различных метрик (можно использовать и другие метрики помимо представленных в ноутбуке), проанализировать вероятные проблемы. Предложить способы их решения и/или попробовать их решить, улучшив результат.**

In [501]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import numpy as np
import pandas as pd

In [503]:
# Загрузим набор данных

df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/freMPL-R.csv', low_memory=False)
df = df.loc[df.Dataset.isin([5, 6, 7, 8, 9])]
df.drop('Dataset', axis=1, inplace=True)
df.dropna(axis=1, how='all', inplace=True)
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115155 entries, 0 to 115154
Data columns (total 20 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Exposure           115155 non-null  float64
 1   LicAge             115155 non-null  int64  
 2   RecordBeg          115155 non-null  object 
 3   RecordEnd          59455 non-null   object 
 4   Gender             115155 non-null  object 
 5   MariStat           115155 non-null  object 
 6   SocioCateg         115155 non-null  object 
 7   VehUsage           115155 non-null  object 
 8   DrivAge            115155 non-null  int64  
 9   HasKmLimit         115155 non-null  int64  
 10  BonusMalus         115155 non-null  int64  
 11  ClaimAmount        115155 non-null  float64
 12  ClaimInd           115155 non-null  int64  
 13  ClaimNbResp        115155 non-null  float64
 14  ClaimNbNonResp     115155 non-null  float64
 15  ClaimNbParking     115155 non-null  float64
 16  Cl

In [504]:
df.head()

Unnamed: 0,Exposure,LicAge,RecordBeg,RecordEnd,Gender,MariStat,SocioCateg,VehUsage,DrivAge,HasKmLimit,BonusMalus,ClaimAmount,ClaimInd,ClaimNbResp,ClaimNbNonResp,ClaimNbParking,ClaimNbFireTheft,ClaimNbWindscreen,OutUseNb,RiskArea
0,0.083,332,2004-01-01,2004-02-01,Male,Other,CSP50,Professional,46,0,50,0.0,0,0.0,1.0,0.0,0.0,0.0,0.0,9.0
1,0.916,333,2004-02-01,,Male,Other,CSP50,Professional,46,0,50,0.0,0,0.0,1.0,0.0,0.0,0.0,0.0,9.0
2,0.55,173,2004-05-15,2004-12-03,Male,Other,CSP50,Private+trip to office,32,0,68,0.0,0,0.0,2.0,0.0,0.0,0.0,0.0,7.0
3,0.089,364,2004-11-29,,Female,Other,CSP55,Private+trip to office,52,0,50,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,8.0
4,0.233,426,2004-02-07,2004-05-01,Male,Other,CSP60,Private,57,0,50,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,7.0


В предыдущем уроке мы заметили отрицательную величину убытка для некоторых наблюдений. Заметим, что для всех таких полисов переменная "ClaimInd" принимает только значение 0. Поэтому заменим все соответствующие значения "ClaimAmount" нулями.

In [505]:
NegClaimAmount = df.loc[df.ClaimAmount < 0, ['ClaimAmount','ClaimInd']]
print('Unique values of ClaimInd:', NegClaimAmount.ClaimInd.unique())
NegClaimAmount.head()

Unique values of ClaimInd: [0]


Unnamed: 0,ClaimAmount,ClaimInd
82,-74.206042,0
175,-1222.585196,0
177,-316.288822,0
363,-666.75861,0
375,-1201.600604,0


In [0]:
df.loc[df.ClaimAmount < 0, 'ClaimAmount'] = 0

Перекодируем переменные типа object с помощью числовых значений

In [0]:
def SeriesFactorizer(series):
    series, unique = pd.factorize(series)
    reference = {x: i for x, i in enumerate(unique)}
    print(reference)
    return series, reference

In [508]:
df.Gender, GenderRef = SeriesFactorizer(df.Gender)

{0: 'Male', 1: 'Female'}


In [509]:
df.MariStat, MariStatRef = SeriesFactorizer(df.MariStat)

{0: 'Other', 1: 'Alone'}


In [510]:
list(df.VehUsage.unique())

['Professional', 'Private+trip to office', 'Private', 'Professional run']

In [511]:
VU_dummies = pd.get_dummies(df.VehUsage, prefix='VehUsg', drop_first=False)
VU_dummies.head()

Unnamed: 0,VehUsg_Private,VehUsg_Private+trip to office,VehUsg_Professional,VehUsg_Professional run
0,0,0,1,0
1,0,0,1,0
2,0,1,0,0
3,0,1,0,0
4,1,0,0,0


Фактор "SocioCateg" содержит информацию о социальной категории в виде кодов классификации CSP. Агрегируем имеющиеся коды до 1 знака, а затем закодируем их с помощью one-hot encoding.

In [512]:
df['SocioCateg'].unique()

array(['CSP50', 'CSP55', 'CSP60', 'CSP48', 'CSP6', 'CSP66', 'CSP1',
       'CSP46', 'CSP21', 'CSP47', 'CSP42', 'CSP37', 'CSP22', 'CSP3',
       'CSP49', 'CSP20', 'CSP2', 'CSP40', 'CSP7', 'CSP26', 'CSP65',
       'CSP41', 'CSP17', 'CSP57', 'CSP56', 'CSP38', 'CSP51', 'CSP59',
       'CSP30', 'CSP44', 'CSP61', 'CSP63', 'CSP45', 'CSP16', 'CSP43',
       'CSP39', 'CSP5', 'CSP32', 'CSP35', 'CSP73', 'CSP62', 'CSP52',
       'CSP27', 'CSP24', 'CSP19', 'CSP70'], dtype=object)

In [0]:
df['SocioCateg'] = df.SocioCateg.str.slice(0,4)

In [514]:
pd.DataFrame(df.SocioCateg.value_counts().sort_values()).rename({'SocioCateg': 'Frequency'}, axis=1)

Unnamed: 0,Frequency
CSP7,14
CSP3,1210
CSP1,2740
CSP2,3254
CSP4,7648
CSP6,24833
CSP5,75456


In [0]:
df = pd.get_dummies(df, columns=['VehUsage','SocioCateg'])

Теперь, когда большинство переменных типа object обработаны, исключим их из набора данных за ненадобностью.

In [0]:
df = df.select_dtypes(exclude=['object'])

Также создадим такую переменную, как квадрат возраста, а также квадраты некоторых других переменных.

In [517]:
df['DrivAgeSq'] = df.DrivAge.apply(lambda x: x**2)
df['LicAgeSq'] = df.LicAge.apply(lambda x: x**2)
df['BonusMalusSq'] = df.BonusMalus.apply(lambda x: x**2)
df.head()

Unnamed: 0,Exposure,LicAge,Gender,MariStat,DrivAge,HasKmLimit,BonusMalus,ClaimAmount,ClaimInd,ClaimNbResp,ClaimNbNonResp,ClaimNbParking,ClaimNbFireTheft,ClaimNbWindscreen,OutUseNb,RiskArea,VehUsage_Private,VehUsage_Private+trip to office,VehUsage_Professional,VehUsage_Professional run,SocioCateg_CSP1,SocioCateg_CSP2,SocioCateg_CSP3,SocioCateg_CSP4,SocioCateg_CSP5,SocioCateg_CSP6,SocioCateg_CSP7,DrivAgeSq,LicAgeSq,BonusMalusSq
0,0.083,332,0,0,46,0,50,0.0,0,0.0,1.0,0.0,0.0,0.0,0.0,9.0,0,0,1,0,0,0,0,0,1,0,0,2116,110224,2500
1,0.916,333,0,0,46,0,50,0.0,0,0.0,1.0,0.0,0.0,0.0,0.0,9.0,0,0,1,0,0,0,0,0,1,0,0,2116,110889,2500
2,0.55,173,0,0,32,0,68,0.0,0,0.0,2.0,0.0,0.0,0.0,0.0,7.0,0,1,0,0,0,0,0,0,1,0,0,1024,29929,4624
3,0.089,364,1,0,52,0,50,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0,1,0,0,0,0,0,0,1,0,0,2704,132496,2500
4,0.233,426,0,0,57,0,50,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,1,0,0,0,0,0,0,0,0,1,0,3249,181476,2500


In [0]:
# !apt-get install default-jre

In [0]:
# !java -version

In [0]:
# !pip install h2o

In [521]:
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,1 hour 54 mins
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.0.2
H2O_cluster_version_age:,5 days
H2O_cluster_name:,H2O_from_python_unknownUser_2bnh5m
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.070 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


In [0]:
from sklearn.model_selection import train_test_split

In [0]:
# Разбиение датасета на train/val/test

x_train_ind, x_test_ind, y_train_ind, y_test_ind = train_test_split(df.drop(['ClaimInd', 'ClaimAmount'], axis=1), df.ClaimInd, test_size=0.3, random_state=1)
x_valid_ind, x_test_ind, y_valid_ind, y_test_ind = train_test_split(x_test_ind, y_test_ind, test_size=0.5, random_state=1)

In [524]:
# Преобразование в H2O-Frame

h2o_train_ind = h2o.H2OFrame(pd.concat([x_train_ind, y_train_ind], axis=1))
h2o_valid_ind = h2o.H2OFrame(pd.concat([x_valid_ind, y_valid_ind], axis=1))
h2o_test_ind = h2o.H2OFrame(pd.concat([x_test_ind, y_test_ind], axis=1))

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [0]:
# Преобразуем целевую переменную ClaimInd в категориальную при помощи метода asfactor во всех наборах данных

h2o_train_ind['ClaimInd'] = h2o_train_ind['ClaimInd'].asfactor()
h2o_valid_ind['ClaimInd'] = h2o_valid_ind['ClaimInd'].asfactor()
h2o_test_ind['ClaimInd'] = h2o_test_ind['ClaimInd'].asfactor()

In [526]:
# Инициализируем и обучим GLM модель c кросс-валидацией

glm_poisson = H2OGeneralizedLinearEstimator(family = "binomial", lambda_=0.0001, nfolds=5)
glm_poisson.train(y="ClaimInd", x = h2o_train_ind.names[1:-1], training_frame = h2o_train_ind, validation_frame = h2o_valid_ind, weights_column = "Exposure")

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [527]:
# Параметры модели: распределение, функция связи, гиперпараметры регуляризации, количество использованных объясняющих переменных

glm_poisson.summary()


GLM Model: summary


Unnamed: 0,Unnamed: 1,family,link,regularization,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
0,,binomial,logit,"Elastic Net (alpha = 0.5, lambda = 1.0E-4 )",27,25,3,py_25_sid_9d65




In [528]:
# Метрики качества модели - по всем данным и на кросс-валидации

glm_poisson.cross_validation_metrics_summary().as_data_frame()

Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,accuracy,0.48841798,0.08422018,0.57595134,0.3614605,0.45790386,0.49873888,0.5480353
1,auc,0.5788966,0.011701856,0.5861915,0.56863815,0.5646088,0.5917847,0.5832597
2,aucpr,0.16080387,0.0076874658,0.16110548,0.17065667,0.14938235,0.1634332,0.1594417
3,err,0.511582,0.08422018,0.42404866,0.6385395,0.54209614,0.5012611,0.45196468
4,err_count,3648.308,603.45685,3032.296,4556.351,3862.79,3592.975,3197.128
5,f0point5,0.1812863,0.004204574,0.1868524,0.17949069,0.17664346,0.18442172,0.17902327
6,f1,0.24827935,0.0048188376,0.2485463,0.25281283,0.24471034,0.25308418,0.24224308
7,f2,0.39485633,0.023027593,0.37106133,0.42741057,0.39811972,0.4032008,0.37448922
8,lift_top_group,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,logloss,0.3791118,0.010225499,0.37536094,0.3953337,0.3812357,0.3756878,0.36794087


In [529]:
# Таблица коэффициентов модели (в зависимости от модели могут выводиться также стандартная ошибка, z-score и p-value)

glm_poisson._model_json['output']['coefficients_table'].as_data_frame()

Unnamed: 0,names,coefficients,standardized_coefficients
0,Intercept,-2.24979,-1.946601
1,LicAge,-0.001359,-0.216671
2,Gender,0.019998,0.009684
3,MariStat,-0.053385,-0.019089
4,DrivAge,7.1e-05,0.001062
5,HasKmLimit,-0.39472,-0.122648
6,BonusMalus,0.007141,0.107674
7,ClaimNbResp,0.055883,0.029302
8,ClaimNbNonResp,0.133666,0.07952
9,ClaimNbParking,0.17351,0.051149


In [530]:
# Таблица нормированных коэффициентов по всем данным и на кросс-валидации

pmodels = {}
pmodels['overall'] = glm_poisson.coef_norm()
for x in range(len(glm_poisson.cross_validation_models())):
    pmodels[x] = glm_poisson.cross_validation_models()[x].coef_norm()
pd.DataFrame.from_dict(pmodels).round(5)

Unnamed: 0,overall,0,1,2,3,4
Intercept,-1.9466,-1.94193,-1.9671,-1.95014,-1.94188,-1.93383
LicAge,-0.21667,-0.24462,-0.18815,-0.22012,-0.22128,-0.18139
Gender,0.00968,0.01326,0.01122,0.01719,-0.00447,0.01172
MariStat,-0.01909,-0.01763,-0.02421,-0.01526,-0.01142,-0.0246
DrivAge,0.00106,0.00991,0.00582,-0.04317,0.03522,0.0
HasKmLimit,-0.12265,-0.11331,-0.14033,-0.12498,-0.10639,-0.12871
BonusMalus,0.10767,0.13187,0.10853,0.0664,0.09589,0.13956
ClaimNbResp,0.0293,0.02823,0.02604,0.02778,0.03127,0.03186
ClaimNbNonResp,0.07952,0.08473,0.07057,0.08751,0.07933,0.07553
ClaimNbParking,0.05115,0.04445,0.06103,0.05354,0.05222,0.04404


In [531]:
# Построение прогнозных значений для обучающей, валидационной и тестовой выборок

ind_train_pred = glm_poisson.predict(h2o_train_ind).as_data_frame()
ind_valid_pred = glm_poisson.predict(h2o_valid_ind).as_data_frame()
ind_test_pred = glm_poisson.predict(h2o_test_ind).as_data_frame()

glm prediction progress: |████████████████████████████████████████████████| 100%
glm prediction progress: |████████████████████████████████████████████████| 100%
glm prediction progress: |████████████████████████████████████████████████| 100%


In [0]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, precision_score, recall_score, roc_auc_score

In [533]:
confusion_matrix(y_train_ind, round(ind_train_pred['predict']))

array([[32324, 40649],
       [ 2514,  5121]])

In [534]:
confusion_matrix(y_valid_ind, round(ind_valid_pred['predict']))

array([[6926, 8738],
       [ 539, 1070]])

In [535]:
confusion_matrix(y_test_ind, round(ind_test_pred['predict']))

array([[6778, 8871],
       [ 547, 1078]])

In [536]:
# Выведем импортированные выше метрики классификации для обучающей, валидационной и тестовой выборок

print('Accuracy_score:', 
      '\ntrain', round(accuracy_score(y_train_ind, round(ind_train_pred['predict'])), 4),
      '\nvalid', round(accuracy_score(y_valid_ind, round(ind_valid_pred['predict'])), 4),
      '\ntest', round(accuracy_score(y_test_ind, round(ind_test_pred['predict'])), 4))

Accuracy_score: 
train 0.4645 
valid 0.4629 
test 0.4548


In [537]:
print('f1_score:', 
      '\ntrain', round(f1_score(y_train_ind, round(ind_train_pred['predict'])), 4),
      '\nvalid', round(f1_score(y_valid_ind, round(ind_valid_pred['predict'])), 4),
      '\ntest', round(f1_score(y_test_ind, round(ind_test_pred['predict'])), 4))

f1_score: 
train 0.1918 
valid 0.1874 
test 0.1863


In [538]:
print('Precision:', 
      '\ntrain', round(precision_score(y_train_ind, round(ind_train_pred['predict'])), 4),
      '\nvalid', round(precision_score(y_valid_ind, round(ind_valid_pred['predict'])), 4),
      '\ntest', round(precision_score(y_test_ind, round(ind_test_pred['predict'])), 4))

Precision: 
train 0.1119 
valid 0.1091 
test 0.1084


In [539]:
print('Recall:', 
      '\ntrain', round(recall_score(y_train_ind, round(ind_train_pred['predict'])), 4),
      '\nvalid', round(recall_score(y_valid_ind, round(ind_valid_pred['predict'])), 4),
      '\ntest', round(recall_score(y_test_ind, round(ind_test_pred['predict'])), 4))

Recall: 
train 0.6707 
valid 0.665 
test 0.6634


In [540]:
print('ROC AUC:', 
      '\ntrain', round(roc_auc_score(y_train_ind, round(ind_train_pred['predict'])), 4),
      '\nvalid', round(roc_auc_score(y_valid_ind, round(ind_valid_pred['predict'])), 4),
      '\ntest', round(roc_auc_score(y_test_ind, round(ind_test_pred['predict'])), 4))

ROC AUC: 
train 0.5568 
valid 0.5536 
test 0.5483


**Какие проблемы вы здесь видите? Как можно улучшить данный результат?**

Качество результатов не высокое, о чем говорит ряд факторов:
- значение ROC AUC примерно 0.55;
- метрика f1 примерно 0.19;
- метрика precision примерно 0.11;
- метрика recall примерно 0.66-0.67;

Повысить результат можно за счет:
- балансирования данных, относящихся к разным классам;
- увеличения объема данных;
- доработки модели.