本kernel是文章的例子，相关理论部参考知乎专栏文章https://zhuanlan.zhihu.com/p/148102950

## Table of Contents

1. [导入包](#1)
2. [导入数据](#2)
3. [EDA](#3)
4. [数据分箱](#4)
5. [特征筛选](#5)
6. [逻辑回归建模](#6)
7. [评分卡转换](#7)
8. [总结](#8)

> 
## 导入包<a class="anchor" id="1"></a>

In [None]:
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import math
import pandas as pd 
import matplotlib.pyplot as plt

import seaborn as sns
import sklearn.metrics as metrics
%matplotlib inline
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## 导入数据<a class="anchor" id="2"></a>

In [None]:
df_train = pd.read_csv("/kaggle/input/GiveMeSomeCredit/cs-training.csv",index_col=0)
df_test = pd.read_csv("/kaggle/input/GiveMeSomeCredit/cs-test.csv",index_col=0)
df_train.head()

In [None]:
df_train.info()

In [None]:
df_train.describe()

## EDA <a class="anchor" id="3"></a>

详见 https://www.kaggle.com/orange90/credit-scorecard-example

## 数据分箱？ <a class="anchor" id="4"></a>

对于树结构的模型，数据分箱并非必要的，因为树分叉的过程会有个split的阈值，相当于自动分箱了，除非你确信你对业务的理解够深入

## 特征筛选 <a class="anchor" id="5"></a>

暂时跳过

In [None]:
#尝试过SMOTE去过采样，但没啥效果，就注释掉了
# from imblearn.over_sampling import SMOTE 
# sm = SMOTE(random_state=42)
# X_train, y_train = sm.fit_resample(X_train, y_train)
# print('bad rate is: ',y_train.mean())

In [None]:
import xgboost as xgb

留30%作为模型的验证集

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_train.drop(['SeriousDlqin2yrs'],axis=1), df_train['SeriousDlqin2yrs'], test_size=0.2, random_state=42)

因为比赛是用AUC作为标准，所以我也看看我的模型AUC多少

In [None]:
def plot_AUC(model,X_test,y_test):
    probs = model.predict_proba(X_test)
    preds = probs[:,1]
    fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
    roc_auc = metrics.auc(fpr, tpr)

    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

# 建模 <a class="anchor" id="6"></a>

## model 1: use default parameter

In [None]:

model1 = xgb.XGBClassifier(objective="binary:logistic", random_state=42)
model1.fit(X_train,y_train)

In [None]:
#在验证集上看性能
model1.score(X_test,y_test)

In [None]:
# calculate the fpr and tpr for all thresholds of the classification
plot_AUC(model1,X_test,y_test)

In [None]:
#混淆矩阵
y_pred = model1.predict(X_test)
metrics.confusion_matrix(y_test,y_pred)

## model 2: use change scale_pos_weight

In [None]:
weight = int(y_train.count()/y_train.sum())

In [None]:
model2 = xgb.XGBClassifier(objective="binary:logistic", random_state=42,scale_pos_weight = weight)
model2.fit(X_train,y_train)

In [None]:
model2.score(X_test,y_test)

In [None]:
plot_AUC(model2,X_test,y_test)

In [None]:
#混淆矩阵
y_pred = model2.predict(X_test)
metrics.confusion_matrix(y_test,y_pred)

## model 3: use smote

In [None]:

from imblearn.over_sampling import SMOTE 
sm = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = sm.fit_resample(X_train.fillna(0), y_train)
print('bad rate is: ',y_train_balanced.mean())

In [None]:
model3 = xgb.XGBClassifier(objective="binary:logistic", random_state=42)
model3.fit(X_train_balanced,y_train_balanced)

In [None]:
model3.score(X_test,y_test)

In [None]:
plot_AUC(model3,X_test,y_test)

In [None]:
#混淆矩阵
y_pred = model3.predict(X_test)
metrics.confusion_matrix(y_test,y_pred)

## model 4, easy ensemble

In [None]:
param = {'min_child_weight': 10.0,
'objective': 'binary:logistic',
'max_depth': 5,
'eval_metric': 'auc',
'max_delta_step': 1.8,
'colsample_bytree': 0.4,
'subsample': 0.8,
'eta': 0.025,
'gamma': 0.65,
'num_boost_round' : 391
        }

In [None]:
from imblearn.ensemble import EasyEnsembleClassifier

In [None]:
X_train.head()

In [None]:
model4 = EasyEnsembleClassifier(n_estimators=20, random_state=42, base_estimator=xgb.XGBClassifier(objective="binary:logistic",random_state=42))
model4.fit(X_train.fillna(0),y_train)

In [None]:
model4.score(X_test.fillna(0),y_test)

In [None]:
plot_AUC(model4,X_test.fillna(0),y_test)

# 特征重要度

# 使用SHAP来解释模型

我们使用第一个模型，因为AUC最好

In [None]:
import shap

In [None]:
final_model = model4

In [None]:
for x in final_model.estimators_:
    print(x['classifier'].feature_importances_)

In [None]:
explainer = shap.TreeExplainer(final_model)


In [None]:
shap_values = explainer.shap_values(X_train)
print(shap_values.shape)


In [None]:
shap.summary_plot(shap_values,X_train)

In [None]:
shap.summary_plot(shap_values,X_train,plot_type='bar')

In [None]:
shap.dependence_plot('RevolvingUtilizationOfUnsecuredLines', shap_values,X_train, interaction_index=None, show=False)

## 看单个预测

In [None]:
sample = X_test.sample(1,random_state=42)

In [None]:
sample

In [None]:
final_model.predict_proba(sample)

In [None]:
shap.initjs()
shap_value_sample = explainer.shap_values(sample)
shap.force_plot(explainer.expected_value, shap_value_sample, sample)

蓝色表示该特征的贡献是负数，红色则表示该特征的贡献是正数。

## 总结 <a class="anchor" id="8"></a>

submission

In [None]:
df_test.head()

In [None]:
model_final = EasyEnsembleClassifier(n_estimators=50, random_state=42, base_estimator=xgb.XGBClassifier().set_params(**param))
model_final.fit(df_train.drop(['SeriousDlqin2yrs'],axis=1), df_train['SeriousDlqin2yrs'])


In [None]:
result = model_final.predict_proba(df_test.drop('SeriousDlqin2yrs',axis=1))

In [None]:
result = [x[1] for x in result]

In [None]:
df_result = pd.DataFrame({'Id':df_test['Unnamed: 0'].tolist(), 'Probability':result})

In [None]:
df_result.head()

In [None]:
df_result.to_csv('submission_credit_3.csv', index=False)

Reference:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.309.1465&rep=rep1&type=pdf