<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#判定贷款用户是否逾期" data-toc-modified-id="判定贷款用户是否逾期-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>判定贷款用户是否逾期</a></span><ul class="toc-item"><li><span><a href="#载入数据" data-toc-modified-id="载入数据-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>载入数据</a></span></li><li><span><a href="#模型选择与模型评估" data-toc-modified-id="模型选择与模型评估-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>模型选择与模型评估</a></span><ul class="toc-item"><li><span><a href="#LR模型" data-toc-modified-id="LR模型-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>LR模型</a></span></li><li><span><a href="#SVM模型" data-toc-modified-id="SVM模型-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>SVM模型</a></span></li><li><span><a href="#决策树模型" data-toc-modified-id="决策树模型-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>决策树模型</a></span></li><li><span><a href="#XGBoost模型" data-toc-modified-id="XGBoost模型-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>XGBoost模型</a></span></li><li><span><a href="#LightGBM模型" data-toc-modified-id="LightGBM模型-1.2.5"><span class="toc-item-num">1.2.5&nbsp;&nbsp;</span>LightGBM模型</a></span></li><li><span><a href="#最终模型" data-toc-modified-id="最终模型-1.2.6"><span class="toc-item-num">1.2.6&nbsp;&nbsp;</span>最终模型</a></span></li></ul></li></ul></li><li><span><a href="#结果对比" data-toc-modified-id="结果对比-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>结果对比</a></span></li></ul></div>

# 判定贷款用户是否逾期

给定金融数据，预测贷款用户是否会逾期。
（status是标签：0表示未逾期，1表示逾期。）

**Final** - 用统一的数据。数据三七分，随机种子2018，用AUC作为模型评价指标，比一下单模型和融合模型的比分。

## 载入数据

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

# 导入数据
data = pd.read_csv('data_all.csv')
y = data['status']
data.drop('status', axis = 1, inplace = True)
X = data

# 划分训练集测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2018)

# 特征归一化
std = StandardScaler()
X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)

In [2]:
# 性能评估
from sklearn.metrics import accuracy_score, roc_auc_score

def model_metrics(clf, X_train, X_test, y_train, y_test):
    # 预测
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)
    
    y_train_proba = clf.predict_proba(X_train)[:,1]
    y_test_proba = clf.predict_proba(X_test)[:,1]
    
    # 准确率
    print('[准确率]', end = ' ')
    print('训练集：', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ')
    print('测试集：', '%.4f'%accuracy_score(y_test, y_test_pred))
    
    # auc取值：用roc_auc_score或auc
    print('[auc值]', end = ' ')
    print('训练集：', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ')
    print('测试集：', '%.4f'%roc_auc_score(y_test, y_test_proba))

## 模型选择与模型评估

In [3]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
from mlxtend.classifier import StackingClassifier

### LR模型

In [5]:
lr = LogisticRegression(random_state = 2018)
# param = {'C': [1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']}
param = {'C': [i/100 for i in range(1,21)], 'penalty':['l1', 'l2']}

gsearch = GridSearchCV(lr, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)

print('最佳参数：',gsearch.best_params_)
print('训练集的最佳分数：', gsearch.best_score_)
print('测试集的最佳分数：', gsearch.score(X_test, y_test))

最佳参数： {'C': 0.04, 'penalty': 'l1'}
训练集的最佳分数： 0.7964618766902285
测试集的最佳分数： 0.7830845148300002


In [4]:
lr = LogisticRegression(C = 0.04, penalty = 'l1',random_state = 2018)
lr.fit(X_train, y_train)
model_metrics(lr, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.8016 测试集： 0.7884
[auc值] 训练集： 0.8080 测试集： 0.7831


### SVM模型

In [7]:
# 线性SVM
svm_linear = svm.SVC(kernel = 'linear', probability=True, random_state = 2018)
param = {'C':[0.01, 0.05, 0.1, 0.5, 1]}

gsearch = GridSearchCV(svm_linear, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)

print('最佳参数：',gsearch.best_params_)
print('训练集的最佳分数：', gsearch.best_score_)
print('测试集的最佳分数：', gsearch.score(X_test, y_test))

最佳参数： {'C': 0.01}
训练集的最佳分数： 0.7950417081289969
测试集的最佳分数： 0.7790418661909382


In [5]:
svm_linear = svm.SVC(C = 0.01, kernel = 'linear', probability=True,random_state = 2018)
svm_linear.fit(X_train, y_train)
model_metrics(svm_linear, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.7992 测试集： 0.7765
[auc值] 训练集： 0.8152 测试集： 0.7790


In [9]:
# 多项式SVM
svm_poly = svm.SVC(kernel = 'poly', probability=True,random_state = 2018)
param = {'C':[0.01, 0.05, 0.1, 0.5, 1]}
gsearch = GridSearchCV(svm_poly, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)

print('最佳参数：',gsearch.best_params_)
print('训练集的最佳分数：', gsearch.best_score_)
print('测试集的最佳分数：', gsearch.score(X_test, y_test))

最佳参数： {'C': 0.01}
训练集的最佳分数： 0.745558070524133
测试集的最佳分数： 0.7346979228610476


In [6]:
svm_poly =  svm.SVC(C = 0.01, kernel = 'poly', probability=True,random_state = 2018)
svm_poly.fit(X_train, y_train)
model_metrics(svm_poly, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.7568 测试集： 0.7505
[auc值] 训练集： 0.8626 测试集： 0.7347


In [11]:
# 高斯SVM
svm_rbf = svm.SVC(probability=True,random_state = 2018)
param = {'gamma':[0.01, 0.05, 0.1, 0.5, 1, 5, 10], 
         'C':[0.01, 0.05, 0.1, 0.5, 1]}
gsearch = GridSearchCV(svm_poly, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)

print('最佳参数：',gsearch.best_params_)
print('训练集的最佳分数：', gsearch.best_score_)
print('测试集的最佳分数：', gsearch.score(X_test, y_test))

最佳参数： {'C': 0.01, 'gamma': 0.01}
训练集的最佳分数： 0.7462600497861079
测试集的最佳分数： 0.7370583080341774


In [7]:
svm_rbf =  svm.SVC(gamma = 0.01, C =0.01 , probability=True,random_state = 2018)
svm_rbf.fit(X_train, y_train)
model_metrics(svm_rbf, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.7493 测试集： 0.7484
[auc值] 训练集： 0.8522 测试集： 0.7708


In [13]:
# sigmoid - SVM
svm_sigmoid = svm.SVC(kernel = 'sigmoid',probability=True,random_state = 2018)
param = {'C':[0.01, 0.05, 0.1, 0.5, 1]}
gsearch = GridSearchCV(svm_sigmoid, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)

print('最佳参数：',gsearch.best_params_)
print('训练集的最佳分数：', gsearch.best_score_)
print('测试集的最佳分数：', gsearch.score(X_test, y_test))

最佳参数： {'C': 0.05}
训练集的最佳分数： 0.7778813338030846
测试集的最佳分数： 0.7590059779036651


In [8]:
svm_sigmoid = svm.SVC(C = 0.05, kernel = 'sigmoid',probability=True,random_state = 2018)
svm_sigmoid.fit(X_train, y_train)
model_metrics(svm_sigmoid, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.7647 测试集： 0.7617
[auc值] 训练集： 0.7660 测试集： 0.7590


### 决策树模型

In [19]:
dt = DecisionTreeClassifier(random_state = 2018)
dt.fit(X_train, y_train)
model_metrics(dt, X_train, X_test, y_train, y_test)

[准确率] 训练集： 1.0000 测试集： 0.6854
[auc值] 训练集： 1.0000 测试集： 0.5956


In [35]:
param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)}
#param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)}
#param = {'min_samples_split':range(100,401,10), 'min_samples_leaf':range(40,101,10)}
#param = {'max_features':range(7,20,2)}
#param = {'max_features':[18,19,20]}
gsearch = GridSearchCV(DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, random_state = 2018),
                       param_grid = param,scoring ='roc_auc', cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'max_depth': 9, 'min_samples_split': 100}, 0.7330286268284397)

In [9]:
dt = DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, random_state = 2018)
dt.fit(X_train, y_train)
model_metrics(dt, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.7812 测试集： 0.7561
[auc值] 训练集： 0.7721 测试集： 0.6946


### XGBoost模型

In [6]:
import warnings
warnings.filterwarnings("ignore")

xgb0 = XGBClassifier(random_state =2018)
xgb0.fit(X_train, y_train)

model_metrics(xgb0, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.8539 测试集： 0.7842
[auc值] 训练集： 0.9175 测试集： 0.7709


In [31]:
param_test = {'n_estimators':range(20,200,20)}
# param_test = {'n_estimators':range(40,81,10)}
# param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}
# param_test = {'max_depth':[2,3,4], 'min_child_weight':[10,11,12]}
# param_test = {'gamma':[i/10 for i in range(6)]}
# param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}
# param_test = { 'subsample':[i/100 for i in range(60,81,5)], 'colsample_bytree':[i/100 for i in range(70,91,5)]}
#param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
# param_test = {'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1]}
# 上述循环调整, 然后降低学习速率

gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, 
                                                  min_child_weight=11, gamma=0, subsample=0.7, 
                                                  colsample_bytree=0.8, objective= 'binary:logistic', 
                                                  nthread=4,scale_pos_weight=1, random_state =2018), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
#gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'n_estimators': 60}, 0.8038327506772067)

In [10]:
xgb = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11, 
                    gamma=0, subsample=0.7,colsample_bytree=0.8, objective= 'binary:logistic',
                    nthread=4,scale_pos_weight=1, random_state =2018)
xgb.fit(X_train, y_train)
model_metrics(xgb, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.8302 测试集： 0.7891
[auc值] 训练集： 0.8710 测试集： 0.7780


### LightGBM模型

In [38]:
# 首先看一下默认参数的结果
lgb0 = LGBMClassifier(random_state =2018)
lgb0.fit(X_train, y_train)

model_metrics(lgb0, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.9958 测试集： 0.7701
[auc值] 训练集： 1.0000 测试集： 0.7535


In [51]:
param_test = {'n_estimators':range(20,200,20)}
# param_test = {'n_estimators':range(30,51,10)}
# param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}
# param_test = {'max_depth':[2,3,4], 'min_child_weight':[6,7,8]}
# param_test = {'gamma':[i/10 for i in range(6)]}
# param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}
# param_test = { 'subsample':[i/100 for i in range(60,81,5)], 'colsample_bytree':[i/100 for i in range(70,91,5)]}
#param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
# param_test = {'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1]}
# 上述循环调整, 然后降低学习速率
gsearch = GridSearchCV(estimator = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, 
                                                  min_child_weight=7, gamma=0, subsample=0.5, 
                                                  colsample_bytree=0.8, reg_alpha = 1e-5,
                                                  nthread=4,scale_pos_weight=1, random_state =2018), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'n_estimators': 60}, 0.8054852081797643)

In [11]:
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7, 
                    gamma=0, subsample=0.5, colsample_bytree=0.8, reg_alpha=1e-5, 
                    nthread=4,scale_pos_weight=1,random_state =2018)
lgb.fit(X_train, y_train)
model_metrics(lgb, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.8269 测试集： 0.7877
[auc值] 训练集： 0.8741 测试集： 0.7746


### 最终模型

1.在融合的时候需要对模型进行筛选

2.StackingClassifier的参数设置

>如果average_probas=True，则对分类器的结果求平均，得到：p=[0.25,0.45,0.35]

>如果average_probas=False，则分类器的所有结果都保留作为新的特征：p=[0.2,0.5,0.3,0.3,0.4,0.4]

average_probas尝试True后, 效果更好。其次, 决策树和svm_poly单模型效果并不好, 尝试去掉两者后再Stacking

In [4]:
lr = LogisticRegression(C = 0.04, penalty = 'l1',random_state = 2018)
svm_linear =svm.SVC(C = 0.01, kernel = 'linear', probability=True,random_state = 2018)
svm_poly =  svm.SVC(C = 0.01, kernel = 'poly', probability=True,random_state = 2018)
svm_rbf =  svm.SVC(gamma = 0.01, C =0.01 , probability=True,random_state = 2018)
svm_sigmoid =  svm.SVC(C = 0.01, kernel = 'sigmoid',probability=True,random_state = 2018)
dt = DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, 
                            random_state = 2018)
xgb = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11, 
                    gamma=0, subsample=0.7,colsample_bytree=0.8, objective= 'binary:logistic',
                    nthread=4,scale_pos_weight=1, random_state =2018)
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7, 
                    gamma=0, subsample=0.5, colsample_bytree=0.8, reg_alpha=1e-5, 
                    nthread=4,scale_pos_weight=1,random_state =2018)

In [5]:
sclf = StackingClassifier(classifiers=[lr, svm_linear, svm_rbf, xgb, lgb], 
                            meta_classifier=lr, use_probas=True,average_probas=True)

In [6]:
sclf.fit(X_train, y_train.values)
model_metrics(sclf, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.8161 测试集： 0.7821
[auc值] 训练集： 0.8556 测试集： 0.7861


Stacking调参

https://www.jianshu.com/p/48d1962679f5

# 结果对比

|模型|参数|auc值|
|:---|:---|:---|
|LR|C = 0.04, penalty = 'l1'|训练集： 0.8080 测试集： 0.7831|
|svm_linear|C = 0.01|训练集： 0.8152 测试集： 0.7790|
|svm_poly|C = 0.01|训练集： 0.8626 测试集： 0.7347|
|svm_rbf|gamma = 0.01, C =0.01|训练集： 0.8522 测试集： 0.7708|
|svm_sigmoid|C = 0.01|训练集： 0.7660 测试集： 0.7590|
|决策树|max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9|训练集： 0.7721 测试集： 0.6946|
|XGBoost|learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11, gamma=0, subsample=0.7,colsample_bytree=0.8|训练集： 0.8710 测试集： 0.7780|
|LightGBM|learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7, gamma=0, subsample=0.5, colsample_bytree=0.8|训练集： 0.8741 测试集： 0.7746|
|Stacking|-|训练集： 0.8750 测试集： 0.7861|

> 测试集最好情况是LR模型0.7831。

> 可以看到LR取最优值时, 是L1正则化。所以需要进一步特征选择。