## 金融风控项目
在此项目中，你需要完成金融风控模型的搭建。 基于一个用户的基本信息、历史信息来预测逾期与否。采样的具体数据是拍拍贷数据。https://www.kesci.com/home/competition/56cd5f02b89b5bd026cb39c9/content/1
在此数据中提供了三种不同类型的数据:
1. Master: 用户的主要信息
2. Loginfo: 登录信息
3. Userupdateinfo: 修改信息

在本次项目中，我们只使用```Master```的信息来预测一个用户是否会逾期。 数据里有一个字段叫作```Target```是样本的标签（label)。 在```Master```表格里，包含200多个特征，但不少特征具有缺失值。 做项目的时候需要仔细处理一下。 

对于特征处理方面的技术，请参考本章视频课程里的内容。

本项目区别于之前的项目，希望大家能够开放性地思考，不要太局限于给定的条条框框，把目前为止学到的知识都用起来。所以在项目的设计上区别于之前的，没有设置太多的框架性的，大家可以按照自己的思路灵活做项目。 由于项目本身来自于数据竞赛，所以可以试着跟竞赛里的TOP选手的结果做一下对比，看看跟他们的差距或者优势在哪儿。

```数据```
- ```Training/PPD_Training_Master_GBK_3_1_Training_Set.csv```:  训练数据
- ```Test/PPD_Master_GBK_2_Test_Set.csv```: 测试数据


强调：一定要把注释写清楚。 每个函数，每一个模块具体做什么写清楚。

> 注意： 除了下面导入的库，还有sklearn、XGBoost等经典的库之外，建议不要使用其他的函数库。 如果你不得不使用一些其他特殊的库，请把环境注明在requirements.txt里面，不然我们判作业的时候就没有办法去评判了。 

In [232]:
import numpy as np 
import math 
import pandas as pd 
pd.set_option('display.float_format',lambda x:'%.3f' % x)
import matplotlib.pyplot as plt 
plt.style.use('ggplot')
%matplotlib inline
import seaborn as sns 
sns.set_palette('muted')
sns.set_style('darkgrid')
import warnings
warnings.filterwarnings('ignore')
import os 


In [233]:
# 读取Master数据
data = pd.read_csv('data/Training/PPD_Training_Master_GBK_3_1_Training_Set.csv',encoding='gb18030')
print (data.shape)

(30000, 228)


In [234]:
# 展示记录
data.head()

Unnamed: 0,Idx,UserInfo_1,UserInfo_2,UserInfo_3,UserInfo_4,WeblogInfo_1,WeblogInfo_2,WeblogInfo_3,WeblogInfo_4,WeblogInfo_5,...,SocialNetwork_10,SocialNetwork_11,SocialNetwork_12,SocialNetwork_13,SocialNetwork_14,SocialNetwork_15,SocialNetwork_16,SocialNetwork_17,target,ListingInfo
0,10001,1.0,深圳,4.0,深圳,,1.0,,1.0,1.0,...,222,-1,0,0,0,0,0,1,0,2014/3/5
1,10002,1.0,温州,4.0,温州,,0.0,,1.0,1.0,...,1,-1,0,0,0,0,0,2,0,2014/2/26
2,10003,1.0,宜昌,3.0,宜昌,,0.0,,2.0,2.0,...,-1,-1,-1,1,0,0,0,0,0,2014/2/28
3,10006,4.0,南平,1.0,南平,,,,,,...,-1,-1,-1,0,0,0,0,0,0,2014/2/25
4,10007,5.0,辽阳,1.0,辽阳,,0.0,,1.0,1.0,...,-1,-1,-1,0,0,0,0,0,0,2014/2/27


In [235]:
# 正负样本的比例， 可以看出样本比例不平衡的
data.target.value_counts()

0    27802
1     2198
Name: target, dtype: int64

好了，剩下的部分需要由大家完成。 我大致给一下思路，然后大家可以按照这个思路去一步步完成。 

> #### 1. 数据的预处。 需要考虑以下几个方面：
- ```缺失值```。数据里有大量的缺失值，需要做一些处理。 
- ```字符串的清洗```。比如“北京市”和“北京”合并成“北京”， 统一转换成小写等
- ```二值化```。具体方法请参考课程里的介绍
- ```衍生特征```：比如户籍地和当前城市是否是同一个？ 
- ```特征的独热编码```：对于类别型特征使用独热编码形式
- ```连续性特征的处理```：根据情况来处理
- ```其他```: 根据情况，自行决定要不要做

In [236]:
# 1. 处理缺失值
temp = data.isnull().any()

# 1）统计每个维度缺失值大小删除缺失值超过30%的特征.
dropList1 = []   # 保存需要删除的column名字
for colname in data.columns:
    nCount = len(data) - data[colname].count()
    nLostRate = (nCount / len(data)) * 100
    sLostRate = '%.2f%%' % nLostRate
    print('字段名为：',str(colname).ljust(10),'缺失值数量:',str(nCount).ljust(4),'缺失数量占比：',sLostRate)
    if (nLostRate > 30):
        dropList1.append(colname)

# 2）删除缺失值大于30%的列
data.drop(dropList1, axis=1, inplace=True)
print("删除了%d列的数据"%len(dropList1))

# 3）删除存在缺失值的行
data.dropna(axis=0, how='any', inplace=True)
data_test.dropna(axis=0, how='any', inplace=True)
print(data.shape)
print(data.target.value_counts())


字段名为： Idx        缺失值数量: 0    缺失数量占比： 0.00%
字段名为： UserInfo_1 缺失值数量: 6    缺失数量占比： 0.02%
字段名为： UserInfo_2 缺失值数量: 302  缺失数量占比： 1.01%
字段名为： UserInfo_3 缺失值数量: 7    缺失数量占比： 0.02%
字段名为： UserInfo_4 缺失值数量: 268  缺失数量占比： 0.89%
字段名为： WeblogInfo_1 缺失值数量: 29030 缺失数量占比： 96.77%
字段名为： WeblogInfo_2 缺失值数量: 1658 缺失数量占比： 5.53%
字段名为： WeblogInfo_3 缺失值数量: 29030 缺失数量占比： 96.77%
字段名为： WeblogInfo_4 缺失值数量: 1651 缺失数量占比： 5.50%
字段名为： WeblogInfo_5 缺失值数量: 1651 缺失数量占比： 5.50%
字段名为： WeblogInfo_6 缺失值数量: 1651 缺失数量占比： 5.50%
字段名为： WeblogInfo_7 缺失值数量: 0    缺失数量占比： 0.00%
字段名为： WeblogInfo_8 缺失值数量: 0    缺失数量占比： 0.00%
字段名为： WeblogInfo_9 缺失值数量: 0    缺失数量占比： 0.00%
字段名为： WeblogInfo_10 缺失值数量: 0    缺失数量占比： 0.00%
字段名为： WeblogInfo_11 缺失值数量: 0    缺失数量占比： 0.00%
字段名为： WeblogInfo_12 缺失值数量: 0    缺失数量占比： 0.00%
字段名为： WeblogInfo_13 缺失值数量: 0    缺失数量占比： 0.00%
字段名为： WeblogInfo_14 缺失值数量: 0    缺失数量占比： 0.00%
字段名为： WeblogInfo_15 缺失值数量: 0    缺失数量占比： 0.00%
字段名为： WeblogInfo_16 缺失值数量: 0    缺失数量占比： 0.00%
字段名为： WeblogInfo_17 缺失值数量: 0    缺失数量占比： 0.00%
字段名为： We

(20988, 223)
0    19356
1     1632
Name: target, dtype: int64


In [237]:
# 先将data 和 test 数据进行合并再进行清洗
# 2 字符串的清洗
# 定义清洗用的函数
def deleteChars(s, c):
    return s.replace(c, '')

# 根据data.describe() 观察到UserInfo_2, UserInfo_4, UserInfo_8, UserInfo_20都是城市信息，在此做一个清洗工作
clean_cols = ['UserInfo_2', 'UserInfo_4', 'UserInfo_8', 'UserInfo_20']
for c in clean_cols:
    data[c] = data[c].apply(deleteChars, c = '市')

print(data['UserInfo_2'].unique())


['深圳' '温州' '宜昌' '吴忠' '绵阳' '东莞' '赤峰' '武汉' '长沙' '漳州' '牡丹江' '北京' '成都' '三明'
 '临沂' '福州' '泰州' '上海' '红河哈尼族彝族自治州' '南平' '郴州' '常州' '湖州' '茂名' '天津' '南宁' '聊城'
 '柳州' '太原' '重庆' '曲靖' '合肥' '鸡西' '资阳' '兰州' '济宁' '丽水' '滨州' '渭南' '汕头' '黔南'
 '廊坊' '西宁' '金华' '龙岩' '清远' '徐州' '潍坊' '阳泉' '包头' '陇南' '保定' '吉安' '厦门' '大庆'
 '荆门' '威海' '石家庄' '汕尾' '淄博' '巴彦淖尔盟' '黔西南' '昆明' '宝鸡' '酒泉' '延边朝鲜族自治州' '泉州'
 '无锡' '黄冈' '商丘' '抚州' '吕梁' '阿拉善盟' '黑河' '宿州' '克拉玛依' '淮南' '邵阳' '惠州' '益阳' '淮安'
 '咸宁' '洛阳' '襄阳' '平顶山' '泰安' '扬州' '新乡' '海口' '西安' '焦作' '唐山' '梅州' '肇庆' '阜阳'
 '岳阳' '鞍山' '永州' '杭州' '哈尔滨' '郑州' '南阳' '赣州' '绍兴' '济南' '绥化' '蚌埠' '河源' '银川'
 '南京' '连云港' '韶关' '九江' '广州' '白银' '镇江' '榆林' '广元' '菏泽' '阳江' '日照' '台州' '鄂尔多斯'
 '沈阳' '常德' '烟台' '中山' '白山' '苏州' '周口' '宁德' '大同' '贵阳' '固原' '德阳' '来宾' '宜宾'
 '随州' '运城' '衢州' '襄樊' '荆州' '邯郸' '营口' '邢台' '丹东' '玉林' '南充' '莆田' '嘉兴' '乌鲁木齐'
 '伊犁哈萨克自治州' '玉溪' '晋中' '乌海' '佛山' '定西' '德宏傣族景颇族自治州' '遵义' '北海' '东营' '百色' '巢湖'
 '怀化' '咸阳' '揭阳' '长治' '三门峡' '乌兰察布盟' '临汾' '昭通' '德州' '平凉' '青岛' '锦州' '朝阳' '枣庄'
 '安阳' '潮州' '阿克苏' '秦皇岛' '安康' '钦州' '盐城' '佳木斯' '巴中' '漯河'

In [238]:
# 3. 衍生特征
# 对于UserInfo_2 和 UserInfo_4 相等的列增加一个新的feature
data['special_feature'] = pd.Series(np.zeros(len(data)))
data.loc[data['UserInfo_2'] == data['UserInfo_4'],'special_feature'] = 1
data.loc[data['UserInfo_2'] != data['UserInfo_4'],'special_feature'] = 0
data.head()

Unnamed: 0,Idx,UserInfo_1,UserInfo_2,UserInfo_3,UserInfo_4,WeblogInfo_2,WeblogInfo_4,WeblogInfo_5,WeblogInfo_6,WeblogInfo_7,...,SocialNetwork_11,SocialNetwork_12,SocialNetwork_13,SocialNetwork_14,SocialNetwork_15,SocialNetwork_16,SocialNetwork_17,target,ListingInfo,special_feature
0,10001,1.0,深圳,4.0,深圳,1.0,1.0,1.0,1.0,14,...,-1,0,0,0,0,0,1,0,2014/3/5,1.0
1,10002,1.0,温州,4.0,温州,0.0,1.0,1.0,1.0,14,...,-1,0,0,0,0,0,2,0,2014/2/26,1.0
2,10003,1.0,宜昌,3.0,宜昌,0.0,2.0,2.0,2.0,9,...,-1,-1,1,0,0,0,0,0,2014/2/28,1.0
5,10008,1.0,吴忠,5.0,银川,0.0,2.0,2.0,2.0,4,...,-1,-1,0,0,0,0,0,0,2014/2/27,0.0
6,10011,1.0,绵阳,3.0,赤峰,0.0,13.0,1.0,13.0,15,...,-1,0,1,0,0,0,1,1,2014/2/24,0.0


In [239]:
columns = data.select_dtypes(include=['object']).columns.tolist()
data[columns].describe()
data.loc[data['target'] == 1]['UserInfo_2'].value_counts()

深圳           47
成都           37
广州           34
重庆           26
苏州           24
上海           24
东莞           22
临沂           21
潍坊           20
青岛           20
淄博           19
南京           18
石家庄          18
温州           18
武汉           18
北京           17
菏泽           17
佛山           17
烟台           17
长沙           17
泉州           17
宜昌           16
厦门           16
中山           15
盐城           15
济南           13
杭州           13
宁波           13
聊城           13
哈尔滨          12
             ..
葫芦岛           1
乌兰察布盟         1
张家界           1
乌海            1
阳江            1
平顶山           1
蚌埠            1
石嘴山           1
四平            1
巴音郭楞蒙古自治州     1
防城港           1
文山壮族苗族自治州     1
天水            1
白城            1
乌鲁木齐          1
临夏回族自治州       1
阿克苏           1
黄石            1
黑河            1
大同            1
保山            1
钦州            1
黔西南           1
广安            1
吴忠            1
抚州            1
酒泉            1
金昌            1
营口            1
内江            1
Name: UserInfo_2, Length

In [240]:
# 4 二值化处理
# 计算出逾期率最高的K个城市，这里K = 4
K = 50
# 通过data.describe()观察出需要做二值化的columns，因为存在大量的类别数据。
bin_cols = ['UserInfo_2', 'UserInfo_4', 'UserInfo_8', 'UserInfo_20']
for c in bin_cols:
    cities = data.loc[data['target'] == 1][c].value_counts().index.tolist()
    # 将前K个城市的值保留，其他城市都用其他代替
    data.loc[~data[c].isin(cities1[:K]), c] = '其他'

# cities1 = data.loc[data['target'] == 1]['UserInfo_2'].value_counts().index.tolist()
# cities2 = data.loc[data['target'] == 1]['UserInfo_4'].value_counts().index.tolist()

# data.loc[~data['UserInfo_2'].isin(cities1[:K]), 'UserInfo_2'] = '其他'
# data.loc[~data['UserInfo_4'].isin(cities2[:K]), 'UserInfo_4'] = '其他'
print(data['UserInfo_2'].value_counts())
print(data['UserInfo_4'].value_counts())

其他     11930
深圳       510
广州       471
上海       429
重庆       376
北京       371
温州       342
泉州       337
东莞       312
成都       304
苏州       260
金华       229
杭州       225
郑州       202
福州       201
长沙       195
武汉       195
厦门       179
台州       173
青岛       173
临沂       171
赣州       166
佛山       162
宁波       161
南京       156
潍坊       155
昆明       153
中山       152
海口       144
天津       133
盐城       133
济南       132
石家庄      132
哈尔滨      126
徐州       124
宜昌       120
聊城       119
邯郸       117
滨州       101
烟台        95
安阳        90
临汾        90
淄博        89
菏泽        83
大连        83
泰州        78
绵阳        77
珠海        73
河源        62
丹东        51
永州        46
Name: UserInfo_2, dtype: int64
其他     11529
深圳       579
广州       543
上海       474
北京       451
重庆       382
成都       351
泉州       332
温州       328
东莞       318
杭州       254
金华       251
苏州       249
郑州       248
武汉       237
长沙       210
福州       201
青岛       193
厦门       181
昆明       176
宁波       170
中山       165
台州       165
临沂     

In [241]:
# 5 特征独热编码
# UserInfo_24是地址信息且数据样本极大，这里做删除操作。
data.drop('UserInfo_24', axis = 1, inplace = True)
# ListingInfo 是时间信息与训练无关也做删除操作
data.drop('ListingInfo', axis = 1, inplace = True)

# 选出所有‘类别’行数据进行独热编码
columns = data.select_dtypes(include=['object']).columns.tolist()
print(columns)
data[columns].describe()



['UserInfo_2', 'UserInfo_4', 'UserInfo_7', 'UserInfo_8', 'UserInfo_9', 'UserInfo_19', 'UserInfo_20', 'UserInfo_22', 'UserInfo_23', 'Education_Info2', 'Education_Info3', 'Education_Info4', 'Education_Info6', 'Education_Info7', 'Education_Info8', 'WeblogInfo_19', 'WeblogInfo_20', 'WeblogInfo_21']


Unnamed: 0,UserInfo_2,UserInfo_4,UserInfo_7,UserInfo_8,UserInfo_9,UserInfo_19,UserInfo_20,UserInfo_22,UserInfo_23,Education_Info2,Education_Info3,Education_Info4,Education_Info6,Education_Info7,Education_Info8,WeblogInfo_19,WeblogInfo_20,WeblogInfo_21
count,20988,20988,20988,20988,20988,20988,20988,20988,20988,20988,20988,20988,20988,20988,20988,20988,20988,20988
unique,51,51,32,51,7,31,46,7,26,7,3,6,6,2,7,7,36,4
top,其他,其他,不详,其他,中国移动,山东省,其他,D,D,E,E,E,E,E,E,I,I5,D
freq,11930,11529,2920,12625,10868,1736,16305,19276,19276,19502,19502,19502,20235,20235,20235,17947,10773,17796


In [242]:
# 对数据就进行独热编码并删除之前加入‘其他’的列
data = pd.get_dummies(data, columns = columns,  prefix_sep="_", dummy_na = False, drop_first = False)
# 删除带 ‘其他’的列
# data.head()
dropList = [x + '_其他' for x in bin_cols]
print(dropList)
data.drop(columns = dropList, axis=1, inplace=True)
print(data.shape)

['UserInfo_2_其他', 'UserInfo_4_其他', 'UserInfo_8_其他', 'UserInfo_20_其他']
(20988, 580)


> #### 2. 特征选择
200多个特征里可能有效的特征不会很多。在这里做特征选择相关的工作。 在特征选择这一块请使用```树```模型。 比如sklearn自带的特征选择模块（https://scikit-learn.org/stable/modules/feature_selection.html）， 或者直接使用XGBoost等模型来直接选择。 这些模型训练好之后你可以直接通过```feature_importance_values```属性来获取。

In [253]:
# 进行下采样
positive_data = data[data['target'] == 1]  # 正样本
negative_data = data[data['target'] == 0]  # 负样本

lower_data = negative_data.sample(n = len(positive_data), replace = False, random_state=42, axis = 0)
data_resample = pd.concat([positive_data, lower_data])
print(data_resample.shape)

(3264, 580)


In [254]:
# 进行训练集和测试集的分割
feature_names = np.array(data_resample.columns[data_resample.columns != 'target'].tolist())

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data_resample[feature_names].values, 
    data_resample['target'].values,
    test_size=0.2,
    random_state=42
)
print(X_train.shape, y_train.shape, X_test.shape, y_train.shape)

(2611, 579) (2611,) (653, 579) (2611,)


In [205]:
# 1)采用逻辑回归L1作为特征选择的base
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import f1_score
#from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline

params_c = np.logspace(-4, 1, 11)
print(params_c)

# 使用逻辑回归+selectfromModel进行特征筛选
pipe = Pipeline([
    ('fs', SelectFromModel(estimator=LogisticRegression(penalty='l1'))),
    ('clf', LogisticRegression())
])

parameters = {
    'fs__estimator__C': params_c,
}

kf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
model = GridSearchCV(pipe, parameters, cv = kf, n_jobs= -1, verbose = 1, scoring = 'f1') # f1 socring for binary classifier
model.fit(X_train, y_train)

print('Best f1-score: ', model.best_score_)    
best_parameters = model.best_params_
print('Best parameters: ', best_parameters) 

# 求出c_best
c_best = best_parameters['fs__estimator__C']

[1.00000000e-04 3.16227766e-04 1.00000000e-03 3.16227766e-03
 1.00000000e-02 3.16227766e-02 1.00000000e-01 3.16227766e-01
 1.00000000e+00 3.16227766e+00 1.00000000e+01]
Fitting 5 folds for each of 11 candidates, totalling 55 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed: 25.5min
[Parallel(n_jobs=-1)]: Done  55 out of  55 | elapsed: 48.6min finished


Best f1-score:  0.01534462465680796
Best parameters:  {'fs__estimator__C': 0.001}


In [255]:
# 通过c_best值，重新在整个X_train里做训练，并选出特征。
c_best = 0.001
lr_clf = LogisticRegression(penalty='l1', C=c_best)
lr_clf.fit(X_train, y_train) # 在整个训练数据重新训练

select_model = SelectFromModel(lr_clf, prefit=True)
selected_features = select_model.get_support()  # 被选出来的特征

# 重新构造feature_names
feature_names = feature_names[selected_features]

# 重新构造训练数据和测试数据
X_train1 = X_train[:, selected_features]
X_test1 = X_test[:, selected_features]

In [256]:
# 使用选择后的逻辑回归训练比计算score
from sklearn.metrics import roc_auc_score 

params_c = np.logspace(-5,2,5) # 也可以自行定义一个范围

# TODO: 实现逻辑回归 + L2正则， 利用GrisSearchCV
pipe = Pipeline([
    ('clf', LogisticRegression(penalty='l2', solver='lbfgs'))
])

parameters = {
    'clf__C': params_c
}

kf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
model = GridSearchCV(pipe, parameters, cv = kf, n_jobs= -1, verbose = 1, scoring = 'roc_auc') # roc_auc
model.fit(X_train1, y_train)

# 输出最好的参数 
print('Best roc_auc: ', model.best_score_)
print('Best parameters:', model.best_params_)


Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Best roc_auc:  0.6436987843172806
Best parameters: {'clf__C': 0.03162277660168379}


[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    1.3s finished


In [258]:
predictions = model.predict(X_test1)
print(roc_auc_score(y_test, predictions))

0.5685275599457096


In [280]:
# 2）使用XGBoost来做特征选择
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
result_roc_auc = roc_auc_score(y_test, y_pred)
print("roc_auc: %.2f%%" % (result_roc_auc * 100.0))
# Fit model using each importance as a threshold
# print(model.feature_importances_)
thresholds = np.sort(model.feature_importances_)
#print(thresholds)
best_roc_auc = 0
best_thresh = 0
X_train2 = None
X_test2 = None
select_X_train = X_train
select_X_test = X_test
nCount = 0
for thresh in thresholds:
    # select features using threshold
    if thresh < 10e-5:
        continue
    select_model = SelectFromModel(model, threshold=thresh, prefit = True)
    select_X_train = select_model.transform(select_X_train)
    
    # train model
    model = XGBClassifier()
    model.fit(select_X_train, y_train)
    # eval model
    select_X_test = select_model.transform(select_X_test)
    y_pred = model.predict(select_X_test)
    current_roc_auc = roc_auc_score(y_test, y_pred)
    if nCount % 5 == 0:
        print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], current_roc_auc*100.0))
    nCount += 1
    # select the best features
    if current_roc_auc > best_roc_auc:
        best_roc_auc = current_roc_auc
        best_thresh = thresh
        X_train2 = select_X_train
        X_test2 = select_X_test

roc_auc: 66.30%
Thresh=0.000, n=135, Accuracy: 66.30%
Thresh=0.003, n=113, Accuracy: 66.96%
Thresh=0.003, n=111, Accuracy: 67.27%
Thresh=0.004, n=109, Accuracy: 68.29%
Thresh=0.004, n=109, Accuracy: 68.29%
Thresh=0.004, n=102, Accuracy: 67.18%
Thresh=0.005, n=101, Accuracy: 67.12%
Thresh=0.005, n=100, Accuracy: 67.04%
Thresh=0.005, n=98, Accuracy: 67.51%
Thresh=0.005, n=95, Accuracy: 66.82%
Thresh=0.006, n=90, Accuracy: 66.96%
Thresh=0.006, n=85, Accuracy: 67.18%
Thresh=0.006, n=85, Accuracy: 67.18%
Thresh=0.006, n=82, Accuracy: 66.55%
Thresh=0.007, n=79, Accuracy: 66.47%
Thresh=0.007, n=77, Accuracy: 65.28%
Thresh=0.007, n=77, Accuracy: 65.28%
Thresh=0.007, n=73, Accuracy: 65.67%
Thresh=0.007, n=73, Accuracy: 65.67%
Thresh=0.008, n=71, Accuracy: 65.64%
Thresh=0.009, n=70, Accuracy: 66.24%
Thresh=0.009, n=63, Accuracy: 66.20%
Thresh=0.010, n=56, Accuracy: 66.04%
Thresh=0.011, n=53, Accuracy: 66.61%
Thresh=0.012, n=49, Accuracy: 66.12%
Thresh=0.013, n=46, Accuracy: 65.96%
Thresh=0.015, 

In [282]:
# X_train2, X_test2就是选出来的最好特征
print(X_train2.shape, X_test2.shape)

(2611, 107) (653, 107)


> #### 3. XGBoost来训练风控模型，结果以AUC为准
https://github.com/dmlc/xgboost   这是XGBoost library具体的地址, 具有详细的文档。 https://pypi.org/project/xgboost/ 里有安装的步骤。 试着去调一下它的超参数，使得得到最好的效果。 一定要注意不需要使用测试数据来训练。 最终的结果以测试数据上的AUC为标准。 

In [284]:
# data_test = pd.read_csv('data/Test/PPD_Master_GBK_2_Test_Set.csv',encoding='gb18030')
clf = XGBClassifier()
# 1)greedy approach adjust learning rate
parameters = {
    # learning rate
    'clf__eta': np.logspace(-2, 1, 10)
}

kf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
model = GridSearchCV(clf, parameters, cv = kf, n_jobs= -1, verbose = 1, scoring = 'roc_auc') # roc_auc_score
model.fit(X_train2, y_train)

print('Best roc_auc_score: ', model.best_score_)    
best_parameters = model.best_params_
print('Best parameters: ', best_parameters)

# 求出eta best
eta_best = best_parameters['clf__eta']

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    7.9s finished


Best roc_auc_score:  0.7297296310319055
Best parameters:  {'clf__eta': 0.01}


In [286]:
# ）greedy approach adjust for
# max_depth and min_child_weight 
clf = XGBClassifier(eta = eta_best)
parameters = {
    'clf__max_depth': [3, 5, 7, 9, 10],
    'clf__min_child_weight': [1, 2, 4, 8, 16]
}

kf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
model = GridSearchCV(clf, parameters, cv = kf, n_jobs= -1, verbose = 1, scoring = 'roc_auc') # roc_auc_score
model.fit(X_train2, y_train)

print('Best roc_auc_score: ', model.best_score_)    
best_parameters = model.best_params_
print('Best parameters: ', best_parameters)

# 求出best max_depth 和 min_child_weight
max_depth_best = best_parameters['clf__max_depth']
min_child_weight = best_parameters['clf__min_child_weight']

Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done 125 out of 125 | elapsed:   16.9s finished


Best roc_auc_score:  0.7297296310319055
Best parameters:  {'clf__max_depth': 3, 'clf__min_child_weight': 1}


In [287]:
# greedy approach for n_estimator
clf = XGBClassifier(eta = eta_best, max_depth = max_depth_best, min_child_weight = min_child_weight)
parameters = {
    'clf__n_estimators': [400, 500, 600, 700, 800]
}
kf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
model = GridSearchCV(clf, parameters, cv = kf, n_jobs= -1, verbose = 1, scoring = 'roc_auc') # roc_auc_score
model.fit(X_train2, y_train)

print('Best roc_auc_score: ', model.best_score_)    
best_parameters = model.best_params_
print('Best parameters: ', best_parameters)

# 求出best n_estimators
n_estimator_best = best_parameters['clf__n_estimators']

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    5.1s finished


Best roc_auc_score:  0.7297296310319055
Best parameters:  {'clf__n_estimators': 400}


In [290]:
# greedy approach for gamma
# {'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]}
clf = XGBClassifier(eta = eta_best, 
                    max_depth = max_depth_best, 
                    min_child_weight = min_child_weight, 
                    n_estimators = n_estimator_best)
parameters = {
    'clf__gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
}
kf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
model = GridSearchCV(clf, parameters, cv = kf, n_jobs= -1, verbose = 1, scoring = 'roc_auc') # roc_auc_score
model.fit(X_train2, y_train)

print('Best roc_auc_score: ', model.best_score_)    
best_parameters = model.best_params_
print('Best parameters: ', best_parameters)

# 求出best gamma
gamma_best = best_parameters['clf__gamma']

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:   13.8s finished


Best roc_auc_score:  0.72417734324015
Best parameters:  {'clf__gamma': 0.1}


In [292]:
# greedy approach
clf = XGBClassifier(eta = eta_best, 
                    max_depth = max_depth_best, 
                    min_child_weight = min_child_weight, 
                    n_estimators = n_estimator_best,
                    gamma = gamma_best)
parameters = {
    'clf__subsample': [0.4, 0.5, 0.6, 0.7],
    'clf__colsample_bytree': [0.4, 0.5, 0.6, 0.7]
}
kf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
model = GridSearchCV(clf, parameters, cv = kf, n_jobs= -1, verbose = 1, scoring = 'roc_auc') # roc_auc_score
model.fit(X_train2, y_train)

print('Best roc_auc_score: ', model.best_score_)    
best_parameters = model.best_params_
print('Best parameters: ', best_parameters)

# 求出best gamma
subsample_best = best_parameters['clf__subsample']
colsample_bytree_best = best_parameters['clf__colsample_bytree']

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   17.5s
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:   35.5s finished


Best roc_auc_score:  0.7280011885040995
Best parameters:  {'clf__colsample_bytree': 0.4, 'clf__subsample': 0.4}


In [293]:
# greedy approach
clf = XGBClassifier(eta = eta_best, 
                    max_depth = max_depth_best, 
                    min_child_weight = min_child_weight, 
                    n_estimators = n_estimator_best,
                    gamma = gamma_best,
                    subsample = subsample_best,
                    colsample_bytree = colsample_bytree_best)
parameters = {
    'clf__reg_alpha': [0.05, 0.1, 1, 2, 3], 
    'clf__reg_lambda': [0.05, 0.1, 1, 2, 3]
}

kf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
model = GridSearchCV(clf, parameters, cv = kf, n_jobs= -1, verbose = 1, scoring = 'roc_auc') # roc_auc_score
model.fit(X_train2, y_train)

print('Best roc_auc_score: ', model.best_score_)    
best_parameters = model.best_params_
print('Best parameters: ', best_parameters)

# 求出best gamma
reg_alpha_best = best_parameters['clf__reg_alpha']
reg_lambda_best = best_parameters['clf__reg_lambda']

Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    7.3s
[Parallel(n_jobs=-1)]: Done 125 out of 125 | elapsed:   23.3s finished


Best roc_auc_score:  0.7284200787357374
Best parameters:  {'clf__reg_alpha': 0.05, 'clf__reg_lambda': 0.05}


In [297]:
# calculate the auc rate for test data
predictions = model.predict(X_test2)
print("roc_auc_score in the test data set is: %.4f" % roc_auc_score(y_test, predictions))

roc_auc_score in the test data set is: 0.6536
