# 0. 流程
1. 定义问题
2. 获取训练数据和测试数据
3. 数据探索与数据预处理
4. 模型选择
5. 调参、建模、预测、求解问题
6. 提交结果

# 1. 定义问题
* Give Me Some Credit是Kaggle上关于信用评分的项目，训练集包含一些借款人样本并给出其是否会发生严重逾期状况的标记，我们需要训练一个信用评分模型，判断测试集中的借款人是否会发生严重逾期状况，可以为财务决策提供一些参考。

### 导包

In [None]:
# 数据整理和分析
import numpy as np
import pandas as pd

# 可视化
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# 机器学习
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve, auc
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier

# 忽略警告
import warnings
warnings.filterwarnings("ignore")

# 2. 获取数据

In [None]:
# 读取训练集和测试集数据
train_df = pd.read_csv('/kaggle/input/GiveMeSomeCredit/cs-training.csv', index_col = 0)
test_df = pd.read_csv('/kaggle/input/GiveMeSomeCredit/cs-test.csv', index_col = 0)

# 3. 数据探索与数据预处理

### 数据集中包含的特征及含义

In [None]:
train_df.columns.values

|Variable Name|Description|Type|
|:-----------|:-----------|----|
|SeriousDlqin2yrs|Person experienced 90 days past due delinquency or worse(逾期90天或更严重的人员)|Y/N|
|RevolvingUtilizationOfUnsecuredLines|Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits(信用卡和个人信用额度的总余额（不动产和汽车贷款等无分期付款债务除外）除以信用额度之和)|percentage|
|age|Age of borrower in years(借款人的年龄（以年计）)|integer
|NumberOfTime30-59DaysPastDueNotWorse|Number of times borrower has been 30-59 days past due but no worse in the last 2 years.(借款人逾期30-59天的次数，但在过去2年中没有更差的信用记录)|integer
|DebtRatio|Monthly debt payments, alimony,living costs divided by monthy gross income(每月债务、赡养费、生活之和除以每月总收入)|percentage
|MonthlyIncome|Monthly income(月收入)|real
|NumberOfOpenCreditLinesAndLoans|Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)(开放贷款数量（汽车贷款或抵押贷款）和信用额度（如信用卡）)|integer
|NumberOfTimes90DaysLate|Number of times borrower has been 90 days or more past due.(借款人逾期90天或以上的次数)|integer
|NumberRealEstateLoansOrLines|Number of mortgage and real estate loans including home equity lines of credit(抵押和房地产贷款数量，包括房屋净值信贷额度)|integer
|NumberOfTime60-89DaysPastDueNotWorse|Number of times borrower has been 60-89 days past due but no worse in the last 2 years.(借款人逾期60-89天的次数，但在过去2年中没有更差的信用记录)|integer
|NumberOfDependents|Number of dependents in family excluding themselves (spouse, children etc.)(家人中(不包括自己)受抚养人人数（配偶、子女等）)|integer


In [None]:
train_df.head(5)

In [None]:
test_df.head(5)

### 特征及其类型
SeriousDlqin2yrs为目标字段，其余10个特征为输入字段
* 分类型（Categorical）特征：SeriousDlqin2yrs
* 数值型（numeric）特征
    * 连续型特征：RevolvingUtilizationOfUnsecuredLines, age, DebtRatio, MonthlyIncome
    * 离散型特征：NumberOfTime30-59DaysPastDueNotWorse, NumberOfTime60-89DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfOpenCreditLinesAndLoans, NumberRealEstateLoansOrLines, NumberOfDependents

### 快速浏览数据集信息

In [None]:
# 查看训练集信息
train_df.info()

In [None]:
# 查看测试集信息
test_df.info()

In [None]:
# 查看训练集统计信息
train_df.describe()

In [None]:
# 查看测试集统计信息
test_df.describe()

### 初步观察结果
* 训练集包含150,000条记录，测试集包含101,503条记录 
* **缺失值**：训练集和测试集中MonthlyIncome和NumberOfDependents两字段存在缺失值
* **异常值**
    * 训练集中年龄出现了0岁，不合理
    * RevolvingUtilizationOfUnsecuredLines, NumberOf-Time30-59DaysPastDueNotWorse, NumberOfTime60-89DaysPastDueNotWorse, NumberOfTimes90DaysLate等特征的max值与75%的值差距较大，可能是分布不均，也可能存在异常值，后续需要排查

### 缺失值处理

In [None]:
# 查看训练集缺失值情况
pd.DataFrame({'count':train_df.isnull().sum().values, 'ratio': train_df.isnull().mean() * 100})

In [None]:
# 查看测试集缺失值情况
pd.DataFrame({'count':test_df.isnull().sum().values, 'ratio': test_df.isnull().mean() * 100})

* 训练集和测试集中MonthlyIncome、NumberOfDependents两个特征分别有约20%和2.6%的缺失值
* MonthlyIncome特征的缺失值数量较多，不能直接删去，需要找到合理的填充方法
* MonthlyIncome和NumberOfDependents特征与DebtRatio特征可能存在一定关系，因为DebtRatio字段等于每月债务、赡养费、生活之和除以每月总收入。可以观察一下三个特征分布的特点

In [None]:
# MonthlyIncome为空的记录中，NumberOfDependents和DebtRatio值的情况
train_df[train_df['MonthlyIncome'].isnull()][['NumberOfDependents', 'DebtRatio']].describe()

In [None]:
# NumberOfDependents为空的记录中，MonthlyIncome和DebtRatio值的情况
train_df[train_df['NumberOfDependents'].isnull()][['MonthlyIncome', 'DebtRatio']].describe()

In [None]:
# 训练集：高DebtRatio（>100）中，MonthlyIncome字段为空的比例
train_df[train_df['DebtRatio']>100]['MonthlyIncome'].isnull().sum()/len(train_df)*100

In [None]:
# 测试集：高DebtRatio（>100）中，MonthlyIncome字段为空的比例
test_df[test_df['DebtRatio']>100]['MonthlyIncome'].isnull().sum()/len(test_df)*100

In [None]:
# 训练集中MonthlyIncome和NumberOfDependents均为非空的比例
train_df[train_df['MonthlyIncome'].isnull()]['NumberOfDependents'].isnull().sum()/len(train_df)*100

In [None]:
# 测试集中MonthlyIncome和NumberOfDependents均为非空的比例
test_df[test_df['MonthlyIncome'].isnull()]['NumberOfDependents'].isnull().sum()/len(test_df)*100

In [None]:
# 高DebtRatio且MonthlyIncome字段非空的数据中，MonthlyIncome字段的统计信息
train_df[(train_df['DebtRatio']>100) & (train_df['MonthlyIncome'].notnull())]['MonthlyIncome'].describe()

In [None]:
# 低DebtRatio且MonthlyIncome字段非空的数据中，MonthlyIncome字段的统计信息
train_df[(train_df['DebtRatio']<100) & (train_df['MonthlyIncome'].notnull())]['MonthlyIncome'].describe()

#### 处理MonthlyIncome的缺失值
* MonthlyIncome缺失的记录中，DebtRatio较高（中位数1159）
* 高DebtRatio借款人记录的统计信息显示，这些借款人的MonthlyIncome大多为0或较小值（中位数为0，下四分位数为1）
* 这可能意味着MonthlyIncome记录缺失的借款人会故意将此栏留空，因为他们没有收入
* 因此，处理此缺失值的最佳方法是将其替换为0

#### 处理NumberOfDependents的缺失值
* 缺少NumberOfDependents的记录同时也缺少MonthlyIncome的记录，这表明，将MonthlyIncome留空的同一组借款人也会将NumberOfDependents字段留空
* MonthlyIncome缺失的借款人记录的统计信息显示，这些借款人的NumberOfDependents通常也为0（25%，50%，75%均为0）。这类收入很少甚至没有收入的借款人没有家属，是合乎逻辑的
* 因此，处理此缺失值的最佳方法是将其替换为0

In [None]:
# 填充MonthlyIncome和NumberOfDependents的缺失值
train_df['MonthlyIncome'].fillna(0, inplace=True)
train_df['NumberOfDependents'].fillna(0, inplace=True)
test_df['MonthlyIncome'].fillna(0, inplace=True)
test_df['NumberOfDependents'].fillna(0, inplace=True)

接下来逐个特征观察

### SeriousDlqin2yrs

In [None]:
# SeriousDlqin2yrs目标字段中0和1的比例
train_df['SeriousDlqin2yrs'].value_counts()/len(train_df)

In [None]:
# 查看SeriousDlqin2yrs分布图
plt.figure()
sns.countplot('SeriousDlqin2yrs',data=train_df)

SeriousDlqin2yrs观察结果
* 存在类别不均衡的问题，负样本和正样本的比例约为14:1，可能会影响一些模型的预测结果

### RevolvingUtilizationOfUnsecuredLines

In [None]:
train_df['RevolvingUtilizationOfUnsecuredLines'].describe().to_frame().T

In [None]:
train_df[train_df['RevolvingUtilizationOfUnsecuredLines'] > train_df['RevolvingUtilizationOfUnsecuredLines'].quantile(0.99)]['RevolvingUtilizationOfUnsecuredLines'].describe().to_frame().T

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.distplot(x = np.array(train_df['RevolvingUtilizationOfUnsecuredLines']), ax = axes[0])
axes[0].set_title('Histogram Plot of RevolvingUtilizationOfUnsecuredLines')
sns.boxplot(x = train_df['RevolvingUtilizationOfUnsecuredLines'], ax = axes[1])
axes[1].set_title('Box Plot of RevolvingUtilizationOfUnsecuredLines')

可以看出分布极不均匀，所以分段查看数据情况

In [None]:
# 分别计算训练集中RevolvingUtilizationOfUnsecuredLines字段<1, 1~10, >10的记录所占的比例
below_1 = train_df[train_df['RevolvingUtilizationOfUnsecuredLines'] < 1]['RevolvingUtilizationOfUnsecuredLines'].count()*100/len(train_df)
bet_1_10 = train_df[(train_df['RevolvingUtilizationOfUnsecuredLines'] > 1) & (train_df['RevolvingUtilizationOfUnsecuredLines'] < 10)]['RevolvingUtilizationOfUnsecuredLines'].count() * 100/len(train_df)
beyond_10 = train_df[train_df['RevolvingUtilizationOfUnsecuredLines'] > 10]['RevolvingUtilizationOfUnsecuredLines'].count()*100/len(train_df)
pd.DataFrame({"below_1": below_1, "bet_1_10": bet_1_10, "beyond_10": beyond_10}, index=[1])

In [None]:
# 绘制<1, 1~10, >10三段的盒图和直方图
fig, axes = plt.subplots(2, 3, figsize=(18,10))
sns.boxplot(x = train_df[train_df['RevolvingUtilizationOfUnsecuredLines'] < 1]['RevolvingUtilizationOfUnsecuredLines'], ax = axes[0][0])
axes[0][0].set_title('{}% of Train_Dataset'.format(round(below_1, 2)))
sns.boxplot(x = train_df[(train_df['RevolvingUtilizationOfUnsecuredLines'] > 1) & (train_df['RevolvingUtilizationOfUnsecuredLines'] < 10)]['RevolvingUtilizationOfUnsecuredLines'], ax = axes[0][1])
axes[0][1].set_title('{}% of Train_Dataset'.format(round(bet_1_10, 2)))
sns.boxplot(x = train_df[train_df['RevolvingUtilizationOfUnsecuredLines'] > 10]['RevolvingUtilizationOfUnsecuredLines'], ax = axes[0][2])
axes[0][2].set_title('{}% of Train_Dataset'.format(round(beyond_10, 2)))
sns.distplot(x = train_df[train_df['RevolvingUtilizationOfUnsecuredLines'] < 1]['RevolvingUtilizationOfUnsecuredLines'], ax = axes[1][0])
axes[1][0].set_title('{}% of Train_Dataset'.format(round(below_1, 2)))
sns.distplot(x = train_df[(train_df['RevolvingUtilizationOfUnsecuredLines'] > 1) & (train_df['RevolvingUtilizationOfUnsecuredLines'] < 10)]['RevolvingUtilizationOfUnsecuredLines'], ax = axes[1][1])
axes[1][1].set_title('{}% of Train_Dataset'.format(round(bet_1_10, 2)))
sns.distplot(x = train_df[train_df['RevolvingUtilizationOfUnsecuredLines'] > 10]['RevolvingUtilizationOfUnsecuredLines'], ax = axes[1][2])
axes[1][2].set_title('{}% of Train_Dataset'.format(round(beyond_10, 2)))


In [None]:
train_df[train_df['RevolvingUtilizationOfUnsecuredLines'] > 10]['RevolvingUtilizationOfUnsecuredLines'].describe().describe().to_frame().T

In [None]:
train_df[train_df['RevolvingUtilizationOfUnsecuredLines'] > 10]['RevolvingUtilizationOfUnsecuredLines'].count()/len(train_df)*100, test_df[test_df['RevolvingUtilizationOfUnsecuredLines'] > 10]['RevolvingUtilizationOfUnsecuredLines'].count()/len(test_df)*100

RevolvingUtilizationOfUnsecuredLines观察结果
* 该字段的统计数据及其直方图、盒图显示，平均值比中位数大40倍，超过第99个百分位的数值变化很大。
* 介于0和1之间的值约占98%，符合常理。
* 介于1和10之间的值约占2%，借款人有时可以超出信贷限额消费。也是可以接受的。
* 超过10的值约占0.2%，并且它们的数值非常大，为防止其影响模型预测结果，考虑将这些值删除

### age

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15,10))
sns.boxplot(x= train_df['age'], ax = axes[0][0])
axes[0][0].set_title('Train_Dataset')
sns.boxplot(x= test_df['age'], ax = axes[0][1])
axes[0][1].set_title('Test_Dataset')
sns.distplot(x= train_df['age'], ax = axes[1][0])
axes[1][0].set_title('Train_Dataset')
sns.distplot(x= train_df['age'], ax = axes[1][1])
axes[1][1].set_title('Test_Dataset')

In [None]:
len(train_df[train_df['age'] == 0]), len(train_df[train_df['age'] < 18])

age观察结果
* 整体来看，年龄基本呈正态分布，较为合理
* 训练集中存在年龄为0的数据，通常认为该值为异常值。查看数据可以发现仅有一条数据年龄为0，因此考虑直接删除

### DebtRatio

In [None]:
train_df['DebtRatio'].describe().to_frame().T

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.distplot(x = np.array(train_df['DebtRatio']),
             ax = axes[0])
axes[0].set_title('Histogram Plot of Debt Ratio')
sns.boxplot(x = train_df['DebtRatio'], ax = axes[1])
axes[1].set_title('Box Plot of Debt Ratio')

In [None]:
pd.DataFrame({'below 1': train_df[train_df['DebtRatio'] <= 1]['DebtRatio'].count()*100/len(train_df),
             'between 1 - 10': train_df[(train_df['DebtRatio'] > 1) &
                                        (train_df['DebtRatio'] <=10)]['DebtRatio'].count()*100/len(train_df),
             'beyond 10': train_df[train_df['DebtRatio'] > 10]['DebtRatio'].count()*100/len(train_df)}, index = [1])

In [None]:
train_df[(train_df['DebtRatio'] > 1) & (train_df['DebtRatio'] <=10)]['DebtRatio'].describe().to_frame().T

In [None]:
train_df[train_df['DebtRatio'] > 10]['DebtRatio'].describe().describe().to_frame().T

DebtRatio观察结果
* 介于0和1之间的值约占76%
* 介于1和10之间的值约占4%
* 超过10的值约占19%，并且数值较大（中位数为2167），可以看做导致该特征倾斜的异常值。但有时候负债率可能会非常高，可以将其看作特例，不做处理

### MonthlyIncome

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(18,6))
sns.histplot(x = train_df[train_df['MonthlyIncome'] < 1000]['MonthlyIncome'], ax = axes[0,0])
sns.histplot(x = train_df[(train_df['MonthlyIncome'] > 1000) & 
                         (train_df['MonthlyIncome'] <= 10000)]['MonthlyIncome'], ax = axes[0,1])
sns.histplot(x = train_df[(train_df['MonthlyIncome'] > 10000) & 
                         (train_df['MonthlyIncome'] <= 20000)]['MonthlyIncome'], ax = axes[1,0])
sns.histplot(x = train_df[(train_df['MonthlyIncome'] > 20000) & 
                         (train_df['MonthlyIncome'] <= 50000)]['MonthlyIncome'], ax = axes[1,1])

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.histplot(x = train_df[(train_df['MonthlyIncome'] > 50000) &
                         (train_df['MonthlyIncome'] <= 200000)]['MonthlyIncome'], ax=axes[0])
sns.histplot(x = train_df[(train_df['MonthlyIncome'] > 200000) & 
                         (train_df['MonthlyIncome'] <= 3000000)]['MonthlyIncome'], ax=axes[1])

MonthlyIncome观察结果
* 未发现异常

### NumberOfOpenCreditLinesAndLoans

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.histplot(x = train_df['NumberOfOpenCreditLinesAndLoans'], binwidth=1, ax = axes[0])
sns.histplot(x = test_df['NumberOfOpenCreditLinesAndLoans'], binwidth=1, ax = axes[1])

NumberOfOpenCreditLinesAndLoans观察结果
* 未发现异常

### NumberRealEstateLoansOrLines

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.histplot(x = train_df['NumberRealEstateLoansOrLines'], binwidth=1, ax = axes[0])
sns.histplot(x = test_df['NumberRealEstateLoansOrLines'], binwidth=1, ax = axes[1])

In [None]:
train_df['NumberRealEstateLoansOrLines'].value_counts()

NumberRealEstateLoansOrLines观察结果
* 未发现异常

### NumberOfTime30-59DaysPastDueNotWorse, NumberOfTime60-89DaysPastDueNotWorse, NumberOfTimes90DaysLate

In [None]:
due_30_59 = pd.DataFrame(train_df['NumberOfTime30-59DaysPastDueNotWorse'].value_counts()).rename(columns = {'NumberOfTime30-59DaysPastDueNotWorse':'30-59days'})
due_60_89 =  pd.DataFrame(train_df['NumberOfTime60-89DaysPastDueNotWorse'].value_counts()).rename(columns = {'NumberOfTime60-89DaysPastDueNotWorse':'60-89days'})
due_90 = pd.DataFrame(train_df['NumberOfTimes90DaysLate'].value_counts()).rename(columns = {'NumberOfTimes90DaysLate':'90days'})
pd.concat([due_30_59, due_60_89, due_90], axis = 1)

In [None]:
train_df[train_df['NumberOfTime30-59DaysPastDueNotWorse'] > 17][['NumberOfTime30-59DaysPastDueNotWorse',
                                                                'NumberOfTime60-89DaysPastDueNotWorse',
                                                                'NumberOfTimes90DaysLate']]

In [None]:
train_df[train_df['NumberOfTime30-59DaysPastDueNotWorse'] > 17]['SeriousDlqin2yrs'].mean()*100

NumberOfTime30-59DaysPastDueNotWorse, NumberOfTime60-89DaysPastDueNotWorse, NumberOfTimes90DaysLate观察结果
* 这三个特征具有相似的分布。出现两个较大的数值（98和96），借款人2年内拖欠98或96次的可能性较低。
* 同时，三个特征中，取到96和98的记录的对应索引相同，比较异常，可能表示错误数据
* 因此，我们将这三个特征中，取96或98的记录删除。

### NumberOfDependents

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.histplot(x = train_df['NumberOfDependents'], binwidth=1, ax = axes[0])
sns.histplot(x = test_df['NumberOfDependents'], binwidth=1, ax = axes[1])

NumberOfDependents观察结果
* 未发现异常

综合以上，我们对数据集的异常值进行处理

In [None]:
# 删除RevolvingUtilizationOfUnsecuredLines超过10的记录
train_df = train_df[train_df['RevolvingUtilizationOfUnsecuredLines'] <= 10]

# 删除age为0的记录
train_df = train_df[train_df['age'] > 0]

# NumberOfTime30-59DaysPastDueNotWorse, NumberOfTime60-89DaysPastDueNotWorse, NumberOfTimes90DaysLate
train_df = train_df[train_df['NumberOfTime30-59DaysPastDueNotWorse'] < 90] 
train_df = train_df[train_df['NumberOfTimes90DaysLate'] < 90] 
train_df = train_df[train_df['NumberOfTime60-89DaysPastDueNotWorse'] < 90] 

In [None]:
# 检查变量之间的相关性
corr = train_df.corr()
plt.subplots(figsize=(13, 10))
sns.heatmap(corr, annot=True, fmt='.2g')

In [None]:
# 训练集/测试集划分
x = train_df.drop(['SeriousDlqin2yrs'], axis=1)
y = train_df['SeriousDlqin2yrs']
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42)

In [None]:
# 写出绘图auc曲线函数
def plot_roc_curve(fpr, tpr, label=None):
    plt.figure(figsize=(8,6))
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0,1],[0,1], "k--") # 画直线做参考
    plt.axis([0,1,0,1])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive rate")

# 4. 模型选择

In [None]:
# KNN
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(x_train, y_train)
knn_pred = knn.predict_proba(x_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, knn_pred)
knn_roc_auc = auc(fpr,tpr)
knn_cv_roc_auc = cross_val_score(knn, x_train, y_train, scoring='roc_auc', cv=10).mean()
print ('LogisticRegression AUC Score :', knn_roc_auc)
print('LogisticRegression CV AUC Score :', knn_cv_roc_auc)
plot_roc_curve(fpr,tpr)

In [None]:
# 随机森林
rfc = RandomForestClassifier(random_state=42)
rfc.fit(x_train,y_train)
rfc_pred = rfc.predict_proba(x_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, rfc_pred)
rfc_roc_auc = auc(fpr,tpr)
rfc_cv_roc_auc = cross_val_score(rfc, x_train, y_train, scoring='roc_auc', cv=10).mean()
plot_roc_curve(fpr,tpr)
print ('RandomForestClassifier AUC Score :', rfc_roc_auc)
print('RandomForestClassifier CV AUC Score :', rfc_cv_roc_auc)

In [None]:
xgbc = XGBClassifier(max_depth=5,eval_metric='auc',objective='binary:logistic')
xgbc.fit(x_train, y_train)
# make predictions for test data
xgbc_pred = xgbc.predict_proba(x_test)[:,1]
# evaluate predictions
fpr, tpr, _ = roc_curve(y_test, xgbc_pred)
xgbc_roc_auc = auc(fpr,tpr)
xgbc_cv_roc_auc = cross_val_score(xgbc, x_train, y_train, scoring='roc_auc', cv=10).mean()
plot_roc_curve(fpr,tpr)
print ('XGBClassifier AUC Score :', xgbc_roc_auc)
print('XGBClassifier CV AUC Score :', xgbc_cv_roc_auc)

In [None]:
gbc = GradientBoostingClassifier()
gbc.fit(x_train,y_train)
gbc_proba = gbc.predict_proba(x_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, gbc_proba)
gbc_roc_auc = auc(fpr,tpr)
gbc_cv_roc_auc = cross_val_score(gbc, x_train, y_train, scoring='roc_auc', cv=10).mean()
print ('GradientBoostingClassifier AUC Score :', gbc_roc_auc)
print('GradientBoostingClassifier CV AUC Score :', gbc_cv_roc_auc)
plot_roc_curve(fpr,tpr)

In [None]:
lgbmc = LGBMClassifier()
lgbmc.fit(x_train,y_train)
lgbmc_proba = lgbmc.predict_proba(x_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, lgbmc_proba)
lgbmc_roc_auc = auc(fpr,tpr)
lgbmc_cv_roc_auc = cross_val_score(lgbmc, x_train, y_train, scoring='roc_auc', cv=10).mean()
print ('LGBMClassifier AUC Score :', lgbmc_roc_auc)
print('LGBMClassifier CV AUC Score :', lgbmc_cv_roc_auc)
plot_roc_curve(fpr,tpr)

In [None]:
# 对比各个模型的准确率
models = pd.DataFrame({'Model': ['KNN', 'RandomForest', 'XGBoost', 'GradientBoosting', 'LightGBM'], 
                        'AUC' : [knn_roc_auc, rfc_roc_auc, xgbc_roc_auc, gbc_roc_auc, lgbmc_roc_auc], 
                        'CV AUC' : [knn_cv_roc_auc, rfc_cv_roc_auc, xgbc_cv_roc_auc, gbc_cv_roc_auc, lgbmc_cv_roc_auc]})
models.sort_values(by='AUC', ascending=True)

* 对比各模型在训练集上的准确率和10折交叉验证的准确率可以发现，LightGBM和GradientBoosting的拟合能力强于其他分类算法，并且LightGBM的速度比GradientBoosting要快很多。KNN的表现尤其差，因为数据集存在样本不均衡问题,少数类分类精度不高。
* 基于性能和效率两方面的考虑，我们选择LightGBM进行调参，作为最终的预测模型。

# 5. 调参、建模、预测、求解问题

#### 使用10折交叉验证法与网格搜索法结合找到模型的最优参数
主要调节的参数如下：
* max_depth
* num_leaves
* min_child_samples
* min_child_weight
* bagging_fraction
* bagging_freq
* lambda_l1
* lambda_l2
* cat_smooth

In [None]:
'''
# 首先，调整max_depth 和 num_leaves，这两个参数基本可以确定树的大小及复杂度
parameters1 = {
    'max_depth': [4,6,8],
    'num_leaves': [10,20,30,40],
}
gsearch1 = GridSearchCV(LGBMClassifier(), param_grid=parameters1, scoring='roc_auc', cv=10)
gsearch1.fit(x_train, y_train)
print('参数的最佳取值:{0}'.format(gsearch1.best_params_))
print('最佳模型得分:{0}'.format(gsearch1.best_score_))
print(gsearch1.cv_results_['mean_test_score'].mean())
'''

In [None]:
'''
# 调整min_data_in_leaf 和 min_sum_hessian_in_leaf,该步骤主要是防止树过拟合
parameters2 = {
    'min_child_samples': [18,19,20,21,22], 
    'min_child_weight': [0.001,0.002]
}
gsearch = GridSearchCV(LGBMClassifier(max_depth=6, num_leaves=20), param_grid=parameters2, scoring='roc_auc', cv=10)
gsearch.fit(x_train, y_train)
print('参数的最佳取值:{0}'.format(gsearch.best_params_))
print('最佳模型得分:{0}'.format(gsearch.best_score_))
print(gsearch.cv_results_['mean_test_score'].mean())
'''

In [None]:
'''
# 调整bagging_fraction和bagging_freq.
# bagging_fraction相当于subsample样本采样，可以使bagging更快的运行，同时也可以降拟合。bagging_freq默认0，表示bagging的频率，0意味着没有使用bagging，k意味着每k轮迭代进行一次bagging。
parameters3 = {
     'bagging_fraction': [0.8,0.9,1],
     'bagging_freq': [2,3,4],
}
gsearch = GridSearchCV(LGBMClassifier(max_depth=6, num_leaves=20, min_child_samples=20, min_child_weight=0.001), param_grid=parameters3, scoring='roc_auc', cv=10)
gsearch.fit(x_train, y_train)
print('参数的最佳取值:{0}'.format(gsearch.best_params_))
print('最佳模型得分:{0}'.format(gsearch.best_score_))
print(gsearch.cv_results_['mean_test_score'].mean())
'''

In [None]:
'''
# 本步骤通过L1正则化和L2正则化降低过拟合。
parameters4 = {
    'lambda_l1': [0, 0.1, 0.4, 0.5, 0.6],
    'lambda_l2': [0, 10, 15, 35, 40],
}
gsearch = GridSearchCV(LGBMClassifier(max_depth=6, num_leaves=20, min_child_samples=20, min_child_weight=0.001, bagging_fraction=0.9, bagging_freq=3), param_grid=parameters4, scoring='roc_auc', cv=10)
gsearch.fit(x_train, y_train)
print('参数的最佳取值:{0}'.format(gsearch.best_params_))
print('最佳模型得分:{0}'.format(gsearch.best_score_))
print(gsearch.cv_results_['mean_test_score'].mean())
'''

In [None]:
'''
# cat_smooth为设置每个类别拥有最小的个数，主要用于去噪。
parameters5 = {
     'cat_smooth': [0,10,20],
}
gsearch = GridSearchCV(LGBMClassifier(max_depth=6, num_leaves=20, min_child_samples=20, min_child_weight=0.001, bagging_fraction=0.9, bagging_freq=3, lambda_l1=0.6, lambda_l2=10), param_grid=parameters5, scoring='roc_auc', cv=10)
gsearch.fit(x_train, y_train)
print('参数的最佳取值:{0}'.format(gsearch.best_params_))
print('最佳模型得分:{0}'.format(gsearch.best_score_))
print(gsearch.cv_results_['mean_test_score'].mean())
'''

将确定的最佳参数取值代入模型，进行训练

In [None]:
final_model = LGBMClassifier(max_depth=6, num_leaves=20, min_child_samples=20, min_child_weight=0.001, bagging_fraction=0.9, bagging_freq=3, lambda_l1=0.6, lambda_l2=10, cat_smooth=10)
final_model.fit(x_train, y_train)
final_predict_y = final_model.predict_proba(x_test)[:,1]

fpr, tpr, _ = roc_curve(y_test, final_predict_y)
final_model_roc_auc = auc(fpr,tpr)
final_model_cv_roc_auc = cross_val_score(final_model, x_train, y_train, scoring='roc_auc', cv=10).mean()
print ('After GridSearchCV LGBMClassifier AUC Score :', final_model_roc_auc)
print ('After GridSearchCV LGBMClassifier CV AUC Score :', final_model_cv_roc_auc)
plot_roc_curve(fpr,tpr)

经参数优化后的lightGBM相比原始模型，AUC约提高了0.24%, CV AUC约提高了0.26%

# 6. 提交结果

In [None]:
X_test = test_df.drop(['SeriousDlqin2yrs'],axis=1)
Y_test = final_model.predict_proba(X_test)[:,1]
submission = pd.DataFrame({'ID': np.arange(1, len(X_test)+1), 'Probability': Y_test})
submission.to_csv("submission.csv", index=False)
submission