In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## 1.定义问题

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 
Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by **predicting the probability that somebody will experience financial distress in the next two years**.
The goal of this competition is to **build a model that borrowers can use to help make the best financial decisions**.
Historical data are provided on 250,000 borrowers and the prize pool is 5,000 (3,000 for first, 1,500 for second and 500 for third).

题目要求建立一个模型，通过银行用户的个人信息以及以往的行为来预测其两年内出现经济困难的概率，从而对银行决定是否应当予其贷款起参考作用。该问题为二分类问题，特征为用户个人信息以及近期贷款行为，要求建立模型预测用户未来两年出现经济困难的概率。

## 2. 导入数据

In [None]:
import warnings
warnings.filterwarnings("ignore")
#导入数据
x_train = pd.read_csv('/kaggle/input/GiveMeSomeCredit/cs-training.csv')
x_test = pd.read_csv('/kaggle/input/GiveMeSomeCredit/cs-test.csv')
combine = [x_train, x_test]

In [None]:
x_train.dtypes

发现数据共有11列组成， 其中
1. SeriousDlqin2yrs : 预测的标签，用户两年后是否会经济困难 Y/N
2. RevolvingUtilizationOfUnsecuredLines： 信用卡剩余额度与可用额度的比值 float64
3. age 年龄 int64
4. NumberOfTime30-59DaysPastDueNotWorse 贷款逾期30-59天的次数 int64
5. DebtRatio 月支出与收入的比值 float64
6. MonthlyIncome 月收入 float64
7. NumberOfOpenCreditLinesAndLoans 贷款数量 int64
8. NumberOfTimes90DaysLate 贷款逾期90天的次数 int64
9. NumberRealEstateLoansOrLines 不动产贷款的数量 int64
10. NumberOfTime60-89DaysPastDueNotWorse 贷款逾期60-89天的次数 int64
11. NumberOfDependents 家属的数量

* 不难发现，除SeriousDlqin2yrs是分类特征外，其余均为数值型特征。
* 从命名上观察，NumberOfTime30-59DaysPastDueNotWorse， NumberOfTime60-89DaysPastDueNotWorse 和 NumberOfTimes90DaysLate 是同一类特征，反应了用户在逾约特定天数的次数，可以在预处理时采取相似策略。
* DebtRatio 和 MonthlyIncome 有直接关系，其中DebtRatio是由MonthlyIncome计算得出

## 3.数据探索

In [None]:
[len(x_train), len(x_test)]

训练集共有150000条数据，测试集共有101503条数据

### 3.1 空值填充

In [None]:
x_train.isnull().mean()

In [None]:
x_test.isnull().mean()

发现在训练集和测试集中仅有两个特征 MonthlyIncome 和 NumberOfDependents存在空值（标签SeriousDlqin2yrs除外）  
由于空数据占比较大，不能直接删除  
首先查看非空的NumberOfDependents的分布情况。

In [None]:
x_train[x_train['NumberOfDependents'].notnull()][['NumberOfDependents']].describe(percentiles=[.25,.60, .70, .80, .9, .99])

In [None]:
x_test[x_test['NumberOfDependents'].notnull()][['NumberOfDependents']].describe(percentiles=[.25,.60, .70, .80, .9, .99])

由上表可知，训练集共有3924条数据缺失NumberOfDependents特征，测试集有2626条缺失。未缺失的数据90%分布在区间 0-2。训练集和测试集最大值为20和43  
再观察特征MonthlyIncome,分析MonthlyIncome为空的数据的公共特征。由于MonthlyIncome和DebtRatio直接相关，这里结合DebtRatio 和 NumberOfDependents一起观察

In [None]:
x_train[x_train['MonthlyIncome'].isnull()][['DebtRatio','NumberOfDependents']].describe(percentiles=[.25,.60, .70, .80, .9, .99])

In [None]:
x_test[x_test['MonthlyIncome'].isnull()][['DebtRatio','NumberOfDependents']].describe(percentiles=[.25,.60, .70, .80, .9, .99])

* 通过上表发现，训练集共有29731条数据MonthlyIncome为空，测试集共有20103条空。  
* Debtratio数据由明显的离群点，需要在之后的处理中去除。  
* 细心可以发现，训练集测试集中在MonthlyIncome为空的数据中NumberOfDependents为空的数据也恰好分别有3924和2626条。这说明很可能NumberOfDependents为空的用户MonthlyIncome均为空，下面验证一下

In [None]:
[x_train[x_train['NumberOfDependents'].isnull()][x_train['MonthlyIncome'].notnull()]['DebtRatio'].count(),
 x_test[x_test['NumberOfDependents'].isnull()][x_test['MonthlyIncome'].notnull()]['DebtRatio'].count()]

以上代码验证了这一结论：*NumberOfDependents为空的用户MonthlyIncome均为空*。通过前文得出的表格可以看出MonthlyIncome为空的不一定NumberOfDependents为空。所以可以通过MonthlyIncome为空而NumberOfDependents不空的NumberOfDependents中位数来填充空确值，下面计算这一中位数。

In [None]:
x_train[x_train['NumberOfDependents'].notnull()][x_train['MonthlyIncome'].isnull()][['NumberOfDependents']].median()

由以上结果可知，在预处理中将用0填充NumberOfDependents空值即可
  

接下来探索MonthlyIncome和DebtRatio的关系

In [None]:
x_train[['DebtRatio']].describe(percentiles=[.25,.60, .70, .80, .9, .99])

In [None]:
x_train[x_train['MonthlyIncome'].isnull()][['DebtRatio']].describe(percentiles=[.10,.25,.60, .70, .80, .9, .99])

从较上的表格发现，对于DebtRatio，大部分数据分布在0-1，只有少于30%的数据大于1，这也符合常理。但由第二张表可得出相反的结论：大多数MonthlyIncome为空的都有较高的负债率。考虑到负债率是由MonthlyIncome计算得出，可猜想：当MonthlyIncome为空时，负债率由一个很小的分母计算得出，所以得出较大的结果。带着这一猜想，通过MonthlyIncome本身就很小的数据进行验证。

In [None]:
x_train[x_train['MonthlyIncome'] < 5][['DebtRatio']].describe(percentiles=[.10,.25,.60, .70, .80, .9, .99])

上表展示了MonthlyIncome < 5 的用户DebtRatio的分布，与MonthlyIncome为空的分布很相近，这也验证了猜想。 
  
接下来看看MonthlyIncome较小的取值

In [None]:
x_train[x_train['MonthlyIncome'] < 5].groupby('MonthlyIncome').describe()

In [None]:
x_test[x_test['MonthlyIncome'] < 5].groupby('MonthlyIncome').describe()

可发现MonthlyIncome取值为1或0的较多， 他们都具有相同的特征：DebtRatio很大，这也决定他们是一类用户。
  
在对MonthlyIncome预处理时，可以将空值和1都转为0，说明他们是一类用户。  
对于这一类用户(MonthlyIncome为空，0 或 1)，由于其DebtRatio是有一个很小的分母算出的大值，可以将其全部除以一个因子来平缓数据。 这个因子可以是: 这一类用户(MonthlyIncome为空，0 或 1)的DebtRatio中位数 / 其他用户DebtRatio中位数，下面计算这一中位数

In [None]:
float(x_train[x_train['MonthlyIncome'].isnull()][['DebtRatio']].median()) // float(x_train[x_train['MonthlyIncome'].notnull()][['DebtRatio']].median())

在预处理中，将训练集和数据集中MonthlyIncome为空、0或1的DebtRatio全部除以3915

### 3.2 逐个字段可视化分析

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#### SeriousDlqin2yrs

In [None]:
sns.countplot('SeriousDlqin2yrs' ,data = x_train)

标签的分布约为14:1，大部分用户在未来两年不会出现经济困难

#### RevolvingUtilizationOfUnsecuredLines

In [None]:
x_train[['RevolvingUtilizationOfUnsecuredLines']].describe(percentiles=[.10,.25,.60, .70, .80, .9, .99])

大部分数据RevolvingUtilizationOfUnsecuredLines分布较为均匀，但存在离群点

In [None]:
[len(x_train[x_train['RevolvingUtilizationOfUnsecuredLines'] > 2]) / len(x_train),
 len(x_test[x_test['RevolvingUtilizationOfUnsecuredLines'] > 2]) / len(x_train)]

由于测试及也存在RevolvingUtilizationOfUnsecuredLines较大的数据项，同时发现仅有0.1% - 0.2%的数据RevolvingUtilizationOfUnsecuredLines大于2，所以可以将大于2的RevolvingUtilizationOfUnsecuredLines全部转为2
  
下面探索RevolvingUtilizationOfUnsecuredLines与标签的关系

In [None]:
g = sns.FacetGrid(x_train, col='SeriousDlqin2yrs')
g.map(plt.hist, 'RevolvingUtilizationOfUnsecuredLines', bins=20)

标签RevolvingUtilizationOfUnsecuredLines下的分布基本符合14:1，所以RevolvingUtilizationOfUnsecuredLines对标签不起决定性作用，之后的特征若对标签不起决定作用将不再赘述

#### age

In [None]:
x_train[['age']].describe(percentiles=[.10,.25,.60, .70, .80, .9, .99])

In [None]:
plt.figure(figsize=[10, 8])
plt.subplot(221)
sns.boxplot(data=x_train['age'])
plt.ylabel('age') 
plt.subplot(222)
sns.histplot(x_train['age'])
plt.xlabel('age')

In [None]:
g = sns.FacetGrid(x_train, col='SeriousDlqin2yrs')
g.map(plt.hist, 'age', bins=20)

年龄数据分布很均匀，类似于正态分布。但是由于年龄太小的用户无法拥有银行账户，在预处理中将数据集年龄小于18岁的数据删除

#### NumberOfTime30-59DaysPastDueNotWorse / NumberOfTime60-89DaysPastDueNotWorse / NumberOfTimes90DaysLate

In [None]:
x_train[['NumberOfTime30-59DaysPastDueNotWorse','NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfTimes90DaysLate']].describe(percentiles=[.10,.25,.60, .70, .80, .9, .99])

In [None]:
x_test[['NumberOfTime30-59DaysPastDueNotWorse','NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfTimes90DaysLate']].describe(percentiles=[.10,.25,.60, .70, .80, .9, .99])

通过观察这三个特征的分布，发现无论在训练集还是测试集，99%的数据都小于5，但最大值都是98。于是考虑是否有些数据行这三个特征都是98。

In [None]:
[len(x_train[x_train['NumberOfTime30-59DaysPastDueNotWorse'] == 98]), 
 len(x_train[x_train['NumberOfTime60-89DaysPastDueNotWorse'] == 98]),
 len(x_train[x_train['NumberOfTimes90DaysLate'] == 98]),
 len(x_train[x_train['NumberOfTime30-59DaysPastDueNotWorse'] == 98][x_train['NumberOfTimes90DaysLate'] == 98][x_train['NumberOfTime60-89DaysPastDueNotWorse'] == 98])]

In [None]:
[len(x_test[x_test['NumberOfTime30-59DaysPastDueNotWorse'] == 98]), 
 len(x_test[x_test['NumberOfTime60-89DaysPastDueNotWorse'] == 98]),
 len(x_test[x_test['NumberOfTimes90DaysLate'] == 98]),
 len(x_test[x_test['NumberOfTime30-59DaysPastDueNotWorse'] == 98][x_test['NumberOfTimes90DaysLate'] == 98][x_test['NumberOfTime60-89DaysPastDueNotWorse'] == 98])]

发现这三个特征等于98的数目恰好都是2同一个数，并且三个特征同时等于98的也是同一个数，说明有数据三个特征都是98.测试集也有类似的特性

In [None]:
x_train.groupby('NumberOfTime30-59DaysPastDueNotWorse')[['NumberOfTime30-59DaysPastDueNotWorse']].describe()

发现NumberOfTime30-59DaysPastDueNotWorse的值直接从13跳到96，结合上面的结论，96和98可能有特殊的意义，在预处理的时候不能直接删除，可以将96和98转为比13大的20从而使数据更加平稳。

#### DebtRatio

在3.1小节中已将DebtRatio进行了一定的预处理，这里先写出预处理的代码

In [None]:
for dataset in combine:
    dataset.loc[dataset['MonthlyIncome'].isnull(),'DebtRatio'] = dataset.loc[dataset['MonthlyIncome'].isnull(),'DebtRatio'] / 3915;
    dataset.loc[dataset['MonthlyIncome'] == 1,'DebtRatio'] = dataset.loc[dataset['MonthlyIncome'] == 1,'DebtRatio'] / 3915;
    dataset.loc[dataset['MonthlyIncome'] == 0,'DebtRatio'] = dataset.loc[dataset['MonthlyIncome'] == 0,'DebtRatio'] / 3915;  

In [None]:
x_train[['DebtRatio']].describe(percentiles=[.10,.25,.60, .70, .80, .9, .99])

In [None]:
plt.figure(figsize=[10, 8])
plt.subplot(221)
sns.boxplot(data=x_train['DebtRatio'])
plt.ylabel('DebtRatio') 

In [None]:
g = sns.FacetGrid(x_train, col='SeriousDlqin2yrs')
g.map(plt.hist, 'DebtRatio', bins=20)

发现DebtRatio存在离群点，可将DebtRatio大于400的点转化为400

#### MonthlyIncome

In [None]:
plt.figure(figsize=[10, 8])
plt.subplot(221)
sns.boxplot(data=x_train['MonthlyIncome'])
plt.ylabel('MonthlyIncome') 
plt.subplot(222)
sns.histplot(x_train['MonthlyIncome'])
plt.xlabel('MonthlyIncome')

发现MonthlyIncome同样存在离群点，可将大于1e6的数据删除

In [None]:
g = sns.FacetGrid(x_train, col='SeriousDlqin2yrs')
g.map(plt.hist, 'MonthlyIncome', bins=20)

#### NumberOfOpenCreditLinesAndLoans

In [None]:
plt.figure(figsize=[10, 8])
plt.subplot(221)
sns.boxplot(data=x_train['NumberOfOpenCreditLinesAndLoans'])
plt.ylabel('NumberOfOpenCreditLinesAndLoans') 
plt.subplot(222)
sns.histplot(x_train['NumberOfOpenCreditLinesAndLoans'])
plt.xlabel('NumberOfOpenCreditLinesAndLoans')

In [None]:
g = sns.FacetGrid(x_train, col='SeriousDlqin2yrs')
g.map(plt.hist, 'NumberOfOpenCreditLinesAndLoans', bins=20)

NumberOfOpenCreditLinesAndLoans的分布很均匀，不用进一步处理

#### NumberRealEstateLoansOrLines

In [None]:
plt.figure(figsize=[10, 8])
plt.subplot(221)
sns.boxplot(data=x_train['NumberRealEstateLoansOrLines'])
plt.ylabel('NumberRealEstateLoansOrLines') 
plt.subplot(222)
sns.histplot(x_train['NumberRealEstateLoansOrLines'])
plt.xlabel('NumberRealEstateLoansOrLines')

NumberRealEstateLoansOrLines存在个别离群点，可将NumberRealEstateLoansOrLines大于50的点删除

In [None]:
g = sns.FacetGrid(x_train, col='SeriousDlqin2yrs')
g.map(plt.hist, 'NumberRealEstateLoansOrLines', bins=20)

#### NumberOfDependents

在3.1已对该特征进行分析

### 3.3 新特征添加

由于NumberOfTime60-89DaysPastDueNotWorse、NumberOfTime30-59DaysPastDueNotWorse和NumberOfTimes90DaysLate三个特征共同说明用户逾期次数，可以将三个数相加得到新特性：用户两年内逾期超过30天的总次数。在预处理中新增特征'90_num'为这三个特征相加的结果。

## 4.数据预处理

In [None]:
#数据预处理
for dataset in combine:
    dataset.drop(columns = ['Unnamed: 0'], inplace = True)
    dataset['RevolvingUtilizationOfUnsecuredLines'].mask(dataset['RevolvingUtilizationOfUnsecuredLines'] > 2, 2, inplace=True)
    dataset['NumberOfTime60-89DaysPastDueNotWorse'].mask(dataset['NumberOfTime60-89DaysPastDueNotWorse'] > 95, 20, inplace=True)
    dataset['NumberOfTime30-59DaysPastDueNotWorse'].mask(dataset['NumberOfTime30-59DaysPastDueNotWorse'] > 95, 20, inplace=True)
    dataset['NumberOfTimes90DaysLate'].mask(dataset['NumberOfTimes90DaysLate'] > 95, 20, inplace=True)
    dataset['90_sum'] = dataset['NumberOfTime30-59DaysPastDueNotWorse'] + dataset['NumberOfTimes90DaysLate'] + dataset['NumberOfTime60-89DaysPastDueNotWorse']
    #这三个预处理已经做过，这里不再做
#     dataset.loc[dataset['MonthlyIncome'].isnull(),'DebtRatio'] = dataset.loc[dataset['MonthlyIncome'].isnull(),'DebtRatio'] / 3915;
#     dataset.loc[dataset['MonthlyIncome'] == 1,'DebtRatio'] = dataset.loc[dataset['MonthlyIncome'] == 1,'DebtRatio'] / 3915;
#     dataset.loc[dataset['MonthlyIncome'] == 0,'DebtRatio'] = dataset.loc[dataset['MonthlyIncome'] == 0,'DebtRatio'] / 3915;
    dataset['DebtRatio'].mask(dataset['DebtRatio'] > 400, 400, inplace=True)
    dataset['MonthlyIncome'].replace(np.nan, 0, inplace=True)
    dataset['MonthlyIncome'].replace(1, 0, inplace=True)
    dataset['NumberOfDependents'].replace(np.nan, 0, inplace=True)
x_train = x_train[x_train['age'] > 20]
x_train = x_train[x_train['MonthlyIncome'] <= 1e6]
x_train = x_train[x_train['NumberRealEstateLoansOrLines'] <= 50]

根据上一节得出的结论进行数据的预处理

In [None]:
x_train.describe(percentiles=[.25,.60, .70, .80, .9, .99])

In [None]:
x_test.describe(percentiles=[.25,.60, .70, .80, .9, .99])

经过预处理，各个特征都变得比较均匀且平滑，接下来准备选择模型

## 5. 模型选择

由于该问题是一个二分类问题且标签分布不均，所以利用ROC曲线和AUC评估模型。同时利用kfold交叉验证方法，k选择10。选择一些评论中较多的boosting方法，同时对比一些bagging算法和经典的二分类算法，包括逻辑回归和支持向量机。下面是对比结果

In [None]:
#划分数据集用于交叉验证
from sklearn.model_selection import train_test_split
result = x_train['SeriousDlqin2yrs' ]
data = x_train.drop(['SeriousDlqin2yrs'], axis = 1)
train_data, test_data, train_result, test_result = train_test_split(data, result, test_size = 0.2, stratify = result, random_state=33)

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.model_selection import StratifiedKFold
models = []
scores = []
models.append("AdaBoost")
print("AdaBoost:")
model = AdaBoostClassifier()
from sklearn.model_selection import cross_val_score
result = cross_val_score(model,train_data,train_result,scoring='roc_auc',cv=StratifiedKFold(n_splits=10))
print('mean score:', result.mean())
scores.append(result.mean())

models.append("LGBM")
print("LGBM:")
model = LGBMClassifier()
result = cross_val_score(model,train_data,train_result,scoring='roc_auc',cv=StratifiedKFold(n_splits=10))
print('mean score:', result.mean())
scores.append(result.mean())

models.append("XGBoost")
print("XGBoost:")
model = XGBClassifier(eval_metric = 'auc')
result = cross_val_score(model,train_data,train_result,scoring='roc_auc',cv=StratifiedKFold(n_splits=10))
print('mean score:', result.mean())
scores.append(result.mean())

models.append("GBDT")
print("GBDT:")
model = GradientBoostingClassifier()
result = cross_val_score(model,train_data,train_result,scoring='roc_auc',cv=StratifiedKFold(n_splits=10))
print('mean score:', result.mean())
scores.append(result.mean())

models.append("RandomForest")
print("RandomForest:")
model = RandomForestClassifier()
result = cross_val_score(model,train_data,train_result,scoring='roc_auc',cv=StratifiedKFold(n_splits=10))
print('mean score:', result.mean())
scores.append(result.mean())

models.append("SVM")
print("SVM:")
model = svm.SVC()
result = cross_val_score(model,train_data,train_result,scoring='roc_auc',cv=StratifiedKFold(n_splits=10))
print('mean score:', result.mean())
scores.append(result.mean())


通过交叉验证得到初步结果：  
* 由于标签分布不均，SVM结果不理想  
* boosting算法总体优于bagging算法，其中LGBM算法和GBDT结果较好  
* LGBM算法较轻量级，较GBDT更适合处理大量数据，所以LGBM更适合本场景  

## 6. 调参、建模及求解

### 6.1 采用默认参数

In [None]:
model = LGBMClassifier()
model.fit(train_data, train_result)
predict_proba = model.predict_proba(test_data)[:,1]
from sklearn.metrics import roc_auc_score
roc_auc_score(test_result, predict_proba)

### 6.2 调参

为了保证对后续调优的稳定性，首先对boosting弱学习器的个数n_estimators和学习率learning_rating进行调优，两个参数一起，利用sklearn中的GridSearchCV函数进行网络搜索法调优

In [None]:
# from sklearn.model_selection import GridSearchCV
# param_grid = {
#     'n_estimators':range(100,400,50),
#     'learning_rate': [0.025,0.05,0.075]
# }
# estimator = LGBMClassifier()
# gridsearch = GridSearchCV(param_grid = param_grid, estimator = estimator, scoring='roc_auc',cv=StratifiedKFold(n_splits=10), verbose=0)
# gridsearch.fit(train_data,train_result)
# gridsearch.best_params_

In [None]:
# param_grid = {
#     'n_estimators':range(100,400,50),
#     'learning_rate': [0.025,0.05,0.075]
# }
# estimator = LGBMClassifier()
# gridsearch = GridSearchCV(param_grid = param_grid, estimator = estimator, scoring='roc_auc',cv=StratifiedKFold(n_splits=10), verbose=0)
# gridsearch.fit(train_data,train_result)
# gridsearch.best_params_

叶节点数量接下来对LGBM中决策树最大深度max_depth和叶子节点数量num_leaves进行调优

In [None]:
# param_grid = {
#     'max_depth':range(6,10,1),
#     'num_leaves': range(15,25,1)
# }
# estimator = LGBMClassifier(n_estimators = 290, learning_rate = 0.025)
# gridsearch = GridSearchCV(param_grid = param_grid,estimator = estimator, scoring='roc_auc',
#                           cv=StratifiedKFold(n_splits=10), verbose=0)
# gridsearch.fit(train_data,train_result)
# gridsearch.best_params_

In [None]:
# param_grid = {
#     'max_depth':range(10,12,1),
#     'num_leaves': range(25,35,1)
# }
# estimator = LGBMClassifier(n_estimators = 290, learning_rate = 0.025)
# gridsearch = GridSearchCV(param_grid = param_grid,estimator = estimator, scoring='roc_auc',
#                           cv=StratifiedKFold(n_splits=10), verbose=0)
# gridsearch.fit(train_data,train_result)
# gridsearch.best_params_

最后，为了避免过拟合，对feature_fraction和bagging_fraction进行调参

In [None]:
# param_grid = {
#     'feature_fraction': [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
#     'bagging_fraction' : [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
# }
# estimator = LGBMClassifier(n_estimators = 290, learning_rate = 0.025,max_depth = 6, num_leaves = 22)
# gridsearch = GridSearchCV(param_grid = param_grid,estimator = estimator, scoring='roc_auc',
#                           cv=StratifiedKFold(n_splits=10), verbose=0)
# gridsearch.fit(train_data,train_result)
# gridsearch.best_params_

### 6.1 建模求解

In [None]:
model = LGBMClassifier(n_estimators = 290, learning_rate = 0.025,max_depth = 6, num_leaves = 22)
model.fit(train_data, train_result)
predict_proba = model.predict_proba(test_data)[:,1]
roc_auc_score(test_result, predict_proba)

* 得到AUC得分，较优化前提高0.23%,下面绘制ROC曲线*

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(test_result, predict_proba)
plt.plot(fpr, tpr)
plt.plot(fpr, fpr, linestyle = '--')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC')
plt.show()

## 7. 提交

In [None]:
x_test = x_test.drop(['SeriousDlqin2yrs'],axis=1)
y_test = model.predict_proba(x_test)[:,1]
ids = np.arange(1,101504)
res = pd.DataFrame({'Id': ids, 'Probability': y_test})
res.to_csv("submission.csv", index=False)

In [None]:
res