# 1. 目的
通过预测在未来两年内某人将经历财务困境的可能性，改善信用评分的状态。根据客户信息，建立违约模型，制作评分卡，利用AUC和KS值作为评价指标，最终选出评价效果最好的预测模型。

# 2. 背景
银行在市场经济中扮演着至关重要的角色。他们决定谁可以获得融资，以及什么条件，可以做出或破坏投资决策。为了让市场和社会发挥作用，个人和企业需要获得信贷。信用评分算法，对违约概率进行猜测，是银行用来决定是否应该发放贷款的方法。这一竞赛要求参与者通过预测未来两年某人将经历财务困境的可能性，来改善信用评分的状态。这种竞争的目标是建立一个模型，让借款人可以用来帮助做出最好的财务决策。

客户申请评分卡由一系列特征项组成，每个特征项相当于申请表上的一个问题（例如，年龄、银行流水、收入等）。每一个特征项都有一系列可能的属性，相当于每一个问题的一系列可能答案（例如，对于年龄这个问题，答案可能就有30岁以下、30到45等）。在开发评分卡系统模型中，先确定属性与申请人未来信用表现之间的相互关系，然后给属性分配适当的分数权重，分配的分数权重要反映这种相互关系。分数权重越大，说明该属性表示的信用表现越好。一个申请的得分是其属性分值的简单求和。如果申请人的信用评分大于等于金融放款机构所设定的界限分数，此申请处于可接受的风险水平并将被批准；低于界限分数的申请人将被拒绝或给予标示以便进一步审查。

# 3.评价
AUC、KS

# 4. 项目流程
1. 探索性分析：数据结构、变量含义、变量的分位数等；
2. 数据清洗：处理重复值、缺失值、异常值；
3. 数据分析：利用Sklearn建立Logistic，RandomForest模型；
4. 数据输出：输出文件。

## 4.1 探索性数据分析

**4.1.1 加载必要的Python库，导入数据**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re as re
from pandas import Series, DataFrame
import scipy
from scipy.stats import chi2
from sklearn.ensemble import RandomForestRegressor
import seaborn as sns
from scipy import stats
import copy
import matplotlib.pyplot as plt
%matplotlib inline
#图可以显示中文
plt.rcParams['font.sans-serif']='SimHei'
plt.rcParams['axes.unicode_minus']=False

#显示文件路径
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#导入数据
train_data = pd.read_csv('/kaggle/input/give-me-some-credit-dataset/cs-training.csv')
test_data=pd.read_csv('/kaggle/input/give-me-some-credit-dataset/cs-test.csv')

**4.1.2 查看数据集信息和描述统计信息**

In [None]:
#查看变量的数据结构
# train_data.info()
print(train_data.info())
train_data.head(5)

In [None]:
# test_data.info()
print(test_data.info())
test_data.head(5)

In [None]:
#查看变量的分位数等信息
train_data.describe([0.01, 0.10, 0.25, 0.50, 0.75, 0.90, 0.99]).T

**数据集介绍：**

| 序号 | 变量标签 | 变量解释 | Type |
| --------- | ------------- | ------------- |------------- |
| 1 | SeriousDlqin2yr | 超过90天或更糟的逾期拖欠（区分好客户和坏客户） | Y/N |
| 2 | RevolvingUtilization Of UnsecuredLines | 无担保放款的循环利用（除了房贷车贷之外的信用卡账面金额，即贷款金额/信用卡总额度） | percentage |
| 3 | Age | 借款人年龄 | integer |
| 4 | NumberOfTime30-59DaysPastDueNotWorse | 30-59天逾期但不糟糕次数 | integer |
| 5 | DebtRatio | 负债比率 | percentage |
| 6 | MonthlyIncome | 月收入 | real |
| 7 |  Number Of OpenCreditLinesAndLoans | 开放式信贷（如信用卡）和贷款（分期付款如汽车贷款或抵押贷款）数量 | integer |
| 8 | NumberOfTimes90DaysLate | 借款者有90天或更高逾期的次数 | integer |
| 10 | NumberReal Estate Loans Or Lines | 不动产贷款或额度数量 | integer |
| 11 | Number Of Time 60-89Days PastDue Not Worse | 60-89天逾期但不糟糕次数 | integer |
| 12 | NumberOfDependents | 家属数量（不包括本人在内） | integer |

## 4.2 数据清洗

**4.2.1列重命名**

In [None]:
train_data.rename(columns={'Unnamed: 0':'ID'}, inplace=True)
test_data.rename(columns={'Unnamed: 0':'ID'}, inplace=True)

In [None]:
train_data.head(5)

In [None]:
test_data.head(5)

**4.2.2 去除重复值**

In [None]:
train_data.drop_duplicates(inplace=True)
test_data.drop_duplicates(inplace=True)

**4.2.3 缺失值处理**

In [None]:
#查看缺失值
train_data.isnull().mean()

由以上运行结果可知，MonthlyIncome和NumberOfDependents 两个字段有缺失值，需要处理。MonthlyIncome的缺失值较多，直接删除可能会影响结果，可以采用中位数/均值/众数等方式填充缺失数据，而对于缺失较少的NumberOfDependts,可以直接用中位数填充。

In [None]:
#按照退休年龄划分数据集
working = train_data.loc[(train_data['age'] >= 18) & (train_data['age'] <= 60)]
senior = train_data.loc[(train_data['age'] > 60)]
working_income_mean = working['MonthlyIncome'].mean()
senior_income_mean = senior['MonthlyIncome'].mean()
print(working_income_mean)
print(senior_income_mean)

可以看到退休与否差距不大，对收入的空数据填充平均值

In [None]:
train_data['MonthlyIncome'] = train_data['MonthlyIncome'].replace(np.nan,train_data['MonthlyIncome'].mean())

In [None]:
# train_data=train_data.dropna()
# 现在对NumberOfDependents非空值统计
train_data['NumberOfDependents'].value_counts()

In [None]:
# 对空值用中位数填充
train_data['NumberOfDependents'].fillna(train_data['NumberOfDependents'].median(), inplace=True)

In [None]:
# 检查
train_data.info()

In [None]:
test_data.loc[test_data['age'] == 0, 'age'] = test_data['age'].median()
test_data['MonthlyIncome'] = test_data['MonthlyIncome'].replace(np.nan,test_data['MonthlyIncome'].mean())
test_data['NumberOfDependents'].fillna(test_data['NumberOfDependents'].median(), inplace=True)

In [None]:
#MonthlyIncome（月收入）用中位数填充，NumberOfDependents数据量比较少，直接删除
# train_data.MonthlyIncome.fillna(value=train_data.MonthlyIncome.median(), inplace=True)
# train_data=train_data.dropna()

# mData = train_data.iloc[:,[5,0,1,2,3,4,6,7,8,9]]
# train_known = mData[mData.MonthlyIncome.notnull()].values
# train_unknown = mData[mData.MonthlyIncome.isnull()].values
# train_X = train_known[:,1:]
# train_y = train_known[:,0]
# rfr = RandomForestRegressor(random_state=0,n_estimators=200,max_depth=3,n_jobs=-1)
# rfr.fit(train_X,train_y)
# predicted_y = rfr.predict(train_unknown[:,1:]).round(0)
# train_data.loc[train_data.MonthlyIncome.isnull(),'MonthlyIncome'] = predicted_y

# train_data = train_data.dropna()

**4.2.4 异常值处理**

In [None]:
# 看是否有异常值
train_data.describe()

从describe()的结果看，年龄出现了0(min)，不合理，用中位数替换。NumberOfTime30-59DaysPastDueNotWorse，NumberOfTimes90DaysLate，NumberOfTime60-89DaysPastDueNotWorse三种的最大值都是98，导致平均值很接近，应该排查一下；同理，NumberOfOpenCreditLinesAndLoans，NumberRealEstateLoansOrLines。需要检查一下这几个参数之间的相关性。

In [None]:
test_data.describe()

可以看到测试数据集也发生了这样的情况。

In [None]:
# 检查数据的相关性
corr = train_data.corr()
plt.figure(figsize=(19, 15))
sns.heatmap(corr, annot=True, fmt='.2g')

由上图可见，NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse三者相关性很大，接下来查看一下三者的箱型图。

In [None]:
plt.figure(figsize=(19, 12)) 
train_data[['NumberOfTime30-59DaysPastDueNotWorse', 
          'NumberOfTime60-89DaysPastDueNotWorse',
          'NumberOfTimes90DaysLate']].boxplot()
plt.show()

In [None]:
# 去掉98和96两个点，再查看相关性如何
def replace98and96(column):
    new = []
    newval = column.median()
    for i in column:
        if (i == 96 or i == 98):
            new.append(newval)
        else:
            new.append(i)
    return new

train_data['NumberOfTime30-59DaysPastDueNotWorse'] = replace98and96(train_data['NumberOfTime30-59DaysPastDueNotWorse'])
train_data['NumberOfTimes90DaysLate'] = replace98and96(train_data['NumberOfTimes90DaysLate'])
train_data['NumberOfTime60-89DaysPastDueNotWorse'] = replace98and96(train_data['NumberOfTime60-89DaysPastDueNotWorse'])

test_data['NumberOfTime30-59DaysPastDueNotWorse'] = replace98and96(test_data['NumberOfTime30-59DaysPastDueNotWorse'])
test_data['NumberOfTimes90DaysLate'] = replace98and96(test_data['NumberOfTimes90DaysLate'])
test_data['NumberOfTime60-89DaysPastDueNotWorse'] = replace98and96(test_data['NumberOfTime60-89DaysPastDueNotWorse'])

In [None]:
# 检查数据的相关性
corr = train_data.corr()
plt.figure(figsize=(19, 15))
sns.heatmap(corr, annot=True, fmt='.2g')

In [None]:
# 对分类结果SeriousDlqin2yrs查看
sns.countplot(x="SeriousDlqin2yrs",data=train_data)

In [None]:
# 可以看出分类结果是及其不平衡的，事件发生率如下
P = train_data.groupby('SeriousDlqin2yrs')['ID'].count().reset_index()
P['Percentage'] = 100 * P['ID'] / P['ID'].sum()
print(P)

数据不平衡会让监督学习算法过多关注多数类，使分类性能下降；因为数据足够多，采用欠采样；采用正则回归模型和集成模型。

In [None]:
#剔除异常值，用99%的分位数进行盖帽处理
train_data = train_data[train_data['NumberOfTime30-59DaysPastDueNotWorse'] <4.00]
train_data = train_data[train_data['NumberOfTime60-89DaysPastDueNotWorse'] <2.00]
train_data = train_data[train_data['NumberOfTimes90DaysLate'] <3.00]
train_data = train_data[train_data['RevolvingUtilizationOfUnsecuredLines'] < 1.09]
train_data = train_data[train_data['DebtRatio'] < 4979.04]
train_data = train_data[train_data['MonthlyIncome'] <25000.00]
train_data = train_data[train_data['NumberRealEstateLoansOrLines'] <4.00]
train_data = train_data[train_data['NumberOfDependents'] <4.00]
train_data = train_data[train_data['NumberOfOpenCreditLinesAndLoans'] <24.00]
train_data = train_data[train_data['age'] <87.00]
# 年龄等于0的异常值进行剔除
train_data = train_data[train_data['age'] > 0]

## 4.3数据分析

In [None]:
# 为了避免和交叉验证混淆，将train和test设定为其他名称
X = train_data.drop(['SeriousDlqin2yrs', 'ID'],axis=1)
y = train_data['SeriousDlqin2yrs']
W = test_data.drop(['SeriousDlqin2yrs', 'ID'],axis=1)
z = test_data['SeriousDlqin2yrs']

**4.3.1 线性回归分类**

In [None]:
# 用线性回归模型
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.preprocessing import StandardScaler

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=111)

# 调用线性回归函数，C为正则化系数，l1表示L1正则化
logit = LogisticRegression(random_state=111, solver='saga', penalty='l1', class_weight='balanced', C=1.0, max_iter=500)

# 标准化拟合
scaler = StandardScaler().fit(X_train)

# 标准化X_train 和X_test
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 线性回归拟合
logit.fit(X_train_scaled, y_train)

# 输入训练集，返回每个样本对应到每种分类结果的概率
logit_scores_proba = logit.predict_proba(X_train_scaled)

# 返回分类1的概率
logit_scores = logit_scores_proba[:,1]

In [None]:
# 画图
def plot_roc_curve(fpr, tpr, label=None):
    plt.figure(figsize=(12,10))
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0,1],[0,1], "k--") # 画直线做参考
    plt.axis([0,1,0,1])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive rate")

In [None]:
# roc_curve根据分类结果和分类概率，返回false positive rage和true positive rate
fpr_logit, tpr_logit, thresh_logit = roc_curve(y_train, logit_scores)

# 画图
plot_roc_curve(fpr_logit,tpr_logit)
print('AUC Score : ', (roc_auc_score(y_train,logit_scores)))

In [None]:
# 验证测试集，测试分类结果概率分布
logit_scores_proba_val = logit.predict_proba(X_test_scaled)

# 分类结果为1的概率
logit_scores_val = logit_scores_proba_val[:,1]

# roc_curve根据分类结果和分类概率，返回false positive rage和true positive rate
fpr_logit_val, tpr_logit_val, thresh_logit_val = roc_curve(y_test, logit_scores_val)

# 画图
plot_roc_curve(fpr_logit_val,tpr_logit_val)
print('AUC Score :', (roc_auc_score(y_test,logit_scores_val)))

In [None]:
# 采用LogisticRegressionCV来交叉验证选择正则化系数C
from sklearn.linear_model import LogisticRegressionCV
logit = LogisticRegressionCV(Cs=[0.001, 0.01, 0.1, 1, 10, 100], penalty='l1', solver='saga', max_iter=500, class_weight='balanced', random_state=111)

# 线性回归拟合
logit.fit(X_train_scaled, y_train)

print(logit.C_)

In [None]:
# 输入训练集，返回每个样本对应到每种分类结果的概率
logit_scores_proba = logit.predict_proba(X_train_scaled)

# 返回分类1的概率
logit_scores = logit_scores_proba[:,1]

# roc_curve根据分类结果和分类概率，返回false positive rage和true positive rate
fpr_logit, tpr_logit, thresh_logit = roc_curve(y_train, logit_scores)

# 画图
plot_roc_curve(fpr_logit,tpr_logit)
print('AUC Score : ', (roc_auc_score(y_train,logit_scores)))

从结果看，LR方法调参数并不能很好地提高AUC，虽然采用了balanced权重，但是效果还是不理想；接下来尝试先将数据降采样，再采用随机森林法。

In [None]:
# 引入降采样模块
from imblearn.under_sampling import RandomUnderSampler

# Counter类的目的是用来跟踪值出现的次数
from collections import Counter
print('Original dataset shape :', Counter(y))

In [None]:
# 调用模块
rus = RandomUnderSampler(random_state=111)

# 直接降采样后返回采样后的数值
X_resampled, y_resampled = rus.fit_resample(X, y)
print('Resampled dataset shape:', Counter(y_resampled))

In [None]:
# 划分训练集和测试集
from sklearn.model_selection import train_test_split
X_train_rus, X_test_rus, y_train_rus, y_test_rus = train_test_split(X_resampled, y_resampled, random_state=111)
X_train_rus.shape, y_train_rus.shape

In [None]:
# 对重采样以后的数据进行分类
logit_resampled = LogisticRegression(random_state=111, solver='saga', penalty='l1', class_weight='balanced', C=1.0, max_iter=500)

logit_resampled.fit(X_resampled, y_resampled)
logit_resampled_proba_res = logit_resampled.predict_proba(X_resampled)
logit_resampled_scores = logit_resampled_proba_res[:, 1]
fpr_logit_resampled, tpr_logit_resampled, thresh_logit_resampled = roc_curve(y_resampled, logit_resampled_scores)
plot_roc_curve(fpr_logit_resampled, tpr_logit_resampled)
print('AUC score: ', roc_auc_score(y_resampled, logit_resampled_scores))

可以看到准确率反而降低了。

**4.3.2 随机森林法**

In [None]:
# 采用随机森林法分类和梯度上升法
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
forest = RandomForestClassifier(n_estimators=300, random_state=111, max_depth=5, class_weight='balanced')
forest.fit(X_train_rus, y_train_rus)
y_scores_prob = forest.predict_proba(X_train_rus)
y_scores = y_scores_prob[:, 1]
fpr, tpr, thresh = roc_curve(y_train_rus, y_scores)
plot_roc_curve(fpr, tpr)
print('AUC score:', roc_auc_score(y_train_rus, y_scores))

In [None]:
# 交叉验证
y_test_proba = forest.predict_proba(X_test_rus)
y_scores_test = y_test_proba[:, 1]
fpr_test, tpr_test, thresh_test = roc_curve(y_test_rus, y_scores_test)
plot_roc_curve(fpr_test, tpr_test)
print('AUC Score:', roc_auc_score(y_test_rus, y_scores_test))

In [None]:
# 看看随机森林法对各个特征的重视程度
def plot_feature_importances(model):
    plt.figure(figsize=(10,8))
    n_features = X.shape[1]
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), X.columns)
    plt.xlabel('Feature importance')
    plt.ylabel('Feature')
    plt.ylim(-1, n_features)

plot_feature_importances(forest)

**4.3.3 梯度提升法分类**

In [None]:
gbc_clf = GradientBoostingClassifier(n_estimators=300, learning_rate=0.05, max_depth=8, random_state=112)
gbc_clf.fit(X_train, y_train)
gbc_clf_proba = gbc_clf.predict_proba(X_train)
gbc_clf_scores = gbc_clf_proba[:, 1]
fpr_gbc, tpr_gbc, thresh_gbc = roc_curve(y_train, gbc_clf_scores)
plot_roc_curve(fpr_gbc, tpr_gbc)
print('AUC Score:', roc_auc_score(y_train, gbc_clf_scores))

In [None]:
# 来看一下交叉验证的结果
gbc_val_proba = gbc_clf.predict_proba(X_test)
gbc_val_scores = gbc_val_proba[:, 1]
print('AUC score:', roc_auc_score(y_test, gbc_val_scores))

看来是过拟合了，调一下参数。

In [None]:
gbc_clf_submission = GradientBoostingClassifier(n_estimators=200, learning_rate=0.05 ,max_depth=4,  random_state=42)
gbc_clf_submission.fit(X_train,y_train)
gbc_clf_proba = gbc_clf_submission.predict_proba(X_train)
gbc_clf_scores = gbc_clf_proba[:,1]
gbc_val_proba = gbc_clf_submission.predict_proba(X_test)
gbc_val_scores = gbc_val_proba[:,1]
fpr_gbc, tpr_gbc, thresh_gbc = roc_curve(y_train, gbc_clf_scores)
print('AUC Score :', roc_auc_score(y_train, gbc_clf_scores))
print('AUC Score :', roc_auc_score(y_test, gbc_val_scores))

In [None]:
plot_feature_importances(gbc_clf)

和随机森林法相比，GBC方法给予DebtRatio更多着重。

## 4.4 数据输出

In [None]:
submission_proba = gbc_clf_submission.predict_proba(W)
submission_scores = submission_proba[:, 1]
submission_scores.shape

In [None]:
W.shape

In [None]:
ids = np.arange(1, 101504)
submission = pd.DataFrame( {'Id': ids, 'Probability': submission_scores})
submission.to_csv('submission.csv', index=False)