# 信用评分

## 0 Preparation

导入所需要的包

In [None]:
!pip install xlrd
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier

## 1 Background

### 1.1 Background of the problem 

银行在市场经济中发挥着至关重要的作用。他们决定谁能获得融资，以什么条件获得融资，并能做出或破坏投资决定。为了使市场和社会运作，个人和公司需要获得信贷。

信用评分算法，对违约的概率进行猜测，是银行用来决定是否应该发放贷款的方法。本次比赛要求参赛者通过预测某人在未来两年内遭遇财务困境的概率，来改善信用评分的技术水平。

这项比赛的目标是建立一个借款人可以用来帮助做出最佳财务决定的模型。

本题提供25万名借款人的历史数据。

### 1.2 Data Explore

我们可以看看题目给出的数据解释

In [None]:
columns = pd.read_excel('/kaggle/input/GiveMeSomeCredit/Data Dictionary.xls')
columns

对数据集的一些解释：  
+ SeriousDlqin2yrs: 经历过90天的逾期拖欠或更严重的拖欠的人 
+ RevolvingUtilizationOfUnsecuredLines: 信用卡和个人信用额度的总余额（房地产和汽车贷款等无分期付款债务除外）除以信用额度之和
+ age: 借款人的年龄，以岁为单位
+ NumberOfTime30-59DaysPastDueNotWorse: 在过去两年中，借款人逾期30-59天但没有恶化的次数。
+ DebtRatio: 每月的债务支付、赡养费、生活费除以每月的总收入
+ MonthlyIncome: 月收入
+ NumberOfOpenCreditLinesAndLoans: 未偿贷款（分期付款如汽车贷款或抵押贷款）和信用额度（如信用卡）的数量
+ NumberOfTimes90DaysLate: 借款人逾期90天或更长时间的次数
+ NumberRealEstateLoansOrLines: 抵押贷款和房地产贷款的数量，包括房屋净值信贷额度
+ NumberOfTime60-89DaysPastDueNotWorse: 在过去两年中，借款人逾期60-89天但没有恶化的次数。
+ NumberOfDependents: 家庭中除自己以外的受抚养人数量（配偶、子女等)

### 1.3 Dataset Explore

读取训练数据

In [None]:
train = pd.read_csv('/kaggle/input/GiveMeSomeCredit/cs-training.csv')
train.head()

查看训练集的相关列

In [None]:
train.info()

**共有11列，训练集包含150000条数据**  
+ 缺失值列: MonthlyIncome(月收入)   NumberOfDependents(家庭中除自己以外的受抚养人数量（配偶、子女等))

读取测试数据

In [None]:
test = pd.read_csv('/kaggle/input/GiveMeSomeCredit/cs-test.csv')
test.head()

查看测试集的相关列

In [None]:
test.info()

**共有11列数据，测试集包括101502条数据**
+ 缺失值列: MonthlyIncome(月收入)   NumberOfDependents(家庭中除自己以外的受抚养人数量（配偶、子女等))

## 2 Feauture Project

### 2.1 SeriousDlqin2yrs

我们看到训练集中SeriousDlqin2yrs有数据，而测试集中没有该数据，说明这个应该是我们要预测的标签，我们可以看看训练集中的分布情况

In [None]:
sns.countplot(x='SeriousDlqin2yrs', data=train)

In [None]:
train.SeriousDlqin2yrs.value_counts()

可以看出，类别的分布十分不均匀，0很少而1很多

### 2.2 RevolvingUtilizationOfUnsecuredLines

RevolvingUtilizationOfUnsecuredLines是信用卡和个人信用额度的总余额（房地产和汽车贷款等无分期付款债务除外）除以信用额度之和

In [None]:
sns.boxplot(data=train, x='RevolvingUtilizationOfUnsecuredLines')

总的来说，该属性的类别差异较大

### 2.3 age

age是年龄

In [None]:
sns.distplot(train.age)

In [None]:
sns.boxplot(train.age)

In [None]:
g = sns.FacetGrid(train, col='SeriousDlqin2yrs')
g.map(sns.boxplot, 'age')

类别之间的年龄分布存在差异

发现会有很多离谱的年龄，我们可以看看这些人对应的数据

In [None]:
train[train.age>100]

In [None]:
train.loc[train.age>100, 'age'] = train.age.mean()

### 2.4 NumberOfTime30-59DaysPastDueNotWorse

 NumberOfTime30-59DaysPastDueNotWorse是指在过去两年中，借款人逾期30-59天但没有恶化的次数

In [None]:
sns.countplot(train['NumberOfTime30-59DaysPastDueNotWorse'])

In [None]:
train.loc[train['NumberOfTime30-59DaysPastDueNotWorse']>=96]

这个超过96次应该属于数据的问题，我们可以将其规范化一下，超过15次的全部设置为15次，即15次为其最大的逾期次数

In [None]:
train.loc[train['NumberOfTime30-59DaysPastDueNotWorse']>15, 'NumberOfTime30-59DaysPastDueNotWorse'] = 15

### 2.5 DebtRatio

DebtRatio是指每月的债务支付、赡养费、生活费除以每月的总收入

In [None]:
sns.distplot(train.DebtRatio)

In [None]:
train[train.DebtRatio>10000]

该属性样本之间差距是很大的，而且没有明显的发现其与样本标签值有着明显的关系，后续可以考虑删除属性

### 2.6 MonthlyIncome

MonthlyIncome是指月收入，这个里面包含非常多的缺失值

In [None]:
train.loc[train.MonthlyIncome.isnull()]

In [None]:
train.loc[train.MonthlyIncome.isnull(), 'SeriousDlqin2yrs'].sum()

空值所在样本中也有很多的目标样本，应次不能简单删去

平均值填值

In [None]:
train['MonthlyIncome'] = train['MonthlyIncome'].fillna(train['MonthlyIncome'].mean())

### 2.7 NumberOfOpenCreditLinesAndLoans

NumberOfOpenCreditLinesAndLoans是指未偿贷款（分期付款如汽车贷款或抵押贷款）和信用额度（如信用卡）的数量

In [None]:
sns.distplot(train.NumberOfOpenCreditLinesAndLoans)

In [None]:
g = sns.FacetGrid(train, col='SeriousDlqin2yrs')
g.map(sns.distplot, 'NumberOfOpenCreditLinesAndLoans')

可以看出正负样本之间是有差别的

### 2.8 NumberOfTimes90DaysLate

NumberOfOpenCreditLinesAndLoans是借款人逾期90天或更长时间的次数

In [None]:
sns.countplot(train.NumberOfTimes90DaysLate)

In [None]:
g = sns.FacetGrid(train, col='SeriousDlqin2yrs')
g.map(sns.countplot, 'NumberOfTimes90DaysLate')

发现该属性值大的人更容易成为目标样本

我们可以看一些数据上的问题，超过90天的次数应该小于等于逾期60-89天的人数，而数据中有一些是不满足的，后续应该对其进行处理

In [None]:
train.loc[train.NumberOfTimes90DaysLate > train['NumberOfTime60-89DaysPastDueNotWorse']]

In [None]:
train.loc[train['NumberOfTimes90DaysLate']>15, 'NumberOfTimes90DaysLate'] = 15

### 2.9 NumberRealEstateLoansOrLines

NumberRealEstateLoansOrLines是指抵押贷款和房地产贷款的数量，包括房屋净值信贷额度

In [None]:
sns.countplot(train.NumberRealEstateLoansOrLines)

In [None]:
g = sns.FacetGrid(train, col='SeriousDlqin2yrs')
g.map(sns.countplot, 'NumberRealEstateLoansOrLines')

可以看出类别之间该属性的分布还是有一定的差异的

### 2.10  NumberOfTime60-89DaysPastDueNotWorse

NumberOfTime60-89DaysPastDueNotWorse是指在过去两年中，借款人逾期60-89天但没有恶化的次数

In [None]:
sns.countplot(train['NumberOfTime60-89DaysPastDueNotWorse'])

In [None]:
g = sns.FacetGrid(train, col='SeriousDlqin2yrs')
g.map(sns.countplot, 'NumberOfTime60-89DaysPastDueNotWorse')

发现该属性值大的人更容易成为目标样本

我们可以看一些数据上的问题，60-89天的次数应该小于等于逾期30-59天的人数，而数据中有一些是不满足的，后续应该对其进行处理

In [None]:
train.loc[train['NumberOfTime60-89DaysPastDueNotWorse'] > train['NumberOfTime30-59DaysPastDueNotWorse']]

In [None]:
train.loc[train['NumberOfTime60-89DaysPastDueNotWorse']>15, 'NumberOfTime60-89DaysPastDueNotWorse'] = 15

### 2.11 NumberOfDependents

NumberOfDependents是指家庭中除自己以外的受抚养人数量（配偶、子女等)

In [None]:
g = sns.FacetGrid(train, col='SeriousDlqin2yrs')
g.map(sns.boxplot, 'NumberOfDependents')

可以看出类别之间的差异还是很大的，我们可以认为因为他们有赡养的负担因此可能会有更多的负债，并且有着很高的危险

平均值填值

In [None]:
train['NumberOfDependents'] = train['NumberOfDependents'].fillna(0)

### 2.11 Correlation Analysis

In [None]:
sns.heatmap(train.corr(),cmap="coolwarm",annot=False)

可以发现：
+ SeriousDlqin2yrs和几个逾期的天数有很大关系
+ SeriousDlqin2yrs和年龄没多大关系

## 3 Model

主要使用两种模型，分别是LGBM和随机森林

### 3.0 Preparation

In [None]:
train_X = train[train.columns[2:]]
train_y = train[train.columns[1]]
train_X, val_X, train_y, val_y = train_test_split(train_X, train_y, test_size=0.1,  random_state=42, stratify=train_y)

In [None]:
val_X.info()

### 3.1 LGBM

In [None]:
grid_lgb = GridSearchCV(
    estimator= LGBMClassifier(),
    param_grid={
        'n_estimators':range(20,200,10),
        'learning_rate': np.linspace(0.05, 0.5, 10)
    },
    scoring='roc_auc',
    verbose=1
)

grid_lgb.fit(train_X,train_y)

for result in grid_lgb.cv_results_:
    print(result, grid_lgb.cv_results_[result])
grid_lgb.best_params_['n_estimators']

In [None]:
clf_lgb = LGBMClassifier(n_estimators=grid_lgb.best_params_['n_estimators'])
clf_lgb.fit(train_X, train_y)
y_pred_lgb = clf_lgb.predict_proba(val_X)[:,1]
score = roc_auc_score(val_y, y_pred_lgb)
print(score)

### 3.2 随机森林

In [None]:
# 效果不好
grid_rf = GridSearchCV(
    estimator= RandomForestClassifier(n_estimators=45,random_state=42,max_depth=11),
    param_grid={
        'min_samples_split': range(5,30,5),
        'n_estimators':range(20,100,20),
        'max_depth':range(5,20,5)
    },
    scoring='roc_auc',
    verbose=1
)

grid_rf.fit(train_X,train_y)

for result in grid_rf.cv_results_:
    print(result, grid_rf.cv_results_[result])
grid_rf.best_params_['n_estimators']

In [None]:
clf_rf = RandomForestClassifier(n_estimators=grid_rf.best_params_['n_estimators'])
clf_rf.fit(train_X, train_y)
y_pred_rf = clf_rf.predict_proba(val_X)[:,1]
score = roc_auc_score(val_y, y_pred_rf)
print(score)

### 3.3 Feature cut

In [None]:

train_X = train[train.columns[2:]]
train_X = train_X.drop('age', axis=1)
train_y = train[train.columns[1]]



train_X['NumberOfTime60-89DaysPastDueNotWorse'] = train_X[['NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfTimes90DaysLate']].max(axis=1)
train_X['NumberOfTime30-59DaysPastDueNotWorse'] = train_X[['NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfTime30-59DaysPastDueNotWorse']].max(axis=1)


In [None]:
train_X, val_X, train_y, val_y = train_test_split(train_X, train_y, test_size=0.1,  random_state=42, stratify=train_y)

In [None]:
grid_lgb_fc = GridSearchCV(
    estimator= LGBMClassifier(),
    param_grid={
        'n_estimators':range(20,200,10),
        'learning_rate': np.linspace(0.05, 0.5, 10)
    },
    scoring='roc_auc',
    verbose=1
)

grid_lgb_fc.fit(train_X,train_y)

for result in grid_lgb_fc.cv_results_:
    print(result, grid_lgb_fc.cv_results_[result])
grid_lgb_fc.best_params_['n_estimators']

In [None]:

train_X = train[train.columns[2:]]
train_X = train_X.drop('age', axis=1)
train_y = train[train.columns[1]]


clf_lgb_fc = LGBMClassifier(n_estimators=grid_lgb_fc.best_params_['n_estimators'])
clf_lgb_fc.fit(train_X, train_y)

## 4 Predict

In [None]:
test_lgb = test.copy()

In [None]:
test_lgb.info()

In [None]:
test_lgb.loc[test_lgb['NumberOfTime60-89DaysPastDueNotWorse']>15, 'NumberOfTime60-89DaysPastDueNotWorse'] = 15

In [None]:
test_lgb['NumberOfDependents'] = test_lgb['NumberOfDependents'].fillna(0)

In [None]:
test_lgb['MonthlyIncome'] = test_lgb['MonthlyIncome'].fillna(train_X['MonthlyIncome'].mean())

In [None]:
test_lgb.loc[test_lgb['NumberOfTime30-59DaysPastDueNotWorse']>15, 'NumberOfTime30-59DaysPastDueNotWorse'] = 15

In [None]:
test_lgb.loc[test_lgb['NumberOfTimes90DaysLate']>15, 'NumberOfTimes90DaysLate'] = 15

In [None]:
test_lgb['NumberOfTime60-89DaysPastDueNotWorse'] = test_lgb[['NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfTimes90DaysLate']].max(axis=1)
test_lgb['NumberOfTime30-59DaysPastDueNotWorse'] = test_lgb[['NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfTime30-59DaysPastDueNotWorse']].max(axis=1)

In [None]:
test_lgb['NumberOfDependents'] = test_lgb['NumberOfDependents'].fillna(0)

In [None]:
test_X = test_lgb[test_lgb.columns[2:]]

In [None]:
test_X = test_X.drop('age', axis =1)
test_X.info()

In [None]:
y_pred = clf_lgb_fc.predict_proba(test_X)[:,1]

In [None]:
sample = pd.read_csv('../input/GiveMeSomeCredit/sampleEntry.csv')
sample

In [None]:
sample['Probability'] = y_pred
sample.to_csv('./submit_lgb.csv',index=False)
reload = pd.read_csv('./submit_lgb.csv')
reload