# 催收评分卡项目

## 环境处理
- [依赖的外部环境](#env) 其中主要封装了一些模型的训练和模型效果评估的代码, 以及依赖包的导入
    - model_trains 模型训练
    - model_estimate 模型效果评估
    - start_load 依赖包的导入和一些工具方法的定义
    
## [加载数据](#load_data)

- [列名替换](#将列名进行映射)这里为了方便理解用中文列名进行了替换
- [切分数据集和验证集](#切分数据集和验证集)
- [查看样本分布](#查看样本分布)

## [缺失值处理d](#缺失值处理)


## 模型训练

### [1辅助模型](#辅助模型)
- [特征重要性排序](#重要性排序)
- [IV值排序](#iv值排序)

## [分箱](#split_box)

- [特征类型划分](#特征分类)

<div id="env"></div>

## 环境处理
```shell
!mkdir -p libs/utils
!touch libs/__init__.py libs/utils/__init__.py
!wget -O libs/utils/model_trains.py https://gitee.com/mill_teacher/machine_learn/raw/master/card/libs/utils/model_trains.py
!wget -O libs/utils/model_estimate.py https://gitee.com/mill_teacher/machine_learn/raw/master/card/libs/utils/model_estimate.py
!wget -O start_load.py https://gitee.com/mill_teacher/machine_learn/raw/master/card/script/%E8%AF%84%E5%88%86%E5%8D%A1%E4%BB%A3%E7%A0%81/start_load.py
!pip install toad==0.0.61
```

In [None]:
!mkdir -p libs/utils
!touch libs/__init__.py libs/utils/__init__.py
!wget -O libs/utils/model_trains.py https://gitee.com/mill_teacher/machine_learn/raw/master/card/libs/utils/model_trains.py
!wget -O libs/utils/model_estimate.py https://gitee.com/mill_teacher/machine_learn/raw/master/card/libs/utils/model_estimate.py
!wget -O start_load.py https://gitee.com/mill_teacher/machine_learn/raw/master/card/script/%E8%AF%84%E5%88%86%E5%8D%A1%E4%BB%A3%E7%A0%81/start_load.py
!pip install toad==0.0.61
%run start_load.py

<div id="load_data"></div>

## 加载数据

In [None]:
data_path = os.path.join('../input/give-me-some-credit-dataset','cs-training.csv')
data = pd.read_csv(data_path,index_col=0)
data.shape

In [None]:
import toad

<div id="将列名进行映射"></div>

### 将列名进行映射

field|describe|type
---|---|---
SeriousDlqin2yrs|逾期90天以上|Y/N
RevolvingUtilizationOfUnsecuredLines|信用卡和个人信用额度的总余额（不动产和汽车贷款等无分期付款债务除外）除以信用额度之和|percentage
age|年龄|integer
NumberOfTime30-59DaysPastDueNotWorse|借款人逾期30-59天的次数，但在过去2年中没有恶化。|integer
DebtRatio|负债率: 每月还债，赡养费，生活费除以每月总收入|percentage
MonthlyIncome|实际月收入|
NumberOfOpenCreditLinesAndLoans|未结贷款（分期付款，如汽车贷款或抵押贷款）和信贷额度（如信用卡）的数量|integer
NumberOfTimes90DaysLate|借款人逾期90天或以上的次数|integer
NumberRealEstateLoansOrLines|抵押贷款和房地产贷款的数量，包括房屋净值信贷额度|integer
NumberOfTime60-89DaysPastDueNotWorse|借款人逾期60-89天的次数，但在过去2年中没有恶化。|integer
NumberOfDependents|家庭中不包括自己的受抚养人人数（配偶、子女等）|integer

In [None]:
column_map = {
    'SeriousDlqin2yrs':'target',
    'RevolvingUtilizationOfUnsecuredLines':'信用额度使用率',
    'age':'年龄',
    'NumberOfTime30-59DaysPastDueNotWorse':'逾期30-59天的次数',
    'DebtRatio':'负债率',
    'MonthlyIncome':'实际月收入',
    'NumberOfOpenCreditLinesAndLoans':'未结贷款的数量',
    'NumberOfTimes90DaysLate':'连续逾期90天以上的次数',
    'NumberRealEstateLoansOrLines':'抵押贷款笔数',
    'NumberOfTime60-89DaysPastDueNotWorse': '连续逾期60~90天的次数',
    'NumberOfDependents':'家庭人口数'
}
data = data.rename(columns=column_map)
data.describe().T

<div id="切分数据集和验证集"></div>

### 切分数据集和验证集

将数据分为
- 训练集
- 测试集
- 验证集 正常情况下，验证集从时间外样本上获取

#### 切分数据集和验证集

In [None]:
train_data, oot_data = train_test_split(data,stratify=data['target'],random_state=47)
train_data.shape, oot_data.shape

#### 切分训练集和测试集

In [None]:
train_data, test_data = train_test_split(train_data,stratify=train_data['target'],random_state=47)
train_data.shape, test_data.shape

### 打上类型标记
> 所有的数据都在一起进行处理，如缺失值填补和异常值处理。但是标注好类型，方便分开

In [None]:
train_data['type']='train'
oot_data['type']='oot'
test_data['type']='test'
data = pd.concat([train_data, oot_data, test_data])
data.shape

<div id="查看样本分布"></div>

## 查看样本分布

> 可以发现坏样本率在6.6%

In [None]:
samples_rate = data.groupby(['type','target']).agg({'年龄':'count'}).reset_index().rename(columns={'年龄':'count'}) # 按类型的好坏样本分布
samples_total = data['type'].value_counts().reset_index().rename(columns={'index':'type','type':'total'}) # 按类型总客户数
samples_cal_pd = pd.merge(samples_rate,samples_total,on='type')
samples_cal_pd['rate']=samples_cal_pd['count']/samples_cal_pd['total'] # 计算好坏客户占比
samples_cal_pd

<div id="缺失值处理"></div>

## 缺失值处理

一般的处理方式如下：
1. 缺失值超过90%直接删除
2. 缺失值超过50%将缺失值单独作为一类
3. 通过分箱解决
4. 填充固定值，这个常用，因为缺失值往往有一定的意义
5. 填充中位数
6. knn或者随机森林填充

本案中`实际月收入`和`家庭人口数` 的缺失值都没有实际的含义。缺失率也不多，应该使用插补填充。

In [None]:
missing_data = pd.DataFrame(data.isnull().sum()).reset_index().rename(columns={'index':'column',0:'count'})
missing_data_sortd = missing_data.sort_values('count',ascending=False)
missing_data_sortd['missing_rate'] = missing_data_sortd['count']/data.shape[0]
missing_data_sortd

### 尝试使用KNN填补
因为内存消耗太大无法计算
```python
from sklearn.impute import KNNImputer

train_columns = set(data.columns)-{'target','type'}

imputer =  KNNImputer(n_neighbors=5)

data.loc[:,train_columns] = imputer.fit_transform(data.loc[:,train_columns])
```

### 尝试使用xgboost填充
也是内存使用太大，无法计算
```python
train_columns = list(set(data.columns)-{'target','type','实际月收入'})

valide_data = data.loc[data['实际月收入'].isnull(),:]
train_data = data.loc[data['实际月收入'].notnull(),:]

x_train,x_test,y_train,y_test = train_test_split(train_data[train_columns].values,train_data['实际月收入'].values,random_state=37)

xgb_model(x_train,y_train,x_test,y_test,estimators=100)
```

### 这里先使用均值填充

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer()
train_columns = set(data.columns)-{'target','type'}
data.loc[:,train_columns] = imputer.fit_transform(data.loc[:,train_columns])

In [None]:
missing = data.isnull().sum()
missing[missing>0]

<div id="异常值处理"></div>

## 异常值处理

查看数据的分布情况

In [None]:
plt.rcParams['font.sans-serif']=['Droid Sans Fallback' ] 
plt.rcParams['axes.unicode_minus'] = False 
plt.rcParams['font.family'] = ['Times New Roman']
plt.rcParams.update({'font.size': 8}) 
number_features = data[train_columns].select_dtypes(['float','int']).columns
plot_distplot(data[number_features])

### 箱线图查看长尾情况

In [None]:
plot_box(data[number_features])

<div id="辅助模型"></div>

## 辅助模型

In [None]:
X_train,X_test,y_train,y_test = train_test_split(data[train_columns],data['target'])

xgb_model_obj,xgb_test_pred = xgb_model(X_train,y_train,X_test,y_test)

<div id="重要性排序"></div>

### xgboost重要性排序

In [None]:
from xgboost import plot_importance
_, ax = plt.subplots(figsize=(12,8))
plot_importance(xgb_model_obj, ax=ax)

### 用列表显示特征重要性

In [None]:
importants = xgb_model_obj.get_booster().get_score()
import_features = pd.DataFrame(importants, index=['import']).T.reset_index().rename(columns={"index":"name"})
import_features.sort_values('import',ascending=False)

<div id="iv值排序"></div>

### iv值排序

In [None]:
toad.quality(data.drop('type',axis=1), cpu_cores=1, iv_only=True)

<div id="split_box"></div>

## 分箱

### 初步筛选
在特征分箱之前，先行进行一次筛选，目的是减少分箱的工作

In [None]:
selected, drop_list = toad.select(data, return_drop=True, iv=0.03, corr=1,exclude=['type'])
drop_list

<div id="特征分类"></div>

## 将特征分为
- 二值型
- 类别型
- 数值型

其中只有数值型需要参与分箱，本案中没有类别型，全部都是数值型

In [None]:
pd.DataFrame(data.drop(['target','type'], axis=1).nunique().sort_values())

#### 样本划分

这里只采用训练集的数据进行分箱，而用测试集和验证集来进行验证

In [None]:
train_data = data.loc[data['type']=='train',:]
oot_data = data.loc[data['type']=='oot',:]
test_data = data.loc[data['type']=='test',:]

### 自动卡方分箱

使用卡方分箱，`min_samples`参数可以设置最小一箱的样本量。 这里为了代码重跑，做了缓存判断

In [None]:
import pickle
MODEL_PATH = './combiner_model_v1.pkl'
if os.path.exists(MODEL_PATH):
    with open(MODEL_PATH,'rb') as f:
        combiner = pickle.load(f)
else:
    combiner = toad.transform.Combiner()
    print('start fit...')
    combiner.fit(train_data[train_columns], train_data['target'], method='chi', min_samples=0.05)
    print('end fit...')
    with open(MODEL_PATH, 'wb') as f:
        pickle.dump(combiner, f)

### 分箱的分隔

In [None]:
bin = combiner.export()
bin

In [None]:
def transform(data):
    '''
    对数据进行分箱
    '''
    data_number = data.copy() # copy一份数据，因为需要多次操作
    data_number.loc[:,train_columns] = combiner.transform(data[train_columns])
    return data_number

In [None]:
data_number = transform(data)

In [None]:
def bin_badrate_plot(data, col, t='type', target='target'):
    '''
    画图查看分箱情况
    '''
    badrate_plot(data,x=t, target=target, by=col)
    data_train = data.loc[data['type']=='train',:]
    bin_plot(data_train, x=col, annotate_format='.2f')
    bin_bg_plot(data_train, col)

#### 查看不稳定的特征

In [None]:
list(filter(lambda x: not is_stable(data_number, x),number_features))

#### 查看没有单调性的特征

In [None]:
list(filter(lambda x: not is_monotonic(data_number, x), number_features))

In [None]:
bin['实际月收入'] # 查看分箱区隔

In [None]:
ajd_bin = {
    '抵押贷款笔数':[1.0, ],
    '负债率': [0.020232659,  0.406828251, 0.50950794, ]
    ,'未结贷款的数量':[3.0, 5.0, ]
    ,'实际月收入': [ 4839.0, 7542.0]
}
combiner.set_rules(ajd_bin)
data_number = transform(data)

In [None]:
bin_badrate_plot(data_number, "实际月收入")

In [None]:
features_count_lst = []
for i in set(data_number.columns) - {'target','type'}:
    tmp = data_number[i].value_counts().reset_index().rename(columns={i:"count"})
    tmp['feature']=i
    features_count_lst.append(tmp)
box_score_pd = pd.concat(features_count_lst)

In [None]:
t = toad.transform.WOETransformer()
data_number.loc[:,train_columns] = t.fit_transform(data_number[train_columns], data_number['target'])

In [None]:
woe_map = t.export()

In [None]:
data_selected, drop_lst = toad.selection.select(data_number[train_columns]
                                                , data_number['target']
                                                , empty=0.6
                                                , iv=0.002
                                                , corr=0.7
                                                , return_drop=True)
drop_lst

In [None]:
test_data.head()

In [None]:
train_data = data_number.loc[data['type']=='train',:]
oot_data = data_number.loc[data['type']=='oot',:]
test_data = data_number.loc[data['type']=='test',:]

In [None]:
# model, val_pred = xgb_model(train_data[train_columns],train_data['target'],test_data[train_columns],test_data['target'])

In [None]:
y_test_pred, y_train_pred, lr_model_obj = lr_model(train_data[train_columns],train_data['target'],test_data[train_columns],test_data['target'],C=0.1)

## 参数搜索

In [None]:
param_gric = [
    {"penalty":["l2"],"solver":['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],"C":np.arange(0.5,1.5,0.1)}
]
grid_search = GridSearchCV(lr_model_obj, param_gric, n_jobs=1, verbose=1)
grid_search.fit(train_data[train_columns], train_data['target'])

In [None]:
grid_search.best_params_

In [None]:
y_test_pred, y_train_pred, lr_model_obj = lr_model(train_data[train_columns],train_data['target'],test_data[train_columns],test_data['target'],**grid_search.best_params_)

In [None]:
import pickle
LR_MODEL_PATH = './lr_model.plk'
with open(LR_MODEL_PATH,'wb') as f:
    pickle.dump(lr_model_obj, f)

In [None]:
%who DataFrame

In [None]:
pred_train = lr_model_validation(lr_model_obj, oot_data[train_columns], oot_data['target'])

In [None]:
toad.metrics.PSI(y_train_pred,pred_train)

In [None]:
toad.metrics.PSI(y_train_pred,y_test_pred)

In [None]:
toad.metrics.PSI(y_test_pred,pred_train)

In [None]:
PDO = (1000-0)/np.log2(9999/(1/9999))
B = PDO/np.log(2)
A = 0-B*np.log(1/9999)

## 计算总分

In [None]:
score_pd = pd.DataFrame({'score':pd.Series(y_test_pred).apply(get_score_with_model, args=(A,B)).values
                        , 'possi':y_test_pred
                        , 'real': test_data['target'].values})
score_pd.head()

In [None]:
box_score_pd['woe'] = box_score_pd.apply(lambda row:woe_map.get(row['feature']).get(row['index']),axis=1)

In [None]:
def get_combine_max(row):
    '''
    根据分箱获取每一个变量的每一个分箱的值范围上限
    '''
    tmp = bin.get(row['feature'])
    if tmp == None:
        return row['index']
    idx = int(row['index'])
    if len(tmp)<=idx:
        return np.inf
    return tmp[idx]

In [None]:
def get_combine_min(row):
    '''
    根据分箱获取每一个变量的每一个分箱的值范围下限
    '''
    tmp = bin.get(row['feature'])
    if tmp == None:
        return row['index']
    idx = row['index']-1
    if idx<0:
        return -np.inf
    return tmp[int(idx)]

In [None]:
bin = combiner.export()
box_score_pd['min']=box_score_pd.apply(get_combine_min, axis=1)
box_score_pd['max']=box_score_pd.apply(get_combine_max, axis=1)

In [None]:
box_score_pd = box_score_pd.sort_values('feature','index')

In [None]:
feature_cnt = len(train_columns)

In [None]:
a = lr_model_obj.intercept_

In [None]:
b_dict = dict(zip(list(X_train.columns), lr_model_obj.coef_.tolist()[0])) # 系数

In [None]:
box_score_pd['a']=a[0]
box_score_pd['b']=box_score_pd['feature'].apply(lambda x: b_dict.get(x))

In [None]:
def cal_score(row,A,B):
    '''
    计算打分结果, 一定要检查feature_cnt， 搞了一下午这个值都搞错了
    '''
    woe,b,a = row[['woe','b','a']]
    return A/feature_cnt - B*(woe*b+a/feature_cnt)

In [None]:
box_score_pd['score'] = box_score_pd.apply(cal_score, axis=1, args=(A,B))

In [None]:
from IPython.display import HTML
HTML(box_score_pd.to_html())

## 模型效果评估

### KS值

In [None]:
from libs.utils.model_estimate import model_monotony,calculate_ks
ks_value, probability, crossdens = calculate_ks(y_test_pred, test_data['target'])
ks_value

## 查看效果评估报告

In [None]:
from sklearn.metrics import classification_report
rs = classification_report(y_pred=np.where(y_test_pred>=probability,1,0),y_true=test_data['target'])
print(rs)

In [None]:
from sklearn.metrics import confusion_matrix,recall_score

In [None]:
pd.DataFrame(confusion_matrix(test_data['target'],y_pred=np.where(y_test_pred>=probability,1,0)))

In [None]:
recall_score(test_data['target'],y_pred=np.where(y_test_pred>=probability,1,0))

In [None]:
ks_score = get_score_with_model(probability, A, B)
ks_score, ks_value, probability

In [None]:
ks_bucket_pd = cal_lift(y_test_pred, test_data['target'], A, B, ks_score)

In [None]:
ks_bucket_pd.applymap(lambda x: round(x,2))

In [None]:
oot_ks_bucket_pd = cal_lift(pred_train,oot_data['target'],A,B,ks_score)

In [None]:
oot_ks_bucket_pd.applymap(lambda x: round(x, 2))

## 入模特征

In [None]:
train_cols_ivs = toad.quality(data_number[list(train_columns)+['target']],iv_only=True)
train_cols_ivs

## 找几个样例客户

In [None]:
test_data['score'] = pd.Series(y_test_pred).apply(get_score_with_model, args=(A,B)).values

In [None]:
good_sample_idx = test_data.sort_values('score',ascending=False).head(2).index # 好客户的索引

good_sample = data.loc[good_sample_idx, list(train_columns)+['target']].T # 查看好样本
good_sample

In [None]:
bad_sample_idx = test_data.sort_values('score',ascending=True).head(2).index # 坏客户的索引

bad_sample = data.loc[bad_sample_idx, list(train_columns)+['target']].T # 坏客户样本
bad_sample

In [None]:
def cal_score_split(row):
    score_lst = list(map(lambda idx: box_score_pd.loc[(box_score_pd['feature']==idx[0])&(box_score_pd['index']==idx[1]),'score'].values[0],zip(row.index,row)))
    return pd.Series(score_lst,index=row.index) # index对齐，方便做concat

In [None]:
def get_sample_detail(sample):
    data_train = combiner.transform(sample.T) # 分箱
    # result_type='expand' 参数表示将like-array扩展成pd，这里其实没有必要设置，因为返回的是series
    score_pd = data_train[train_columns].apply(cal_score_split, axis=1,result_type='expand').T
    return pd.concat([score_pd, sample],axis=1)

In [None]:
good_sample_pd = get_sample_detail(good_sample).applymap(lambda x: round(x,2)) # 好样本
bad_sample_pd = get_sample_detail(bad_sample).applymap(lambda x: round(x,2)) # 坏样本

tpl_samples_pd = pd.concat([good_sample_pd, bad_sample_pd, train_cols_ivs], axis=1) # 合并

tpl_samples_pd.drop(['gini','entropy'], axis=1, inplace=True) # 删除空字段

In [None]:
tpl_samples_pd.sort_values('iv',ascending=False) # 找的坏样本中，还有个误杀-_-||

## 输出依赖的版本号

In [None]:
import xgboost as xgb
import sklearn as sk
for m in {inspect, math, np,sk, os, pd, pickle, plt, re, relativedelta, sns, sys, toad, warnings, xgb}:
    try:
        print("{}--{}".format(m.__name__, m.__version__))
    except:
        pass
%conda -V
!python -V


## 评分推导

In [None]:
from sympy import *

import sympy as sy

init_session(use_latex=True)

In [None]:
A,B,Odds,PDO,Score,p = symbols("A,B,Odds,PDO,Score,p")

- Odds 好坏比或者坏好比
- PDO 好坏比增加一倍的时候，分数增加PDO

In [None]:
Eq(Score, A+B*log(Odds)) # 基本公式

In [None]:
Eq(PDO+Score, A+B*log(2*Odds)) # 好坏比增加一倍的时候，分数增加PDO

#### 定义Odds，算出B的值
设Odds为坏好比，Odds=9999/1的时候客户最坏， Odds=1/9999的时候客户最好

In [None]:
Eq(0,A+B*log(Odds)).subs({Odds:(9999/1)}) # 最坏的时候分数为0

In [None]:
Eq(1000,A+B*log(Odds)).subs({Odds:(1/9999)}) # 最好的时候分数为1000

In [None]:
exp1 = Eq((A+B*log(1/9999))-(A+B*log(9999/1)),1000) # 上两个等式化简后可以算出B的值
exp1

In [None]:
BVal = solve(exp1)[0] # 算出B的值
BVal

#### 根据B算出A的值

根据下面任意一种情况算出的A值都是一样的

In [None]:
exp2 = Eq(Score,A+B*log(Odds)).subs({Score:0, Odds:9999/1, B:BVal}) # 0分对应的好坏比
exp2

In [None]:
Eq(Score,A+B*log(Odds)).subs({Score:1000, Odds:1/9999, B:BVal}) # 1000分对应的好坏比

In [None]:
AVal = solve(exp2)[0]
AVal

#### 计算出PDO的值

实际上PDO和B是对应的值，只要求出B就能给出PDO

In [None]:
exp3 = Eq(PDO,(A+B*log(2*Odds))-(A+B*log(Odds)))
exp3

In [None]:
PDOVal = solve(exp3.subs({B:BVal}))[0][PDO]
PDOVal

### 逻辑回归计算变量分推理

到这里，实际上能算出总分来了。但是根据每一个特征算分，还需要推理一下

In [None]:
Eq(p,1/(1+E**-g(x))) # 逻辑回归的激活函数sigmoid函数

假设P为好客户概率，1-P为坏客户概率，好坏比Odds为：
$Odds = \frac{p}{1 - p}$

In [None]:
exp3 = Eq(Score, A+B*log(Odds)).subs({Odds:(p/(1-p))})
exp3

In [None]:
exp4 = exp3.subs({p: 1/(1+E**-g(x))}) # 把P的sigmoid定义带入
exp4

In [None]:
simplify(exp4) # 化简后

In [None]:
Eq(Score,A+B*g(x)) # log和e抵消

$$Score = A + B g{\left(x \right)}$$
$$g{\left(x \right)} = b + \sum_{i=1}^{n} w_{i} x_{i}$$
$$Score = \frac{A}{n} - B (\frac{b}{n}+ w_{i} x_{i})$$

