## machine learning for credit scoring


Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

|Variable Name	|	Description	|	Type|
|----|----|----|
|SeriousDlqin2yrs	|	Person experienced 90 days past due delinquency or worse 	|	Y/N|
|RevolvingUtilizationOfUnsecuredLines	|	Total balance on credit divided by the sum of credit limits	|	percentage|
|age	|	Age of borrower in years	|	integer|
|NumberOfTime30-59DaysPastDueNotWorse	|	Number of times borrower has been 30-59 days past due |	integer|
|DebtRatio	|	Monthly debt payments	|	percentage|
|MonthlyIncome	|	Monthly income	|	real|
|NumberOfOpenCreditLinesAndLoans	|	Number of Open loans |	integer|
|NumberOfTimes90DaysLate	|	Number of times borrower has been 90 days or more past due.	|	integer|
|NumberRealEstateLoansOrLines	|	Number of mortgage and real estate loans	|	integer|
|NumberOfTime60-89DaysPastDueNotWorse	|	Number of times borrower has been 60-89 days past due |integer|
|NumberOfDependents	|	Number of dependents in family	|	integer|


Read the data into Pandas 

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.read_csv(f, index_col=0)
data.head()

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [2]:
data.shape

(112915, 11)

Drop na

In [3]:
# 把data压缩成一行统计, 每列中nan的个数.可以看到age, numberOfDependents中无效项很多.
data.isnull().sum(axis=0)

SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [4]:
data.dropna(inplace=True) #把无效条目丢掉
data.shape

(108648, 11)

Create X and y

In [5]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

In [6]:
y.mean()

0.06742876076872101

# 练习1

把数据切分成训练集和测试集

In [7]:
from sklearn.model_selection import train_test_split

num_folds = 5
num_folds_train = 4
num_X_train = X.shape[0]/num_folds*num_folds_train
#num_X_train = X.shape[0]/num_folds*num_folds_train
print("data items, num_data_train", X.shape[0], num_X_train)
#X_train, X_test, y_train, y_test  = train_test_split(X, y, train_size=num_X_train, random_state=22) #python2.7
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=22) #python3.5
print("X_train shape", X_train.shape)
print("y_train shape", y_train.shape)
print("X_test shape", X_test.shape)
print("y_test shape", y_test.shape)

#X_train.head()
#也可以利用train_test_split的另一种输入
# train_test_split(X,y, train_size=0.8,random_state=22) # 也是可以把data分成train和test，并且其比例是4:1的



data items, num_data_train 108648 86918.4
X_train shape (86918, 10)
y_train shape (86918,)
X_test shape (21730, 10)
y_test shape (21730,)




# 练习2
使用logistic regression/决策树/SVM/KNN...等sklearn分类算法进行分类，尝试查sklearn API了解模型参数含义，调整不同的参数。

In [8]:
# 线性的不能处理吗？
from sklearn.linear_model import LinearRegression
linr = LinearRegression()
linr.fit(X_train, y_train)

from sklearn.linear_model import LogisticRegression
logr = LogisticRegression(penalty='l1',#使用l2范数作为正则化惩罚
                          C=1 # reg strenght 
                         )

In [9]:
# decision tree
"""
scikit-learn决策树算法类库内部实现是使用了调优过的CART树算法，既可以做分类，又可以做回归。
分类决策树的类对应的是DecisionTreeClassifier，而回归决策树的类对应的是DecisionTreeRegressor。
两者的参数定义几乎完全相同，但是意义不全相同。

"""
from sklearn import tree
dtc = tree.DecisionTreeClassifier()
dtc.fit(X_train,y_train)


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

## pydotplus error

> 
#ImportError: No module named 'pydotplus'
import pydotplus
from IPython.display import Image, display  
dot_data = tree.export_graphviz(dtc, out_file=None, 
                         feature_names=X.columns,  
                         class_names=['SeriousDlqin2yrs', 'normal'],  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
display(Image(graph.create_png()))


## SVM COST
TAKE TOO MUCH TIME TO RUN SVM LOSS AND SGD

In [10]:
## TAKE TOO MUCH TIME TO RUN SVM LOSS AND SGD
#from sklearn import svm
#svm_clf = svm.SVC()
#svm_clf.fit(X_train, y_train)


In [11]:
#svm_clf.

In [12]:
#knn
from sklearn.neighbors import KNeighborsClassifier 
knn = KNeighborsClassifier()



# 练习3
在测试集上进行预测，计算准确度

In [13]:
#linear regression
linr_train_score = linr.score(X_train,y_train)
print("linr_train_score",linr_train_score)

linr_test_score = linr.score(X_test,y_test)
print("linr_test_score",linr_test_score)

linr_train_score 0.055950252318685
linr_test_score 0.06440636702278191


In [14]:
logr.fit(X_train,y_train)
#logr_results = logr.predict_proba(X_train)
#logr_results.shape
logr_score_train = logr.score(X_train,y_train)
print("logr_score_train", logr_score_train)

logr_score_test = logr.score(X_test,y_test)
print("logr_score_test", logr_score_test)

logr_score_train 0.9337651579649785
logr_score_test 0.9310170271514036


In [15]:
train_pred_m = dtc.predict_proba(X_train)
train_pred = dtc.score(X_train,y_train)
print("train pred matrix", train_pred_m)
print("trian pred p", train_pred)

test_pred_m = dtc.predict_proba(X_test)
test_pred_p = dtc.score(X_test,y_test)
print("train tes matrix", test_pred_m)
print("trian test p", test_pred_p)

train pred matrix [[1. 0.]
 [1. 0.]
 [1. 0.]
 ...
 [1. 0.]
 [1. 0.]
 [1. 0.]]
trian pred p 1.0
train tes matrix [[1. 0.]
 [1. 0.]
 [1. 0.]
 ...
 [1. 0.]
 [1. 0.]
 [1. 0.]]
trian test p 0.8931891394385642


In [16]:
knn.fit(X_train,y_train)
knn_train_pred_m = knn.predict_proba(X_train)
knn_train_pred_p = knn.score(X_train,y_train)
knn_test_pred_m = knn.predict_proba(X_test)
knn_test_pred_pred = knn.predict(X_test)
knn_test_pred_p = knn.score(X_test,y_test)
print("knn train tes matrix", knn_train_pred_m)
print("knn trian test p", knn_train_pred_p)
print("knn test tes matrix", knn_test_pred_m)
print("knn test test p", knn_test_pred_p)
print("knn test test pred", knn_test_pred_pred)

knn train tes matrix [[1. 0.]
 [1. 0.]
 [1. 0.]
 ...
 [1. 0.]
 [1. 0.]
 [1. 0.]]
knn trian test p 0.9358130651878782
knn test tes matrix [[1.  0. ]
 [1.  0. ]
 [0.8 0.2]
 ...
 [0.8 0.2]
 [0.8 0.2]
 [1.  0. ]]
knn test test p 0.9289461573861022
knn test test pred [0 0 0 ... 0 0 0]


# 练习4
查看sklearn的官方说明，了解分类问题的评估标准，并对此例进行评估。

In [21]:
logr_test_pred_m = logr.predict(X_test)
from sklearn.metrics import accuracy_score
logr_acc = accuracy_score(y_test,logr_test_pred_m)
print("logr acc ", logr_acc)
from sklearn.metrics import classification_report
target_names = ['class 0(normal)', 'class 1(SeriousDlqin2yrs)']
#class0 normal
#class1 SeriousDlqin2yrs
print(classification_report(y_test, logr_test_pred_m, target_names = target_names))

logr acc  0.9310170271514036
                           precision    recall  f1-score   support

          class 0(normal)       0.93      1.00      0.96     20204
class 1(SeriousDlqin2yrs)       0.64      0.04      0.08      1526

              avg / total       0.91      0.93      0.90     21730



# 练习5

银行通常会有更严格的要求，因为fraud带来的后果通常比较严重，一般我们会调整模型的标准。<br>
比如在logistic regression当中，一般我们的概率判定边界为0.5，但是我们可以把阈值设定低一些，来提高模型的“敏感度”，试试看把阈值设定为0.3，再看看这时的评估指标(主要是准确率和召回率)。

tips:sklearn的很多分类模型，predict_prob可以拿到预估的概率，可以根据它和设定的阈值大小去判断最终结果(分类类别)

In [23]:
test_preb = logr.predict_proba(X_test)[:,1]
test_preb_strict = [ 1 if item > 0.3 else 0 for item in test_preb ]
print(classification_report(y_test, test_preb_strict, target_names = target_names))

array([[0.96088813, 0.03911187],
       [0.96547137, 0.03452863],
       [0.90265925, 0.09734075],
       ...,
       [0.91855361, 0.08144639],
       [0.91716093, 0.08283907],
       [0.9577212 , 0.0422788 ]])