# Credit risk case study
*DISCLAIMER: this is my own reference for classification problems. The text documenting the notebook comes from different sources (kaggle data set description, sklearn documentation, matplotlib documentation, wikipedia, etc.).*

## Table of content

* [Dataset overview](#ds)
* [Exploratory analysis](#explo)
    * [Descritive statistics for PAID loans](#descp)
    * [Descritive statistics for DEFAULT loans](#descd)
    * [DEFAULT as a function of reason for aquiring the loans](#reason)
    * [DEFAULT as a function of occupation](#occupation)
    * [Graphical overview](#graph)
    * [Violin plot](#violin)
    * [Correlation matrix](#corr)
* [Test of default classifiers](#classification)
* [Model evaluation](#eval)
    * [Precision & recall](#per)
    * [F1](#f1)
    * [Receiver operating characteristic](#roc)
    * [Confusion matrix](#confusion)
    * [Classification probability](#prob)
* [Logistic regression](#logit)
* [SGD classifier](#sgd)
* [Supporting vector classifier](#svc)
* [Gradient boosting classifier](#gbrt)
* [Forest of randomized tree](#frt)
    * [Randm forest classifier](#rfc)
    * [Extremely randomized tree](#ert)
* [Model comparison and conclusion](#conclusion)

## Dataset overview
<a id='ds'></a>

The dataset contains baseline and loan performance information for 5,960 recent home equity loans. A home equity loan is a loan where the obligor uses the equity of his or her home as the underlying collateral. The target (BAD) is a binary variable indicating whether an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). 

For each applicant, 11 input variables were recorded:

* BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan
* LOAN: Amount of the loan request
* MORTDUE: Amount due on existing mortgage
* VALUE: Value of current property
* REASON: DebtCon = debt consolidation; HomeImp = home improvement
* JOB: Occupational categories
* YOJ: Years at present job
* DEROG: Number of major derogatory reports
* DELINQ: Number of delinquent credit lines
* CLAGE: Age of oldest credit line in months
* NINQ: Number of recent credit inquiries
* CLNO: Number of credit lines

In [None]:
import pandas as pd
import numpy as np
from pprint import pprint
import matplotlib.pyplot as plt
import seaborn as sns

import holoviews as hv
hv.extension('bokeh', 'matplotlib', logo=False)

# Avoid warnings to show up (trick for the final notebook on kaggle)
import warnings
warnings.filterwarnings('ignore')

In [None]:
df=pd.read_csv('../input/hmeq.csv', low_memory=False) # No duplicated columns, no highly correlated columns
df.drop('DEBTINC', axis=1, inplace=True) # The meaning of this variable is not clear. Better drop it for the moment
df.dropna(axis=0, how='any', inplace=True)

## Exploratory analysis
<a id='explo'></a>

I summarize the main characteristics of the dataset with visual methods and summary statistics. I use the target variable (BAD) to divide the data set into sub-samples and I specifically look for variables, features and correlation which contain classification power.

### Descritive statistics for PAID loans
<a id='descp'></a>

In [None]:
df[df['BAD']==0].drop('BAD', axis=1).describe().style.format("{:.2f}")

### Descritive statistics for DEFAULT loans
<a id='descd'></a>

In [None]:
df[df['BAD']==1].drop('BAD', axis=1).describe().style.format("{:.2f}")

1. From the descriptive statistics above I can draw the following consideration:

* The amount of requested loan, the amount of due mortgage and the value of the underlying collateral are statistically consistent for both loans that been PAID and that resulted in a DEFAULT. This suggests that those variables may not provide significant discrimination power to separate the two classes.


* The number of years at the present job (YOJ) seems to discriminate the two classes as DEFAULTs seem more frequent in contractors which have a shorter seniority. This tendency is supported by the correspoding average value quantiles which indicate a distribution skewed toward shorter seniority.

* A similar considerations apply to variables related to the contractor credit history such as: the number of major derogatory reports (DEROG), the number of delinquent credit lines (DELINQ), the age of oldest credit line in months (CLAGE), and the number of recent credit inquiries (NINQ). In the case of DEFAULT the distribution of these variables is skewed toward values that suggest a credit hystory that is worse than the corresponding distribution for PAID loan contractors.


* Finally, the number of open credit line (CLNO) seems statistically consistent in both case, suggesting that this variable has no significant discrimination power.

In [None]:
df.loc[df.BAD == 1, 'STATUS'] = 'DEFAULT'
df.loc[df.BAD == 0, 'STATUS'] = 'PAID'

### DEFAULT as a function of the reason for aquiring the loans
<a id='reason'></a>
The fraction of PAID and DEFAULT loans do not seem to depend strongly on the reason for acquiring the loan. On average, 80% of the loans have been payed while about the 20% DEFAULT. The 2% discrepancy observed is not statistically significant given the amount of loans in the dataset.

In [None]:
g = df.groupby('REASON')
g['STATUS'].value_counts(normalize=True).to_frame().style.format("{:.1%}")

###  DEFAULT as a function of the occupation
<a id='occupation'></a>
The fraction of PAID and DEFAULT loans show some dependence on the occupation of the contractor. Office worker and professional executives have the highest probability to pay their loans while sales and self employed have the highest probability to default. The occupation shows a good discriminating power and it will  most likely be an important feature of our classification model.

In [None]:
g = df.groupby('JOB')
g['STATUS'].value_counts(normalize=True).to_frame().style.format("{:.1%}")

In [None]:
%%opts Bars[width=700 height=400 tools=['hover'] xrotation=45]{+axiswise +framewise}

# Categorical

cols = ['REASON', 'JOB']

dd={}

for col in cols:

    counts=df.groupby(col)['STATUS'].value_counts(normalize=True).to_frame('val').reset_index()
    dd[col] = hv.Bars(counts, [col, 'STATUS'], 'val') 
    
var = [*dd]
kdims=hv.Dimension(('var', 'Variable'), values=var)    
hv.HoloMap(dd, kdims=kdims)

### Graphical overview
<a id='graph'></a>
A coherent graphical overview of the dataset is shown below. For each variable I show an histogram for the whole dataset, for the PAID, and DEFUALT loans, respectively. The correlations among variables are also sumamrized in 2-dimensinal scatter plots.

In [None]:
%%opts Histogram[width=700 height=400 tools=['hover'] xrotation=0]{+axiswise +framewise}

g = df.groupby('STATUS')

cols = ['LOAN',
        'MORTDUE', 
        'VALUE',
        'YOJ',
        'DEROG',
        'DELINQ',
        'CLAGE',
        'NINQ',
        'CLNO']
dd={}

# Histograms
for col in cols:
    
    freq, edges = np.histogram(df[col].values)
    dd[col] = hv.Histogram((edges, freq), label='ALL Loans').redim.label(x=' ')
    
    freq, edges = np.histogram(g.get_group('PAID')[col].values, bins=edges)
    dd[col] *= hv.Histogram((edges, freq), label='PAID Loans').redim.label(x=' ')
    
    freq, edges = np.histogram(g.get_group('DEFAULT')[col].values, bins=edges)
    dd[col] *= hv.Histogram((edges, freq), label='DEFAULT Loans' ).redim.label(x=' ')   
    
var = [*dd]
kdims=hv.Dimension(('var', 'Variable'), values=var)    
hv.HoloMap(dd, kdims=kdims)

In [None]:
%%opts Scatter[width=500 height=500 tools=['hover'] xrotation=0]{+axiswise +framewise}

g = df.groupby('STATUS')

cols = ['LOAN',
        'MORTDUE',
        'VALUE',
        'YOJ',
        'DEROG',
        'DELINQ',
        'CLAGE',
        'NINQ',
        'CLNO']

import itertools
prod = list(itertools.combinations(cols,2))

dd = {}

for p in prod:
    dd['_'.join(p)] = hv.Scatter(g.get_group('PAID')[list(p)], label='PAID Loans').options(size=5)
    dd['_'.join(p)] *= hv.Scatter(g.get_group('DEFAULT')[list(p)], label='DEFAULT Loans').options(size=5, marker='x')
    
var = [*dd]
kdims=hv.Dimension(('var', 'Variable'), values=var)    
hv.HoloMap(dd, kdims=kdims).collate()

In [None]:
g=sns.PairGrid(df.drop('BAD',axis=1), hue='STATUS', diag_sharey=False, palette={'PAID': 'C0', 'DEFAULT':'C1'})
g.map_lower(sns.kdeplot)
g.map_upper(sns.scatterplot)
g.map_diag(sns.kdeplot, lw=3)
g.add_legend()
plt.show()

### Violin plot
<a id='violin'></a>
Violin plot shows the different shapes of the probability density function for some of the variables discussed previously that seem the most promising for the classification task. The plot shows, in different colors, the PAID and the DEFAULT loans. The horizontal dashed lines indecate the position of the mean and the quantiles of the different distributions. Since there is a dependency of the DEFAULT probability on the occupation categories, the "violins" are shown for each of them.

In [None]:
cols=['YOJ', 'CLAGE', 'NINQ']

for col in cols:
    
    plt.figure(figsize=(15,5))

    sns.violinplot(x='JOB', y=col, hue='STATUS',
                   split=True, inner="quart",  palette={'PAID': 'C0', 'DEFAULT':'C1'},
                   data=df)
    
    sns.despine(left=True)

### Correlation matrix
<a id='corr'></a>
Finally I show the correlation matrix among the variables discussed so far. Correlations are useful because they can indicate a predictive relationship that can be exploited in the classification task. 

The plot is color coded: colder colors correspond to low correlation while warmer color correspond to high correlation. The variables are also grouped according to their correlation, i.e. variables with higher correlation are close to each other.

Variables related to the credit history (DELINQ, DEROG, NINQ) are the most correlated with the loan status (BAD), suggesting that these will be the most discriminating variables. These variables are also slightly correlated among them, suggesting that some of the information might be redoundant.

As already discussed, the amount of the loan or the underlying collateral do not seem related to the loan status. They anyhow form another correlation cluster with other variables such as the age of oldest credit line (CLAGE) and the number of credit lines (CLNO). This is expected since those variables are clearly related.

In [None]:
def compute_corr(df,size=10):
    '''Function plots a graphical correlation matrix for each pair of columns in the dataframe.

    Input:
        df: pandas DataFrame
        size: vertical and horizontal size of the plot'''
    import scipy
    import scipy.cluster.hierarchy as sch
    
    corr = df.corr()
    
    # Clustering
    d = sch.distance.pdist(corr)   # vector of ('55' choose 2) pairwise distances
    L = sch.linkage(d, method='complete')
    ind = sch.fcluster(L, 0.5*d.max(), 'distance')
    columns = [df.select_dtypes(include=[np.number]).columns.tolist()[i] for i in list((np.argsort(ind)))]
    
    # Reordered df upon custering results
    df = df.reindex(columns, axis=1)
    
    # Recompute correlation matrix w/ clustering
    corr = df.corr()
    #corr.dropna(axis=0, how='all', inplace=True)
    #corr.dropna(axis=1, how='all', inplace=True)
    #corr.fillna(0, inplace=True)
    
    #fig, ax = plt.subplots(figsize=(size, size))
    #img = ax.matshow(corr)
    #plt.xticks(range(len(corr.columns)), corr.columns, rotation=45);
    #plt.yticks(range(len(corr.columns)), corr.columns);
    #fig.colorbar(img)
    
    return corr

In [None]:
%%opts HeatMap [tools=['hover'] colorbar=True width=500  height=500 toolbar='above', xrotation=45, yrotation=45]

corr=compute_corr(df)
corr=corr.stack(level=0).to_frame('value').reset_index()
hv.HeatMap(corr).options(cmap='Viridis')

<a id='classification'></a>
# Test of default classifiers
The exploratory analysis described above provides good insights on the dataset and higlights the most promising variables with good discrimination power to identify loans resulting in DEFAULT. In this section I develop and investigate supervided machine learning classifiers to predict the outcome of loans. Given the large amount of algorithms available in literature, I begin from the simple methods, such as logistc regression, and gradually increase the model complexity up to randomized trees techniques. Finally I compare the performance of each model and discuss the most appropriate for this loan classification task. In this section, the following models are developed:
* [Logistic regression](#logit)
* [SGD classifier](#sgd)
* [Supporting vector classifier](#svc)
* [Gradient boosting classifier](#gbrt)
* [Forest of randomized tree](#frt)
    * [Randm forest classifier](#rfc)
    * [Extremely randomized tree](#ert)
* [Model comparison and conclusion](#conclusion)

<a id='eval'></a>
## Model Evaluation
The evaluation of classifiers performance is relatively complex and depenends on many factors, some of which are model dependent. In order to indetify the best model for our classification task, I adopt different evaluation metrics that are briefly summarized in the following.

To avoid overtraining, the performance of our classification model are evaluated using cross-validation. The training set is randomly splited in $N$ distinct subsets called folds, then the model is trained and evaluated $N$ times by using a different fold for the evaluation of a model that is trained on the other $N-1$ folds. The results of the procedure consist in $N$ evaluation scores for each metric that are then averaged. These averages are fianlly used to compare the different techniques considered in this study.

<a id='per'></a>
### Precision & recall
Precision-Recall is a useful performance metric to evaluate a models in those cases when the classes are very imbalanced. In information retrieval, precision is a measure of result relevancy, while recall is a measure of how many truly relevant results are returned. Intuitively, precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the ability of the classifier to find all the positive samples. 

A system with high recall but low precision returns many labels that tend to be predicted incorrectly when compared to the training labels. A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels. An ideal system with high precision and high recall will return many results, with many results labeled correctly.

Precision ($P$) is defined as the number of true positives ($T_{p}$) over the number of true positives plus the number of false positives ($T_{p}+F_{p}$):

$P = \frac{T_{p}}{T_{p}+F_{p}}$  

Recall ($R$) is defined as the number of true positives ($T_{p}$) over the number of true positives plus the number of false negatives ($T_{p}+F_{n}$):

$R = \frac{T_{p}}{T_{p}+F_{n}}$

<a id='f1'></a>
### F1 measure
It is often convenient to combine precision and recall into a single metric called the $F_{1}$ score, defined as a weighted harmonic mean of the precision and recall:

$F_{1} = 2\times \frac{P \times R}{P+R}$

Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if both recall and precision are high.
The $F_{1}$ score favors classifiers that have similar precision and recall. This is not always what you want: in some contexts you mostly care about precision, and in other contexts you really care about recall.

<a id='roc'></a>
###  Receiver operating characteristic
A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate.
There is a tradeoff: the higher the recall (TPR), the more false positives (FPR) the classifier produces. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner). 

The area under the ROC curve, which is also denoted by AUC, summarise the curve information in one number. The AUC should be interpreted as the probability that a classifier will rank a randomly chosen positive istance higher than a randomly chosen negative one. A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5.

<a id='confusion'></a>
### Confusion matrix
The confusion matrix evaluates classification accuracy by computing the confusion matrix with each row corresponding to the true class. By definition, entry $i,j$ in a confusion matrix is the number of observations actually in group $i$, but predicted to be in group $j$. The confusion matrix is not used for model evaluation but it provide a good grasp on the overall model performance.

<a id='prob'></a>
### Classification probability
The classification probability provides an estimation of the probability that a given instance of the data belongs to the given class. In a binary classification problem like the one being considered, the histogram of the classification probability for the two class provide a good visual grasp on the model performance. The more the peak of the classification probability are far from each other, the higher the separation power of the model.

In [None]:
import pandas as pd
import numpy as np
from pprint import pprint
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.metrics import classification_report

In [None]:
df=pd.read_csv('../input/hmeq.csv', low_memory=False) # No duplicated columns, no highly correlated columns
df=pd.get_dummies(df, columns=['REASON','JOB'])
df.drop('DEBTINC', axis=1, inplace=True)
df.dropna(axis=0, how='any', inplace=True)
y = df['BAD']
X = df.drop(['BAD'], axis=1)

In [None]:
def cross_validate_model(model, X, y, 
                         scoring=['f1', 'precision', 'recall', 'roc_auc'], 
                         cv=12, n_jobs=-1, verbose=True):
    
    scores = cross_validate(pipe, 
                        X, y, 
                        scoring=scoring,
                        cv=cv, n_jobs=n_jobs, 
                        verbose=verbose,
                        return_train_score=False)

    #sorted(scores.keys())
    dd={}
    
    for key, val in scores.items():
        if key in ['fit_time', 'score_time']:
            continue
        #print('{:>30}: {:>6.5f} +/- {:.5f}'.format(key, np.mean(val), np.std(val)) )
        name = " ".join(key.split('_')[1:]).capitalize()
        
        dd[name] = {'value' : np.mean(val), 'error' : np.std(val)}
        
    return  pd.DataFrame(dd)    
    #print()
    #pprint(scores)
    #print()

In [None]:
def plot_roc(model, X_test ,y_test, n_classes=0):
    
    from sklearn.metrics import roc_curve, auc
    
    """
    Target scores, can either be probability estimates 
    of the positive class, confidence values, or 
    non-thresholded measure of decisions (as returned 
    by “decision_function” on some classifiers).
    """
    try:
        y_score = model.decision_function(X_test)
    except Exception as e:
        y_score = model.predict_proba(X_test)[:,1]
    
    
    fpr, tpr, _ = roc_curve(y_test.ravel(), y_score.ravel())
    roc_auc = auc(fpr, tpr)

    # Compute micro-average ROC curve and ROC area
    #fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
    #roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
    
    #plt.figure()
    lw = 2
    plt.plot(fpr, tpr, color='darkorange',
             lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)

    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    #plt.show()
    
# shuffle and split training and test sets
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
#                                                    random_state=0)

In [None]:
def plot_confusion_matrix(model, X_test ,y_test,
                          classes=[0,1],
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    
    import itertools
    from sklearn.metrics import confusion_matrix
    
    y_pred = model.predict(X_test)
    
    # Compute confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    np.set_printoptions(precision=2)
    
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    #    print("Normalized confusion matrix")
    #else:
    #    print('Confusion matrix, without normalization')

    #print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
def feature_importance(coef, names, verbose=False, plot=True):
    
    #importances = model.feature_importances_

    
    
    #std = np.std([tree.feature_importances_ for tree in model.estimators_],
    #             axis=0)
    indices = np.argsort(coef)[::-1]
    
    if verbose:
    
        # Print the feature ranking
        print("Feature ranking:")
    
        for f in range(len(names)):
            print("{:>2d}. {:>15}: {:.5f}".format(f + 1, names[indices[f]], coef[indices[f]]))
        
    if plot:
        
        # Plot the feature importances of the forest
        #plt.figure(figsize=(5,10))
        plt.title("Feature importances")
        plt.barh(range(len(names)), coef[indices][::-1], align="center")
        #plt.barh(range(X.shape[1]), importances[indices][::-1],
        #         xerr=std[indices][::-1], align="center")
        plt.yticks(range(len(names)), names[indices][::-1])
        #plt.xlim([-0.001, 1.1])
        #plt.show()

In [None]:
def plot_proba(model, X, y, bins=40, show_class = 1):
    
    from sklearn.calibration import CalibratedClassifierCV
    
    model = CalibratedClassifierCV(model)#, cv='prefit')
    
    model.fit(X, y)
    
    proba=model.predict_proba(X)
    
    if show_class == 0:
        sns.kdeplot(proba[y==0,0], shade=True, color="r", label='True class')
        sns.kdeplot(proba[y==0,1], shade=True, color="b", label='Wrong class')
        plt.title('Classification probability: Class 0')
    elif show_class == 1:
        sns.kdeplot(proba[y==1,1], shade=True, color="r", label='True class')
        sns.kdeplot(proba[y==1,0], shade=True, color="b", label='Wrong class')
        plt.title('Classification probability: Class 1')
    plt.legend()

## Logistic regression
<a id='logit'></a>

Logistic regression is the simplest linear model for classification. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. The optimization problem is solved minimizing a cost function using an highly optimized coordinate descent algorithm.

In [None]:
from sklearn.linear_model import LogisticRegression

steps = [('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
         ('model', LogisticRegression(random_state=0))]

pipe = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
pipe.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(15,10))

plt.subplot(221)
plot_roc(pipe, X_test ,y_test)

plt.subplot(222)
plot_confusion_matrix(pipe, X_test ,y_test, normalize=True)

plt.subplot(223)
plot_proba(pipe, X_test, y_test)

plt.subplot(224)
feature_importance(pipe.named_steps['model'].coef_[0], X.columns)

plt.tight_layout()

In [None]:
logit_xval_res = cross_validate_model(pipe, X, y, verbose=False)
logit_xval_res.T[['value','error']].style.format("{:.2f}")

<a id='sgd'></a>
## SGD Classifier
This algorithms implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). SGD allows minibatch (online/out-of-core) learning.

In [None]:
from sklearn.linear_model import SGDClassifier

steps = [('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
         ('model', SGDClassifier(loss="hinge", penalty="l2", max_iter=1000, tol=1e-3, random_state=0))]

pipe = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
pipe.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(15,10))

plt.subplot(221)
plot_roc(pipe, X_test ,y_test)

plt.subplot(222)
plot_confusion_matrix(pipe, X_test ,y_test, normalize=True)

plt.subplot(223)
plot_proba(pipe, X_test, y_test)

plt.subplot(224)
feature_importance(pipe.named_steps['model'].coef_[0], X.columns)

plt.tight_layout()

In [None]:
sgd_xval_res = cross_validate_model(pipe, X, y, verbose=False)
sgd_xval_res.T[['value','error']].style.format("{:.2f}")

## Supporting Vector Classifier
<a id='svc'></a>
A support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.

The advantages of support vector machines are:
* Effective in high dimensional spaces.
* Still effective in cases where number of dimensions is greater than the number of samples.
* It uses a subset of the training points in the decision function so it is memory efficient.

In [None]:
from sklearn.svm import SVC

steps = [('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
         ('model', SVC(random_state=0, kernel='linear', probability=True))]

pipe = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
pipe.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(15,10))

plt.subplot(221)
plot_roc(pipe, X_test ,y_test)

plt.subplot(222)
plot_confusion_matrix(pipe, X_test ,y_test, normalize=True)

plt.subplot(223)
plot_proba(pipe, X_test, y_test)

plt.subplot(224)
feature_importance(pipe.named_steps['model'].coef_[0], X.columns)

plt.tight_layout()

In [None]:
svc_xval_res = cross_validate_model(pipe, X, y, verbose=False)
svc_xval_res.T[['value','error']].style.format("{:.2f}")

<a id='gbrt'></a>
### Gradient Boosting Classifier
Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is a generalization of boosting to arbitrary differentiable loss functions. GBRT produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

steps = [('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
         ('model', GradientBoostingClassifier(n_estimators=250, learning_rate=0.05, random_state=0))]

pipe = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
pipe.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(15,10))

plt.subplot(221)
plot_roc(pipe, X_test ,y_test)

plt.subplot(222)
plot_confusion_matrix(pipe, X_test ,y_test, normalize=True)

plt.subplot(223)
plot_proba(pipe, X_test, y_test)

plt.subplot(224)
feature_importance(pipe.named_steps['model'].feature_importances_, X.columns)

plt.tight_layout()

In [None]:
gbc_xval_res = cross_validate_model(pipe, X, y, verbose=False)
gbc_xval_res.T[['value','error']].style.format("{:.2f}")

<a id='frt'></a>
### Forests of randomized trees
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

The forest of randomized tree technique includes two averaging algorithms based on randomized decision trees: the RandomForest algorithm and the Extra-Trees method. Both algorithms are perturb-and-combine techniques specifically designed for trees. This means a diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.

<a id='rfc'></a>
#### Random Forest Classifier
In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

In [None]:
from sklearn.ensemble import RandomForestClassifier

steps = [('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
         ('model', RandomForestClassifier(n_estimators=250, n_jobs=-1, random_state=0))]

pipe = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
pipe.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(15,10))

plt.subplot(221)
plot_roc(pipe, X_test ,y_test)

plt.subplot(222)
plot_confusion_matrix(pipe, X_test ,y_test, normalize=True)

plt.subplot(223)
plot_proba(pipe, X_test, y_test)

plt.subplot(224)
feature_importance(pipe.named_steps['model'].feature_importances_, X.columns)

plt.tight_layout()

In [None]:
rfc_xval_res = cross_validate_model(pipe, X, y, verbose=False)
rfc_xval_res.T[['value','error']].style.format("{:.2f}")

<a id='ert'></a>
#### Extremely Randomized Trees
In extremely randomized trees, randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

steps = [('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
         ('model', ExtraTreesClassifier(n_estimators=250, n_jobs=-1, random_state=0, class_weight='balanced'))]

pipe = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
pipe.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(15,10))

plt.subplot(221)
plot_roc(pipe, X_test ,y_test)

plt.subplot(222)
plot_confusion_matrix(pipe, X_test ,y_test, normalize=True)

plt.subplot(223)
plot_proba(pipe, X_test, y_test)

plt.subplot(224)
feature_importance(pipe.named_steps['model'].feature_importances_, X.columns)

plt.tight_layout()

In [None]:
ert_xval_res = cross_validate_model(pipe, X, y, verbose=False)
ert_xval_res.T[['value','error']].style.format("{:.2f}")

<a id='conclusion'></a>
## Model comparison and conclusions
The table below summarizes the performance of the classification models that I considered in this study. Performances are ordered by increasing value of $F_{1}$. The best performances are obtained by the **extremely randomized tree**, followed by the **random forest** and the **logistic regression**. 

The extremely randomized tree allow to identify up to 66% of loans which would cause a DEFAULT while retaining 91% of loans which would be PAID in time. The ROC AUC value is as high as 96%, indicating that the probabilty that the classifier would perform better by random choice is as low as 4%.

In [None]:
from collections import OrderedDict

res_comp = OrderedDict([
    ('Logistic regression'              , logit_xval_res[1:]),
    ('SGD classifier'                   , sgd_xval_res[1:]  ),
    ('Supporting vector classifier'     , svc_xval_res[1:]  ),
    ('Random forest classifier'         , rfc_xval_res[1:]  ),
    ('Extermely random tree classifier' , ert_xval_res[1:]  ),
    ('Gradient boost classifier'        , gbc_xval_res[1:]  ),
])

new_columns = {'level_0' : 'Model'}

pd.concat(res_comp).reset_index().drop('level_1', axis=1).rename(columns=new_columns).set_index('Model').sort_values('F1', ascending=False).style.format("{:.2f}")