# Calculando Probabilidad de ```Default``` usando ```XGBoost```
XGBoost consiste en construir en Ensamble de Arboles de Decision, es decir construir multiples arboles de decision que generaran diferentes probabilidades de Default, la combinación de estas predicciones es tomada como la prediccion final del modelo. Imainemos que tenemos dos arboles de decision que para el mismo conjunto de datos predicen que algunos clientes van a dejar de pagar sus prestamos cuando en realidad son buenos clientes, XGBoost combina las probabilidades de ambos arboles y genera una prediccion más refinada:

<img src="images/combined-weak-models.png" class="center" alt="XGBoost Example" />


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, \
                            classification_report, \
                            precision_recall_fscore_support, \
                            roc_auc_score, \
                            confusion_matrix
import matplotlib.pyplot as plt
import matplotlib
import xgboost as xgb
cr_loan_clean = pd.read_csv("data/cr_loan_nout_nmiss.csv")

In [2]:
cr_loan_clean.shape

(29459, 12)

In [3]:
cred_num = cr_loan_clean.select_dtypes(exclude=['object'])
cred_str = cr_loan_clean.select_dtypes(include=['object'])

cred_str_onehot = pd.get_dummies(cred_str)
cr_loan_prep = pd.concat([cred_num, cred_str_onehot], axis=1)

In [4]:
# Separacion de los features y la variable a predecir
columns = list(cr_loan_prep.columns)
X_columns = [column for column in columns if column != 'loan_status']
X = cr_loan_prep[X_columns]
y = cr_loan_prep[['loan_status']]

# Creacion del dataset de entrenamiento y de testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=123)

Entrenamiento del modelo xgboost

In [5]:
# Entrenamiento del modelo
clf_gbt = xgb.XGBClassifier().fit(X_train, np.ravel(y_train))

In [6]:
# Obtencion de las probabilidades
gbt_preds = clf_gbt.predict_proba(X_test)
gbt_preds

array([[0.05956489, 0.9404351 ],
       [0.07798618, 0.9220138 ],
       [0.9782927 , 0.02170731],
       ...,
       [0.73190296, 0.26809704],
       [0.7961383 , 0.20386168],
       [0.90489477, 0.09510522]], dtype=float32)

In [7]:
preds_df = pd.DataFrame(gbt_preds[:,1][0:5], columns = ['prob_default'])
preds_df

Unnamed: 0,prob_default
0,0.940435
1,0.922014
2,0.021707
3,0.026483
4,0.064803


In [8]:
true_df = y_test.head()
true_df

Unnamed: 0,loan_status
28606,1
22585,1
13888,0
3145,0
14882,1


In [9]:
# Concatenado de los valores reales y de las probabilidades predichas
pd.concat([true_df.reset_index(drop = True), preds_df], axis = 1)

Unnamed: 0,loan_status,prob_default
0,1,0.940435
1,1,0.922014
2,0,0.021707
3,0,0.026483
4,1,0.064803
