# Clasificación de impagos de préstamos en un banco

Vamos a implementar un modelo para detectar impagos en los préstamos de un banco

## Índice

1. [Conexión a la base de datos](#mysql)
2. [Feature Extraction](#feature_extraction)
4. [Feature selection](#correlation)
5. [Transformation](#transformation)
6. [Modeling](#modeling)
7. [Feature Importance](#feature_importance)
8. [Próximos pasos](#future)

<a name='mysql'></a>
## 1. Conexión a la base de datos

In [1]:
import pymysql
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
database_host = 'relational.fit.cvut.cz'
username = 'guest'
password = 'relational'
database_name = 'financial'

db = pymysql.connect(host = database_host,
                     user = username,
                     password = password,
                     database = database_name)

In [5]:
query = "SELECT * FROM loan"
df = pd.read_sql(query,db)
df

Unnamed: 0,loan_id,account_id,date,amount,duration,payments,status
0,4959,2,1994-01-05,80952,24,3373.0,A
1,4961,19,1996-04-29,30276,12,2523.0,B
2,4962,25,1997-12-08,30276,12,2523.0,A
3,4967,37,1998-10-14,318480,60,5308.0,D
4,4968,38,1998-04-19,110736,48,2307.0,C
...,...,...,...,...,...,...,...
677,7294,11327,1998-09-27,39168,24,1632.0,C
678,7295,11328,1998-07-18,280440,60,4674.0,C
679,7304,11349,1995-10-29,419880,60,6998.0,C
680,7305,11359,1996-08-06,54024,12,4502.0,A


<a name='feature_extraction'></a>
## 2. Extracción de variables

De las tablas **loan**, **account** y **district** extrae las siguientes variables en una única tabla
- identificador de la cuenta
- fecha del préstamo
- cantidad del préstamo
- duración del préstamo
- pagos mensuales
- estado del préstamo
- frecuencia de los extractos bancarios
- fecha de creación de la cuenta
- nº de habitantes del distrito
- variables del distrito (A4, A11, A12, A13, A14, A15, A16)

In [43]:
query = '''
SELECT  loan.account_id,
        loan.date as date_loan,
        amount,
        duration,
        payments,
        status,
        frequency,
        account.date as date_acc,
        A4, A11, A12, A13, A14, A15, A16
FROM loan
JOIN account
ON loan.account_id = account.account_id
JOIN district
ON account.district_id = district.district_id
'''

df = pd.read_sql(query,db)
df

Unnamed: 0,account_id,date_loan,amount,duration,payments,status,frequency,date_acc,A4,A11,A12,A13,A14,A15,A16
0,2,1994-01-05,80952,24,3373.0,A,POPLATEK MESICNE,1993-02-26,1204953,12541,0.2,0.43,167,85677.0,99107
1,19,1996-04-29,30276,12,2523.0,B,POPLATEK MESICNE,1995-04-07,103347,9104,1.5,2.07,123,2299.0,2354
2,25,1997-12-08,30276,12,2523.0,A,POPLATEK MESICNE,1996-07-28,228848,9893,4.0,4.72,96,5623.0,5887
3,37,1998-10-14,318480,60,5308.0,D,POPLATEK MESICNE,1997-08-18,70646,8547,2.6,3.64,120,1563.0,1542
4,38,1998-04-19,110736,48,2307.0,C,POPLATEK TYDNE,1997-08-08,51428,8402,3.1,3.98,120,999.0,1099
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
677,11327,1998-09-27,39168,24,1632.0,C,POPLATEK MESICNE,1997-10-15,94725,9920,2.2,2.87,130,4289.0,4846
678,11328,1998-07-18,280440,60,4674.0,C,POPLATEK MESICNE,1996-11-05,387570,9897,1.6,1.96,140,18721.0,18696
679,11349,1995-10-29,419880,60,6998.0,C,POPLATEK TYDNE,1995-05-26,1204953,12541,0.2,0.43,167,85677.0,99107
680,11359,1996-08-06,54024,12,4502.0,A,POPLATEK MESICNE,1994-10-01,117897,8814,4.7,5.74,107,2112.0,2059


Transforma las fechas con `pandas.to_datetime`

In [44]:
df['date_loan'] = pd.to_datetime(df.date_loan, format='%Y-%m-%d')
df['date_acc'] = pd.to_datetime(df.date_acc, format='%Y-%m-%d')

Crea la variable `days_between` como la diferencia de días entre la fecha del préstamo y la fecha creación de la cuenta

In [45]:
df['days_between'] = (df['date_loan'] - df['date_acc']).dt.days

Renombra las variables relacionadas con el distrito y crea la variable del ratio de crímenes por habitante

In [46]:
df['n_inhabitants'] = df.A4
df['average_salary'] = df.A11
df['average_unemployment_rate'] = df[['A12', 'A13']].mean(axis=1)
df['entrepreneur_rate'] = df['A14']
df['average_crime_rate'] = df[['A15', 'A16']].mean(axis=1) / df['n_inhabitants']

Crea el target binario

In [47]:
df['target'] = (df['status'] == 'B') | (df['status'] == 'D')

In [48]:
df

Unnamed: 0,account_id,date_loan,amount,duration,payments,status,frequency,date_acc,A4,A11,...,A14,A15,A16,days_between,n_inhabitants,average_salary,average_unemployment_rate,entrepreneur_rate,average_crime_rate,target
0,2,1994-01-05,80952,24,3373.0,A,POPLATEK MESICNE,1993-02-26,1204953,12541,...,167,85677.0,99107,313,1204953,12541,0.315,167,0.076677,False
1,19,1996-04-29,30276,12,2523.0,B,POPLATEK MESICNE,1995-04-07,103347,9104,...,123,2299.0,2354,388,103347,9104,1.785,123,0.022512,True
2,25,1997-12-08,30276,12,2523.0,A,POPLATEK MESICNE,1996-07-28,228848,9893,...,96,5623.0,5887,498,228848,9893,4.360,96,0.025148,False
3,37,1998-10-14,318480,60,5308.0,D,POPLATEK MESICNE,1997-08-18,70646,8547,...,120,1563.0,1542,422,70646,8547,3.120,120,0.021976,True
4,38,1998-04-19,110736,48,2307.0,C,POPLATEK TYDNE,1997-08-08,51428,8402,...,120,999.0,1099,254,51428,8402,3.540,120,0.020397,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
677,11327,1998-09-27,39168,24,1632.0,C,POPLATEK MESICNE,1997-10-15,94725,9920,...,130,4289.0,4846,347,94725,9920,2.535,130,0.048219,False
678,11328,1998-07-18,280440,60,4674.0,C,POPLATEK MESICNE,1996-11-05,387570,9897,...,140,18721.0,18696,620,387570,9897,1.780,140,0.048271,False
679,11349,1995-10-29,419880,60,6998.0,C,POPLATEK TYDNE,1995-05-26,1204953,12541,...,167,85677.0,99107,156,1204953,12541,0.315,167,0.076677,False
680,11359,1996-08-06,54024,12,4502.0,A,POPLATEK MESICNE,1994-10-01,117897,8814,...,107,2112.0,2059,675,117897,8814,5.220,107,0.017689,False


In [49]:
df.account_id.nunique()

682

De la tabla **trans** obtén, para cada cuenta, la cantidad transferida y el balance

In [50]:
query = '''
select trans.account_id, trans.amount as trans_amount, balance as trans_balance 
from trans
join loan
on trans.account_id = loan.account_id
and trans.date < loan.date
'''
df_trans = pd.read_sql(query,db)
df_trans

Unnamed: 0,account_id,trans_amount,trans_balance
0,2,1100,1100
1,2,20236,21336
2,2,20236,45286
3,2,20236,54631
4,2,30354,67530
...,...,...,...
54689,11362,93,17922
54690,11362,75,14889
54691,11362,73,15993
54692,11362,87,19331


Cuenta el número de transacciones de cada cuenta

In [51]:
# account_id | n_trans
#    2          54
n_trans = df_trans[['account_id', 'trans_amount']].groupby('account_id',as_index=False).count()
n_trans.columns = ['account_id', 'n_trans']
n_trans

Unnamed: 0,account_id,n_trans
0,2,54
1,19,80
2,25,164
3,37,116
4,38,55
...,...,...
677,11327,54
678,11328,106
679,11349,18
680,11359,147


Calcula la media de cantidad y balance de las transacciones para cada cuenta

In [52]:
df_trans2 = df_trans.groupby('account_id',as_index=False).mean()
df_trans2

Unnamed: 0,account_id,trans_amount,trans_balance
0,2,7954.333333,32590.759259
1,19,5856.350000,25197.137500
2,25,12113.981707,62991.408537
3,37,7572.034483,39954.034483
4,38,4716.200000,31383.581818
...,...,...,...
677,11327,7977.981481,55438.814815
678,11328,8138.754717,38619.084906
679,11349,24426.500000,59352.666667
680,11359,8708.775510,36480.238095


Une todos estos datos con el dataframe general

In [53]:
df = df.merge(df_trans2,how='left',left_on='account_id',right_on='account_id').merge(n_trans,how='left',left_on='account_id',right_on='account_id')
df

Unnamed: 0,account_id,date_loan,amount,duration,payments,status,frequency,date_acc,A4,A11,...,days_between,n_inhabitants,average_salary,average_unemployment_rate,entrepreneur_rate,average_crime_rate,target,trans_amount,trans_balance,n_trans
0,2,1994-01-05,80952,24,3373.0,A,POPLATEK MESICNE,1993-02-26,1204953,12541,...,313,1204953,12541,0.315,167,0.076677,False,7954.333333,32590.759259,54
1,19,1996-04-29,30276,12,2523.0,B,POPLATEK MESICNE,1995-04-07,103347,9104,...,388,103347,9104,1.785,123,0.022512,True,5856.350000,25197.137500,80
2,25,1997-12-08,30276,12,2523.0,A,POPLATEK MESICNE,1996-07-28,228848,9893,...,498,228848,9893,4.360,96,0.025148,False,12113.981707,62991.408537,164
3,37,1998-10-14,318480,60,5308.0,D,POPLATEK MESICNE,1997-08-18,70646,8547,...,422,70646,8547,3.120,120,0.021976,True,7572.034483,39954.034483,116
4,38,1998-04-19,110736,48,2307.0,C,POPLATEK TYDNE,1997-08-08,51428,8402,...,254,51428,8402,3.540,120,0.020397,False,4716.200000,31383.581818,55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
677,11327,1998-09-27,39168,24,1632.0,C,POPLATEK MESICNE,1997-10-15,94725,9920,...,347,94725,9920,2.535,130,0.048219,False,7977.981481,55438.814815,54
678,11328,1998-07-18,280440,60,4674.0,C,POPLATEK MESICNE,1996-11-05,387570,9897,...,620,387570,9897,1.780,140,0.048271,False,8138.754717,38619.084906,106
679,11349,1995-10-29,419880,60,6998.0,C,POPLATEK TYDNE,1995-05-26,1204953,12541,...,156,1204953,12541,0.315,167,0.076677,False,24426.500000,59352.666667,18
680,11359,1996-08-06,54024,12,4502.0,A,POPLATEK MESICNE,1994-10-01,117897,8814,...,675,117897,8814,5.220,107,0.017689,False,8708.775510,36480.238095,147


De la tabla **card**, añade el tipo de tarjeta para cada cuenta (solo titulares)

In [54]:
query='''
select disp.account_id, card.type as card_type

from card
join disp 
on card.disp_id = disp.disp_id
join loan
on disp.account_id = loan.account_id
and card.issued < loan.date
WHERE disp.type = 'OWNER'
'''
df_card = pd.read_sql(query,db)
df_card

Unnamed: 0,account_id,card_type
0,105,classic
1,226,classic
2,276,classic
3,544,classic
4,666,classic
5,1480,classic
6,1766,classic
7,1869,classic
8,2116,classic
9,2262,classic


Une la tabla anterior a la general. Para aquellas cuentas sin tarjeta, ponle el valor "No"

In [55]:
df = df.merge(df_card, how='left', left_on='account_id',right_on='account_id')
df['card_type'].fillna('No',inplace=True)
df

Unnamed: 0,account_id,date_loan,amount,duration,payments,status,frequency,date_acc,A4,A11,...,n_inhabitants,average_salary,average_unemployment_rate,entrepreneur_rate,average_crime_rate,target,trans_amount,trans_balance,n_trans,card_type
0,2,1994-01-05,80952,24,3373.0,A,POPLATEK MESICNE,1993-02-26,1204953,12541,...,1204953,12541,0.315,167,0.076677,False,7954.333333,32590.759259,54,No
1,19,1996-04-29,30276,12,2523.0,B,POPLATEK MESICNE,1995-04-07,103347,9104,...,103347,9104,1.785,123,0.022512,True,5856.350000,25197.137500,80,No
2,25,1997-12-08,30276,12,2523.0,A,POPLATEK MESICNE,1996-07-28,228848,9893,...,228848,9893,4.360,96,0.025148,False,12113.981707,62991.408537,164,No
3,37,1998-10-14,318480,60,5308.0,D,POPLATEK MESICNE,1997-08-18,70646,8547,...,70646,8547,3.120,120,0.021976,True,7572.034483,39954.034483,116,No
4,38,1998-04-19,110736,48,2307.0,C,POPLATEK TYDNE,1997-08-08,51428,8402,...,51428,8402,3.540,120,0.020397,False,4716.200000,31383.581818,55,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
677,11327,1998-09-27,39168,24,1632.0,C,POPLATEK MESICNE,1997-10-15,94725,9920,...,94725,9920,2.535,130,0.048219,False,7977.981481,55438.814815,54,No
678,11328,1998-07-18,280440,60,4674.0,C,POPLATEK MESICNE,1996-11-05,387570,9897,...,387570,9897,1.780,140,0.048271,False,8138.754717,38619.084906,106,No
679,11349,1995-10-29,419880,60,6998.0,C,POPLATEK TYDNE,1995-05-26,1204953,12541,...,1204953,12541,0.315,167,0.076677,False,24426.500000,59352.666667,18,No
680,11359,1996-08-06,54024,12,4502.0,A,POPLATEK MESICNE,1994-10-01,117897,8814,...,117897,8814,5.220,107,0.017689,False,8708.775510,36480.238095,147,classic


De la tabla **client**, obtén la edad de los clientes (en el momento del préstamo), su sexo y añade una variable binaria que indique si el distrito del cliente coincide con el distrito de la cuenta

Une la tabla anterior con la general

### Tablón final

<a name='train-test'></a>
## 3. División train-test

<a name='correlation'></a>
## 4. Selección de variables

Aplica técnicas para seleccionar las variables de entrada al modelo

<a name='transformation'></a>
## 5. Transformaciones

Estandariza las variables numéricas y convierte las categóricas mediante one-hot encoding

<a name='modeling'></a>
## 6. Modelado

Entrena uno o varios modelos, dividiendo el conjunto en train-test 

### Medición del desempeño

Utiliza las méticas de desempeño de los modelos de clasificación binarios, en el conjunto de train y test