# Prediccion de Default en Prestamos


Para este proyecto utilizaremos un sample de los datos de Lending Club. La idea es predecir si cierto usuario cometera Default basado en informacion que la plataforma recolecta. Esto nos ayudara a mejorar la metodologia/pipeline de prestamo.


# Descripcion



Contiene los prestamos de esta plataforma:

    periodo 2007-2017Q3.
    887mil observaciones, sample de 100mil
    150 variables
    Target: loan status



# Objetivo

Realizar un ETL y un EDA

## ETL

0. Limpia los datos de tal manera que al final del ETL queden en formato `tidy`.
1. Asegurate de cargar y leer los datos
2. Crea una tabla donde se guarde el nombre de la columna y el tipo de dato: (`column_name`,   `type`).
3. Asegurate de pensar cual es el tipo de dato correcto. Porque elejiste strig/object o float o int?. No hay respuestas incorrectas como tal, pero tienes que justificar tu decision.
4. Maneja missings o nans de la manera adecuada. Justifica cada decision







## EDA

0. Preparar lo datos para un pipeline de datos
1. Quitar columnas inservibles 
2. Imputar valores
3. Mantener replicabildiad y reproducibilidad

**No olvides anotar tus justificaciones en celdas para recordar cuando te toque explicarlo.** Puedes agregar el numero de celdas que necesites para poner tu explicacion y el codigo, solo manten la estructura.

# ETL

In [1]:
import pandas as pd
import numpy as np
pd.set_option("display.max_columns", None)
pd.set_option('display.max_colwidth', None)

Vas a obtener 2 errores, solucionalo con los visto en clase.  
Tip: Se arreglan con argumentos adicionales de la funcion `read_csv`  
Documentacion: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html 

In [4]:
loans = pd.read_csv('https://github.com/sonder-art/fdd_prim_2023/blob/main/codigo/pandas/LoansData_sample.csv.gz?raw=true', compression="gzip") 
loans


  loans = pd.read_csv('https://github.com/sonder-art/fdd_prim_2023/blob/main/codigo/pandas/LoansData_sample.csv.gz?raw=true', compression="gzip")


Unnamed: 0.1,Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,0,38098114,,15000.0,15000.0,15000.0,60 months,12.39,336.64,C,...,,,Cash,N,,,,,,
1,1,36805548,,10400.0,10400.0,10400.0,36 months,6.99,321.08,A,...,,,Cash,N,,,,,,
2,2,37842129,,21425.0,21425.0,21425.0,60 months,15.59,516.36,D,...,,,Cash,N,,,,,,
3,3,37612354,,12800.0,12800.0,12800.0,60 months,17.14,319.08,D,...,,,Cash,N,,,,,,
4,4,37662224,,7650.0,7650.0,7650.0,36 months,13.66,260.20,C,...,,,Cash,N,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,99995,22454240,,8400.0,8400.0,8400.0,36 months,9.17,267.79,B,...,,,Cash,N,,,,,,
99996,99996,11396920,,10000.0,10000.0,10000.0,36 months,12.99,336.90,C,...,,,Cash,N,,,,,,
99997,99997,8556176,,30000.0,30000.0,30000.0,60 months,20.99,811.44,E,...,,,Cash,N,,,,,,
99998,99998,24023408,,8475.0,8475.0,8475.0,36 months,24.99,336.92,F,...,,,Cash,N,,,,,,


## Tabla (column_name, type)

Revisa el metodo pd.DataFrame.dtypes. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html 

In [5]:
column_types = loans.dtypes
column_types

Unnamed: 0                 int64
id                         int64
member_id                float64
loan_amnt                float64
funded_amnt              float64
                          ...   
settlement_status         object
settlement_date           object
settlement_amount        float64
settlement_percentage    float64
settlement_term          float64
Length: 151, dtype: object

## Cargar descripcion de columnas

La siguiente tabla tiene una descripcion del significado de cada columna

In [6]:


datos_dict = pd.read_excel(
    'https://resources.lendingclub.com/LCDataDictionary.xlsx')
datos_dict.columns = ['feature', 'description']


In [7]:
datos_dict

Unnamed: 0,feature,description
0,acc_now_delinq,The number of accounts on which the borrower i...
1,acc_open_past_24mths,Number of trades opened in past 24 months.
2,addr_state,The state provided by the borrower in the loan...
3,all_util,Balance to credit limit on all trades
4,annual_inc,The self-reported annual income provided by th...
...,...,...
148,settlement_amount,The loan amount that the borrower has agreed t...
149,settlement_percentage,The settlement amount as a percentage of the p...
150,settlement_term,The number of months that the borrower will be...
151,,


### Pickle

Crea codigo para **guardar** y **cargar** el DataFrame de `datos_dict` creada en las celdas anteriores en formato **pickle**

In [8]:
import pickle

In [9]:
# Codigo guardar
datos_dict.to_pickle('datos_dict.pkl')

# Codigo cargar
datos_dict_pickle = pd.read_pickle('datos_dict.pkl')
datos_dict_pickle

Unnamed: 0,feature,description
0,acc_now_delinq,The number of accounts on which the borrower i...
1,acc_open_past_24mths,Number of trades opened in past 24 months.
2,addr_state,The state provided by the borrower in the loan...
3,all_util,Balance to credit limit on all trades
4,annual_inc,The self-reported annual income provided by th...
...,...,...
148,settlement_amount,The loan amount that the borrower has agreed t...
149,settlement_percentage,The settlement amount as a percentage of the p...
150,settlement_term,The number of months that the borrower will be...
151,,


## Tipos de Datos

Realiza las transformaciones o casteos (casting) que creas necesarios a tus datos de tal manera que el typo de dato sea adecuado. Al terminar recrea la tabla `column_types` con los nuevos tipos.

No olvides anotar tus justificaciones para recordar cuando te toque explicarlo.

In [12]:
loans.set_index("id", inplace=True)

In [14]:
# Tomamos un subset con aquellas columnas de datos que pueden ser de interes, descartamos las columnas que tienen NaN y algunas otras que no necesitarían cambiar su dType

loans_sub = loans[['loan_amnt' , 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade', 
                      'sub_grade', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'loan_status',
                      'pymnt_plan', 'purpose', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'open_acc', 'revol_bal', 'total_acc', 'total_pymnt', 'total_pymnt_inv', 
                      'last_pymnt_d', 'last_pymnt_amnt', 'tot_cur_bal', 'acc_open_past_24mths', 'avg_cur_bal', 'pct_tl_nvr_dlq', 'tot_hi_cred_lim']]

loans_sub

Unnamed: 0_level_0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_length,home_ownership,...,total_acc,total_pymnt,total_pymnt_inv,last_pymnt_d,last_pymnt_amnt,tot_cur_bal,acc_open_past_24mths,avg_cur_bal,pct_tl_nvr_dlq,tot_hi_cred_lim
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
38098114,15000.0,15000.0,15000.0,60 months,12.39,336.64,C,C1,10+ years,RENT,...,17.0,17392.370000,17392.37,Jun-2016,12017.81,149140.0,5.0,29828.0,100.0,196500.0
36805548,10400.0,10400.0,10400.0,36 months,6.99,321.08,A,A3,8 years,MORTGAGE,...,36.0,6611.690000,6611.69,Aug-2016,321.08,162110.0,7.0,9536.0,83.3,179407.0
37842129,21425.0,21425.0,21425.0,60 months,15.59,516.36,D,D1,6 years,RENT,...,35.0,25512.200000,25512.20,May-2016,17813.19,42315.0,4.0,4232.0,91.4,57073.0
37612354,12800.0,12800.0,12800.0,60 months,17.14,319.08,D,D4,10+ years,MORTGAGE,...,13.0,11207.670000,11207.67,Dec-2017,319.08,261815.0,2.0,32727.0,76.9,368700.0
37662224,7650.0,7650.0,7650.0,36 months,13.66,260.20,C,C3,< 1 year,RENT,...,20.0,2281.980000,2281.98,Aug-2015,17.70,64426.0,6.0,5857.0,100.0,82331.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22454240,8400.0,8400.0,8400.0,36 months,9.17,267.79,B,B1,2 years,MORTGAGE,...,16.0,9640.145407,9640.15,Aug-2017,267.50,152181.0,2.0,25364.0,93.7,209557.0
11396920,10000.0,10000.0,10000.0,36 months,12.99,336.90,C,C1,3 years,RENT,...,17.0,11685.080000,11685.08,Mar-2016,5594.78,46413.0,3.0,4219.0,100.0,64149.0
8556176,30000.0,30000.0,30000.0,60 months,20.99,811.44,E,E4,10+ years,RENT,...,30.0,32530.430000,32530.43,Dec-2017,811.44,345934.0,5.0,20349.0,93.3,371088.0
24023408,8475.0,8475.0,8475.0,36 months,24.99,336.92,F,F4,10+ years,RENT,...,29.0,2695.360000,2695.36,Apr-2015,336.92,31247.0,8.0,3125.0,86.4,43686.0


In [None]:
# Obseravamos que hay algunos datos que podrían ser casteados : term, las fechas, emp_length, pyment_plan, annual_inc, loan_amnt 

In [None]:
# Lo pasamos a int
loans_sub['annual_inc']=loans['annual_inc'].astype(np.int64)  

In [None]:
# Lo pasamos a int
loans_sub['loan_amnt']=loans['loan_amnt'].astype(np.int64)

In [None]:
# Reemplazamos los n y y con booleanos
loans_sub["pymnt_plan"].replace({"n":False, "y":True}, inplace=True)

In [None]:
# Reemplazamos a int en lugar de tener num y letra
loans_sub["term"].replace({" 36 months":"36", " 60 months":"60"}, inplace=True)
loans_sub["term"] = loans_sub["term"].astype("int")

In [None]:
# Reemplazamos a float en lugar de tener num y letra
loans_sub["emp_length"].replace({"10+ years":"10", "8 years":"8", "6 years":"6", "< 1 year":"0", "2 years":"2"
, "9 years":"9", "7 years":"7", "5 years":"5", "3 years":"3", "1 year":"1", "4 years":"4"}, inplace=True)
loans_sub["emp_length"] = loans_sub["emp_length"].astype('float')

In [None]:
# Casteamos las fechas a datetime

loans_sub['last_pymnt_d'] = pd.to_datetime(loans_sub['last_pymnt_d'])
loans_sub['last_pymnt_d'] = loans_sub['last_pymnt_d'].dt.to_period('M')

loans_sub['earliest_cr_line'] = pd.to_datetime(loans_sub['earliest_cr_line'])
loans_sub['earliest_cr_line'] = loans_sub['earliest_cr_line'].dt.to_period('M')

In [32]:
column_types = loans_sub.dtypes
column_types



loan_amnt                   int64
funded_amnt               float64
funded_amnt_inv           float64
term                        int64
int_rate                  float64
installment               float64
grade                      object
sub_grade                  object
emp_length                float64
home_ownership             object
annual_inc                  int64
verification_status        object
loan_status                object
pymnt_plan                   bool
purpose                    object
dti                       float64
delinq_2yrs               float64
earliest_cr_line        period[M]
open_acc                  float64
revol_bal                 float64
total_acc                 float64
total_pymnt               float64
total_pymnt_inv           float64
last_pymnt_d            period[M]
last_pymnt_amnt           float64
tot_cur_bal               float64
acc_open_past_24mths      float64
avg_cur_bal               float64
pct_tl_nvr_dlq            float64
tot_hi_cred_li

## Manejo de NaNs o missings

Maneja los datos de tipos missing. Elije una estrategia adecuada dependiendo del tipo de dato que le asignaste a la columna.


Crea codigo para **guardar** y **cargar** un archivo JSON en el que se guarde la `estrategia` y `valor` que utilizaste para **imputar**. Por ejemplo: Si hay una columna que se llama `columna 3` y utilizaste la estrategia de imputacion de media, y existe otra llamada `columna 4` y  elegiste la palabra 'missing' el JSON debera contener:  
  
 `{'columna 3':{'estrategia':'mean', 'valor':3.4}, 'columna 4':{'estrategia':'identificador', 'valor':'missing'}}`  

 De tal manera que para cada columna que tenga un metodo de imputacion apunte a otro diccionario donde el **key** `estrategia` describa de manera sencilla el metodo, y el **key** `valor` el valor usado. En general:   
 `{'nombre de la columna':{'estrategia':'descripcion de estrategia', 'valor':'valor utilizado'}}`. 
 

De utilizar mas de un metodo puedes anidarlos en una lista  
  `[{...},{...}]`.  

Incluso si la columna utilizada no sufrio imputacion, es necesario que la agregues al JSON.

La idea es que cualquier otra persona pueda cargar el el archivo JSON con tu funcion, entender que hiciste y replicarlo facilmente. No existe solo una respuesta correcta, pero tendras que justificar y explicar tus deciciones.

### Imputacion

In [34]:
loans_sub.isna().sum()


loan_amnt                  0
funded_amnt                0
funded_amnt_inv            0
term                       0
int_rate                   0
installment                0
grade                      0
sub_grade                  0
emp_length              5259
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
pymnt_plan                 0
purpose                    0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
open_acc                   0
revol_bal                  0
total_acc                  0
total_pymnt                0
total_pymnt_inv            0
last_pymnt_d              67
last_pymnt_amnt            0
tot_cur_bal                0
acc_open_past_24mths       0
avg_cur_bal                0
pct_tl_nvr_dlq             0
tot_hi_cred_lim            0
dtype: int64

In [35]:
# Teenemos NaN en emp_length y last_pymnt_amnt
loans_sub["last_pymnt_d"].fillna("missing", inplace=True)
loans_sub["emp_length"].fillna("missing", inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  loans_sub["last_pymnt_d"].fillna("missing", inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  loans_sub["emp_length"].fillna("missing", inplace=True)


In [36]:
loans_sub.isna().sum()
# Ya no hay NaN

loan_amnt               0
funded_amnt             0
funded_amnt_inv         0
term                    0
int_rate                0
installment             0
grade                   0
sub_grade               0
emp_length              0
home_ownership          0
annual_inc              0
verification_status     0
loan_status             0
pymnt_plan              0
purpose                 0
dti                     0
delinq_2yrs             0
earliest_cr_line        0
open_acc                0
revol_bal               0
total_acc               0
total_pymnt             0
total_pymnt_inv         0
last_pymnt_d            0
last_pymnt_amnt         0
tot_cur_bal             0
acc_open_past_24mths    0
avg_cur_bal             0
pct_tl_nvr_dlq          0
tot_hi_cred_lim         0
dtype: int64

### Codigo para salvar y cargar JSONs

In [37]:
import json

In [38]:
with open("columnas.json", "w") as archivo:

    d = {'loan_amnt':{} , 'funded_amnt':{}, 'funded_amnt_inv':{}, 'term':{}, 'int_rate':{}, 'installment':{}, 'sub_grade':{}, "emp_length":{}, 'home_ownership':{}, 'annual_inc':{}, 'verification_status':{}, 'loan_status':{}, 'pymnt_plan':{}, 'purpose':{}, 'dti':{}, 'delinq_2yrs':{}, 'earliest_cr_line':{}, 'open_acc':{}, 'revol_bal':{}, 'total_acc':{}, 'total_pymnt':{}, 'total_pymnt_inv':{}, "last_pymnt_d": {"estrategia":"identificador", "valor":"missing"}, 'last_pymnt_amnt':{}, 'tot_cur_bal':{}, 'acc_open_past_24mths':{}, 'avg_cur_bal':{}, 'pct_tl_nvr_dlq':{}, 'tot_hi_cred_lim':{}}
    json.dump(d, archivo, indent=2)
    

In [39]:
with open("columnas.json") as archivo:

    d = json.load(archivo)
    print(d)

{'loan_amnt': {}, 'funded_amnt': {}, 'funded_amnt_inv': {}, 'term': {}, 'int_rate': {}, 'installment': {}, 'sub_grade': {}, 'emp_length': {}, 'home_ownership': {}, 'annual_inc': {}, 'verification_status': {}, 'loan_status': {}, 'pymnt_plan': {}, 'purpose': {}, 'dti': {}, 'delinq_2yrs': {}, 'earliest_cr_line': {}, 'open_acc': {}, 'revol_bal': {}, 'total_acc': {}, 'total_pymnt': {}, 'total_pymnt_inv': {}, 'last_pymnt_d': {'estrategia': 'identificador', 'valor': 'missing'}, 'last_pymnt_amnt': {}, 'tot_cur_bal': {}, 'acc_open_past_24mths': {}, 'avg_cur_bal': {}, 'pct_tl_nvr_dlq': {}, 'tot_hi_cred_lim': {}}
