# Prediccion de Default en Prestamos


Para este proyecto utilizaremos un sample de los datos de Lending Club. La idea es predecir si cierto usuario cometera Default basado en informacion que la plataforma recolecta. Esto nos ayudara a mejorar la metodologia/pipeline de prestamo.


# Descripcion



Contiene los prestamos de esta plataforma:

    periodo 2007-2017Q3.
    887mil observaciones, sample de 100mil
    150 variables
    Target: loan status



# Objetivo

Realizar un ETL y un EDA

## ETL

0. Limpia los datos de tal manera que al final del ETL queden en formato `tidy`.
1. Asegurate de cargar y leer los datos
2. Crea una tabla donde se guarde el nombre de la columna y el tipo de dato: (`column_name`,   `type`).
3. Asegurate de pensar cual es el tipo de dato correcto. Porque elejiste strig/object o float o int?. No hay respuestas incorrectas como tal, pero tienes que justificar tu decision.
4. Maneja missings o nans de la manera adecuada. Justifica cada decision







## EDA

0. Preparar lo datos para un pipeline de datos
1. Quitar columnas inservibles 
2. Imputar valores
3. Mantener replicabildiad y reproducibilidad

**No olvides anotar tus justificaciones en celdas para recordar cuando te toque explicarlo.** Puedes agregar el numero de celdas que necesites para poner tu explicacion y el codigo, solo manten la estructura.

# ETL

In [1]:
import pandas as pd
import numpy as np

Vas a obtener 2 errores, solucionalo con los visto en clase.  
Tip: Se arreglan con argumentos adicionales de la funcion `read_csv`  
Documentacion: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html 

In [3]:
loans = pd.read_csv('https://github.com/sonder-art/fdd_prim_2023/blob/main/codigo/pandas/LoansData_sample.csv.gz?raw=true', compression='gzip', low_memory=False)

loans


Unnamed: 0.1,Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,0,38098114,,15000.0,15000.0,15000.0,60 months,12.39,336.64,C,...,,,Cash,N,,,,,,
1,1,36805548,,10400.0,10400.0,10400.0,36 months,6.99,321.08,A,...,,,Cash,N,,,,,,
2,2,37842129,,21425.0,21425.0,21425.0,60 months,15.59,516.36,D,...,,,Cash,N,,,,,,
3,3,37612354,,12800.0,12800.0,12800.0,60 months,17.14,319.08,D,...,,,Cash,N,,,,,,
4,4,37662224,,7650.0,7650.0,7650.0,36 months,13.66,260.20,C,...,,,Cash,N,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,99995,22454240,,8400.0,8400.0,8400.0,36 months,9.17,267.79,B,...,,,Cash,N,,,,,,
99996,99996,11396920,,10000.0,10000.0,10000.0,36 months,12.99,336.90,C,...,,,Cash,N,,,,,,
99997,99997,8556176,,30000.0,30000.0,30000.0,60 months,20.99,811.44,E,...,,,Cash,N,,,,,,
99998,99998,24023408,,8475.0,8475.0,8475.0,36 months,24.99,336.92,F,...,,,Cash,N,,,,,,


## Tabla (column_name, type)

Revisa el metodo pd.DataFrame.dtypes. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html 

In [13]:
column_types = loans.dtypes
column_types

Unnamed: 0                 int64
id                         int64
member_id                float64
loan_amnt                float64
funded_amnt              float64
                          ...   
settlement_status         object
settlement_date           object
settlement_amount        float64
settlement_percentage    float64
settlement_term          float64
Length: 151, dtype: object

## Cargar descripcion de columnas

La siguiente tabla tiene una descripcion del significado de cada columna

In [5]:


datos_dict = pd.read_excel(
    'https://resources.lendingclub.com/LCDataDictionary.xlsx')
datos_dict.columns = ['feature', 'description']


In [6]:
datos_dict

Unnamed: 0,feature,description
0,acc_now_delinq,The number of accounts on which the borrower i...
1,acc_open_past_24mths,Number of trades opened in past 24 months.
2,addr_state,The state provided by the borrower in the loan...
3,all_util,Balance to credit limit on all trades
4,annual_inc,The self-reported annual income provided by th...
...,...,...
148,settlement_amount,The loan amount that the borrower has agreed t...
149,settlement_percentage,The settlement amount as a percentage of the p...
150,settlement_term,The number of months that the borrower will be...
151,,


### Pickle

Crea codigo para **guardar** y **cargar** el DataFrame de `datos_dict` creada en las celdas anteriores en formato **pickle**

In [11]:
# COdigo guardar
import pickle

with open('datos_dict.pkl','wb') as file:
    pickle.dump(datos_dict, file)

In [20]:
# Codigo para cargar
datos_dict_from_pickle = None
with open('datos_dict.pkl','rb') as file:
    datos_dict_from_pickle = pickle.load(file)

list(datos_dict_from_pickle.description)

['The number of accounts on which the borrower is now delinquent.',
 'Number of trades opened in past 24 months.',
 'The state provided by the borrower in the loan application',
 'Balance to credit limit on all trades',
 'The self-reported annual income provided by the borrower during registration.',
 'The combined self-reported annual income provided by the co-borrowers during registration',
 'Indicates whether the loan is an individual application or a joint application with two co-borrowers',
 'Average current balance of all accounts',
 'Total open to buy on revolving bankcards.',
 'Ratio of total current balance to high credit/credit limit for all bankcard accounts.',
 'Number of charge-offs within 12 months',
 'post charge off collection fee',
 'Number of collections in 12 months excluding medical collections',
 "The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years",
 'The past-due amount owed for the accounts on which the bo

## Tipos de Datos

Realiza las transformaciones o casteos (casting) que creas necesarios a tus datos de tal manera que el typo de dato sea adecuado. Al terminar recrea la tabla `column_types` con los nuevos tipos.

No olvides anotar tus justificaciones para recordar cuando te toque explicarlo.

In [31]:
to_int_cols = ['loan_amnt','funded_amnt','funded_amnt_inv','tot_hi_cred_lim','total_bal_ex_mort','total_bc_limit','total_il_high_credit_limit']
#Reduce size of dataframe in memory by converting all the floats to int32
for column in to_int_cols:
    loans[column]=loans[column].apply(np.int32)


In [34]:
# Here I'm just extracting the months of the term so I can query using it
loans['term']=loans['term'].apply(lambda x: int(x.replace('month', '').strip()) if not 'months' in x else int(x.replace('months', '').strip()))

In [47]:
column_types = loans.dtypes
column_types


Unnamed: 0                 int64
id                         int64
member_id                float64
loan_amnt                  int32
funded_amnt                int32
                          ...   
settlement_status         object
settlement_date           object
settlement_amount        float64
settlement_percentage    float64
settlement_term          float64
Length: 151, dtype: object

## Manejo de NaNs o missings

Maneja los datos de tipos missing. Elije una estrategia adecuada dependiendo del tipo de dato que le asignaste a la columna.


Crea codigo para **guardar** y **cargar** un archivo JSON en el que se guarde la `estrategia` y `valor` que utilizaste para **imputar**. Por ejemplo: Si hay una columna que se llama `columna 3` y utilizaste la estrategia de imputacion de media, y existe otra llamada `columna 4` y  elegiste la palabra 'missing' el JSON debera contener:  
  
 `{'columna 3':{'estrategia':'mean', 'valor':3.4}, 'columna 4':{'estrategia':'identificador', 'valor':'missing'}}`  

 De tal manera que para cada columna que tenga un metodo de imputacion apunte a otro diccionario donde el **key** `estrategia` describa de manera sencilla el metodo, y el **key** `valor` el valor usado. En general:   
 `{'nombre de la columna':{'estrategia':'descripcion de estrategia', 'valor':'valor utilizado'}}`. 
 

De utilizar mas de un metodo puedes anidarlos en una lista  
  `[{...},{...}]`.  

Incluso si la columna utilizada no sufrio imputacion, es necesario que la agregues al JSON.

La idea es que cualquier otra persona pueda cargar el el archivo JSON con tu funcion, entender que hiciste y replicarlo facilmente. No existe solo una respuesta correcta, pero tendras que justificar y explicar tus deciciones.

### Imputacion

In [43]:
# Tu codigo aqui
cleaned_loans = loans.dropna(axis='columns', how='any')

specific_cols_loans = cleaned_loans[['id','loan_amnt','annual_inc','term','acc_now_delinq']]
specific_cols_loans = specific_cols_loans[specific_cols_loans['acc_now_delinq']>0]
specific_cols_loans['percentage_of_inc_to_fullfill'] = (specific_cols_loans['loan_amnt']/(specific_cols_loans['term']/12))/specific_cols_loans['annual_inc']
mean = specific_cols_loans['percentage_of_inc_to_fullfill'].mean()
risky_ones = specific_cols_loans[specific_cols_loans['percentage_of_inc_to_fullfill']>mean]

strategies = [{'id': 'int', 'loan_amnt': 'int', 'annual_inc': 'int', 'term': 'int', 'percentage_of_inc_to_fullfill':'((loan_amnt/(term/12))/annual_inc)>percentage_of_inc_to_fullfill@mean'},{'acc_now_delinq': '>0'}]
# All this user probably won't pay back the loan because they are already committing a crime.
# Also they have a lower chance of paying back because the percentage they need to assign of their
# total income is higher than the average loan.
print(risky_ones)
print(mean)

             id  loan_amnt  annual_inc  term  acc_now_delinq  \
40     37631762      11000     60000.0    36             1.0   
104    37800170      18000    109000.0    36             1.0   
405    36280834      21000    108600.0    36             1.0   
421    33411541      12000     40000.0    60             1.0   
566    37620902       8475     25000.0    36             1.0   
...         ...        ...         ...   ...             ...   
98229  19606339      24000     71000.0    60             1.0   
98367  19205915       8650     55000.0    36             1.0   
98544  20449082      10000     30000.0    60             1.0   
99440  23943791      22000     90000.0    36             2.0   
99536  23973518      10425     50000.0    36             3.0   

       percentage_of_inc_to_fullfill  
40                          0.061111  
104                         0.055046  
405                         0.064457  
421                         0.060000  
566                         0.113000

In [44]:
# TUcodigo aqui
import json

with open('strategies.json', "w") as file:
    json.dump(strategies, file)

### Codigo para salvar y cargar JSONs