# Prediccion de Default en Prestamos


Para este proyecto utilizaremos un sample de los datos de Lending Club. La idea es predecir si cierto usuario cometera Default basado en informacion que la plataforma recolecta. Esto nos ayudara a mejorar la metodologia/pipeline de prestamo.


# Descripcion



Contiene los prestamos de esta plataforma:

    periodo 2007-2017Q3.
    887mil observaciones, sample de 100mil
    150 variables
    Target: loan status



# Objetivo

Realizar un ETL y un EDA

## ETL

0. Limpia los datos de tal manera que al final del ETL queden en formato `tidy`.
1. Asegurate de cargar y leer los datos
2. Crea una tabla donde se guarde el nombre de la columna y el tipo de dato: (`column_name`,   `type`).
3. Asegurate de pensar cual es el tipo de dato correcto. Porque elejiste strig/object o float o int?. No hay respuestas incorrectas como tal, pero tienes que justificar tu decision.
4. Maneja missings o nans de la manera adecuada. Justifica cada decision







## EDA

0. Preparar lo datos para un pipeline de datos
1. Quitar columnas inservibles 
2. Imputar valores
3. Mantener replicabildiad y reproducibilidad

**No olvides anotar tus justificaciones en celdas para recordar cuando te toque explicarlo.** Puedes agregar el numero de celdas que necesites para poner tu explicacion y el codigo, solo manten la estructura.

# ETL

In [10]:
import pandas as pd
import numpy as np

Vas a obtener 2 errores, solucionalo con los visto en clase.  
Tip: Se arreglan con argumentos adicionales de la funcion `read_csv`  
Documentacion: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html 

In [11]:
loans = pd.read_csv('/home/braulioloz/Fdd_tareas_Braulio/tareas/207898/pandas/LoansData_sample.csv.gz')

loans.head()


  loans = pd.read_csv('/home/braulioloz/Fdd_tareas_Braulio/tareas/207898/pandas/LoansData_sample.csv.gz')


Unnamed: 0.1,Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,0,38098114,,15000.0,15000.0,15000.0,60 months,12.39,336.64,C,...,,,Cash,N,,,,,,
1,1,36805548,,10400.0,10400.0,10400.0,36 months,6.99,321.08,A,...,,,Cash,N,,,,,,
2,2,37842129,,21425.0,21425.0,21425.0,60 months,15.59,516.36,D,...,,,Cash,N,,,,,,
3,3,37612354,,12800.0,12800.0,12800.0,60 months,17.14,319.08,D,...,,,Cash,N,,,,,,
4,4,37662224,,7650.0,7650.0,7650.0,36 months,13.66,260.2,C,...,,,Cash,N,,,,,,


## Tabla (column_name, type)

In [12]:
column_types = pd.DataFrame({
    'column_name': loans.columns,
    'type': loans.dtypes.astype(str)
})

column_types

Unnamed: 0,column_name,type
Unnamed: 0,Unnamed: 0,int64
id,id,int64
member_id,member_id,float64
loan_amnt,loan_amnt,float64
funded_amnt,funded_amnt,float64
...,...,...
settlement_status,settlement_status,object
settlement_date,settlement_date,object
settlement_amount,settlement_amount,float64
settlement_percentage,settlement_percentage,float64


## Cargar descripcion de columnas

In [13]:
datos_dict = pd.read_excel(
    'https://resources.lendingclub.com/LCDataDictionary.xlsx')
datos_dict.columns = ['feature', 'description']


In [14]:
datos_dict

Unnamed: 0,feature,description
0,acc_now_delinq,The number of accounts on which the borrower i...
1,acc_open_past_24mths,Number of trades opened in past 24 months.
2,addr_state,The state provided by the borrower in the loan...
3,all_util,Balance to credit limit on all trades
4,annual_inc,The self-reported annual income provided by th...
...,...,...
148,settlement_amount,The loan amount that the borrower has agreed t...
149,settlement_percentage,The settlement amount as a percentage of the p...
150,settlement_term,The number of months that the borrower will be...
151,,


### Pickle

Crea codigo para **guardar** y **cargar** el DataFrame de `datos_dict` creada en las celdas anteriores en formato **pickle**

In [15]:
# Guardar el DataFrame en formato pickle
datos_dict.to_pickle('datos_dict.pkl')
print("El DataFrame 'datos_dict' ha sido guardado exitosamente en 'datos_dict.pkl'.")

El DataFrame 'datos_dict' ha sido guardado exitosamente en 'datos_dict.pkl'.


In [16]:
# Cargar el DataFrame desde el archivo pickle
datos_dict_cargado = pd.read_pickle('datos_dict.pkl')
print("El DataFrame 'datos_dict' ha sido cargado exitosamente desde 'datos_dict.pkl'.")

El DataFrame 'datos_dict' ha sido cargado exitosamente desde 'datos_dict.pkl'.


In [17]:
datos_dict_cargado

Unnamed: 0,feature,description
0,acc_now_delinq,The number of accounts on which the borrower i...
1,acc_open_past_24mths,Number of trades opened in past 24 months.
2,addr_state,The state provided by the borrower in the loan...
3,all_util,Balance to credit limit on all trades
4,annual_inc,The self-reported annual income provided by th...
...,...,...
148,settlement_amount,The loan amount that the borrower has agreed t...
149,settlement_percentage,The settlement amount as a percentage of the p...
150,settlement_term,The number of months that the borrower will be...
151,,


## Manejo de los datos

In [18]:
#Para unnamed:0, como es un id pero duplicado, podemos eliminarlo porque ya tenemos el indice. 
loans = loans.drop(columns =['Unnamed: 0'])


In [19]:
#Ahora vamos a convertir el id a un str para que no se use en operaciones matematicas 
loans['id'] = loans['id'].astype(str)

In [20]:
#Para member_id como todo parece ser NaN es mejor que la quitemos, pero tenemos que ver si muchos son NaN
missing_member_id = loans['member_id'].isnull().mean() * 100
print(f"Porcentaje de missing values en 'member_id' : {missing_member_id}% ")
#Porcentaje de missing values en 'member_id' : 100.0% 
loans = loans.drop(columns=['member_id'])


Porcentaje de missing values en 'member_id' : 100.0% 


In [21]:
#Para de una vez, como hay muchas columnas que tienen NaN y N y asi, vamos a manejar los missing values 
#de la misma forma que con member_id
missing_percent = loans.isnull().mean() * 100
missing_percent = missing_percent[missing_percent > 0 ].sort_values (ascending= False)
missing_percent


#open_il_24m                  100.000
#mths_since_rcnt_il           100.000
#annual_inc_joint             100.000
#verification_status_joint    100.000
#dti_joint                    100.000
#                              ...   
#bc_open_to_buy                 1.135
#mths_since_recent_bc           1.049
#last_pymnt_d                   0.067
#revol_util                     0.056
#last_credit_pull_d             0.017


cols_to_drop = missing_percent[missing_percent > 90].index.tolist() #Uso el 90% de missing para ver las mas criticas
print(f"Columnas a eliminar: {cols_to_drop}")

loans = loans.drop(columns = cols_to_drop)

loans.head()

Columnas a eliminar: ['open_il_24m', 'mths_since_rcnt_il', 'annual_inc_joint', 'verification_status_joint', 'dti_joint', 'revol_bal_joint', 'inq_last_12m', 'sec_app_open_acc', 'sec_app_mort_acc', 'sec_app_mths_since_last_major_derog', 'open_il_12m', 'open_act_il', 'open_acc_6m', 'il_util', 'total_bal_il', 'open_rv_12m', 'total_cu_tl', 'inq_fi', 'all_util', 'max_bal_bc', 'open_rv_24m', 'sec_app_open_act_il', 'sec_app_revol_util', 'sec_app_fico_range_high', 'sec_app_fico_range_low', 'sec_app_inq_last_6mths', 'sec_app_collections_12_mths_ex_med', 'sec_app_chargeoff_within_12_mths', 'sec_app_num_rev_accts', 'sec_app_earliest_cr_line', 'desc', 'orig_projected_additional_accrued_interest', 'hardship_start_date', 'deferral_term', 'hardship_amount', 'hardship_end_date', 'hardship_type', 'hardship_payoff_balance_amount', 'hardship_last_payment_amount', 'hardship_status', 'hardship_reason', 'payment_plan_start_date', 'hardship_length', 'hardship_dpd', 'hardship_loan_status', 'settlement_percenta

Unnamed: 0,id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,hardship_flag,disbursement_method,debt_settlement_flag
0,38098114,15000.0,15000.0,15000.0,60 months,12.39,336.64,C,C1,MANAGEMENT,...,0.0,0.0,0.0,196500.0,149140.0,10000.0,12000.0,N,Cash,N
1,36805548,10400.0,10400.0,10400.0,36 months,6.99,321.08,A,A3,Truck Driver Delivery Personel,...,14.3,0.0,0.0,179407.0,15030.0,13000.0,11325.0,N,Cash,N
2,37842129,21425.0,21425.0,21425.0,60 months,15.59,516.36,D,D1,Programming Analysis Supervisor,...,100.0,0.0,0.0,57073.0,42315.0,15000.0,35573.0,N,Cash,N
3,37612354,12800.0,12800.0,12800.0,60 months,17.14,319.08,D,D4,Senior Sales Professional,...,100.0,0.0,0.0,368700.0,18007.0,4400.0,18000.0,N,Cash,N
4,37662224,7650.0,7650.0,7650.0,36 months,13.66,260.2,C,C3,Technical Specialist,...,100.0,0.0,0.0,82331.0,64426.0,4900.0,64031.0,N,Cash,N


In [22]:
#Ahora quiero ver si quedan columnas con un missing de mas del 50%
missing_percent = loans.isnull().mean() * 100
missing_percent = missing_percent[missing_percent > 0 ].sort_values (ascending= False)
missing_percent

#next_pymnt_d                      86.138
#mths_since_last_record            83.268
#mths_since_recent_bc_dlq          73.545
#mths_since_last_major_derog       72.059
#mths_since_recent_revol_delinq    63.814
#mths_since_last_delinq            48.703
#ths_since_recent_inq              9.818
#emp_title                          5.264
#emp_length                         5.259
#mo_sin_old_il_acct                 3.007
#num_tl_120dpd_2m                   1.956
#bc_util                            1.198
#percent_bc_gt_75                   1.161
#bc_open_to_buy                     1.135
#mths_since_recent_bc               1.049
#last_pymnt_d                       0.067
#revol_util                         0.056
#last_credit_pull_d                 0.017

cols_to_drop = missing_percent[missing_percent > 48].index.tolist() 
print(f"Columnas a eliminar: {cols_to_drop}")

loans = loans.drop(columns = cols_to_drop)

loans.head()

Columnas a eliminar: ['next_pymnt_d', 'mths_since_last_record', 'mths_since_recent_bc_dlq', 'mths_since_last_major_derog', 'mths_since_recent_revol_delinq', 'mths_since_last_delinq']


Unnamed: 0,id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,hardship_flag,disbursement_method,debt_settlement_flag
0,38098114,15000.0,15000.0,15000.0,60 months,12.39,336.64,C,C1,MANAGEMENT,...,0.0,0.0,0.0,196500.0,149140.0,10000.0,12000.0,N,Cash,N
1,36805548,10400.0,10400.0,10400.0,36 months,6.99,321.08,A,A3,Truck Driver Delivery Personel,...,14.3,0.0,0.0,179407.0,15030.0,13000.0,11325.0,N,Cash,N
2,37842129,21425.0,21425.0,21425.0,60 months,15.59,516.36,D,D1,Programming Analysis Supervisor,...,100.0,0.0,0.0,57073.0,42315.0,15000.0,35573.0,N,Cash,N
3,37612354,12800.0,12800.0,12800.0,60 months,17.14,319.08,D,D4,Senior Sales Professional,...,100.0,0.0,0.0,368700.0,18007.0,4400.0,18000.0,N,Cash,N
4,37662224,7650.0,7650.0,7650.0,36 months,13.66,260.2,C,C3,Technical Specialist,...,100.0,0.0,0.0,82331.0,64426.0,4900.0,64031.0,N,Cash,N


In [23]:
column_types = pd.DataFrame({
    'column_name': loans.columns,
    'type': loans.dtypes.astype(str)
})

column_types.head(15)

Unnamed: 0,column_name,type
id,id,object
loan_amnt,loan_amnt,float64
funded_amnt,funded_amnt,float64
funded_amnt_inv,funded_amnt_inv,float64
term,term,object
int_rate,int_rate,float64
installment,installment,float64
grade,grade,object
sub_grade,sub_grade,object
emp_title,emp_title,object


In [24]:
#loan_amnt, funded_amnt y funded_amnt_inv estas las dejamos bien
loans.head()

Unnamed: 0,id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,hardship_flag,disbursement_method,debt_settlement_flag
0,38098114,15000.0,15000.0,15000.0,60 months,12.39,336.64,C,C1,MANAGEMENT,...,0.0,0.0,0.0,196500.0,149140.0,10000.0,12000.0,N,Cash,N
1,36805548,10400.0,10400.0,10400.0,36 months,6.99,321.08,A,A3,Truck Driver Delivery Personel,...,14.3,0.0,0.0,179407.0,15030.0,13000.0,11325.0,N,Cash,N
2,37842129,21425.0,21425.0,21425.0,60 months,15.59,516.36,D,D1,Programming Analysis Supervisor,...,100.0,0.0,0.0,57073.0,42315.0,15000.0,35573.0,N,Cash,N
3,37612354,12800.0,12800.0,12800.0,60 months,17.14,319.08,D,D4,Senior Sales Professional,...,100.0,0.0,0.0,368700.0,18007.0,4400.0,18000.0,N,Cash,N
4,37662224,7650.0,7650.0,7650.0,36 months,13.66,260.2,C,C3,Technical Specialist,...,100.0,0.0,0.0,82331.0,64426.0,4900.0,64031.0,N,Cash,N


In [25]:
#Para term mejor los convertimos a int, porque son los plazos en meses
loans['term'] = loans['term'].str.strip().str.replace('months', '').astype(int)
loans.head()

Unnamed: 0,id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,hardship_flag,disbursement_method,debt_settlement_flag
0,38098114,15000.0,15000.0,15000.0,60,12.39,336.64,C,C1,MANAGEMENT,...,0.0,0.0,0.0,196500.0,149140.0,10000.0,12000.0,N,Cash,N
1,36805548,10400.0,10400.0,10400.0,36,6.99,321.08,A,A3,Truck Driver Delivery Personel,...,14.3,0.0,0.0,179407.0,15030.0,13000.0,11325.0,N,Cash,N
2,37842129,21425.0,21425.0,21425.0,60,15.59,516.36,D,D1,Programming Analysis Supervisor,...,100.0,0.0,0.0,57073.0,42315.0,15000.0,35573.0,N,Cash,N
3,37612354,12800.0,12800.0,12800.0,60,17.14,319.08,D,D4,Senior Sales Professional,...,100.0,0.0,0.0,368700.0,18007.0,4400.0,18000.0,N,Cash,N
4,37662224,7650.0,7650.0,7650.0,36,13.66,260.2,C,C3,Technical Specialist,...,100.0,0.0,0.0,82331.0,64426.0,4900.0,64031.0,N,Cash,N


In [26]:
#Para asegurarme de que int_rate este en porcentaje hago lo siguiente 

loans['int_rate'].describe()

#count    100000.000000
#mean         13.278073
#std           4.390210
#min           6.000000
#25%          10.150000
#50%          12.990000
#75%          15.610000
#max          26.060000
#Como los valores no estan entre 0 y 1 no estan en decimal debemos ajustarlos porque estan como porcentake

loans['int_rate'] = loans['int_rate'] / 100
#Esto lo hago porque como vamos a trabajar con las tasas de interes es mejor tenerlas en decimal que 
#en formato de porcentaje

In [27]:
loans.head()

Unnamed: 0,id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,hardship_flag,disbursement_method,debt_settlement_flag
0,38098114,15000.0,15000.0,15000.0,60,0.1239,336.64,C,C1,MANAGEMENT,...,0.0,0.0,0.0,196500.0,149140.0,10000.0,12000.0,N,Cash,N
1,36805548,10400.0,10400.0,10400.0,36,0.0699,321.08,A,A3,Truck Driver Delivery Personel,...,14.3,0.0,0.0,179407.0,15030.0,13000.0,11325.0,N,Cash,N
2,37842129,21425.0,21425.0,21425.0,60,0.1559,516.36,D,D1,Programming Analysis Supervisor,...,100.0,0.0,0.0,57073.0,42315.0,15000.0,35573.0,N,Cash,N
3,37612354,12800.0,12800.0,12800.0,60,0.1714,319.08,D,D4,Senior Sales Professional,...,100.0,0.0,0.0,368700.0,18007.0,4400.0,18000.0,N,Cash,N
4,37662224,7650.0,7650.0,7650.0,36,0.1366,260.2,C,C3,Technical Specialist,...,100.0,0.0,0.0,82331.0,64426.0,4900.0,64031.0,N,Cash,N


In [28]:
#Ahora vamos a convertir a grade y sub_grade a un type category para optimizar uso de memoria y manejo
loans['grade'] = loans['grade'].astype('category')
loans['sub_grade'] = loans['sub_grade'].astype('category')

Proceso para limpiar el emp_length

In [29]:
loans['emp_length'].unique()

array(['10+ years', '8 years', '6 years', '< 1 year', '2 years',
       '9 years', '7 years', '5 years', '3 years', '1 year', nan,
       '4 years'], dtype=object)

In [30]:
def clean_emp_length(x):
    if pd.isnull(x):
        return x  # Dejar los valores faltantes como están
    x = x.strip()  # Eliminar espacios en blanco al inicio y al final
    if x == '< 1 year':
        return 0  # Asignar 0 años para menos de un año
    elif x == '10+ years':
        return 10  # Asignar 10 años para 10 o más años
    elif x == 'nan':
        return np.nan  # Convertir 'nan' a NaN
    else:
        return int(x.split()[0])  # Extraer el número de años

loans['emp_length'] = loans['emp_length'].apply(clean_emp_length)
# Convertir 'emp_length' a tipo float (para manejar NaN)
loans['emp_length'] = loans['emp_length'].astype(float)

In [31]:
#Como si hay valores faltantes, le voy a poner la media para que siga bien
median_emp_length = loans['emp_length'].median()
loans['emp_length'].fillna(median_emp_length, inplace = True)
print(f'La mediana es :{median_emp_length}')

La mediana es :7.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loans['emp_length'].fillna(median_emp_length, inplace = True)


Aqui es para ver que show con emp_title

In [32]:
# Número de valores únicos
unique_titles = loans['emp_title'].nunique()
print(f"Número de títulos de empleo únicos: {unique_titles}")


Número de títulos de empleo únicos: 37432


In [33]:
# Convertir a minúsculas y eliminar espacios extra
loans['emp_title'] = loans['emp_title'].str.lower().str.strip()

# Número de valores únicos
unique_titles = loans['emp_title'].nunique()
print(f"Número de títulos de empleo únicos: {unique_titles}")


Número de títulos de empleo únicos: 31078


In [34]:
#Vamos a convertir todas las columnas que son de categorias a type category

# Lista de columnas categóricas
categorical_cols = ['grade', 'sub_grade', 'home_ownership', 'verification_status', 'purpose', 'addr_state', 'initial_list_status', 'application_type']

# Verificar si las columnas existen en el DataFrame
categorical_cols = [col for col in categorical_cols if col in loans.columns]

# Convertir columnas a tipo 'category'
for col in categorical_cols:
    loans[col] = loans[col].astype('category')


Me di cuenta que hardship_flag y debt+settlement_flag tienen N 
Voy a ver sus valores unique

In [35]:
loans['hardship_flag'].unique()

array(['N', 'Y'], dtype=object)

In [36]:
loans['debt_settlement_flag'].unique()
#No hay missing values, si hubiera le pongo 0 

array(['N', 'Y'], dtype=object)

Ahora las voy a convertir a 0 o 1

In [37]:
flag_mapping = {'N': 0, 'Y': 1}

loans['hardship_flag'] = loans['hardship_flag'].map(flag_mapping).astype(int)
loans['debt_settlement_flag'] = loans['debt_settlement_flag'].map(flag_mapping).astype(int)

Ahora vamos con issue_d que es fecha de emision del prestamo

In [38]:
loans['issue_d']

0        Dec-2014
1        Dec-2014
2        Dec-2014
3        Dec-2014
4        Dec-2014
           ...   
99995    Aug-2014
99996    Aug-2014
99997    Aug-2014
99998    Aug-2014
99999    Aug-2014
Name: issue_d, Length: 100000, dtype: object

In [39]:
#Los voy a convertir a tiempo
loans['issue_d'] = pd.to_datetime(loans['issue_d'], format='%b-%Y')
#%b-%Y mes y año

Ver si hay missing values todavía 

In [40]:
# Calcular el porcentaje de valores faltantes por columna
missing_percent = loans.isnull().mean() * 100
missing_percent = missing_percent[missing_percent > 0].sort_values(ascending=False)
missing_percent


mths_since_recent_inq    9.818
emp_title                5.264
mo_sin_old_il_acct       3.007
num_tl_120dpd_2m         1.956
bc_util                  1.198
percent_bc_gt_75         1.161
bc_open_to_buy           1.135
mths_since_recent_bc     1.049
last_pymnt_d             0.067
revol_util               0.056
last_credit_pull_d       0.017
dtype: float64

In [41]:
# Identificar columnas numéricas y categóricas
num_cols = loans.select_dtypes(include=['float64', 'int64']).columns.tolist()
cat_cols = loans.select_dtypes(include=['object', 'category']).columns.tolist()

# Columnas con valores faltantes
missing_cols = missing_percent.index.tolist()

# Listas de columnas numéricas y categóricas con valores faltantes
num_missing_cols = [col for col in missing_cols if col in num_cols]
cat_missing_cols = [col for col in missing_cols if col in cat_cols]

print(f"Columnas numéricas con valores faltantes: {num_missing_cols}")
print(f"Columnas categóricas con valores faltantes: {cat_missing_cols}")


Columnas numéricas con valores faltantes: ['mths_since_recent_inq', 'mo_sin_old_il_acct', 'num_tl_120dpd_2m', 'bc_util', 'percent_bc_gt_75', 'bc_open_to_buy', 'mths_since_recent_bc', 'revol_util']
Columnas categóricas con valores faltantes: ['emp_title', 'last_pymnt_d', 'last_credit_pull_d']


Aqui es para columnas numericas

In [42]:
#la columna de mths_since_recent_inq le podemos poner la mediana
# Son Meses desde la última consulta de crédito reciente
median_value = loans ['mths_since_recent_inq'].median()
loans['mths_since_recent_inq'].fillna(median_value, inplace=True) 
print(f"la mediana es: {median_value}")

la mediana es: 5.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loans['mths_since_recent_inq'].fillna(median_value, inplace=True)


In [43]:
#mo_sin_old_il_acct son meses desde la cuenta de crédito a plazos más antigua
#Igual le pondre la mediana (Lo que aun no se es por que la mediana y no el mean)

median_value = loans['mo_sin_old_il_acct'].median()
loans['mo_sin_old_il_acct'].fillna(median_value, inplace= True)
print(f"la mediana es: {median_value}")



la mediana es: 130.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loans['mo_sin_old_il_acct'].fillna(median_value, inplace= True)


In [44]:
#num_tl_120dpd_2m son el num de lineas de credito 120 dias atrasadas en los ultimos 2 meses
#Aqui como so lineas de credito, podemos poner un 0 para indicar que no hay lineas atrasadas  120 dias
loans['num_tl_120dpd_2m'].fillna(0, inplace= True)
print("poniendole a 'num_tl_120dpd_2m' 0 en los faltantes")


poniendole a 'num_tl_120dpd_2m' 0 en los faltantes


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loans['num_tl_120dpd_2m'].fillna(0, inplace= True)


In [45]:
#bc_util para esto podemos usar la mediana segun chat
median_value = loans['bc_util'].median()
loans['bc_util'].fillna(median_value, inplace= True)
print(f"Imputando 'bc_util' con la mediana: {median_value}")

Imputando 'bc_util' con la mediana: 68.7


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loans['bc_util'].fillna(median_value, inplace= True)


In [46]:
#LLegamos a percent_bc_gt_75, porcentaje de lineas de crédito con un uso superior a 75%
#USamos la mediana aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
median_value = loans['percent_bc_gt_75'].median()
loans['percent_bc_gt_75'].fillna(median_value, inplace= True)
print(f"mediana: {median_value}")


mediana: 50.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loans['percent_bc_gt_75'].fillna(median_value, inplace= True)


In [47]:
#bc_open_to_buy es cantidad disponible en las lieans de credito rotativas de bancos
#Medianaaaaaaaaaaaaaaaaaaaaa
median_value = loans['bc_open_to_buy'].median()
loans['bc_open_to_buy'].fillna(median_value, inplace= True)
print(f"mediana: {median_value}")


mediana: 3844.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loans['bc_open_to_buy'].fillna(median_value, inplace= True)


In [48]:
#mths_since_recent_bc es meses desde la ultima apertura de una linea de credito rotativa 
#Mediaaaaaaaaaaaaaaaaaaaaaaaaanaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
median_value = loans['mths_since_recent_bc'].median()
loans['mths_since_recent_bc'].fillna(median_value, inplace= True)
print(f"mediana: {median_value}")


mediana: 13.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loans['mths_since_recent_bc'].fillna(median_value, inplace= True)


In [49]:
#revol_util es tasa de utilizacion de lineas renovables
#Mediana
median_value = loans['revol_util'].median()
loans['revol_util'].fillna(median_value, inplace= True)
print(f"mediana: {median_value}")


mediana: 56.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loans['revol_util'].fillna(median_value, inplace= True)


Columans categoricas

In [50]:
#emp_title es el titulo de empleo del prestatario
#La voy a eliminar alv porque tiene muchos unique values y no quiero hacer nada ya con la medianaaaaaaaaaaaaaaaaaaaaa
# Eliminar 'emp_title' del DataFrame
loans = loans.drop(columns=['emp_title'])


In [51]:
#last_pymnt_d fecha del ultimo pago recibido 
#Voy a hacerla datetime, hay mmissing values pero puede que no haya recibido aun algun pago
loans['last_pymnt_d'] = pd.to_datetime(loans['last_pymnt_d'], format='%b-%Y')

In [52]:
#last_credit_pull_d es la fecha en que se extrajo por ultima veez el reporte del credito del prestatario
#La hago datetime man, tampoco le hago nada a los missing values porque puede que no haya recibido algun pago
loans['last_credit_pull_d'] = pd.to_datetime(loans['last_credit_pull_d'], format='%b-%Y')

In [53]:
#Ver si aun quedan missing values en el DataFrame
# Recalcular el porcentaje de valores faltantes
missing_percent_after = loans.isnull().mean() * 100
missing_percent_after = missing_percent_after[missing_percent_after > 0].sort_values(ascending=False)
print("Porcentaje de valores faltantes después de la imputación:")
missing_percent_after


Porcentaje de valores faltantes después de la imputación:


last_pymnt_d          0.067
last_credit_pull_d    0.017
dtype: float64

In [54]:
#Para last_pymnt_d, issue_d es la fecha de emision del prestamo, voy a llenarlo con eso
# Imputar 'last_pymnt_d' con 'issue_d' donde sea NaN
loans['last_pymnt_d'].fillna(loans['issue_d'], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loans['last_pymnt_d'].fillna(loans['issue_d'], inplace=True)


In [55]:
#Para last_credit_pull_d voy a ponerle la fecha mas reciente del reporte extraido
# Imputar 'last_credit_pull_d' con la fecha más reciente
max_date = loans['last_credit_pull_d'].max()
loans['last_credit_pull_d'].fillna(max_date, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loans['last_credit_pull_d'].fillna(max_date, inplace=True)


In [56]:
#Ver si aun quedan missing values en el DataFrame
# Recalcular el porcentaje de valores faltantes
missing_percent_after = loans.isnull().mean() * 100
missing_percent_after = missing_percent_after[missing_percent_after > 0].sort_values(ascending=False)
print("Porcentaje de valores faltantes después de la imputación:")
missing_percent_after


Porcentaje de valores faltantes después de la imputación:


Series([], dtype: float64)

Columnas con un valor unico

In [57]:
# Identificar columnas con un solo valor único porque no nos aportan nada
cols_single_value = [col for col in loans.columns if loans[col].nunique() == 1]

cols_single_value


['policy_code', 'application_type', 'disbursement_method']

In [58]:
#Borrar esas columans que no aportan nada
loans = loans.drop(columns=cols_single_value)

In [59]:
# Revisar los tipos de datos
loans.dtypes


id                             object
loan_amnt                     float64
funded_amnt                   float64
funded_amnt_inv               float64
term                            int64
                               ...   
total_bal_ex_mort             float64
total_bc_limit                float64
total_il_high_credit_limit    float64
hardship_flag                   int64
debt_settlement_flag            int64
Length: 88, dtype: object

In [60]:
# Guardar el DataFrame limpio
loans.to_csv('Loans_clean.csv', index=False)
print("Dataset limpio guardado como 'LoansData_clean.csv'")


Dataset limpio guardado como 'LoansData_clean.csv'


Revisa el metodo pd.DataFrame.dtypes. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html 

In [61]:
# column_types =
'''
column_types = pd.DataFrame({
    'column_name': loans.columns,
    'type': loans.dtypes.astype(str)
})

column_types
'''

"\ncolumn_types = pd.DataFrame({\n    'column_name': loans.columns,\n    'type': loans.dtypes.astype(str)\n})\n\ncolumn_types\n"

La siguiente tabla tiene una descripcion del significado de cada columna

In [62]:

'''
datos_dict = pd.read_excel(
    'https://resources.lendingclub.com/LCDataDictionary.xlsx')
datos_dict.columns = ['feature', 'description']
'''

"\ndatos_dict = pd.read_excel(\n    'https://resources.lendingclub.com/LCDataDictionary.xlsx')\ndatos_dict.columns = ['feature', 'description']\n"

In [63]:
#datos_dict

### Pickle

Crea codigo para **guardar** y **cargar** el DataFrame de `datos_dict` creada en las celdas anteriores en formato **pickle**

In [64]:
# COdigo guardar


In [65]:
# Codigo para cargar

## Tipos de Datos

Realiza las transformaciones o casteos (casting) que creas necesarios a tus datos de tal manera que el typo de dato sea adecuado. Al terminar recrea la tabla `column_types` con los nuevos tipos.

No olvides anotar tus justificaciones para recordar cuando te toque explicarlo.

In [66]:
# Manejos de tipos 1
# Tu codigo aqui

In [67]:
# Manejos de tipos 2
# Tu codigo aqui


In [68]:
# column_types =


## Manejo de NaNs o missings

Maneja los datos de tipos missing. Elije una estrategia adecuada dependiendo del tipo de dato que le asignaste a la columna.


Crea codigo para **guardar** y **cargar** un archivo JSON en el que se guarde la `estrategia` y `valor` que utilizaste para **imputar**. Por ejemplo: Si hay una columna que se llama `columna 3` y utilizaste la estrategia de imputacion de media, y existe otra llamada `columna 4` y  elegiste la palabra 'missing' el JSON debera contener:  
  
 `{'columna 3':{'estrategia':'mean', 'valor':3.4}, 'columna 4':{'estrategia':'identificador', 'valor':'missing'}}`  

 De tal manera que para cada columna que tenga un metodo de imputacion apunte a otro diccionario donde el **key** `estrategia` describa de manera sencilla el metodo, y el **key** `valor` el valor usado. En general:   
 `{'nombre de la columna':{'estrategia':'descripcion de estrategia', 'valor':'valor utilizado'}}`. 
 

De utilizar mas de un metodo puedes anidarlos en una lista  
  `[{...},{...}]`.  

Incluso si la columna utilizada no sufrio imputacion, es necesario que la agregues al JSON.

La idea es que cualquier otra persona pueda cargar el el archivo JSON con tu funcion, entender que hiciste y replicarlo facilmente. No existe solo una respuesta correcta, pero tendras que justificar y explicar tus deciciones.

### Imputacion

In [69]:
# Tu codigo aqui

In [70]:
# TUcodigo aqui

### Codigo para salvar y cargar JSONs

In [85]:
# Diccionario para almacenar las estrategias y valores de imputación
imputation_info = {}

# Lista de columnas imputadas con la mediana
columns_imputed_median = [
    'mths_since_recent_inq', 'mo_sin_old_il_acct', 'bc_util',
    'percent_bc_gt_75', 'bc_open_to_buy', 'mths_since_recent_bc', 'revol_util'
]

# Diccionario para almacenar las medianas
median_values = {}

for col in columns_imputed_median:
    median_value = loans[col].median()
    median_values[col] = median_value

median_values

{'mths_since_recent_inq': np.float64(5.0),
 'mo_sin_old_il_acct': np.float64(130.0),
 'bc_util': np.float64(68.7),
 'percent_bc_gt_75': np.float64(50.0),
 'bc_open_to_buy': np.float64(3844.0),
 'mths_since_recent_bc': np.float64(13.0),
 'revol_util': np.float64(56.0)}

In [86]:
for col in columns_imputed_median:
    imputation_info[col] = {
        'estrategia': 'mediana',
        'valor': median_values[col]
    }


In [87]:
imputation_info['num_tl_120dpd_2m'] = {
    'estrategia': 'cero',
    'valor': 0
}


In [88]:
imputation_info['last_pymnt_d'] = {
    'estrategia': 'fecha_emision_prestamo',
    'valor': 'issue_d'
}


In [89]:
max_date = loans['last_credit_pull_d'].max()
imputation_info['last_credit_pull_d'] = {
    'estrategia': 'fecha_maxima',
    'valor': max_date.strftime('%Y-%m-%d')
}

In [90]:
# Obtener todas las columnas del DataFrame
all_columns = loans.columns.tolist()

# Columnas que ya hemos registrado
imputed_columns = list(imputation_info.keys())

# Columnas que no sufrieron imputación
columns_no_imputation = [col for col in all_columns if col not in imputed_columns]

# Agregar estas columnas al diccionario
for col in columns_no_imputation:
    imputation_info[col] = {
        'estrategia': 'ninguna',
        'valor': None
    }


In [91]:
import json

# Guardar el diccionario en un archivo JSON
with open('imputation_info.json', 'w') as json_file:
    json.dump(imputation_info, json_file, indent=4, ensure_ascii=False)


In [92]:
# Cargar el diccionario desde el archivo JSON
with open('imputation_info.json', 'r') as json_file:
    imputation_info_loaded = json.load(json_file)


In [93]:
def apply_imputation(df, imputation_dict):
    for col, strategy_info in imputation_dict.items():
        estrategia = strategy_info['estrategia']
        valor = strategy_info['valor']
        
        if estrategia == 'mediana':
            df[col].fillna(valor, inplace=True)
        elif estrategia == 'constante':
            df[col].fillna(valor, inplace=True)
        elif estrategia == 'fecha_emision_prestamo':
            df[col].fillna(df['issue_d'], inplace=True)
        elif estrategia == 'fecha_maxima':
            max_date = pd.to_datetime(valor)
            df[col].fillna(max_date, inplace=True)
        elif estrategia == 'ninguna':
            continue
        else:
            print(f"Estrategia desconocida para la columna '{col}'.")
    return df