# Prediccion de Default en Prestamos


Para este proyecto utilizaremos un sample de los datos de Lending Club. La idea es predecir si cierto usuario cometera Default basado en informacion que la plataforma recolecta. Esto nos ayudara a mejorar la metodologia/pipeline de prestamo.


# Descripcion



Contiene los prestamos de esta plataforma:

    periodo 2007-2017Q3.
    887mil observaciones, sample de 100mil
    150 variables
    Target: loan status



# Objetivo

Realizar un ETL y un EDA

## ETL

0. Limpia los datos de tal manera que al final del ETL queden en formato `tidy`.
1. Asegurate de cargar y leer los datos
2. Crea una tabla donde se guarde el nombre de la columna y el tipo de dato: (`column_name`,   `type`).
3. Asegurate de pensar cual es el tipo de dato correcto. Porque elejiste strig/object o float o int?. No hay respuestas incorrectas como tal, pero tienes que justificar tu decision.
4. Maneja missings o nans de la manera adecuada. Justifica cada decision







## EDA

0. Preparar lo datos para un pipeline de datos
1. Quitar columnas inservibles 
2. Imputar valores
3. Mantener replicabildiad y reproducibilidad

**No olvides anotar tus justificaciones en celdas para recordar cuando te toque explicarlo.** Puedes agregar el numero de celdas que necesites para poner tu explicacion y el codigo, solo manten la estructura.

# ETL

In [26]:
import pandas as pd
import numpy as np
import pickle
import json

Vas a obtener 2 errores, solucionalo con los visto en clase.  
Tip: Se arreglan con argumentos adicionales de la funcion `read_csv`  
Documentacion: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html 

In [7]:
loans = pd.read_csv('https://github.com/sonder-art/fdd_prim_2023/blob/main/codigo/pandas/LoansData_sample.csv.gz?raw=true', compression='gzip')

df_loans = pd.DataFrame(loans)


  loans = pd.read_csv('https://github.com/sonder-art/fdd_prim_2023/blob/main/codigo/pandas/LoansData_sample.csv.gz?raw=true', compression='gzip')


## Tabla (column_name, type)

Revisa el metodo pd.DataFrame.dtypes. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html 

In [3]:
column_types = df_loans.dtypes.unique()
print(column_types)

[dtype('int64') dtype('float64') dtype('O')]


## Cargar descripcion de columnas

La siguiente tabla tiene una descripcion del significado de cada columna

In [16]:


datos_dict = pd.read_excel(
    'https://resources.lendingclub.com/LCDataDictionary.xlsx')
datos_dict.columns = ['feature', 'description']


In [17]:
datos_dict

Unnamed: 0,feature,description
0,acc_now_delinq,The number of accounts on which the borrower i...
1,acc_open_past_24mths,Number of trades opened in past 24 months.
2,addr_state,The state provided by the borrower in the loan...
3,all_util,Balance to credit limit on all trades
4,annual_inc,The self-reported annual income provided by th...
...,...,...
148,settlement_amount,The loan amount that the borrower has agreed t...
149,settlement_percentage,The settlement amount as a percentage of the p...
150,settlement_term,The number of months that the borrower will be...
151,,


### Pickle

Crea codigo para **guardar** y **cargar** el DataFrame de `datos_dict` creada en las celdas anteriores en formato **pickle**

In [25]:
# Codigo guardar
with open('datos_dictP.pkl', 'wb') as f:
    pickle.dump(datos_dict, f)

In [26]:
# Codigo para cargar
with open('datos_dict.pkl', 'rb') as f:
    datos_dictP = pickle.load(f)
    
datos_dictP

Unnamed: 0,feature,description
0,acc_now_delinq,The number of accounts on which the borrower i...
1,acc_open_past_24mths,Number of trades opened in past 24 months.
2,addr_state,The state provided by the borrower in the loan...
3,all_util,Balance to credit limit on all trades
4,annual_inc,The self-reported annual income provided by th...
...,...,...
148,settlement_amount,The loan amount that the borrower has agreed t...
149,settlement_percentage,The settlement amount as a percentage of the p...
150,settlement_term,The number of months that the borrower will be...
151,,


## Tipos de Datos

Realiza las transformaciones o casteos (casting) que creas necesarios a tus datos de tal manera que el typo de dato sea adecuado. Al terminar recrea la tabla `column_types` con los nuevos tipos.

No olvides anotar tus justificaciones para recordar cuando te toque explicarlo.

In [16]:
# Manejos de tipos 1
# Tu codigo aqui

int64_columns = [col for col in loans.columns if loans[col].dtype == 'int64']

def change_cast_to_int32(df, columns):
    int32_min, int32_max = -2**31, 2**31 - 1  # Range of int32
    
    for col in columns:
        if col not in df.columns:
            print(f"Column '{col}' does not exist in the DataFrame.")
            continue

        # Check if column is numeric
        if not pd.api.types.is_numeric_dtype(df[col]):
            print(f"Column '{col}' is not numeric, skipping.")
            continue

        # Check for NaNs and analyze range
        has_nan = df[col].isna().any()
        min_val = df[col].min()
        max_val = df[col].max()

        print(f"\nAnalyzing column '{col}':")
        print(f"  - Min value: {min_val}")
        print(f"  - Max value: {max_val}")
        print(f"  - Contains NaNs: {'Yes' if has_nan else 'No'}")

        # Evaluate if casting to int32 is safe
        if has_nan:
            print(f"  - Column '{col}' has NaN values, conversion to int32 will not be possible without handling them.")
        elif min_val >= int32_min and max_val <= int32_max:
            print(f"  - Column '{col}' is within int32 range. It can be safely cast to int32.")
            # Optional: Cast to int32 if suitable
            df[col] = df[col].astype('int32')
            print(f"  - Column '{col}' has been cast to int32.")
        else:
            print(f"  - Column '{col}' has values outside int32 range. Conversion to int32 is not safe.")
            
change_cast_to_int32(loans, int64_columns)


Analyzing column 'Unnamed: 0':
  - Min value: 0
  - Max value: 99999
  - Contains NaNs: No
  - Column 'Unnamed: 0' is within int32 range. It can be safely cast to int32.
  - Column 'Unnamed: 0' has been cast to int32.


In [19]:

# Manejos de tipos 2
# Tu codigo aqui

float64_columns = [col for col in loans.columns if loans[col].dtype == 'float64']

def cast_float32(df, columns):
    float32_min, float32_max = np.finfo(np.float32).min, np.finfo(np.float32).max  # Range of float32
    
    for col in columns:
        if col not in df.columns:
            print(f"Column '{col}' does not exist in the DataFrame.")
            continue

        # Check if column is of type float64
        if df[col].dtype != 'float64':
            print(f"Column '{col}' is not of type float64, skipping.")
            continue

        # Check for NaNs and analyze range
        has_nan = df[col].isna().any()
        min_val = df[col].min()
        max_val = df[col].max()

        print(f"\nAnalyzing column '{col}':")
        print(f"  - Min value: {min_val}")
        print(f"  - Max value: {max_val}")
        print(f"  - Contains NaNs: {'Yes' if has_nan else 'No'}")

        # Evaluate if casting to float32 is safe
        if min_val >= float32_min and max_val <= float32_max:
            print(f"  - Column '{col}' is within float32 range and can be safely cast to float32.")
            # Optional: Cast to float32 if suitable
            df[col] = df[col].astype('float32')
            print(f"  - Column '{col}' has been cast to float32.")
        else:
            print(f"  - Column '{col}' has values outside the float32 range. Conversion to float32 is not safe.")
            
cast_float32(loans, float64_columns)
# Convert columns to datetime type
date_columns = [
    'issue_d', 'earliest_cr_line', 'last_pymnt_d', 'next_pymnt_d',
    'last_credit_pull_d', 'hardship_start_date', 'hardship_end_date',
    'payment_plan_start_date', 'debt_settlement_flag_date', 'settlement_date'
]
for col in date_columns:
    loans[col] = pd.to_datetime(loans[col], errors='coerce')


Analyzing column 'annual_inc_joint':
  - Min value: nan
  - Max value: nan
  - Contains NaNs: Yes
  - Column 'annual_inc_joint' has values outside the float32 range. Conversion to float32 is not safe.

Analyzing column 'dti_joint':
  - Min value: nan
  - Max value: nan
  - Contains NaNs: Yes
  - Column 'dti_joint' has values outside the float32 range. Conversion to float32 is not safe.

Analyzing column 'total_bal_il':
  - Min value: nan
  - Max value: nan
  - Contains NaNs: Yes
  - Column 'total_bal_il' has values outside the float32 range. Conversion to float32 is not safe.

Analyzing column 'il_util':
  - Min value: nan
  - Max value: nan
  - Contains NaNs: Yes
  - Column 'il_util' has values outside the float32 range. Conversion to float32 is not safe.

Analyzing column 'max_bal_bc':
  - Min value: nan
  - Max value: nan
  - Contains NaNs: Yes
  - Column 'max_bal_bc' has values outside the float32 range. Conversion to float32 is not safe.

Analyzing column 'all_util':
  - Min valu

In [11]:
column_types = df_loans.dtypes.unique()
print(column_types)


[dtype('int64') dtype('float64') dtype('O')]


## Manejo de NaNs o missings

Maneja los datos de tipos missing. Elije una estrategia adecuada dependiendo del tipo de dato que le asignaste a la columna.


Crea codigo para **guardar** y **cargar** un archivo JSON en el que se guarde la `estrategia` y `valor` que utilizaste para **imputar**. Por ejemplo: Si hay una columna que se llama `columna 3` y utilizaste la estrategia de imputacion de media, y existe otra llamada `columna 4` y  elegiste la palabra 'missing' el JSON debera contener:  
  
 `{'columna 3':{'estrategia':'mean', 'valor':3.4}, 'columna 4':{'estrategia':'identificador', 'valor':'missing'}}`  

 De tal manera que para cada columna que tenga un metodo de imputacion apunte a otro diccionario donde el **key** `estrategia` describa de manera sencilla el metodo, y el **key** `valor` el valor usado. En general:   
 `{'nombre de la columna':{'estrategia':'descripcion de estrategia', 'valor':'valor utilizado'}}`. 
 

De utilizar mas de un metodo puedes anidarlos en una lista  
  `[{...},{...}]`.  

Incluso si la columna utilizada no sufrio imputacion, es necesario que la agregues al JSON.

La idea es que cualquier otra persona pueda cargar el el archivo JSON con tu funcion, entender que hiciste y replicarlo facilmente. No existe solo una respuesta correcta, pero tendras que justificar y explicar tus deciciones.

### Imputacion

In [20]:
# Tu codigo aqui
#primero quitamos todas las columnas que solo contengan Nan porque son columnas que no aportan nada de información y ocupan espacio
def drop_nan_only_columns(df):
    """
    Drops columns that contain only NaN values.

    Parameters:
    df (pd.DataFrame): The DataFrame from which to drop NaN-only columns.

    Returns:
    pd.DataFrame: A DataFrame with NaN-only columns removed.
    """
    # Drop columns where all values are NaN
    df_cleaned = df.dropna(axis=1, how='all')
    return df_cleaned

# Example usage
# Assuming `loans` is your DataFrame
loans_cleaned = drop_nan_only_columns(loans)

In [21]:
# Function to handle numeric column imputation
def imputar_numerico(df, column):
    # Use mean or median based on skewness
    if abs(df[column].skew()) < 1:
        value = df[column].mean()
        strategy = 'mean'
    else:
        value = df[column].median()
        strategy = 'median'
    df[column].fillna(value)
    return {"columna":column,"estrategia": strategy, "valor": value}

In [23]:
def imputar_categorico(df, column):
    # Use a placeholder string for missing values
    value = "missing"
    df[column].fillna(value)
    return { "estrategia": "identificador", "valor": value}
# Function to handle date column imputation
def imputar_fecha(df, column):
    # Use a placeholder date for missing values
    value = pd.Timestamp("1900-01-01")
    df[column].fillna(value)
    return {"estrategia": "placeholder_date", "valor": str(value)}
# Function to handle boolean column imputation
def imputar_booleano(df, column):
    # Use False as the default for missing values
    value = False
    df[column].fillna(value)
    return {"estrategia": "default_bool", "valor": value}
# Main function to apply imputations based on data type
def imputar_valores(df):
    imputation_strategies = {}
    for column in df.columns:
        if df[column].isna().sum() > 0:
            print("woo")
            if pd.api.types.is_numeric_dtype(df[column]):
                imputation_strategies[column] = imputar_numerico(df, column)
            elif pd.api.types.is_object_dtype(df[column]):
                imputation_strategies[column] = imputar_categorico(df, column)
            elif pd.api.types.is_datetime64_any_dtype(df[column]):
                imputation_strategies[column] = imputar_fecha(df, column)
            elif pd.api.types.is_bool_dtype(df[column]):
                imputation_strategies[column] = imputar_booleano(df, column)
        else:
            imputation_strategies[column] = {"estrategia": "sin_imputacion", "valor": None}
    return imputation_strategies


In [24]:
explanation_strategies = {
    "mean": "Replaces NaN values with the mean of the column for balanced data.",
    "median": "Replaces NaN values with the median of the column for skewed data.",
    "identificador": "Replaces NaN values in categorical columns with a placeholder like 'missing'.",
    "placeholder_date": "Replaces NaN values in date columns with a default date (e.g., '1900-01-01').",
    "default_bool": "Replaces NaN values in boolean columns with a default False value.",
    "sin_imputacion": "Indicates no imputation was needed because no NaN values were present."
}

In [28]:
def imputar_numerico(df, column):
	# Use mean or median based on skewness
	if abs(df[column].skew()) < 1:
		value = df[column].mean()
		strategy = 'mean'
	else:
		value = df[column].median()
		strategy = 'median'
	
	# Ensure the value matches the column's data type
	if pd.api.types.is_integer_dtype(df[column]):
		value = int(value)
	elif pd.api.types.is_float_dtype(df[column]):
		value = float(value)
	
	df[column].fillna(value, inplace=True)
	return {"columna": column, "estrategia": strategy, "valor": value}

imputation_strategies = imputar_valores(loans_cleaned)

woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo
woo


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column].fillna(value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perf

### Codigo para salvar y cargar JSONs

In [29]:


# Save imputation strategies to JSON
with open("imputation_strategies.json", "w") as file:
    json.dump(imputation_strategies, file, indent=4)

# Save strategy explanations to JSON
with open("strategy_explanations.json", "w") as file:
    json.dump(explanation_strategies, file, indent=4)

In [30]:
def load_json(file_path):
    with open(file_path, 'r') as file:
        data = json.load(file)
    return data

# Load the JSON containing imputation strategies for each column
imputation_strategies = load_json("imputation_strategies.json")
print("Imputation Strategies:\n", imputation_strategies)

# Load the JSON containing explanations of each strategy
strategy_explanations = load_json("strategy_explanations.json")
print("\nStrategy Explanations:\n", strategy_explanations)

Imputation Strategies:
 {'Unnamed: 0': {'estrategia': 'sin_imputacion', 'valor': None}, 'id': {'estrategia': 'sin_imputacion', 'valor': None}, 'loan_amnt': {'estrategia': 'sin_imputacion', 'valor': None}, 'funded_amnt': {'estrategia': 'sin_imputacion', 'valor': None}, 'funded_amnt_inv': {'estrategia': 'sin_imputacion', 'valor': None}, 'term': {'estrategia': 'sin_imputacion', 'valor': None}, 'int_rate': {'estrategia': 'sin_imputacion', 'valor': None}, 'installment': {'estrategia': 'sin_imputacion', 'valor': None}, 'grade': {'estrategia': 'sin_imputacion', 'valor': None}, 'sub_grade': {'estrategia': 'sin_imputacion', 'valor': None}, 'home_ownership': {'estrategia': 'sin_imputacion', 'valor': None}, 'annual_inc': {'estrategia': 'sin_imputacion', 'valor': None}, 'verification_status': {'estrategia': 'sin_imputacion', 'valor': None}, 'issue_d': {'estrategia': 'sin_imputacion', 'valor': None}, 'loan_status': {'estrategia': 'sin_imputacion', 'valor': None}, 'pymnt_plan': {'estrategia': 'sin_i