APPLICATION DATA

Nos encontramos ante un problema de aprendizaje supervisado, ya que contamos con una variable objetivo, TARGET, que clasifica a los clientes según si pagan su préstamo (valor 0) o no (valor 1). En este análisis exploratorio de datos (EDA) investigaremos el dataset application_data.csv para comprender las características y comportamientos de las variables, identificando patrones, distribuciones y relaciones clave. Además, abordaremos posibles valores atípicos, sesgos y valores faltantes que puedan afectar el rendimiento de futuros modelos predictivos.

Este análisis tiene como objetivo principal ayudar al banco a tomar decisiones más informadas, reduciendo el riesgo de impagos al identificar perfiles de clientes con mayor probabilidad de incumplimiento, sin perjudicar a quienes cumplen; a la vez que también reduce la probabilidad de dejar pasar a clientes que no incumplirán. El modelo se basará en datos proporcionados por el cliente al solicitar el préstamo, como información demográfica, financiera y laboral, y será evaluado mediante métricas clave como la curva ROC, AUC, precisión y recall.

Los objetivos específicos de este EDA son:

- Analizar las distribuciones de las variables categóricas y numéricas, explorando su relación con la variable objetivo.
- Identificar y tratar valores nulos o atípicos para mejorar la calidad de los datos.
- Explorar relaciones clave entre las variables y su impacto en el objetivo.
- Proponer transformaciones y selecciones de variables para optimizar el dataset para el modelado.



In [24]:
import pandas as pd
import seaborn as sb
import numpy as np
from matplotlib import pyplot as pyplot
import plotly.express as px

pd.set_option('display.max_columns', 10000)
pd.set_option('display.max_rows', 10000)
pd.set_option('display.width', 10000)

FUNCIONES

In [None]:
import nbimporter
import funciones

In [42]:
def dame_variables_categoricas(dataset):
    '''
    ----------------------------------------------------------------------------------------------------------
    Función clasificar_variables:
    ----------------------------------------------------------------------------------------------------------
        - Descripción : Funcion que recibe un dataset y devuelve una lista respectiva para cada tipo de variable
        (Categórica, Continua, Booleana y No clasificada)
        - Inputs:
            -- dataset : Pandas dataframe que contiene los datos
        - Return : 
            -- 1: la ejecución es incorrecta
            -- lista_var_bool: lista con los nombres de las variables booleanas del dataset de entrada, con valores
            unicos con una longitud de dos, que sean del tipo booleano y que presenten valores 'yes','no','n' & 'y' .
            -- lista_var_cat: lista con los nombres de las variables categóricas del dataset de entrada, con valores
            de tipo object o tipo categorical.
            -- lista_var_con: lista con los nombres de las variables continuas del dataset de entrada, con valores 
            de tipo float o con una longitud de valores unicos mayor a dos. 
            -- lista_var_no_clasificadas: lista con los nombres de las variables no clasificadas del dataset de 
            entrada, que no cumplen con los aspectos anteriormente mencionadas de las demás listas. 
    '''
    
    if dataset is None:
        # Resultante al no brindar ningun DataFrame
        print(u'\nFaltan argumentos por pasar a la función')
        return 1
    
    # Listas para cada tipo de variable
    lista_var_bool = []
    lista_var_cat = []
    lista_var_con = []
    lista_var_no_clasificadas = []
    
    for columna in dataset.columns:
        # Valores unicos por columna sin los NAs
        valores_unicos = dataset[columna].dropna().unique()
        # Trato de mayusculas
        valores_lower = set(val.lower() for val in valores_unicos if isinstance(val, str))
        
        # Variables booleanas
        if (len(valores_unicos) == 2 and
            (valores_lower <= {"yes", "no", "n", "y"} or
             set(valores_unicos) <= {0, 1} or 
             pd.api.types.is_bool_dtype(dataset[columna]))):
            lista_var_bool.append(columna)
        
        # Variables continuas
        elif pd.api.types.is_float_dtype(dataset[columna]) and len(valores_unicos) > 2:
            lista_var_con.append(columna)
        
        # Variables categóricas
        elif pd.api.types.is_object_dtype(dataset[columna]) or pd.api.types.is_categorical_dtype(dataset[columna]):
            lista_var_cat.append(columna)
        
        elif set(valores_unicos).issubset({1, 2, 3}):
            lista_var_cat.append(columna)
        
        # Variables no clasificadas
        else:
            lista_var_no_clasificadas.append(columna) 

    # Calcula la cantidad de cada tipo de variable
    c_v_b = len(lista_var_bool)
    c_v_ca = len(lista_var_cat)
    c_v_co = len(lista_var_con)
    c_v_f = len(lista_var_no_clasificadas)

    print("Variables Booleanas:", c_v_b, lista_var_bool)
    print('============================================================================================================================================================================')
    print("Variables Categóricas:", c_v_ca, lista_var_cat)
    print('============================================================================================================================================================================')
    print("Variables Continuas:", c_v_co, lista_var_con)
    print('============================================================================================================================================================================')
    print("Variables no clasificadas:", c_v_f, lista_var_no_clasificadas)

    return lista_var_bool, lista_var_cat, lista_var_con, lista_var_no_clasificadas

In [26]:
path_folder = "../data/"
pd_loan = pd.read_csv(path_folder + "application_data.csv", low_memory = False)

pd_loan.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,Business Entity Type 3,0.083037,0.262949,0.139376,0.0247,0.0369,0.9722,0.6192,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,0.0252,0.0383,0.9722,0.6341,0.0144,0.0,0.069,0.0833,0.125,0.0377,0.022,0.0198,0.0,0.0,0.025,0.0369,0.9722,0.6243,0.0144,0.0,0.069,0.0833,0.125,0.0375,0.0205,0.0193,0.0,0.0,reg oper account,block of flats,0.0149,"Stone, brick",No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.003541,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,Core staff,2.0,1,1,MONDAY,11,0,0,0,0,0,0,School,0.311267,0.622246,,0.0959,0.0529,0.9851,0.796,0.0605,0.08,0.0345,0.2917,0.3333,0.013,0.0773,0.0549,0.0039,0.0098,0.0924,0.0538,0.9851,0.804,0.0497,0.0806,0.0345,0.2917,0.3333,0.0128,0.079,0.0554,0.0,0.0,0.0968,0.0529,0.9851,0.7987,0.0608,0.08,0.0345,0.2917,0.3333,0.0132,0.0787,0.0558,0.0039,0.01,reg oper account,block of flats,0.0714,Block,No,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,Laborers,1.0,2,2,MONDAY,9,0,0,0,0,0,0,Government,,0.555912,0.729567,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.008019,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,Laborers,2.0,2,2,WEDNESDAY,17,0,0,0,0,0,0,Business Entity Type 3,,0.650442,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.028663,-19932,-3038,-4311.0,-3458,,1,1,0,1,0,0,Core staff,1.0,2,2,THURSDAY,11,0,0,0,0,1,1,Religion,,0.322738,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


ANÁLISIS GENERAL DE LA TABLA Y PROCESAMIENTO INICIAL

In [27]:
print(pd_loan.shape, pd_loan.drop_duplicates().shape)

(307511, 122) (307511, 122)


Nos encontramos ante un dataframe de 307,511 filas y 122 columnas

In [28]:

pd_loan['FLAG_DOCUMENT_10'].dtypes

dtype('int64')

EXPLORACIÓN DE VARIABLE OBJETIVO Y TRATAMIENTO

In [29]:
pd_plot_target = pd_loan['TARGET']\
        .value_counts(normalize=True)\
        .mul(100).rename('percent').reset_index()

pd_plot_target_conteo = pd_loan['TARGET'].value_counts().reset_index()
pd_plot_target_pc = pd.merge(pd_plot_target, pd_plot_target_conteo, on = ['TARGET'], how='inner')

print(pd_plot_target_pc)

   TARGET    percent   count
0       0  91.927118  282686
1       1   8.072882   24825


Nuestra variable objetivo "TARGET" puede tomar valores 0 y 1. Contamos con 282,686 filas que cuentan con el valor 0 de esta variable (un 92% del dataset) y 24,825 cuentan con el valor 1 (8% del dataset)

In [30]:
fig = px.histogram(pd_plot_target_pc, x="TARGET", y=['percent'])
fig.update_xaxes(tickvals=[0, 1])
fig.show()

Aquí podemos visualizar la distribución de los valores de la variable TARGET en suma de valor, donde se ve que más del 90% del dataset toma el valor 0; y en suma de valor.

In [31]:
pd_series_null_columns = pd_loan.isnull().sum().sort_values(ascending=False)
pd_series_null_rows = pd_loan.isnull().sum(axis=1).sort_values(ascending=False)
print(pd_series_null_columns.shape, pd_series_null_rows.shape)

pd_null_columnas = pd.DataFrame(pd_series_null_columns, columns=['nulos_columnas'])     
pd_null_filas = pd.DataFrame(pd_series_null_rows, columns=['nulos_filas'])  
pd_null_filas['TARGET'] = pd_loan['TARGET'].copy()
pd_null_columnas['porcentaje_columnas'] = pd_null_columnas['nulos_columnas']/pd_loan.shape[0]
pd_null_filas['porcentaje_filas']= pd_null_filas['nulos_filas']/pd_loan.shape[1]

(122,) (307511,)


In [32]:
threshold=0.8
list_vars_not_null = list(pd_null_columnas[pd_null_columnas['porcentaje_columnas']<threshold].index)
pd_loan_filter_null = pd_loan.loc[:, list_vars_not_null]
pd_loan_filter_null.shape

(307511, 122)

Hemos determinado que trabajaremos con aquellas columnas que tengan menos de un 80% de valores nulos, por lo que nos quedaremos con las 122 columnas, aunque si vemos en un futuro que no son necesarias ciertas columnas con tantos valores nulos, podríamos imputarlas o eliminarlas.

TIPOS: VARIABLES CATEGÓRICAS Y NUMÉRICAS

In [None]:
list_cat_vars, other = funciones.dame_variables_categoricas(dataset=pd_loan_filter_null)
pd_loan_filter_null[list_cat_vars] = pd_loan_filter_null[list_cat_vars].astype("category")
pd_loan_filter_null[list_cat_vars].head()


is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, pd.CategoricalDtype) instead



Variables Booleanas: 36 ['EMERGENCYSTATE_MODE', 'LIVE_REGION_NOT_WORK_REGION', 'REG_REGION_NOT_LIVE_REGION', 'FLAG_OWN_CAR', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'TARGET', 'FLAG_OWN_REALTY', 'FLAG_EMAIL', 'FLAG_PHONE', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'REG_REGION_NOT_WORK_REGION', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21']
Variables Categóricas: 16 ['FONDKAPREMONT_MODE', 'WALLSMATERIAL_MODE', 'HOUSETYPE_MODE', 'OCCUPATION_TYPE', 'NAME_TYPE_SUITE', 'ORGANIZATION_TYPE', 'NAME_CONTRACT_TYPE', 'NAME_INCOME_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'NAME_EDUCATI

ValueError: too many values to unpack (expected 2)

In [34]:
list_cat_vars

['FONDKAPREMONT_MODE',
 'WALLSMATERIAL_MODE',
 'HOUSETYPE_MODE',
 'EMERGENCYSTATE_MODE',
 'OCCUPATION_TYPE',
 'NAME_TYPE_SUITE',
 'ORGANIZATION_TYPE',
 'NAME_CONTRACT_TYPE',
 'FLAG_OWN_CAR',
 'NAME_INCOME_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_HOUSING_TYPE',
 'NAME_EDUCATION_TYPE',
 'CODE_GENDER',
 'FLAG_OWN_REALTY',
 'WEEKDAY_APPR_PROCESS_START']

Tenemos 11 columnas de variables categóricas

In [35]:
pd_loan_filter_null[list_cat_vars].dtypes
print(pd_loan_filter_null[list_cat_vars])

       FONDKAPREMONT_MODE WALLSMATERIAL_MODE  HOUSETYPE_MODE EMERGENCYSTATE_MODE OCCUPATION_TYPE NAME_TYPE_SUITE       ORGANIZATION_TYPE NAME_CONTRACT_TYPE FLAG_OWN_CAR      NAME_INCOME_TYPE    NAME_FAMILY_STATUS  NAME_HOUSING_TYPE            NAME_EDUCATION_TYPE CODE_GENDER FLAG_OWN_REALTY WEEKDAY_APPR_PROCESS_START
0        reg oper account       Stone, brick  block of flats                  No        Laborers   Unaccompanied  Business Entity Type 3         Cash loans            N               Working  Single / not married  House / apartment  Secondary / secondary special           M               Y                  WEDNESDAY
1        reg oper account              Block  block of flats                  No      Core staff          Family                  School         Cash loans            N         State servant               Married  House / apartment               Higher education           F               N                     MONDAY
2                     NaN                NaN  

PROCESAMIENTO INICIAL DE ALGUNAS VARIABLES

In [36]:
dia = { "MONDAY": 1, "TUESDAY": 2, "WEDNESDAY": 3, "THURSDAY": 4, "FRIDAY": 5, "SATURDAY": 6, "SUNDAY": 7}

pd_loan_filter_null['NWEEKDAY_PROCESS_START'] = pd_loan_filter_null['WEEKDAY_APPR_PROCESS_START'].replace(dia).astype("category")


Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`


The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.



In [37]:
pd_loan_filter_null.head()

Unnamed: 0,COMMONAREA_AVG,COMMONAREA_MODE,COMMONAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAPARTMENTS_MODE,NONLIVINGAPARTMENTS_AVG,FONDKAPREMONT_MODE,LIVINGAPARTMENTS_AVG,LIVINGAPARTMENTS_MEDI,LIVINGAPARTMENTS_MODE,FLOORSMIN_MODE,FLOORSMIN_AVG,FLOORSMIN_MEDI,YEARS_BUILD_AVG,YEARS_BUILD_MODE,YEARS_BUILD_MEDI,OWN_CAR_AGE,LANDAREA_MEDI,LANDAREA_AVG,LANDAREA_MODE,BASEMENTAREA_MODE,BASEMENTAREA_MEDI,BASEMENTAREA_AVG,EXT_SOURCE_1,NONLIVINGAREA_MODE,NONLIVINGAREA_AVG,NONLIVINGAREA_MEDI,ELEVATORS_AVG,ELEVATORS_MEDI,ELEVATORS_MODE,WALLSMATERIAL_MODE,APARTMENTS_AVG,APARTMENTS_MEDI,APARTMENTS_MODE,ENTRANCES_MODE,ENTRANCES_MEDI,ENTRANCES_AVG,LIVINGAREA_AVG,LIVINGAREA_MEDI,LIVINGAREA_MODE,HOUSETYPE_MODE,FLOORSMAX_MODE,FLOORSMAX_AVG,FLOORSMAX_MEDI,YEARS_BEGINEXPLUATATION_MODE,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BEGINEXPLUATATION_AVG,TOTALAREA_MODE,EMERGENCYSTATE_MODE,OCCUPATION_TYPE,EXT_SOURCE_3,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_YEAR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_QRT,NAME_TYPE_SUITE,DEF_60_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_30_CNT_SOCIAL_CIRCLE,EXT_SOURCE_2,AMT_GOODS_PRICE,AMT_ANNUITY,CNT_FAM_MEMBERS,DAYS_LAST_PHONE_CHANGE,HOUR_APPR_PROCESS_START,LIVE_REGION_NOT_WORK_REGION,REG_REGION_NOT_LIVE_REGION,ORGANIZATION_TYPE,NAME_CONTRACT_TYPE,FLAG_OWN_CAR,SK_ID_CURR,AMT_CREDIT,AMT_INCOME_TOTAL,CNT_CHILDREN,NAME_INCOME_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,NAME_EDUCATION_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,CODE_GENDER,TARGET,FLAG_OWN_REALTY,FLAG_EMAIL,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,FLAG_PHONE,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,REG_REGION_NOT_WORK_REGION,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_11,FLAG_DOCUMENT_10,FLAG_DOCUMENT_9,FLAG_DOCUMENT_8,FLAG_DOCUMENT_7,FLAG_DOCUMENT_6,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_19,FLAG_DOCUMENT_18,FLAG_DOCUMENT_17,FLAG_DOCUMENT_16,FLAG_DOCUMENT_15,FLAG_DOCUMENT_14,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,NWEEKDAY_PROCESS_START
0,0.0143,0.0144,0.0144,0.0,0.0,0.0,reg oper account,0.0202,0.0205,0.022,0.125,0.125,0.125,0.6192,0.6341,0.6243,,0.0375,0.0369,0.0377,0.0383,0.0369,0.0369,0.083037,0.0,0.0,0.0,0.0,0.0,0.0,"Stone, brick",0.0247,0.025,0.0252,0.069,0.069,0.069,0.019,0.0193,0.0198,block of flats,0.0833,0.0833,0.0833,0.9722,0.9722,0.9722,0.0149,No,Laborers,0.139376,0.0,0.0,0.0,1.0,0.0,0.0,Unaccompanied,2.0,2.0,2.0,2.0,0.262949,351000.0,24700.5,1.0,-1134.0,10,0,0,Business Entity Type 3,Cash loans,N,100002,406597.5,202500.0,0,Working,Single / not married,House / apartment,0.018801,Secondary / secondary special,-9461,-637,-3648.0,-2120,1,1,0,1,M,1,Y,0,2,2,WEDNESDAY,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3
1,0.0605,0.0497,0.0608,0.0039,0.0,0.0039,reg oper account,0.0773,0.0787,0.079,0.3333,0.3333,0.3333,0.796,0.804,0.7987,,0.0132,0.013,0.0128,0.0538,0.0529,0.0529,0.311267,0.0,0.0098,0.01,0.08,0.08,0.0806,Block,0.0959,0.0968,0.0924,0.0345,0.0345,0.0345,0.0549,0.0558,0.0554,block of flats,0.2917,0.2917,0.2917,0.9851,0.9851,0.9851,0.0714,No,Core staff,,0.0,0.0,0.0,0.0,0.0,0.0,Family,0.0,1.0,0.0,1.0,0.622246,1129500.0,35698.5,2.0,-828.0,11,0,0,School,Cash loans,N,100003,1293502.5,270000.0,0,State servant,Married,House / apartment,0.003541,Higher education,-16765,-1188,-1186.0,-291,1,1,0,1,F,0,N,0,1,1,MONDAY,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,,,,,,,,,,,,,,,,,26.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Laborers,0.729567,0.0,0.0,0.0,0.0,0.0,0.0,Unaccompanied,0.0,0.0,0.0,0.0,0.555912,135000.0,6750.0,1.0,-815.0,9,0,0,Government,Revolving loans,Y,100004,135000.0,67500.0,0,Working,Single / not married,House / apartment,0.010032,Secondary / secondary special,-19046,-225,-4260.0,-2531,1,1,1,1,M,0,Y,0,2,2,MONDAY,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Laborers,,,,,,,,Unaccompanied,0.0,2.0,0.0,2.0,0.650442,297000.0,29686.5,2.0,-617.0,17,0,0,Business Entity Type 3,Cash loans,N,100006,312682.5,135000.0,0,Working,Civil marriage,House / apartment,0.008019,Secondary / secondary special,-19005,-3039,-9833.0,-2437,1,1,0,1,F,0,Y,0,2,2,WEDNESDAY,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3
4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Core staff,,0.0,0.0,0.0,0.0,0.0,0.0,Unaccompanied,0.0,0.0,0.0,0.0,0.322738,513000.0,21865.5,1.0,-1106.0,11,0,0,Religion,Cash loans,N,100007,513000.0,121500.0,0,Working,Single / not married,House / apartment,0.028663,Secondary / secondary special,-19932,-3038,-4311.0,-3458,1,1,0,1,M,0,Y,0,2,2,THURSDAY,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,4


In [38]:
pd_loan_filter_null.shape

(307511, 123)

In [41]:
pd_loan_filter_null.to_csv("../data/pd_preprocessing.csv")