# Máster en Data Science - Machine Learning

# Procesamiento de variables: Codificación de variables categóricas y escalado
Autor: Ramón Morillo Barrera

## Dataset: Application data

En este notebook trabajaremos en el procesamiento de nuestras variables categóricas del Dataset, con el objetivo de que queden listas para ser procesadas por los modelos que probaremos.

Aplicaré diferentes técnicas de encoding que transforman las variables categóricas de nuestro Dataset a numéricas y posteriormente permitirán a los modelos que implementemos procesar las variables categóricas presentes en nuestro Dataset. Además de un escalado de variables.

El escalado de variables es un proceso de transformar los datos para que todas las características estén en la misma escala, generalmente normalizando los valores dentro de un rango (por ejemplo, 0 a 1) o estandarizándolos para que tengan media 0 y desviación estándar 1. Esto evita que las variables con mayores rangos predominen en el modelo o afecten a los cálculos del mismo, especialmente en algoritmos basados en distancia (como SVM, KNN, PCA) o gradientes (como redes neuronales). Además, mejora el desempeño ya que haace que algoritmos sensibles a la magnitud de las variables (por ejemplo, KNN) funcionen mejor y evita sesgos haciendo que las variables con grandes rangos no tendrán mayor peso que otras menos dispersas.

#### Librerías

In [24]:
import sys
import pandas as pd 
import numpy as np
import sklearn
from sklearn.pipeline import Pipeline
from sklearn import metrics
# pip install category-encoders
import category_encoders as ce
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, \
                            silhouette_score, recall_score, precision_score, make_scorer, \
                            roc_auc_score, f1_score, precision_recall_curve, accuracy_score, roc_auc_score, \
                            classification_report, confusion_matrix

from sklearn import metrics
from sklearn.metrics import ConfusionMatrixDisplay

#### Funciones

In [25]:
sys.path.append('../src')
import funciones_auxiliares as f_aux
sys.path.remove('../src')

# Constante
seed = 12354

#### Importación de datos

In [5]:
df_loan_train = pd.read_csv("../../data_loan_status/data_split/df_loan_train.csv")
df_loan_test = pd.read_csv("../../data_loan_status/data_split/df_loan_test.csv")

In [6]:
df_loan_train.head()

Unnamed: 0,SK_ID_CURR,COMMONAREA_AVG,COMMONAREA_MEDI,COMMONAREA_MODE,NONLIVINGAPARTMENTS_AVG,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAPARTMENTS_MODE,FONDKAPREMONT_MODE,LIVINGAPARTMENTS_MEDI,LIVINGAPARTMENTS_AVG,...,FLAG_DOCUMENT_13,FLAG_DOCUMENT_19,FLAG_DOCUMENT_18,FLAG_DOCUMENT_17,FLAG_DOCUMENT_16,FLAG_DOCUMENT_15,FLAG_DOCUMENT_14,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,TARGET
0,189359,0.021,0.0208,0.019,0.0,0.0,0.0,Desconocido,0.0761,0.0756,...,0,0,0,0,0,0,0,0,0,0
1,418867,0.0142,0.0142,0.0143,0.0,0.0,0.0,reg oper account,0.077,0.0756,...,0,0,0,0,0,0,0,0,0,0
2,263377,0.021,0.0208,0.019,0.0,0.0,0.0,Desconocido,0.0761,0.0756,...,0,0,0,0,0,0,0,0,0,0
3,366006,0.021,0.0208,0.019,0.0,0.0,0.0,Desconocido,0.0761,0.0756,...,0,0,0,0,0,0,0,0,0,0
4,197882,0.021,0.0208,0.019,0.0,0.0,0.0,Desconocido,0.0761,0.0756,...,0,0,0,0,0,0,0,0,0,0


In [7]:
df_loan_train.columns

Index(['SK_ID_CURR', 'COMMONAREA_AVG', 'COMMONAREA_MEDI', 'COMMONAREA_MODE',
       'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MEDI',
       'NONLIVINGAPARTMENTS_MODE', 'FONDKAPREMONT_MODE',
       'LIVINGAPARTMENTS_MEDI', 'LIVINGAPARTMENTS_AVG',
       ...
       'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_18',
       'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_15',
       'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'TARGET'],
      dtype='object', length=122)

In [8]:
df_loan_train.head()

Unnamed: 0,SK_ID_CURR,COMMONAREA_AVG,COMMONAREA_MEDI,COMMONAREA_MODE,NONLIVINGAPARTMENTS_AVG,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAPARTMENTS_MODE,FONDKAPREMONT_MODE,LIVINGAPARTMENTS_MEDI,LIVINGAPARTMENTS_AVG,...,FLAG_DOCUMENT_13,FLAG_DOCUMENT_19,FLAG_DOCUMENT_18,FLAG_DOCUMENT_17,FLAG_DOCUMENT_16,FLAG_DOCUMENT_15,FLAG_DOCUMENT_14,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,TARGET
0,189359,0.021,0.0208,0.019,0.0,0.0,0.0,Desconocido,0.0761,0.0756,...,0,0,0,0,0,0,0,0,0,0
1,418867,0.0142,0.0142,0.0143,0.0,0.0,0.0,reg oper account,0.077,0.0756,...,0,0,0,0,0,0,0,0,0,0
2,263377,0.021,0.0208,0.019,0.0,0.0,0.0,Desconocido,0.0761,0.0756,...,0,0,0,0,0,0,0,0,0,0
3,366006,0.021,0.0208,0.019,0.0,0.0,0.0,Desconocido,0.0761,0.0756,...,0,0,0,0,0,0,0,0,0,0
4,197882,0.021,0.0208,0.019,0.0,0.0,0.0,Desconocido,0.0761,0.0756,...,0,0,0,0,0,0,0,0,0,0


In [9]:
df_loan_train.describe()

Unnamed: 0,SK_ID_CURR,COMMONAREA_AVG,COMMONAREA_MEDI,COMMONAREA_MODE,NONLIVINGAPARTMENTS_AVG,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAPARTMENTS_MODE,LIVINGAPARTMENTS_MEDI,LIVINGAPARTMENTS_AVG,LIVINGAPARTMENTS_MODE,...,FLAG_DOCUMENT_13,FLAG_DOCUMENT_19,FLAG_DOCUMENT_18,FLAG_DOCUMENT_17,FLAG_DOCUMENT_16,FLAG_DOCUMENT_15,FLAG_DOCUMENT_14,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,TARGET
count,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,...,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0
mean,278169.356269,0.028003,0.027854,0.025996,0.002694,0.002641,0.002465,0.084238,0.083525,0.086088,...,0.003577,0.000565,0.00824,0.000276,0.009918,0.001195,0.002845,0.000504,0.000354,0.080729
std,102777.885435,0.042568,0.042607,0.041739,0.026736,0.0265,0.025872,0.053924,0.053303,0.056526,...,0.059702,0.023764,0.090398,0.016623,0.099096,0.034549,0.053267,0.022445,0.018802,0.272419
min,100003.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,189151.75,0.021,0.0208,0.019,0.0,0.0,0.0,0.0761,0.0756,0.0771,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,278276.5,0.021,0.0208,0.019,0.0,0.0,0.0,0.0761,0.0756,0.0771,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,367019.25,0.021,0.0208,0.019,0.0,0.0,0.0,0.0761,0.0756,0.0771,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,456255.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Realizamos una pequeña verificación para ver que los datos se importaron en el formato correcto, al parecer esta todo bien.

## Codificación de variables categoricas

La variable objetivo no tendremos que codificarla, pues ya toma el valor [0, 1]. A continuación voy a desarrollar y proporcionar una breve definición de los diferentes tipos de codificación de variables que he encontrado. Posteriormente realizaré un pequeño estudio de mis variables categóricas para ver y definir cuántas categorías presenta cada una, para, finalmente decidir que tipo de codificación aplicar a cada variable.

- **One-Hot encoding**: Convierte cada valor único de una variable categórica en una nueva columna binaria (0 o 1), donde un valor específico se marca con 1 y el resto con 0. Utilizado cuando no hay un orden natural entre las categorías y no queremos que el modelo asuma una relación jerárquica entre ellas.

- **Mean encoding**: Reemplaza cada valor de la variable categórica por la media del target (variable dependiente) correspondiente a esa categoría. Es útil cuando hay una relación estadística entre la variable categórica y la variable objetivo, pero puede causar overfitting si no se maneja correctamente.

- **Ordinal encoding**: Asigna un valor numérico a cada categoría de una variable categórica, respetando el orden natural de las categorías. Es ideal cuando existe un orden implícito entre las categorías y el modelo puede aprovechar esa relación.

- **Target Encoding**: Similar al mean encoding, pero en lugar de usar la media global de la variable, utiliza el promedio de la variable objetivo por categoría. Proporciona un valor más relevante y específico para cada categoría al incorporar la información de la variable objetivo, pero debe ser utilizado cuidadosamente para evitar el sobreajuste.

- **CatBoost Encoding**: Método especializado en codificación para modelos de CatBoost, que se basa en la codificación por target encoding, pero ajusta el valor usando un proceso de suavizado y utilizando la frecuencia de las categorías. Es especialmente útil en problemas con variables categóricas de alta cardinalidad y mejora la precisión de modelos que utilizan árboles de decisión, como CatBoost, LightGBM y XGBoost.

In [10]:
cat_vars = df_loan_train.select_dtypes(include=['object']).columns

# Contar valores únicos en cada variable categórica
unique_counts = df_loan_train[cat_vars].nunique()

print(unique_counts)

FONDKAPREMONT_MODE             5
WALLSMATERIAL_MODE             8
HOUSETYPE_MODE                 4
EMERGENCYSTATE_MODE            3
OCCUPATION_TYPE               19
NAME_TYPE_SUITE                8
ORGANIZATION_TYPE             58
NAME_CONTRACT_TYPE             2
FLAG_OWN_CAR                   2
CODE_GENDER                    3
NAME_INCOME_TYPE               8
NAME_FAMILY_STATUS             5
NAME_HOUSING_TYPE              6
NAME_EDUCATION_TYPE            5
FLAG_OWN_REALTY                2
WEEKDAY_APPR_PROCESS_START     7
dtype: int64


Después de observar el número de valores únicos que presentan nuestras variables categóricas. Voy a establecer los siguientes criterios de encoding:
- Baja cardinalidad (< 10 categorías):
Usar One-Hot Encoding.
- Media/Alta cardinalidad (10-50 categorías):
Considerar Target Encoding o Ordinal Encoding (si hay orden implícito). En este caso como no hay orden implícito, usaré Target Encoding.
- Alta cardinalidad (> 50 categorías):
Usar técnicas específicas como Mean Encoding, CatBoost Encoding, para reducir dimensionalidad y evitar overfitting. En este caso usaré CatBoost Encoding.

#### Separación X e y

In [11]:
y_train = df_loan_train['TARGET']
X_train = df_loan_train.drop('TARGET', axis=1)
y_test = df_loan_test['TARGET']
X_test = df_loan_test.drop('TARGET', axis=1)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((246008, 121), (246008,), (61503, 121), (61503,))

### 1. One-Hot Encoding

In [13]:
list_columns_cat = list(df_loan_train.select_dtypes(include=["object", "category"]).columns)
exclude_vars = ['OCCUPATION_TYPE', 'ORGANIZATION_TYPE']  # Excluir estas columnas
list_columns_ohe = [col for col in list_columns_cat if col not in exclude_vars]

# Crear y aplicar One-Hot Encoder
ohe = ce.OneHotEncoder(cols=list_columns_ohe, use_cat_names=True)
ohe.fit(X_train, y_train)  

# Transformar X_train y X_test
X_train_t = ohe.transform(X_train)
X_test_t = ohe.transform(X_test)

# Verificar formas finales
print(X_train_t.shape, X_test_t.shape)

(246008, 175) (61503, 175)


### 2. Target Encoding

In [14]:
# Target Encoding

target_column = 'OCCUPATION_TYPE'

# Crear y ajustar el codificador de Target Encoding
target_enc = ce.TargetEncoder(cols=[target_column])
target_enc.fit(X_train_t[target_column], y_train)  

# Transformar X_train y X_test
X_train_te = X_train_t.copy()
X_test_te = X_test_t.copy()

X_train_te[target_column] = target_enc.transform(X_train_t[target_column])
X_test_te[target_column] = target_enc.transform(X_test_t[target_column])

# Verificar formas finales
print(X_train_te.shape, X_test_te.shape)


(246008, 175) (61503, 175)


### 3. CatBoost Encoding

In [15]:
# CatBoost Encoding

target_column = 'ORGANIZATION_TYPE'

# Crear y ajustar el codificador de CatBoost Encoding
catboost_enc = ce.CatBoostEncoder(cols=[target_column])
catboost_enc.fit(X_train_te[target_column], y_train)  

# Transformar X_train y X_test
X_train_tec = X_train_te.copy()
X_test_tec = X_test_te.copy()

X_train_tec[target_column] = catboost_enc.transform(X_train_te[target_column])
X_test_tec[target_column] = catboost_enc.transform(X_test_te[target_column])

# Verificar formas finales
print(X_train_tec.shape, X_test_tec.shape)

(246008, 175) (61503, 175)


In [16]:
X_train_tec.dtypes.to_dict()

{'SK_ID_CURR': dtype('int64'),
 'COMMONAREA_AVG': dtype('float64'),
 'COMMONAREA_MEDI': dtype('float64'),
 'COMMONAREA_MODE': dtype('float64'),
 'NONLIVINGAPARTMENTS_AVG': dtype('float64'),
 'NONLIVINGAPARTMENTS_MEDI': dtype('float64'),
 'NONLIVINGAPARTMENTS_MODE': dtype('float64'),
 'FONDKAPREMONT_MODE_Desconocido': dtype('int64'),
 'FONDKAPREMONT_MODE_reg oper account': dtype('int64'),
 'FONDKAPREMONT_MODE_reg oper spec account': dtype('int64'),
 'FONDKAPREMONT_MODE_not specified': dtype('int64'),
 'FONDKAPREMONT_MODE_org spec account': dtype('int64'),
 'LIVINGAPARTMENTS_MEDI': dtype('float64'),
 'LIVINGAPARTMENTS_AVG': dtype('float64'),
 'LIVINGAPARTMENTS_MODE': dtype('float64'),
 'FLOORSMIN_MODE': dtype('float64'),
 'FLOORSMIN_AVG': dtype('float64'),
 'FLOORSMIN_MEDI': dtype('float64'),
 'YEARS_BUILD_MODE': dtype('float64'),
 'YEARS_BUILD_MEDI': dtype('float64'),
 'YEARS_BUILD_AVG': dtype('float64'),
 'OWN_CAR_AGE': dtype('float64'),
 'LANDAREA_MEDI': dtype('float64'),
 'LANDAREA_A

In [17]:
X_train_tec.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246008 entries, 0 to 246007
Columns: 175 entries, SK_ID_CURR to FLAG_DOCUMENT_21
dtypes: float64(67), int64(108)
memory usage: 328.5 MB


In [18]:
X_train_tec.shape

(246008, 175)

Después de aplicar los encodings definidos, nuestro dataset tiene muy buena pinta. Finalmente tenemos 175 columnas todas numéricas con 246008 registros para cada columna.

Ya están listos para realizar el escalado de variables.

## Escalado de Variables

El escalado de variables es el proceso de transformar las características numéricas de un conjunto de datos para que se encuentren dentro de un rango determinado (como 0-1 o con media 0 y desviación estándar 1). Esto no altera la distribución relativa de los datos, sino que los ajusta a una escala común.

Escalar datos es esencial para evitar que algunas características dominen a otras debido a diferencias en sus magnitudes. Esto asegura que los modelos se entrenen de manera equilibrada y robusta, resultando en un mejor rendimiento y generalización.

Los modelos basados en distancia necesitan un escalado de datos porque la distancia entre puntos en el espacio de características (como la distancia Euclídea) es sensible a las magnitudes de las variables. Si las características tienen rangos diferentes, las variables con valores más grandes dominarán el cálculo de las distancias, lo que puede llevar a resultados sesgados o incorrectos.

Algoritmos sensibles a escalas de datos son:
- **K-Nearest Neighbors (KNN)**: Los vecinos más cercanos se identifican usando distancias, y un mal escalado puede llevar a clasificaciones erróneas.
- **Clustering (K-Means, DBSCAN)**: Las asignaciones de clústeres dependen de las distancias entre puntos.
- **Support Vector Machines (SVM)**: El margen de separación se define en función de distancias en el espacio transformado.

In [19]:
scaler = StandardScaler()
model_scaled = scaler.fit(X_train_tec)
X_train_scaled = pd.DataFrame(scaler.transform(X_train_tec), columns=X_train_tec.columns, index=X_train_tec.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test_tec), columns=X_test_tec.columns, index=X_test_tec.index)

In [20]:
X_train_scaled.head()

Unnamed: 0,SK_ID_CURR,COMMONAREA_AVG,COMMONAREA_MEDI,COMMONAREA_MODE,NONLIVINGAPARTMENTS_AVG,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAPARTMENTS_MODE,FONDKAPREMONT_MODE_Desconocido,FONDKAPREMONT_MODE_reg oper account,FONDKAPREMONT_MODE_reg oper spec account,...,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_19,FLAG_DOCUMENT_18,FLAG_DOCUMENT_17,FLAG_DOCUMENT_16,FLAG_DOCUMENT_15,FLAG_DOCUMENT_14,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21
0,-0.864102,-0.164523,-0.165556,-0.167622,-0.100751,-0.099644,-0.095279,0.679355,-0.561764,-0.201797,...,-0.002016,-0.059916,-0.023777,-0.091148,-0.016628,-0.100089,-0.034591,-0.053419,-0.022457,-0.018809
1,1.368951,-0.324267,-0.320459,-0.280228,-0.100751,-0.099644,-0.095279,-1.471984,1.780107,-0.201797,...,-0.002016,-0.059916,-0.023777,-0.091148,-0.016628,-0.100089,-0.034591,-0.053419,-0.022457,-0.018809
2,-0.143926,-0.164523,-0.165556,-0.167622,-0.100751,-0.099644,-0.095279,0.679355,-0.561764,-0.201797,...,-0.002016,-0.059916,-0.023777,-0.091148,-0.016628,-0.100089,-0.034591,-0.053419,-0.022457,-0.018809
3,0.854628,-0.164523,-0.165556,-0.167622,-0.100751,-0.099644,-0.095279,0.679355,-0.561764,-0.201797,...,-0.002016,-0.059916,-0.023777,-0.091148,-0.016628,-0.100089,-0.034591,-0.053419,-0.022457,-0.018809
4,-0.781175,-0.164523,-0.165556,-0.167622,-0.100751,-0.099644,-0.095279,0.679355,-0.561764,-0.201797,...,-0.002016,-0.059916,-0.023777,-0.091148,-0.016628,-0.100089,-0.034591,-0.053419,-0.022457,-0.018809


In [21]:
X_train_scaled.describe()

Unnamed: 0,SK_ID_CURR,COMMONAREA_AVG,COMMONAREA_MEDI,COMMONAREA_MODE,NONLIVINGAPARTMENTS_AVG,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAPARTMENTS_MODE,FONDKAPREMONT_MODE_Desconocido,FONDKAPREMONT_MODE_reg oper account,FONDKAPREMONT_MODE_reg oper spec account,...,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_19,FLAG_DOCUMENT_18,FLAG_DOCUMENT_17,FLAG_DOCUMENT_16,FLAG_DOCUMENT_15,FLAG_DOCUMENT_14,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21
count,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,...,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0
mean,-1.870602e-16,-1.748138e-16,-1.90916e-17,-4.707915e-18,-2.6427860000000002e-17,-6.25315e-18,7.610647e-18,1.587477e-17,-3.7944930000000006e-17,6.645958e-17,...,4.3143849999999996e-19,1.7705220000000003e-17,5.480533e-18,4.988079e-17,-1.974869e-18,4.2407340000000006e-17,3.884752e-18,-4.072491e-18,1.226441e-17,2.816084e-18
std,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,...,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002
min,-1.733512,-0.6578498,-0.6537366,-0.6228378,-0.1007506,-0.09964403,-0.09527905,-1.471984,-0.5617641,-0.2017975,...,-0.002016166,-0.05991628,-0.02377692,-0.09114836,-0.01662799,-0.1000886,-0.03459065,-0.05341868,-0.02245669,-0.01880883
25%,-0.8661181,-0.1645229,-0.1655556,-0.1676219,-0.1007506,-0.09964403,-0.09527905,-1.471984,-0.5617641,-0.2017975,...,-0.002016166,-0.05991628,-0.02377692,-0.09114836,-0.01662799,-0.1000886,-0.03459065,-0.05341868,-0.02245669,-0.01880883
50%,0.001042481,-0.1645229,-0.1655556,-0.1676219,-0.1007506,-0.09964403,-0.09527905,0.6793551,-0.5617641,-0.2017975,...,-0.002016166,-0.05991628,-0.02377692,-0.09114836,-0.01662799,-0.1000886,-0.03459065,-0.05341868,-0.02245669,-0.01880883
75%,0.8644863,-0.1645229,-0.1655556,-0.1676219,-0.1007506,-0.09964403,-0.09527905,0.6793551,-0.5617641,-0.2017975,...,-0.002016166,-0.05991628,-0.02377692,-0.09114836,-0.01662799,-0.1000886,-0.03459065,-0.05341868,-0.02245669,-0.01880883
max,1.732727,22.83391,22.8165,23.3359,37.30189,37.63622,38.55718,0.6793551,1.780107,4.955463,...,495.9909,16.68995,42.0576,10.97112,60.13954,9.991144,28.90955,18.72004,44.53016,53.16651


In [22]:
X_test_scaled.head()

Unnamed: 0,SK_ID_CURR,COMMONAREA_AVG,COMMONAREA_MEDI,COMMONAREA_MODE,NONLIVINGAPARTMENTS_AVG,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAPARTMENTS_MODE,FONDKAPREMONT_MODE_Desconocido,FONDKAPREMONT_MODE_reg oper account,FONDKAPREMONT_MODE_reg oper spec account,...,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_19,FLAG_DOCUMENT_18,FLAG_DOCUMENT_17,FLAG_DOCUMENT_16,FLAG_DOCUMENT_15,FLAG_DOCUMENT_14,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21
0,-0.102604,-0.267887,-0.259437,-0.220331,-0.100751,-0.099644,-0.095279,-1.471984,1.780107,-0.201797,...,-0.002016,-0.059916,-0.023777,-0.091148,-0.016628,-0.100089,-0.034591,-0.053419,-0.022457,-0.018809
1,0.731546,-0.469916,-0.463628,-0.428772,-0.100751,-0.099644,-0.095279,-1.471984,1.780107,-0.201797,...,-0.002016,-0.059916,-0.023777,-0.091148,-0.016628,-0.100089,-0.034591,-0.053419,-0.022457,-0.018809
2,-0.844263,-0.157475,-0.158515,-0.160434,-0.100751,-0.099644,-0.095279,0.679355,-0.561764,-0.201797,...,-0.002016,-0.059916,-0.023777,-0.091148,-0.016628,-0.100089,-0.034591,-0.053419,-0.022457,-0.018809
3,-1.696792,-0.180967,-0.172597,-0.131684,-0.100751,-0.099644,-0.095279,-1.471984,1.780107,-0.201797,...,-0.002016,-0.059916,-0.023777,-0.091148,-0.016628,-0.100089,-0.034591,-0.053419,-0.022457,-0.018809
4,-0.550502,-0.157475,-0.158515,-0.160434,-0.100751,-0.099644,-0.095279,0.679355,-0.561764,-0.201797,...,-0.002016,-0.059916,-0.023777,-0.091148,-0.016628,-0.100089,-0.034591,-0.053419,-0.022457,-0.018809


In [23]:
X_test_scaled.describe()

Unnamed: 0,SK_ID_CURR,COMMONAREA_AVG,COMMONAREA_MEDI,COMMONAREA_MODE,NONLIVINGAPARTMENTS_AVG,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAPARTMENTS_MODE,FONDKAPREMONT_MODE_Desconocido,FONDKAPREMONT_MODE_reg oper account,FONDKAPREMONT_MODE_reg oper spec account,...,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_19,FLAG_DOCUMENT_18,FLAG_DOCUMENT_17,FLAG_DOCUMENT_16,FLAG_DOCUMENT_15,FLAG_DOCUMENT_14,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21
count,61503.0,61503.0,61503.0,61503.0,61503.0,61503.0,61503.0,61503.0,61503.0,61503.0,...,61503.0,61503.0,61503.0,61503.0,61503.0,61503.0,61503.0,61503.0,61503.0,61503.0
mean,0.000543,0.01818,0.018412,0.01696,-0.00021,0.000716,0.000714,-0.003828,0.002466,0.00398,...,0.006048,-0.004358,0.006329,-0.006072,-0.002935,0.000491,0.002117,0.008546,0.000724,-0.004973
std,1.000608,1.062253,1.066521,1.060834,0.993182,1.002929,0.996601,1.001517,1.001506,1.009415,...,1.999988,0.963076,1.125165,0.966393,0.907507,1.002431,1.03012,1.076787,1.015993,0.85773
min,-1.733522,-0.65785,-0.653737,-0.622838,-0.100751,-0.099644,-0.095279,-1.471984,-0.561764,-0.201797,...,-0.002016,-0.059916,-0.023777,-0.091148,-0.016628,-0.100089,-0.034591,-0.053419,-0.022457,-0.018809
25%,-0.866485,-0.157475,-0.158515,-0.160434,-0.100751,-0.099644,-0.095279,-1.471984,-0.561764,-0.201797,...,-0.002016,-0.059916,-0.023777,-0.091148,-0.016628,-0.100089,-0.034591,-0.053419,-0.022457,-0.018809
50%,-0.00193,-0.157475,-0.158515,-0.160434,-0.100751,-0.099644,-0.095279,0.679355,-0.561764,-0.201797,...,-0.002016,-0.059916,-0.023777,-0.091148,-0.016628,-0.100089,-0.034591,-0.053419,-0.022457,-0.018809
75%,0.871188,-0.157475,-0.158515,-0.160434,-0.100751,-0.099644,-0.095279,0.679355,-0.561764,-0.201797,...,-0.002016,-0.059916,-0.023777,-0.091148,-0.016628,-0.100089,-0.034591,-0.053419,-0.022457,-0.018809
max,1.732669,22.833907,22.816503,23.335895,37.301893,37.636221,38.557181,0.679355,1.780107,4.955463,...,495.990927,16.689953,42.0576,10.971124,60.139544,9.991144,28.909547,18.720043,44.530164,53.166514


Después de escalar variables y terminar con esta primera entrega de la práctica, tendremos que realizar un feature processing en el que aplicando diferentes técnicas y algoritmos decidiremos cuantas variables del dataset usaremos para construir nuestros modelos de ML.

Un pequeño resumen de las etapas realizadas hasta ahora son:
1. EDA inicial en el que observamos tamaño de datos y características de la variable TARGET y demás variables
2. Análisis de tipos de variables y como afectan a la variable TARGET
3. Análisis de correlación de variables
4. División en Train y Test de manera estratificada
5. Tratamiento de outliers
6. Imputación de valores nulos
7. Análisis WoE e IV
8. Codificación numérica de variables categóricas
9. Escalado de variables

Por tanto, a priori tenemos preparados nuestros datos para continuar con los siguientes pasos de nuestro proyecto. La selección de variables o feature processing, el modelado, la implementación del modelo, explicabilidad y conclusiones.