# Predicción de Altas de Tarjetas de Crédito
### Prueba Técnica - Selección de Variables Significativas y Optimización del Umbral

En este notebook se desarrolla un modelo de Machine Learning para predecir si un cliente adquirirá una tarjeta de crédito (`alta_tdc`).
El proceso se divide en dos etapas principales:

1. **Selección de Variables Significativas:** Se entrena un modelo base para extraer la importancia de cada variable y, posteriormente, se optimiza el umbral de selección para elegir las variables más relevantes.
2. **Construcción y Evaluación del Modelo Final:** Se construye un pipeline final que incluye preprocesamiento, balanceo de clases (SMOTE) y entrenamiento de un **XGBClassifier** con hiperparámetros predeterminados.

El objetivo es obtener un modelo robusto y, al final, mostrar únicamente el average_precision_score y la precisión.

## Objetivo y Entregables

**Objetivo:** Desarrollar un modelo predictivo utilizando únicamente las variables significativas, optimizando el umbral de selección y entrenando un modelo con parámetros predeterminados, de manera que se impriman únicamente el average_precision_score y la precisión.

**Entregables:**
- **Notebook (.ipynb):** Con el desarrollo del modelo, incluyendo la justificación de las decisiones tomadas y una reflexión sobre posibles mejoras o próximos pasos.
- **Archivo de predicciones (.csv):** Con las columnas `ID_cliente` y `alta_tdc` (1 si el cliente adquiere la tarjeta, 0 en caso contrario).

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, average_precision_score, f1_score

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

In [2]:
# Cargar el dataset de entrenamiento
train_df = pd.read_csv('train_df.csv', delimiter=';', thousands='.', decimal=',')
print('Dataset de entrenamiento cargado, dimensiones:', train_df.shape)
train_df.head()

Dataset de entrenamiento cargado, dimensiones: (12012, 165)


  train_df = pd.read_csv('train_df.csv', delimiter=';', thousands='.', decimal=',')


Unnamed: 0,identificador_cliente,cust_age_number,customer_seniority_number,main_city_type,town_id,personal_type,marital_status_type,address_relationship_type,adult_dependants_number,target_type,...,cr_dbt_card_tech_expns_amount,cr_debit_card_tech_op_number,cr_dbt_cd_tourism_expns_amount,cr_debit_cd_tourism_op_number,cr_dbt_cd_clothes_expns_amount,cr_debit_cd_clothes_op_number,cr_dbt_card_prvs_expns_amount,cr_debit_card_prvs_op_number,audit_date,alta_tdc
0,1,68,37,1,11001,1,1.0,3,0.0,1,...,12545000.0,3.0,0.0,0.0,8500000.0,1.0,18852200.0,5.0,2024-11-07T05:18:36.416Z,0
1,2,92,37,1,11001,1,2.0,3,0.0,1,...,,,,,,,,,2024-11-07T05:18:36.416Z,0
2,3,70,27,0,25572,1,2.0,2,,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2024-11-07T05:18:36.416Z,0
3,4,76,27,0,25290,1,2.0,3,,1,...,,,,,,,,,2024-11-07T05:18:36.416Z,0
4,5,90,37,1,5001,1,2.0,5,2.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2024-11-07T05:18:36.416Z,0


In [3]:
# Definir variables objetivo e identificación
target = 'alta_tdc'
id_col = 'identificador_cliente'

# Separar características y target
X = train_df.drop([target, id_col], axis=1)
y = train_df[target]

# Identificar columnas numéricas y categóricas
num_cols = X.select_dtypes(include=['number']).columns.tolist()
cat_cols = X.select_dtypes(exclude=['number']).columns.tolist()

print('Columnas numéricas:', num_cols)
print('Columnas categóricas:', cat_cols)

# Convertir variables categóricas a string
X[cat_cols] = X[cat_cols].astype(str)

Columnas numéricas: ['cust_age_number', 'customer_seniority_number', 'main_city_type', 'marital_status_type', 'address_relationship_type', 'adult_dependants_number', 'target_type', 'customer_asset_amount', 'customer_liability_amount', 'customer_status_type', 'month_withdrawals_number', 'month_payment_number', 'month_purchase_number', 'm_exit_amount', 'm_entry_amount', 'mth_mbl_app_qry_number', 'mth_mbl_app_mnty_tr_number', 'mth_mbl_app_non_mnty_tr_number', 'month_atm_queries_number', 'atm_made_fin_trans_number', 'atm_made_nfin_tran_number', 'inquiries_digi_channel_number', 'dig_financial_trans_tot_number', 'dig_non_fin_trans_tot_number', 'mth_co_netcash_queries_number', 'mth_co_netc_non_mnty_tr_number', 'mth_co_netcash_mnty_tr_number', 'mth_ppl_netcash_queries_number', 'mth_ppl_netcash_mnty_tr_number', 'mth_pplnetc_non_mnty_tr_number', 'month_h2h_queries_number', 'month_h2h_mnty_tr_number', 'month_h2h_non_mnty_tr_number', 'month_webpage_queries_number', 'mth_webpage_mnty_tr_number', 'm

## Modelo Base y Cálculo de Importancia de Variables

Se construye un pipeline que procesa las variables numéricas y categóricas y se entrena un **XGBClassifier** para obtener la importancia de cada variable.

Decisiones tomadas:
- Uso de **SimpleImputer** para valores faltantes (mediana para numéricas y moda para categóricas).
- Aplicación de **StandardScaler** y **OneHotEncoder** para normalización y codificación respectivamente.

In [4]:
# Pipeline para variables numéricas
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Pipeline para variables categóricas
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combinar ambos pipelines
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, num_cols),
    ('cat', categorical_transformer, cat_cols)
])

# Crear pipeline base: preprocesamiento + XGBClassifier
pipeline_base = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss'))
])

# División en entrenamiento y prueba
X_train_base, X_test_base, y_train_base, y_test_base = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline_base.fit(X_train_base, y_train_base)

# Extraer el clasificador entrenado
model_base = pipeline_base.named_steps['classifier']

# Obtener nombres de columnas tras preprocesamiento
onehot_feature_names = pipeline_base.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(cat_cols)
feature_names = num_cols + list(onehot_feature_names)

# Calcular importancias
importances = model_base.feature_importances_
feat_imp = pd.Series(importances, index=feature_names).sort_values(ascending=False)

print('Importancia de las variables:')
print(feat_imp.head(10))

Parameters: { "use_label_encoder" } are not used.



Importancia de las variables:
town_id_11001                0.037613
town_id_11001.0              0.030228
town_id_76001                0.026303
libranza_tenure_mark_type    0.019680
town_id_8001                 0.018775
town_id_5001                 0.017677
mth_mbl_app_qry_number       0.014303
save_account_type            0.013810
town_id_66001                0.012919
main_city_type               0.012578
dtype: float32


## Optimización del Umbral de Importancia

Se explora un rango de umbrales sobre las importancias para seleccionar las variables más relevantes, reduciendo la complejidad del modelo al evitar incluir variables con baja contribución.

In [5]:
# Definir un rango de umbrales (de 0 al máximo de importancias)
umbral_values = np.linspace(0, feat_imp.max(), num=20)

best_umbral = None
best_score = 0
best_num_cols = None
best_cat_cols = None

for umbral in umbral_values:
    features_temp = feat_imp[feat_imp > umbral].index.tolist()
    
    num_cols_temp = [col for col in num_cols if col in features_temp]
    
    onehot_all_names = pipeline_base.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(cat_cols)
    cat_cols_temp = []
    for col in cat_cols:
        dummies = [dummy for dummy in onehot_all_names if dummy.startswith(col + '_')]
        if any(dummy in features_temp for dummy in dummies):
            cat_cols_temp.append(col)
    
    selected_cols_temp = num_cols_temp + cat_cols_temp
    if len(selected_cols_temp) == 0:
        continue
    
    X_selected_temp = X[selected_cols_temp]
    X_train_temp, X_test_temp, y_train_temp, y_test_temp = train_test_split(X_selected_temp, y, test_size=0.2, random_state=42)
    
    numeric_transformer_sel = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    categorical_transformer_sel = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])
    preprocessor_sel = ColumnTransformer(transformers=[
        ('num', numeric_transformer_sel, [col for col in selected_cols_temp if col in num_cols]),
        ('cat', categorical_transformer_sel, [col for col in selected_cols_temp if col in cat_cols])
    ])
    clf_temp = Pipeline(steps=[
        ('preprocessor', preprocessor_sel),
        ('classifier', XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss'))
    ])
    
    clf_temp.fit(X_train_temp, y_train_temp)
    y_pred_temp = clf_temp.predict(X_test_temp)
    
    # Utilizamos el F1 Score internamente para la optimización (no se imprime)
    score = 0
    try:
        score = f1_score(y_test_temp, y_pred_temp, pos_label=1)
    except Exception as e:
        pass
    
    if score > best_score:
        best_score = score
        best_umbral = umbral
        best_num_cols = num_cols_temp
        best_cat_cols = cat_cols_temp

print('Mejor umbral:', best_umbral)
print('Variables numéricas seleccionadas:', best_num_cols)
print('Variables categóricas seleccionadas:', best_cat_cols)

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.



Mejor umbral: 0.0
Variables numéricas seleccionadas: ['cust_age_number', 'customer_seniority_number', 'main_city_type', 'marital_status_type', 'address_relationship_type', 'adult_dependants_number', 'target_type', 'customer_asset_amount', 'customer_liability_amount', 'month_withdrawals_number', 'month_payment_number', 'month_purchase_number', 'm_exit_amount', 'm_entry_amount', 'mth_mbl_app_qry_number', 'mth_mbl_app_mnty_tr_number', 'mth_mbl_app_non_mnty_tr_number', 'month_atm_queries_number', 'atm_made_fin_trans_number', 'dig_financial_trans_tot_number', 'mth_ppl_netcash_queries_number', 'mth_ppl_netcash_mnty_tr_number', 'month_h2h_mnty_tr_number', 'month_h2h_non_mnty_tr_number', 'mth_ext_pos_queries_number', 'mth_ext_pos_mnty_tr_number', 'mth_banknet_mnty_tr_number', 'recv_payrl_transactions_number', 'current_payroll_incm_amount', 'payroll_type', 'instlmnt_1_bill_average_amount', 'cr_card_bill_average_amount', 'instlmnt_1_tr_average_per', 'saving_prdt_avg_bal_amount', 'fix_saving_prdt

## Construcción del Pipeline Final con Variables Seleccionadas

Utilizando el umbral óptimo, se seleccionan las variables relevantes y se crea el pipeline final que incluye:
- Preprocesamiento para las variables numéricas y categóricas seleccionadas.
- Balanceo de clases mediante **SMOTE**.
- Entrenamiento de un **XGBClassifier** con parámetros predeterminados.

Al final se mostrarán únicamente el average_precision_score y la precisión.

In [6]:
selected_cols = best_num_cols + best_cat_cols
print('Columnas seleccionadas para el modelo final:', selected_cols)

numeric_transformer_sel = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer_sel = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor_sel = ColumnTransformer(transformers=[
    ('num', numeric_transformer_sel, [col for col in selected_cols if col in num_cols]),
    ('cat', categorical_transformer_sel, [col for col in selected_cols if col in cat_cols])
])

clf_final = Pipeline(steps=[
    ('preprocessor', preprocessor_sel),
    ('classifier', XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss'))
])

Columnas seleccionadas para el modelo final: ['cust_age_number', 'customer_seniority_number', 'main_city_type', 'marital_status_type', 'address_relationship_type', 'adult_dependants_number', 'target_type', 'customer_asset_amount', 'customer_liability_amount', 'month_withdrawals_number', 'month_payment_number', 'month_purchase_number', 'm_exit_amount', 'm_entry_amount', 'mth_mbl_app_qry_number', 'mth_mbl_app_mnty_tr_number', 'mth_mbl_app_non_mnty_tr_number', 'month_atm_queries_number', 'atm_made_fin_trans_number', 'dig_financial_trans_tot_number', 'mth_ppl_netcash_queries_number', 'mth_ppl_netcash_mnty_tr_number', 'month_h2h_mnty_tr_number', 'month_h2h_non_mnty_tr_number', 'mth_ext_pos_queries_number', 'mth_ext_pos_mnty_tr_number', 'mth_banknet_mnty_tr_number', 'recv_payrl_transactions_number', 'current_payroll_incm_amount', 'payroll_type', 'instlmnt_1_bill_average_amount', 'cr_card_bill_average_amount', 'instlmnt_1_tr_average_per', 'saving_prdt_avg_bal_amount', 'fix_saving_prdt_avg_bal

In [7]:
X_selected = X[selected_cols]
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

In [8]:
clf_final_smote = ImbPipeline(steps=[
    ('preprocessor', preprocessor_sel),
    ('smote', SMOTE(random_state=42)),
    ('classifier', XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss'))
])

clf_final_smote.fit(X_train, y_train)

y_pred_smote = clf_final_smote.predict(X_test)

# Calcular average_precision_score y precisión
ap_smote = average_precision_score(y_test, clf_final_smote.predict_proba(X_test)[:, 1])
accuracy = accuracy_score(y_test, y_pred_smote)
f1_class1 = f1_score(y_test, y_pred_smote, pos_label=1)

print("Average Precision (SMOTE):", ap_smote)
print("Precisión:", accuracy)
print("F1 Score (Clase 1):", f1_class1)

[WinError 2] El sistema no puede encontrar el archivo especificado
  File "c:\ProgramData\anaconda3\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "c:\ProgramData\anaconda3\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\ProgramData\anaconda3\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "c:\ProgramData\anaconda3\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Parameters: { "use_label_encoder" } are not used.



Average Precision (SMOTE): 0.6493932761297511
Precisión: 0.9359134415314191
F1 Score (Clase 1): 0.5698324022346368


In [9]:
from sklearn.metrics import confusion_matrix, f1_score

# Calcular el F1 score para la clase 1
f1_class1 = f1_score(y_test, y_pred_smote, pos_label=1)
print("F1 Score (Clase 1):", f1_class1)

# Calcular la matriz de confusión
conf_matrix = confusion_matrix(y_test, y_pred_smote)
# Crear un DataFrame con etiquetas ordenadas para la matriz
conf_df = pd.DataFrame(conf_matrix, index=["Actual: 0", "Actual: 1"], columns=["Predicho: 0", "Predicho: 1"])
print("Matriz de Confusión ordenada:")
print(conf_df)

F1 Score (Clase 1): 0.5698324022346368
Matriz de Confusión ordenada:
           Predicho: 0  Predicho: 1
Actual: 0         2147           31
Actual: 1          123          102


## Generación de Predicciones y Archivo de Salida

Se generan las predicciones sobre el conjunto de validación y se guarda el resultado en un archivo CSV con las columnas:
- **ID_cliente**
- **alta_tdc** (1 si el cliente adquiere la tarjeta, 0 en caso contrario)

In [10]:
validation_df = pd.read_csv('Validation_df.csv', delimiter=';', thousands='.', decimal=',')

X_validation = validation_df[selected_cols]

validation_preds = clf_final_smote.predict(X_validation)

output_df = pd.DataFrame({
    'ID_cliente': validation_df[id_col],
    'alta_tdc': validation_preds
})

output_df.to_csv('predicciones.csv', index=False, sep=';')
print('Archivo de predicciones generado: predicciones.csv')

Archivo de predicciones generado: predicciones.csv


## Justificación de las Decisiones y Próximos Pasos

Decisiones tomadas:

- Se entrenó un modelo base para obtener la importancia de las variables y se optimizó el umbral para seleccionar las más relevantes.
- Se utilizó un **XGBClassifier** con parámetros predeterminados para mantener la simplicidad.
- Se aplicó **SMOTE** para balancear las clases y mejorar la generalización del modelo.

**Reflexión y Próximos Pasos:**

- Explorar otros algoritmos (p.ej., LightGBM o CatBoost).
- Investigar técnicas adicionales de balanceo y validación cruzada.
- Documentar cada decisión para asegurar la claridad y reproducibilidad.