# Predicción de Altas de Tarjetas de Crédito
### Prueba Técnica - Selección de Variables Significativas y Optimización del Umbral

En este notebook se desarrolla un modelo de Machine Learning para predecir si un cliente adquirirá una tarjeta de crédito (`alta_tdc`).
El proceso se divide en dos etapas principales:

1. **Selección de Variables Significativas:** Se entrena un modelo base para extraer la importancia de cada variable y, posteriormente, se optimiza el umbral de selección para elegir las variables más relevantes.
2. **Construcción y Evaluación del Modelo Final:** Se construye un pipeline final que incluye preprocesamiento, balanceo de clases (SMOTE) y entrenamiento de un LightGBMClassifier con hiperparámetros predeterminados.

El objetivo es obtener un modelo robusto y, al final, mostrar únicamente el average_precision_score y la precisión.

## Objetivo y Entregables

**Objetivo:** Desarrollar un modelo predictivo utilizando únicamente las variables significativas, optimizando el umbral de selección y entrenando un modelo con parámetros predeterminados, de manera que se impriman únicamente el average_precision_score y la precisión.

**Entregables:**
- **Notebook (.ipynb):** Con el desarrollo del modelo, incluyendo la justificación de las decisiones tomadas y una reflexión sobre posibles mejoras o próximos pasos.
- **Archivo de predicciones (.csv):** Con las columnas `ID_cliente` y `alta_tdc` (1 si el cliente adquiere la tarjeta, 0 en caso contrario).

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score, average_precision_score, f1_score

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

import lightgbm as lgb  # Se utiliza LightGBM en lugar de RandomForestClassifier

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Cargar el dataset de entrenamiento
train_df = pd.read_csv('train_df.csv', delimiter=';', thousands='.', decimal=',')
print('Dataset de entrenamiento cargado, dimensiones:', train_df.shape)
train_df.head()

# Definir variables objetivo e identificación
target = 'alta_tdc'
id_col = 'identificador_cliente'

# Separar características y target
X = train_df.drop([target, id_col], axis=1)
y = train_df[target]

# Identificar columnas numéricas y categóricas
num_cols = X.select_dtypes(include=['number']).columns.tolist()
cat_cols = X.select_dtypes(exclude=['number']).columns.tolist()

print('Columnas numéricas:', num_cols)
print('Columnas categóricas:', cat_cols)

# Convertir variables categóricas a string
X[cat_cols] = X[cat_cols].astype(str)

Dataset de entrenamiento cargado, dimensiones: (12012, 165)
Columnas numéricas: ['cust_age_number', 'customer_seniority_number', 'main_city_type', 'marital_status_type', 'address_relationship_type', 'adult_dependants_number', 'target_type', 'customer_asset_amount', 'customer_liability_amount', 'customer_status_type', 'month_withdrawals_number', 'month_payment_number', 'month_purchase_number', 'm_exit_amount', 'm_entry_amount', 'mth_mbl_app_qry_number', 'mth_mbl_app_mnty_tr_number', 'mth_mbl_app_non_mnty_tr_number', 'month_atm_queries_number', 'atm_made_fin_trans_number', 'atm_made_nfin_tran_number', 'inquiries_digi_channel_number', 'dig_financial_trans_tot_number', 'dig_non_fin_trans_tot_number', 'mth_co_netcash_queries_number', 'mth_co_netc_non_mnty_tr_number', 'mth_co_netcash_mnty_tr_number', 'mth_ppl_netcash_queries_number', 'mth_ppl_netcash_mnty_tr_number', 'mth_pplnetc_non_mnty_tr_number', 'month_h2h_queries_number', 'month_h2h_mnty_tr_number', 'month_h2h_non_mnty_tr_number', 'mon

  train_df = pd.read_csv('train_df.csv', delimiter=';', thousands='.', decimal=',')


## Modelo Base y Cálculo de Importancia de Variables con LightGBM

Se construye el pipeline base utilizando LightGBM para entrenar el modelo y extraer la importancia de cada variable.

In [3]:
# Pipeline para variables numéricas
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Pipeline para variables categóricas
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combinar ambos pipelines
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, num_cols),
    ('cat', categorical_transformer, cat_cols)
])

# Crear pipeline base: preprocesamiento + LightGBM
pipeline_base = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', lgb.LGBMClassifier(n_estimators=100, random_state=42))
])

# División en entrenamiento y prueba
X_train_base, X_test_base, y_train_base, y_test_base = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline_base.fit(X_train_base, y_train_base)

# Extraer el clasificador entrenado
model_base = pipeline_base.named_steps['classifier']

# Obtener nombres de columnas tras preprocesamiento
onehot_feature_names = pipeline_base.named_steps['preprocessor']\
    .named_transformers_['cat']\
    .named_steps['onehot'].get_feature_names_out(cat_cols)
feature_names = num_cols + list(onehot_feature_names)

# Calcular importancias
importances = model_base.feature_importances_
feat_imp = pd.Series(importances, index=feature_names).sort_values(ascending=False)

print('Importancia de las variables:')
print(feat_imp.head(10))

[WinError 2] El sistema no puede encontrar el archivo especificado
  File "c:\ProgramData\anaconda3\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "c:\ProgramData\anaconda3\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\ProgramData\anaconda3\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "c:\ProgramData\anaconda3\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


[LightGBM] [Info] Number of positive: 982, number of negative: 8627
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005134 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 10089
[LightGBM] [Info] Number of data points in the train set: 9609, number of used features: 216
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.102196 -> initscore=-2.173061
[LightGBM] [Info] Start training from score -2.173061
Importancia de las variables:
customer_seniority_number      147
cust_age_number                134
saving_prdt_avg_bal_amount     132
customer_liability_amount      127
current_payroll_incm_amount    123
customer_asset_amount          105
mth_mbl_app_qry_number          89
month_withdrawals_number        83
m_entry_amount                  79
month_purchase_number           72
dtype: int32


## Optimización del Umbral de Importancia

Se recorre un rango de umbrales para seleccionar las variables más relevantes utilizando LightGBM.

In [4]:
# Definir un rango de umbrales (de 0 al máximo de importancias)
umbral_values = np.linspace(0, feat_imp.max(), num=20)

best_umbral = None
best_score = 0
best_num_cols = None
best_cat_cols = None

for umbral in umbral_values:
    features_temp = feat_imp[feat_imp > umbral].index.tolist()
    
    num_cols_temp = [col for col in num_cols if col in features_temp]
    
    onehot_all_names = pipeline_base.named_steps['preprocessor']\
        .named_transformers_['cat']\
        .named_steps['onehot'].get_feature_names_out(cat_cols)
    cat_cols_temp = []
    for col in cat_cols:
        dummies = [dummy for dummy in onehot_all_names if dummy.startswith(col + '_')]
        if any(dummy in features_temp for dummy in dummies):
            cat_cols_temp.append(col)
    
    selected_cols_temp = num_cols_temp + cat_cols_temp
    if len(selected_cols_temp) == 0:
        continue
    
    X_selected_temp = X[selected_cols_temp]
    X_train_temp, X_test_temp, y_train_temp, y_test_temp = train_test_split(X_selected_temp, y, test_size=0.2, random_state=42)
    
    numeric_transformer_sel = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    categorical_transformer_sel = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])
    preprocessor_sel = ColumnTransformer(transformers=[
        ('num', numeric_transformer_sel, [col for col in selected_cols_temp if col in num_cols]),
        ('cat', categorical_transformer_sel, [col for col in selected_cols_temp if col in cat_cols])
    ])
    clf_temp = Pipeline(steps=[
        ('preprocessor', preprocessor_sel),
        ('classifier', lgb.LGBMClassifier(n_estimators=100, random_state=42))
    ])
    
    clf_temp.fit(X_train_temp, y_train_temp)
    y_pred_temp = clf_temp.predict(X_test_temp)
    
    # Utilizamos el F1 Score internamente para la optimización (no se imprime)
    score = 0
    try:
        score = f1_score(y_test_temp, y_pred_temp, pos_label=1)
    except Exception as e:
        pass
    
    if score > best_score:
        best_score = score
        best_umbral = umbral
        best_num_cols = num_cols_temp
        best_cat_cols = cat_cols_temp

print('Mejor umbral:', best_umbral)
print('Variables numéricas seleccionadas:', best_num_cols)
print('Variables categóricas seleccionadas:', best_cat_cols)

[LightGBM] [Info] Number of positive: 982, number of negative: 8627
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004825 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9812
[LightGBM] [Info] Number of data points in the train set: 9609, number of used features: 192
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.102196 -> initscore=-2.173061
[LightGBM] [Info] Start training from score -2.173061
[LightGBM] [Info] Number of positive: 982, number of negative: 8627
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003029 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8412
[LightGBM] [Info] Number of data points in the train set: 9609, number of used features: 151
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.102196 -> initscore=-2.173061
[LightGBM] [Info] Start training from score -2.173061
[LightGBM] [Info] 

## Construcción del Pipeline Final con Variables Seleccionadas y LightGBM

Utilizando el umbral óptimo se define el pipeline final que incluye SMOTE para balancear las clases y LightGBM como clasificador.

In [5]:
selected_cols = best_num_cols + best_cat_cols
print('Columnas seleccionadas para el modelo final:', selected_cols)

numeric_transformer_sel = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer_sel = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor_sel = ColumnTransformer(transformers=[
    ('num', numeric_transformer_sel, [col for col in selected_cols if col in num_cols]),
    ('cat', categorical_transformer_sel, [col for col in selected_cols if col in cat_cols])
])

# Pipeline final sin SMOTE (si se requiere)
clf_final = Pipeline(steps=[
    ('preprocessor', preprocessor_sel),
    ('classifier', lgb.LGBMClassifier(n_estimators=100, random_state=42))
])

Columnas seleccionadas para el modelo final: ['cust_age_number', 'customer_seniority_number', 'main_city_type', 'customer_asset_amount', 'customer_liability_amount', 'month_withdrawals_number', 'month_purchase_number', 'm_exit_amount', 'm_entry_amount', 'mth_mbl_app_qry_number', 'mth_mbl_app_mnty_tr_number', 'month_atm_queries_number', 'atm_made_fin_trans_number', 'mth_ppl_netcash_queries_number', 'month_h2h_mnty_tr_number', 'mth_ext_pos_mnty_tr_number', 'current_payroll_incm_amount', 'saving_prdt_avg_bal_amount', 'consumer_cash_balance_amount', 'bbva_payrl_oblg_tot_bal_amount', 'avg_cr_card_balance_amount', 'new_segment_id', 'cr_dbt_cd_digital_expns_amount', 'cr_dbt_cd_courier_expns_amount', 'cr_dbt_cd_rstrnt_expns_amount', 'cr_dbt_card_spmk_expns_amount', 'cr_dbt_card_tech_expns_amount', 'town_id']


In [6]:
X_selected = X[selected_cols]
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

clf_final_smote = ImbPipeline(steps=[
    ('preprocessor', preprocessor_sel),
    ('smote', SMOTE(random_state=42)),
    ('classifier', lgb.LGBMClassifier(n_estimators=100, random_state=42))
])

clf_final_smote.fit(X_train, y_train)

y_pred_smote = clf_final_smote.predict(X_test)

# Calcular average_precision_score y precisión
ap_smote = average_precision_score(y_test, clf_final_smote.predict_proba(X_test)[:, 1])
accuracy = accuracy_score(y_test, y_pred_smote)
f1_class1 = f1_score(y_test, y_pred_smote, pos_label=1)

print("Average Precision (SMOTE):", ap_smote)
print("Precisión:", accuracy)
print("F1 Score (Clase 1):", f1_class1)

[LightGBM] [Info] Number of positive: 8627, number of negative: 8627
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.009005 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 10266
[LightGBM] [Info] Number of data points in the train set: 17254, number of used features: 192
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
Average Precision (SMOTE): 0.6440572153131335
Precisión: 0.9313358302122348
F1 Score (Clase 1): 0.5714285714285714


In [7]:
from sklearn.metrics import confusion_matrix, f1_score

# Calcular el F1 score para la clase 1
f1_class1 = f1_score(y_test, y_pred_smote, pos_label=1)
print("F1 Score (Clase 1):", f1_class1)

# Calcular la matriz de confusión
conf_matrix = confusion_matrix(y_test, y_pred_smote)
# Crear un DataFrame con etiquetas ordenadas para la matriz
conf_df = pd.DataFrame(conf_matrix, index=["Actual: 0", "Actual: 1"], columns=["Predicho: 0", "Predicho: 1"])
print("Matriz de Confusión ordenada:")
print(conf_df)

F1 Score (Clase 1): 0.5714285714285714
Matriz de Confusión ordenada:
           Predicho: 0  Predicho: 1
Actual: 0         2128           50
Actual: 1          115          110


## Generación de Predicciones y Archivo de Salida

Se generan las predicciones sobre el conjunto de validación y se guarda el resultado en un archivo CSV con las columnas:
- **ID_cliente**
- **alta_tdc** (1 si el cliente adquiere la tarjeta, 0 en caso contrario)

In [8]:
validation_df = pd.read_csv('Validation_df.csv', delimiter=';', thousands='.', decimal=',')

X_validation = validation_df[selected_cols]

validation_preds = clf_final_smote.predict(X_validation)

output_df = pd.DataFrame({
    'ID_cliente': validation_df[id_col],
    'alta_tdc': validation_preds
})

output_df.to_csv('predicciones.csv', index=False, sep=';')
print('Archivo de predicciones generado: predicciones.csv')

Archivo de predicciones generado: predicciones.csv


## Conclusión y Próximos Pasos

**Decisiones tomadas:**

- Se entrenó un modelo base con LightGBM para obtener la importancia de las variables y se optimizó el umbral para seleccionar las más relevantes.
- Se utiliza un pipeline final que incluye balanceo de clases mediante SMOTE y LightGBM como clasificador.
- Se imprimen únicamente el average_precision_score y la precisión final del modelo.

**Próximos pasos:**

- Explorar la optimización de hiperparámetros para LightGBM.
- Evaluar otros algoritmos (como XGBoost o CatBoost) y compararlos.
- Profundizar en técnicas de validación cruzada y análisis de sensibilidad del modelo.