## Proyecto: Campañas de Marketing directo para depósitos bancarios a plazo fijo

## Descripción del Proyecto y Origen del Dataset

El dataset corresponde a una recopilación de datos relacionados a campañas de marketing, realizados mediante llamadas telefónicas a los clientes.

El objetivo es evaluar la aceptación de un depósito a plazo fijo ofrecido por una entidad bancaria Portuguesa.

Número de registros: 42211

Número de variables: 16, más 1 variable adicional de salida.

Path del Dataset: https://www.kaggle.com/datasets/thedevastator/bank-term-deposit-predictions/

#### Objetivo:

Realizar etapa de Feature Engineering y Clasificación.

Técnicas aplicadas: KNeighbors Classifier, Support Vector Classifier, Gaussian Naive Bayes, Decision Tree Classifier, MLP Classifier.

## I. Feature Engineering

Para el proceso de Feature Engineering realizaremos dos métodos:

- Método 1: Uso de funciones para aplicar transformaciones con el objetivo de visualizar paso a paso el proceso.
- Método 2: Utilización de pipelines, que ofrece mejor rendimiento en tiempos de ejecución y mejor fiabilidad en comparación con el método 1.

In [163]:
# Importar librerías a utilizar.
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
    OneHotEncoder,
    OrdinalEncoder,
    MinMaxScaler,
    StandardScaler,
)

# Lectura del dataset en un dataframe.
df = pd.read_csv('train.csv', sep=',')
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [164]:
df.shape

(45211, 17)

In [165]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


### 1. Hallazgos:

45211 registros, 17 variables.

No existe la presencia de valores nulos.

- 01 - age: edad (numérico)
- 02 - job: tipo de empleo (categórico: "unknown", "unemployed", "management", "housemaid", "entrepreneur", "student", "blue-collar", "self-employed", "retired", "technician", "services").
- 03 - marital: estado civil (categórico: "married", "divorced", "single").
- 04 - education: nivel educativo (categórico: "unknown", "secondary", "primary", "tertiary").
- 05 - default: indicador si existe algún crédito en rojo (binario: "yes", "no").
- 06 - balance: balance de la cuenta del cliente en euros (numérico).
- 07 - housing: indicador de préstamo hipotecario (binario: "yes", "no").
- 08 - loan: si existe o no algún crédito personal (binario: "yes", "no").

Variables relacionadas con el último contacto de la campaña actual:
- 09 - contact: tipo de comunicación (categórico: "unknown", "telephone", "cellular").
- 10 - day: día de la última comunicacón (numérico)
- 11 - month: mes de la última comunicación (categórico: "jan", "feb", "mar", ..., "nov", "dec")
- 12 - duration: duración en segundos de la última comunicación (numérico).

Otros atributos:
- 13 - campaign: número de comunicaciones realizadas durante la campaña actual por cada cliente (numérico).
- 14 - pdays: número de días desde que se contactó con el cliente desde una campaña anterior. El valor -1 significa que el cliente no fue contactado previamente (numérico).
- 15 - previous: número de comunicaciones realizadas antes de la campaña actual (numérico).
- 16 - poutcome: resultado de la campaña de marketing anterior (categórico: "unknown", "other", "failure", "success").

Variable de salida:
- 17 - y: indica si el cliente ha contratado un depósito a plazo fijo (binario: "yes", "no").

### 2. Transformación de las variables "default", "housing", "loan" y variable objetivo "y" a valores numéricos

In [166]:
# yes: 1; no: 0
df['default'] = df['default'].map(
    {'yes': 1, 'no': 0,}
)
df['housing'] = df['housing'].map(
    {'yes': 1, 'no': 0,}
)
df['loan'] = df['loan'].map(
    {'yes': 1, 'no': 0,}
)
df['y'] = df['y'].map(
    {'yes': 1, 'no': 0,}
)
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,0,2143,1,0,unknown,5,may,261,1,-1,0,unknown,0
1,44,technician,single,secondary,0,29,1,0,unknown,5,may,151,1,-1,0,unknown,0
2,33,entrepreneur,married,secondary,0,2,1,1,unknown,5,may,76,1,-1,0,unknown,0
3,47,blue-collar,married,unknown,0,1506,1,0,unknown,5,may,92,1,-1,0,unknown,0
4,33,unknown,single,unknown,0,1,0,0,unknown,5,may,198,1,-1,0,unknown,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,0,825,0,0,cellular,17,nov,977,3,-1,0,unknown,1
45207,71,retired,divorced,primary,0,1729,0,0,cellular,17,nov,456,2,-1,0,unknown,1
45208,72,retired,married,secondary,0,5715,0,0,cellular,17,nov,1127,5,184,3,success,1
45209,57,blue-collar,married,secondary,0,668,0,0,telephone,17,nov,508,4,-1,0,unknown,0


### 3. Eliminación de la variable "unknown" de la variable "education"

Justificación:
- No se analizará el valor "unknown" debido a que se considera que no representa un valor válido en la categoría.
- Se transformará posteriormente la variable "education" con OrdinalEncoder.

In [167]:
df = df[df['education'] != 'unknown']
df = df.reset_index(drop=True)
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,0,2143,1,0,unknown,5,may,261,1,-1,0,unknown,0
1,44,technician,single,secondary,0,29,1,0,unknown,5,may,151,1,-1,0,unknown,0
2,33,entrepreneur,married,secondary,0,2,1,1,unknown,5,may,76,1,-1,0,unknown,0
3,35,management,married,tertiary,0,231,1,0,unknown,5,may,139,1,-1,0,unknown,0
4,28,management,single,tertiary,0,447,1,1,unknown,5,may,217,1,-1,0,unknown,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43349,51,technician,married,tertiary,0,825,0,0,cellular,17,nov,977,3,-1,0,unknown,1
43350,71,retired,divorced,primary,0,1729,0,0,cellular,17,nov,456,2,-1,0,unknown,1
43351,72,retired,married,secondary,0,5715,0,0,cellular,17,nov,1127,5,184,3,success,1
43352,57,blue-collar,married,secondary,0,668,0,0,telephone,17,nov,508,4,-1,0,unknown,0


### 4. Transformación de las columnas restantes mediante el uso de métodos OneHotEncoder, OrdinalEncoder, MinMaxScaler y StandardScaler.

Se aplicarán los dos métodos propuestos:
- Método 1: Uso de funciones.
- Método 2: Utilización de pipelines.

#### 4.1. Método 1: Transformaciones mediante el uso de funciones

In [168]:
# Funciones para aplicación de OneHotEncoder, OrdinalEncoder, MinMaxScaler y
# StandardScaler, sobre secciones del dataframe.

def applyOneHotEncoder(elements, columns):
    """
    Transformar una sección del dataframe usando OneHotEncoder.

    :param elements: dataframe, sección del dataframe.
    :param columns: List, columnas a utilizar.

    :return: dataframe.
    """
    encoder = OneHotEncoder(sparse_output=False)
    array_onehotencoder = encoder.fit_transform(elements)
    new_col_names = []
    for i, col in zip(range(len(columns)), columns):
        new_col_names += [f'{col}_{cat}' for cat in encoder.categories_[i]]
    
    return pd.DataFrame(array_onehotencoder, columns=new_col_names)


def applyOrdinalEncoder(categories, elements, columns):
    """
    Transformar una sección del dataframe usando OrdinalEncoder.

    :param categories: categorías.
    :param elements: dataframe, sección del dataframe.
    :param columns: List, columnas a utilizar.

    :return: dataframe.
    """
    encoder = OrdinalEncoder(categories=categories)
    array_ordinal = encoder.fit_transform(elements)
    return pd.DataFrame(array_ordinal, columns=columns)


def applyMinMaxScaler(elements, columns):
    """
    Transformar una sección del dataframe usando MinMaxScaler.

    :param elements: dataframe, sección del dataframe.
    :param columns: List, columnas a utilizar.

    :return: dataframe.
    """
    scaler = MinMaxScaler()
    array_minmax = scaler.fit_transform(elements)
    return pd.DataFrame(array_minmax, columns=columns)    


def applyStandardScaler(elements, columns):
    """
    Transformar una sección del dataframe usando StandardScaler.

    :param elements: dataframe, sección del dataframe.
    :param columns: List, columnas a utilizar.

    :return: dataframe.
    """
    scaler = StandardScaler()
    array_standard = scaler.fit_transform(elements)
    return pd.DataFrame(array_standard, columns=columns)

#### 4.1.1. Aplicación de OneHotEncoder para las columnas "job", "marital", "contact", "month" y "poutcome"

In [169]:
# Aplicación de OneHotEncoder.
df_ohe = applyOneHotEncoder(
    df[['job', 'marital', 'contact', 'month', 'poutcome']],
    ['job_ohe', 'marital_ohe', 'contact_ohe', 'month_ohe', 'poutcome_ohe']
)
df_ohe

Unnamed: 0,job_ohe_admin.,job_ohe_blue-collar,job_ohe_entrepreneur,job_ohe_housemaid,job_ohe_management,job_ohe_retired,job_ohe_self-employed,job_ohe_services,job_ohe_student,job_ohe_technician,...,month_ohe_jun,month_ohe_mar,month_ohe_may,month_ohe_nov,month_ohe_oct,month_ohe_sep,poutcome_ohe_failure,poutcome_ohe_other,poutcome_ohe_success,poutcome_ohe_unknown
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43349,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
43350,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
43351,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
43352,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0


#### 4.1.2. Aplicación de OrdinalEncoder para la columna "education"

In [170]:
# Visualizamos las categorías de la columna de "education"
# categorías: primary, secondary, tertiary
df['education'].value_counts()

education
secondary    23202
tertiary     13301
primary       6851
Name: count, dtype: int64

In [171]:
# Aplicación de OrdinalEncoder para la columna "education".
df_oe = applyOrdinalEncoder(
    [['primary', 'secondary', 'tertiary']],
    df[['education']],
    ['education_oe']
)
df_oe

Unnamed: 0,education_oe
0,2.0
1,1.0
2,1.0
3,2.0
4,2.0
...,...
43349,2.0
43350,0.0
43351,1.0
43352,1.0


#### 4.1.3. Aplicación de MinMaxScaler para las columnas "age", "day", "pdays", "previous" y "campaign"

Justificación: El rango de valores no es amplio.


In [172]:
# Aplicación de MinMaxScaler.
df_mm = applyMinMaxScaler(
    df[['age', 'day', 'pdays', 'previous', 'campaign']],
    ['age_mm', 'day_mm', 'pdays_mm', 'previous_mm', 'campaign_mm']
)
df_mm

Unnamed: 0,age_mm,day_mm,pdays_mm,previous_mm,campaign_mm
0,0.519481,0.133333,0.000000,0.000000,0.000000
1,0.337662,0.133333,0.000000,0.000000,0.000000
2,0.194805,0.133333,0.000000,0.000000,0.000000
3,0.220779,0.133333,0.000000,0.000000,0.000000
4,0.129870,0.133333,0.000000,0.000000,0.000000
...,...,...,...,...,...
43349,0.428571,0.533333,0.000000,0.000000,0.035088
43350,0.688312,0.533333,0.000000,0.000000,0.017544
43351,0.701299,0.533333,0.212156,0.010909,0.070175
43352,0.506494,0.533333,0.000000,0.000000,0.052632


In [173]:
# Métricas de valores iniciales de las columnas: "age", "day", "pdays",
# "previous" y "campaign".
print(f"age - Valor mínimo: {df['age'].min()}, Valor máximo: {df['age'].max()}")
print(f"day - Valor mínimo: {df['day'].min()}, Valor máximo: {df['day'].max()}")
print(
    f"pdays - Valor mínimo: {df['pdays'].min()}, "
    f"Valor máximo: {df['pdays'].max()}"
)
print(
    f"previous - Valor mínimo: {df['previous'].min()}, "
    f"Valor máximo: {df['previous'].max()}"
)
print(
    f"campaign - Valor mínimo: {df['campaign'].min()}, "
    f"Valor máximo: {df['campaign'].max()}"
)

age - Valor mínimo: 18, Valor máximo: 95
day - Valor mínimo: 1, Valor máximo: 31
pdays - Valor mínimo: -1, Valor máximo: 871
previous - Valor mínimo: 0, Valor máximo: 275
campaign - Valor mínimo: 1, Valor máximo: 58


In [174]:
# Visualización de métricas posterior a la transformación mediante MinMaxScaler.
print(
    f"age_mm - Valor mínimo: {df_mm['age_mm'].min()}, "
    f"Valor máximo: {df_mm['age_mm'].max()}"
)
print(
    f"day_mm - Valor mínimo: {df_mm['day_mm'].min()}, "
    f"Valor máximo: {df_mm['day_mm'].max()}"
)
print(
    f"pdays_mm - Valor mínimo: {df_mm['pdays_mm'].min()}, "
    f"Valor máximo: {df_mm['pdays_mm'].max()}"
)
print(
    f"previous_mm - Valor mínimo: {df_mm['previous_mm'].min()}, "
    f"Valor máximo: {df_mm['previous_mm'].max()}"
)
print(
    f"campaign_mm - Valor mínimo: {df_mm['campaign_mm'].min()}, "
    f"Valor máximo: {df_mm['campaign_mm'].max()}"
)

age_mm - Valor mínimo: 0.0, Valor máximo: 1.0
day_mm - Valor mínimo: 0.0, Valor máximo: 0.9999999999999999
pdays_mm - Valor mínimo: 0.0, Valor máximo: 1.0
previous_mm - Valor mínimo: 0.0, Valor máximo: 1.0
campaign_mm - Valor mínimo: 0.0, Valor máximo: 1.0


#### 4.1.4. Aplicación de StandardScaler para las columnas "balance" y "duración"

Justificación: El rango de valores es amplio.

In [175]:
df_ss = applyStandardScaler(
    df[['balance', 'duration']],
    ['balance_ss', 'duration_ss']
)
df_ss

Unnamed: 0,balance_ss,duration_ss
0,0.259146,0.010854
1,-0.436276,-0.415461
2,-0.445158,-0.706131
3,-0.369826,-0.461968
4,-0.298770,-0.159672
...,...,...
43349,-0.174423,2.785777
43350,0.122957,0.766594
43351,1.434192,3.367116
43352,-0.226070,0.968125


In [176]:
# Métricas de valores iniciales de las columnas: "balance" y "duration".
print(
    f"balance - Valor mínimo: {df['balance'].min()}, "
    f"Valor máximo: {df['balance'].max()}, Media: {df['balance'].mean()}"
)
print(
    f"duration - Valor mínimo: {df['duration'].min()}, "
    f"Valor máximo: {df['duration'].max()}, Media: {df['duration'].mean()}"
)

balance - Valor mínimo: -8019, Valor máximo: 102127, Media: 1355.226714951331
duration - Valor mínimo: 0, Valor máximo: 4918, Media: 258.19945103104675


In [177]:
# Visualización de métricas posterior a la transformación mediante
# StandardScaler.
print(
    f"balance_ss - Valor mínimo: {df_ss['balance_ss'].min()}, "
    f"Valor máximo: {df_ss['balance_ss'].max()}, "
    f"Media: {df_ss['balance_ss'].mean()}"
)
print(
    f"duration_ss - Valor mínimo: {df_ss['duration_ss'].min()}, "
    f"Valor máximo: {df_ss['duration_ss'].max()}, "
    f"Media: {df_ss['duration_ss'].mean()}"
)

balance_ss - Valor mínimo: -3.0837471394908507, Valor máximo: 33.14989887257092, Media: -3.933437666245884e-18
duration_ss - Valor mínimo: -1.000675527043081, Valor máximo: 18.059482123741837, Media: -5.769041910493962e-17


#### 4.1.5. Concatenación de los dataframes resultantes de la aplicación de transformaciones mediante el uso de funciones.

In [178]:
# Contanenación de los dataframes resultantes de las transformaciones con las
# columnas "default", "housing", "loan" y "y".
# df_ohe: dataframe de OneHotEncoder.
# df_oe: dataframe de OrdinalEncoder.
# df_mm: dataframe de MinMaxScaler.
# df_ss: dataframe de StandardScaler
df_result_1 = pd.concat(
    [df_ohe, df_oe, df_mm, df_ss, df[['default', 'housing', 'loan', 'y']]],
    axis=1
)
df_result_1

Unnamed: 0,job_ohe_admin.,job_ohe_blue-collar,job_ohe_entrepreneur,job_ohe_housemaid,job_ohe_management,job_ohe_retired,job_ohe_self-employed,job_ohe_services,job_ohe_student,job_ohe_technician,...,day_mm,pdays_mm,previous_mm,campaign_mm,balance_ss,duration_ss,default,housing,loan,y
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.133333,0.000000,0.000000,0.000000,0.259146,0.010854,0,1,0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.133333,0.000000,0.000000,0.000000,-0.436276,-0.415461,0,1,0,0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.133333,0.000000,0.000000,0.000000,-0.445158,-0.706131,0,1,1,0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.133333,0.000000,0.000000,0.000000,-0.369826,-0.461968,0,1,0,0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.133333,0.000000,0.000000,0.000000,-0.298770,-0.159672,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43349,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.533333,0.000000,0.000000,0.035088,-0.174423,2.785777,0,0,0,1
43350,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.533333,0.000000,0.000000,0.017544,0.122957,0.766594,0,0,0,1
43351,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.533333,0.212156,0.010909,0.070175,1.434192,3.367116,0,0,0,1
43352,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.533333,0.000000,0.000000,0.052632,-0.226070,0.968125,0,0,0,0


#### 4.2. Método 2: Transformaciones mediante el uso de Pipelines

#### 4.2.1. Definición de los Pipelines Base

In [179]:
# Definición de columnas que no sufrirán transformación.
columns = ['default', 'housing', 'loan']

# Definición del Pipeline para su transformación mediante OneHotEncoder.
columns_ohe = ['job', 'marital', 'contact', 'month', 'poutcome']
one_hot_encoder = OneHotEncoder(sparse_output=False)
pipeline_ohe = Pipeline([('oneHotEncoder', one_hot_encoder)])

# Definición del Pipeline para su transformación mediante OrdinalEncoder.
columns_oe = ['education']
ordinal_encoder = OrdinalEncoder(
    categories=[['unknown', 'primary', 'secondary', 'tertiary']]
)
pipeline_oe = Pipeline([('ordinalEncoder', ordinal_encoder)])

# Definición del Pipeline para su transformación mediante MixMaxScaler.
columns_mm = ['age', 'day', 'pdays', 'previous', 'campaign']
min_max_scaler = MinMaxScaler()
pipeline_mm = Pipeline([('minMaxScaler', min_max_scaler)])

# Definición del Pipeline para su transformación mediante StandardScaler.
columns_ss = ['balance', 'duration']
standard_scaler = StandardScaler()
pipeline_ss = Pipeline([('standardScaler', standard_scaler)])

#### 4.2.2. Definición de ColumnTransformer y del Pipeline principal

In [180]:
# Definición de ColumnTransformer, contiene todos los Pipelines a ejecutar y las
# columnas de entrada.
pre_process = ColumnTransformer(
    transformers=[
        ('onehot', pipeline_ohe, columns_ohe),
        ('ordinal', pipeline_oe, columns_oe),
        ('minmax', pipeline_mm, columns_mm),
        ('standard', pipeline_ss, columns_ss),
        ('default', 'passthrough', columns),
    ],
    remainder='passthrough'
)

# Definición del Pipeline principal, contiene a ColumnTransformer.
sk_pipeline = Pipeline(steps=[('pre_processing', pre_process)])

# Definición del dataframe de entrada a ser usado en el Pipeline principal.
# df_X: contiene todas las columans a ser transformadas (a excepción del target
#  "y").
# df_Y: contiene el objetivo del conjunto de datos de clasificación.
# donde "y": indica si el cliente ha contratado un depósito a plazo fijo
#  (binario: "yes", "no")
df_X = df.copy().drop(columns=['y'])
df_Y = df['y']

# array_X: array resultante de la ejecución del Pipeline principal.
array_X = sk_pipeline.fit_transform(df_X, df_Y)
array_X

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 1.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

In [181]:
# Nro de filas(43354) y columnas(45) del array resultante.
array_X.shape

(43354, 45)

In [182]:
# Se muestra el pipeline principal para revisión de su estructura.
sk_pipeline

#### 4.2.3. Definición del nombre de las columnas para el dataframe resultante

In [183]:
# Se obtienen las columnas generadas por el método OneHotEncoder, debido a que
# son las únicas columnas que varían respecto a la entrada de datos. Las demas
# columnas mantienen su mismo nombre.
onehotencoder_columns = (
    sk_pipeline.named_steps['pre_processing'].named_transformers_['onehot'].
    named_steps['oneHotEncoder'].get_feature_names_out(columns_ohe)
)
onehotencoder_columns

array(['job_admin.', 'job_blue-collar', 'job_entrepreneur',
       'job_housemaid', 'job_management', 'job_retired',
       'job_self-employed', 'job_services', 'job_student',
       'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'contact_cellular', 'contact_telephone', 'contact_unknown',
       'month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan',
       'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov',
       'month_oct', 'month_sep', 'poutcome_failure', 'poutcome_other',
       'poutcome_success', 'poutcome_unknown'], dtype=object)

In [184]:
# Se concatenan las columnas en el orden em que las  entradas fueron enviadas al
# Pipeline.
final_columns = np.concatenate(
    [onehotencoder_columns, columns_oe, columns_mm, columns_ss, columns]
)
final_columns

array(['job_admin.', 'job_blue-collar', 'job_entrepreneur',
       'job_housemaid', 'job_management', 'job_retired',
       'job_self-employed', 'job_services', 'job_student',
       'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'contact_cellular', 'contact_telephone', 'contact_unknown',
       'month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan',
       'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov',
       'month_oct', 'month_sep', 'poutcome_failure', 'poutcome_other',
       'poutcome_success', 'poutcome_unknown', 'education', 'age', 'day',
       'pdays', 'previous', 'campaign', 'balance', 'duration', 'default',
       'housing', 'loan'], dtype=object)

#### 4.2.4. Creación del dataframe resultante posterior a la aplicación del Pipeline

Se crea el dataframe con el array resultante del pipeline principal y la lista de columnas de todos los elementos resultantes.

In [185]:
df_result_2 = pd.DataFrame(array_X, columns=final_columns)
df_result_2

Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,age,day,pdays,previous,campaign,balance,duration,default,housing,loan
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.519481,0.133333,0.000000,0.000000,0.000000,0.259146,0.010854,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.337662,0.133333,0.000000,0.000000,0.000000,-0.436276,-0.415461,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.194805,0.133333,0.000000,0.000000,0.000000,-0.445158,-0.706131,0.0,1.0,1.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.220779,0.133333,0.000000,0.000000,0.000000,-0.369826,-0.461968,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.129870,0.133333,0.000000,0.000000,0.000000,-0.298770,-0.159672,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43349,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.428571,0.533333,0.000000,0.000000,0.035088,-0.174423,2.785777,0.0,0.0,0.0
43350,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.688312,0.533333,0.000000,0.000000,0.017544,0.122957,0.766594,0.0,0.0,0.0
43351,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.701299,0.533333,0.212156,0.010909,0.070175,1.434192,3.367116,0.0,0.0,0.0
43352,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.506494,0.533333,0.000000,0.000000,0.052632,-0.226070,0.968125,0.0,0.0,0.0


In [186]:
# Se agrega el target "y" como columna final al dataframe.
# "y": indica si el cliente ha contratado un depósito a plazo fijo
#  (binario: "yes", "no")
df_result_2 = pd.concat([df_result_2, df[['y']]], axis=1)
df_result_2

Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,day,pdays,previous,campaign,balance,duration,default,housing,loan,y
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.133333,0.000000,0.000000,0.000000,0.259146,0.010854,0.0,1.0,0.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.133333,0.000000,0.000000,0.000000,-0.436276,-0.415461,0.0,1.0,0.0,0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.133333,0.000000,0.000000,0.000000,-0.445158,-0.706131,0.0,1.0,1.0,0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.133333,0.000000,0.000000,0.000000,-0.369826,-0.461968,0.0,1.0,0.0,0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.133333,0.000000,0.000000,0.000000,-0.298770,-0.159672,0.0,1.0,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43349,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.533333,0.000000,0.000000,0.035088,-0.174423,2.785777,0.0,0.0,0.0,1
43350,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.533333,0.000000,0.000000,0.017544,0.122957,0.766594,0.0,0.0,0.0,1
43351,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.533333,0.212156,0.010909,0.070175,1.434192,3.367116,0.0,0.0,0.0,1
43352,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.533333,0.000000,0.000000,0.052632,-0.226070,0.968125,0.0,0.0,0.0,0


## II. Clasificación

### 1. Dividimos el dataset en un 80% de filas para entrenamiento y un 20% para testing

In [187]:
# Librerías a utilizar en los métodos de clasificación.
import seaborn as sn
from sklearn.model_selection import KFold, train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier


# Se definen las siguientes constantes para la ejecución de los diferentes
# métodos.
global_n_splits = 5
global_random_state = 42
global_percentage = 100

df_train_eval, df_test = train_test_split(
    df_result_2, train_size=0.8, random_state=42, shuffle=True)
print('df_result_2:', df_result_2.shape)
print('df_train_eval:', df_train_eval.shape)
print('df_test:', df_test.shape)

df_result_2: (43354, 46)
df_train_eval: (34683, 46)
df_test: (8671, 46)


### 2. Definición de funciones para cada tipo de clasificador

#### 2.1. Clasificador Vecinos próximos - Nearest neighbors

In [188]:
# KNeighborsClassifier: Algoritmo de clasificación de k-vecinos más cercanos.
# Es un método supervisado que se utiliza para clasificación y regresión.
# El principio de K-Nearest Neighbors es que un punto de datos se clasifica en
# función de la mayoría de los votos de sus vecinos más cercanos.
def applyKNeighborsClassifierIteration(
        i, x_train, y_train, x_eval, y_eval, best_model, best_model_index,
        best_model_position):
    """
    Aplicar KNeighborsClassifier con diferentes valores de n_neighbors.

    Entrenar y evaluar varios modelos KNeighborsClassifier con diferentes
    valores de n_neighbors y actualizar el mejor modelo basado en la precisión
    de evaluación.

    :param i: Índice de la iteración actual.
    :param x_train: Conjunto de datos de entrenamiento (features).
    :param y_train: Conjunto de datos de entrenamiento (target).
    :param x_eval: Conjunto de datos de evaluación (features).
    :param y_eval: Conjunto de datos de evaluación (target).
    :param best_model: Mejor modelo entrenado hasta el momento.
    :param best_model_index: Índice del mejor modelo entrenado.
    :param best_model_position: Precisión del mejor modelo entrenado.

    :return: DataFrame con resultados de evaluación, mejor modelo, índice del
        mejor modelo, precisión del mejor modelo.
    """
    # Definir los modelos a entrenar y evaluar.
    neighbors = [3, 4, 5, 6]
    models = [KNeighborsClassifier(n_neighbors=n) for n in neighbors]
    indexes = [f'KN{n}{i+1}' for n in neighbors]

    # Entrenar los modelos.
    for model in models:
        model.fit(x_train, y_train)

    # Evaluar los modelos.
    scores = [model.score(x_eval, y_eval) for model in models]

    # Actualizar el mejor modelo.
    for index, model, score in zip(indexes, models, scores):
        if score > best_model_position:
            best_model = model
            best_model_index = index
            best_model_position = score

    # Mostrar los resultados del clasificador por iteración.
    iteration_array_kn = [
        ['KNeighborsClassifier', f'n_neighbors={n}', i + 1, x_train.shape,
         x_eval.shape, f'{score * global_percentage:.4f}%', 0
        ] for n, score in zip(neighbors, scores)
    ]

    # Definir los índices para cada fila
    results_array_kn = np.array(iteration_array_kn, dtype=object)

    # Convertir el resultado en un DataFrame asignando los índices definidos.
    df_result_kn = pd.DataFrame(
        results_array_kn,
        columns=[
            'Método', 'Parámetros', 'Iteración', 'Train Shape', 'Eval Shape',
            'Accuracy', 'Mejor x Método'
        ],
        index=indexes
    )

    # Retornar el DataFrame de resultados, el mejor modelo, el índice del mejor
    # modelo y la precisión del mejor modelo.
    return df_result_kn, best_model, best_model_index, best_model_position

#### 2.2. Clasificador de Vectores de Apoyo - Support Vector Classifier (SVC)

In [189]:
# Support Vector Classifier (SVC): Algoritmo de clasificación de machine
# learning basada en máquinas de soporte vectorial (SVM). Técnica de aprendizaje
# supervisado que se utiliza para clasificación y regresión.
def applySVCIteration(
        i, x_train, y_train, x_eval, y_eval, best_model, best_model_index,
        best_model_position):
    """
    Aplicar diferentes SVC (Support Vector Classifier) con varios kernels y
    actualizar el mejor modelo basado en la precisión de evaluación.

    :param i: Índice de la iteración actual.
    :param x_train: Conjunto de datos de entrenamiento (features).
    :param y_train: Conjunto de datos de entrenamiento (target).
    :param x_eval: Conjunto de datos de evaluación (features).
    :param y_eval: Conjunto de datos de evaluación (target).
    :param best_model: Mejor modelo entrenado hasta el momento.
    :param best_model_index: Índice del mejor modelo entrenado.
    :param best_model_position: Precisión del mejor modelo entrenado.

    :return: DataFrame con resultados de evaluación, mejor modelo, índice del
        mejor modelo, precisión del mejor modelo.
    """
    # Definir los modelos SVC con diferentes kernels.
    kernels = ['linear', 'sigmoid', 'poly', 'rbf']
    models = [SVC(kernel=kernel, probability=True) for kernel in kernels]
    indexes = [f'SVC{kernel.upper()[0]}{i+1}' for kernel in kernels]

    # Entrenar los modelos.
    for model in models:
        model.fit(x_train, y_train)

    # Evaluar los modelos.
    scores = [model.score(x_eval, y_eval) for model in models]

    # Actualizar el mejor modelo.
    for index, model, score in zip(indexes, models, scores):
        if score > best_model_position:
            best_model = model
            best_model_index = index
            best_model_position = score

    # Mostrar los resultados del clasificador por iteración.
    iteration_array_svc = [
        ['SVC', kernel, i + 1, x_train.shape, x_eval.shape,
         f'{score * global_percentage:.4f}%', 0
        ] for kernel, score in zip(kernels, scores)
    ]

    # Definir los índices para cada fila.
    results_array_svc = np.array(iteration_array_svc, dtype=object)

    # Convertir el resultado en un DataFrame asignando los índices definidos.
    df_result_svc = pd.DataFrame(
        results_array_svc,
        columns=[
            'Método', 'Parámetros', 'Iteración', 'Train Shape', 'Eval Shape',
            'Accuracy', 'Mejor x Método'
        ],
        index=indexes
    )

    # Retornar el DataFrame de resultados, el mejor modelo, el índice del mejor
    # modelo y la precisión del mejor modelo
    return df_result_svc, best_model, best_model_index, best_model_position

#### 2.3. Clasificador Naive Bayes

In [190]:
# Naive Bayes: Algoritmo de clasificación basado en el teorema de Bayes.
# - Gaussian Naive Bayes: Asume que las características siguen una distribución
#   normal (gaussiana). Usado en datos continuos.
def applyNaiveBayesIteration(
      i, x_train, y_train, x_eval, y_eval, best_model, best_model_index,
      best_model_position):
    """
    Aplicar el clasificador Gaussian Naive Bayes con diferentes parámetros de
    suavizado(smmoothing) en una iteración y retornar los resultados de la
    evaluación.

    :param i: Índice de la iteración actual.
    :param x_train: Conjunto de datos de entrenamiento (features).
    :param y_train: Conjunto de datos de entrenamiento (target).
    :param x_eval: Conjunto de datos de evaluación (features).
    :param y_eval: Conjunto de datos de evaluación (target).
    :param best_model: Mejor modelo entrenado hasta el momento.
    :param best_model_index: Índice del mejor modelo entrenado.
    :param best_model_position: Precisión del mejor modelo entrenado.

    :return: DataFrame con resultados de evaluación, mejor modelo, índice del
        mejor modelo, precisión del mejor modelo.
    """
    # Definir los diferentes modelos Gaussian Naive Bayes con distintas
    # suavizaciones("smoothing").
    smoothing_values = [1e-01, 1e-04, 1e-07, 1e-09]
    indexes = [f'NB{i+1}{j}' for j in range(1, 5)]
    models = [
        GaussianNB(var_smoothing=smoothing) for smoothing in smoothing_values]

    # Entrenar y evaluar cada modelo.
    scores = []
    for model in models:
        model.fit(x_train, y_train)
        score = model.score(x_eval, y_eval)
        scores.append(score)

    # Encontrar el mejor modelo en esta iteración.
    for index, model, score in zip(indexes, models, scores):
        if score > best_model_position:
            best_model = model
            best_model_index = index
            best_model_position = score

    # Crear un array con los resultados de la iteración.
    iteration_array_nb = [
        [
            'GaussianNB', f'var_smoothing={smoothing:.0e}', i+1, x_train.shape,
            x_eval.shape,  f'{score*global_percentage:.4f}%', 0
        ] for _, smoothing, score in zip(indexes, smoothing_values, scores)
    ]

    # Convertir los resultados en un DataFrame.
    df_result_nb = pd.DataFrame(
        iteration_array_nb,
        columns=[
            'Método', 'Parámetros', 'Iteración', 'Train Shape', 'Eval Shape',
            'Accuracy', 'Mejor x Método'
        ],
        index=indexes
    )

    # Retornar el DataFrame resultante junto al mejor modelo encontrado.
    return df_result_nb, best_model, best_model_index, best_model_position

#### 2.4. Clasificador Árboles de Decisión

In [191]:
# Clasificador de Árboles de Decisión: modelo de aprendizaje automática que
# utiliza una estructura de árbol para toma de decisiones basados en las
# características de entrada.
def applyDecisionTreeClassifierIteration(
        i, x_train, y_train, x_eval, y_eval, best_model, best_model_index,
        best_model_position):
    """
    Aplicar el clasificador Decision Tree con diferentes parámetros de
    min_samples_leaf en una iteración y retornar los resultados de la
    evaluación.

    :param i: Índice de la iteración actual.
    :param x_train: Conjunto de datos de entrenamiento (features).
    :param y_train: Conjunto de datos de entrenamiento (target).
    :param x_eval: Conjunto de datos de evaluación (features).
    :param y_eval: Conjunto de datos de evaluación (target).
    :param best_model: Mejor modelo entrenado hasta el momento.
    :param best_model_index: Índice del mejor modelo entrenado.
    :param best_model_position: Precisión del mejor modelo entrenado.

    :return: DataFrame con resultados de evaluación, mejor modelo, índice del
        mejor modelo, precisión del mejor modelo.
    """
    # Definir los diferentes modelos Decision Tree con distintas configuraciones
    # de min_samples_leaf.
    min_samples_leaf_values = [1, 20, 50, 70, 100]
    indexes = [f'DTC{value}-{i+1}' for value in min_samples_leaf_values]
    models = [
        DecisionTreeClassifier(
            min_samples_leaf=value, random_state=global_random_state
        ) for value in min_samples_leaf_values
    ]

    # Entrenar y evaluar cada modelo.
    scores = []
    for model in models:
        model.fit(x_train, y_train)
        score = model.score(x_eval, y_eval)
        scores.append(score)

    # Encontrar el mejor modelo en esta iteración.
    for index, model, score in zip(indexes, models, scores):
        if score > best_model_position:
            best_model = model
            best_model_index = index
            best_model_position = score

    # Crear un array con los resultados de la iteración.
    iteration_array_dt = [
        [
            'DecisionTreeClassifier', f'min_samples_leaf={value}', i+1,
            x_train.shape, x_eval.shape, f"{score*global_percentage:.4f}%", 0
        ] for value, score in zip(min_samples_leaf_values, scores)
    ]

    # Convertir los resultados en un DataFrame.
    df_result_dt = pd.DataFrame(
        iteration_array_dt,
        columns=[
            'Método', 'Parámetros', 'Iteración', 'Train Shape', 'Eval Shape',
            'Accuracy', 'Mejor x Método'
        ],
        index=indexes
    )

    # Retornar el DataFrame resultante junto al mejor modelo encontrado.
    return df_result_dt, best_model, best_model_index, best_model_position

#### 2.5. Clasificador Redes Neurales - MLP (Multi-layer perceptron)

In [192]:
# MLPClassifier: Clasificador de Red neuronal de tipo perceptrón multicapa (MLP)
# utilizado para clasificación supervisada.
def applyMLPClassifierIteration(
        i, x_train, y_train, x_eval, y_eval, best_model, best_model_index,
        best_model_position):
    """
    Aplicar diferentes configuraciones del clasificador MLPClassifier y evaluar
    su rendimiento.

    Esta función entrena cuatro variantes del clasificador MLPClassifier con
    diferentes funciones de activación y evalúa su rendimiento en un conjunto de
    datos de evaluación. Se selecciona el modelo con la mejor puntuación y se
    devuelve un DataFrame con los resultados de cada configuración.

    :param i: Número de la iteración actual.
    :param x_train: Conjunto de características de entrenamiento.
    :param y_train: Etiquetas de entrenamiento.
    :param x_eval: Conjunto de características de evaluación.
    :param y_eval: Etiquetas de evaluación.
    :param best_model: Mejor modelo entrenado hasta el momento.
    :param best_model_index: El índice del mejor modelo entrenado.
    :param best_model_position: La puntuación del mejor modelo entrenado.

    :return: DataFrame con resultados de evaluación, mejor modelo, índice del
        mejor modelo, precisión del mejor modelo.
    """
    # Definir los modelos con diferentes funciones de activación.
    models = {
        'identity': MLPClassifier(
            hidden_layer_sizes=(45, 1), activation='identity', alpha=1,
            random_state=0
        ),
        'logistic': MLPClassifier(
            hidden_layer_sizes=(45, 1), activation='logistic', alpha=1,
            random_state=0
        ),
        'tanh': MLPClassifier(
            hidden_layer_sizes=(45, 1), activation='tanh', alpha=1,
            random_state=0
        ),
        'relu': MLPClassifier(
            hidden_layer_sizes=(45, 1), activation='relu', alpha=1,
            random_state=0
        )
    }
    
    # Inicializar listas para guardar resultados.
    scores = []
    indexes = []
    
    # Entrenar y evaluar cada modelo
    for activation, model in models.items():
        index = f'MLP{activation[0].upper()}{i+1}'
        model.fit(x_train, y_train)
        score = model.score(x_eval, y_eval)
        
        # Guardar el mejor modelo
        if score > best_model_position:
            best_model = model
            best_model_index = index
            best_model_position = score
        
        # Guardar los resultados para cada modelo.
        scores.append(score)
        indexes.append(index)
    
    # Crear un DataFrame con los resultados.
    iteration_array_mlp = [
        [
            'MLPClassifier', f'activation={act}', i+1, x_train.shape,
            x_eval.shape, f'{score*global_percentage:.4f}%', 0
        ] for act, score in zip(models.keys(), scores)
    ]
    
    # Convertir a DataFrame.
    df_result_mlp = pd.DataFrame(
        iteration_array_mlp,
        columns=[
            'Método', 'Parámetros', 'Iteración', 'Train Shape', 'Eval Shape',
            'Accuracy', 'Mejor x Método'
        ],
        index=indexes
    )

    # Retornar el DataFrame de resultados, el mejor modelo, su índice y su
    # puntuación.
    return df_result_mlp, best_model, best_model_index, best_model_position

### 3. Dividimos el dataset de entrenamiento mediante validación cruzada (cross valdiation)

In [193]:
# KFold: Se hace uso de KFold para realizar la validación cruzada, dividiendo el
# conjunto de datos en varias partes "folds".

# Definir el objeto KFold.
kf = KFold(
    n_splits=global_n_splits, random_state=global_random_state, shuffle=True)

# Inicialización de variables.
best_model_position_kn = 0
best_model_position_svc = 0
best_model_position_nb = 0
best_model_position_dt = 0
best_model_position_mlp = 0
best_model_index_kn = ""
best_model_index_svc = ""
best_model_index_nb = ""
best_model_index_dt = ""
best_model_index_mlp = ""
best_model_kn = None
best_model_svc = None
best_model_nb = None
best_model_dt = None
best_model_mlp = None
df_result_kn = None
df_result_svc = None
df_result_nb = None
df_result_dt = None
df_result_mlp = None

for i, (train_index, eval_index) in zip(range(global_n_splits), kf.split(df_train_eval)):
    # Separar el conjunto de datos en entrenamiento y evaluación.
    df_train = df_train_eval.iloc[train_index]
    df_eval = df_train_eval.iloc[eval_index]
    
    # Obtener los conjuntos "X" e "Y" para entrenamiento y evaluación.
    x_train = df_train.drop(columns=['y'])
    y_train = df_train['y']
    x_eval = df_eval.drop(columns=['y'])
    y_eval = df_eval['y']
    
    # Aplicar KNeighborsClassifier.
    df_result_kn_f, best_model_kn_f, best_model_index_kn_f, best_model_position_kn_f = (
        applyKNeighborsClassifierIteration(
            i, x_train, y_train, x_eval, y_eval, best_model_kn,
            best_model_index_kn, best_model_position_kn)
    )
    best_model_kn = best_model_kn_f
    best_model_index_kn = best_model_index_kn_f
    best_model_position_kn = best_model_position_kn_f
    
    # Concatenar los resultados.
    if i == 0: 
        df_result_kn = df_result_kn_f
    else:
        df_result_kn = pd.concat([df_result_kn, df_result_kn_f])
    
    # Aplicar SVC.
    df_result_svc_f, best_model_svc_f, best_model_index_svc_f, best_model_position_svc_f = (
        applySVCIteration(
            i, x_train, y_train, x_eval, y_eval, best_model_svc,
            best_model_index_svc, best_model_position_svc)
    )
    best_model_svc = best_model_svc_f
    best_model_index_svc = best_model_index_svc_f
    best_model_position_svc = best_model_position_svc_f
    
    # Concatenar los resultados.
    if i == 0: 
        df_result_svc = df_result_svc_f
    else:
        df_result_svc = pd.concat([df_result_svc, df_result_svc_f])
    
    # Aplicar GaussianNB.
    df_result_nb_f, best_model_nb_f, best_model_index_nb_f, best_model_position_nb_f = (
        applyNaiveBayesIteration(
            i, x_train, y_train, x_eval, y_eval, best_model_nb,
            best_model_index_nb, best_model_position_nb)
    )
    best_model_nb = best_model_nb_f
    best_model_index_nb = best_model_index_nb_f
    best_model_position_nb = best_model_position_nb_f
    
    # Concatenar los resultados.
    if i == 0: 
        df_result_nb = df_result_nb_f
    else:
        df_result_nb = pd.concat([df_result_nb, df_result_nb_f])
    
    # Aplicar DecisionTreeClassifier.
    df_result_dt_f, best_model_dt_f, best_model_index_dt_f, best_model_position_dt_f = (
        applyDecisionTreeClassifierIteration(
            i, x_train, y_train, x_eval, y_eval, best_model_dt,
            best_model_index_dt, best_model_position_dt)
    )
    best_model_dt = best_model_dt_f
    best_model_index_dt = best_model_index_dt_f
    best_model_position_dt = best_model_position_dt_f
    
    # Concatenar los resultados.
    if i == 0: 
        df_result_dt = df_result_dt_f
    else:
        df_result_dt = pd.concat([df_result_dt, df_result_dt_f])
    
    # Aplicar MLPClassifier.
    df_result_mlp_f, best_model_mlp_f, best_model_index_mlp_f, best_model_position_mlp_f = (
        applyMLPClassifierIteration(
            i, x_train, y_train, x_eval, y_eval, best_model_mlp,
            best_model_index_mlp, best_model_position_mlp)
    )
    best_model_mlp = best_model_mlp_f
    best_model_index_mlp = best_model_index_mlp_f
    best_model_position_mlp = best_model_position_mlp_f
    
    # Concatenar los resultados.
    if i == 0: 
        df_result_mlp = df_result_mlp_f
    else:
        df_result_mlp = pd.concat([df_result_mlp, df_result_mlp_f])

# Marcar los mejores modelos en los resultados.
df_result_kn.loc[best_model_index_kn, 'Mejor x Método'] = 1
df_result_svc.loc[best_model_index_svc, 'Mejor x Método'] = 1
df_result_nb.loc[best_model_index_nb, 'Mejor x Método'] = 1
df_result_dt.loc[best_model_index_dt, 'Mejor x Método'] = 1
df_result_mlp.loc[best_model_index_mlp, 'Mejor x Método'] = 1

In [194]:
# Mostrar los resultados de los clasificadores para "KNeighborsClassifier".
df_result_kn

Unnamed: 0,Método,Parámetros,Iteración,Train Shape,Eval Shape,Accuracy,Mejor x Método
KN31,KNeighborsClassifier,n_neighbors=3,1,"(27746, 45)","(6937, 45)",89.1452%,0
KN41,KNeighborsClassifier,n_neighbors=4,1,"(27746, 45)","(6937, 45)",89.7650%,0
KN51,KNeighborsClassifier,n_neighbors=5,1,"(27746, 45)","(6937, 45)",89.4623%,0
KN61,KNeighborsClassifier,n_neighbors=6,1,"(27746, 45)","(6937, 45)",89.7650%,0
KN32,KNeighborsClassifier,n_neighbors=3,2,"(27746, 45)","(6937, 45)",88.9722%,0
KN42,KNeighborsClassifier,n_neighbors=4,2,"(27746, 45)","(6937, 45)",89.3326%,0
KN52,KNeighborsClassifier,n_neighbors=5,2,"(27746, 45)","(6937, 45)",89.3470%,0
KN62,KNeighborsClassifier,n_neighbors=6,2,"(27746, 45)","(6937, 45)",89.5632%,0
KN33,KNeighborsClassifier,n_neighbors=3,3,"(27746, 45)","(6937, 45)",89.4335%,0
KN43,KNeighborsClassifier,n_neighbors=4,3,"(27746, 45)","(6937, 45)",89.7939%,0


In [195]:
# Mostrar los resultados de los clasificadores para "SVC".
df_result_svc

Unnamed: 0,Método,Parámetros,Iteración,Train Shape,Eval Shape,Accuracy,Mejor x Método
SVCL1,SVC,linear,1,"(27746, 45)","(6937, 45)",89.3902%,0
SVCS1,SVC,sigmoid,1,"(27746, 45)","(6937, 45)",84.8638%,0
SVCP1,SVC,poly,1,"(27746, 45)","(6937, 45)",90.4570%,0
SVCR1,SVC,rbf,1,"(27746, 45)","(6937, 45)",90.5723%,0
SVCL2,SVC,linear,2,"(27746, 45)","(6937, 45)",88.9722%,0
SVCS2,SVC,sigmoid,2,"(27746, 45)","(6937, 45)",84.4601%,0
SVCP2,SVC,poly,2,"(27746, 45)","(6937, 45)",89.7074%,0
SVCR2,SVC,rbf,2,"(27746, 45)","(6937, 45)",89.9813%,0
SVCL3,SVC,linear,3,"(27746, 45)","(6937, 45)",89.1163%,0
SVCS3,SVC,sigmoid,3,"(27746, 45)","(6937, 45)",84.4313%,0


In [196]:
# Mostrar los resultados de los clasificadores para "GaussianNB".
df_result_nb

Unnamed: 0,Método,Parámetros,Iteración,Train Shape,Eval Shape,Accuracy,Mejor x Método
NB11,GaussianNB,var_smoothing=1e-01,1,"(27746, 45)","(6937, 45)",89.6353%,0
NB12,GaussianNB,var_smoothing=1e-04,1,"(27746, 45)","(6937, 45)",85.7431%,0
NB13,GaussianNB,var_smoothing=1e-07,1,"(27746, 45)","(6937, 45)",85.0800%,0
NB14,GaussianNB,var_smoothing=1e-09,1,"(27746, 45)","(6937, 45)",85.0800%,0
NB21,GaussianNB,var_smoothing=1e-01,2,"(27746, 45)","(6937, 45)",88.7848%,0
NB22,GaussianNB,var_smoothing=1e-04,2,"(27746, 45)","(6937, 45)",85.6134%,0
NB23,GaussianNB,var_smoothing=1e-07,2,"(27746, 45)","(6937, 45)",85.5413%,0
NB24,GaussianNB,var_smoothing=1e-09,2,"(27746, 45)","(6937, 45)",85.5413%,0
NB31,GaussianNB,var_smoothing=1e-01,3,"(27746, 45)","(6937, 45)",89.7650%,0
NB32,GaussianNB,var_smoothing=1e-04,3,"(27746, 45)","(6937, 45)",86.3197%,0


In [197]:
# Mostrar los resultados de los clasificadores para "DecisionTreeClassifier".
df_result_dt

Unnamed: 0,Método,Parámetros,Iteración,Train Shape,Eval Shape,Accuracy,Mejor x Método
DTC1-1,DecisionTreeClassifier,min_samples_leaf=1,1,"(27746, 45)","(6937, 45)",87.0405%,0
DTC20-1,DecisionTreeClassifier,min_samples_leaf=20,1,"(27746, 45)","(6937, 45)",89.5632%,0
DTC50-1,DecisionTreeClassifier,min_samples_leaf=50,1,"(27746, 45)","(6937, 45)",90.2263%,0
DTC70-1,DecisionTreeClassifier,min_samples_leaf=70,1,"(27746, 45)","(6937, 45)",90.4426%,0
DTC100-1,DecisionTreeClassifier,min_samples_leaf=100,1,"(27746, 45)","(6937, 45)",89.8515%,0
DTC1-2,DecisionTreeClassifier,min_samples_leaf=1,2,"(27746, 45)","(6937, 45)",87.4874%,0
DTC20-2,DecisionTreeClassifier,min_samples_leaf=20,2,"(27746, 45)","(6937, 45)",89.4335%,0
DTC50-2,DecisionTreeClassifier,min_samples_leaf=50,2,"(27746, 45)","(6937, 45)",89.8948%,0
DTC70-2,DecisionTreeClassifier,min_samples_leaf=70,2,"(27746, 45)","(6937, 45)",89.9236%,0
DTC100-2,DecisionTreeClassifier,min_samples_leaf=100,2,"(27746, 45)","(6937, 45)",89.6497%,0


In [198]:
# Mostrar los resultados de los clasificadores para "MLPClassifier".
df_result_mlp

Unnamed: 0,Método,Parámetros,Iteración,Train Shape,Eval Shape,Accuracy,Mejor x Método
MLPI1,MLPClassifier,activation=identity,1,"(27746, 45)","(6937, 45)",90.1831%,0
MLPL1,MLPClassifier,activation=logistic,1,"(27746, 45)","(6937, 45)",88.6262%,0
MLPT1,MLPClassifier,activation=tanh,1,"(27746, 45)","(6937, 45)",90.1110%,0
MLPR1,MLPClassifier,activation=relu,1,"(27746, 45)","(6937, 45)",88.6262%,0
MLPI2,MLPClassifier,activation=identity,2,"(27746, 45)","(6937, 45)",89.6641%,0
MLPL2,MLPClassifier,activation=logistic,2,"(27746, 45)","(6937, 45)",88.0784%,0
MLPT2,MLPClassifier,activation=tanh,2,"(27746, 45)","(6937, 45)",89.7794%,0
MLPR2,MLPClassifier,activation=relu,2,"(27746, 45)","(6937, 45)",88.0784%,0
MLPI3,MLPClassifier,activation=identity,3,"(27746, 45)","(6937, 45)",90.3561%,0
MLPL3,MLPClassifier,activation=logistic,3,"(27746, 45)","(6937, 45)",88.2658%,0


In [199]:
# Obtener los mejores clasificadores por método.
best_classifier_kn = df_result_kn.loc[best_model_index_kn]
best_classifier_svc = df_result_svc.loc[best_model_index_svc]
best_classifier_nb = df_result_nb.loc[best_model_index_nb]
best_classifier_dt = df_result_dt.loc[best_model_index_dt]
best_classifier_mlp = df_result_mlp.loc[best_model_index_mlp]

print(
    f"Mejor clasificador 'KNeighborsClassifier': Identificador = "
    f"{best_model_index_kn}, Accuracy = {best_classifier_kn['Accuracy']}"
)
print(
    f"Mejor clasificador 'SVC': Identificador = {best_model_index_svc}, "
    f"Accuracy = {best_classifier_svc['Accuracy']}"
)
print(
    f"Mejor clasificador 'GaussianNB': Identificador = {best_model_index_nb}, "
    f"Accuracy = {best_classifier_nb['Accuracy']}"
)
print(
    f"Mejor clasificador 'DecisionTreeClassifier': Identificador = "
    f"{best_model_index_dt}, Accuracy = {best_classifier_dt['Accuracy']}"
)
print(
    f"Mejor clasificador 'MLPClassifier': Identificador = {best_model_index_mlp}, "
    f"Accuracy = {best_classifier_mlp['Accuracy']}"
)

Mejor clasificador 'KNeighborsClassifier': Identificador = KN64, Accuracy = 90.2105%
Mejor clasificador 'SVC': Identificador = SVCP4, Accuracy = 91.0611%
Mejor clasificador 'GaussianNB': Identificador = NB41, Accuracy = 90.1240%
Mejor clasificador 'DecisionTreeClassifier': Identificador = DTC70-3, Accuracy = 90.7164%
Mejor clasificador 'MLPClassifier': Identificador = MLPI4, Accuracy = 90.6719%


### 4. Utilizamos el mejor clasificador de cada método para predecir el conjunto de datos de prueba

In [200]:
# Obtenemos los datasets "X" y "Y" para entrenamiento.
x_test = df_test.drop(columns=['y'])
y_test = df_test['y']

#### 4.1. Predicción de los datasets de prueba para los clasificadores de K-Neighbors, SVC, Naive Bayes, DecisionTreeClassifier y MLPClassifier

In [201]:
# Predecir el objetivo "y" utilizando el mejor clasificador.
y_test_predict_kn = best_model_kn.predict(x_test)
y_test_predict_svc = best_model_svc.predict(x_test)
y_test_predict_nb = best_model_nb.predict(x_test)
y_test_predict_dt = best_model_dt.predict(x_test)
y_test_predict_mlp = best_model_mlp.predict(x_test)

# Predecir la probabilidad de obtener el objetivo "y" utilizando el
# mejorclasificador.
y_test_predict_proba_kn = best_model_kn.predict_proba(x_test)[:, 1]
y_test_predict_proba_svc = best_model_svc.predict_proba(x_test)[:, 1]
y_test_predict_proba_nb = best_model_nb.predict_proba(x_test)[:, 1]
y_test_predict_proba_dt = best_model_dt.predict_proba(x_test)[:, 1]
y_test_predict_proba_mlp = best_model_mlp.predict_proba(x_test)[:, 1]

# Obtener la puntuación (precisión) utilizando el mejor clasificador.
y_test_score_kn = best_model_kn.score(x_test, y_test)
y_test_score_svc = best_model_svc.score(x_test, y_test)
y_test_score_nb = best_model_nb.score(x_test, y_test)
y_test_score_dt = best_model_dt.score(x_test, y_test)
y_test_score_mlp = best_model_mlp.score(x_test, y_test)

#### 4.2. Mostrar los mejores clasificadores por método y su asociación con el conjunto de datos de prueba

In [202]:
classifiers_array = [
    [
        best_classifier_kn.Método, best_classifier_kn.Parámetros,
        best_classifier_kn.Iteración, df_test.shape, best_classifier_kn.Accuracy,
        f'{y_test_score_kn*global_percentage:.4f}%', 0
    ], [
        best_classifier_svc.Método, best_classifier_svc.Parámetros,
        best_classifier_svc.Iteración, df_test.shape,
        best_classifier_svc.Accuracy,
        f'{y_test_score_svc*global_percentage:.4f}%', 0
    ], [
        best_classifier_nb.Método, best_classifier_nb.Parámetros,
        best_classifier_nb.Iteración, df_test.shape,
        best_classifier_nb.Accuracy,
        f'{y_test_score_nb*global_percentage:.4f}%', 0
    ], [
        best_classifier_dt.Método, best_classifier_dt.Parámetros,
        best_classifier_dt.Iteración, df_test.shape,
        best_classifier_dt.Accuracy,
        f'{y_test_score_dt*global_percentage:.4f}%', 0
    ], [
        best_classifier_mlp.Método, best_classifier_mlp.Parámetros,
        best_classifier_mlp.Iteración, df_test.shape,
        best_classifier_mlp.Accuracy,
        f'{y_test_score_mlp*global_percentage:.4f}%', 0
    ]
]
classifiers_indexes = [
    best_model_index_kn, best_model_index_svc, best_model_index_nb,
    best_model_index_dt, best_model_index_mlp
]
classifiers_array_final = np.array(classifiers_array, dtype=object)
df_result_testing = pd.DataFrame(
    classifiers_array_final,
    columns=[
        'Método', 'Parámetros', 'Iteración', 'Test Shape', 'Train Accuracy',
        'Test Accuracy', 'Mejor'
    ],
    index=classifiers_indexes
)

# Diccionario para almacenar los puntajes y los índices de los modelos.
test_scores = {
    'kn': (y_test_score_kn, best_model_index_kn),
    'svc': (y_test_score_svc, best_model_index_svc),
    'nb': (y_test_score_nb, best_model_index_nb),
    'dt': (y_test_score_dt, best_model_index_dt),
    'mlp': (y_test_score_mlp, best_model_index_mlp)
}

# Encontrar el mejor puntaje y el índice correspondiente.
best_test_score, best_test_model_index = max(
    test_scores.values(), key=lambda x: x[0])

# Marcar el mejor modelo en el DataFrame.
df_result_testing.loc[best_test_model_index, 'Mejor'] = 1

# Mostrar el DataFrame resultante.
df_result_testing

Unnamed: 0,Método,Parámetros,Iteración,Test Shape,Train Accuracy,Test Accuracy,Mejor
KN64,KNeighborsClassifier,n_neighbors=6,4,"(8671, 46)",90.2105%,89.7128%,0
SVCP4,SVC,poly,4,"(8671, 46)",91.0611%,90.5432%,1
NB41,GaussianNB,var_smoothing=1e-01,4,"(8671, 46)",90.1240%,89.6782%,0
DTC70-3,DecisionTreeClassifier,min_samples_leaf=70,3,"(8671, 46)",90.7164%,90.1857%,0
MLPI4,MLPClassifier,activation=identity,4,"(8671, 46)",90.6719%,90.0012%,0


### 5. Conclusiones en relación a los modelos de clasificación utilizados

- El clasificador con la mejor precisión (accuracy) es "SVC"(kernel="poly").

- El objetivo de disponer de un clasificador, es predecir o estimar los esfuerzos necesarios para alcanzar los objetivos en una futura campaña de marketing.

- Es de relevancia elegir un clasificador con el mejor accuracy que porporcione una mayor confianza en la predicción del éxito de las campañas de marketing.

Futuras mejores a realizar sobre el proyecto:
- Generar mapas de calor de las matrices de confusión de los clasificadores analizados, con el objetivo de evaluar la exactitud de los modelos de clasificaciñon y visualizando su rendimiento.

## III. Desarrollo mediante un modelo preentrenado de Hugging Face: bert-base-uncased

1. Selección del modelo encoder

   Explicar en la documentación adjuntada:

- Investigación y comparación de diferentes opciones de modelos de tipo encoder

  BERT (Bidirectional Encoder Representations from Transformers):
  Modelo de lenguaje basado en transformes que permite un entrenamiento bidireccional.
  Su principla ventaja es en la comprensión de lenguaje natural, usado en tareas de clasificación de texto, traducción, preguntas y respuestas.
  Su desventaja es que requiere de recursos computacionales para entrenamiento.

  Word2Vec:
  Modelo basado en aprendizaje para representaciones de palabras,
  Su ventaja es la utilización de menos recursos computacionales en comparación a BERT.
  Su desventaja es la no capturar el contexto bidireccional.

  GloVe (Global Vectors for Word Representation):
  Modelo de embedding basado en matrices de co-ocurrencia, que representa relaciones semanticas.
  Su ventaja radica en las representaciones semánticas.
  Su desventaja es no capturar contexto bidireccional.

  TF-IDF (Term Frecuency-Inverse Document Frecuency):
  Evalua estadísticamente la importancia de una palabra en un documento en relación al corpus.
  Su ventaja es su eficiencia en recursos computacionales, y es fácil de interpretar.
  Su desventaja es no capturar relaciones semánticas ni contexto bidireccional.


- Evaluación de los requisitos del proyecto para seleccionar el modelo más adecuado

  Requisitos del proyecto:
  Se buscar maximizar la precisión del model predictivo, debiendo ser eficiente en términos computacionales.
  El modelo debe de trabajar con datos categóricos y numéricos.
  El modelo debe ser interpretable, es decir, comprensión de las diferentes características en la predicción.
  El modelo debe ser escalable, es decir, que pueda soportar grandes volúmenes de datos.

  Análisis de Requisitos:
  BERT se adapta en tareas de comprensión de lenguaje, efectivo para datos textuales y adaptable para manejar variables categóricas.
  BERT ofrece interpretación, es decir, que partes del texto influye mas en las predicciones.

- Justificación clara de la elección del modelo encoder

  La selección del modelo es "bert-base-uncased", debido a:
  Proporciona representaciones de texto contextuales bidireccionales, que ofrece precisión para tareas de clasificación y predicción.
  BERT se adapta para tareas de codificación de variables categóricas mediante técnicas de embeddings.
  BERT captura el contexto bidireccional para comprensión de relaciones complejas entre los datos.



2. Implementación:

   Descripción:

   - Carga de datos: El dataset se carga desde un archivo CSV.
   - Preprocesamiento de columnas categóricas: Se transforman las columnas categóricas en embeddings utilizando bert-base-uncased.
   - Preprocesamiento de columnas numéricas: Se imputan los valores faltantes y se normalizan las columnas numéricas.
   - Pipeline de Clasificación: Se define un pipeline que incluye el preprocesamiento y un clasificador RandomForest.
   - División del dataset: Se divide el dataset en conjuntos de entrenamiento y prueba.
   - Entrenamiento y evaluación: Se entrena el modelo y se evalúa utilizando métricas de clasificación.

In [2]:
# Importar librerias necesarias.
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModel
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import torch


# Cargar el tokenizer y el modelo preentrenado de Hugging Face para convertir
# textos a embeddings
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Definir una función para convertir textos en embeddings
def text_to_embeddings(texts):
    if isinstance(texts, pd.Series):
        # Convertir Series a lista
        texts = texts.tolist()
    elif isinstance(texts, np.ndarray):
        # Convertir ndarray a lista
        texts = texts.tolist()
    elif isinstance(texts, list):
        # Asegurarse de que todos los elementos son cadenas
        texts = [str(text) for text in texts]
    inputs = tokenizer(
        texts, return_tensors='pt', padding=True, truncation=True,
        max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    # return outputs.last_hidden_state.mean(dim=1).numpy()
    return outputs.pooler_output.numpy()

# Cargar el dataset.
df = pd.read_csv('train.csv')

# Definir las columnas categóricas y numéricas.
categorical_cols = [
    'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
    'month', 'poutcome']
numerical_cols = [
    'age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']

# Función para aplicar text_to_embeddings a un DataFrame.
def transform_categorical_to_embeddings(df, cols):
    return np.hstack([text_to_embeddings(df[col]) for col in cols])

# Preprocesamiento de columnas categóricas utilizando embeddings de texto.
categorical_transformer = FunctionTransformer(
    lambda x: transform_categorical_to_embeddings(
        pd.DataFrame(x, columns=categorical_cols), categorical_cols
    ), validate=False)

# Preprocesamiento de columnas numéricas.
# strategy: "median", los valores faltantes de las olumnas numéricas se
# reemplazarán con la mediana de columna.
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Combinar preprocesadores en un ColumnTransformer.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Crear un pipeline de clasificación.
# - n_estimators: 100, Hiperparámetro de clasificador RandomForest, indica que
#   se usarán 100 árboles de decisión. Un mayor número de árboles puede mejorar
#   la precisión del modelo.
# - random_state: 42, Hiperparámetro para fijar la semilla del generador de
#   números aleatorios.
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Dividir los datos en conjunto de entrenamiento y prueba.
# Se usar un 20% de los datos para evaluar el modelo, y un 80% para
# entrenamiento.
X = df.drop('y', axis=1)
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Entrenar el pipeline.
pipeline.fit(X_train, y_train)

# Realizar predicciones.
y_pred = pipeline.predict(X_test)

# Evaluar el modelo.
report = classification_report(y_test, y_pred)
print(report)



              precision    recall  f1-score   support

          no       0.91      0.97      0.94      7952
         yes       0.56      0.30      0.39      1091

    accuracy                           0.89      9043
   macro avg       0.73      0.63      0.66      9043
weighted avg       0.87      0.89      0.87      9043



#### Interpretación de los datos obtenidos:

Precision (Proporción de verdaderos positivos entre el total de positivos predichos):
- Para la clase no: 0.91 (91%): Indica que el modelo tiene un 91% de certeza de que las instancias que predice como no son realmente no.
- Para la clase yes: 0.56 (56%): Indica que el modelo tiene un 56% de certeza de que las instancias que predice como yes son realmente yes.

Recall - exhaustividad (Proporción de verdaderos positivos entre el total de positivos reales):
- Para la clase no: 0.97 (97%): El modelo detecta el 97% de todos los casos reales de la clase no.
- Para la clase yes: 0.30 (30%): El modelo detecta solo el 30% de todos los casos reales de la clase yes.

F1-score (Media armónica entre precisión y exhaustividad):
- Para la clase no: 0.94 (94%): La puntuación F1 es alta, indicando un buen equilibrio entre precisión y exhaustividad para la clase no.
- Para la clase yes: 0.39 (39%): La puntuación F1 es baja, lo que indica que el modelo tiene dificultades para equilibrar precisión y exhaustividad para la clase yes.

Support (Número de instancias verdaderas para cada clase en el conjunto de prueba):
- Para la clase no: 7952 instancias.
- Para la clase yes: 1091 instancias.

Accuracy (Proporción de predicciones correcta):
- 0.89 (89%): El modelo acierta el 89% de las veces en el conjunto de prueba.

Promedio Macro (Promedio de las métricas de cada clase):
- Precisión: 0.73 (73%)
- Exhaustividad: 0.63 (63%)
- Puntuación F1: 0.66 (66%)

Promedio Ponderado (Promedio de las métricas de cada clase):
- Precisión: 0.87 (87%)
- Exhaustividad: 0.89 (89%)
- Puntuación F1: 0.87 (87%)

#### Interpretación general:
- El modelo funciona bien en identificar la clase no, con alta precisión y exhaustividad.
- El modelo tiene dificultades para identificar la clase yes, con baja precisión y exhaustividad.
- Mejoras sobre el proceso: Considerar ajustar el modelo, y probar con otras técnicas de preprocesamiento.



#### Optimización de Hiperparámetros:

Mediante "Grid Search" se realiza una búsqueda de hiperparámetros para encontrar los valores óptimos.


In [4]:
# Importar librerias necesarias.
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
)


# Ajustar el GridSearchCV para encontrar los mejores hiperparámetros.
param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [10, 20],
    'classifier__min_samples_split': [2, 5]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Mejor modelo encontrado.
best_model = grid_search.best_estimator_

# Realizar predicciones.
y_pred = best_model.predict(X_test)

# Evaluar el modelo con métricas adicionales.
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])

print('Mejores hiperparámetros: ', grid_search.best_params_)
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')
print(f'ROC AUC: {roc_auc}')


KeyboardInterrupt: 