# ¿Los clientes de Beta Bank se quedarán?
Los clientes de Beta Bank se están yendo, cada mes, poco a poco. Los banqueros descubrieron que es más barato salvar a los clientes existentes que atraer nuevos. Por lo cuál, se necesita predecir si un cliente dejará el banco pronto.

# Contents<a id='back'></a>

* [Introducción](#intro)
* [Inicialización](#inicializacion)
* [Cargando los datos](#cargando_datos)
* [Explorando los datos](#explorando_datos_iniciales)
* [Segmentación de datos](#segmentación_de_datos)
* [Preprocesamiento de datos](#preprocesamiento_de_datos)
* [Restaurando valores ausentes](#restaurando_valores_ausentes)
* [Equilibrio de clases](#equilibrio_de_clases)
* [Modelos de entrenamiento](#modelos_de_entrenamiento)
* [Conclusión general](#end)

## Introducción <a id='intro'></a>
**Propósito del proyecto**
    
   Crear un modelo con un valor F1 mayor o igual a 0.59.

### Objetivos:
1. Entrenar los modelos sin tener en cuenta el desequilibrio,
2. Medir la métrica F1 y AUC-ROC,
3. Escoger el mejor modelo.

[Volver a Contenidos](#back)

## Inicialización <a id='inicializacion'></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.utils import shuffle
import time
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

[Volver a Contenidos](#back)

## Cargando los datos <a id='cargando_datos'></a>

In [2]:
df = pd.read_csv('/datasets/Churn.csv')

[Volver a Contenidos](#back)

## Explorando los datos <a id='explorando_datos_iniciales'></a>

**Descripción de datos**

**Tabla Megaline**

- `RowNumber` —  índice de cadena de datos
- `CustomerId` —  identificador de cliente único
- `Surname` —  apellido
- `CreditScore` —  valor de crédito
- `Geography` —  país de residencia
- `Gender` —  sexo
- `Age` —  edad
- `Tenure` —  período durante el cual ha madurado el depósito a plazo fijo de un cliente (años)
- `Balance` —  saldo de la cuenta
- `NumOfProducts` —  número de productos bancarios utilizados por el cliente
- `HasCrCard` —  el cliente tiene una tarjeta de crédito (1 - sí; 0 - no)
- `IsActiveMember` —  actividad del cliente (1 - sí; 0 - no)
- `EstimatedSalary` —  salario estimado

In [3]:
df.shape

(10000, 14)

In [4]:
# Imprimiendo la información general/resumen sobre el DataFrame de los clientes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [5]:
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,9091.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,4.99769,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.894723,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,2.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [6]:
# Imprime una muestra de los datos de los clientes
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


**Valores ausentes**

In [7]:
df.isna().sum().sort_values(ascending=False)

Tenure             909
RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

In [8]:
df.isna().sum().sort_values(ascending=False) / 10000

Tenure             0.0909
RowNumber          0.0000
CustomerId         0.0000
Surname            0.0000
CreditScore        0.0000
Geography          0.0000
Gender             0.0000
Age                0.0000
Balance            0.0000
NumOfProducts      0.0000
HasCrCard          0.0000
IsActiveMember     0.0000
EstimatedSalary    0.0000
Exited             0.0000
dtype: float64

**Valores duplicados**

In [9]:
len(df.drop_duplicates()) == len (df)

True

### Conclusión intermedia

 - **`DataFrame Clientes Megaline`** — la tabla tiene 14 columnas y 10.000 filas. Con valores ausentes con un 10% en la columna Ternure. Y sin valores duplicados en el DataFrame.

[Volver a Contenidos](#back)

## Preprocesamiento de datos <a id='preprocesamiento_de_datos'></a>

### Estilo del encabezado
Imprimiendo el encabezado de la columna

In [10]:
# Lista de los nombres de las columnas en la tabla df
df.columns

Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography',
       'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

In [11]:
# Renombrando las columnas
df.columns = df.columns.str.lower()

In [12]:
# Renombrando las columnas
df.rename(
    columns={
        'rownumber': 'row_number',
        'customerid': 'customer_id',
        'creditscore': 'credit_score',
        'hascrcard': 'has_cr_card',
        'isactivemember': 'is_active_member',
        'estimatedsalary': 'estimated_salary',
    }, inplace=True
)

df.columns

Index(['row_number', 'customer_id', 'surname', 'credit_score', 'geography',
       'gender', 'age', 'tenure', 'balance', 'numofproducts', 'has_cr_card',
       'is_active_member', 'estimated_salary', 'exited'],
      dtype='object')

### **Inspeccionando los valores únicos**

In [13]:
df["credit_score"].unique()

array([619, 608, 502, 699, 850, 645, 822, 376, 501, 684, 528, 497, 476,
       549, 635, 616, 653, 587, 726, 732, 636, 510, 669, 846, 577, 756,
       571, 574, 411, 591, 533, 553, 520, 722, 475, 490, 804, 582, 472,
       465, 556, 834, 660, 776, 829, 637, 550, 698, 585, 788, 655, 601,
       656, 725, 511, 614, 742, 687, 555, 603, 751, 581, 735, 661, 675,
       738, 813, 657, 604, 519, 664, 678, 757, 416, 665, 777, 543, 506,
       493, 652, 750, 729, 646, 647, 808, 524, 769, 730, 515, 773, 814,
       710, 413, 623, 670, 622, 785, 605, 479, 685, 538, 562, 721, 628,
       668, 828, 674, 625, 432, 770, 758, 795, 686, 789, 589, 461, 584,
       579, 663, 682, 793, 691, 485, 650, 754, 535, 716, 539, 706, 586,
       631, 717, 800, 683, 704, 615, 667, 484, 480, 578, 512, 606, 597,
       778, 514, 525, 715, 580, 807, 521, 759, 516, 711, 618, 643, 671,
       689, 620, 676, 572, 695, 592, 567, 694, 547, 594, 673, 610, 767,
       763, 712, 703, 662, 659, 523, 772, 545, 634, 739, 771, 68

In [14]:
df["geography"].unique()

array(['France', 'Spain', 'Germany'], dtype=object)

In [15]:
df["tenure"].unique()

array([ 2.,  1.,  8.,  7.,  4.,  6.,  3., 10.,  5.,  9.,  0., nan])

In [16]:
df = df.drop(['row_number', 'customer_id', 'surname'], axis=1)
df

Unnamed: 0,credit_score,geography,gender,age,tenure,balance,numofproducts,has_cr_card,is_active_member,estimated_salary,exited
0,619,France,Female,42,2.0,0.00,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.80,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.00,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,France,Male,39,5.0,0.00,2,1,0,96270.64,0
9996,516,France,Male,35,10.0,57369.61,1,1,1,101699.77,0
9997,709,France,Female,36,7.0,0.00,1,0,1,42085.58,1
9998,772,Germany,Male,42,3.0,75075.31,2,1,0,92888.52,1


### Conclusión intermedia

 - **`DataFrame Clientes Megaline`** — se eliminó la columna *row_number*, *customer_id*, *surname* ya que no lo consideramos como característica.

## Restaurando valores ausentes <a id='restaurando_valores_ausentes'></a>

In [17]:
df_notnan = df.dropna()
df_notnan.head()

Unnamed: 0,credit_score,geography,gender,age,tenure,balance,numofproducts,has_cr_card,is_active_member,estimated_salary,exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [18]:
# Vamos a escribir una función que calcule la categoría de período
def tenure_category(tenure):
    if tenure == 0:
        return "cero"
    elif tenure == 1:
        return "uno"
    elif tenure == 2:
        return "dos"
    elif tenure == 3:
        return "tres"
    elif tenure == 4:
        return "cuatro"
    elif tenure == 5:
        return "cinco"
    elif tenure == 6:
        return "seis"
    elif tenure == 7:
        return "siete"
    elif tenure == 8:
        return "ocho"
    elif tenure == 9:
        return "nueve"
    elif tenure == 10:
        return "diez"
    else:
        return "Sin información"

In [19]:
print(tenure_category(5))

cinco


In [20]:
# Creando una nueva columna basada en la función
df_notnan["tenure_category"] = df_notnan["tenure"].apply(tenure_category)

In [21]:
# Comprobando cómo los valores en la nueva columna
df_notnan.head()

Unnamed: 0,credit_score,geography,gender,age,tenure,balance,numofproducts,has_cr_card,is_active_member,estimated_salary,exited,tenure_category
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1,dos
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0,uno
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1,ocho
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0,uno
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0,dos


In [22]:
w = (
    df_notnan
    .groupby(["geography", "gender", "age", "tenure"])["tenure_category"]
    .count()
    .reset_index()
    .sort_values(by=["geography", "tenure_category"], ascending=False)
)

w.head()

Unnamed: 0,geography,gender,age,tenure,tenure_category
2350,Spain,Male,38,7.0,16
2291,Spain,Male,33,3.0,11
2322,Spain,Male,36,1.0,11
2344,Spain,Male,38,1.0,11
1975,Spain,Female,39,6.0,10


In [23]:
w["tenure"].unique()

array([ 7.,  3.,  1.,  6.,  5.,  9.,  8.,  2.,  4.,  0., 10.])

In [24]:
_tenure= w.groupby("geography").head(1).drop(columns="tenure_category").set_index("geography")
_tenure

Unnamed: 0_level_0,gender,age,tenure
geography,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Spain,Male,38,7.0
Germany,Male,37,9.0
France,Male,36,1.0


In [25]:
#Función que usaremos para completar los valores ausentes
def tenure_(row):
    tenure = row["tenure"]
    geography = row["geography"]
    gender = row["gender"]
    age = row["age"]
    if pd.isna(tenure):
        try:
            return _tenure["tenure"][geography]
        except:
            return None
    return tenure

In [26]:
# Comprobando si funciona
df.apply(tenure_, axis=1).isna().sum()

0

In [27]:
# Aplicando la función al income_type
df["tenure"] = df.apply(tenure_, axis=1)
df.head(10)

Unnamed: 0,credit_score,geography,gender,age,tenure,balance,numofproducts,has_cr_card,is_active_member,estimated_salary,exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


In [28]:
df["tenure"].unique()

array([ 2.,  1.,  8.,  7.,  4.,  6.,  3., 10.,  5.,  9.,  0.])

### Conclusión intermedia

Se crearon categorias para cada año el cual ha madurado el depósito a plazo fijo de un cliente, se incluyó al DataFrame la columna "tenure_category", esta nos permitió ver mediante una agrupación, los ratings por período que más se repiten (moda), tomar el valor más alto y rellenar los valores ausentes de la columna tenure.

## **Equilibrio de clases**  <a id='equilibrio_de_clases'></a>

In [29]:
#Features excluding 'exited'
features = df.drop('exited', axis=1)

In [30]:
#Target
target = df["exited"]

In [31]:
#Examinando el equilibrio de clases
target.value_counts(normalize = True) 

0    0.7963
1    0.2037
Name: exited, dtype: float64

Se tiene un imbalance del 20%.

[Volver a Contenidos](#back)

# **Modelos de entrenamiento**  <a id='modelos_de_entrenamiento'></a>

## Modelo de árbol de decisión

In [32]:
#Codificando las columnas
encoder = OrdinalEncoder()

In [33]:
#Ajustando y transformando los datos
features_ordinal = pd.DataFrame(encoder.fit_transform(features), columns=features.columns)
features_ordinal

Unnamed: 0,credit_score,geography,gender,age,tenure,balance,numofproducts,has_cr_card,is_active_member,estimated_salary
0,228.0,0.0,0.0,24.0,2.0,0.0,0.0,1.0,1.0,5068.0
1,217.0,2.0,0.0,23.0,1.0,743.0,0.0,0.0,1.0,5639.0
2,111.0,0.0,0.0,24.0,8.0,5793.0,2.0,1.0,0.0,5707.0
3,308.0,0.0,0.0,21.0,1.0,0.0,1.0,0.0,0.0,4704.0
4,459.0,2.0,0.0,25.0,2.0,3696.0,0.0,1.0,1.0,3925.0
...,...,...,...,...,...,...,...,...,...,...
9995,380.0,0.0,1.0,21.0,5.0,0.0,1.0,1.0,0.0,4827.0
9996,125.0,0.0,1.0,17.0,10.0,124.0,0.0,1.0,1.0,5087.0
9997,318.0,0.0,0.0,18.0,7.0,0.0,0.0,0.0,1.0,2062.0
9998,381.0,1.0,1.0,24.0,3.0,427.0,1.0,1.0,0.0,4639.0


In [34]:
# División de datos en conjuntos de entrenamiento y validación
random_seed = 200

X_full_train, X_test, y_full_train, y_test = train_test_split(features_ordinal,
                                                    target,
                                                    test_size = 0.2,
                                                    random_state=random_seed)

X_train, X_valid, y_train, y_valid = train_test_split(X_full_train,
                                                      y_full_train,
                                                      test_size = 0.25,
                                                      random_state=random_seed+1)

In [35]:
for df in [X_full_train, X_test, X_train, X_valid]:
  print(df.shape)

(8000, 10)
(2000, 10)
(6000, 10)
(2000, 10)


Vamos a entrenar al modelo con 8.000 observaciones, escoger los hiperparametros con 2.000 observaciones, entrenar al modelo con 6.000 observaciones y la muestra de prueba final con 2.000 observaciones.

**Árbol de decisión sin balance**

In [36]:
random_seed = 200
i = -1
best_f1_score_DT = 0
best_max_depth_DT = 0

for max_depth in [2, 4, 6, 8, 10]:
    model_DT = DecisionTreeClassifier(random_state=random_seed + i,
                                      max_depth=max_depth) # creando el modelo con pseudoaleatoriedad definida y profundidad proporcionada
    model_DT.fit(X_full_train, y_full_train) # entrenando el modelo 
    predictions_DT = model_DT.predict(X_test) # obteniendo la predicción del modelo
    f1_score_DT = f1_score(y_test, predictions_DT >= 0.5)
    if f1_score_DT > best_f1_score_DT:
      best_max_depth_DT = max_depth
      best_f1_score_DT = f1_score_DT
      best_model_DT = model_DT
    

print(f"Best max depth DT: {best_max_depth_DT}")
print(f"Best max F1 DT: {best_f1_score_DT}")
print(f"ROC AUC: {roc_auc_score(y_test, predictions_DT)}")
print()

Best max depth DT: 8
Best max F1 DT: 0.5179856115107915
ROC AUC: 0.682415200298956



**Conclusión intermedia**

- Las métricas de evaluación difieren mucho entre sí. Se obtienen menores valores de F1 score cuando hay desquilibrio. El ROC AUC tiene el efecto contrario, mientras más grande sea el desequilibrio mejor se verá el ROC AUC.

### Corrigiendo el desequilibrio de clases

In [37]:
#Ajustando
random_seed = 200
i = -1
best_f1_score_DT = 0
best_max_depth_DT = 0

for adjusted_max_depth in [2, 4, 6, 8, 10]:
    adjusted_model_DT = DecisionTreeClassifier(random_state=random_seed + i,
                                               max_depth=adjusted_max_depth,
                                               class_weight="balanced") # creando el modelo con pseudoaleatoriedad definida y profundidad proporcionada
    adjusted_model_DT.fit(X_train, y_train) # entrenando el modelo 
    adjusted_preds_DT = adjusted_model_DT.predict(X_valid) # obteniendo la predicción del modelo
    adjusted_f1_score_DT = f1_score(y_valid, adjusted_preds_DT >= 0.5)
    
    if adjusted_f1_score_DT > best_f1_score_DT:
      best_max_depth_DT = adjusted_max_depth
      best_f1_score_DT = adjusted_f1_score_DT
      adjusted_best_model_DT = adjusted_model_DT

print(f"Best max depth DT: {best_max_depth_DT}")
print(f"Best max F1 DT: {best_f1_score_DT}")
print(f"ROC AUC: {roc_auc_score(y_valid, adjusted_preds_DT)}")
print()

Best max depth DT: 8
Best max F1 DT: 0.5529197080291971
ROC AUC: 0.7018796717656304



### Ajuste de umbral

In [38]:
#Cambiando el umbral
random_seed = 200
i = -1
best_f1_score_DT = 0
best_max_depth_DT = 0

for threshold_max_depth in [2, 4, 6, 8, 10]:
    threshold_model_DT = DecisionTreeClassifier(random_state=random_seed + i,
                                                max_depth=threshold_max_depth) # creando el modelo con pseudoaleatoriedad definida y profundidad proporcionada
    threshold_model_DT.fit(X_train, y_train) # entrenando el modelo 
    threshold_preds_DT = threshold_model_DT.predict(X_valid) # obteniendo la predicción del modelo
    threshold_f1_score_DT = f1_score(y_valid, threshold_preds_DT >= 0.5)
    
    if threshold_f1_score_DT > best_f1_score_DT:
      best_max_depth_DT = threshold_max_depth
      best_f1_score_DT = threshold_f1_score_DT
      threshold_best_model_DT = threshold_model_DT

print(f"Best max depth DT: {best_max_depth_DT}")
print(f"Best max F1 DT: {best_f1_score_DT}")
print(f"ROC AUC: {roc_auc_score(y_test, threshold_preds_DT)}")
print()

Best max depth DT: 8
Best max F1 DT: 0.5208333333333333
ROC AUC: 0.5037563033092094



In [39]:
#Encontrando el mejor umbral
# dame un número entre cero y uno, y optimizame  21 números
best_threshold_DT = 0
best_f1_score_DT = 0

for t_DT in np.linspace(0, 1, 21): #Devolviendo un número de muestras espaciadas uniformemente, calculadas sobre el intervalo [0 , 1]
    f1_score_tmp_DT = f1_score(y_valid, threshold_preds_DT >= t_DT) #calculando f1_score variando el umbral en cada iteración
    
    if f1_score_tmp_DT > best_f1_score_DT: #si f1.. supera a best...
        best_f1_score_DT = f1_score_tmp_DT
        best_threshold_DT = t_DT # best-- igual al umbral usado en la iteración

print(f"Best threshold: {best_threshold_DT}")

Best threshold: 0.05


In [40]:
#Imprimiendo el desempeño
#F1 score y_test, valor verdadero. preds, predicción binaria
#ROC AUC y_test, valor verdadero. preds, predicción

print(f"""
F1 score: {f1_score(y_valid, threshold_preds_DT >= best_threshold_DT)} 
ROC AUC: {roc_auc_score(y_valid, threshold_preds_DT)} 
""")

# tomando las respuestas y predicciones correctas y devolviendo la media armónica de recall y precisión
#tomando los valores objetivo y las probabilidades de clase positivas


F1 score: 0.5111731843575419 
ROC AUC: 0.6834987298280242 



### **Sobremuestreo**

In [41]:
#Dividiendo el conjunto de datos de entrenamiento en observaciones negativas y positivas
#Duplicando las observaciones positivas varias veces
#Creando una nueva muestra de entrenamiento basada en los datos obtenidos
def upsample(features, target, repeat):
    features_zeros = X_train[y_train == 0]
    features_ones = X_train[y_train == 1]
    target_zeros = y_train[y_train == 0]
    target_ones = y_train[y_train == 1]
    
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=random_seed)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(
    X_train, y_train, 10
)

In [42]:
random_seed = 200
i = -1
best_f1_score_DT = 0
best_max_depth_DT = 0

for SM_max_depth in [2, 4, 6, 8, 10]:
    SM_model_DT = DecisionTreeClassifier(random_state=random_seed + i,
                                         max_depth=SM_max_depth) # creando el modelo con pseudoaleatoriedad definida y profundidad proporcionada
    SM_model_DT.fit(features_upsampled, target_upsampled) # entrenando el modelo 
    SM_preds_DT = SM_model_DT.predict(X_valid) # obteniendo la predicción del modelo
    SM_f1_score_DT = f1_score(y_valid, SM_preds_DT)
    
    if SM_f1_score_DT > best_f1_score_DT:
      best_max_depth_DT = SM_max_depth
      best_f1_score_DT = SM_f1_score_DT
      SM_best_model_DT = SM_model_DT

print(f"Best max depth DT: {best_max_depth_DT}")
print(f"Best max F1 DT: {best_f1_score_DT}")
print(f"ROC AUC: {roc_auc_score(y_test, SM_preds_DT)}")
print()

Best max depth DT: 2
Best max F1 DT: 0.5022999080036799
ROC AUC: 0.4781805377930564



### **Cambiando el umbral**

In [43]:
#Cambiando el umbral
turned_model_DT = DecisionTreeClassifier(random_state=random_seed,
                                         max_depth=best_max_depth_DT,
                                         class_weight="balanced")
turned_model_DT.fit(X_train, y_train)

# Predicción
turned_valid_preds_DT = turned_model_DT.predict_proba(X_valid)[:,1]

In [44]:
best_threshold_DT = 0
best_f1_score_DT = 0

for t_DT in np.linspace(0, 1, 21):
    f1_score_tmp_DT = f1_score(y_valid, turned_valid_preds_DT >= t_DT) #calculando f1_score variando el umbral en cada iteración
    
    if f1_score_tmp_DT > best_f1_score_DT: #si fi.. supera a best...
        best_f1_score_DT = f1_score_tmp_DT
        best_threshold_DT = t_DT # best-- igual al umbral usado en la iteración

print(f"Best threshold: {round(best_threshold_DT, 4)}")

Best threshold: 0.3


In [45]:
print(f"""
Validation results:
F1 score: {f1_score(y_valid, turned_valid_preds_DT >= best_threshold_DT)}  
ROC AUC: {roc_auc_score(y_valid, turned_valid_preds_DT)} 
""")


Validation results:
F1 score: 0.5022999080036799  
ROC AUC: 0.7265505540630274 



### **Modelo final**

In [46]:
#Cambiando el umbral
final_model_DT = DecisionTreeClassifier(random_state=random_seed,
                                        max_depth=best_max_depth_DT,
                                        class_weight="balanced")
final_model_DT.fit(X_full_train, y_full_train)

# Predicción
final_test_preds_DT = final_model_DT.predict_proba(X_test)[:,1]

In [47]:
print(f"""
Test results:
F1 score: {f1_score(y_test, final_test_preds_DT >= best_threshold_DT)}
ROC AUC: {roc_auc_score(y_test, final_test_preds_DT)}
""")


Test results:
F1 score: 0.5000000000000001
ROC AUC: 0.7234300721633061



## Modelo de bosque aleatorio

**Bosque aleatorio sin balance**

In [48]:
random_seed = 200
i = -1
best_f1_score_RF = 0
best_max_depth_RF = 0
best_n_estimators_RF = 0

for max_depth_RF in [5, 6, 7, 8]:
  for n_estimators in [10, 25, 50, 100]:
    model_first = RandomForestClassifier(random_state=random_seed + i,
                                         n_estimators=n_estimators,
                                         max_depth=max_depth_RF) # creando el modelo con pseudoaleatoriedad definida, profundidad proporcionada y cantidad de árboles
    model_first.fit(X_full_train, y_full_train) # entrenando el modelo 
    predictions_first = model_first.predict(X_test) # obteniendo las predicciones del modelo
    f1_score_RF = f1_score(y_test, predictions_first >= 0.5)

    if f1_score_RF > best_f1_score_RF:
      best_max_depth_RF = max_depth_RF
      best_n_estimators_RF = n_estimators
      best_f1_score_RF = f1_score_RF
      best_model_RF = model_first

print(f"Best max depth RF: {best_max_depth_RF}")
print(f"Best n estimators RF: {best_n_estimators_RF}")
print(f"Best max F1 DT: {best_f1_score_RF}")
print(f"ROC AUC: {roc_auc_score(y_test, predictions_first)}")
print()

Best max depth RF: 7
Best n estimators RF: 50
Best max F1 DT: 0.5078864353312302
ROC AUC: 0.670707831661632



### Corrigiendo el desequilibrio de clases

**class_weight = "balanced"**

In [49]:
random_seed = 200
i = -1
best_f1_score_RF = 0
best_max_depth_RF = 0
best_n_estimators_RF = 0

for adjusted_max_depth_RF in [5, 6, 7, 8]:
  for n_estimators in [10, 25, 50, 100]:
    adjusted_model_RF = RandomForestClassifier(random_state=random_seed + i,
                                               n_estimators=n_estimators,
                                               max_depth=adjusted_max_depth_RF,
                                               class_weight="balanced") # creando el modelo con pseudoaleatoriedad definida, profundidad proporcionada y cantidad de árboles
    adjusted_model_RF.fit(X_train, y_train) # entrenando el modelo 
    adjusted_preds_RF = adjusted_model_RF.predict(X_valid) # obteniendo las predicciones del modelo
    adjusted_f1_score_RF = f1_score(y_valid, adjusted_preds_RF >= 0.5)

    if adjusted_f1_score_RF > best_f1_score_RF:
      best_max_depth_RF = adjusted_max_depth_RF
      best_n_estimators_RF = n_estimators
      best_f1_score_RF = adjusted_f1_score_RF
      best_model_RF = adjusted_model_RF

print(f"Best max depth RF: {best_max_depth_RF}")
print(f"Best n estimators RF: {best_n_estimators_RF}")
print(f"Best max F1 DT: {best_f1_score_RF}")
print(f"ROC AUC: {roc_auc_score(y_test, adjusted_preds_RF)}")
print()

Best max depth RF: 7
Best n estimators RF: 50
Best max F1 DT: 0.592274678111588
ROC AUC: 0.4914228841203506



**class_weight = "balanced_subsample"**

In [50]:
random_seed = 200
i = -1
best_f1_score_RF = 0
best_max_depth_RF = 0
best_n_estimators_RF = 0

for adjusted_sub_max_depth_RF in [5, 6, 7, 8]:
  for n_estimators in [10, 25, 50, 100]:
    adjusted_sub_model_RF = RandomForestClassifier(random_state=random_seed + i,
                                                   n_estimators=n_estimators,
                                                   max_depth=adjusted_sub_max_depth_RF,
                                                   class_weight="balanced_subsample") # creando el modelo con pseudoaleatoriedad definida, profundidad proporcionada y cantidad de árboles
    adjusted_sub_model_RF.fit(X_train, y_train) # entrenando el modelo 
    adjusted_sub_preds_RF = adjusted_sub_model_RF.predict(X_valid) # obteniendo las predicciones del modelo
    adjusted_sub_f1_score_RF = f1_score(y_valid, adjusted_sub_preds_RF >= 0.5)

    if adjusted_sub_f1_score_RF > best_f1_score_RF:
      best_max_depth_RF = adjusted_sub_max_depth_RF
      best_n_estimators_RF = n_estimators
      best_f1_score_RF = adjusted_sub_f1_score_RF
      best_model_RF = adjusted_sub_model_RF

print(f"Best max depth RF: {best_max_depth_RF}")
print(f"Best n estimators RF: {best_n_estimators_RF}")
print(f"Best max F1 DT: {best_f1_score_RF}")
print(f"ROC AUC: {roc_auc_score(y_test, adjusted_sub_preds_RF)}")
print()

Best max depth RF: 8
Best n estimators RF: 25
Best max F1 DT: 0.5984598459845984
ROC AUC: 0.48662217067582186



### Ajuste de umbral

In [51]:
random_seed = 200
i = -1
best_f1_score_RF = 0
best_max_depth_RF = 0
best_n_estimators_RF = 0

for threshold_max_depth_RF in [5, 6, 7, 8]:
  for n_estimators in [10, 25, 50, 100]:
    threshold_model_RF = RandomForestClassifier(random_state=random_seed + i,
                                               n_estimators=n_estimators,
                                               max_depth=threshold_max_depth_RF) # creando el modelo con pseudoaleatoriedad definida, profundidad proporcionada y cantidad de árboles
    threshold_model_RF.fit(X_train, y_train) # entrenando el modelo 
    threshold_preds_RF = threshold_model_RF.predict(X_valid) # obteniendo las predicciones del modelo
    threshold_f1_score_RF = f1_score(y_valid, threshold_preds_RF >= 0.5)

    if threshold_f1_score_RF > best_f1_score_RF:
      best_max_depth_RF = threshold_max_depth_RF
      best_n_estimators_RF = n_estimators
      best_f1_score_RF = threshold_f1_score_RF
      best_model_RF = threshold_model_RF

print(f"Best max depth RF: {best_max_depth_RF}")
print(f"Best n estimators RF: {best_n_estimators_RF}")
print(f"Best max F1 DT: {best_f1_score_RF}")
print(f"ROC AUC: {roc_auc_score(y_test, threshold_preds_RF)}")
print()

Best max depth RF: 8
Best n estimators RF: 50
Best max F1 DT: 0.5214626391096979
ROC AUC: 0.49875683184177977



In [52]:
#Encontrando el mejor umbral

best_threshold_RF = 0
best_f1_score_RF = 0

for t_RF in np.linspace(0, 1, 21): #Devolviendo un número de muestras espaciadas uniformemente, calculadas sobre el intervalo [0 , 1]
    f1_score_tmp_RF = f1_score(y_valid, threshold_preds_RF >= t_RF) #calculando f1_score variando el umbral en cada iteración
    
    if f1_score_tmp_RF > best_f1_score_RF: #si fi.. supera a best...
        best_f1_score_RF = f1_score_tmp_RF
        best_threshold_RF = t_RF # best-- igual al umbral usado en la iteración

print(f"Best threshold: {best_threshold_RF}")

Best threshold: 0.05


In [53]:
#Imprimiendo el desempeño
#F1 score y_test, valor verdadero. preds, predicción binaria
#ROC AUC y_test, valor verdadero. preds, predicción

print(f"""
F1 score: {f1_score(y_valid, threshold_preds_RF >= best_threshold_RF)} 
ROC AUC: {roc_auc_score(y_valid, threshold_preds_RF)} 
""")
# tomando las respuestas y predicciones correctas y devolviendo la media armónica de recall y precisión
#tomando los valores objetivo y las probabilidades de clase positivas


F1 score: 0.516025641025641 
ROC AUC: 0.6789967773574331 



### Sobremuestreo

In [54]:
random_seed = 200
i = -1
best_f1_score_RF = 0
best_max_depth_RF = 0
best_n_estimators_RF = 0

for SM_max_depth_RF in [5, 6, 7, 8]:
  for n_estimators in [10, 25, 50, 100]:
    SM_model_RF = RandomForestClassifier(random_state=random_seed + i,
                                         n_estimators=n_estimators,
                                         max_depth=SM_max_depth_RF) # creando el modelo con pseudoaleatoriedad definida, profundidad proporcionada y cantidad de árboles
    SM_model_RF.fit(features_upsampled, target_upsampled) # entrenando el modelo 
    SM_preds_RF = SM_model_RF.predict(X_valid) # obteniendo las predicciones del modelo
    SM_f1_score_RF = f1_score(y_valid, SM_preds_RF)

    if SM_f1_score_RF > best_f1_score_RF:
      best_max_depth_RF = SM_max_depth_RF
      best_n_estimators_RF = n_estimators
      best_f1_score_RF = SM_f1_score_RF
      best_model_RF = SM_model_RF

print(f"Best max depth RF: {best_max_depth_RF}")
print(f"Best n estimators RF: {best_n_estimators_RF}")
print(f"Best max F1 DT: {best_f1_score_RF}")
print(f"ROC AUC: {roc_auc_score(y_test, SM_preds_RF)}")
print()

Best max depth RF: 8
Best n estimators RF: 100
Best max F1 DT: 0.5185714285714286
ROC AUC: 0.4739292004567712



### **Cambiando el umbral**

In [55]:
#Cambiando el umbral
turned_model_RF = RandomForestClassifier(random_state=random_seed,
                                         n_estimators=best_n_estimators_RF,
                                         max_depth=best_max_depth_RF,
                                         class_weight="balanced_subsample")
turned_model_RF.fit(X_train, y_train)

# Predicción
turned_valid_preds_RF = turned_model_RF.predict_proba(X_valid)[:,1]

In [56]:
best_threshold_RF = 0
best_f1_score_RF = 0

for t_RF in np.linspace(0, 1, 51):
    f1_score_tmp_RF = f1_score(y_valid, turned_valid_preds_RF >= t_RF) #calculando f1_score variando el umbral en cada iteración
    
    if f1_score_tmp_RF > best_f1_score_RF: #si fi.. supera a best...
        best_f1_score_RF = f1_score_tmp_RF
        best_threshold_RF = t_RF # best-- igual al umbral usado en la iteración

print(f"Best threshold: {round(best_threshold_RF, 4)}")

Best threshold: 0.5


In [57]:
print(f"""
Validation results:
F1 score: {f1_score(y_valid, turned_valid_preds_RF >= best_threshold_RF)}
ROC AUC: {roc_auc_score(y_valid, turned_valid_preds_RF)}
""")


Validation results:
F1 score: 0.6036036036036035
ROC AUC: 0.8517721488142016



### **Modelo final**

In [58]:
#Cambiando el umbral
final_model_RF = RandomForestClassifier(random_state=random_seed,
                                        n_estimators=best_n_estimators_RF,
                                        max_depth=best_max_depth_RF,
                                        class_weight="balanced_subsample")
final_model_RF.fit(X_full_train, y_full_train)

# Predicción
final_test_preds_RF = final_model_RF.predict_proba(X_test)[:,1]

In [59]:
print(f"""
Test results:
F1 score: {f1_score(y_test, final_test_preds_RF >= best_threshold_RF)}
ROC AUC: {roc_auc_score(y_test, final_test_preds_RF)}
""")


Test results:
F1 score: 0.5949367088607594
ROC AUC: 0.8382303240723509



## Modelo de regresión logística

In [60]:
#Transformando características categóricas en numéricas
features_ohe = pd.get_dummies(features, drop_first=True)

In [61]:
features_ohe

Unnamed: 0,credit_score,age,tenure,balance,numofproducts,has_cr_card,is_active_member,estimated_salary,geography_Germany,geography_Spain,gender_Male
0,619,42,2.0,0.00,1,1,1,101348.88,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,1,0
2,502,42,8.0,159660.80,3,1,0,113931.57,0,0,0
3,699,39,1.0,0.00,2,0,0,93826.63,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.10,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,39,5.0,0.00,2,1,0,96270.64,0,0,1
9996,516,35,10.0,57369.61,1,1,1,101699.77,0,0,1
9997,709,36,7.0,0.00,1,0,1,42085.58,0,0,0
9998,772,42,3.0,75075.31,2,1,0,92888.52,1,0,1


In [62]:
# División de datos en conjuntos de entrenamiento y validación
random_seed = 200

X_fullo_train, X_o_test, y_fullo_train, y_o_test = train_test_split(features_ohe,
                                                    target,
                                                    test_size = 0.2,
                                                    random_state=random_seed)

X_o_train, X_o_valid, y_o_train, y_o_valid = train_test_split(X_fullo_train,
                                                      y_fullo_train,
                                                      test_size = 0.25,
                                                      random_state=random_seed+1)

In [63]:
for df in [X_fullo_train, X_o_test, X_o_train, X_o_valid]:
  print(df.shape)

(8000, 11)
(2000, 11)
(6000, 11)
(2000, 11)


Vamos a entrenar al modelo con 8.000 observaciones, escoger los hiperparametros con 2.000 observaciones, entrenar al modelo con 6.000 observaciones y la muestra de prueba final con 2.000 observaciones.

**Regresión logística sin balance**

In [64]:
random_seed = 200
i = -1
best_f1_score_LR = 0
best_penalty = None
solver = "liblinear"

for penalty in ["l1", "l2", "none"]:
    if penalty == "none":
        solver = "saga"
    model_LR = LogisticRegression(random_state=random_seed + i,
                                  solver=solver,
                                  penalty = penalty,
                                  max_iter = 10000) # inicializando el constructor de regresión logística con pseudoaleatoriedad definida, solver='liblinear', penalizadores e iteraciones (max-iter)
    model_LR.fit(X_fullo_train, y_fullo_train) # entrenando el modelo 
    predictions_LR = model_LR.predict(X_o_test) # obteniendo las predicciones del modelo
    f1_score_LR = f1_score(y_o_test, predictions_LR >= 0.5)

    if f1_score_LR > best_f1_score_LR:
      best_f1_score_LR = f1_score_LR
      best_penalty = penalty
      best_model_LR = model_LR

print(f"Best penalty: {best_penalty}")
print(f"Best max F1 DT: {round(best_f1_score_LR, 4)}")
print(f"ROC AUC: {roc_auc_score(y_test, predictions_LR)}")
print()

Best penalty: l1
Best max F1 DT: 0.2702
ROC AUC: 0.5



### Corrigiendo el desequilibrio de clases

**class_weight = "balanced"**

In [65]:
random_seed = 200
i = -1
best_f1_score_LR = 0
best_penalty = None
solver = "liblinear"

for penalty in ["l1", "l2", "none"]:
    if penalty == "none":
        solver = "saga"
    adjusted_model_LR = LogisticRegression(random_state=random_seed + i,
                                           solver=solver,
                                           penalty = penalty,
                                           max_iter = 10000,
                                           class_weight="balanced") # inicializando el constructor de regresión logística con pseudoaleatoriedad definida, solver='liblinear', penalizadores e iteraciones (max-iter)
    adjusted_model_LR.fit(X_o_train, y_o_train) # entrenando el modelo 
    adjusted_predictions_LR = adjusted_model_LR.predict(X_o_valid) # obteniendo las predicciones del modelo
    adjusted_f1_score_LR = f1_score(y_o_valid, adjusted_predictions_LR >= 0.5)

    if adjusted_f1_score_LR > best_f1_score_LR:
      best_f1_score_LR = adjusted_f1_score_LR
      best_penalty = penalty
      best_model_LR = adjusted_model_LR

print(f"Best penalty: {best_penalty}")
print(f"Best max F1 DT: {round(best_f1_score_LR, 4)}")
print(f"ROC AUC: {roc_auc_score(y_o_test, adjusted_predictions_LR)}")
print()

Best penalty: l1
Best max F1 DT: 0.4627
ROC AUC: 0.48183857870892144



### Ajuste de umbral

In [66]:
random_seed = 200
i = -1
best_f1_score_LR = 0
best_penalty = None
solver = "liblinear"

for penalty in ["l1", "l2", "none"]:
    if penalty == "none":
        solver = "saga"
    threshold_model_LR = LogisticRegression(random_state=random_seed + i,
                                            solver=solver, penalty = penalty,
                                            max_iter = 10000) # inicializando el constructor de regresión logística con pseudoaleatoriedad definida, solver='liblinear', penalizadores e iteraciones (max-iter)
    threshold_model_LR.fit(X_o_train, y_o_train) # entrenando el modelo 
    threshold_predictions_LR = threshold_model_LR.predict(X_o_valid) # obteniendo las predicciones del modelo
    threshold_f1_score_LR = f1_score(y_o_valid, threshold_predictions_LR >= 0.5)

    if threshold_f1_score_LR > best_f1_score_LR:
      best_f1_score_LR = threshold_f1_score_LR
      best_penalty = penalty
      best_model_LRR = threshold_model_LR

print(f"Best penalty: {best_penalty}")
print(f"Best max F1 DT: {round(best_f1_score_LR, 4)}")
print(f"ROC AUC: {roc_auc_score(y_o_test, threshold_predictions_LR)}")
print()

Best penalty: l1
Best max F1 DT: 0.328
ROC AUC: 0.5



In [67]:
#Encontrando el mejor umbral
best_threshold_LR = 0
best_f1_score_LR = 0

for t_LR in np.linspace(0, 1, 21): #Devolviendo un número de muestras espaciadas uniformemente, calculadas sobre el intervalo [0 , 1]
    f1_score_tmp_LR = f1_score(y_o_valid, threshold_predictions_LR >= t_LR) #calculando f1_score variando el umbral en cada iteración
    
    if f1_score_tmp_LR > best_f1_score_LR: #si f1.. supera a best...
        best_f1_score_LR = f1_score_tmp_LR
        best_threshold_LR = t_LR # best-- igual al umbral usado en la iteración

print(f"Best threshold: {best_threshold_LR}")

Best threshold: 0.0


In [68]:
#Imprimiendo el desempeño
#F1 score y_test, valor verdadero. preds, predicción binaria
#ROC AUC y_test, valor verdadero. preds, predicción

print(f"""
F1 score: {f1_score(y_o_valid, threshold_predictions_LR >= best_threshold_LR)} 
ROC AUC: {roc_auc_score(y_o_valid, threshold_predictions_LR)} 
""")

# tomando las respuestas y predicciones correctas y devolviendo la media armónica de recall y precisión
#tomando los valores objetivo y las probabilidades de clase positivas


F1 score: 0.3429991714995857 
ROC AUC: 0.5 



### **Sobremuestreo**

In [69]:
#Dividiendo el conjunto de datos de entrenamiento en observaciones negativas y positivas
#Duplicando las observaciones positivas varias veces
#Creando una nueva muestra de entrenamiento basada en los datos obtenidos
def upsample(features, target, repeat):
    features_zeros = X_o_train[y_o_train == 0]
    features_ones = X_o_train[y_o_train == 1]
    target_zeros = y_o_train[y_o_train == 0]
    target_ones = y_o_train[y_o_train == 1]
    
    features_upsampled_LR = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled_LR = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled_LR, target_upsampled_LR = shuffle(features_upsampled_LR, target_upsampled_LR, random_state=random_seed)
    
    return features_upsampled_LR, target_upsampled_LR

features_upsampled_LR, target_upsampled_LR = upsample(
    X_o_train, y_o_train, 10
)

In [70]:
random_seed = 200
i = -1
best_f1_score_LR = 0
best_penalty = None
solver = "liblinear"

for penalty in ["l1", "l2", "none"]:
    if penalty == "none":
        solver = "saga"
    threshold_model_LR = LogisticRegression(random_state=random_seed + i,
                                            solver=solver,
                                            penalty = penalty,
                                            max_iter = 10000) # inicializando el constructor de regresión logística con pseudoaleatoriedad definida, solver='liblinear', penalizadores e iteraciones (max-iter)
    threshold_model_LR.fit(features_upsampled_LR, target_upsampled_LR) # entrenando el modelo 
    threshold_predictions_LR = threshold_model_LR.predict(X_o_valid) # obteniendo las predicciones del modelo
    threshold_f1_score_LR = f1_score(y_o_valid, threshold_predictions_LR)

    if threshold_f1_score_LR > best_f1_score_LR:
      best_f1_score_LR = threshold_f1_score_LR
      best_penalty = penalty
      best_model_LRR = threshold_model_LR

print(f"Best penalty: {best_penalty}")
print(f"Best max F1 DT: {round(best_f1_score_LR, 4)}")
print(f"ROC AUC: {roc_auc_score(y_o_test, threshold_predictions_LR)}")
print()

Best penalty: l1
Best max F1 DT: 0.4065
ROC AUC: 0.5



### **Cambiando el umbral**

In [71]:
#Cambiando el umbral
turned_model_LR = LogisticRegression(random_state=random_seed,
                                     solver="liblinear",
                                     penalty=best_penalty,
                                     max_iter = 10000,
                                     class_weight="balanced")
turned_model_LR.fit(X_o_train, y_o_train)

# Predicción
turned_valid_preds_LR = turned_model_LR.predict_proba(X_o_valid)[:,1]

In [72]:
best_threshold_LR = 0
best_f1_score_LR = 0

for t_LR in np.linspace(0, 1, 21):
    f1_score_tmp_LR = f1_score(y_o_valid, turned_valid_preds_LR >= t_LR) #calculando f1_score variando el umbral en cada iteración
    
    if f1_score_tmp_LR > best_f1_score_LR: #si fi.. supera a best...
        best_f1_score_LR = f1_score_tmp_LR
        best_threshold_LR = t_LR # best-- igual al umbral usado en la iteración

print(f"Best threshold: {round(best_threshold_LR, 4)}")

Best threshold: 0.6


In [73]:
print(f"""
Validation results:
F1 score: {f1_score(y_o_valid, turned_valid_preds_LR >= best_threshold_LR)}  
ROC AUC: {roc_auc_score(y_o_valid, turned_valid_preds_LR)} 
""")


Validation results:
F1 score: 0.490280777537797  
ROC AUC: 0.7492628738173999 



### **Modelo final**

In [74]:
#Cambiando el umbral
final_model_LR = LogisticRegression(random_state=random_seed,
                                    solver="liblinear",
                                    penalty=best_penalty,
                                    max_iter = 10000,
                                    class_weight="balanced")
final_model_LR.fit(X_fullo_train, y_fullo_train)

# Predicción
final_test_preds_LR = final_model_LR.predict_proba(X_o_test)[:,1]

In [75]:
print(f"""
Test results:
F1 score: {f1_score(y_o_test, final_test_preds_LR >= best_threshold_LR)}
ROC AUC: {roc_auc_score(y_o_test, final_test_preds_LR)}
""")


Test results:
F1 score: 0.4994882292732856
ROC AUC: 0.763617902217008



## Conclusión general <a id='end'></a>

1. El DataFrame Clientes Megaline, inició con 14 columnas, pero al final se trabajó los modelos con 11 columnas, ya que row_number, customer_id, surname no agregaban valor por no considerarse como características. Además, la columna Ternure tenía un 10% de valores ausentes, el cúal fueron reemplazados con la moda de la categorización de cada año el cual ha madurado el depósito a plazo fijo de un cliente. Y por último, el DataFrame no tenía valores duplicados.
2. Se examinó el equilibrio de clases t se evidencio que se tenía un imbalance del 20%. Se entrenaron los modelos árbol de decisión, bosque aleatorio y regresión logística y se dividió la data en conjuntos de entrenamiento y validación.
3. Se mejoró la calidad del modelo con los enfoque class_weight, ajuste de umbral y sobremuestreo. Comparando los valores de F1 y AUC-ROC.
4. Al obtener los resultados tras haber mejorado la calidad del modelo, se tomaron los mejores enfoques y se entrenó un último modelo para cada uno. Escogiendo el modelo de bosque aleatorio como el más adecuado ya que se obtuvo un F1 de 0.59.

[Volver a Contenidos](#back)