## Analisis exploratorio

### Librerias

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.preprocessing import OneHotEncoder, MaxAbsScaler
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metrics
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
from lightgbm import LGBMClassifier
from catboost import Pool, CatBoostClassifier
import optuna



### Carga de datos

In [9]:
#from google.colab import drive
#drive.mount('/content/drive')

In [5]:
path = '/content/drive/MyDrive/TripleTen Projects/Proyecto Final/final_provider/'

def load_data(path,file_name):
    """
    Con esta funcion realizaremos la carga de datos, se realizara desde forma local o desde la plataforma tripleten, argumentos:
    path: ruta del archivo local
    file_name: nombre del archivo a cargar con su extension en str
    """
    try:
        df = pd.read_csv(path+file_name)
        return df
    except:
        df = pd.read_csv(file_name)
        return df

In [6]:
# Cargamos los datos
contract = load_data(path,'contract.csv')
internet = load_data(path,'internet.csv')
personal = load_data(path,'personal.csv')
phone = load_data(path,'phone.csv')

### Analisis exploratorio de datos

In [10]:
def explore(data):
    """
    Con esta funcion podemos hacer una visualización de datos de forma rapida, con la cual podemos observar:
    - Los datos contenidos en el dataset
    - El tipo de los datos
    - Valores ausentes
    - Valores duplicados
    - Descripcion de las variables numericas en caso de contenerlas
    """
    display(data.head(10))
    print('Tipo de datos')
    print(data.info())
    print()
    print('Valores ausentes')
    print(data.isna().sum())
    print()
    print('Valores duplicados')
    print(data.duplicated().sum())
    print()
    print('Valores de variables númericas')
    print(data.describe())

#### Conjunto "contract"

In [11]:
explore(contract)

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65
5,9305-CDSKC,2019-03-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,99.65,820.5
6,1452-KIOVK,2018-04-01,No,Month-to-month,Yes,Credit card (automatic),89.1,1949.4
7,6713-OKOMC,2019-04-01,No,Month-to-month,No,Mailed check,29.75,301.9
8,7892-POOKP,2017-07-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,104.8,3046.05
9,6388-TABGU,2014-12-01,No,One year,No,Bank transfer (automatic),56.15,3487.95


Tipo de datos
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
dtypes: float64(1), object(7)
memory usage: 440.3+ KB
None

Valores ausentes
customerID          0
BeginDate           0
EndDate             0
Type                0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
dtype: int64

Valores duplicados
0

Valores de variables númericas
       MonthlyCharges
count     7043.000000
mean        64.761692
std         30.090047
mi

En este conjunto podemos ver:
* **"customerID"** - el ID del cliente
* **"BeginDate"** - su inicio de los servicios de la compañia
* **"EndDate"** - la fecha de terminacion del contrato si es que tiene
* **"Type"** - el tipo de contrato que tiene
* **"PaperlessBilling"** - si su facturacion es impresa o no
* **"PaymentMethod"** - metodo de pago
* **"MonthlyCharges"** - cargo mensual del pago
* **"TotalCharges"** - cargo total  que tendra que pagar

Podemos observar de entrada tenemos 7043 filas de datos, y los tipos de datos no corresponden a su tipo como las fechas y algunas variables numericas, los cuales deben ser corregidos.

En cuanto a las variables numericas podemos observar a simple que vista que no tenemos valores anormales, por lo que podemos considerarlos "correctos".

En cuanto a las variables categoricas veremos mas adelante si nos pueden ser de utilidad.

No contamos con valores ausentes o valores duplicados

#### Conjunto "internet"

In [12]:
explore(internet)

Unnamed: 0,customerID,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,Fiber optic,No,No,No,No,No,No
5,9305-CDSKC,Fiber optic,No,No,Yes,No,Yes,Yes
6,1452-KIOVK,Fiber optic,No,Yes,No,No,Yes,No
7,6713-OKOMC,DSL,Yes,No,No,No,No,No
8,7892-POOKP,Fiber optic,No,No,Yes,Yes,Yes,Yes
9,6388-TABGU,DSL,Yes,Yes,No,No,No,No


Tipo de datos
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5517 entries, 0 to 5516
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   customerID        5517 non-null   object
 1   InternetService   5517 non-null   object
 2   OnlineSecurity    5517 non-null   object
 3   OnlineBackup      5517 non-null   object
 4   DeviceProtection  5517 non-null   object
 5   TechSupport       5517 non-null   object
 6   StreamingTV       5517 non-null   object
 7   StreamingMovies   5517 non-null   object
dtypes: object(8)
memory usage: 344.9+ KB
None

Valores ausentes
customerID          0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
dtype: int64

Valores duplicados
0

Valores de variables númericas
        customerID InternetService OnlineSecurity OnlineBackup  \
count         5517            5517           5

En este conjunto podemos ver:
* **"customerID"** - el ID del cliente
* **"InternetService"**	- Tipo de internet
* **"OnlineSecurity"** - Si tiene seguridad en linea
* **"OnlineBackup** - Si tiene respaldo en linea
* **"DeviceProtection** - Si tiene proteccion de dispositivo
* **"TechSupport** - Si tiene soporte de tecnologia
* **"StreamingTV** - Si tiene TV por internet
* **"StreamingMovies** - Si tiene peliculas por internet

Podemos ver que solo contiene 5517 filas en comparacion con los contactos que tiene 7043, por lo que podemos ver que hay clientes de los cuales no tenemos informacion.

Los tipos de datos son correctos al ser tipo object por contener solo palabras.

Contamos solo con variables categoricas, la mayoria con solo Yes/No como respuesta.

No contamos con valores ausentes o valores duplicados

#### Conjunto "personal"

In [13]:
explore(personal)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents
0,7590-VHVEG,Female,0,Yes,No
1,5575-GNVDE,Male,0,No,No
2,3668-QPYBK,Male,0,No,No
3,7795-CFOCW,Male,0,No,No
4,9237-HQITU,Female,0,No,No
5,9305-CDSKC,Female,0,No,No
6,1452-KIOVK,Male,0,No,Yes
7,6713-OKOMC,Female,0,No,No
8,7892-POOKP,Female,0,Yes,No
9,6388-TABGU,Male,0,No,Yes


Tipo de datos
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     7043 non-null   object
 1   gender         7043 non-null   object
 2   SeniorCitizen  7043 non-null   int64 
 3   Partner        7043 non-null   object
 4   Dependents     7043 non-null   object
dtypes: int64(1), object(4)
memory usage: 275.2+ KB
None

Valores ausentes
customerID       0
gender           0
SeniorCitizen    0
Partner          0
Dependents       0
dtype: int64

Valores duplicados
0

Valores de variables númericas
       SeniorCitizen
count    7043.000000
mean        0.162147
std         0.368612
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000


En este conjunto podemos ver:

* **"customerID"** - el ID del cliente
* **"gender"**	- Genero del cliente
* **"SeniorCitizen"** - Si es jubilado
* **"Partner** - Si tiene pareja
* **"Dependents** - Si tiene dependientes

Podemos ver que contiene 7043 filas,mismas filas que el conjunto contactos, por lo que podemos pensar que todos los datos estan completos.

Contamos solo con variables categoricas.

Los tipos de datos son correctos.

No contamos con valores ausentes o valores duplicados

#### Conjunto "phone"

In [14]:
explore(phone)

Unnamed: 0,customerID,MultipleLines
0,5575-GNVDE,No
1,3668-QPYBK,No
2,9237-HQITU,No
3,9305-CDSKC,Yes
4,1452-KIOVK,Yes
5,7892-POOKP,Yes
6,6388-TABGU,No
7,9763-GRSKD,No
8,7469-LKBCI,No
9,8091-TTVAX,Yes


Tipo de datos
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6361 entries, 0 to 6360
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     6361 non-null   object
 1   MultipleLines  6361 non-null   object
dtypes: object(2)
memory usage: 99.5+ KB
None

Valores ausentes
customerID       0
MultipleLines    0
dtype: int64

Valores duplicados
0

Valores de variables númericas
        customerID MultipleLines
count         6361          6361
unique        6361             2
top     5575-GNVDE            No
freq             1          3390


En este conjunto podemos ver:

* **"customerID"** - el ID del cliente
* **"MultipleLines"**	- Si tiene multiples lineas

Podemos ver que contiene 6361 filas, las cuales son menores a las 7043 del conjunto de contactos, por lo que podemos decir que no tenemos los datos completos.

Contamos solo con variables categoricas.

Los tipos de datos son correctos.

No contamos con valores ausentes o valores duplicados

## Preparación y correcion de datos

### Correcion de tipos de datos

Convertiremos los titulos de columnas a minusculas para un manejo mas facil

In [15]:
def lower_col(data):
    """
    Convierte los titulos de las columnas a minusculas, argumentos:
    data: dataset a trabajar
    """
    new_names = []
    for x in data.columns:
        new_names.append(x.lower())
    data.columns = new_names

In [16]:
# Crearemos una lista con nuestro 4 datasets en caso que requiramos hacer la misma instruccion en los 4
datasets = [contract,internet,personal,phone]

# Usamos la funcion "lower_col" en los 4 datasets
for x in datasets:
    lower_col(x)

In [17]:
for x in datasets:
    display(x.head())

Unnamed: 0,customerid,begindate,enddate,type,paperlessbilling,paymentmethod,monthlycharges,totalcharges
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65


Unnamed: 0,customerid,internetservice,onlinesecurity,onlinebackup,deviceprotection,techsupport,streamingtv,streamingmovies
0,7590-VHVEG,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,Fiber optic,No,No,No,No,No,No


Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents
0,7590-VHVEG,Female,0,Yes,No
1,5575-GNVDE,Male,0,No,No
2,3668-QPYBK,Male,0,No,No
3,7795-CFOCW,Male,0,No,No
4,9237-HQITU,Female,0,No,No


Unnamed: 0,customerid,multiplelines
0,5575-GNVDE,No
1,3668-QPYBK,No
2,9237-HQITU,No
3,9305-CDSKC,Yes
4,1452-KIOVK,Yes


Nuestros datasets ya se encuentran con los titulos de columnas en minusculas

Convertiremos los tipos de datos correspondientes de los 4 datasets que tenemos

In [18]:
# Cambiamos el tipo de dato de fecha
contract['begindate'] = pd.to_datetime(contract['begindate'], format='%Y-%m-%d')

# La variables o columna 'enddate' no la cambiaremos ya que solo la usaremos como referencia para nuestro objetivo

# Convertimos a tipo de datos numericos
contract['totalcharges'] = pd.to_numeric(contract['totalcharges'],errors='coerce')

In [19]:
# Revisamos los datos con valores ausentes
print('Datos ausentes')
print(contract['totalcharges'].isna().sum())
print()
print('Porcentraje de datos ausentes')
print(round((contract['totalcharges'].isna().sum()/len(contract))*100,2),'%')

Datos ausentes
11

Porcentraje de datos ausentes
0.16 %


No representa mayor detalle ya que es insignificante los valores ausente, los reemplazaremos por el promedio

In [20]:
contract['totalcharges'].fillna(contract['totalcharges'].mean(),inplace=True)

### Creacion de columna objetivo

Crearemos nuestra columna con el valor "objetivo"

In [21]:
# Creamos nuestra funcion de clasificacion de objetivo
def objetivo(row):
    if row['enddate'] == 'No':
        return 0
    else:
        return 1

# Aplicamos nuestra funcion para crear nuestra nueva columna objetivo
contract['churn'] = contract.apply(objetivo,axis=1)

In [22]:
contract.head(10)

Unnamed: 0,customerid,begindate,enddate,type,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85,0
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5,0
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15,1
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75,0
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65,1
5,9305-CDSKC,2019-03-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,99.65,820.5,1
6,1452-KIOVK,2018-04-01,No,Month-to-month,Yes,Credit card (automatic),89.1,1949.4,0
7,6713-OKOMC,2019-04-01,No,Month-to-month,No,Mailed check,29.75,301.9,0
8,7892-POOKP,2017-07-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,104.8,3046.05,1
9,6388-TABGU,2014-12-01,No,One year,No,Bank transfer (automatic),56.15,3487.95,0


### Creacion de columna/variable dias de servicio

Crearemos nuestra columna con el valor de diferencia de dias, para esto crearemos una nueva columna referida a "enddate", en la cual si el valor es no, tomaremos como valida la ultima fecha registrada que es 1 de febrero de 2020 ó 2020-02-01

In [23]:
# Creamos nuestra funcion de clasificacion de objetivo
def timetoint(row):
    if row['enddate'] == 'No':
        return '2020-02-01 00:00:00'
    else:
        return row['enddate']

# Aplicamos nuestra funcion para crear nuestra nueva columna objetivo
contract['enddate'] = contract.apply(timetoint,axis=1)

In [24]:
# Cambiamos el tipo de dato de fecha
contract['enddate'] = pd.to_datetime(contract['enddate'], format='%Y-%m-%d')

In [25]:
# Creamos nuestra variable
contract['diff_days'] = contract['enddate'] - contract['begindate']

In [26]:
contract['diff_days'] = contract['diff_days'].dt.days.astype('int16')

In [27]:
contract.head()

Unnamed: 0,customerid,begindate,enddate,type,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn,diff_days
0,7590-VHVEG,2020-01-01,2020-02-01,Month-to-month,Yes,Electronic check,29.85,29.85,0,31
1,5575-GNVDE,2017-04-01,2020-02-01,One year,No,Mailed check,56.95,1889.5,0,1036
2,3668-QPYBK,2019-10-01,2019-12-01,Month-to-month,Yes,Mailed check,53.85,108.15,1,61
3,7795-CFOCW,2016-05-01,2020-02-01,One year,No,Bank transfer (automatic),42.3,1840.75,0,1371
4,9237-HQITU,2019-09-01,2019-11-01,Month-to-month,Yes,Electronic check,70.7,151.65,1,61


### Union de conjuntos de datos

Para unir nuestros datos en un solo conjunto de datos usaremos como referencia la columna customerid, sabemos que no tenemos todos los datos de cada cliente en cada conjunto de datos, sin embargo, de momento no es problema alguno, continuaremos con la union de los diferentes conjuntos de datos

In [28]:
# Crearemos nuestro conjunto de datos, usaremos "how='outer'" para verificar que efectivamente solo tengamos los 7043 datos de
# clientes unicos
df = pd.merge(contract,internet,how='outer',on='customerid')
df = pd.merge(df,personal,how='outer',on='customerid')
df = pd.merge(df,phone,how='outer',on='customerid')

In [29]:
display(df.head())
print(df.info())

Unnamed: 0,customerid,begindate,enddate,type,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn,diff_days,...,onlinebackup,deviceprotection,techsupport,streamingtv,streamingmovies,gender,seniorcitizen,partner,dependents,multiplelines
0,7590-VHVEG,2020-01-01,2020-02-01,Month-to-month,Yes,Electronic check,29.85,29.85,0,31,...,Yes,No,No,No,No,Female,0,Yes,No,
1,5575-GNVDE,2017-04-01,2020-02-01,One year,No,Mailed check,56.95,1889.5,0,1036,...,No,Yes,No,No,No,Male,0,No,No,No
2,3668-QPYBK,2019-10-01,2019-12-01,Month-to-month,Yes,Mailed check,53.85,108.15,1,61,...,Yes,No,No,No,No,Male,0,No,No,No
3,7795-CFOCW,2016-05-01,2020-02-01,One year,No,Bank transfer (automatic),42.3,1840.75,0,1371,...,No,Yes,Yes,No,No,Male,0,No,No,
4,9237-HQITU,2019-09-01,2019-11-01,Month-to-month,Yes,Electronic check,70.7,151.65,1,61,...,No,No,No,No,No,Female,0,No,No,No


<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   customerid        7043 non-null   object        
 1   begindate         7043 non-null   datetime64[ns]
 2   enddate           7043 non-null   datetime64[ns]
 3   type              7043 non-null   object        
 4   paperlessbilling  7043 non-null   object        
 5   paymentmethod     7043 non-null   object        
 6   monthlycharges    7043 non-null   float64       
 7   totalcharges      7043 non-null   float64       
 8   churn             7043 non-null   int64         
 9   diff_days         7043 non-null   int16         
 10  internetservice   5517 non-null   object        
 11  onlinesecurity    5517 non-null   object        
 12  onlinebackup      5517 non-null   object        
 13  deviceprotection  5517 non-null   object        
 14  techsupport       5517 n

Con esto podemos validar que solo tenemos 7043 registros unicos de clientes de acuerdo a "customerid".

Ya tenemos nuestros 4 conjuntos de datos en una sola tabla, recordemos que tenemos valores que desconocemos en cuertas columnas, no les haremos caso por el momento.

En nuetro conjunto final tenemos 20 columnas, sin embargo, para fines de nuestro objetivo final y entrenamiento de modelo hay ciertas columnas que no son requeridas y/o no aportarian valor alguno a nuestro modelo como son: customerid,begindate,enddate; por lo que las eliminaremos de una vez.

In [30]:
# Eliminamos las columnas mencionadas
df.drop(['customerid','begindate','enddate'],axis=1,inplace=True)

In [31]:
display(df.head())
print(df.info())

Unnamed: 0,type,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn,diff_days,internetservice,onlinesecurity,onlinebackup,deviceprotection,techsupport,streamingtv,streamingmovies,gender,seniorcitizen,partner,dependents,multiplelines
0,Month-to-month,Yes,Electronic check,29.85,29.85,0,31,DSL,No,Yes,No,No,No,No,Female,0,Yes,No,
1,One year,No,Mailed check,56.95,1889.5,0,1036,DSL,Yes,No,Yes,No,No,No,Male,0,No,No,No
2,Month-to-month,Yes,Mailed check,53.85,108.15,1,61,DSL,Yes,Yes,No,No,No,No,Male,0,No,No,No
3,One year,No,Bank transfer (automatic),42.3,1840.75,0,1371,DSL,Yes,No,Yes,Yes,No,No,Male,0,No,No,
4,Month-to-month,Yes,Electronic check,70.7,151.65,1,61,Fiber optic,No,No,No,No,No,No,Female,0,No,No,No


<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   type              7043 non-null   object 
 1   paperlessbilling  7043 non-null   object 
 2   paymentmethod     7043 non-null   object 
 3   monthlycharges    7043 non-null   float64
 4   totalcharges      7043 non-null   float64
 5   churn             7043 non-null   int64  
 6   diff_days         7043 non-null   int16  
 7   internetservice   5517 non-null   object 
 8   onlinesecurity    5517 non-null   object 
 9   onlinebackup      5517 non-null   object 
 10  deviceprotection  5517 non-null   object 
 11  techsupport       5517 non-null   object 
 12  streamingtv       5517 non-null   object 
 13  streamingmovies   5517 non-null   object 
 14  gender            7043 non-null   object 
 15  seniorcitizen     7043 non-null   int64  
 16  partner           7043 non-null   object 


Ya tenemos nuestro conjunto un 50% listo para poder usarlos en un modelo de machine learning.

Hablando en terminos de machine learning tenemos 2 varibles numericas y 15 variables categoricas y 1 variable objetivo:
- Vamos a realizar la codificacion OneHotEncoding para nuestras variables categoricas que en este caso serian 14 ya que la columna "seniorcitizen" ya esta clasificada
- Vamos a realizar un escalamiento de datos para nuestras variables numericas: monthlycharges, totalcharges
- Dejaremos nuestra columna "churn" tal cual, ya que no requiere modificaciones y es nuestra columna objetivo

In [32]:
# Definimos en listas nuestras variables
categorical = ['type', 'paymentmethod', 'internetservice', 'onlinesecurity',
       'onlinebackup', 'deviceprotection', 'techsupport', 'streamingtv',
       'streamingmovies', 'gender', 'seniorcitizen', 'partner', 'dependents',
       'multiplelines','paperlessbilling', ]
numerical = ['monthlycharges',
       'totalcharges','diff_days']

In [33]:
# Revisaremos de forma rapida los valores unicos de nuestras variables categoricas para verificar que no haya datos erroneos
for x in categorical:
    print('Valores unicos de:',x)
    print(df[x].sort_values().unique())
    print()

Valores unicos de: type
['Month-to-month' 'One year' 'Two year']

Valores unicos de: paymentmethod
['Bank transfer (automatic)' 'Credit card (automatic)' 'Electronic check'
 'Mailed check']

Valores unicos de: internetservice
['DSL' 'Fiber optic' nan]

Valores unicos de: onlinesecurity
['No' 'Yes' nan]

Valores unicos de: onlinebackup
['No' 'Yes' nan]

Valores unicos de: deviceprotection
['No' 'Yes' nan]

Valores unicos de: techsupport
['No' 'Yes' nan]

Valores unicos de: streamingtv
['No' 'Yes' nan]

Valores unicos de: streamingmovies
['No' 'Yes' nan]

Valores unicos de: gender
['Female' 'Male']

Valores unicos de: seniorcitizen
[0 1]

Valores unicos de: partner
['No' 'Yes']

Valores unicos de: dependents
['No' 'Yes']

Valores unicos de: multiplelines
['No' 'Yes' nan]

Valores unicos de: paperlessbilling
['No' 'Yes']



Podemos ver que en la columna de "paymentmethod" tenemos 4 tipos de datos, sin embargo, podemos convertirlos a dos, ya que dos de los tipos de datos es de forma automatica y dos en pago con cheque, asi que los agruparemos

In [34]:
# Creamos nuestra funcion de clasificacion de objetivo
def col_paymentmethod(row):
    if row['paymentmethod'] == 'Bank transfer (automatic)':
        return 'automatic'
    if row['paymentmethod'] == 'Credit card (automatic)':
        return 'automatic'
    else:
        return 'check'

# Aplicamos nuestra funcion para crear nuestra nueva columna objetivo
df['paymentmethod'] = df.apply(col_paymentmethod,axis=1)

# Comprobamos los valores unicos nuevamente
print(df['paymentmethod'].sort_values().unique())

['automatic' 'check']


Podemos ver que en la columna de "type" tenemos 3 tipos de datos, sin embargo, podemos convertirlos a dos, ya que dos de los tipos de datos son contratos de forma anual y uno de forma mensual, asi que agruparemos los de tipo anual

In [35]:
# Creamos nuestra funcion de clasificacion de objetivo
def col_type(row):
    if row['type'] == 'Month-to-month':
        return 'monthly'
    else:
        return 'anual'

# Aplicamos nuestra funcion para crear nuestra nueva columna objetivo
df['type'] = df.apply(col_type,axis=1)

# Comprobamos los valores unicos nuevamente
print(df['type'].sort_values().unique())

['anual' 'monthly']


In [36]:
df.fillna('unkn',inplace=True)

Con esto podemos ver los valores unicos de cada variable categorica, ahora, tenemos valores ausentes, podemos llenar estos valores ausentes con la palabra "unknown", sin embargo, al codificar nuestras variables automaticamente ese valor unknown representaria lo mismo que un valor "NaN", seria una etiqueta extra, por lo que para fines practicos dejaremos los valores ausentes ya que asi nos ahorramos tiempo en un proceso que el resultado seria el mismo, asi que manos a la obra.

### Codificación OneHotEncoding

In [37]:
# Crearemos nuestro codificador y convertiremos los datos
ohe = OneHotEncoder(sparse = False, drop='if_binary')
ohe_data = pd.DataFrame(ohe.fit_transform(df[categorical]), columns = ohe.get_feature_names_out())

In [38]:
ohe_data

Unnamed: 0,type_monthly,paymentmethod_check,internetservice_DSL,internetservice_Fiber optic,internetservice_unkn,onlinesecurity_No,onlinesecurity_Yes,onlinesecurity_unkn,onlinebackup_No,onlinebackup_Yes,...,streamingmovies_Yes,streamingmovies_unkn,gender_Male,seniorcitizen_1,partner_Yes,dependents_Yes,multiplelines_No,multiplelines_Yes,multiplelines_unkn,paperlessbilling_Yes
0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0
7039,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0
7040,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0
7041,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0


### Escalado de datos

In [39]:
# Creamos nuestro escalador de datos y convertiremos los datos
scaler = MaxAbsScaler()
scal_data = pd.DataFrame(scaler.fit_transform(df[numerical]),columns=df[numerical].columns)

In [40]:
scal_data

Unnamed: 0,monthlycharges,totalcharges,diff_days
0,0.251368,0.003437,0.014149
1,0.479579,0.217564,0.472843
2,0.453474,0.012453,0.027841
3,0.356211,0.211951,0.625742
4,0.595368,0.017462,0.027841
...,...,...,...
7038,0.714105,0.229194,0.333181
7039,0.869053,0.847792,1.000000
7040,0.249263,0.039892,0.153811
7041,0.626526,0.035303,0.056139


Ya tenemos nuestras variables con el escalamiento de datos, procederemos a juntar nuestros conjuntos de OHE y escalamiento

### Features y target

Ya tenemos nuestros datos listos para definirlos como features y target

####  Features

In [41]:
features = pd.concat([ohe_data,scal_data],axis=1)

In [42]:
features

Unnamed: 0,type_monthly,paymentmethod_check,internetservice_DSL,internetservice_Fiber optic,internetservice_unkn,onlinesecurity_No,onlinesecurity_Yes,onlinesecurity_unkn,onlinebackup_No,onlinebackup_Yes,...,seniorcitizen_1,partner_Yes,dependents_Yes,multiplelines_No,multiplelines_Yes,multiplelines_unkn,paperlessbilling_Yes,monthlycharges,totalcharges,diff_days
0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.251368,0.003437,0.014149
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.479579,0.217564,0.472843
2,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.453474,0.012453,0.027841
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.356211,0.211951,0.625742
4,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.595368,0.017462,0.027841
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.714105,0.229194,0.333181
7039,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.869053,0.847792,1.000000
7040,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.249263,0.039892,0.153811
7041,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.626526,0.035303,0.056139


####  Target

In [43]:
target = df['churn']

In [44]:
target

0       0
1       0
2       1
3       0
4       1
       ..
7038    0
7039    0
7040    0
7041    1
7042    0
Name: churn, Length: 7043, dtype: int64

Ya tenemos nuestras features y target listos para entrenar nuestros modelos de clasificación, manos a la obra

## Modelado de datos

Crearemos un dataframe para ir comparando los diferentes modelos y sus resultados:

In [45]:
global_results = {'tipo':[],
                  'modelo':[],
                  'AUC-ROC':[],
                  'tiempo_entrenamiento':[]}
global_results = pd.DataFrame(global_results)

Crearemos nuestros conjuntos de entrenamiento y prueba 4:1

In [46]:
def add_results(train_result,valid_result,time,model,table):
    """Con esta funcion agregaremos los valores del nombre y tipo de conjunto, su valor de RECM, el tiempo de procesamiento/entrenamiento,
    y el tipo de modelo, con los siguientes argumentos
    train_result: resultado de RECM conjunto entrenamiento
    valid_result: resultado de RECM conjunto validacion
    time: resultado del tiempo de procesamiento
    model: tipo de modelo,
    table: tabla a la que se agregaran los resultados
    """
    row1 = pd.DataFrame({'tipo' : ['train'], 'modelo':[model],'AUC-ROC': [train_result],'tiempo_entrenamiento': [time]})
    row2 = pd.DataFrame({'tipo' : ['valid'], 'modelo':[model],'AUC-ROC': [valid_result],'tiempo_entrenamiento': [time]})
    table = pd.concat([table,row1,row2])
    return table.reset_index(drop=True)

### Separacion de datos train, test

In [47]:
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.20, random_state=92)

In [48]:
# Verificamos el tamaño de nuestros conjuntos
features_train.shape, features_valid.shape, target_train.shape, target_valid.shape

((5634, 34), (1409, 34), (5634,), (1409,))

### Balanceo de clases

Algo que puede afectar una tarea de clasificacion es el desbalanceo de clases, veamos cual es la situacion actualy tomemos la decision de hacer un sobremuestreo o un submuestreo

In [49]:
features_zeros =  features_train[target_train == 0]
features_ones = features_train[target_train == 1]
target_zeros = target_train[target_train == 0]
target_ones = target_train[target_train == 1]

print(features_zeros.shape)
print(features_ones.shape)
print(target_zeros.shape)
print(target_ones.shape)

(4133, 34)
(1501, 34)
(4133,)
(1501,)


In [50]:
print('Balance entre clases')
print()
print('Porcentaje de "1"')
print(round(len(target_ones)/(len(target_ones)+len(target_zeros))*100,2),'%')
print()
print('Porcentaje de "0"')
print(round(len(target_zeros)/(len(target_ones)+len(target_zeros))*100,2),'%')

Balance entre clases

Porcentaje de "1"
26.64 %

Porcentaje de "0"
73.36 %


Podemos ver que tenemos un desbalance de clases 3:1 solo en el split de entrenamiento, tenemos entendido que un desbalance de clases en un modelo de clasificacion puedeafectar nuestros resultados finales, en este caso nuestro valor de AUC-ROC, hagamos un balanceo haciendo un sobremuestreo con los target = 1

In [51]:
def upsample(features, target, repeat):
    """
    Con esta funcion haremos el sobremuestreo con tres argumentos:
    features: conjunto de caracteristicas
    target: serie de objetivos
    repeat: el numero de veces que queremos el sobremuestreo
    """
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345
    )

    return features_upsampled, target_upsampled

In [52]:
features_train, target_train = upsample(features_train, target_train, 4)

In [53]:
# Comprobamos una vez mas el tamaño de los conjuntos y el balance de clases
features_zeros =  features_train[target_train == 0]
features_ones = features_train[target_train == 1]
target_zeros = target_train[target_train == 0]
target_ones = target_train[target_train == 1]

print(features_zeros.shape)
print(features_ones.shape)
print(target_zeros.shape)
print(target_ones.shape)

(4133, 34)
(6004, 34)
(4133,)
(6004,)


### Logistic Regression

In [54]:
start_time = time.time()
model = LogisticRegression(random_state=92)
param = {'solver':['liblinear'],
         'penalty':['l1','l2']}
gs = GridSearchCV(estimator = model,
                  param_grid = param,
                  scoring = 'roc_auc',
                  cv = 5)
gs.fit(features_train,target_train)
end_time = time.time()
training_time = round(end_time - start_time,2)
print('Tiempo de entrenamiento:', training_time,'segundos')
print('Mejor modelo:',gs.best_estimator_)
train_pred = gs.best_estimator_.predict(features_train)
valid_pred = gs.best_estimator_.predict(features_valid)
# AUC-Score
auc_train = roc_auc_score(target_train,train_pred)
print('La puntuación auc-roc del conjunto de entrenamiento es:',auc_train)
auc_valid = roc_auc_score(target_valid,valid_pred)
print('La puntuación auc-roc del conjunto de validación es:',auc_valid)

Tiempo de entrenamiento: 0.5 segundos
Mejor modelo: LogisticRegression(penalty='l1', random_state=92, solver='liblinear')
La puntuación auc-roc del conjunto de entrenamiento es: 0.7606218807592261
La puntuación auc-roc del conjunto de validación es: 0.7562857620181263


In [55]:
global_results = add_results(auc_train,auc_valid,training_time,'LogisticRegression',global_results)

### Logistic Regression - Optuna

In [56]:
def objective(trial):
    # Hiperparametros
    penalty = trial.suggest_categorical('penalty', ['none', 'l2',])
    solver = trial.suggest_categorical('solver', ['lbfgs', 'saga','newton-cg'])
    C = trial.suggest_float('C',1,100)
    max_iter = trial.suggest_int('max_iter',100,200)

    # Modelo
    model = LogisticRegression(random_state=92,penalty=penalty,solver=solver,C=C,max_iter=max_iter)
    model.fit(features_train,target_train)
    valid_pred = model.predict(features_valid)
    auc_roc = roc_auc_score(target_valid,valid_pred)
    return auc_roc

# Creacion de estudio
start_time = time.time()
study = optuna.create_study(direction='maximize')
study.optimize(objective,n_trials=200)
best_params = study.best_params
# Uso de mejores hiperparametros
model = LogisticRegression(random_state=92,penalty=best_params['penalty'],solver=best_params['solver'],
                           C=best_params['C'],max_iter=best_params['max_iter'])
model.fit(features_train,target_train)
end_time = time.time()
training_time = round(end_time - start_time,2)
print('Tiempo de entrenamiento:', training_time,'segundos')
print('Mejores hiperparametros:',study.best_params)
train_pred = model.predict(features_train)
valid_pred = model.predict(features_valid)

# AUC-Score
auc_train = roc_auc_score(target_train,train_pred)
print('La puntuación auc-roc del conjunto de entrenamiento es:',auc_train)
auc_valid = roc_auc_score(target_valid,valid_pred)
print('La puntuación auc-roc del conjunto de validación es:',auc_valid)

[I 2024-04-25 14:18:19,870] A new study created in memory with name: no-name-6a37d712-45bb-49a6-be9a-ffc414258210
[I 2024-04-25 14:18:19,954] Trial 0 finished with value: 0.7572463768115942 and parameters: {'penalty': 'none', 'solver': 'newton-cg', 'C': 67.04110824389258, 'max_iter': 123}. Best is trial 0 with value: 0.7572463768115942.
[I 2024-04-25 14:18:20,022] Trial 1 finished with value: 0.7572463768115942 and parameters: {'penalty': 'none', 'solver': 'newton-cg', 'C': 48.74826409712483, 'max_iter': 168}. Best is trial 0 with value: 0.7572463768115942.
[I 2024-04-25 14:18:20,104] Trial 2 finished with value: 0.7563679885561542 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 42.54124702476374, 'max_iter': 111}. Best is trial 0 with value: 0.7572463768115942.
[I 2024-04-25 14:18:20,189] Trial 3 finished with value: 0.7572463768115942 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 78.67201602568815, 'max_iter': 165}. Best is trial 0 with value: 0.75724637

[I 2024-04-25 14:18:23,559] Trial 21 finished with value: 0.7572463768115942 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 81.2200732592563, 'max_iter': 167}. Best is trial 0 with value: 0.7572463768115942.
[I 2024-04-25 14:18:23,656] Trial 22 finished with value: 0.7572463768115942 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 77.24528253947264, 'max_iter': 176}. Best is trial 0 with value: 0.7572463768115942.
[I 2024-04-25 14:18:23,743] Trial 23 finished with value: 0.7572463768115942 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 86.15703177221268, 'max_iter': 154}. Best is trial 0 with value: 0.7572463768115942.
[I 2024-04-25 14:18:23,837] Trial 24 finished with value: 0.7572463768115942 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 70.45427398952863, 'max_iter': 165}. Best is trial 0 with value: 0.7572463768115942.
[I 2024-04-25 14:18:23,920] Trial 25 finished with value: 0.7572463768115942 and parameters: {'penalty': 

[I 2024-04-25 14:18:26,631] Trial 50 finished with value: 0.7586050724637683 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 3.587632379521218, 'max_iter': 128}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:26,704] Trial 51 finished with value: 0.7586050724637683 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 3.6643637920342385, 'max_iter': 127}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:26,780] Trial 52 finished with value: 0.7578911372843838 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 1.051856915406181, 'max_iter': 129}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:26,857] Trial 53 finished with value: 0.7554073737626864 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 10.694502469276888, 'max_iter': 116}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:26,916] Trial 54 finished with value: 0.7581247650670342 and parameters: {'pen

[I 2024-04-25 14:18:30,496] Trial 85 finished with value: 0.7554073737626864 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 11.40053769282796, 'max_iter': 104}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:30,567] Trial 86 finished with value: 0.7581247650670342 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 5.836441879351441, 'max_iter': 107}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:30,648] Trial 87 finished with value: 0.7554073737626864 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 9.094651919634995, 'max_iter': 115}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:30,702] Trial 88 finished with value: 0.7581247650670342 and parameters: {'penalty': 'l2', 'solver': 'lbfgs', 'C': 4.365530779173653, 'max_iter': 107}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:30,785] Trial 89 finished with value: 0.7563679885561542 and parameters: {'penalty':

[I 2024-04-25 14:18:34,877] Trial 118 finished with value: 0.7581247650670342 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 7.576762113550236, 'max_iter': 105}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:34,960] Trial 119 finished with value: 0.7554073737626864 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 12.050005436687092, 'max_iter': 144}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:35,052] Trial 120 finished with value: 0.7563679885561542 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 17.064811196510128, 'max_iter': 100}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:35,150] Trial 121 finished with value: 0.7554073737626864 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 8.961499711424901, 'max_iter': 109}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:35,228] Trial 122 finished with value: 0.7581247650670342 and parameters: 

[I 2024-04-25 14:18:40,523] Trial 145 finished with value: 0.7586050724637683 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 4.05604312501319, 'max_iter': 113}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:40,595] Trial 146 finished with value: 0.7586050724637683 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 4.122125114919759, 'max_iter': 114}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:40,660] Trial 147 finished with value: 0.7576444576703003 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 2.7499467541079343, 'max_iter': 116}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:40,733] Trial 148 finished with value: 0.7586050724637683 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 3.803799629040068, 'max_iter': 113}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:41,118] Trial 149 finished with value: 0.7581247650670342 and parameters: {'

[I 2024-04-25 14:18:45,365] Trial 176 finished with value: 0.7558876811594203 and parameters: {'penalty': 'l2', 'solver': 'saga', 'C': 19.668810453391615, 'max_iter': 119}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:45,449] Trial 177 finished with value: 0.7576444576703003 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 2.652894064450581, 'max_iter': 123}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:45,528] Trial 178 finished with value: 0.7581247650670342 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 7.144889327047472, 'max_iter': 115}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:45,618] Trial 179 finished with value: 0.7554073737626864 and parameters: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 9.820223038182931, 'max_iter': 122}. Best is trial 50 with value: 0.7586050724637683.
[I 2024-04-25 14:18:45,999] Trial 180 finished with value: 0.7558876811594203 and parameters: {'pena

Tiempo de entrenamiento: 28.0 segundos
Mejores hiperparametros: {'penalty': 'l2', 'solver': 'newton-cg', 'C': 3.587632379521218, 'max_iter': 128}
La puntuación auc-roc del conjunto de entrenamiento es: 0.7603799257628554
La puntuación auc-roc del conjunto de validación es: 0.7586050724637683


In [57]:
global_results = add_results(auc_train,auc_valid,training_time,'LogisticRegression-optuna',global_results)

### Decision Tree Classifier

In [58]:
start_time = time.time()
model = DecisionTreeClassifier(random_state=92)
param = {'criterion':['gini', 'entropy'],
         'max_depth':[None,5,10,20,50],
         'min_samples_split':[2,3,4,5]
        }
gs = GridSearchCV(estimator = model,
                  param_grid = param,
                  scoring = 'roc_auc',
                  cv = 5)
gs.fit(features_train,target_train)
end_time = time.time()
training_time = round(end_time - start_time,2)
print('Tiempo de entrenamiento:', training_time,'segundos')
print('Mejor modelo:',gs.best_estimator_)
train_pred = gs.best_estimator_.predict(features_train)
valid_pred = gs.best_estimator_.predict(features_valid)
# AUC-Score
auc_train = roc_auc_score(target_train,train_pred)
print('La puntuación auc-roc del conjunto de entrenamiento es:',auc_train)
auc_valid = roc_auc_score(target_valid,valid_pred)
print('La puntuación auc-roc del conjunto de validación es:',auc_valid)

Tiempo de entrenamiento: 6.05 segundos
Mejor modelo: DecisionTreeClassifier(max_depth=20, min_samples_split=3, random_state=92)
La puntuación auc-roc del conjunto de entrenamiento es: 0.978706267762777
La puntuación auc-roc del conjunto de validación es: 0.7387532368541955


In [59]:
global_results = add_results(auc_train,auc_valid,training_time,'DecisionTreeClassifier',global_results)

### Decision Tree Classifier - Optuna

In [60]:
def objective(trial):
    # Hiperparametros
    criterion = trial.suggest_categorical('criterion', ['gini', 'entropy'])
    splitter = trial.suggest_categorical('splitter', ['random', 'best'])
    max_features = trial.suggest_categorical('max_features', ['sqrt', 'log2','auto'])
    max_depth = trial.suggest_int('max_depth',1,20)
    min_samples_split = trial.suggest_int('min_samples_split',2,20)
    min_samples_leaf = trial.suggest_int('min_samples_leaf',1,20)
    min_weight_fraction_leaf = trial.suggest_float('min_weight_fraction_leaf',0.0,0.5)

    # Modelo
    model = DecisionTreeClassifier(random_state=92,criterion=criterion,splitter=splitter,max_features=max_features,
                                   max_depth=max_depth,min_samples_split=min_samples_split,
                                   min_samples_leaf=min_samples_leaf,min_weight_fraction_leaf=min_weight_fraction_leaf)
    model.fit(features_train,target_train)
    valid_pred = model.predict(features_valid)
    auc_roc = roc_auc_score(target_valid,valid_pred)
    return auc_roc

# Creacion de estudio
start_time = time.time()
study = optuna.create_study(direction='maximize')
study.optimize(objective,n_trials=200)
best_params = study.best_params
# Uso de mejores hiperparametros
model = DecisionTreeClassifier(random_state=92,criterion=best_params['criterion'],splitter=best_params['splitter'],
                               max_features=best_params['max_features'],max_depth=best_params['max_depth'],
                               min_samples_split=best_params['min_samples_split'],min_samples_leaf=best_params['min_samples_leaf'],
                               min_weight_fraction_leaf=best_params['min_weight_fraction_leaf'])
model.fit(features_train,target_train)
end_time = time.time()
training_time = round(end_time - start_time,2)
print('Tiempo de entrenamiento:', training_time,'segundos')
print('Mejores hiperparametros:',study.best_params)
train_pred = model.predict(features_train)
valid_pred = model.predict(features_valid)

# AUC-Score
auc_train = roc_auc_score(target_train,train_pred)
print('La puntuación auc-roc del conjunto de entrenamiento es:',auc_train)
auc_valid = roc_auc_score(target_valid,valid_pred)
print('La puntuación auc-roc del conjunto de validación es:',auc_valid)

[I 2024-04-25 14:18:53,986] A new study created in memory with name: no-name-fdde15da-b668-408f-b877-0236d578d5d0
[I 2024-04-25 14:18:53,993] Trial 0 finished with value: 0.6183644488994696 and parameters: {'criterion': 'entropy', 'splitter': 'random', 'max_features': 'sqrt', 'max_depth': 2, 'min_samples_split': 10, 'min_samples_leaf': 3, 'min_weight_fraction_leaf': 0.2409393921876627}. Best is trial 0 with value: 0.6183644488994696.
[I 2024-04-25 14:18:54,001] Trial 1 finished with value: 0.6439812784529926 and parameters: {'criterion': 'entropy', 'splitter': 'best', 'max_features': 'auto', 'max_depth': 13, 'min_samples_split': 15, 'min_samples_leaf': 3, 'min_weight_fraction_leaf': 0.2219904006814748}. Best is trial 1 with value: 0.6439812784529926.
[I 2024-04-25 14:18:54,008] Trial 2 finished with value: 0.6655076640354175 and parameters: {'criterion': 'entropy', 'splitter': 'best', 'max_features': 'log2', 'max_depth': 2, 'min_samples_split': 13, 'min_samples_leaf': 13, 'min_weight_f

[I 2024-04-25 14:18:54,472] Trial 26 finished with value: 0.740626174664829 and parameters: {'criterion': 'entropy', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 3, 'min_samples_split': 8, 'min_samples_leaf': 8, 'min_weight_fraction_leaf': 0.10324607833365154}. Best is trial 11 with value: 0.740626174664829.
[I 2024-04-25 14:18:54,495] Trial 27 finished with value: 0.6183644488994696 and parameters: {'criterion': 'entropy', 'splitter': 'random', 'max_features': 'sqrt', 'max_depth': 7, 'min_samples_split': 11, 'min_samples_leaf': 10, 'min_weight_fraction_leaf': 0.20318060873790944}. Best is trial 11 with value: 0.740626174664829.
[I 2024-04-25 14:18:54,518] Trial 28 finished with value: 0.6183644488994696 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'auto', 'max_depth': 1, 'min_samples_split': 6, 'min_samples_leaf': 5, 'min_weight_fraction_leaf': 0.1581690778284105}. Best is trial 11 with value: 0.740626174664829.
[I 2024-04-25 14:18:54,543] 

[I 2024-04-25 14:18:55,159] Trial 52 finished with value: 0.7271514638934136 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 17, 'min_samples_split': 8, 'min_samples_leaf': 14, 'min_weight_fraction_leaf': 0.002352143887745386}. Best is trial 49 with value: 0.7470346238984255.
[I 2024-04-25 14:18:55,187] Trial 53 finished with value: 0.740626174664829 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 16, 'min_samples_split': 7, 'min_samples_leaf': 12, 'min_weight_fraction_leaf': 0.028117311468074863}. Best is trial 49 with value: 0.7470346238984255.
[I 2024-04-25 14:18:55,216] Trial 54 finished with value: 0.740626174664829 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 13, 'min_samples_split': 9, 'min_samples_leaf': 13, 'min_weight_fraction_leaf': 0.027262229126708264}. Best is trial 49 with value: 0.7470346238984255.
[I 2024-04-25 14:18:55,2

[I 2024-04-25 14:18:55,872] Trial 78 finished with value: 0.740626174664829 and parameters: {'criterion': 'entropy', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 13, 'min_samples_split': 7, 'min_samples_leaf': 12, 'min_weight_fraction_leaf': 0.01441467989622237}. Best is trial 49 with value: 0.7470346238984255.
[I 2024-04-25 14:18:55,899] Trial 79 finished with value: 0.663080023388882 and parameters: {'criterion': 'gini', 'splitter': 'best', 'max_features': 'sqrt', 'max_depth': 18, 'min_samples_split': 9, 'min_samples_leaf': 8, 'min_weight_fraction_leaf': 0.056573895957671}. Best is trial 49 with value: 0.7470346238984255.
[I 2024-04-25 14:18:55,925] Trial 80 finished with value: 0.740626174664829 and parameters: {'criterion': 'entropy', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 7, 'min_samples_split': 9, 'min_samples_leaf': 13, 'min_weight_fraction_leaf': 0.10252636892079284}. Best is trial 49 with value: 0.7470346238984255.
[I 2024-04-25 14:18:55,952] 

[I 2024-04-25 14:18:56,575] Trial 104 finished with value: 0.7421493233930585 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 6, 'min_samples_split': 18, 'min_samples_leaf': 1, 'min_weight_fraction_leaf': 0.00016475723888959102}. Best is trial 88 with value: 0.7658618907405087.
[I 2024-04-25 14:18:56,604] Trial 105 finished with value: 0.740626174664829 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 7, 'min_samples_split': 19, 'min_samples_leaf': 1, 'min_weight_fraction_leaf': 0.03483009012277162}. Best is trial 88 with value: 0.7658618907405087.
[I 2024-04-25 14:18:56,631] Trial 106 finished with value: 0.7397908574531178 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 6, 'min_samples_split': 19, 'min_samples_leaf': 2, 'min_weight_fraction_leaf': 0.008962995315542711}. Best is trial 88 with value: 0.7658618907405087.
[I 2024-04-25 14:18:56

[I 2024-04-25 14:18:57,307] Trial 130 finished with value: 0.740626174664829 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'sqrt', 'max_depth': 5, 'min_samples_split': 16, 'min_samples_leaf': 2, 'min_weight_fraction_leaf': 0.05203481464575192}. Best is trial 88 with value: 0.7658618907405087.
[I 2024-04-25 14:18:57,335] Trial 131 finished with value: 0.740626174664829 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 5, 'min_samples_split': 16, 'min_samples_leaf': 2, 'min_weight_fraction_leaf': 0.0181374999294656}. Best is trial 88 with value: 0.7658618907405087.
[I 2024-04-25 14:18:57,362] Trial 132 finished with value: 0.740626174664829 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 5, 'min_samples_split': 17, 'min_samples_leaf': 1, 'min_weight_fraction_leaf': 0.030843984405174676}. Best is trial 88 with value: 0.7658618907405087.
[I 2024-04-25 14:18:57,391] 

[I 2024-04-25 14:18:58,056] Trial 156 finished with value: 0.6183644488994696 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 7, 'min_samples_split': 16, 'min_samples_leaf': 3, 'min_weight_fraction_leaf': 0.2696052962242322}. Best is trial 88 with value: 0.7658618907405087.
[I 2024-04-25 14:18:58,087] Trial 157 finished with value: 0.7476924362026479 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 7, 'min_samples_split': 15, 'min_samples_leaf': 2, 'min_weight_fraction_leaf': 0.009783193091929383}. Best is trial 88 with value: 0.7658618907405087.
[I 2024-04-25 14:18:58,117] Trial 158 finished with value: 0.740626174664829 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'sqrt', 'max_depth': 9, 'min_samples_split': 17, 'min_samples_leaf': 2, 'min_weight_fraction_leaf': 0.013800136672611277}. Best is trial 88 with value: 0.7658618907405087.
[I 2024-04-25 14:18:58,14

[I 2024-04-25 14:18:58,846] Trial 182 finished with value: 0.7420540450235977 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 5, 'min_samples_split': 16, 'min_samples_leaf': 4, 'min_weight_fraction_leaf': 0.013344809680141371}. Best is trial 88 with value: 0.7658618907405087.
[I 2024-04-25 14:18:58,875] Trial 183 finished with value: 0.7299080106920603 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 6, 'min_samples_split': 15, 'min_samples_leaf': 3, 'min_weight_fraction_leaf': 0.17405543445265315}. Best is trial 88 with value: 0.7658618907405087.
[I 2024-04-25 14:18:58,906] Trial 184 finished with value: 0.7345766507956397 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 7, 'min_samples_split': 20, 'min_samples_leaf': 6, 'min_weight_fraction_leaf': 0.020259127371542823}. Best is trial 88 with value: 0.7658618907405087.
[I 2024-04-25 14:18:58,

Tiempo de entrenamiento: 5.39 segundos
Mejores hiperparametros: {'criterion': 'gini', 'splitter': 'random', 'max_features': 'log2', 'max_depth': 6, 'min_samples_split': 16, 'min_samples_leaf': 2, 'min_weight_fraction_leaf': 0.006273689954202907}
La puntuación auc-roc del conjunto de entrenamiento es: 0.7628100340558509
La puntuación auc-roc del conjunto de validación es: 0.7658618907405087


In [61]:
global_results = add_results(auc_train,auc_valid,training_time,'DecisionTreeClassifier - optuna',global_results)

### Random Forest Classifier

In [62]:
start_time = time.time()
model = RandomForestClassifier(random_state=92)
param = {'n_estimators':[100,200],
         'max_depth':[None,5,10],
         'min_samples_split':[2,3,4,5]
        }
gs = GridSearchCV(estimator = model,
                  param_grid = param,
                  scoring = 'roc_auc',
                  cv = 5)
gs.fit(features_train,target_train)
end_time = time.time()
training_time = round(end_time - start_time,2)
print('Tiempo de entrenamiento:', training_time,'segundos')
print('Mejor modelo:',gs.best_estimator_)
train_pred = gs.best_estimator_.predict(features_train)
valid_pred = gs.best_estimator_.predict(features_valid)
# AUC-Score
auc_train = roc_auc_score(target_train,train_pred)
print('La puntuación auc-roc del conjunto de entrenamiento es:',auc_train)
auc_valid = roc_auc_score(target_valid,valid_pred)
print('La puntuación auc-roc del conjunto de validación es:',auc_valid)

Tiempo de entrenamiento: 71.71 segundos
Mejor modelo: RandomForestClassifier(n_estimators=200, random_state=92)
La puntuación auc-roc del conjunto de entrenamiento es: 0.9987902250181466
La puntuación auc-roc del conjunto de validación es: 0.7546033809464144


In [63]:
gs.best_params_

{'max_depth': None, 'min_samples_split': 2, 'n_estimators': 200}

In [64]:
global_results = add_results(auc_train,auc_valid,training_time,'RandomForestClassifier',global_results)

### Random Forest Classifier - Optuna

In [65]:
def objective(trial):
    # Hiperparametros
    criterion = trial.suggest_categorical('criterion', ['gini', 'entropy'])
    n_estimators = trial.suggest_int('n_estimators',10,300)
    max_features = trial.suggest_categorical('max_features', ['sqrt', 'log2','auto'])
    max_depth = trial.suggest_int('max_depth',1,20)
    min_samples_split = trial.suggest_int('min_samples_split',2,20)
    min_samples_leaf = trial.suggest_int('min_samples_leaf',1,20)
    min_weight_fraction_leaf = trial.suggest_float('min_weight_fraction_leaf',0.0,0.5)

    # Modelo
    model = RandomForestClassifier(random_state=92,n_estimators=n_estimators,criterion=criterion,max_features=max_features,
                                   max_depth=max_depth,min_samples_split=min_samples_split,
                                   min_samples_leaf=min_samples_leaf,min_weight_fraction_leaf=min_weight_fraction_leaf)
    model.fit(features_train,target_train)
    valid_pred = model.predict(features_valid)
    auc_roc = roc_auc_score(target_valid,valid_pred)
    return auc_roc

# Creacion de estudio
start_time = time.time()
study = optuna.create_study(direction='maximize')
study.optimize(objective,n_trials=200)
best_params = study.best_params
# Uso de mejores hiperparametros
model = RandomForestClassifier(random_state=92,n_estimators=best_params['n_estimators'],criterion=best_params['criterion'],
                               max_features=best_params['max_features'],max_depth=best_params['max_depth'],
                               min_samples_split=best_params['min_samples_split'],min_samples_leaf=best_params['min_samples_leaf'],
                               min_weight_fraction_leaf=best_params['min_weight_fraction_leaf'])
model.fit(features_train,target_train)
end_time = time.time()
training_time = round(end_time - start_time,2)
print('Tiempo de entrenamiento:', training_time,'segundos')
print('Mejores hiperparametros:',study.best_params)
train_pred = model.predict(features_train)
valid_pred = model.predict(features_valid)

# AUC-Score
auc_train = roc_auc_score(target_train,train_pred)
print('La puntuación auc-roc del conjunto de entrenamiento es:',auc_train)
auc_valid = roc_auc_score(target_valid,valid_pred)
print('La puntuación auc-roc del conjunto de validación es:',auc_valid)

[I 2024-04-25 14:20:11,499] A new study created in memory with name: no-name-dc01bdce-649e-4828-b939-31528c388e19
[I 2024-04-25 14:20:11,847] Trial 0 finished with value: 0.702646650377981 and parameters: {'criterion': 'entropy', 'n_estimators': 170, 'max_features': 'sqrt', 'max_depth': 11, 'min_samples_split': 20, 'min_samples_leaf': 1, 'min_weight_fraction_leaf': 0.2941836316929157}. Best is trial 0 with value: 0.702646650377981.
[I 2024-04-25 14:20:12,109] Trial 1 finished with value: 0.7177828592908156 and parameters: {'criterion': 'entropy', 'n_estimators': 125, 'max_features': 'log2', 'max_depth': 19, 'min_samples_split': 7, 'min_samples_leaf': 12, 'min_weight_fraction_leaf': 0.24082770641317552}. Best is trial 1 with value: 0.7177828592908156.
[I 2024-04-25 14:20:12,162] Trial 2 finished with value: 0.6548756421501065 and parameters: {'criterion': 'entropy', 'n_estimators': 24, 'max_features': 'sqrt', 'max_depth': 1, 'min_samples_split': 3, 'min_samples_leaf': 3, 'min_weight_fra

[I 2024-04-25 14:20:20,934] Trial 25 finished with value: 0.7485316689637891 and parameters: {'criterion': 'entropy', 'n_estimators': 66, 'max_features': 'log2', 'max_depth': 9, 'min_samples_split': 19, 'min_samples_leaf': 12, 'min_weight_fraction_leaf': 0.03948317599356099}. Best is trial 22 with value: 0.7681159420289855.
[I 2024-04-25 14:20:21,194] Trial 26 finished with value: 0.6953167418452156 and parameters: {'criterion': 'gini', 'n_estimators': 130, 'max_features': 'log2', 'max_depth': 1, 'min_samples_split': 11, 'min_samples_leaf': 8, 'min_weight_fraction_leaf': 0.045645569368330634}. Best is trial 22 with value: 0.7681159420289855.
[I 2024-04-25 14:20:22,171] Trial 27 finished with value: 0.7553943219312533 and parameters: {'criterion': 'entropy', 'n_estimators': 264, 'max_features': 'log2', 'max_depth': 5, 'min_samples_split': 15, 'min_samples_leaf': 3, 'min_weight_fraction_leaf': 0.0008818173893567244}. Best is trial 22 with value: 0.7681159420289855.
[I 2024-04-25 14:20:22

[I 2024-04-25 14:20:38,751] Trial 51 finished with value: 0.7714350227623942 and parameters: {'criterion': 'entropy', 'n_estimators': 263, 'max_features': 'log2', 'max_depth': 13, 'min_samples_split': 16, 'min_samples_leaf': 8, 'min_weight_fraction_leaf': 0.00040420504151331485}. Best is trial 42 with value: 0.774961627615587.
[I 2024-04-25 14:20:40,177] Trial 52 finished with value: 0.772793718414568 and parameters: {'criterion': 'entropy', 'n_estimators': 265, 'max_features': 'log2', 'max_depth': 13, 'min_samples_split': 16, 'min_samples_leaf': 8, 'min_weight_fraction_leaf': 0.0005082910306464248}. Best is trial 42 with value: 0.774961627615587.
[I 2024-04-25 14:20:41,113] Trial 53 finished with value: 0.758193939773629 and parameters: {'criterion': 'entropy', 'n_estimators': 263, 'max_features': 'log2', 'max_depth': 13, 'min_samples_split': 16, 'min_samples_leaf': 8, 'min_weight_fraction_leaf': 0.02353555801480118}. Best is trial 42 with value: 0.774961627615587.
[I 2024-04-25 14:20

[I 2024-04-25 14:21:04,574] Trial 76 finished with value: 0.7495745102952847 and parameters: {'criterion': 'entropy', 'n_estimators': 241, 'max_features': 'auto', 'max_depth': 17, 'min_samples_split': 9, 'min_samples_leaf': 17, 'min_weight_fraction_leaf': 0.03368241268567231}. Best is trial 75 with value: 0.7753336048114271.
[I 2024-04-25 14:21:05,229] Trial 77 finished with value: 0.7490811510671177 and parameters: {'criterion': 'entropy', 'n_estimators': 226, 'max_features': 'auto', 'max_depth': 19, 'min_samples_split': 10, 'min_samples_leaf': 19, 'min_weight_fraction_leaf': 0.0797953182729067}. Best is trial 75 with value: 0.7753336048114271.
[I 2024-04-25 14:21:06,020] Trial 78 finished with value: 0.7564371632627489 and parameters: {'criterion': 'gini', 'n_estimators': 215, 'max_features': 'auto', 'max_depth': 18, 'min_samples_split': 2, 'min_samples_leaf': 16, 'min_weight_fraction_leaf': 0.016027735820204416}. Best is trial 75 with value: 0.7753336048114271.
[I 2024-04-25 14:21:0

[I 2024-04-25 14:21:29,185] Trial 101 finished with value: 0.7699679969093263 and parameters: {'criterion': 'entropy', 'n_estimators': 265, 'max_features': 'auto', 'max_depth': 11, 'min_samples_split': 7, 'min_samples_leaf': 19, 'min_weight_fraction_leaf': 0.003710909124335371}. Best is trial 92 with value: 0.7774884621810132.
[I 2024-04-25 14:21:30,788] Trial 102 finished with value: 0.7776920707513677 and parameters: {'criterion': 'entropy', 'n_estimators': 272, 'max_features': 'sqrt', 'max_depth': 13, 'min_samples_split': 10, 'min_samples_leaf': 4, 'min_weight_fraction_leaf': 4.1959609605407436e-05}. Best is trial 102 with value: 0.7776920707513677.
[I 2024-04-25 14:21:32,533] Trial 103 finished with value: 0.7681120264795556 and parameters: {'criterion': 'entropy', 'n_estimators': 282, 'max_features': 'sqrt', 'max_depth': 19, 'min_samples_split': 10, 'min_samples_leaf': 2, 'min_weight_fraction_leaf': 0.0003851355385301734}. Best is trial 102 with value: 0.7776920707513677.
[I 2024-

[I 2024-04-25 14:21:57,556] Trial 126 finished with value: 0.757315551518189 and parameters: {'criterion': 'entropy', 'n_estimators': 250, 'max_features': 'log2', 'max_depth': 15, 'min_samples_split': 10, 'min_samples_leaf': 20, 'min_weight_fraction_leaf': 0.01740540625814723}. Best is trial 102 with value: 0.7776920707513677.
[I 2024-04-25 14:21:58,810] Trial 127 finished with value: 0.7745635467568809 and parameters: {'criterion': 'entropy', 'n_estimators': 222, 'max_features': 'auto', 'max_depth': 16, 'min_samples_split': 19, 'min_samples_leaf': 5, 'min_weight_fraction_leaf': 0.00034528420510795006}. Best is trial 102 with value: 0.7776920707513677.
[I 2024-04-25 14:21:59,507] Trial 128 finished with value: 0.7474066010942656 and parameters: {'criterion': 'entropy', 'n_estimators': 221, 'max_features': 'auto', 'max_depth': 17, 'min_samples_split': 20, 'min_samples_leaf': 5, 'min_weight_fraction_leaf': 0.052242518301768195}. Best is trial 102 with value: 0.7776920707513677.
[I 2024-0

[I 2024-04-25 14:22:23,075] Trial 151 finished with value: 0.7646846155452532 and parameters: {'criterion': 'entropy', 'n_estimators': 290, 'max_features': 'auto', 'max_depth': 19, 'min_samples_split': 9, 'min_samples_leaf': 19, 'min_weight_fraction_leaf': 0.00827744697580855}. Best is trial 102 with value: 0.7776920707513677.
[I 2024-04-25 14:22:24,259] Trial 152 finished with value: 0.7642043081485194 and parameters: {'criterion': 'entropy', 'n_estimators': 270, 'max_features': 'auto', 'max_depth': 19, 'min_samples_split': 10, 'min_samples_leaf': 19, 'min_weight_fraction_leaf': 0.008252434032175173}. Best is trial 102 with value: 0.7776920707513677.
[I 2024-04-25 14:22:25,318] Trial 153 finished with value: 0.7562727101866934 and parameters: {'criterion': 'entropy', 'n_estimators': 285, 'max_features': 'auto', 'max_depth': 18, 'min_samples_split': 20, 'min_samples_leaf': 13, 'min_weight_fraction_leaf': 0.020502354471205676}. Best is trial 102 with value: 0.7776920707513677.
[I 2024-0

[I 2024-04-25 14:22:50,135] Trial 176 finished with value: 0.7568352441214551 and parameters: {'criterion': 'entropy', 'n_estimators': 102, 'max_features': 'sqrt', 'max_depth': 17, 'min_samples_split': 10, 'min_samples_leaf': 18, 'min_weight_fraction_leaf': 0.019164021672356847}. Best is trial 169 with value: 0.7781332226538029.
[I 2024-04-25 14:22:51,105] Trial 177 finished with value: 0.7572333249801612 and parameters: {'criterion': 'entropy', 'n_estimators': 263, 'max_features': 'sqrt', 'max_depth': 17, 'min_samples_split': 9, 'min_samples_leaf': 18, 'min_weight_fraction_leaf': 0.025645889299432033}. Best is trial 169 with value: 0.7781332226538029.
[I 2024-04-25 14:22:52,122] Trial 178 finished with value: 0.7678000877083073 and parameters: {'criterion': 'entropy', 'n_estimators': 246, 'max_features': 'sqrt', 'max_depth': 18, 'min_samples_split': 9, 'min_samples_leaf': 16, 'min_weight_fraction_leaf': 0.009484452951850984}. Best is trial 169 with value: 0.7781332226538029.
[I 2024-0

Tiempo de entrenamiento: 184.4 segundos
Mejores hiperparametros: {'criterion': 'entropy', 'n_estimators': 267, 'max_features': 'sqrt', 'max_depth': 18, 'min_samples_split': 10, 'min_samples_leaf': 18, 'min_weight_fraction_leaf': 0.0007471889291537688}
La puntuación auc-roc del conjunto de entrenamiento es: 0.8071903995610313
La puntuación auc-roc del conjunto de validación es: 0.7781332226538029


In [66]:
global_results = add_results(auc_train,auc_valid,training_time,'RandomForestClassifier - optuna',global_results)

### LGBM Classifier

In [67]:
start_time = time.time()
model = LGBMClassifier()
model.fit(features_train,target_train)
end_time = time.time()
training_time = round(end_time - start_time,2)
print('Tiempo de entrenamiento:', training_time,'segundos')
print('Mejor modelo:',gs.best_estimator_)
train_pred = model.predict(features_train)
valid_pred = model.predict(features_valid)
# AUC-Score
auc_train = roc_auc_score(target_train,train_pred)
print('La puntuación auc-roc del conjunto de entrenamiento es:',auc_train)
auc_valid = roc_auc_score(target_valid,valid_pred)
print('La puntuación auc-roc del conjunto de validación es:',auc_valid)

[LightGBM] [Info] Number of positive: 6004, number of negative: 4133
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 728
[LightGBM] [Info] Number of data points in the train set: 10137, number of used features: 34
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.592286 -> initscore=0.373422
[LightGBM] [Info] Start training from score 0.373422
Tiempo de entrenamiento: 0.19 segundos
Mejor modelo: RandomForestClassifier(n_estimators=200, random_state=92)
La puntuación auc-roc del conjunto de entrenamiento es: 0.9197187357795021
La puntuación auc-roc del conjunto de validación es: 0.8364148707346615


found 0 physical cores < 1
  File "C:\Users\user\anaconda3\lib\site-packages\joblib\externals\loky\backend\context.py", line 245, in _count_physical_cores
    raise ValueError(


In [68]:
global_results = add_results(auc_train,auc_valid,training_time,'LGBMClassifier',global_results)

### Catboost Classifier

In [69]:
start_time = time.time()
model = CatBoostClassifier(eval_metric='AUC')
model.fit(features_train,target_train)
end_time = time.time()
training_time = round(end_time - start_time,2)
print('Tiempo de entrenamiento:', training_time,'segundos')
print('Mejor modelo:',gs.best_estimator_)
train_pred = model.predict(features_train)
valid_pred = model.predict(features_valid)
# AUC-Score
auc_train = roc_auc_score(target_train,train_pred)
print('La puntuación auc-roc del conjunto de entrenamiento es:',auc_train)
auc_valid = roc_auc_score(target_valid,valid_pred)
print('La puntuación auc-roc del conjunto de validación es:',auc_valid)

Learning rate set to 0.027699
0:	total: 167ms	remaining: 2m 47s
1:	total: 197ms	remaining: 1m 38s
2:	total: 220ms	remaining: 1m 13s
3:	total: 228ms	remaining: 56.8s
4:	total: 234ms	remaining: 46.6s
5:	total: 240ms	remaining: 39.7s
6:	total: 245ms	remaining: 34.7s
7:	total: 250ms	remaining: 31s
8:	total: 255ms	remaining: 28.1s
9:	total: 260ms	remaining: 25.8s
10:	total: 265ms	remaining: 23.8s
11:	total: 270ms	remaining: 22.2s
12:	total: 275ms	remaining: 20.9s
13:	total: 280ms	remaining: 19.7s
14:	total: 285ms	remaining: 18.7s
15:	total: 290ms	remaining: 17.8s
16:	total: 295ms	remaining: 17.1s
17:	total: 300ms	remaining: 16.4s
18:	total: 305ms	remaining: 15.8s
19:	total: 311ms	remaining: 15.2s
20:	total: 316ms	remaining: 14.7s
21:	total: 321ms	remaining: 14.3s
22:	total: 326ms	remaining: 13.9s
23:	total: 331ms	remaining: 13.5s
24:	total: 336ms	remaining: 13.1s
25:	total: 341ms	remaining: 12.8s
26:	total: 346ms	remaining: 12.5s
27:	total: 351ms	remaining: 12.2s
28:	total: 356ms	remaining:

269:	total: 1.63s	remaining: 4.4s
270:	total: 1.64s	remaining: 4.4s
271:	total: 1.64s	remaining: 4.39s
272:	total: 1.65s	remaining: 4.39s
273:	total: 1.65s	remaining: 4.38s
274:	total: 1.66s	remaining: 4.38s
275:	total: 1.67s	remaining: 4.37s
276:	total: 1.67s	remaining: 4.36s
277:	total: 1.68s	remaining: 4.36s
278:	total: 1.68s	remaining: 4.35s
279:	total: 1.69s	remaining: 4.35s
280:	total: 1.7s	remaining: 4.34s
281:	total: 1.7s	remaining: 4.33s
282:	total: 1.71s	remaining: 4.33s
283:	total: 1.71s	remaining: 4.32s
284:	total: 1.72s	remaining: 4.31s
285:	total: 1.73s	remaining: 4.31s
286:	total: 1.73s	remaining: 4.3s
287:	total: 1.74s	remaining: 4.3s
288:	total: 1.74s	remaining: 4.29s
289:	total: 1.75s	remaining: 4.29s
290:	total: 1.76s	remaining: 4.28s
291:	total: 1.76s	remaining: 4.28s
292:	total: 1.77s	remaining: 4.27s
293:	total: 1.78s	remaining: 4.27s
294:	total: 1.78s	remaining: 4.26s
295:	total: 1.79s	remaining: 4.25s
296:	total: 1.79s	remaining: 4.25s
297:	total: 1.8s	remaining

542:	total: 3.08s	remaining: 2.59s
543:	total: 3.08s	remaining: 2.58s
544:	total: 3.09s	remaining: 2.58s
545:	total: 3.1s	remaining: 2.57s
546:	total: 3.1s	remaining: 2.57s
547:	total: 3.1s	remaining: 2.56s
548:	total: 3.11s	remaining: 2.55s
549:	total: 3.11s	remaining: 2.55s
550:	total: 3.12s	remaining: 2.54s
551:	total: 3.12s	remaining: 2.54s
552:	total: 3.13s	remaining: 2.53s
553:	total: 3.13s	remaining: 2.52s
554:	total: 3.14s	remaining: 2.52s
555:	total: 3.14s	remaining: 2.51s
556:	total: 3.15s	remaining: 2.5s
557:	total: 3.15s	remaining: 2.5s
558:	total: 3.16s	remaining: 2.49s
559:	total: 3.16s	remaining: 2.48s
560:	total: 3.17s	remaining: 2.48s
561:	total: 3.17s	remaining: 2.47s
562:	total: 3.18s	remaining: 2.47s
563:	total: 3.18s	remaining: 2.46s
564:	total: 3.19s	remaining: 2.45s
565:	total: 3.19s	remaining: 2.45s
566:	total: 3.19s	remaining: 2.44s
567:	total: 3.2s	remaining: 2.43s
568:	total: 3.2s	remaining: 2.43s
569:	total: 3.21s	remaining: 2.42s
570:	total: 3.21s	remaining

789:	total: 4.28s	remaining: 1.14s
790:	total: 4.28s	remaining: 1.13s
791:	total: 4.29s	remaining: 1.13s
792:	total: 4.29s	remaining: 1.12s
793:	total: 4.3s	remaining: 1.11s
794:	total: 4.3s	remaining: 1.11s
795:	total: 4.31s	remaining: 1.1s
796:	total: 4.32s	remaining: 1.1s
797:	total: 4.32s	remaining: 1.09s
798:	total: 4.32s	remaining: 1.09s
799:	total: 4.33s	remaining: 1.08s
800:	total: 4.33s	remaining: 1.08s
801:	total: 4.34s	remaining: 1.07s
802:	total: 4.34s	remaining: 1.06s
803:	total: 4.35s	remaining: 1.06s
804:	total: 4.35s	remaining: 1.05s
805:	total: 4.36s	remaining: 1.05s
806:	total: 4.36s	remaining: 1.04s
807:	total: 4.37s	remaining: 1.04s
808:	total: 4.37s	remaining: 1.03s
809:	total: 4.38s	remaining: 1.03s
810:	total: 4.38s	remaining: 1.02s
811:	total: 4.39s	remaining: 1.01s
812:	total: 4.39s	remaining: 1.01s
813:	total: 4.4s	remaining: 1s
814:	total: 4.4s	remaining: 999ms
815:	total: 4.41s	remaining: 994ms
816:	total: 4.41s	remaining: 988ms
817:	total: 4.42s	remaining: 

In [70]:
global_results = add_results(auc_train,auc_valid,training_time,'CatBoostClassifier',global_results)

### Resultados

In [71]:
global_results.query('tipo == "valid"').sort_values(by='AUC-ROC',ascending=False)

Unnamed: 0,tipo,modelo,AUC-ROC,tiempo_entrenamiento
13,valid,LGBMClassifier,0.836415,0.19
15,valid,CatBoostClassifier,0.825216,5.47
11,valid,RandomForestClassifier - optuna,0.778133,184.4
7,valid,DecisionTreeClassifier - optuna,0.765862,5.39
3,valid,LogisticRegression-optuna,0.758605,28.0
1,valid,LogisticRegression,0.756286,0.5
9,valid,RandomForestClassifier,0.754603,71.71
5,valid,DecisionTreeClassifier,0.738753,6.05


## Conclusión

Para llegar a una solucion de este proyecto primero corregimos los datos de ciertas columnas como lo son fechas y datos numericos al convertirlos al tipo correspondiente de datos.

Creamos dos variables:
1. La variable objetivo que en este caso es "churn" respecto a si ya habian finalizado su servicio o no.
2. La variable de "diff_days", que es la cantidad de dias que un usuario a utlizado el servicio.

Unimos los 4 conjuntos de datos para obtener una sola tabla.

Revisamos los diferentes tipos de variables categoricas en cada columna para verificar que no existieran valores que podamos agrupar, siendo este el caso los agrupamos para tener una homologacion de datos.

Realizamos una codificacion OneHotEncoding para las variables categoricas y un escalamiento de datos para las variables numericas.

Tambien realzamos un sobremuestreo con las muestras de "resultado = 1" en churn ya que teniamos un desbalance en los datos objetivo lo cual puede no ayudar a nuestros modelos a obtener un resultado satisfactorio.

Hacemos nuestra separacion de datos y realizamos el entrenamiento de diversos modelos como son: LogisticRegression, DecissionTreeClassifier, RandomForestClassifier, LGBMClassifier y CatboostClassifier; para los tres primeros casos utilizamos las herramientas de GridSearch y optuna.

Obtuvimos un mejor resultado AUC-ROC del modelo de LGBMClassifier con 0.836 aunado a que es el modelo mas rapido para desarrollar nuestra tarea con 0.20 segundos.