# 3. Modelo A

Modelo inicial:
- Baselines
- Grid search
- Voting ensemble

## Librerías

In [1]:
# Tratamiento de datos
import pandas as pd
import numpy as np

# Visualización
import matplotlib.pyplot as plt
import seaborn as sns

# Modelos
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, BaggingClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score, precision_score, recall_score, \
roc_curve, roc_auc_score, ConfusionMatrixDisplay, multilabel_confusion_matrix

# Otros
import warnings
warnings.filterwarnings('ignore')

## Carga de datos

In [2]:
df = pd.read_csv(r'..\data\processed\train_3.csv')

## División de datos

In [3]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['nobeyesdad']),
                                                    df['nobeyesdad'],
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=df['nobeyesdad'])

## Baselines

In [11]:
modelos = {LogisticRegression(random_state=42):'LogR',
           SVC(kernel='linear', random_state=42): 'SVC_linear',
           SVC(kernel='poly', degree=4): 'SVC_poly',
           SVC(random_state=42):'SVC_rbf',
           DecisionTreeClassifier(random_state=42):'DT',
           RandomForestClassifier(random_state=42, class_weight='balanced'):'RF',
           KNeighborsClassifier(n_neighbors=5):'KNEIGH',
           lgb.LGBMClassifier():'LGBM',
           XGBClassifier():'XGB'}

scores = ['accuracy']
data = []

for modelo in modelos:
    print(f'processing {modelo}')
    data.append([(cross_val_score(modelo, X_train, y_train, cv=5, scoring=score)).mean() for score in scores])

baselines = pd.DataFrame(data, columns=scores, index=modelos.values())
baselines.sort_values(by='accuracy', ascending=False)

processing LogisticRegression(random_state=42)
processing SVC(kernel='linear', random_state=42)
processing SVC(degree=4, kernel='poly')
processing SVC(random_state=42)
processing DecisionTreeClassifier(random_state=42)
processing RandomForestClassifier(class_weight='balanced', random_state=42)
processing KNeighborsClassifier()
processing LGBMClassifier()
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003541 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 782
[LightGBM] [Info] Number of data points in the train set: 13236, number of used features: 9
[LightGBM] [Info] Start training from score -2.104225
[LightGBM] [Info] Start training from score -1.903892
[LightGBM] [Info] Start training from score -2.148564
[LightGBM] [Info] Start training from score -2.125516
[LightGBM] [Info] Start training from score -1.962364
[LightGBM] [Info] Star

Unnamed: 0,accuracy
LGBM,0.888977
XGB,0.88668
RF,0.880999
SVC_poly,0.854286
SVC_rbf,0.845039
SVC_linear,0.839901
DT,0.829869
KNEIGH,0.784299
LogR,0.782909


Los 3 mejores modelos son **LightGBM**, **XGBoost** y **RandomForest**

## Grid Search

### LightGBM

In [4]:
# LightGBM

LGBM_grid_1 = {
    'num_leaves': [50],
    'learning_rate': [0.1],
    'n_estimators': [100],
    'max_depth': [5],
    'min_child_samples': [20, 30, 40],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0]
}

LGBM_grid_search_1 = GridSearchCV(lgb.LGBMClassifier(),
                           LGBM_grid_1,
                           cv=5,
                           scoring='accuracy',
                           n_jobs=-1
                          )

LGBM_grid_search_1.fit(X_train, y_train)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.009626 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 783
[LightGBM] [Info] Number of data points in the train set: 16546, number of used features: 9
[LightGBM] [Info] Start training from score -2.104533
[LightGBM] [Info] Start training from score -1.903953
[LightGBM] [Info] Start training from score -2.148106
[LightGBM] [Info] Start training from score -2.125576
[LightGBM] [Info] Start training from score -1.962424
[LightGBM] [Info] Start training from score -1.852173
[LightGBM] [Info] Start training from score -1.631497


In [5]:
print(LGBM_grid_search_1.best_params_)

{'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 5, 'min_child_samples': 30, 'n_estimators': 100, 'num_leaves': 50, 'subsample': 0.8}


In [6]:
LGBM_1 = lgb.LGBMClassifier(num_leaves=50,
              learning_rate=0.1,
              n_estimators=100,
              max_depth=5,
              min_child_samples=30,
              subsample=0.8,
              colsample_bytree=1.0)

lgbm1_score = cross_val_score(LGBM_1, X_train, y_train, cv=5, scoring='accuracy').mean()
lgbm1_score

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006516 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 782
[LightGBM] [Info] Number of data points in the train set: 13236, number of used features: 9
[LightGBM] [Info] Start training from score -2.104225
[LightGBM] [Info] Start training from score -1.903892
[LightGBM] [Info] Start training from score -2.148564
[LightGBM] [Info] Start training from score -2.125516
[LightGBM] [Info] Start training from score -1.962364
[LightGBM] [Info] Start training from score -1.852016
[LightGBM] [Info] Start training from score -1.631669
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001394 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 783
[LightGBM] [

0.8915149108126789

In [7]:
LGBM_1.fit(X_train, y_train)
lgbm1_score_t = accuracy_score(y_test, LGBM_1.predict(X_test))
lgbm1_score_t

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003479 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 783
[LightGBM] [Info] Number of data points in the train set: 16546, number of used features: 9
[LightGBM] [Info] Start training from score -2.104533
[LightGBM] [Info] Start training from score -1.903953
[LightGBM] [Info] Start training from score -2.148106
[LightGBM] [Info] Start training from score -2.125576
[LightGBM] [Info] Start training from score -1.962424
[LightGBM] [Info] Start training from score -1.852173
[LightGBM] [Info] Start training from score -1.631497


0.891467246797196

### XGBoost

In [5]:
# XGBoost

XGB_grid_1 = {
       'n_estimators': [50, 100, 200],
       'learning_rate': [0.01, 0.05, 0.1],
       'max_depth': [3, 5, 7],
       'subsample': [0.6, 0.8, 1.0],
       'colsample_bytree': [0.6, 0.8, 1.0]
}

XGB_grid_search_1 = GridSearchCV(XGBClassifier(),
                           XGB_grid_1,
                           cv=5,
                           scoring='accuracy',
                           n_jobs=-1
                          )

XGB_grid_search_1.fit(X_train, y_train)

In [6]:
print(XGB_grid_search_1.best_params_)

{'colsample_bytree': 0.6, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100, 'subsample': 0.8}


In [8]:
XGB_1 = XGBClassifier(n_estimators=100,
              learning_rate=0.1,
              max_depth=7,
              subsample=0.8,
              colsample_bytree=0.6)

xgb1_score = cross_val_score(XGB_1, X_train, y_train, cv=5, scoring='accuracy').mean()
xgb1_score

0.8934488290198205

In [9]:
XGB_1.fit(X_train, y_train)
xgb1_score_t = accuracy_score(y_test, XGB_1.predict(X_test))
xgb1_score_t

0.8941261783901377

### RandomForest

In [8]:
# RandomForest

RF_grid_2 = {
    'n_estimators': [100, 150],
    'max_depth': [5, 10, 15],
    'min_samples_split': [8, 10, 12],
    'min_samples_leaf': [3, 4, 5],
    'bootstrap': [True]
}

RF_grid_search_2 = GridSearchCV(RandomForestClassifier(class_weight='balanced'),
                           RF_grid_2,
                           cv=5,
                           scoring='accuracy',
                           n_jobs=-1
                          )

RF_grid_search_2.fit(X_train, y_train)

In [9]:
print(RF_grid_search_2.best_params_)

{'bootstrap': True, 'max_depth': 15, 'min_samples_leaf': 4, 'min_samples_split': 8, 'n_estimators': 150}


In [10]:
RF_1 = RandomForestClassifier(n_estimators=150,
              max_depth=15,
              min_samples_split=8,
              min_samples_leaf=4,
              bootstrap=True)

rf1_score = cross_val_score(RF_1, X_train, y_train, cv=5, scoring='accuracy').mean()
rf1_score

0.8840814258284876

In [12]:
RF_1.fit(X_train, y_train)
rf1_score_t = accuracy_score(y_test, RF_1.predict(X_test))
rf1_score_t

0.8902586415276771

### Score comparison

In [19]:
# Score comparison on train

t1 = baselines.sort_values(by='accuracy', ascending=False).iloc[0:3]
t2 = pd.concat([t1, pd.DataFrame([lgbm1_score, xgb1_score, rf1_score], index=t1.index, columns=['accuracy (post)'])], axis=1)
t2['improvement'] = t2['accuracy (post)'] - t2['accuracy']
t2

Unnamed: 0,accuracy,accuracy (post),improvement
LGBM,0.888977,0.891515,0.002538
XGB,0.88668,0.893449,0.006769
RF,0.880999,0.884081,0.003082


In [14]:
# Score comparison on test

pd.DataFrame([lgbm1_score_t, xgb1_score_t, rf1_score_t], columns=['accuracy'], index=['LGBM','XGB','RF'])

Unnamed: 0,accuracy
LGBM,0.891467
XGB,0.894126
RF,0.890259


## Ensemble

In [21]:
voting_clf = VotingClassifier(estimators=[('lgbm', LGBM_1),
                                          ('xgb', XGB_1),
                                          ('rf', RF_1)],
                            voting='soft',
                            verbose=False)

voting_clf.fit(X_train, y_train)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001293 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 783
[LightGBM] [Info] Number of data points in the train set: 16546, number of used features: 9
[LightGBM] [Info] Start training from score -2.104533
[LightGBM] [Info] Start training from score -1.903953
[LightGBM] [Info] Start training from score -2.148106
[LightGBM] [Info] Start training from score -2.125576
[LightGBM] [Info] Start training from score -1.962424
[LightGBM] [Info] Start training from score -1.852173
[LightGBM] [Info] Start training from score -1.631497




In [22]:
cross_val_score(voting_clf, X_train, y_train, cv=5, scoring='accuracy').mean()

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001696 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 782
[LightGBM] [Info] Number of data points in the train set: 13236, number of used features: 9
[LightGBM] [Info] Start training from score -2.104225
[LightGBM] [Info] Start training from score -1.903892
[LightGBM] [Info] Start training from score -2.148564
[LightGBM] [Info] Start training from score -2.125516
[LightGBM] [Info] Start training from score -1.962364
[LightGBM] [Info] Start training from score -1.852016
[LightGBM] [Info] Start training from score -1.631669
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005542 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 783
[LightGBM] [

0.8935698940635216

## Test validation

In [23]:
accuracy_score(y_test, voting_clf.predict(X_test))

0.8967851099830795

## Guardado

In [25]:
import joblib

joblib.dump(voting_clf, '..\models\my_model_1.sav')

['..\\models\\my_model_1.sav']