# 3. Modelo C
Basado en el conocimiento del "negocio".\
Según las fuentes consultadas, el sobrepeso se calcula en base al IMC, el cual a su vez se calcula diviendo el peso en kg entre el cuadrado de la altura en metros.

<center>

$IMC = kg/m^2$ 



| IMC        	| Diagnóstico       	|
|------------	|-------------------	|
| <18.5      	| Peso insuficiente 	|
| (18.5, 25) 	| Peso normal       	|
| (25, 30)   	| Sobrepeso         	|
| >30        	| Obesidad          	|

</center>

\
El principal factor adicional que interviene en esta categorización es si el sujeto es deportista o no, y más concretamente si se dedica a la alterofilia. Un sujeto con estas características puede dar positivo en sobrepeso pero no tener nada de masa grasa, debido a su masa muscular. Por desgracia, el dataset ofrece este último dato de manera poco exhaustiva: la FAF sólo tiene valores enteros del 0 al 2. \
\
Procedimiento:
- Crear un modelo con las variables peso y altura
- Crear un segundo modelo añadiendo la variable faf
- Comparar los modelos entre sí

## Librerías

In [1]:
# Tratamiento de datos
import pandas as pd
import numpy as np

# Modelos
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, BaggingClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score, precision_score, recall_score, \
roc_curve, roc_auc_score, ConfusionMatrixDisplay, multilabel_confusion_matrix

# Otros
import warnings
warnings.filterwarnings('ignore')

## Carga de datos

In [2]:
df = pd.read_csv(r'..\data\processed\train_3.csv')
df_1 = df[['height', 'weight', 'nobeyesdad']]
df_2 = df[['height', 'weight', 'faf', 'nobeyesdad']]

## División de datos

In [3]:
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(df_1.drop(columns=['nobeyesdad']),
                                                    df_1['nobeyesdad'],
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=df_1['nobeyesdad'])

X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(df_2.drop(columns=['nobeyesdad']),
                                                    df_2['nobeyesdad'],
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=df_2['nobeyesdad'])

## Baselines

In [None]:
modelos = {LogisticRegression(random_state=42):'LogR',
           SVC(kernel='linear', random_state=42): 'SVC_linear',
           SVC(kernel='poly', degree=4): 'SVC_poly',
           SVC(random_state=42):'SVC_rbf',
           DecisionTreeClassifier(random_state=42):'DT',
           RandomForestClassifier(random_state=42, class_weight='balanced'):'RF',
           KNeighborsClassifier(n_neighbors=5):'KNEIGH',
           lgb.LGBMClassifier():'LGBM',
           XGBClassifier():'XGB'}

data_1 = []
data_2 = []

for modelo in modelos:
    print(f'processing ----------> {modelo}')
    data_1.append((cross_val_score(modelo, X_train_1, y_train_1, cv=5, scoring='accuracy')).mean())
    data_2.append((cross_val_score(modelo, X_train_2, y_train_2, cv=5, scoring='accuracy')).mean())

In [21]:
baselines = pd.DataFrame({'accuracy':data_1,
                          'accuracy_faf':data_2},
                         index=modelos.values())
baselines['diff'] = baselines['accuracy'] - baselines['accuracy_faf']
baselines.sort_values(by='accuracy', ascending=False)

Unnamed: 0,accuracy,accuracy_faf,diff
XGB,0.86323,0.86456,-0.001329
LGBM,0.862626,0.863593,-0.000967
RF,0.851565,0.850357,0.001209
KNEIGH,0.845521,0.847213,-0.001692
SVC_rbf,0.838088,0.837423,0.000665
DT,0.828599,0.824671,0.003928
SVC_poly,0.81488,0.822858,-0.007978
SVC_linear,0.805875,0.808775,-0.002901
LogR,0.771244,0.764294,0.00695


Excepto RandomForest, los mejores modelos preciden mejor un poquito mejor con la feature 'faf' que sin ella. Nos quedamos con LGBM, XGB y RF. \
También observamos que estos 3 modelos, mucho más ligeros en features, tan sólo predicen 2 centésimas peor que el Modelo_1.

## Grid Search

### LightGBM

In [22]:
# LightGBM

LGBM_grid_1 = {
    'num_leaves': [50],
    'learning_rate': [0.1],
    'n_estimators': [100],
    'max_depth': [5],
    'min_child_samples': [20, 30, 40],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0]}

LGBM_grid_search_1 = GridSearchCV(lgb.LGBMClassifier(),
                           LGBM_grid_1,
                           cv=5,
                           scoring='accuracy',
                           n_jobs=-1,
                           verbose=0)

LGBM_grid_search_1.fit(X_train_2, y_train_2)

print('LGBM best params:\n', LGBM_grid_search_1.best_params_)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000770 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 514
[LightGBM] [Info] Number of data points in the train set: 16546, number of used features: 3
[LightGBM] [Info] Start training from score -2.104533
[LightGBM] [Info] Start training from score -1.903953
[LightGBM] [Info] Start training from score -2.148106
[LightGBM] [Info] Start training from score -2.125576
[LightGBM] [Info] Start training from score -1.962424
[LightGBM] [Info] Start training from score -1.852173
[LightGBM] [Info] Start training from score -1.631497
LGBM best params:
 {'colsample_bytree': 0.9, 'learning_rate': 0.1, 'max_depth': 5, 'min_child_samples': 40, 'n_estimators': 100, 'num_leaves': 50, 'subsample': 0.8}


In [23]:
LGBM_1 = lgb.LGBMClassifier(num_leaves=50,
              learning_rate=0.1,
              n_estimators=100,
              max_depth=5,
              min_child_samples=40,
              subsample=0.8,
              colsample_bytree=0.9)

LGBM_1.fit(X_train_2, y_train_2)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000239 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 514
[LightGBM] [Info] Number of data points in the train set: 16546, number of used features: 3
[LightGBM] [Info] Start training from score -2.104533
[LightGBM] [Info] Start training from score -1.903953
[LightGBM] [Info] Start training from score -2.148106
[LightGBM] [Info] Start training from score -2.125576
[LightGBM] [Info] Start training from score -1.962424
[LightGBM] [Info] Start training from score -1.852173
[LightGBM] [Info] Start training from score -1.631497


### XGBoost

In [24]:
# XGBoost

XGB_grid_1 = {
       'n_estimators': [50, 100, 200],
       'learning_rate': [0.01, 0.05, 0.1],
       'max_depth': [3, 5, 7],
       'subsample': [0.6, 0.8, 1.0],
       'colsample_bytree': [0.6, 0.8, 1.0]}

XGB_grid_search_1 = GridSearchCV(XGBClassifier(),
                           XGB_grid_1,
                           cv=5,
                           scoring='accuracy',
                           n_jobs=-1,
                           verbose=0)

XGB_grid_search_1.fit(X_train_2, y_train_2)

print('XGB best params:\n', XGB_grid_search_1.best_params_)

XGB best params:
 {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'subsample': 1.0}


In [25]:
XGB_1 = XGBClassifier(n_estimators=200,
              learning_rate=0.1,
              max_depth=5,
              subsample=1.0,
              colsample_bytree=0.8)

XGB_1.fit(X_train_2, y_train_2)

### RandomForest

In [26]:
# RandomForest

RF_grid_2 = {
    'n_estimators': [100, 150],
    'max_depth': [5, 10, 15],
    'min_samples_split': [8, 10, 12],
    'min_samples_leaf': [3, 4, 5],
    'bootstrap': [True]
}

RF_grid_search_2 = GridSearchCV(RandomForestClassifier(class_weight='balanced'),
                           RF_grid_2,
                           cv=5,
                           scoring='accuracy',
                           n_jobs=-1
                          )

RF_grid_search_2.fit(X_train_2, y_train_2)

print('RF best params:\n', RF_grid_search_2.best_params_)

RF best params:
 {'bootstrap': True, 'max_depth': 15, 'min_samples_leaf': 3, 'min_samples_split': 12, 'n_estimators': 150}


In [27]:
RF_1 = RandomForestClassifier(n_estimators=150,
              max_depth=15,
              min_samples_split=12,
              min_samples_leaf=3,
              bootstrap=True)

RF_1.fit(X_train_2, y_train_2)

## Ensemble

In [28]:
voting_clf = VotingClassifier(estimators=[('lgbm', LGBM_1),
                                          ('xgb', XGB_1),
                                          ('rf', RF_1)],
                            voting='soft',
                            verbose=False)

voting_clf.fit(X_train_2, y_train_2)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000254 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 514
[LightGBM] [Info] Number of data points in the train set: 16546, number of used features: 3
[LightGBM] [Info] Start training from score -2.104533
[LightGBM] [Info] Start training from score -1.903953
[LightGBM] [Info] Start training from score -2.148106
[LightGBM] [Info] Start training from score -2.125576
[LightGBM] [Info] Start training from score -1.962424
[LightGBM] [Info] Start training from score -1.852173
[LightGBM] [Info] Start training from score -1.631497


In [29]:
cross_val_score(voting_clf, X_train_2, y_train_2, cv=5, scoring='accuracy').mean()

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.141953 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 514
[LightGBM] [Info] Number of data points in the train set: 13236, number of used features: 3
[LightGBM] [Info] Start training from score -2.104225
[LightGBM] [Info] Start training from score -1.903892
[LightGBM] [Info] Start training from score -2.148564
[LightGBM] [Info] Start training from score -2.125516
[LightGBM] [Info] Start training from score -1.962364
[LightGBM] [Info] Start training from score -1.852016
[LightGBM] [Info] Start training from score -1.631669
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001144 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 514
[LightGBM] [Info] Number of data points in the train set: 13237, number of u

0.8690322922287381

## Entrenamiento y validación

In [31]:
accuracy_score(y_test_2, voting_clf.predict(X_test_2))

0.8636693255982596

El modelo tiene algo de overfitting y es peor que el Modelo_1 por tres centésimas.

## Guardado

In [32]:
import joblib

joblib.dump(voting_clf, '..\models\my_model_3.sav')

['..\\models\\my_model_3.sav']