# MARATÓN BEHIND THE CODE 2020

## DESAFÍO 2: TORTUGA CODE

### Introducción

En proyectos de ciencia de datos destinados a construir modelos de *aprendizaje automático*, o aprendizaje estadístico, es muy inusual que los datos iniciales ya estén en el formato ideal para la construcción de modelos. Se requieren varios pasos intermedios de preprocesamiento de datos, como la codificación de variables categóricas, normalización de variables numéricas, tratamiento de datos faltantes, etc. La biblioteca **scikit-learn**, una de las bibliotecas de código abierto más populares para *aprendizaje automático* en el mundo, ya tiene varias funciones integradas para realizar las transformaciones de datos más utilizadas. Sin embargo, en un flujo común de un modelo de aprendizaje automático, es necesario aplicar estas transformaciones al menos dos veces: la primera vez para "entrenar" el modelo, y luego nuevamente cuando se envían nuevos datos como entrada para ser clasificados por este modelo.


### Trabajando scikit-learn

In [None]:
!pip install xgboost --upgrade

In [None]:
# A continuación importaremos varias bibliotecas que se utilizarán:

# Biblioteca para trabajar con JSON
import json

# Biblioteca para realizar solicitudes HTTP
import requests

# Biblioteca para exploración y análisis de datos
import pandas as pd

# Biblioteca con métodos numéricos y representaciones matriciales
import numpy as np

# Biblioteca para construir un modelo basado en la técnica Gradient Boosting
import xgboost as xgb

# Paquetes scikit-learn para preprocesamiento de datos
# "SimpleImputer" es una transformación para completar los valores faltantes en conjuntos de datos
from sklearn.impute import SimpleImputer

# Paquetes de scikit-learn para entrenamiento de modelos y construcción de pipelines
# Método para separar el conjunto de datos en muestras de testes y entrenamiento
from sklearn.model_selection import train_test_split
# Método para crear modelos basados en árboles de decisión
from sklearn.tree import DecisionTreeClassifier
# Clase para crear una pipeline de machine-learning
from sklearn.pipeline import Pipeline

# Paquetes scikit-learn para evaluación de modelos
# Métodos para la validación cruzada del modelo creado
from sklearn.model_selection import KFold, cross_validate

In [None]:
   
import itertools
%matplotlib inline
def plot_confusion_matrix(y_true, y_pred, class_names,title="Confusion matrix",normalize=False,onehot = False, size=4):
    """
    Returns a matplotlib figure containing the plotted confusion matrix.

    Args:
    cm (array, shape = [n, n]): a confusion matrix of integer classes
    class_names (array, shape = [n]): String names of the integer classes
    """
    if onehot :
        cm = confusion_matrix([y_i.argmax() for y_i in y_true], [y_ip.argmax() for y_ip in y_pred])
    else:
        cm = confusion_matrix(y_true, y_pred)
    figure = plt.figure(figsize=(size, size))
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(class_names))
    plt.xticks(tick_marks, class_names, rotation=45)
    plt.yticks(tick_marks, class_names)

    # Normalize the confusion matrix.
    cm = np.around(cm.astype('float') / cm.sum(axis=1)[:, np.newaxis], decimals=2) if normalize else cm

    # Use white text if squares are dark; otherwise black.
    threshold = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        color = "red" if cm[i, j] > threshold else "black"
        plt.text(j, i, cm[i, j], horizontalalignment="center", color=color)

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()
    #return figure

### Importar  un .csv a tu proyecto en IBM Cloud Pak for Data al Kernel de este notebook

In [None]:
df_to_be_scored = pd.read_csv(r'../input/tech-students-profile-prediction/to_be_scored_tortuga.csv')
df_to_be_scored.head()

In [None]:
# Primero, importaremos el conjunto de datos proporcionado para el desafío, que ya está incluido en este proyecto.

#!wget --no-check-certificate --content-disposition https://raw.githubusercontent.com/maratonadev-la/desafio-2-2020/master/Assets/Data/dataset-tortuga-desafio-2.csv
df_training_dataset = pd.read_csv('../input/tech-students-profile-prediction/dataset-tortuga.csv')
df_training_dataset.tail(10)

Tenemos 16 columnas presentes en el set de datos proporcionado, 15 de las cuales son variables features (datos de entrada) y una de ellas es una variable target (que queremos que nuestro modelo va a predecir).

Las variables features son:

    Unnamed: 0                          - Esta columna no tiene nombre y debe ser eliminada del dataset
    NAME                                - Nombre del estudiante
    USER_ID                             - Número de identificación del estudiante
    HOURS_DATASCIENCE                   - Número de horas de estudio en Data Science
    HOURS_BACKEND                       - Número de horas de estudio en Web (Back-End)
    HOURS_FRONTEND                      - Número de horas de estudio en Web (Front-End)
    NUM_COURSES_BEGINNER_DATASCIENCE    - Número de cursos de nivel principiante en Data Science completados por el estudiante
    NUM_COURSES_BEGINNER_BACKEND        - Número de cursos de nivel principiante en Web (Back-End) completados por el estudiante
    NUM_COURSES_BEGINNER_FRONTEND       - Número de cursos de nivel principiante en Web (Front-End) completados por el estudiante
    NUM_COURSES_ADVANCED_DATASCIENCE    - Número de cursos de nivel avanzado en Data Science completados por el estudiante
    NUM_COURSES_ADVANCED_BACKEND        - Número de cursos de nivel avanzado en Web (Back-End) completados por el estudiante
    NUM_COURSES_ADVANCED_FRONTEND       - Número de cursos de nivel avanzado en Web (Front-End) completados por el estudiante
    AVG_SCORE_DATASCIENCE               - Promedio acumulado en cursos de Data Science completados por el estudiante
    AVG_SCORE_BACKEND                   - Promedio acumulado en cursos de Web (Back-End) completados por el estudiante
    AVG_SCORE_FRONTEND                  - Promedio acumulado en cursos de Web (Front-End) completados por el estudiante
    
La variable target es:

    PROFILE                             - Perfil de carrera del estudiante (puede ser uno de 6)
    
        - beginner_front_end
        - advanced_front_end
        - beginner_back_end
        - advanced_back_end
        - beginner_data_science
        - advanced_data_science
        
Con un modelo capaz de clasificar a un alumno en una de estas categorías, podemos recomendar contenidos a los alumnos de forma personalizada según las necesidades de cada alumno.

### Explorando los datos proporcionados

Podemos continuar la exploración de los datos proporcionados con la función ``info()``:

In [None]:
df_training_dataset.isnull().sum()

### Visualización (visualizations)

Para ver el conjunto de datos suministrado, podemos usar las bibliotecas ``matplotlib`` y ``seaborn``:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(28, 4))

sns.distplot(df_training_dataset['HOURS_DATASCIENCE'].dropna(), ax=axes[0])
sns.distplot(df_training_dataset['HOURS_BACKEND'].dropna(), ax=axes[1])
sns.distplot(df_training_dataset['HOURS_FRONTEND'].dropna(), ax=axes[2])

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(28, 8))

sns.distplot(df_training_dataset['NUM_COURSES_BEGINNER_DATASCIENCE'].dropna(), ax=axes[0][0] )
sns.distplot(df_training_dataset['NUM_COURSES_BEGINNER_BACKEND'].dropna(), ax=axes[0][1] )
sns.distplot(df_training_dataset['NUM_COURSES_BEGINNER_FRONTEND'].dropna(), ax=axes[0][2])
sns.distplot(df_training_dataset['NUM_COURSES_ADVANCED_DATASCIENCE'].dropna(), ax=axes[1][0])
sns.distplot(df_training_dataset['NUM_COURSES_ADVANCED_BACKEND'].dropna(), ax=axes[1][1])
sns.distplot(df_training_dataset['NUM_COURSES_ADVANCED_FRONTEND'].dropna(), ax=axes[1][2])

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(28, 4))

sns.distplot(df_training_dataset['AVG_SCORE_DATASCIENCE'].dropna(), ax=axes[0])
sns.distplot(df_training_dataset['AVG_SCORE_BACKEND'].dropna(), ax=axes[1])
sns.distplot(df_training_dataset['AVG_SCORE_FRONTEND'].dropna(), ax=axes[2])

In [None]:
fig, axes = plt.subplots(figsize=(28, 4))

sns.countplot(ax=axes, x='PROFILE', data=df_training_dataset)
df_training_dataset['PROFILE'].value_counts()

# FEATURE ENGINEERING

## Drop columns

Podemos borrar datos que no vinculantes con las caracteristicas del alumno
* Unnamed: 0
* USER_ID
* NAME

In [None]:
df_train = df_training_dataset.drop(columns = [ 'Unnamed: 0', 'USER_ID','NAME' ])

# Filling Nan Values

Vamos a llenar nos valores perdidos, agrupando los datos según el profile y llenando los datos con la tecnica de SKITLEARN KNN, dentro de cada clase.

In [None]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.preprocessing import OneHotEncoder
from collections import defaultdict
from tqdm import tqdm

def fill_categorical_valuesUsingNumerical_rfc(data, aux_table, columns2fill, columnsBase_num,columnsBase_cat=None,max_depth=8):
    """
    columnsBase: es una lista de columnas numéricas
    columns2fill: es una lista de columnas categóricas
    aux_table: tabla sin nulos
    data: data a completar
    """
    aux_table = aux_table.reset_index(drop=True).copy()
    data = data.copy()
#     for ind , table in enumerate(data):
#         table['Genero'] = table['Genero'].replace('0', np.nan)
#         data[ind] = table
    models= defaultdict(list)
    
    encoder = OneHotEncoder(handle_unknown='ignore')
    def fill_nan_categorical(x, model,column,columnsBase_num,mean_values,columnsBase_cat,mode_values):
        if x[column] == x[column]: ## solo deben entrar los valores nulos
            return x[column] 
        valores2look = [x[col_name] if x[col_name] == x[col_name] else mean_values[col_name]  for col_name in columnsBase_num]
        #valores2look = valores2look
        valores2look = np.stack(valores2look).reshape(1, -1)
        #print(valores2look)
        return model.predict(valores2look )[0] 
    
    def fill_nan_categorical_catBase(x, model,column,columnsBase_num,mean_values,columnsBase_categ,mode_values,encoder):
        if x[column] == x[column]: ## solo deben entrar los valores nulos
            return x[column] 
        ## Tomando los valores númericos para la predicción
        #print(columnsBase_num)
        valores2look = []
        if len(columnsBase_num)>0:
            valores2look = [x[col_name] if x[col_name] == x[col_name] else mean_values[col_name]  for col_name in columnsBase_num]
        if len(columnsBase_categ)> 0:
            ## Tomando los valores categóricos para la predicción
            valores2look_cat = [x[col_name] if x[col_name] == x[col_name] else mode_values[col_name]  for col_name in columnsBase_categ]
            ##     Transformando con el encoder
            valores2look_cat = onehot_transfor(encoder, valores2look_cat)
            ############ SUMANDO LAS DOS LISTAS DE VALORES
            valores2look = valores2look + valores2look_cat
        #ADAPTANDO AL FORMATO DE ENTRADA DEL CLASIFICADOR
        valores2look = np.stack(valores2look).reshape(1, -1)
        #print(valores2look)
        value = model.predict(valores2look)[0]
        return value if type(value)== str else round(value) 
    
    def upsampling_classes(X, Y , column):
        df_train_umb = X.join(Y).copy()
        count4bal = df_train_umb[column].value_counts().sort_values(ascending=True)
        class_sorter = count4bal.index.tolist()[:-1]
        class_mayor = count4bal.index[-1]  ##  clase mayoritaria
        mayority_sample = count4bal[-1]  ## cantidad de muestras en la clase mayoritaria
        df_train_balanced = df_train_umb.loc[df_train_umb[column]==class_mayor]
        for class_i in class_sorter:
            df_minor_upsampled = resample(df_train_umb.loc[df_train_umb[column]==class_i], 
                                     replace=True,     # sample with replacement
                                     n_samples=mayority_sample,    # to match majority class
                                     random_state=17) # reproducible results
            df_train_balanced = pd.concat([df_train_balanced,df_minor_upsampled ],ignore_index=True)

        return df_train_balanced.drop(columns=[column]), df_train_balanced[column]
    
    def onehot_transfor(encoder, lista):
        output = encoder.transform(np.stack([lista[0]]).reshape(-1, 1)).toarray()
        if len(lista)>1:
            for cat in lista[1:]:
                output = output +  encoder.transform(np.stack([cat]).reshape(-1, 1)).toarray()
            return output[0].tolist()
        else:
            return output[0].tolist()
        
    def best_model(X, Y,column):
        #display(X.columns)
        if type(Y[0]) == object or type(Y[0]) == str:
            #X, Y = upsampling_classes(X, Y , column)
            clf_rfr = GridSearchCV(estimator=RandomForestClassifier(random_state=25, n_jobs=-1), 
                        param_grid=[{'n_estimators':[10,20,100,120],'criterion':['entropy','gini']}], # 'max_depth':[None,20], 
                        scoring='f1_macro', n_jobs=-1, cv=5)
            clf_rfr.fit(X,Y)
            print(clf_rfr.best_params_)########
            return clf_rfr
        else: 
            clf_rfr = GridSearchCV(estimator=RandomForestRegressor(random_state=25, n_jobs=-1), 
                        param_grid=[{ 'n_estimators':[100,120],'criterion':['mse','mae']}], #'max_depth':[None,],
                        scoring='r2', n_jobs=-1, cv=KFold(n_splits=5) )
            clf_rfr.fit(X,Y)
            print(clf_rfr.best_params_)########
            return clf_rfr
            
    ### MODO SOLO NUMÉRICO COMO BASE ####################################################
    columnsBase = columnsBase_num.copy() + columnsBase_cat.copy()
     ### LLENANDO VALORES USANDO EL CLASSIFICADOR ADD DOC
    mean_values = aux_table[columnsBase_num].mean()
    mode_values = aux_table[columnsBase_cat].mode() if columnsBase_cat else None
    for column in tqdm(columns2fill):
        if column in columnsBase:
            columnsBase_cat_ad = columnsBase_cat.copy()
            columnsBase_num_ad = columnsBase_num.copy()
            if column in columnsBase_cat:
                columnsBase_cat_ad.remove(column)
            if column in columnsBase_num:
                columnsBase_num_ad.remove(column)

            if len(columnsBase_cat_ad)>0:
                base_cat_list = np.stack([str(item)+str(ind)   for ind , column in enumerate(columnsBase_cat_ad) for item in np.unique(aux_table[column].to_numpy())]).reshape(-1, 1)
                ##print(base_cat_list)
                encoder.fit(base_cat_list)

            aux_column_set = columnsBase.copy()
            aux_column_set.remove(column)
            model = best_model(pd.get_dummies(aux_table[aux_column_set]),aux_table[column],column)
            ##print(pd.get_dummies(aux_table[aux_column_set]).columns.tolist())
            models[column] = model.best_score_
            ### RELLENANDO DATOS
            for ind, table in enumerate(data):
                table[column] = table.apply(fill_nan_categorical_catBase, args=(model,column,columnsBase_num_ad,mean_values,columnsBase_cat_ad,mode_values,encoder ), axis=1)
                data[ind] = table
        else:
            base_cat_list = np.stack([str(item)+str(ind)   for ind , column in enumerate(columnsBase_cat)    for item in np.unique(aux_table[column].to_numpy())]).reshape(-1, 1)
            #print(base_cat_list)
            encoder.fit(base_cat_list)
            model = best_model(pd.get_dummies(aux_table[columnsBase]),aux_table[column],column)
            #print(pd.get_dummies(aux_table[columnsBase]).columns.tolist())
            models[column] = model.best_score_
            #models[column] = model
            ### RELLENANDO DATOS
            for ind, table in enumerate(data):
                table[column] = table.apply(fill_nan_categorical_catBase, args=(model,column,columnsBase_num,mean_values,columnsBase_cat,mode_values,encoder ), axis=1)
                data[ind] = table
    display(models)
    return data[0], data[1]

In [None]:
df_temp_aux = df_train.dropna().reset_index(drop=True).copy()

In [None]:
df_train.isnull().sum()

In [None]:
columnsBase_num = ['HOURS_DATASCIENCE', 'HOURS_BACKEND', 'HOURS_FRONTEND',
       'NUM_COURSES_BEGINNER_DATASCIENCE', 'NUM_COURSES_BEGINNER_BACKEND',
       'NUM_COURSES_BEGINNER_FRONTEND', 'NUM_COURSES_ADVANCED_DATASCIENCE',
       'NUM_COURSES_ADVANCED_BACKEND', 'NUM_COURSES_ADVANCED_FRONTEND',
       'AVG_SCORE_DATASCIENCE', 'AVG_SCORE_BACKEND', 'AVG_SCORE_FRONTEND' ]
columnsBase_cat = []
#columnsBase_num = []
columns2bfilled = df_train.columns.tolist()
# columns2bfilled.remove('Banca_movil_userfriendly')
# columns2bfilled.remove('Frecuencia_tarjeta_virtual_mes')
columns2bfilled.remove('PROFILE')
df_train_fill,df_test_fill = fill_categorical_valuesUsingNumerical_rfc(data = [df_train,df_to_be_scored], aux_table = df_temp_aux.copy(), 
                                                              columns2fill=columns2bfilled, columnsBase_num=columnsBase_num,
                                                              columnsBase_cat=columnsBase_cat )
df_train_fill.isnull().sum()

## Corrigiendo los datos de las columnas de valores enteros

In [None]:
#df_train_fill,df_test_fill 
columns_int = ['NUM_COURSES_BEGINNER_DATASCIENCE',
 'NUM_COURSES_BEGINNER_BACKEND',
 'NUM_COURSES_BEGINNER_FRONTEND',
 'NUM_COURSES_ADVANCED_DATASCIENCE',
 'NUM_COURSES_ADVANCED_BACKEND',
 'NUM_COURSES_ADVANCED_FRONTEND',]
df_train_fill[columns_int] = df_train_fill[columns_int].apply(lambda x: round(x,0)).astype(int)

# UPsampling Data
La diferencia entre clases no es muy alta pero por buenas prácticas, vamos a nivelar las clases para que los modelos no tengan preferencias por desbalance.

In [None]:
from sklearn.utils import resample
count4bal = df_train_fill['PROFILE'].value_counts().sort_values(ascending=True)
class_sorter = count4bal.index.tolist()[:-1]
class_mayor = count4bal.index[-1]
mayority_sample = count4bal[-1]
df_balanced = df_train_fill.loc[df_train_fill['PROFILE']==class_mayor]
for class_i in class_sorter:
    df_minor_upsampled = resample(df_train_fill.loc[df_train_fill['PROFILE']==class_i], 
                             replace=True,     # sample with replacement
                             n_samples=mayority_sample,    # to match majority class
                             random_state=17) # reproducible results

    df_balanced = pd.concat([df_balanced,df_minor_upsampled ],ignore_index=True)
df_balanced['PROFILE'].value_counts()

## Creando nuevas variables

Puedo extraer caracteristicas de los individuos:
* En dónde tiene su nota más alta
* Donde tiene la mayor cantidad de cursos
* Dónde tiene la mayor cantidad de horas <br>
===> Horas/curso = Cantidad de horas_mayor / Candidad de cursos_mayor <br>
===> Área de mayor nota         = MAX avg note

* Creando la caracteristica | CANTIDAD DE CURSOS X3 | Horas/Especialidad X 3 | Eficiencia_en_especialidad X3

In [None]:
def add_new_features(data):
    def horasxarea(x):
        total_backend = x['NUM_COURSES_BEGINNER_BACKEND'] + x['NUM_COURSES_ADVANCED_BACKEND'] 
        total_ds      = x['NUM_COURSES_BEGINNER_DATASCIENCE'] + x['NUM_COURSES_ADVANCED_DATASCIENCE']
        total_frontend= x['NUM_COURSES_ADVANCED_FRONTEND'] + x['NUM_COURSES_BEGINNER_FRONTEND']

        HR_A_DS = round(x['HOURS_DATASCIENCE']/total_ds) if total_ds != 0 else 0
        HR_A_BE = round(x['HOURS_BACKEND']/total_backend) if total_backend != 0 else 0
        HR_A_FE = round(x['HOURS_FRONTEND']/total_frontend)if total_frontend != 0 else 0

        SCORE_HR_A_DS =round( x['AVG_SCORE_DATASCIENCE']/HR_A_DS )if HR_A_DS !=0 else 0
        SCORE_HR_A_BE = round(x['AVG_SCORE_BACKEND']/HR_A_BE )    if HR_A_BE !=0 else 0
        SCORE_HR_A_FE = round(x['AVG_SCORE_FRONTEND']/HR_A_FE )   if HR_A_FE !=0 else 0

        return pd.Series([total_backend,total_ds,total_frontend,HR_A_DS, HR_A_BE, HR_A_FE,SCORE_HR_A_DS,SCORE_HR_A_BE,SCORE_HR_A_FE], index = ['NUM_CURS_BE','NUM_CURS_DS', 'NUM_CURS_FE', 'HR_A_DS', 'HR_A_BE', 'HR_A_FE','SCORE_HR_A_DS','SCORE_HR_A_BE','SCORE_HR_A_FE'])

    return data.join(data.apply(horasxarea, axis=1))
    
df_balanced_improve =  add_new_features(df_balanced.copy())
df_balanced_improve.describe()

In [None]:
df_balanced_improve.groupby(['PROFILE']).mean()

# Graficando las variables saneadas

# CORR MATRIX - de una sola clase [aleatoria]


# Reducción de dimensionalidad [n_components selection]

Implementaremos el KNN para reducción de dimensionalidad, dada la naturaleza multiclase del problema es más conveniente esta alternativa, manteniendo los parámetros por default

*Resultados*
La reducción de dimensionalidad no resulta util para este studio del caso, el modelo lineal no logra mejorar la puntuación 

## Scaling data
Utilizaremo sel minMax Scaler para colocar algunas caracteristicas dentro del rango de 0-1

In [None]:
from sklearn.preprocessing import StandardScaler,MinMaxScaler
n_components = 12 #n_components
scaler = StandardScaler()#MinMaxScaler()
# scaler.fit(df_nca[[i for i in range(n_components)]])
# df_nca[[i for i in range(n_components)]] = scaler.transform(df_nca[[i for i in range(n_components)]])
# df_nca.head()
columns = df_balanced_improve.columns.tolist()
columns.remove('PROFILE')

scaler.fit(df_balanced_improve.drop(columns=['PROFILE']))
df_balanced_improve[columns] = scaler.transform(df_balanced_improve[columns])
df_balanced_improve.head()

# Class to Transform DATA

In [None]:
class raw2test():
    def __init__(self, columns_int, scaler, auggfunc):
        self.columns_int = columns_int
        self.scaler = scaler
        self.auggfunc = auggfunc
        
    def transform(self, X):
        X = X.drop(columns = [ 'Unnamed: 0', 'USER_ID','NAME' ])
        columns = X.columns.tolist()
        #X[columns] = self.knn_imputer.transform(X) 
        X[self.columns_int] = X[self.columns_int].apply(lambda x: round(x,0)).astype(int)
        X = self.auggfunc(X)
        X = self.scaler.transform(X)
        return X       #test_knn_imputer
pretest = raw2test(columns_int =columns_int, scaler =  scaler,auggfunc=add_new_features)

# MODEL TIME

In [None]:
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV , RandomizedSearchCV,StratifiedKFold
#from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier,AdaBoostClassifier,GradientBoostingClassifier,BaggingClassifier,ExtraTreesClassifier, StackingClassifier
from sklearn.model_selection import StratifiedKFold, KFold,cross_val_score
from sklearn.gaussian_process.kernels import RBF
from sklearn.metrics import accuracy_score,confusion_matrix
#from sklearn import svm
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression as LR
from sklearn.decomposition import PCA
#from sklearn.neighbors import NeighborhoodComponentsAnalysis as NCA
from sklearn.neighbors import KNeighborsClassifier as KNC
from collections import defaultdict
from sklearn.metrics import classification_report

from sklearn.model_selection import KFold
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
best_parameters = defaultdict(list) # 'xgb': XGBClassifier(),
classifiers = defaultdict(list) # 'xgb': XGBClassifier(), X_train_whole,y_train_whole
X_train_whole       = df_balanced_improve.drop(columns=['PROFILE']).to_numpy()
y_train_whole       = df_balanced_improve['PROFILE']

X_train, X_test, y_train, y_test = train_test_split(X_train_whole,y_train_whole, stratify = y_train_whole, random_state= 17,test_size = 0.2 )


In [None]:
##NOTA:
# El metodo df_catdumm (cat+PCA+scalar) no sirve, entrega un pesimo rendimiento        F1 = 0.64
# EL metodo df_cat (usar bandas -10 para agrupar valores) entregó un rendimiento medio F1=0.81
# EL método df_balanced entregó un rendimiento de F1 = 0.93
# X_train       = df_balanced.drop(columns=['PROFILE'])
# y_train       = df_balanced['PROFILE']


skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=15)
# Parameters to tune
# SVC
tuned_parameters_svc = [{'penalty': ['l2','l1'], 'loss': ['squared_hinge','hinge'],'C': [0.1,1.0,10,100], 'max_iter': [4000], 'random_state':[15], 'dual':[False]}]
# XGB
tuned_parameters_xgb = [{'learning_rate':[0.2,0.3,0.4,0.5],'n_estimators':[140,160,180,220,250,300],'min_child_weight':[.01],'subsample':[.4,.5,.8,1.0],'colsample_bytree':[.5,.8,1.0],
                    'objective':['multi:softmax','binary:logistic'],'n_jobs':[-1],'random_state':[15] }]

#KNC
tuned_parameters_knc = [{'n_neighbors':[2,4,6,8,10,12,14,16],'n_jobs':[-1],'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute'],'p':[1,2]}]

#RFC
tuned_parameters_rfc = [{'n_estimators':[100,120,140,180],'criterion':['gini','entropy'],'min_samples_split':[2,3,4,5],'n_jobs':[-1],'random_state':[15]}]

#GBC
tuned_parameters_gbc = [{'loss':['deviance','exponential'],'learning_rate':[0.1,0.2,0.3,0.4],'n_estimators':[160,190,220,250,270],'subsample':[1.0],'criterion':[ 'mse', 'friedman_mse'],'max_depth':[3,8,10],'random_state':[15]}]

#ETC
tuned_parameters_etc = [{ 'min_samples_split':[.2,.4,.8], 'n_estimators':[80,120,150,180,200,250],'warm_start':[True],'bootstrap':[True],
                         'n_jobs':[-1], 'random_state':[15], 'min_samples_leaf':[3,4,5,6] ,'criterion':['gini', 'entropy'],'max_features':[ 'sqrt']   }]

#LR
tuned_parameters_lr = [{'C':[0.1,1.0,10.0], 'dual':[False], 'solver':['newton-cg', 'saga','sag'],'multi_class':['ovr', 'multinomial']}]

# Parameter tunning
scores = ['f1']

# Best parameters
#  'lr': LR(random_state=17), 'svc':LinearSVC(), 'knc': KNC(),
models = { 'xgb': XGBClassifier(),  'rfc':RandomForestClassifier(), 'etc':ExtraTreesClassifier() ,'gbc':GradientBoostingClassifier() }
parameters = {'lr':tuned_parameters_lr, 'xgb': tuned_parameters_xgb, 'knc': tuned_parameters_knc,'rfc':tuned_parameters_rfc,'svc':tuned_parameters_svc, 'gbc':tuned_parameters_gbc, 'etc':tuned_parameters_etc }

#;ista = ['etc']
for model_name in models.keys():
    print("######### MODEL tunning hyper-parameters for %s" % model_name)
    for score in scores:
        print("# %s - Tuning hyper-parameters for %s ###############################################################" % (model_name, score))
        clf_i = GridSearchCV(models[model_name], parameters[model_name], scoring='%s_macro' % score, n_jobs=-1, cv=skf)
        clf_i.fit(X_train,y_train)
        print("Best parameters set found on development set:")
        print()
        print(clf_i.best_params_)
        best_parameters[model_name] = clf_i.best_params_
        classifiers[model_name] = clf_i
        print("Grid scores on development set:")
        print()
        means = clf_i.cv_results_['mean_test_score']
        stds = clf_i.cv_results_['std_test_score']
#         for mean, std, params in zip(means, stds, clf_i.cv_results_['params']):
#             print("%s_macro - %0.3f (+/-%0.03f) for %r"% (score, mean, std * 2, params))
        print("Detailed classification report:")
        print("CV - Results max score: {}".format(np.nan_to_num( means).max()))
        print()
        y_true, y_pred = y_test, clf_i.predict(X_test)
        print(classification_report(y_true, y_pred, digits=4 ))

In [None]:
best_parameters = defaultdict(list,
            {'xgb': {'colsample_bytree': 1.0,
              'learning_rate': 0.4,
              'min_child_weight': 0.01,
              'n_estimators': 300,
              'n_jobs': -1,
              'objective': 'multi:softmax',
              'random_state': 15,
              'subsample': 0.8},
             'rfc': {'criterion': 'gini',
              'min_samples_split': 2,
              'n_estimators': 140,
              'n_jobs': -1,
              'random_state': 15},
             'etc': {'bootstrap': True,
              'criterion': 'gini',
              'max_features': 'sqrt',
              'min_samples_leaf': 3,
              'min_samples_split': 0.2,
              'n_estimators': 250,
              'n_jobs': -1,
              'random_state': 15,
              'warm_start': True},
             'gbc': {'criterion': 'friedman_mse',
              'learning_rate': 0.4,
              'loss': 'deviance',
              'max_depth': 8,
              'n_estimators': 160,
              'random_state': 15,
              'subsample': 1.0},
             'knc': {'algorithm': 'auto',
              'n_jobs': -1,
              'n_neighbors': 12,
              'p': 2},
             'svc': {'C': 0.1,
              'dual': False,
              'loss': 'squared_hinge',
              'max_iter': 4000,
              'penalty': 'l2',
              'random_state': 15},
             'lr': {'C': 1.0,
              'dual': False,
              'multi_class': 'multinomial',
              'solver': 'newton-cg'}})

## Bagging Classifier MODEL - GRIDSEARCH + CV

In [None]:
clf_gbc = GradientBoostingClassifier(**best_parameters['gbc'] )#n_estimators=80, n_jobs=-1)# 20 - 80
clf_RFC = RandomForestClassifier(**best_parameters['rfc'] )#
clf_KNC = KNC(**best_parameters['knc'] )
xgb_model = XGBClassifier(**best_parameters['xgb'])
svc = LinearSVC(**best_parameters['svc'])
#etc = ExtraTreesClassifier(**best_parameters['etc'])
#lr  = LR(**best_parameters['lr'])

models = {'knc': clf_KNC,'svc': svc}

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=15)
# Parameter tunning
scores = ['f1']
# Best parameters
best_parameters_bagging = defaultdict(list)
#lista = ['etc']
for model_name in models.keys():
    params_bagging = [{'n_estimators': [50,100], 'max_samples':[1.0],'base_estimator': [models[model_name]],'n_jobs':[-1] }]
    print("######### Bagging MODEL tunning hyper-parameters for %s" % model_name)
    for score in scores:
        print("# %s - Tuning hyper-parameters for %s #########################################################" % (model_name, score))
        clf_i = GridSearchCV(BaggingClassifier(), params_bagging, scoring='%s_macro' % score, n_jobs=-1, cv=skf)
        clf_i.fit(X_train,y_train)
        print("Best parameters set found on development set:")
        print()
        print(clf_i.best_params_)
        best_parameters_bagging[model_name] = clf_i.best_params_
        print("Grid scores on development set:")
        print()
        means = clf_i.cv_results_['mean_test_score']
        stds = clf_i.cv_results_['std_test_score']
#         for mean, std, params in zip(means, stds, clf_i.cv_results_['params']):
#             print("%s_macro - %0.3f (+/-%0.03f) for %r"% (score, mean, std * 2, params))
        print("Detailed Bagging classification report:")
        print("CV - Results max score: {}".format(np.nan_to_num( means).max()))
        print()
        y_true, y_pred = y_test, clf_i.predict(X_test)
        print(classification_report(y_true, y_pred, digits=4))

In [None]:
best_parameters_bagging = defaultdict(list,
            {'knc': {'base_estimator': KNC(**best_parameters['knc']),
              'max_samples': 1.0,
              'n_estimators': 50,
              'n_jobs': -1},})

# Improved Models

Dependiendo de los resultados del bagging debemos elegir si quedarnos o no el modelo en bagging

In [None]:
xgb_f = XGBClassifier(**best_parameters['xgb'])
KNC_f = BaggingClassifier( **best_parameters_bagging['knc'] )
RFC_f = RandomForestClassifier(**best_parameters['rfc'] )
svc_f = LinearSVC( **best_parameters['svc'] )
gbc_f = GradientBoostingClassifier(**best_parameters['gbc'] )
etc_f = ExtraTreesClassifier(**best_parameters['etc'])
lr_f  = LR( **best_parameters['lr'] )

# Voting Classifier Model + GridSerch

In [None]:
from sklearn.model_selection import StratifiedKFold, KFold,cross_val_score #( X_train_PCA , label)
from sklearn.metrics import f1_score
# Ir retirnando los modelos más deviles progresivamente
estimators0= [ ('xgb', xgb_f), ('knc', KNC_f),('rfc',RFC_f),('svc',svc_f),('gbc',gbc_f),('etc',etc_f),('lr', lr_f)]
estimators1= [ ('xgb', xgb_f), ('knc', KNC_f),('rfc',RFC_f),('gbc',gbc_f),('etc',etc_f),('lr', lr_f)]
estimators2= [ ('xgb', xgb_f), ('knc', KNC_f),('rfc',RFC_f),('gbc',gbc_f),('etc',etc_f)]
estimators3= [ ('xgb', xgb_f), ('knc', KNC_f),('rfc',RFC_f),('gbc',gbc_f)] # **** ganador 
estimators4= [ ('xgb', xgb_f),('rfc',RFC_f),('gbc',gbc_f)] 
estimators5= [ ('rfc',RFC_f),('gbc',gbc_f)]

estimators = [ estimators0,estimators1,estimators2,estimators3,estimators4,estimators5]
voting = ['hard','soft']

params_voting = [{'voting':voting,'n_jobs':[-1] }] # 'estimators': estimators, 

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=15)
# Parameter tunning
scores = ['f1']
# Best parameters
best_parameters_voting = defaultdict(list)
# Scores
score_f1 = []

print("######### Voting MODEL tunning hyper-parameters for %s" % model_name)
for ind , set_estimator in enumerate(estimators):
    model_vc_i = VotingClassifier(estimators = set_estimator)
    for score in scores:
        print("# %s - Tuning hyper-parameters for %s #############################################################" % ('Voting', score))
        clf_i = GridSearchCV(model_vc_i, params_voting, scoring='%s_macro' % score, n_jobs=-1, cv=skf)
        clf_i.fit(X_train,y_train)
        print("Best parameters set found on development set:")
        print()
        print(clf_i.best_params_)
        best_parameters_voting['set'+str(ind)] = clf_i.best_params_
        print("Grid scores on development set:")
        print()
        means = clf_i.cv_results_['mean_test_score']
        stds = clf_i.cv_results_['std_test_score']
    #         for mean, std, params in zip(means, stds, clf_i.cv_results_['params']):
    #             print("%s_macro - %0.3f (+/-%0.03f) for %r"% (score, mean, std * 2, params))
        print("Detailed Bagging classification report:")
        print("CV - Results max score: {}".format(np.nan_to_num( means).max()))
        score_f1.append(np.nan_to_num( means).max())
        best_parameters_voting['scores'] = score_f1
        print()
        y_true, y_pred = y_test, clf_i.predict(X_test)
        print(classification_report(y_true, y_pred, digits=4))

* Testing - Entrenando con el 80% de los datos

In [None]:
estimators3= [ ('xgb', xgb_f), ('knc', KNC_f),('rfc',RFC_f),('gbc',gbc_f)] # **** ganador 
## X_train, X_test, y_train, y_test
model_VC = VotingClassifier (estimators = estimators3, voting='hard',n_jobs=-1) ## editar
model_VC.fit(X_train,y_train)
y_true , y_pred =y_test,  model_VC.predict(X_test)

print("Detailed Voting classification report:")
print(classification_report(y_true, y_pred, digits=4))
#plot_confusion_matrix(y_true=y_train_whole, y_pred=y_pred, class_names=np.unique(y_pred),title="VotingClassifier",normalize=True,size=5)

## STACKING CLASSIFIER + GRIDSEARCH ESTIMADORES

In [None]:
from sklearn.tree import DecisionTreeClassifier as DTC
dtc = DTC()
# Stratify Kfold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=15)
skf1 = StratifiedKFold(n_splits=4, shuffle=True, random_state=15)
skf2 = StratifiedKFold(n_splits=8, shuffle=True, random_state=15)
skf3 = StratifiedKFold(n_splits=3, shuffle=True, random_state=15)

# Ir retirnando los modelos más deviles progresivamente
estimators0= [ ('xgb', xgb_f), ('knc', KNC_f),('rfc',RFC_f),('svc',svc_f),('gbc',gbc_f),('etc',etc_f),('lr', lr_f)]
estimators1= [ ('xgb', xgb_f), ('knc', KNC_f),('rfc',RFC_f),('gbc',gbc_f),('etc',etc_f),('lr', lr_f)]  # **** Winner
estimators2= [ ('xgb', xgb_f), ('knc', KNC_f),('rfc',RFC_f),('gbc',gbc_f),('etc',etc_f)]
estimators3= [ ('xgb', xgb_f), ('knc', KNC_f),('rfc',RFC_f),('gbc',gbc_f)]
estimators4= [ ('xgb', xgb_f),('rfc',RFC_f),('gbc',gbc_f)]  
estimators5= [ ('rfc',RFC_f),('gbc',gbc_f)]

## Parámetro ganador
##params_winer = {'cv': StratifiedKFold(n_splits=8, random_state=15, shuffle=True), 'final_estimator': SVC(), 'n_jobs': -1}

estimators = [ estimators0,estimators1,estimators2,estimators3,estimators4,estimators5]
skf_method = [skf,skf1,skf2,skf3]

params_sc = [{'cv':skf_method,'n_jobs':[-1], 'final_estimator':[ LR(random_state=17),LinearSVC(random_state=17)] }] # 'estimators': estimators, 

# Parameter tunning
scores = ['f1']
# Best parameters
best_parameters_sc = defaultdict(list)
# Scores
score_sc_f1 = []

print("######### Stacking MODEL tunning hyper-parameters" )
for ind , set_estimator in enumerate(estimators):
    model_sc_i = StackingClassifier(estimators = set_estimator)
    for score in scores:
        print("# %s - %s Tuning hyper-parameters for %s #############################################################" % ('Stacking',ind, score))
        clf_i = GridSearchCV(model_sc_i, params_sc, scoring='%s_macro' % score, n_jobs=-1, cv=skf)
        clf_i.fit(X_train,y_train)
        print("Best parameters set found on development set:")
        print()
        print(clf_i.best_params_)
        best_parameters_sc['set'+str(ind)] = clf_i.best_params_
        print("Grid scores on development set:")
        print()
        means = clf_i.cv_results_['mean_test_score']
        stds = clf_i.cv_results_['std_test_score']
    #         for mean, std, params in zip(means, stds, clf_i.cv_results_['params']):
    #             print("%s_macro - %0.3f (+/-%0.03f) for %r"% (score, mean, std * 2, params))
        print("Detailed Staking classification report:")
        print("CV - Results max score: {}".format(np.nan_to_num( means).max()))
        score_sc_f1.append(np.nan_to_num( means).max())
        best_parameters_sc['scores'] = score_sc_f1
        print()
        y_true, y_pred = y_test, clf_i.predict(X_test)
        print(classification_report(y_true, y_pred, digits=4))

# Stacking Classifier - GridSearch final_estimator

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import StratifiedKFold, KFold,cross_val_score #( X_train_PCA , label)
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
# Stratify Kfold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=15)
skf1 = StratifiedKFold(n_splits=4, shuffle=True, random_state=15)
skf2 = StratifiedKFold(n_splits=8, shuffle=True, random_state=15)
skf3 = StratifiedKFold(n_splits=3, shuffle=True, random_state=15)
skf_method = [skf,skf2]
estimators1= [ ('xgb', xgb_f), ('knc', KNC_f),('rfc',RFC_f),('gbc',gbc_f),('etc',etc_f),('lr', lr_f)]  # **** Winner
model_SC = StackingClassifier (estimators=estimators1 , final_estimator = LogisticRegression(random_state=17) )

params_SC = {'final_estimator__C': [0.1,1.0, 10.0], 'final_estimator__solver': [ 'sag', 'saga','newton-cg' ], 'final_estimator__max_iter':[200], 'cv': skf_method,'n_jobs':[-1], }

# Parameter tunning
scores = ['f1']
# Best parameters
best_parameters_SC_final_estim = defaultdict(list)
#lista = ['etc']
for score in scores:
    print("# %s - Tuning hyper-parameters for %s" % ('Stacking Classifier', score))
    clf_i = GridSearchCV(estimator=model_SC, 
                    param_grid=params_SC, 
                    scoring='%s_macro' % score, n_jobs=-1, cv=skf)

    clf_i.fit(X_train,y_train)
    print("Best parameters set found on development set:")
    print()
    print(clf_i.best_params_)
    best_parameters_SC_final_estim['SC'] = clf_i.best_params_
    print("Grid scores on development set:")
    print()
    means = clf_i.cv_results_['mean_test_score']
    stds = clf_i.cv_results_['std_test_score']
#         for mean, std, params in zip(means, stds, clf_i.cv_results_['params']):
#             print("%s_macro - %0.3f (+/-%0.03f) for %r"% (score, mean, std * 2, params))
    print("Detailed Staking classification report:")
    print("CV - Results max score: {}".format(np.nan_to_num( means).max()))
    score_sc_f1.append(np.nan_to_num( means).max())
    best_parameters_sc['scores'] = score_sc_f1
    print()
    y_true, y_pred = y_test, clf_i.predict(X_test)
    print(classification_report(y_true, y_pred, digits=4))

* Stacking classifier Testing - Entrenando con el 80% de los datos

In [None]:
estimators1= [ ('xgb', xgb_f), ('knc', KNC_f),('rfc',RFC_f),('gbc',gbc_f),('etc',etc_f),('lr', lr_f)]  # **** XGB optimizado + GBC optimizado

model_sc = StackingClassifier(estimators = estimators1, final_estimator = LR(C=10.0, max_iter=200,solver = 'sag', random_state=17), cv = StratifiedKFold(n_splits=5, random_state=15, shuffle=True), n_jobs=-1)
model_sc.fit(X_train,y_train)
y_true, y_pred = y_test, model_sc.predict(X_test) ## improve 0.9723 from 0.9698 from 0.9693
print("Detailed Staking classification report:") 
print(classification_report(y_true, y_pred, digits=4))

* Stacking Classifier - Entrenamiento final con todos los datos

In [None]:
model_scf = StackingClassifier(estimators = estimators1, final_estimator = LR(C=10.0, max_iter=200,solver = 'sag', random_state=17), cv = StratifiedKFold(n_splits=5, random_state=15, shuffle=True), n_jobs=-1)
model_scf.fit(X_train_whole,y_train_whole)
y_true, y_pred = y_train_whole, model_scf.predict(X_train_whole)
print("Detailed Staking classification report whole data:") 
print(classification_report(y_true, y_pred, digits=4))

# STACKING CLASSIFIER | Two Layers

In [None]:
from sklearn.tree import DecisionTreeClassifier as DTC
dtc = DTC()

skf = StratifiedKFold(n_splits=10 )


model_sc_l0 = StackingClassifier (estimators=[ ('tree', dtc),('svc',svc_f)], final_estimator = LR(random_state=17, C=1.0, solver='sag'), cv = skf,n_jobs=-1)

model_SC = StackingClassifier (estimators=[ ('xgb', xgb_f), ('knc', KNC_f),('rfc',RFC_f),('svc',svc_f),('gbc',gbc_f),('etc',etc_f),('lr', lr_f)], final_estimator = model_sc_l0, cv = skf,n_jobs=-1)

X_train1, X_test1, y_train1, y_test1 = train_test_split(X_train_whole,y_train_whole, stratify = y_train_whole, random_state= 25,test_size = 0.2 )

model_SC.fit(X_train1,y_train1)
y_pred = model_SC.predict(X_test1)

print('F1 Score for Stacking model = {}'.format(f1_score(y_test1, y_pred, average='macro')))
print("Detailed Staking classification report whole data:") 
print(classification_report(y_test1, y_pred, digits=4))
#plot_confusion_matrix(y_true=y_true, y_pred=y_pred, class_names=np.unique(y_pred),title="Stacking Classifier",normalize=True,size=5)