# Mineração de Dados: Classificação

**Grupo**:

- Gabriel Oliveira Moreira Faria
- Vinícius Oliveira

**Objetivo**:

O objetivo deste trabalho é criar classificadores que buscam prever a aprovação (ou reprovação) de discentes que participaram de um curso online.
Essas bases de dados apresentam características de pessoas que participaram do curso online "Prevenção do uso de drogas".


Os dados estão disponíveis na pasta do drive compartilhada aqui e são formados por 3 arquivos:
"trabalho5_dados_sociais_ID.csv" - dados sócio-econômicos e de perguntas iniciais respondidas pelos participantes,
"trabalho5_dados_modulo1_ID.csv" - dados de acesso ao sistema de ensino durante as atividades referentes ao primeiro módulo do curso, e
"trabalho5_dados_ateh_modulo2_ID.csv" - dados de acesso ao sistema de ensino para as atividades até o segundo módulo do curso.


Nesse trabalho geramos um modelo para cada um das 3 situações a seguir:
1) um modelo que considera apenas as características sócio-econômicos e de perguntas iniciais;
2) um modelo que considera apenas as características sócio-econômicos e de perguntas iniciais, e de acesso considerando o primeiro módulo do curso; e
3) um modelo que considera todos os dados disponíveis.

Para cada uma das 3 situações apresentadas, deve-se gerar ao menos 2 modelos de tipos distintos.

**Com a importância de analisar os parâmetros que fornecemos para os modelos**

In [451]:
import itertools as it
import pickle

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import chisquare
from sklearn.compose import make_column_transformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
    LabelEncoder,
    OneHotEncoder,
    OrdinalEncoder,
    StandardScaler,
)
from sklearn.tree import DecisionTreeClassifier

In [452]:
def preprocess(df: pd.DataFrame, cols_ord=[], cats_ord=[], cols_nom=[], cols_num=[]):
    cols_ord_ = list(set(cols_ord) & set(df.columns))
    cats_ord_ = [cats_ord[cols_ord.index(col)] for col in cols_ord_]
    cols_nom_ = list((set(cols_nom)) & set(df.columns))
    cols_num_ = list(set(cols_num) & set(df.columns))

    transformers = []
    if cols_ord_:
        transform = (OrdinalEncoder(categories=cats_ord_), cols_ord_)
        transformers.append(transform)
    if cols_nom_:
        transform = (OneHotEncoder(), cols_nom_)
        transformers.append(transform)
    if cols_num_:
        transform = (StandardScaler(), cols_num_)
        transformers.append(transform)

    if transformers:
        steps = [("transformer", make_column_transformer(*transformers))]
        return Pipeline(steps).fit_transform(df)
    else:
        return df.to_numpy()


def search(X, y, path=None, read_cache=True, n_jobs=-1):
    try:
        if path is not None and read_cache:
            with open(path, "rb") as file:
                results = pickle.load(file)
        else:
            raise FileNotFoundError()
    except FileNotFoundError:
        pipe = Pipeline([("estimator", LogisticRegression())])
        search = GridSearchCV(
            pipe,
            param_grid=[
                {"estimator": [LogisticRegression()], "estimator__penalty": ["l1"]},
                {
                    "estimator": [DecisionTreeClassifier()],
                    "estimator__max_depth": [10, 20, None],
                },
                {
                    "estimator": [RandomForestClassifier()],
                    "estimator__n_estimators": [100, 250, 1000],
                },
            ],
            n_jobs=n_jobs,
            scoring="f1_micro",
        )
        _ = search.fit(X, y)
        results = pd.DataFrame(search.cv_results_).sort_values(
            by="mean_test_score", ascending=False
        )
        if path is not None:
            with open(path, "wb") as file:
                pickle.dump(results, file)

    return results


def cross_analysis(df, cols, margins=True, normalize=True):
    ctab = pd.crosstab([df[col] for col in cols], df["aprovado"], margins=margins)
    if normalize:
        ctab["Não"] = ctab["Não"] / ctab["All"]
        ctab["Sim"] = ctab["Sim"] / ctab["All"]
    return ctab

### Carregamento dos dados

Primeiro, carregaremos os dados a partir dos arquivos fornecidos.

In [453]:
# Características
df_sociais = pd.read_csv("data/trabalho5_dados_sociais_4.csv")
df_modulo1 = pd.read_csv("data/trabalho5_dados_modulo1_4.csv")
df_modulo2 = pd.read_csv("data/trabalho5_dados_ateh_modulo2_4.csv")

### Análise exploratória

Agora, analisaremos o quanto cada característica da base de dados pode contribuir para a predição de aprovação ou não dos participantes do curso. Para isso, usaremos o teste do Qui-Quadrado para verificar a diferença entre a distribuição "geral" dos dados e dos subgrupos gerados pelas variáveis categóricas da base.

#### Relações entre características e aprovação

##### Dados socioeconômicos

In [454]:
results = {"column": [], "score": []}
for col in df_sociais.columns:
    if col in ["aprovado", "idade", "tempodeservico", "id"]:
        continue
    ctab = cross_analysis(df_sociais, [col], margins=True, normalize=False)
    for cat in ctab.index[:-1]:
        obs = ctab.loc[cat, ctab.columns[:-1]]
        exp = ctab.loc["All", ctab.columns[:-1]]
        exp = exp * obs.sum() / exp.sum()
        if obs.sum() > 13:
            score = chisquare(obs, exp).pvalue
        else:
            score = 1
        results["column"].append(col)
        results["score"].append(score)
cols_scores_se = (
    pd.DataFrame(results).groupby("column").min().reset_index().sort_values("score")
)

In [455]:
cols_scores_se.head(15)

Unnamed: 0,column,score
19,Presença de uma equipe para trabalhar a temáti...,4.689789e-24
5,Desenvolvimento de projetos na escola (facilit...,8.644811000000001e-23
9,Para aquisição de conhecimento na área (motivo...,1.5272249999999999e-21
1,Ausência da família (barreiras),4.545752e-21
20,Promoção de compromisso e confiança (facilitad...,1.4233259999999998e-19
10,Participação da comunidade e dos pais no traba...,2.0126779999999998e-19
23,Valorização do ambiente escolar (facilitadores),7.398094e-19
14,Por ser uma oportunidade de formação continuad...,8.6848e-19
6,Estímulo aos alunos (facilitadores),2.104024e-16
0,Apoio aos projetos em desenvolvimento (facilit...,3.18189e-15


Como é possível observar abaixo, os subgrupos das variáveis apresentadas têm distribuições diferentes

In [456]:
cross_analysis(
    df_sociais,
    cols=["Presença de uma equipe para trabalhar a temática (facilitadores)"],
)

aprovado,Não,Sim,All
Presença de uma equipe para trabalhar a temática (facilitadores),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.682893,0.317107,719
1,0.266904,0.733096,281
All,0.566,0.434,1000


In [457]:
cross_analysis(
    df_sociais, cols=["Desenvolvimento de projetos na escola (facilitadores)"]
)

aprovado,Não,Sim,All
Desenvolvimento de projetos na escola (facilitadores),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.688761,0.311239,694
1,0.287582,0.712418,306
All,0.566,0.434,1000


In [458]:
cross_analysis(df_sociais, cols=["Ausência da família (barreiras)"])

aprovado,Não,Sim,All
Ausência da família (barreiras),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.686589,0.313411,686
1,0.302548,0.697452,314
All,0.566,0.434,1000


In [459]:
cross_analysis(df_sociais, cols=["Uso de substâncias por familiares (barreiras)"])

aprovado,Não,Sim,All
Uso de substâncias por familiares (barreiras),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.627907,0.372093,817
1,0.289617,0.710383,183
All,0.566,0.434,1000


In [460]:
cross_analysis(df_sociais, cols=["pp012"])

aprovado,Não,Sim,All
pp012,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Concordo,0.571429,0.428571,889
Concordo totalmente,0.470588,0.529412,17
Discordo,0.756098,0.243902,41
"Nem discordo, nem concordo",0.358491,0.641509,53
All,0.566,0.434,1000


Analisemos agora a relação entre a aprovação do participante e sua motivação para o ingresso no curso.

In [461]:
cols = [
    "Identificação pessoal com o tema (motivopart)",
    "Identificação profissional com o tema (motivopart)",
    "Para aquisição de conhecimento na área (motivopart)",
    "Pelo fato de o curso ser gratuito (motivopart)",
    "Pelo fato de o curso estar vinculado à Universidade (motivopart)",
    "Por ser um curso à distância (motivopart)",
    "Por ser uma oportunidade de formação continuada (motivopart)",
]
for col in cols:
    display(cross_analysis(df_sociais, [col]))

aprovado,Não,Sim,All
Identificação pessoal com o tema (motivopart),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.582185,0.417815,943
1,0.298246,0.701754,57
All,0.566,0.434,1000


aprovado,Não,Sim,All
Identificação profissional com o tema (motivopart),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.611995,0.388005,817
1,0.360656,0.639344,183
All,0.566,0.434,1000


aprovado,Não,Sim,All
Para aquisição de conhecimento na área (motivopart),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.688047,0.311953,686
1,0.299363,0.700637,314
All,0.566,0.434,1000


aprovado,Não,Sim,All
Pelo fato de o curso ser gratuito (motivopart),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.582185,0.417815,943
1,0.298246,0.701754,57
All,0.566,0.434,1000


aprovado,Não,Sim,All
Pelo fato de o curso estar vinculado à Universidade (motivopart),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.582609,0.417391,920
1,0.375,0.625,80
All,0.566,0.434,1000


aprovado,Não,Sim,All
Por ser um curso à distância (motivopart),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.590759,0.409241,909
1,0.318681,0.681319,91
All,0.566,0.434,1000


aprovado,Não,Sim,All
Por ser uma oportunidade de formação continuada (motivopart),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.670886,0.329114,711
1,0.307958,0.692042,289
All,0.566,0.434,1000


No geral houve diferença perceptível na distribuição em todos os subgrupos. Analisemos agora as combinações dos subgrupos.

In [462]:
display(cross_analysis(df_sociais, cols).reset_index())

aprovado,Identificação pessoal com o tema (motivopart),Identificação profissional com o tema (motivopart),Para aquisição de conhecimento na área (motivopart),Pelo fato de o curso ser gratuito (motivopart),Pelo fato de o curso estar vinculado à Universidade (motivopart),Por ser um curso à distância (motivopart),Por ser uma oportunidade de formação continuada (motivopart),Não,Sim,All
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.772242,0.227758,562
1,0,0.0,0.0,0.0,0.0,0.0,1.0,0.278689,0.721311,61
2,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1
3,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1
4,0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1
5,0,0.0,1.0,0.0,0.0,0.0,0.0,0.293478,0.706522,92
6,0,0.0,1.0,0.0,0.0,0.0,1.0,0.282051,0.717949,39
7,0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,2
8,0,0.0,1.0,0.0,0.0,1.0,1.0,0.133333,0.866667,15
9,0,0.0,1.0,0.0,1.0,0.0,0.0,0.5,0.5,2


É visível que o preenchimento em si das questões ligadas à motivação é correlacionada com a aprovação, e pode indicar um maior engajamento do participante.

In [463]:
cross_analysis(df_sociais, ["escolaridade"])

aprovado,Não,Sim,All
escolaridade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ensino Médio Completo,0.166667,0.833333,6
Ensino Superior Completo,0.618421,0.381579,228
Ensino Superior Incompleto,0.461538,0.538462,13
Pós-graduação,0.555113,0.444887,753
All,0.566,0.434,1000


In [464]:
cross_analysis(df_sociais, ["lidadiretamente"])

aprovado,Não,Sim,All
lidadiretamente,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Não,0.575758,0.424242,297
Sim,0.561878,0.438122,703
All,0.566,0.434,1000


In [465]:
cross_analysis(df_sociais, ["contatoanterior"])

aprovado,Não,Sim,All
contatoanterior,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Não,0.546185,0.453815,249
Sim,0.57257,0.42743,751
All,0.566,0.434,1000


In [466]:
cross_analysis(df_sociais, ["lida.onde"])

aprovado,Não,Sim,All
lida.onde,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Amigos,0.551724,0.448276,58
Comunidade,0.529644,0.470356,253
Escola,0.586916,0.413084,535
Família,0.578431,0.421569,102
Outros,0.519231,0.480769,52
All,0.566,0.434,1000


##### Dados do Módulo 1

In [467]:
results = {"column": [], "score": []}
for col in df_modulo1.columns:
    if col in ["aprovado", "id"]:
        continue
    ctab = cross_analysis(df_modulo1, [col], margins=True, normalize=False)
    for cat in ctab.index[:-1]:
        obs = ctab.loc[cat, ctab.columns[:-1]]
        exp = ctab.loc["All", ctab.columns[:-1]]
        exp = exp * obs.sum() / exp.sum()
        if obs.sum() > 13:
            score = chisquare(obs, exp).pvalue
        else:
            score = 1
        results["column"].append(col)
        results["score"].append(score)
cols_scores_m1 = (
    pd.DataFrame(results).groupby("column").min().reset_index().sort_values("score")
)

In [468]:
cols_scores_m1

Unnamed: 0,column,score
5,forum3,3.27078e-17
0,ativcolm1,9.281761e-16
4,forum2,1.205183e-13
6,forum4,3.286952e-13
2,forum1,6.097683e-10
7,quesm1,1.22342e-09
8,quesm1r,0.000307775
3,forum1r,0.001798624
1,ativcolm1r,0.009776434


In [469]:
cross_analysis(df_modulo1, ["forum3"])

aprovado,Não,Sim,All
forum3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.840517,0.159483,232
1,0.483073,0.516927,768
All,0.566,0.434,1000


In [470]:
cross_analysis(df_modulo1, ["ativcolm1"])

aprovado,Não,Sim,All
ativcolm1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.775623,0.224377,361
1,0.447574,0.552426,639
All,0.566,0.434,1000


##### Dados do Módulo 2

In [471]:
results = {"column": [], "score": []}
for col in df_modulo2.columns:
    if col in ["aprovado", "id"]:
        continue
    ctab = cross_analysis(df_modulo2, [col], margins=True, normalize=False)
    for cat in ctab.index[:-1]:
        obs = ctab.loc[cat, ctab.columns[:-1]]
        exp = ctab.loc["All", ctab.columns[:-1]]
        exp = exp * obs.sum() / exp.sum()
        if obs.sum() > 13:
            score = chisquare(obs, exp).pvalue
        else:
            score = 1
        results["column"].append(col)
        results["score"].append(score)
cols_scores_m2 = (
    pd.DataFrame(results).groupby("column").min().reset_index().sort_values("score")
)

In [472]:
cols_scores_m2

Unnamed: 0,column,score
16,quesm2,9.824336e-29
2,ativcolm2,3.47236e-21
8,forum3,3.27078e-17
0,ativcolm1,9.281761e-16
13,forum8,9.671052e-15
6,forum2,1.205183e-13
12,forum7,2.27096e-13
9,forum4,3.286952e-13
10,forum5,3.326171e-13
4,forum1,6.097683e-10


In [473]:
cross_analysis(df_modulo2, ["quesm2"])

aprovado,Não,Sim,All
quesm2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.883721,0.116279,301
1,0.429185,0.570815,699
All,0.566,0.434,1000


In [474]:
cross_analysis(df_modulo2, ["ativcolm2"])

aprovado,Não,Sim,All
ativcolm2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.794749,0.205251,419
1,0.401033,0.598967,581
All,0.566,0.434,1000


### Pré-processamento

In [475]:
# Rótulos
y = LabelEncoder().fit(["Não", "Sim"]).transform(df_modulo2["aprovado"])
df_sociais = df_sociais.drop(["id", "aprovado"], axis=1)
df_modulo1 = df_modulo1.drop(["id", "aprovado"], axis=1)
df_modulo2 = df_modulo2.drop(["id", "aprovado"], axis=1)

In [476]:
# Ordens das categorias
sn_cat = ["Não", "Sim"]
sexo_cat = ["Feminino", "Masculino"]
escolaridade_cat = [
    "Ensino Médio Completo",
    "Ensino Superior Incompleto",
    "Ensino Superior Completo",
    "Pós-graduação",
]
materialdidatico_cat = ["Adequado", "Muito adequado"]
prazoatividades_cat = [
    "Pouquíssimo flexível",
    "Pouco flexível",
    "Flexível",
    "Muito flexível",
]
interacaopares_cat = ["Importante", "Muito importante"]
organizacaocurso_cat = ["Organizado", "Muito organizado"]
import_ajud_tutor_cat = ["Às vezes", "Sempre"]
autoavaliacao_cat = [
    "Não, não considero",
    "Sim, considero, porém, poderia estar me esforçando mais",
    "Sim, considero",
]
pp_cat = [
    "Discordo totalmente",
    "Discordo",
    "Nem discordo, nem concordo",
    "Concordo",
    "Concordo totalmente",
]

In [477]:
# Variáveis ordinais
cols_ord = [
    "escolaridade",
    "materialdidatico",
    "prazoatividades",
    "interacaopares",
    "import.ajud.tutor",
    "autoavaliacao.x",
] + [f"pp{n + 1:03}" for n in range(37)]
cats_ord = [
    escolaridade_cat,
    materialdidatico_cat,
    prazoatividades_cat,
    interacaopares_cat,
    import_ajud_tutor_cat,
    autoavaliacao_cat,
] + [pp_cat] * 37

# Variáveis nominais
cols_nom = list(set(df_sociais.select_dtypes(object).columns) - set(cols_ord))

# Variáveis numéricas
cols_num = ["idade", "tempodeservico"]

In [478]:
X_sociais = preprocess(df_sociais, cols_ord, cats_ord, cols_nom, cols_num)
X_modulo1 = preprocess(df_modulo1, cols_ord, cats_ord, cols_nom, cols_num)
X_modulo2 = preprocess(df_modulo2, cols_ord, cats_ord, cols_nom, cols_num)

X_sociais

array([[ 3.        ,  1.        ,  3.        , ...,  0.        ,
        -0.11217859,  0.65818824],
       [ 3.        ,  1.        ,  3.        , ...,  1.        ,
         0.41079851,  1.40819926],
       [ 3.        ,  1.        ,  3.        , ...,  0.        ,
         0.41079851,  0.33675495],
       ...,
       [ 3.        ,  1.        ,  3.        , ...,  0.        ,
         0.01856569, -1.27041153],
       [ 3.        ,  1.        ,  3.        , ...,  0.        ,
         0.28005423,  0.01532165],
       [ 3.        ,  1.        ,  3.        , ...,  0.        ,
         0.14930996,  0.87247711]])

### Classificação: Dados socioeconômicos

### Classificação: Dados socioeconômicos + primeiro módulo

### Classificação: Todos os dados

### Classificação: Campos selecionados

#### Dados socioeconômicos

In [488]:
results = []
for i in range(len(cols_scores_se)):
    X = df_sociais[cols_scores_se.loc[:i, "column"]]
    X = preprocess(X, cols_ord, cats_ord, cols_nom)

    for clf in [DecisionTreeClassifier(), RandomForestClassifier(), LogisticRegression()]:
        result = cross_validate(clf, X, y, verbose=False)
        results.append([str(clf)] + [i + 1] + list(result["test_score"]))
results = pd.DataFrame(
    results, columns=["model", "cols", "score1", "score2", "score3", "score4", "score5"]
)

mean = results[results.columns[2:]].mean(axis=1)
std = results[results.columns[2:]].std(axis=1)
max_ = results[results.columns[2:]].max(axis=1)
results["mean"] = mean
results["std"] = std
results["max"] = max_

results.sort_values("mean", ascending=False)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Unnamed: 0,model,cols,score1,score2,score3,score4,score5,mean,std,max
61,RandomForestClassifier(),21,0.690,0.765,0.760,0.740,0.725,0.736,0.030290,0.765
4,RandomForestClassifier(),2,0.695,0.775,0.735,0.740,0.720,0.733,0.029283,0.775
60,DecisionTreeClassifier(),21,0.690,0.745,0.760,0.725,0.725,0.729,0.026315,0.760
10,RandomForestClassifier(),4,0.675,0.745,0.760,0.750,0.710,0.728,0.035107,0.760
28,RandomForestClassifier(),10,0.685,0.760,0.740,0.725,0.720,0.726,0.027704,0.760
...,...,...,...,...,...,...,...,...,...,...
88,RandomForestClassifier(),30,0.515,0.440,0.450,0.490,0.525,0.484,0.037980,0.525
124,RandomForestClassifier(),42,0.495,0.460,0.455,0.490,0.520,0.484,0.026786,0.520
85,RandomForestClassifier(),29,0.530,0.470,0.470,0.465,0.475,0.482,0.027065,0.530
154,RandomForestClassifier(),52,0.515,0.470,0.420,0.470,0.490,0.473,0.034928,0.515


#### Dados socioeconômicos + Módulo 1

In [489]:
results = []
mixed_cols_scores = (
    pd.concat([cols_scores_se, cols_scores_m1]).reset_index().sort_values("score")
)
mixed_df = pd.concat([df_sociais, df_modulo1], axis=1)
for i in range(len(mixed_cols_scores)):
    s = mixed_cols_scores.loc[:i, "column"]
    X = mixed_df[s]
    X = preprocess(X, cols_ord, cats_ord, cols_nom)

    for clf in [DecisionTreeClassifier(), RandomForestClassifier(), LogisticRegression()]:
        result = cross_validate(clf, X, y, verbose=False)
        results.append([str(clf)] + [i + 1] + list(result["test_score"]))
results = pd.DataFrame(
    results, columns=["model", "cols", "score1", "score2", "score3", "score4", "score5"]
)

mean = results[results.columns[2:]].mean(axis=1)
std = results[results.columns[2:]].std(axis=1)
max_ = results[results.columns[2:]].max(axis=1)
results["mean"] = mean
results["std"] = std
results["max"] = max_

results.sort_values("mean", ascending=False)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Unnamed: 0,model,cols,score1,score2,score3,score4,score5,mean,std,max
47,LogisticRegression(),16,0.715,0.740,0.765,0.745,0.710,0.735,0.022638,0.765
242,LogisticRegression(),81,0.710,0.735,0.770,0.745,0.710,0.734,0.025348,0.770
50,LogisticRegression(),17,0.715,0.725,0.775,0.740,0.710,0.733,0.026125,0.775
35,LogisticRegression(),12,0.710,0.740,0.755,0.745,0.715,0.733,0.019558,0.755
233,LogisticRegression(),78,0.710,0.725,0.755,0.755,0.715,0.732,0.021679,0.755
...,...,...,...,...,...,...,...,...,...,...
187,RandomForestClassifier(),63,0.510,0.470,0.440,0.475,0.515,0.482,0.030943,0.515
199,RandomForestClassifier(),67,0.520,0.465,0.425,0.480,0.520,0.482,0.040094,0.520
193,RandomForestClassifier(),65,0.515,0.440,0.425,0.495,0.505,0.476,0.040682,0.515
175,RandomForestClassifier(),59,0.475,0.475,0.465,0.485,0.465,0.473,0.008367,0.485


#### Todos os dados

In [490]:
results = []
mixed_cols_scores = (
    pd.concat([cols_scores_se, cols_scores_m1, cols_scores_m2]).reset_index().sort_values("score")
)
mixed_df = pd.concat([df_sociais, df_modulo1, df_modulo2], axis=1)
for i in range(len(mixed_cols_scores)):
    s = mixed_cols_scores.loc[:i, "column"]
    X = mixed_df[s]
    X = preprocess(X, cols_ord, cats_ord, cols_nom)

    for clf in [DecisionTreeClassifier(), RandomForestClassifier(), LogisticRegression()]:
        result = cross_validate(clf, X, y, verbose=False)
        results.append([str(clf)] + [i + 1] + list(result["test_score"]))
results = pd.DataFrame(
    results, columns=["model", "cols", "score1", "score2", "score3", "score4", "score5"]
)

mean = results[results.columns[2:]].mean(axis=1)
std = results[results.columns[2:]].std(axis=1)
max_ = results[results.columns[2:]].max(axis=1)
results["mean"] = mean
results["std"] = std
results["max"] = max_

results.sort_values("mean", ascending=False)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Unnamed: 0,model,cols,score1,score2,score3,score4,score5,mean,std,max
269,LogisticRegression(),90,0.740,0.745,0.750,0.750,0.730,0.743,0.008367,0.750
32,LogisticRegression(),11,0.740,0.740,0.750,0.755,0.730,0.743,0.009747,0.755
236,LogisticRegression(),79,0.735,0.740,0.750,0.755,0.735,0.743,0.009083,0.755
272,LogisticRegression(),91,0.730,0.740,0.750,0.755,0.735,0.742,0.010368,0.755
38,LogisticRegression(),13,0.735,0.740,0.750,0.750,0.730,0.741,0.008944,0.750
...,...,...,...,...,...,...,...,...,...,...
151,RandomForestClassifier(),51,0.495,0.450,0.450,0.475,0.525,0.479,0.031898,0.525
118,RandomForestClassifier(),40,0.565,0.485,0.480,0.435,0.425,0.478,0.055408,0.565
193,RandomForestClassifier(),65,0.530,0.435,0.445,0.470,0.505,0.477,0.040094,0.530
157,RandomForestClassifier(),53,0.535,0.445,0.470,0.435,0.500,0.477,0.041018,0.535
