# Atividade: Comparação de Modelos de Classificação

## Objetivo

- Comparar os seguintes modelos de classificação:
  * Árvore de Decisão com poda
  * K Vizinhos mais Próximos (KNN)
  * Regressão Logística
  * Floresta Aleatória
- Escolher o melhor modelo com base no desempenho em validação
- Avaliar o melhor modelo no conjunto de teste

## Etapas da atividade

1. **Carregar o dataset** `dataset_classificacao.csv`  
   - Use: `pandas.read_csv("dataset_classificacao.csv")`

2. **Separar os dados em treino, validação e teste** (60%, 20%, 20%)  
   - Use: `sklearn.model_selection.train_test_split` duas vezes:  
     Primeiro para separar treino+validação de teste,  
     Depois para separar treino de validação.

3. **Treinar os seguintes modelos:**

   a) **Árvore de Decisão com poda por complexidade**  
      - Use: `sklearn.tree.DecisionTreeClassifier`  
      - Para poda: use o método `cost_complexity_pruning_path`  
      - Varra diferentes valores de `ccp_alpha` e escolha o melhor na validação  

   b) **KNN (K-Nearest Neighbors)**  
      - Use: `sklearn.neighbors.KNeighborsClassifier`  
      - Teste para `k = 3, 5, 10, 20`

   c) **Regressão Logística (sem ajustar hiperparâmetros)**  
      - Use: `sklearn.linear_model.LogisticRegression` com os parâmetros padrão  

   d) **Floresta Aleatória**  
      - Use: `sklearn.ensemble.RandomForestClassifier`  
      - Fixe `n_estimators=100`  
      - Teste combinações de `criterion = ['gini', 'entropy']` e `max_features = ['sqrt', 'log2', None]`

4. **Para cada modelo (usando apenas o conjunto de validação), calcule:**
   - Acurácia: `sklearn.metrics.accuracy_score`
   - Precisão: `sklearn.metrics.precision_score`
   - Recall: `sklearn.metrics.recall_score`
   - F1-score: `sklearn.metrics.f1_score`

5. **Monte uma tabela** (`pandas.DataFrame`) com uma linha por modelo e colunas:  
   `'acuracia'`, `'precisao'`, `'recall'`, `'f1'`

6. **Escolha o melhor modelo com base nas métricas de validação.**  
   - Avalie esse único modelo no conjunto de teste.  

7. **Exiba as importâncias usando Floresta Aleatória:**  
   - Mostre um gráfico com as 20 variáveis mais importantes  
   - Use: `modelo.feature_importances_`  
   - Para plotar: `matplotlib.pyplot.bar`

8. **Utilizando a informação encontrada no item anterior, quantas covariadas você acha que são de fato importantes para o problema?** 


# Imports

In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

import optuna

# Data Processing

In [2]:
df = pd.read_csv("dataset_classificacao.csv")
df.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x491,x492,x493,x494,x495,x496,x497,x498,x499,target
0,0.991245,0.050748,-0.343749,0.274253,-1.518734,-0.159455,1.427896,1.926818,1.555172,-1.737581,...,0.226665,-0.147782,-0.024233,1.513658,0.928057,0.449303,0.07111,1.468139,0.208218,1
1,0.292069,0.962034,-2.349618,-0.440499,-0.49418,0.711639,1.408426,-0.23869,1.660926,0.67476,...,0.157092,0.509665,-1.6614,-0.502276,0.502979,-1.098843,-1.753755,2.17344,1.09041,0
2,-1.379857,-1.239205,1.61692,0.649324,-1.395141,0.643645,-0.374341,0.37837,1.18238,0.977513,...,-0.817294,-0.561936,0.638945,-0.571702,0.136797,0.895805,-0.958593,1.041268,0.464521,0
3,-0.162766,-0.510166,-0.348483,-0.166792,0.152929,-0.86781,-1.157489,-0.47187,0.9327,1.854919,...,-0.395616,0.160152,-1.308039,-0.635852,1.139097,0.661483,-0.041045,0.464849,-2.702643,0
4,1.381736,-0.332115,-1.061583,-0.585043,-0.299487,1.504501,0.698723,-0.959224,-0.79692,-0.386506,...,-1.890095,0.644228,-0.705502,0.426899,-0.153879,0.517804,1.003377,-0.58295,0.372384,1


In [3]:
TARGET = 'target'
FEATURES = [i for i in df.columns if i != TARGET]

In [4]:
X = df[FEATURES]
y = df[TARGET]

In [5]:
X_train, X_pre, y_train, y_pre = train_test_split(X, y, test_size=.40, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_pre, y_pre, test_size=.5, random_state=42)

In [6]:
print(X_train.shape, X_test.shape, X_val.shape)

(1200, 500) (400, 500) (400, 500)


# Decision Tree

In [None]:
def objective(trial):
    tree = DecisionTreeClassifier(
        # Default
        criterion = 'gini',
        splitter = 'best',
        max_depth = None,
        min_samples_split = 2,
        min_samples_leaf = 1,
        min_weight_fraction_leaf = 0,
        max_features = None,
        max_leaf_nodes = None,
        min_impurity_decrease=0,
        class_weight=None,

        # Different from default
        random_state = 42,
        ccp_alpha = trial.suggest_float('ccp', 1e-8, 100, log=True)
    )

    tree.fit(X_train, y_train)
    pred = tree.predict(X_val)

    return accuracy_score(y_val, pred)

study_tree = optuna.create_study(direction='maximize')
study_tree.optimize(objective, n_trials=50)

[I 2025-05-12 14:41:45,806] A new study created in memory with name: no-name-5ffcf719-be99-4df4-b96d-5618d404d564
[I 2025-05-12 14:41:46,059] Trial 0 finished with value: 0.815 and parameters: {'ccp': 0.0004529035721817811}. Best is trial 0 with value: 0.815.
[I 2025-05-12 14:41:46,296] Trial 1 finished with value: 0.815 and parameters: {'ccp': 6.563913791220479e-06}. Best is trial 0 with value: 0.815.
[I 2025-05-12 14:41:46,531] Trial 2 finished with value: 0.815 and parameters: {'ccp': 1.0159011599885543e-06}. Best is trial 0 with value: 0.815.
[I 2025-05-12 14:41:46,780] Trial 3 finished with value: 0.815 and parameters: {'ccp': 9.830299310361395e-05}. Best is trial 0 with value: 0.815.
[I 2025-05-12 14:41:47,018] Trial 4 finished with value: 0.815 and parameters: {'ccp': 1.4732352510715832e-05}. Best is trial 0 with value: 0.815.
[I 2025-05-12 14:41:47,256] Trial 5 finished with value: 0.815 and parameters: {'ccp': 1.443034015280036e-05}. Best is trial 0 with value: 0.815.
[I 2025-

In [21]:
optuna.visualization.plot_slice(study_tree, params=["ccp"])

In [32]:
study_tree.best_params

{'ccp': 0.00793046026606327}

# KNN

In [25]:
def objective(trial):
    model = KNeighborsClassifier(n_neighbors=trial.suggest_int('neighbors', 1, 50))
    model.fit(X_train, y_train)
    pred = model.predict(X_val)

    return accuracy_score(y_val, pred)


study_knn = optuna.create_study(direction='maximize')
study_knn.optimize(objective, n_trials=50)

optuna.visualization.plot_slice(study_knn, params=["neighbors"])

[I 2025-05-12 14:54:26,334] A new study created in memory with name: no-name-a2e41dd9-820d-4bd0-824e-cef0b1d66b3f
[I 2025-05-12 14:54:26,362] Trial 0 finished with value: 0.8025 and parameters: {'neighbors': 31}. Best is trial 0 with value: 0.8025.
[I 2025-05-12 14:54:26,381] Trial 1 finished with value: 0.7025 and parameters: {'neighbors': 4}. Best is trial 0 with value: 0.8025.
[I 2025-05-12 14:54:26,395] Trial 2 finished with value: 0.7825 and parameters: {'neighbors': 11}. Best is trial 0 with value: 0.8025.
[I 2025-05-12 14:54:26,412] Trial 3 finished with value: 0.8075 and parameters: {'neighbors': 30}. Best is trial 3 with value: 0.8075.
[I 2025-05-12 14:54:26,427] Trial 4 finished with value: 0.7825 and parameters: {'neighbors': 11}. Best is trial 3 with value: 0.8075.
[I 2025-05-12 14:54:26,443] Trial 5 finished with value: 0.8125 and parameters: {'neighbors': 39}. Best is trial 5 with value: 0.8125.
[I 2025-05-12 14:54:26,458] Trial 6 finished with value: 0.82 and parameters:

In [33]:
study_knn.best_params

{'neighbors': 45}

# Regressão Logistica

In [30]:
reg = LogisticRegression()
reg.fit(X_train, y_train)
pred = reg.predict(X_val)

print(accuracy_score(y_val, pred))

0.6325


# Random Forest