# **<h3 align="center">Machine Learning - Project</h3>**
## **<h3 align="center">3. Level 2 Binary Classification</h3>**
### **<h3 align="center">Group 30 - Project</h3>**


### Group Members
| Name              | Email                        | Student ID |
|-------------------|------------------------------|------------|
| Alexandra Pinto   | 20211599@novaims.unl.pt      | 20211599   |
| Gonçalo Peres     | 20211625@novaims.unl.pt      | 20211625   |
| Leonor Mira       | 20240658@novaims.unl.pt      | 20240658   |
| Miguel Natário    | 20240498@novaims.unl.pt      | 20240498   |
| Nuno Bernardino   | 20211546@novaims.unl.pt      | 20211546    |

---

### **3. Level 2 Binary Classification Notebook**
**Description:**
This notebook focuses on the **Level 2 Binary Classification model**, which distinguishes between the two most common classes identified in Level 1:
- **2 - NON-COMP**
- **4 - TEMPORARY**

Key steps include:
- Loading the subset of **“Common”** cases from Level 1 predictions.
- **Feature selection:** Tailor feature preprocessing and selection for this binary classification task.
- **Model training:** Train and evaluate a binary classification model to distinguish between the two classes.
- **Evaluation:** Use metrics like accuracy, precision, recall, and F1-score to measure performance.
- **Output:** Save predictions for integration in the final notebook.

This notebook refines the classification of cases within the most common classes, contributing to the pipeline's accuracy.

---


## Table of Contents
* [1. Import the Libraries](#chapter1)
* [2. Load and Prepare Datasets](#chapter2)
* [3. Setting the Target](#chapter3)
* [4. Feature Selection](#chapter4)
    * [Scaling the Data](#section_4_1)  
    * [Numerical Features](#section_4_2) 
    * [Categorical Features](#section_4_3) 
    * [Final Features](#section_4_3)
* [5. Modelling](#chapter5)
* [6. Loading the results](#chapter6)


# 1. Import the Libraries 📚<a class="anchor" id="chapter1"></a>

In [None]:
# --- Standard Libraries ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import zipfile


# --- Scikit-Learn Modules for Data Partitioning and Preprocessing ---
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, MinMaxScaler, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder


# --- Feature Selection Methods ---
# Filter Methods
import scipy.stats as stats
from scipy.stats import chi2_contingency
from sklearn.feature_selection import mutual_info_classif, chi2, SelectKBest

# Wrapper Methods
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# Embedded Methods
from sklearn.linear_model import LassoCV

# --- Evaluation Metrics ---
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

from xgboost import XGBClassifier


# --- Warnings ---
import warnings
warnings.filterwarnings('ignore')


#selecionar apenas as função que vamos usar neste :)
from utils import plot_importance, cor_heatmap, find_optimal_features_with_rfe, compare_rf_feature_importances,compare_feature_importances, select_high_score_features_chi2_no_model,select_high_score_features_MIC, metrics

# 2. Load and Prepare Datasets 📁<a class="anchor" id="chapter2"></a>

# 3. Feature Selection <a class="anchor" id="chapter3"></a>

In [None]:
# Passo 1: Filtrar apenas as instâncias com classes 2 e 4 no conjunto de treino
train_majority = X_train_processed[X_train_processed['claim injury type'].isin([2, 4])]
val_majority = X_val_processed[X_val_processed['claim injury type'].isin([2, 4])]

In [None]:
# Displaying descriptive statistics for categorical features in the training dataset
train_majority.describe(include='O').T

In [None]:
# Exploring the distribution of Income_Category
train_majority['claim injury type'].value_counts()

In [None]:
train_majority.describe().T

# 4. Setting the Target <a class="anchor" id="chapter3"></a>

In [None]:
# Importar bibliotecas necessárias
from sklearn.ensemble import RandomForestClassifier

# Passo 1: Filtrar apenas as instâncias com classes 2 e 4 no conjunto de treino
train_majority = X_train_processed[X_train_processed['claim injury type'].isin([2, 4])]
val_majority = X_val_processed[X_val_processed['claim injury type'].isin([2, 4])]

# Separar as features (X) e o target (y) no conjunto de treino e validação
X_train_bin = train_majority.drop(columns=['claim injury type'])  # Features de treino
y_train_bin = train_majority['claim injury type']                # Target de treino

X_val_bin = val_majority.drop(columns=['claim injury type'])      # Features de validação
y_val_bin = val_majority['claim injury type']                    # Target de validação

# Passo 2: Criar e treinar o modelo binário
binary_model = RandomForestClassifier(random_state=42)
binary_model.fit(X_train_bin, y_train_bin)

# Avaliar o modelo na validação
accuracy = binary_model.score(X_val_bin, y_val_bin)
print(f"Accuracy on validation set: {accuracy:.2f}")

# Passo 3: Aplicar o modelo ao conjunto de teste para as linhas da classe maioritária
# Selecionar as linhas da classe "majoritária" no conjunto de teste
test_majority = df_test_processed[df_test_processed['class_category'] == 'majority']

# Remover colunas desnecessárias para obter as features
X_test_majority = test_majority.drop(columns=['class_category'])

# Fazer previsões
predictions = binary_model.predict(X_test_majority)

# Adicionar as previsões ao dataset original
df_test_processed.loc[test_majority.index, 'predicted_claim_injury_type'] = predictions

# (Opcional) Verificar as previsões feitas
print(df_test_processed['predicted_claim_injury_type'].value_counts())
