# **<h3 align="center">Machine Learning - Project</h3>**
## **<h3 align="center">5. Integration and Final Predictions</h3>**
### **<h3 align="center">Group 30 - Project</h3>**


### Group Members
| Name              | Email                        | Student ID |
|-------------------|------------------------------|------------|
| Alexandra Pinto   | 20211599@novaims.unl.pt      | 20211599   |
| Gonçalo Peres     | 20211625@novaims.unl.pt      | 20211625   |
| Leonor Mira       | 20240658@novaims.unl.pt      | 20240658   |
| Miguel Natário    | 20240498@novaims.unl.pt      | 20240498   |
| Nuno Bernardino   | 20211546@novaims.unl.pt      | 20211546    |


---

### **5. Integration and Final Predictions Notebook**
**Description:**
In this notebook, we integrate the results from all levels of the hierarchy to produce the **final classification outputs** and evaluate the overall pipeline.

Key steps include:
- Loading predictions from **Level 1**, **Level 2 Binary**, and **Level 2 Multi-Class** notebooks.
- **Merging predictions:** Combine outputs from all levels to assign a final class to each case.
- **Post-processing:** Apply any necessary adjustments or probability thresholds to improve consistency.
- **Evaluation:** Assess the pipeline's overall performance using metrics like accuracy, F1-score, and confusion matrices.
- **Output:** Save the final predictions in a structured format for deployment or reporting.

This notebook serves as the culmination of the hierarchical classification framework, ensuring all components work seamlessly together.

---

## Table of Contents
* [1. Import the Libraries](#chapter1)
* [2. Load and Prepare Datasets](#chapter2)
* [3. Merging Results](#chapter3)

# 1. Import the Libraries 📚<a class="anchor" id="chapter1"></a>

In [1]:
# --- Standard Libraries ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import zipfile


# --- Scikit-Learn Modules for Data Partitioning and Preprocessing ---
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, MinMaxScaler, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder


# --- Feature Selection Methods ---
# Filter Methods
import scipy.stats as stats
from scipy.stats import chi2_contingency
from sklearn.feature_selection import mutual_info_classif, chi2, SelectKBest

# Wrapper Methods
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# Embedded Methods
from sklearn.linear_model import LassoCV

# --- Evaluation Metrics ---
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

#from xgboost import XGBClassifier

# --- Warnings ---


import warnings
warnings.filterwarnings('ignore')


#selecionar apenas as função que vamos usar neste :)
from utils import plot_importance, cor_heatmap, find_optimal_features_with_rfe, compare_rf_feature_importances,compare_feature_importances, select_high_score_features_chi2_no_model,select_high_score_features_MIC, metrics

# 2. Load and Prepare Datasets 📁<a class="anchor" id="chapter2"></a>

In [3]:
# Carregar o primeiro dataset
X_test_final_1 = pd.read_csv('X_test_final_1.csv')
# Carregar o segundo dataset
X_test_final_2 = pd.read_csv('X_test_final_2.csv')

# 3. Merging the results <a class="anchor" id="chapter3"></a>

In [4]:
# Combinar as colunas Final_Predictions, preenchendo valores ausentes (NaNs) do primeiro dataset com valores do segundo
X_test_final_combined = X_test_final_1.copy()
X_test_final_combined['Final_Predictions'] = X_test_final_1['Final_Predictions'].combine_first(X_test_final_2['Final_Predictions'])

In [5]:
X_test_final_combined

Unnamed: 0,Carrier_District_Interaction_freq,Zip_Code_Simplified_freq,District Name_freq,COVID-19 Indicator,WCIO Cause of Injury Code_freq,WCIO Part Of Body Code_freq,Age Group,Attorney/Representative,Industry Code_freq,Body_Part_Category_Trunk,...,Carrier_Name_Simplified_freq,Income_Category,WCIO Nature of Injury Code_freq,Injury_Cause_Category_Strain or Injury By,IME-4 Count,Salary_Per_Dependent,Days_To_First_Hearing,Age at Injury,Final_Predictions,Predictions
0,92172.0,1614.0,188402,0,22242,8131,3,0,26025,0.0,...,6349.0,4,77440,0.0,0.0,0.000876,0.025765,0.046875,3.0,0
1,92172.0,1009.0,188402,0,12748,5519,3,0,12655,0.0,...,7733.0,1,77440,0.0,0.0,0.000568,0.025765,0.046875,3.0,0
2,92172.0,282759.0,188402,0,7560,798,0,0,14772,1.0,...,85478.0,2,38955,0.0,0.0,0.001322,0.025765,0.671875,3.0,0
3,92172.0,282759.0,188402,0,8464,33692,2,0,26025,0.0,...,10057.0,4,77440,0.0,0.0,0.000247,0.025765,0.421875,3.0,0
4,92172.0,1816.0,188402,0,8648,8682,3,0,235,0.0,...,8924.0,4,32898,0.0,0.0,0.000282,0.025765,0.140625,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
387970,7776.0,282759.0,31149,0,8122,9113,0,0,5248,0.0,...,77168.0,4,8133,1.0,0.0,0.000284,0.025765,0.562500,2.0,1
387971,10871.0,282759.0,42054,0,8122,9113,0,1,5248,0.0,...,5790.0,4,8133,1.0,0.0,0.000862,0.025765,0.671875,2.0,1
387972,33910.0,1853.0,188402,0,8122,9113,2,1,5248,0.0,...,77168.0,4,8133,1.0,0.0,0.000284,0.025765,0.453125,2.0,1
387973,14924.0,282759.0,188402,0,8122,9113,2,1,5248,0.0,...,5067.0,4,8133,1.0,0.0,0.000284,0.025765,0.421875,2.0,1


In [6]:
# Salvar o dataset consolidado
X_test_final_combined.to_csv('X_test_final_combined.csv', index=False)