# **<h3 align="center">Machine Learning - Project</h3>**
## **<h3 align="center">7. Integration and Final Predictions</h3>**
### **<h3 align="center">Group 30 - Project</h3>**


### Group Members
| Name              | Email                        | Student ID |
|-------------------|------------------------------|------------|
| Alexandra Pinto   | 20211599@novaims.unl.pt      | 20211599   |
| Gonçalo Peres     | 20211625@novaims.unl.pt      | 20211625   |
| Leonor Mira       | 20240658@novaims.unl.pt      | 20240658   |
| Miguel Natário    | 20240498@novaims.unl.pt      | 20240498   |
| Nuno Bernardino   | 20211546@novaims.unl.pt      | 20211546   |

---

### **8. Integration and Final Predictions Notebook**  
**Description:**  
This notebook integrates and compares the results of both **hierarchical classification** and **flat classification** approaches to produce the final outputs and evaluate the entire pipeline.  

Key steps include:  
- **Loading Predictions:** Import predictions from **Flat Modeling** (Notebook 7) and **Hierarchical Modeling** (Levels 1, 2 Binary, and 2 Multi-Class).  
- **Combining Predictions:** Merge outputs from both approaches to enable a side-by-side comparison.  
- **Performance Evaluation:** Compare flat and hierarchical models using metrics such as accuracy, F1-score, precision, recall, and confusion matrices.  
- **Analysis:** Discuss the advantages, disadvantages, and trade-offs of each approach. Highlight scenarios where one approach outperforms the other.  
- **Output:** Save the final predictions from both approaches for deployment, further reporting, or stakeholder presentation.  

This notebook serves as the final step in the pipeline, providing a detailed evaluation of the two modeling strategies and ensuring clarity on their relative performance and practical implications.  

---


## Table of Contents
* [1. Import the Libraries](#chapter1)
* [2. Load and Prepare Datasets](#chapter2)
* [3. Merging Results](#chapter3)

# 1. Import the Libraries 📚<a class="anchor" id="chapter1"></a>

In [2]:
# --- Standard Libraries ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import zipfile


# --- Scikit-Learn Modules for Data Partitioning and Preprocessing ---
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, MinMaxScaler, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, 


# --- Feature Selection Methods ---
# Filter Methods
import scipy.stats as stats
from scipy.stats import chi2_contingency
from sklearn.feature_selection import mutual_info_classif, chi2, SelectKBest

# Wrapper Methods
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# Embedded Methods
from sklearn.linear_model import LassoCV

# --- Evaluation Metrics ---
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

#from xgboost import XGBClassifier

# --- Warnings ---


import warnings
warnings.filterwarnings('ignore')


#selecionar apenas as função que vamos usar neste :)
from utils import plot_importance, cor_heatmap, find_optimal_features_with_rfe, compare_rf_feature_importances,compare_feature_importances, select_high_score_features_chi2_no_model,select_high_score_features_MIC, metrics

# 2. Load and Prepare Datasets 📁<a class="anchor" id="chapter2"></a>

In [21]:
# Load the first dataset
X_test_final_1 = pd.read_csv('X_test_final_1.csv')
# Load the second dataset
X_test_final_2 = pd.read_csv('X_test_final_2.csv')

# 3. Merging the results <a class="anchor" id="chapter3"></a>

In [22]:
# Combine the Final_Predictions columns, filling missing values (NaNs) from the first dataset with values from the second
X_test_final_combined = X_test_final_1.copy()
X_test_final_combined['Final_Predictions'] = X_test_final_1['Final_Predictions'].combine_first(X_test_final_2['Final_Predictions'])

In [23]:
# Check our final results
X_test_final_combined['Final_Predictions'].value_counts()

Final_Predictions
2.0    169697
3.0    161111
4.0     43308
1.0      8197
5.0      5111
6.0       479
8.0        72
Name: count, dtype: int64

# 4. Creation and Handling of Submission <a class="anchor" id="chapter3"></a>

In [18]:
# Map the 'Final_Predictions' values to injury types
injury_type_mapping = {
    1: "CANCELLED",
    2: "NON-COMP",
    3: "MED ONLY",
    4: "TEMPORARY"
    5: "PPD SCH LOSS",
    6: "PPD NSL",
    7: "PTD",
    8: "DEATH"
}

# Create the 'Claim Injury Type' column with the correct format
X_test_final_combined['Claim Injury Type'] = (
    X_test_final_combined['Final_Predictions'].astype(int).astype(str) + ". " + 
    X_test_final_combined['Final_Predictions'].map(injury_type_mapping)
)

# Select only the necessary columns for submission
submission = X_test_final_combined[['Claim Identifier', 'Claim Injury Type']]

# Display the first rows of the generated file
print(submission.head())


   Claim Identifier Claim Injury Type
0           6165911      3. TEMPORARY
1           6166141      3. TEMPORARY
2           6165907      3. TEMPORARY
3           6166047      3. TEMPORARY
4           6166102      3. TEMPORARY


In [19]:
# Save the submission file and capture its path
submission.to_csv('submission.csv', index=False)
import os
print(os.getcwd())