# Explainable Artificial Intelligence (XAI)

<br>

We will be using the dataset [Airline Passenger Satisfaction](https://www.kaggle.com/datasets/nilanjansamanta1210/airline-passenger-satisfaction) from Kaggle. The dataset contains information about passengers' satisfaction with the airline service. The goal is to predict whether a passenger is satisfied or not based on the features provided. 

## Data Loading and Preprocessing

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import kagglehub
import utils
import os

In [None]:
dataset_path = "data/airline_passenger_satisfaction.csv"

if not os.path.exists(dataset_path):
        print("Downloading dataset...")
        path = kagglehub.dataset_download("nilanjansamanta1210/airline-passenger-satisfaction")
        downloaded_file = os.path.join(path, "airline_passenger_satisfaction.csv")
        if os.path.exists(downloaded_file):
            os.rename(downloaded_file, dataset_path)
        else:
            raise FileNotFoundError("The dataset was not downloaded properly. Please check the Kaggle dataset.")

df = pd.read_csv(dataset_path)
df

In [None]:
df = utils.pre_process_df(df, drop_correlated=False)
df

In [None]:
# Check column types
non_numeric_columns = []
cols_w_missing_values = []

for col in df.columns:
    if not pd.api.types.is_integer_dtype(df[col]):
        non_numeric_columns.append((col, df[col].dtype))
    if df[col].isnull().sum() > 0:
        cols_w_missing_values.append(col)

# Print result
if len(non_numeric_columns) == 0:
    print("All columns are int or float.\n")
else:
    print("There are columns that are not int:")
    for col, dtype in non_numeric_columns:
        print(f"Column: {col}, Type: {dtype}")
        
        
if len(cols_w_missing_values) == 0:
    print("No columns have missing values.")
else:
    print("Columns with missing values:")
    for col in cols_w_missing_values:
        print(f"Column: {col}")

## Data Analysis

We will analyze the data to understand the relationships between the features and the target variable. We will also check for class imbalance and feature distributions.

<br>

### Correlation Analysis

In [None]:
utils.visualize_correlation(df)

Accuracy of the model **without removing** highly correlated features:
- Holdout Accuracy: 94.664%
- Cross-Validation Accuracy: 94.642%

Join of Departure and Arrival Delay
- Holdout Accuracy: 94.649%
- Cross-Validation Accuracy: 94.675%

<br>

Accuracy of the model **removing** highly correlated features:

- Holdout Accuracy: 94.256%
- Cross-Validation Accuracy: 94.25%

Join of Departure and Arrival Delay
- Holdout Accuracy: 94.387%
- Cross-Validation Accuracy: 94.171%


A remoção de informações correlacionadas pode ter prejudicado o modelo, eliminando dados importantes, mesmo que redundantes. Isso sugere que, neste caso, a redundância nas features era benéfica.

A junção das duas features como Total Delay não causou um impacto significativo no desempenho geral, mas ajudou a simplificar o modelo sem sacrificar a performance.

### Class Imbalance

In [None]:
# Verify class imbalance of the Target

utils.visualize_class_imbalance(df)

As we see there is an imbalance in the classes. We will SMOTE to balance the classes.

### Feature distribution

In [None]:
utils.visualize_feature_distributions(df)

## Simple Classifications

We will start by using a glass box and a black box model to classify the data. We chose a Decision Tree and a Random Forest as the models, respectively. We will analyze the performance of the models and then apply XAI techniques to explain the predictions.

In [None]:
X = df.drop(columns=["Satisfaction"])
y = df["Satisfaction"]


### Decision Tree

The decision tree is a glass box model that is easy to interpret. We will use it to classify the data and then apply XAI techniques to explain the predictions.

In [None]:
tree = DecisionTreeClassifier(random_state=42)

accuracy = utils.holdout_accuracy(X,y, tree, test_size=0.2)
cv_score = utils.cross_validation_acc(X,y, tree, cv_fold=10)

print(f"Holdout Accuracy: {accuracy}%\nCross-Validation Accuracy: {cv_score}%")

In [None]:
utils.analyze_tree_complexity(feature_names=X.columns.tolist(),tree=tree)

### Random Forest

The random forest is a black box model that is more complex than the decision tree. Like with the decision tree, we will classify the data and then apply XAI techniques to explain the predictions.

In [None]:
forest = RandomForestClassifier(random_state=42)

accuracy = utils.holdout_accuracy(X,y, tree, test_size=0.2)
cv_score = utils.cross_validation_acc(X,y, tree, cv_fold=10)

print(f"Holdout Accuracy: {accuracy}%\nCross-Validation Accuracy: {cv_score}%")

In [None]:
utils.apply_simplification_based_xai(X,y,forest)

In [None]:
utils.apply_feature_based_xai(X,y,forest)

In [None]:
utils.apply_simplification_based_xai(X,y,forest)