# Group information

Names: William Rodrigues Lopes, 


RAs:248499, 

# **Machine Learning MC886/MO444 - Task \#2**: Model Selection for classification


### Objective:

To explore **Model Selection** techniques to select the best model and hyperparameters for a classification task.

## Airline Passenger Satisfaction

**Context**

This dataset contains an airline passenger satisfaction survey. What factors are highly correlated to a satisfied (or dissatisfied) passenger? Can you predict passenger satisfaction?

**Content**

| Column | Description |
|---|---|
| Gender | Gender of the passengers (Female, Male) |
| Customer Type | The customer type (Loyal customer, disloyal customer) |
| Age | The actual age of the passengers |
| Type of Travel | Purpose of the flight of the passengers (Personal Travel, Business Travel) |
| Class | Travel class in the plane of the passengers (Business, Eco, Eco Plus) |
| Flight Distance | The flight distance of this journey |
| Inflight wifi service | Satisfaction level of the inflight wifi service (0:Not Applicable;1-5) |
| Departure/Arrival time convenient | Satisfaction level of Departure/Arrival time convenient |
| Ease of Online booking | Satisfaction level of online booking |
| Gate location | Satisfaction level of Gate location |
| Food and drink | Satisfaction level of Food and drink |
| Online boarding | Satisfaction level of online boarding |
| Seat comfort | Satisfaction level of Seat comfort |
| Inflight entertainment | Satisfaction level of inflight entertainment |
| On-board service | Satisfaction level of On-board service |
| Leg room service | Satisfaction level of Leg room service |
| Baggage handling | Satisfaction level of baggage handling |
| Check-in service | Satisfaction level of Check-in service |
| Inflight service | Satisfaction level of inflight service |
| Cleanliness | Satisfaction level of Cleanliness |
| Departure Delay in Minutes | Minutes delayed when departure |
| Arrival Delay in Minutes | Minutes delayed when Arrival |
| Satisfaction | Airline satisfaction level(Satisfaction, neutral or dissatisfaction) |

**Note:** This data set was modified from this dataset by John D here. It has been cleaned up for the purposes of classification.

**How to load the dataset**

Dataset link: [here](https://drive.google.com/drive/folders/1Wagh0CUKWzjssOif6n4b9dpkzq50bamx?usp=sharing)

You should open the google drive folder, click on the name of the folder on the top and click on "organize" => "add shortcut".<br/>
Then you should choose where to add the shortcut. The recommendation is to add on "MyDrive", so you don't need to change the dataset path used below.

Then you should run the cell below and authorize google drive access.

*If you want to run the notebook locally, just download the folder and change the path below to the location of the folder in your local environment.*

In [None]:
#imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


import pandas as pd
drive.mount('/content/gdrive', force_remount=True)

path = '/content/gdrive/MyDrive/treino.csv'

data = pd.read_csv(path)

data.head()

### **Data analysis and preprocessing** (1.5 points)

In this section, you should explore the dataset. Remember to avoid using data that you should not have in training.

You can plot graphs with features that you think are important to visualize the relation with the target(`Satisfaction`). You can also use boxplot graphs to understand feature distributions. There are no minimal/maximum requirements in what graphs you should use, explore just what you think can help in understanding the dataset.

As in the previous task, preprocess the data, transform the categorical features with OneHotEncoding, and remember to scale continuous features to be in a similar scale between each other.


In [None]:

# raciocínio idêntico ao da task 1, com os gráficos fica mais nítida a relação de Satisfaction com outras variáveis.

data['satisfaction_numeric'] = data['satisfaction'].apply(lambda x: 1 if x == 'satisfied' else 0)


plt.figure(figsize=(16, 10))

# primeiro gráfico: Satisfaction vs Flight Distance
plt.subplot(2, 2, 1)
sns.boxplot(x='satisfaction_numeric', y='Flight Distance', data=data)
plt.title('Satisfaction vs Flight Distance')
plt.xlabel('Satisfaction')
plt.ylabel('Flight Distance')

# segundo gráfico: Satisfaction vs Inflight wifi service
plt.subplot(2, 2, 2)
sns.boxplot(x='satisfaction_numeric', y='Inflight wifi service', data=data)
plt.title('Satisfaction vs Inflight Wifi Service')
plt.xlabel('Satisfaction')
plt.ylabel('Inflight Wifi Service Rating')

# terceiro gráfico: Satisfaction vs Class
plt.subplot(2, 2, 3)
sns.countplot(x='Class', hue='satisfaction_numeric', data=data)
plt.title('Satisfaction vs Class')
plt.xlabel('Class')
plt.ylabel('Count')

# quarto gráfico: Satisfaction vs Departure Delay in Minutes
plt.subplot(2, 2, 4)
sns.boxplot(x='satisfaction_numeric', y='Departure Delay in Minutes', data=data)
plt.title('Satisfaction vs Departure Delay')
plt.xlabel('Satisfaction')
plt.ylabel('Departure Delay (minutes)')

plt.tight_layout()
plt.show()

### **Metric selection** (0.5 point)

As we're working with unbalanced data, the accuracy metric is not a good indicator of performance. Choose a metric and explain why that metric is a good fit for the online shopping intention problem. You don't need to implement the metric, only discuss it.

*Tip: Some common metrics are [balanced accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html), [recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html), [precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html), [f1-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) and [AUC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)*.

In [None]:
Para lidar com dados desbalanceados, a accuracy (precisão) não é uma boa métrica, porque ela pode levar a uma conclusão errada. Em um cenário onde a classe majoritária predomina, um modelo pode obter uma alta precisão simplesmente prevendo essa classe dominante o tempo todo, ignorando as instâncias da classe minoritária.

Uma métrica mais adequada para o problema de satisfação de passageiros seria a F1-score, especialmente porque ela equilibra a precisão e o recall. O F1-score é a média harmônica entre a precisão e o recall. Ele é ideal para problemas desbalanceados, pois lida bem com cenários onde os falsos negativos (casos onde o modelo não identifica a classe minoritária corretamente) são importantes.

Precisão no contextp: Quantos dos passageiros que o modelo classificou como "satisfeitos" estavam realmente satisfeitos.
Recall no contexto: Quantos dos passageiros realmente satisfeitos foram identificados corretamente pelo modelo.

O F1-score considera tanto os erros falsos positivos quanto os falsos negativos, equilibrando os dois lados do problema. Isso é importante para evitar que o modelo foque apenas na classe majoritária.Dessa forma, o F1-score oferece uma visão mais equilibrada sobre o desempenho do modelo em contextos onde os dados são desbalanceados.

### **Feature selection** (2 points)

As seen in class, there are different ways to select which features to use in a machine learning model.

You should implement the "Forward stepwise selection" technique to find the best `p` features to be used in this task according to that method.

Use the [Logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model and the **K-fold cross-validation** as optimality criterion. You can use the Scikit-learn library, which has helper functions to create the [K-fold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) logic and the model. The metric used in K-fold should be the one chosen in the previous section!

Remember to save a new dataframe only with the selected features for the next steps! Also, use only training data on K-fold validation, keeping a test set separated to estimate the performance of the model on unseen data on the final part of the whole task.

In [None]:
#imports
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import f1_score, make_scorer
import numpy as np


#breve explicação do raciocínio por trás do código: 
# primeiro devemos tratar os valores faltantes
# assim que tivermos as lacunas removidas, é necessário padronizar os dados (normalizar)
# depois disso, verificamos as features constantes
# por fim, padronizamos o erro em relação a f1


# separa a variável alvo (satisfaction_numeric) das features
X = data.drop(columns=['satisfaction', 'satisfaction_numeric', 'Unnamed: 0', 'id'])  # Remover colunas irrelevantes
y = data['satisfaction_numeric']

# converte as colunas categóricas em numéricas com One-Hot Encoding
X_encoded = pd.get_dummies(X, drop_first=True)

# verifica valores nulos
if X_encoded.isnull().sum().sum() > 0:
    # atribui valores faltantes (média ou mediana)
    imputer = SimpleImputer(strategy='mean')
    X_encoded = pd.DataFrame(imputer.fit_transform(X_encoded), columns=X_encoded.columns)

# normaliza as features
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_encoded), columns=X_encoded.columns)

# divide o dataset em treino (80%) e teste (20%)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# define o modelo de regressão logística
log_reg = LogisticRegression(max_iter=1000)

# define K-fold cross-validation com 5 folds
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# função de seleção Forward Stepwise
def forward_stepwise_selection(X_train, y_train, model, kfold, max_features=5):
    selected_features = []
    remaining_features = list(X_train.columns)
    best_f1 = 0
    scorer = make_scorer(f1_score)

    while len(selected_features) < max_features:
        f1_scores = []
        for feature in remaining_features:
            # cria um conjunto de features temporário com a feature atual 
            features_to_evaluate = selected_features + [feature]
            X_subset = X_train[features_to_evaluate]

            try:
                # calcula o F1-score médio usando cross-validation com erro
                scores = cross_val_score(model, X_subset, y_train, cv=kfold, scoring=scorer, error_score='raise')
                mean_f1 = np.mean(scores)
                f1_scores.append((mean_f1, feature))
            except Exception as e:
                print(f"Erro com a feature {feature}: {e}")
        
        # seleciona a feature com o melhor F1-score médio
        if f1_scores:
            f1_scores.sort(reverse=True, key=lambda x: x[0])
            best_new_f1, best_feature = f1_scores[0]
            
            if best_new_f1 > best_f1:
                selected_features.append(best_feature)
                remaining_features.remove(best_feature)
                best_f1 = best_new_f1
            else:
                break
        else:
            break
    
    return selected_features

# seleciona as melhores features
selected_features = forward_stepwise_selection(X_train, y_train, log_reg, kf, max_features=5)
print("features selecionadas:", selected_features)

### **Model selection** (4 points)

This is the main section of the task. Using the features selected in the previous section, you must use [**Grid search** with K-fold cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to select the best classification model and its hyperparameters for this task.

Remember to use only training data on K-fold validation, keeping a test set separated to estimate the performance of the model on unseen data.

You should train and validate the following models:
- [Logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- [Decision Trees](https://scikit-learn.org/stable/modules/tree.html), explore a Decision Tree Classifier, RF and GBM.
- [SVM](https://scikit-learn.org/stable/modules/svm.html)

Explore the documentation above and select which hyperparameters to vary.


Also, you should test the polynomial transformation to find possible nonlinear relations between the features of the dataset. **Do not** use values above "3" for the `degree` of the polynomial transformation, as the number of features increases exponentially.

In short, you should use GridSearchCV (that uses K-fold internally) to get the best hyperparameters for each model, while also testing the polynomial transformation.

*Note: you will need to use the `fit` method more than once to test the different dataset transformations. Choose wisely which hyperparameters to test, as the GridSearch will test all combinations and can take very a long time to finish.*

#### Discussion of key points

- What was the best model according to cross validation?
- The models that use regularization were able to outperform the Logistic Regression? Explain why.


### Threshold testing (1 point)

The three models trained in the previous session can return the probabilities of a sample being of the positive class.
The default threshold used to convert the results to the desired 0 or 1 output is `0.5`.

Use the K-fold cross validation to test different thresholds with the best models trained in the previous section (remember to train the best models with all train data and the best hyperparameters).

*If the model does not output probabilities, look for the `predic_proba` method*.

### Visualizing/interpreting weights (0.5 point)

As we're dealing with models that apply regularization terms, it's relatively easy to verify those results on the coefficient weights of the trained models.

Use the function below to visualize the weights of the final models.</br>
Also, train the three models again using *all* original features, and use the function below to compare how the weight distribution behaves on each of the models*.

\* *If no features were removed in previous sections, just compare the three models*

In [None]:
## You don't need to change this cell!
import plotly.express as px

def plot_weights(model, columns):
  '''
  Plot the weights of the model for each column in an interative graph.
  "model" should be an sklearn model or follow the same interface, having the "coef_" attribute with the weights.

  -----
  Examples:
  plot_weights(classifier, X.columns)
  # for polynomial transformations
  plot_weights(classifier, poly.get_feature_names_out(X.columns))

  '''
  if not hasattr(clf, 'coef_'):
    print("Invalid model!")
    return

  df_plot = pd.DataFrame(columns=['weight','columns'])
  df_plot['columns']= columns

  if len(columns) == len(clf.coef_):
    df_plot['weight']=clf.coef_
  else:
    df_plot['weight']=clf.coef_[0]

  fig = px.bar(df_plot, x='columns', y='weight', color='weight')
  fig.show()

In [None]:
# Plot the weights of your models!

### Explainability Tools (+0.5 points)

Use explainability tools, like [SHAP](https://shap.readthedocs.io/en/latest/), explain how it works and apply to DT or Kernel SVM.

#### Discussion of key points

- What conclusions can you have when looking at the different graphs?


In [None]:
hasattr(obj, 'attribute')

### Testing (0.5 point)

Finally, choose your **Best** model in validation, test it and plot the normalized confusion matrix.

## Deadline

Tuesday, October 22, 11:59 pm.

## Submission

On Google Classroom, submit your Jupyter Notebook (in Portuguese or English) or Google Colaboratory link (remember to share it!).

**This activity is NOT individual, it must be done in pairs (two-person group).**

Only one individual should deliver the notebook.