#### Setup

In [4]:
import pandas as pd, matplotlib.pyplot as plt, numpy as np
import seaborn as sns
import warnings 

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

warnings.filterwarnings("ignore")

# Load the data
df = pd.read_csv('parkinsons.csv')
X = df.drop('target', axis=1)
y = df['target']

#### 5. Train a Linear Regression model, an MLP Regressor with 2 hidden layers of 10 neurons each and no activation functions, and another MLP Regressor with 2 hidden layers of 10 neurons each using ReLU activation functions. (Use random_state=0 on the MLPs, regardless of the run). Plot a boxplot of the test MAE of each model.

In [None]:
mae_linear = []
mae_mlp_no_activation = []
mae_mlp = []

for i in range(1, 11):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=i)
    
    # Linear Regression Model
    linear = LinearRegression()
    linear.fit(X_train, y_train)
    y_pred_linear = linear.predict(X_test)
    mae_linear.append(mean_absolute_error(y_test, y_pred_linear))
    
    # MLP Model no activation
    mlp_no_activation = MLPRegressor(hidden_layer_sizes=(10,10), activation='identity', random_state=0)
    mlp_no_activation.fit(X_train, y_train)
    y_pred_mlp_no_activation = mlp_no_activation.predict(X_test)
    mae_mlp_no_activation.append(mean_absolute_error(y_test, y_pred_mlp_no_activation))
    
    # MLP Model ReLU
    mlp = MLPRegressor(hidden_layer_sizes=(10,10), activation='relu', random_state=0)
    mlp.fit(X_train, y_train)
    y_pred_mlp = mlp.predict(X_test)
    mae_mlp.append(mean_absolute_error(y_test, y_pred_mlp))
    
boxplot_data = [mae_linear, mae_mlp_no_activation, mae_mlp]

plt.figure(figsize=(10, 6))
sns.boxplot(data=boxplot_data)
plt.xticks([0, 1, 2], ['Linear Regression', 'MLP No Activation', 'MLP ReLU'])
plt.ylabel('Mean Absolute Error (MAE)')
plt.title('MAE comparison of 3 different models')
plt.show()


#### 6. Compare a Linear Regression with a MLP with no activations, and explain the impact and the importance of using activation functions in a MLP.Support your reasoning with the results from the boxplots.

If we analyze both the Linear Regression and the MLP with no activation boxplots, we notice them to be very similar to each other. 

This similarity makes sense if we consider the nature and calculus behind both models. The Linear Regression model is a simple model that captures linear relationships between the features and the target variable. On the other hand, the MLP model with no activation essentially is a more complex linear regression, despite having two hidden layers. Since there are no non-linear activation functions in the process, the model will not be able to catch other patterns or relationships in the data other than the linear ones, and so, the result will end up very similar, except with a lot of extra steps. Because of this, both models create two very similar MAE boxplots, as indicated in the first paragraph.

We can finally understand the activation functions’ importance in introducing non-linearity to the model so that it can capture non-linear and more complex relationships between the features and, therefore, achieve a higher generalization capacity and better test results. Without them, MLP’s would just be linear transformations of the input.


#### 7.  Using a 80-20 train-test split with random_state=0, use a Grid Search to tune the hyperparameters of an MLP regressor with two hidden layers (size 10 each). The parameters to search over are: (i) L2 penalty, with the values $\{0.0001, 0.001, 0.01\}$; (ii) learning rate, with the values $\{0.001, 0.01, 0.1\}$; and (iii) batch size, with the values $\{32, 64, 128\}$. Plot the test MAE for each combination of hyperparameters, report the best combination, and discuss the trade-offs between the combinations.  

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Grid Search parameters
param_grid = {
    'alpha': [0.0001, 0.001, 0.01], #L2 penalty
    'learning_rate_init': [0.001, 0.01, 0.1], #Learning rate  
    'batch_size': [32, 64, 128]  #Batch size
}

# Create the MLP Regressor
mlp_regressor = MLPRegressor(hidden_layer_sizes=(10,10), random_state=0)

# Grid Search
grid_search = GridSearchCV(mlp_regressor, param_grid=param_grid, 
                           scoring='neg_mean_absolute_error', cv = 5)
grid_search.fit(X_train, y_train)


# Evaluate the best model
results = pd.DataFrame(grid_search.cv_results_)
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)

print("Best hyperparameters found:", "\nL2_penalty:",grid_search.best_params_['alpha'], "\nLearning rate:", grid_search.best_params_['learning_rate_init']
      , "\nBatch size: ", grid_search.best_params_['batch_size'])
print("Test MAE of the best model: ", test_mae)

# Extract relevant columns
params = results['params']
mean_mae = -results['mean_test_score'] # MAE(since neg_mean_absolute_error is used)

# Labels
param_labels = [
    f"[{param['alpha']}, {param['learning_rate_init']}, {param['batch_size']}]"
    for param in params
]

# Plot the results
plt.figure(figsize=(12, 6))
plt.plot(range(len(mean_mae)), mean_mae, marker='o')
plt.xticks(range(len(mean_mae)), param_labels, rotation=45, ha = 'right')
plt.ylabel('Mean Absolute Error (MAE)')
plt.xlabel('Hyperparameter Combinations')
plt.title('MAE for each Hyperparameter Combination')
plt.grid()
plt.show()

The best combination of hyperparameters observed after this experience is:
$${L2 penalty = 0.0001;\hspace{0.1cm}Learning Rate = 0.01;\hspace{0.1cm}Batch = 32}$$ 

As for the trade-offs of each hyperparameter:
* **L2 penalty (alpha):**
  
    This parameter corresponds to the level of penalization to large coefficients we desire by applying Ridge’s regularization in our model. Higher values of alpha correspond to more regularization.
    Therefore, we can associate overly lower values of this parameter with overfitting and low MAE scores in the test compared to the training, as low regularization is being applied, and excessively high levels with underfitting. The balance was found at alpha = 0.0001 in this model. 

* **Learning Rate:**

    This parameter represents how big are the updates the model does to its weights and biases on each iteration.
    Values too low might cause the model to take too much time to converge to the optimal solution while values too high might cause overshooting and divergence in the values.
    This was the hyperparameter that had the most impact on the MAE scores with Learning Rate = 0.01 being the optimal value for this model.

* **Batch:**

    This parameter is the number of instances the model has in account before making the next update to the weights and biases.
    Generally lower values will be always better for the model’s performance, since it means it is updating values more frequently. However they cause the training to take more time for the same reason, so the trade-off here is in balancing training time and effectiveness of the MLP.
    As expected, the model had better MAE scores for the lower value of batch = 32.
