# Assignment 2 Supervised learning: Classification and regression

In [2]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

from ucimlrepo import fetch_ucirepo 

import plotly.io as pio

pio.renderers.default = "notebook"  # oder "none" für keine automatische Ausgabe

In [3]:
# Load data 

# fetch dataset 
gas_turbine_co_and_nox_emission_data_set = fetch_ucirepo(id=551) 
  
# data (as pandas dataframes) 
X = gas_turbine_co_and_nox_emission_data_set.data.features 
y = gas_turbine_co_and_nox_emission_data_set.data.targets 

KeyboardInterrupt: 

# Regression Analysis - Part A

## 1. Linear Regression Model and Feature Transformation

In this section, we aim to predict the target variable **[insert your target variable]** based on the input features **[list your features]**. The objective is to establish the relationship between the target and the input variables using **linear regression**. By performing regression, we hope to:

- Make accurate predictions for the target variable.
- Analyze the influence of each input feature on the output.
  
### Feature Transformation

To prepare the data for regression, we apply a feature transformation. One commonly used transformation is **one-of-K coding**, which converts categorical variables into binary (0 or 1) indicators for each category. This ensures that the categorical features are represented correctly in the regression model.

Additionally, since we will apply regularization later, the input feature matrix \(X\) is standardized such that each column has a **mean of 0** and a **standard deviation of 1**. This normalization ensures that the regularization affects all features equally, avoiding bias toward features with larger values.

---

## 2. Regularization and Cross-Validation

### Introduction of Regularization Parameter (\(\lambda\))

To improve generalization and prevent overfitting, we introduce the **regularization parameter \(\lambda\)**. This parameter penalizes large coefficients, ensuring that the model remains simple and avoids overfitting the training data.

We will explore a range of \(\lambda\) values to find the one that minimizes the generalization error. Ideally, we choose a range where the generalization error first decreases as \(\lambda\) increases, and then starts to rise again, indicating the sweet spot for regularization.

### Estimation of Generalization Error using Cross-Validation

We will use **10-fold cross-validation** to estimate the generalization error for different values of \(\lambda\). Cross-validation helps assess how well the model generalizes to unseen data by splitting the dataset into training and validation sets multiple times.

Include a plot showing the generalization error as a function of \(\lambda\) and briefly discuss the results.

---

## 3. Analysis of the Linear Model and Coefficients

### Output of the Linear Model

Once we have identified the optimal \(\lambda\) (based on the lowest generalization error), we can compute the output \(y\) for any given input \(x\) using the linear model. The output is computed using the formula:

\[
y = w_0 + w_1x_1 + w_2x_2 + \cdots + w_nx_n
\]

where \(w_0\) is the intercept and \(w_1, w_2, \dots, w_n\) are the weights (coefficients) for each feature.

### Effect of Individual Attributes on the Output

Each input feature \(x_i\) influences the output \(y\) based on the corresponding coefficient \(w_i\). Larger absolute values of \(w_i\) indicate a greater impact of the feature on the prediction. A positive coefficient means that as the feature increases, so does the predicted value of \(y\), while a negative coefficient implies an inverse relationship.

Lastly, we will evaluate whether the effect of individual attributes aligns with the problem context and domain knowledge.

---
# Regression, Part b

In this section, we will compare three models: the regularized linear regression model from the previous section, an artificial neural network (ANN), and a baseline. The focus will be on determining which model performs better and whether either model is superior to a trivial baseline.

1. **Two-Level Cross-Validation Implementation**: 
   Implement two-level cross-validation to compare the models. For this, we will use \( K_1 = K_2 = 10 \) folds. The baseline model will use the mean of \( y \) from the training data to predict \( y \) for the test data. Additionally, for the ANN, we will select the number of hidden units \( h \) as the complexity-controlling parameter. Based on initial tests, we will choose a reasonable range for \( h \) and \( \lambda \).


---
2. **Comparison Table**: 
   Produce a table similar to Table 1 using two-level cross-validation. For each fold \( i \), include the optimal number of hidden units \( h^*_i \) and the regularization strength \( \lambda^*_i \), along with the estimated generalization error for each method. Also, include the baseline error for each fold to allow a direct comparison.

---
3. **Statistical Evaluation**:
   Perform a statistical evaluation to determine whether there is a significant performance difference between the models. Use one of the following setups to compare the models:

   - **Setup I**: Use the paired t-test (Box 11.3.4)
   - **Setup II**: Use the method described in Box 11.4.1

   Report p-values and confidence intervals for the pairwise tests and conclude on the performance of the models. Discuss which model is better and whether either model outperforms the baseline.