# Regression - G3.2

## 1. Dataset Analysis
Load datasets via the scikit-learn library. The datasets can be downloaded using the fetch_openml function by indicating the name of the dataset as a parameter. In addition, organise the downloaded data in a pandas DataFrame, and display the first rows to gain an overview of the available variables.

In [None]:
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
from scipy.stats import normaltest
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset using fetch_openml
meta = fetch_openml(name='meta', version=1, parser='auto')
california_housing = fetch_openml(name='california_housing', version=7, parser='auto')
kin8nm = fetch_openml(name='kin8nm', version=1, parser='auto')
chscase_census2 = fetch_openml(name='chscase_census2', version=1, parser='auto')

### 1.1 Meta
Source Link: https://openml.org/search?type=data&status=active&id=566


We now start by saving our dataset data to a support variable that we will later use to save the information needed for analysis.

In [None]:
data_meta = meta.data

#### 1.1.A Find the dimensions of the dataset
After loading the dataset, we begin by examining its size; the dataset consists of 528 samples and contains 22 features or variables.

In [None]:
print("\n\nDataset dimensions: ") 
data_meta.shape #(samples, n_features)

#### 1.1.B Look at first one hundred few records of the dataset
Printing the first one hundred rows gives us a first impression of the data, data types, etc.

In [None]:
print("\n\nFirst one hundred few records: ") 
data_meta.head(100)

We chose to display a compact version of the first 100 rows to immediately notice possible relationships within the data. Also, we can immediately note the presence of some missing values appearing within the dataset, which will be duly dealt with in the second section of this notebook.

#### 1.1.C Look the main information that makes up the Dataset
We can now print out the characteristics within the dataset that were requested by the analysis text.

In [None]:
target_meta = meta.target
feature_names_meta = meta.feature_names
description_meta = meta.DESCR

# Now we can print data, target, labels and dataset information as needed
print("\n\nData:")
print(data_meta)
print("\n\nTarget:")
print(target_meta)
print("\n\nFeatures name:")
print(feature_names_meta)
print("\n\nDataset description:")
print(description_meta)

#### 1.1.D DataFrame creation with Pandas
After performing all our initial checks within the Dataset, we can proceed to create the DataFrame using the Pandas library.

In [None]:
df_meta = pd.DataFrame(data_meta, columns=feature_names_meta)
df_meta['target'] = target_meta

### 1.2 California_Housing 
Source Link: https://openml.org/search?type=data&status=active&id=44977&sort=runs


We now start by saving our dataset data to a support variable that we will later use to save the information needed for analysis.

In [None]:
data_ch = california_housing.data

#### 1.2.A Find the dimensions of the dataset
After loading the dataset, we begin by examining its size; the dataset consists of 20640 samples and contains 8 features or variables.

In [None]:
print("\n\nDataset dimensions: ") 
data_ch.shape #(samples, n_features)

#### 1.2.B Look at first few records of the dataset
Printing the first rows gives us a first impression of the data, data types, etc.

In [None]:
print("\n\nFirst few records: ") 
data_ch.head()

In this particular case, with an initial analysis, we did not notice any unusual relationships between the data.

#### 1.2.C Look the main information that makes up the Dataset
We can now print out the characteristics within the dataset that were requested by the analysis text.

In [None]:
target_ch = california_housing.target
feature_names_ch = california_housing.feature_names
description_ch = california_housing.DESCR

# Now we can print data, target, labels and dataset information as needed
print("\n\nData:")
print(data_ch)
print("\n\nTarget:")
print(target_ch)
print("\n\nFeatures name:")
print(feature_names_ch)
print("\n\nDataset description:")
print(description_ch)

#### 1.2.D DataFrame creation with Pandas
After performing all our initial checks within the Dataset, we can proceed to create the DataFrame using the Pandas library.

In [None]:
df_ch = pd.DataFrame(data_ch, columns=feature_names_ch)
df_ch['target'] = target_ch

### 1.3 Kin8nm
Source Link: https://openml.org/search?type=data&id=189&sort=runs&status=active



We now start by saving our dataset data to a support variable that we will later use to save the information needed for analysis.

In [None]:
data_kin8nm = kin8nm.data

#### 1.3.A Find the dimensions of the dataset
After loading the dataset, we begin by examining its size; the dataset consists of 8192 samples and contains 8 features or variables.

In [None]:
print("\n\nDataset dimensions: ") 
data_kin8nm.shape #(samples, n_features)

#### 1.3.B Look at first few records of the dataset
Printing the first rows gives us a first impression of the data, data types, etc.

In [None]:
print("\n\nFirst few records: ") 
data_kin8nm.head()

In this particular case, with an initial analysis, we did not notice any unusual relationships between the data.

#### 1.3.C Look the main information that makes up the Dataset
We can now print out the characteristics within the dataset that were requested by the analysis text.

In [None]:
target_kin8nm = kin8nm.target
feature_names_kin8nm = kin8nm.feature_names
description_kin8nm = kin8nm.DESCR

# Now we can print data, target, labels and dataset information as needed
print("\n\nData:")
print(data_kin8nm)
print("\n\nTarget:")
print(target_kin8nm)
print("\n\nFeatures name:")
print(feature_names_kin8nm)
print("\n\nDataset description:")
print(description_kin8nm)

#### 1.3.D DataFrame creation with Pandas
After performing all our initial checks within the Dataset, we can proceed to create the DataFrame using the Pandas library.

In [None]:
df_kin8nm = pd.DataFrame(data_kin8nm, columns=feature_names_kin8nm)
df_kin8nm['target'] = target_kin8nm

### 1.4 Chscase_Census2
Source Link: https://openml.org/search?type=data&id=673&sort=runs&status=active


We now start by saving our dataset data to a support variable that we will later use to save the information needed for analysis.

In [None]:
data_chcc2 = chscase_census2.data

#### 1.4.A Find the dimensions of the dataset
After loading the dataset, we begin by examining its size; the dataset consists of 400 samples and contains 7 features or variables.

In [None]:
print("\n\nDataset dimensions: ") 
data_chcc2.shape #(samples, n_features)

#### 1.4.B Look at first few records of the dataset
Printing the first rows gives us a first impression of the data, data types, etc.

In [None]:
print("\n\nFirst few records: ") 
data_chcc2.head()

In this particular case, with an initial analysis, we did not notice any unusual relationships between the data.

#### 1.4.C Look the main information that makes up the Dataset
We can now print out the characteristics within the dataset that were requested by the analysis text.

In [None]:
target_chcc2 = chscase_census2.target
feature_names_chcc2 = chscase_census2.feature_names
description_chcc2 = chscase_census2.DESCR

# Now we can print data, target, labels and dataset information as needed
print("\n\nData:")
print(data_chcc2)
print("\n\nTarget:")
print(target_chcc2)
print("\n\nFeatures name:")
print(feature_names_chcc2)
print("\n\nDataset description:")
print(description_chcc2)

#### 1.4.D DataFrame creation with Pandas
After performing all our initial checks within the Dataset, we can proceed to create the DataFrame using the Pandas library.

In [None]:
df_chcc2 = pd.DataFrame(data_chcc2, columns=feature_names_chcc2)
df_chcc2['target'] = target_chcc2

## 2. Preprocessing
Now, we will proceed with the necessary preprocessing activities, focusing on the elimination of nominal features and the handling of missing values. Finally, we will perform normalisation or standardisation of variables to ensure that the data are ready for the next steps of predictive analysis.



#### Features Variances
The second function was created to represent the variance between all features within the dataset in a more structured manner.

In [None]:
def calculate_feature_variances(data):
    variances = data.var()

    feature_variances = pd.DataFrame({
        'Feature': variances.index,
        'Variance': variances.values
    })

    return feature_variances

#### Density plots for features distribution
Can we used this function to check the correct features distribution inside the dataset.

In [None]:
def plot_numeric_distributions(dataframe, numeric_features):
    num_features_count = len(numeric_features.columns)
    rows = (num_features_count // 2) + (num_features_count % 2 > 0)
    fig, ax = plt.subplots(rows, 2, figsize=(15, 2 * rows))
    fig.suptitle('Features Distribution Plot\n\n', fontsize=16)

    row, col = 0, 0
    for n, feature in enumerate(numeric_features):
        if (n % 2 == 0) and (n > 0):
            row += 1
            col = 0

        dataframe[feature].plot(kind="kde", ax=ax[row, col])
        ax[row, col].set_title(feature)
        col += 1

    plt.tight_layout()
    plt.show()


### 2.1 Meta


#### 2.1.A Check for nominal features
In this step, we will try to better understand what types of data are present within the dataset. In the case of anomalous types, adopt the choices that will be discussed later.

In [None]:
print("\n\nTypes data:")
df_meta.dtypes

We can immediately see that there are two nominal (or categorical) features within the dataset. In this case, the text advises us to delete them should there be any.

In [None]:
# Delete nominal features (if any)
df_meta = df_meta.select_dtypes(exclude=['category'])

print("\n\nTypes data after removal:")
df_meta.dtypes

#### 2.1.B Check for any missing data
We can now proceed to search for any missing data within the dataset.

In [None]:
df_meta.isnull().sum()

We can see that there are considerable amounts of null values for three different features. Therefore, we will now proceed with the interpolation of the missing instances, as about half of the total instances are absent. In fact, it is worth remembering that in the case of missing data, it is not recommended to delete them, but rather to interpolate.

In [None]:
df_meta = df_meta.interpolate(method='cubicspline', limit_direction='both', axis=0)

df_meta.isnull().sum()

Now there are no longer any missing data within the dataset.


##### 2.1.C Standardisation of values
In this step, what we are going to do is to standardise the values within the dataset, to try to fit them within a single range or scale.

In [None]:
# We standardising the values within the Dataset
numeric_cols_meta = df_meta.select_dtypes(include=['float64', 'int64']).columns
df_meta[numeric_cols_meta] = StandardScaler().fit_transform(df_meta[numeric_cols_meta])

##### 2.1.D Variance control between features
Having previously carried out the standardisation, the variance of each feature is examined. Also, in this step, we try to understand whether it is necessary to apply further Feature Extraction techniques in order to make the dataset "lighter". 

In [None]:
result = calculate_feature_variances(df_meta)
result

In this case it is approximately equal to 1, which suggests that the features have been homogenised in terms of scale. In addition, the fact that the variance is uniform indicates that all features contribute similarly to the overall variance of the dataset and therefore no features need to be eliminated on the basis of variance.

##### 2.1.E Verification of operations
In the following we will check whether the standardisation yielded the desired results.

In [None]:
df_meta.head()

In [None]:
plot_numeric_distributions(df_meta, df_meta[numeric_cols_meta])

In this case, we can say with certainty that the missing values were interpolated correctly and the standardisation was successful. We can certainly see by eye that the features shown in the graphs above do not follow a Gaussian distribution. But this does not matter much to us, because the models we will use later turn out to be robust to the normality of the data.

### 2.2 California_Housing


#### 2.2.A Check for nominal features
In this step, we will try to better understand what types of data are present within the dataset. In the case of anomalous types, adopt the choices that will be discussed later.

In [None]:
print("\n\nTypes data:")
df_ch.dtypes

In this case, nominal features do not appear, so we can proceed to check for missing data without eliminating these features.


#### 2.2.B Check for any missing data
We can now proceed to search for any missing data within the dataset.

In [None]:
df_ch.isnull().sum()

Also in this step, we had no problems as there is no missing data within the dataset. Therefore, we can proceed with the next step, which concerns standardisation.

##### 2.2.C Standardisation of values
In this step, what we are going to do is to standardise the values within the dataset, to try to fit them within a single range or scale.

In [None]:
numeric_cols_ch = df_ch.select_dtypes(include=['float64', 'int64']).columns
df_ch[numeric_cols_ch] = StandardScaler().fit_transform(df_ch[numeric_cols_ch])

##### 2.2.D Variance control between features
Having previously carried out the standardisation, the variance of each feature is examined. Also, in this step, we try to understand whether it is necessary to apply further Feature Extraction techniques in order to make the dataset "lighter". 

In [None]:
result = calculate_feature_variances(df_ch)
result

In this case it is approximately equal to 1, which suggests that the features have been homogenised in terms of scale. In addition, the fact that the variance is uniform indicates that all features contribute similarly to the overall variance of the dataset and therefore no features need to be eliminated on the basis of variance.

##### 2.2.E Verification of operations
In the following we will check whether the standardisation yielded the desired results.

In [None]:
df_ch.head()

In [None]:
plot_numeric_distributions(df_ch, df_ch[numeric_cols_ch])

In this case, we can say with certainty that the missing values were interpolated correctly and the standardisation was successful. We can certainly see by eye that the features shown in the graphs above do not follow a Gaussian distribution. But this does not matter much to us, because the models we will use later turn out to be robust to the normality of the data.

### 2.3 Kin8nm


#### 2.3.A Check for nominal features
In this step, we will try to better understand what types of data are present within the dataset. In the case of anomalous types, adopt the choices that will be discussed later.

In [None]:
print("\n\nTypes data:")
df_kin8nm.dtypes

In this case, nominal features do not appear, so we can proceed to check for missing data without eliminating these features.


#### 2.3.B Check for any missing data
We can now proceed to search for any missing data within the dataset.

In [None]:
df_kin8nm.isnull().sum()

Also in this step, we had no problems as there is no missing data within the dataset. Therefore, we can proceed with the next step, which concerns standardisation.

##### 2.3.C Standardisation of values
In this step, what we are going to do is to standardise the values within the dataset, to try to fit them within a single range or scale.

In [None]:
# We standardising the values within the Dataset
numeric_cols_kin8nm = df_kin8nm.select_dtypes(include=['float64', 'int64']).columns
df_kin8nm[numeric_cols_kin8nm] = StandardScaler().fit_transform(df_kin8nm[numeric_cols_kin8nm])

##### 2.3.D Variance control between features
Having previously carried out the standardisation, the variance of each feature is examined. Also, in this step, we try to understand whether it is necessary to apply further Feature Extraction techniques in order to make the dataset "lighter". 

In [None]:
result = calculate_feature_variances(df_kin8nm)
result

In this case it is approximately equal to 1, which suggests that the features have been homogenised in terms of scale. In addition, the fact that the variance is uniform indicates that all features contribute similarly to the overall variance of the dataset and therefore no features need to be eliminated on the basis of variance.

##### 2.3.E Verification of operations
In the following we will check whether the standardisation yielded the desired results.

In [None]:
df_kin8nm.head()

In [None]:
plot_numeric_distributions(df_kin8nm, df_kin8nm[numeric_cols_kin8nm])

In this case, we can say with certainty that the missing values were interpolated correctly and the standardisation was successful. We can certainly see by eye that the features shown in the graphs above do not follow a Gaussian distribution. But this does not matter much to us, because the models we will use later turn out to be robust to the normality of the data.

### 2.4 Chscase_Census2


#### 2.4.A Check for nominal features
In this step, we will try to better understand what types of data are present within the dataset. In the case of anomalous types, adopt the choices that will be discussed later.

In [None]:
print("\n\nTypes data:")
df_chcc2.dtypes

In this case, nominal features do not appear, so we can proceed to check for missing data without eliminating these features.


#### 2.4.B Check for any missing data
We can now proceed to search for any missing data within the dataset.

In [None]:
df_chcc2.isnull().sum()

Also in this step, we had no problems as there is no missing data within the dataset. Therefore, we can proceed with the next step, which concerns standardisation.

##### 2.4.C Standardisation of values
In this step, what we are going to do is to standardise the values within the dataset, to try to fit them within a single range or scale.

In [None]:
# We standardising the values within the Dataset
numeric_cols_chcc2 = df_chcc2.select_dtypes(include=['float64', 'int64']).columns
df_chcc2[numeric_cols_chcc2] = StandardScaler().fit_transform(df_chcc2[numeric_cols_chcc2])

##### 2.4.D Variance control between features
Having previously carried out the standardisation, the variance of each feature is examined. Also, in this step, we try to understand whether it is necessary to apply further Feature Extraction techniques in order to make the dataset "lighter". 

In [None]:
result = calculate_feature_variances(df_chcc2)
result

In this case it is approximately equal to 1, which suggests that the features have been homogenised in terms of scale. In addition, the fact that the variance is uniform indicates that all features contribute similarly to the overall variance of the dataset and therefore no features need to be eliminated on the basis of variance.

##### 2.4.E Verification of operations
In the following we will check whether the standardisation yielded the desired results.

In [None]:
df_chcc2.head()

In [None]:
plot_numeric_distributions(df_chcc2, df_chcc2[numeric_cols_chcc2])

In this case, we can say with certainty that the missing values were interpolated correctly and that the standardisation was successful. We can certainly see by eye that the features shown in the graphs above do not follow a Gaussian distribution. The only feature that seems to follow a Gaussian distribution is the target variable. But this does not matter much to us, because the models we will use later turn out to be robust to the normality of the data.

## 3. Regression
First of all, let us start by defining some main functions that are fundamental to the third step:

#### SMAPE
In this case, we implemented a special function, which takes two arrays, one for real values and one for predicted values, and returns the SMAPE according to the formula given in the course slides. The formula we used to implement this metric is the one we saw during the course, and thus:
<center>$SMAPE = \frac{1}{N} \sum_{i=1}^N \frac{|y_i - \hat{y_i}|}{|y_i| + |\hat{y_i}|}$</center>
                 

In [None]:
def smape(y_true, y_pred):
    return np.mean(np.abs(y_pred - y_true) / (np.abs(y_pred) + np.abs(y_true)))

#### EVALUATION OF MODELS
In this case, we considered it necessary and indispensable for the cleanliness of the code to create two functions that can be called several times within the code to calculate the results obtained by applying the different regression models required. 

The first function is to create a dataframe containing the metric values for each model applied during the regression.

In [None]:
def metrics_dataframe(model_names, mae_scores, rmse_scores, mape_scores, smape_scores):
    
    metrics_df = pd.DataFrame({
        'Model': model_names,
        'MAE': mae_scores,
        'RMSE': rmse_scores,
        'MAPE': mape_scores,
        'SMAPE': smape_scores
    })

    return metrics_df

Also, the second function is the very core of the metrics calculation.

In [None]:
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name, common_exponent = -2):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    mae = mean_absolute_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    mape = mean_absolute_percentage_error(y_test, y_pred)
    smape_score = smape(y_test, y_pred)

    return y_pred, mae, rmse, mape, smape_score

#### CHART PLOT
Again, we found it necessary to create two function that plots all graphs for a single dataset to keep the code clean.

In [None]:
def plot_scatter_plots(model_names, y_true, y_pred_scores):
    num_models = len(model_names)
    num_rows = 2
    num_cols = 3

    fig = plt.figure(figsize=(15, 8))
    fig.suptitle('\n\nScatter Plots of Model Predictions vs. True Values', fontsize=16)

    for i, (model_name, y_pred) in enumerate(zip(model_names, y_pred_scores)):
        if i == num_models - 1:
            col = 2
        else:
            col = i % num_cols

        row = i // num_cols

        ax = plt.subplot2grid((num_rows, num_cols), (row, col))

        ax.scatter(y_true, y_pred, alpha=0.7, label=model_name)
        ax.plot([min(y_true), max(y_true)], [min(y_true), max(y_true)], color='red', linestyle='--', linewidth=2)
        ax.set_title(model_name)
        ax.set_xlabel('True Values')
        ax.set_ylabel('Predicted Values')
        ax.legend()

    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()

In [None]:
def plot_results(model_names, mae_scores, rmse_scores, mape_scores, smape_scores):
    fig, axs = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('\n\nChart Plots of Evaluation Metrics Results', fontsize=16)

    # MAE
    axs[0, 0].bar(model_names, mae_scores, color='blue')
    axs[0, 0].set_title('MAE')
    axs[0, 0].set_ylabel('Mean Absolute Error')

    # RMSE
    axs[0, 1].bar(model_names, rmse_scores, color='orange')
    axs[0, 1].set_title('RMSE')
    axs[0, 1].set_ylabel('Root Mean Square Error')

    # MAPE
    axs[1, 0].bar(model_names, mape_scores, color='green')
    axs[1, 0].set_title('MAPE')
    axs[1, 0].set_ylabel('Mean Absolute Percentage Error')

    # SMAPE
    axs[1, 1].bar(model_names, smape_scores, color='red')
    axs[1, 1].set_title('SMAPE')
    axs[1, 1].set_ylabel('Symmetric Mean Absolute Percentage Error')

    plt.subplots_adjust(hspace=0.5)
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()

### 3.1 Meta
In this regression phase, we divide the dataset into training and test sets. Next, to train and evaluate the proposed regression models. And finally, we will report the results of this regression through the specified metrics and their corresponding graphs. 

##### 3.1.A Divide dataset into Training and Test sets
In this first step, we will divide the dataset into the previously specified datasets. To do this, we will create vectors where we will save the results of the various metrics, save the features of the dataframe and only then perform the split.

In [None]:
features = df_meta.drop(columns=['target'])
target = df_meta['target']

mae_scores = []
rmse_scores = []
mape_scores = []
smape_scores = []
y_pred_scores = []

# Division of the DataFrame into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

#### 3.1.B Initialisation of models
In this phase, the regression models, such as Linear Regression, Support Vector Machine, Decision Trees, Random Forest, and Gradient Boosting, are instantiated and prepared in order to be ready for the next training and evaluation phase on the dataset data.

In [None]:
linear_reg_model = LinearRegression()
svr_model = SVR()
decision_tree_model = DecisionTreeRegressor()
random_forest_model = RandomForestRegressor()
gradient_boosting_model = GradientBoostingRegressor()

The following approach makes it easier to manage and recall models during the training, testing and evaluation process, as model names are consistently associated with model objects.

In [None]:
models = [linear_reg_model, svr_model, decision_tree_model, random_forest_model, gradient_boosting_model]
model_names = ['Linear Regression', 'SVR', 'Decision Tree', 'Random Forest', 'Gradient Boosting']

##### 3.1.C Evaluation of models
In this step, several regression models are evaluated using the META dataset. For each model, evaluation metrics such as MAE, RMSE, MAPE and SMAPE are calculated and recorded, and model predictions are saved for further analysis.

In [None]:
# Evaluation of models for the META Dataset
for model, model_name in zip(models, model_names):
    y_pred, mae, rmse, mape, smape_score = evaluate_model(model, X_train, X_test, y_train, y_test, model_name)
    mae_scores.append(mae)
    rmse_scores.append(rmse)
    mape_scores.append(mape)
    smape_scores.append(smape_score)
    y_pred_scores.append(y_pred)
    
models_df = metrics_dataframe(model_names, mae_scores, rmse_scores, mape_scores, smape_scores)

models_df.head()

#### 3.1.D Graphical representation of results
In this step, we are generating graphs to visualise the performance of the regression models. With the first function, we are comparing predicted values with actual values, while with the second function, we are using barplots to obtain a quick visualisation and insight into the metrics for each model.

In [None]:
plot_scatter_plots(model_names, y_test, y_pred_scores)
plot_results(model_names, mae_scores, rmse_scores, mape_scores, smape_scores)

### 3.2 California_Housing
In this regression phase, we divide the dataset into training and test sets. Next, to train and evaluate the proposed regression models. And finally, we will report the results of this regression through the specified metrics and their corresponding graphs. 

##### 3.2.A Divide dataset into Training and Test sets
In this first step, we will divide the dataset into the previously specified datasets. To do this, we will create vectors where we will save the results of the various metrics, save the features of the dataframe and only then perform the split.

In [None]:
features = df_ch.drop(columns=['target'])
target = np.log1p(df_ch['target'] - df_ch['target'].min() + 1)

mae_scores = []
rmse_scores = []
mape_scores = []
smape_scores = []
y_pred_scores = []

# Division of the DataFrame into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

As suggested by the description of the dataset, on OpenML, we decided to apply a logarithmic transformation to the target column. In addition, we also paid attention to possible negative or zero values; all this by adding one and subtracting the minimum value of the target variable.

#### 3.2.B Initialisation of models
In this phase, the regression models, such as Linear Regression, Support Vector Machine, Decision Trees, Random Forest, and Gradient Boosting, are instantiated and prepared in order to be ready for the next training and evaluation phase on the dataset data.

In [None]:
linear_reg_model = LinearRegression()
svr_model = SVR()
decision_tree_model = DecisionTreeRegressor()
random_forest_model = RandomForestRegressor()
gradient_boosting_model = GradientBoostingRegressor()

The following approach makes it easier to manage and recall models during the training, testing and evaluation process, as model names are consistently associated with model objects.

In [None]:
models = [linear_reg_model, svr_model, decision_tree_model, random_forest_model, gradient_boosting_model]
model_names = ['Linear Regression', 'SVR', 'Decision Tree', 'Random Forest', 'Gradient Boosting']

##### 3.2.C Evaluation of models
In this step, several regression models are evaluated using the CALIFORNIA_HOUSING dataset. For each model, evaluation metrics such as MAE, RMSE, MAPE and SMAPE are calculated and recorded, and model predictions are saved for further analysis.

In [None]:
# Evaluation of models for the CALIFORNIA_HOUSING Dataset
for model, model_name in zip(models, model_names):
    y_pred, mae, rmse, mape, smape_score = evaluate_model(model, X_train, X_test, y_train, y_test, model_name)
    mae_scores.append(mae)
    rmse_scores.append(rmse)
    mape_scores.append(mape)
    smape_scores.append(smape_score)
    y_pred_scores.append(y_pred)
    
models_df = metrics_dataframe(model_names, mae_scores, rmse_scores, mape_scores, smape_scores)

models_df.head()

#### 3.2.D Graphical representation of results
In this step, we are generating graphs to visualise the performance of the regression models. With the first function, we are comparing predicted values with actual values, while with the second function, we are using barplots to obtain a quick visualisation and insight into the metrics for each model.

In [None]:
plot_scatter_plots(model_names, y_test, y_pred_scores)
plot_results(model_names, mae_scores, rmse_scores, mape_scores, smape_scores) 

### 3.3 Kin8nm
In this regression phase, we divide the dataset into training and test sets. Next, to train and evaluate the proposed regression models. And finally, we will report the results of this regression through the specified metrics and their corresponding graphs. 

##### 3.3.A Divide dataset into Training and Test sets
In this first step, we will divide the dataset into the previously specified datasets. To do this, we will create vectors where we will save the results of the various metrics, save the features of the dataframe and only then perform the split.

In [None]:
features = df_kin8nm.drop(columns=['target'])
target = df_kin8nm['target']

mae_scores = []
rmse_scores = []
mape_scores = []
smape_scores = []
y_pred_scores = []

# Division of the DataFrame into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

#### 3.3.B Initialisation of models
In this phase, the regression models, such as Linear Regression, Support Vector Machine, Decision Trees, Random Forest, and Gradient Boosting, are instantiated and prepared in order to be ready for the next training and evaluation phase on the dataset data.

In [None]:
linear_reg_model = LinearRegression()
svr_model = SVR()
decision_tree_model = DecisionTreeRegressor()
random_forest_model = RandomForestRegressor()
gradient_boosting_model = GradientBoostingRegressor()

The following approach makes it easier to manage and recall models during the training, testing and evaluation process, as model names are consistently associated with model objects.

In [None]:
models = [linear_reg_model, svr_model, decision_tree_model, random_forest_model, gradient_boosting_model]
model_names = ['Linear Regression', 'SVR', 'Decision Tree', 'Random Forest', 'Gradient Boosting']

##### 3.3.C Evaluation of models
In this step, several regression models are evaluated using the KIN8NM dataset. For each model, evaluation metrics such as MAE, RMSE, MAPE and SMAPE are calculated and recorded, and model predictions are saved for further analysis.

In [None]:
# Evaluation of models for the KIN8NM Dataset
for model, model_name in zip(models, model_names):
    y_pred, mae, rmse, mape, smape_score = evaluate_model(model, X_train, X_test, y_train, y_test, model_name)
    mae_scores.append(mae)
    rmse_scores.append(rmse)
    mape_scores.append(mape)
    smape_scores.append(smape_score)
    y_pred_scores.append(y_pred)
    
models_df = metrics_dataframe(model_names, mae_scores, rmse_scores, mape_scores, smape_scores)

models_df.head()

#### 3.3.D Graphical representation of results
In this step, we are generating graphs to visualise the performance of the regression models. With the first function, we are comparing predicted values with actual values, while with the second function, we are using barplots to obtain a quick visualisation and insight into the metrics for each model.

In [None]:
plot_scatter_plots(model_names, y_test, y_pred_scores)
plot_results(model_names, mae_scores, rmse_scores, mape_scores, smape_scores) 

### 3.4 Cshcase_Census2
In this regression phase, we divide the dataset into training and test sets. Next, to train and evaluate the proposed regression models. And finally, we will report the results of this regression through the specified metrics and their corresponding graphs. 

##### 3.4.A Divide dataset into Training and Test sets
In this first step, we will divide the dataset into the previously specified datasets. To do this, we will create vectors where we will save the results of the various metrics, save the features of the dataframe and only then perform the split.

In [None]:
features = df_chcc2.drop(columns=['target'])
target = df_chcc2['target']

mae_scores = []
rmse_scores = []
mape_scores = []
smape_scores = []
y_pred_scores = []

# Division of the DataFrame into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

#### 3.4.B Initialisation of models
In this phase, the regression models, such as Linear Regression, Support Vector Machine, Decision Trees, Random Forest, and Gradient Boosting, are instantiated and prepared in order to be ready for the next training and evaluation phase on the dataset data.

In [None]:
linear_reg_model = LinearRegression()
svr_model = SVR()
decision_tree_model = DecisionTreeRegressor()
random_forest_model = RandomForestRegressor()
gradient_boosting_model = GradientBoostingRegressor()

The following approach makes it easier to manage and recall models during the training, testing and evaluation process, as model names are consistently associated with model objects.

In [None]:
models = [linear_reg_model, svr_model, decision_tree_model, random_forest_model, gradient_boosting_model]
model_names = ['Linear Regression', 'SVR', 'Decision Tree', 'Random Forest', 'Gradient Boosting']

##### 3.4.C Evaluation of models
In this step, several regression models are evaluated using the CSHCASE_CENSUS2 dataset. For each model, evaluation metrics such as MAE, RMSE, MAPE and SMAPE are calculated and recorded, and model predictions are saved for further analysis.

In [None]:
# Evaluation of models for the CSHCASE_CENSUS2 Dataset
for model, model_name in zip(models, model_names):
    y_pred, mae, rmse, mape, smape_score = evaluate_model(model, X_train, X_test, y_train, y_test, model_name)
    mae_scores.append(mae)
    rmse_scores.append(rmse)
    mape_scores.append(mape)
    smape_scores.append(smape_score)
    y_pred_scores.append(y_pred)
    
models_df = metrics_dataframe(model_names, mae_scores, rmse_scores, mape_scores, smape_scores)

models_df.head()

#### 3.3.D Graphical representation of results
In this step, we are generating graphs to visualise the performance of the regression models. With the first function, we are comparing predicted values with actual values, while with the second function, we are using barplots to obtain a quick visualisation and insight into the metrics for each model.

In [None]:
plot_scatter_plots(model_names, y_test, y_pred_scores)
plot_results(model_names, mae_scores, rmse_scores, mape_scores, smape_scores) 