#### B. Normalize the data (5 marks) 

Repeat Part A but use a normalized version of the data. Recall that one way to normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation.

How does the mean of the mean squared errors compare to that from Step A?

In [1]:
# Import necessary libraries
import numpy as np  # For numerical operations
import pandas as pd  # For data manipulation

from sklearn.metrics import mean_squared_error  # To calculate the Mean Squared Error
from sklearn.model_selection import train_test_split  # To split the dataset into training and test sets
from sklearn.preprocessing import StandardScaler  # To normalize the data
from tensorflow.keras.layers import Dense  # To define the layers of the neural network
from tensorflow.keras.models import Sequential  # To build the neural network
from tensorflow.keras.optimizers import Adam  # Optimizer for training the model

In [2]:
# Load the dataset
url = "concrete_data.csv"  # Path to the dataset
data = pd.read_csv(url)  # Load the dataset into a pandas DataFrame

In [3]:
# Split data into predictors (X) and target variable (y)
X = data.drop("Strength", axis=1)  # Features/predictors (all columns except "Strength")
y = data["Strength"]  # Target variable ("Strength")

In [4]:
# Part A: Build Baseline Model
def baseline_model(X, y):
    """
    Builds and evaluates a baseline regression model using Keras.
    The model:
    - Has one hidden layer with 10 nodes and ReLU activation.
    - Uses the Adam optimizer and mean squared error loss function.
    The process is repeated 50 times, and the mean and standard deviation of the MSEs are computed.

    Parameters:
    X: Features (predictors)
    y: Target variable (concrete strength)

    Returns:
    Mean and standard deviation of the MSEs from 50 iterations.
    """
    mse_list = []  # List to store MSEs from each iteration

    # Repeat the process 50 times
    for _ in range(50):
        # Split the dataset into training (70%) and testing (30%) sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=np.random.randint(0, 100))
        
        # Build the neural network model
        model = Sequential([
            Dense(10, activation='relu', input_shape=(X_train.shape[1],)),  # Hidden layer with 10 nodes
            Dense(1)  # Output layer with a single node (for regression)
        ])
        
        # Compile the model using Adam optimizer and mean squared error loss
        model.compile(optimizer=Adam(), loss='mean_squared_error')
        
        # Train the model on the training data for 50 epochs
        model.fit(X_train, y_train, epochs=50, verbose=0)
        
        # Evaluate the model on the test data
        y_pred = model.predict(X_test, verbose=0)  # Predict on the test set
        mse = mean_squared_error(y_test, y_pred)  # Calculate Mean Squared Error
        mse_list.append(mse)  # Append the MSE to the list
    
    # Return the mean and standard deviation of the MSEs
    return np.mean(mse_list), np.std(mse_list)

# Part B: Normalize the data
def normalized_model(X, y):
    """
    Builds and evaluates a regression model using normalized data.
    The normalization process:
    - Subtracts the mean and divides by the standard deviation for each feature.
    The rest of the process is the same as the baseline model, with 50 iterations.

    Parameters:
    X: Features (predictors)
    y: Target variable (concrete strength)

    Returns:
    Mean and standard deviation of the MSEs from 50 iterations with normalized data.
    """
    # Normalize the data using StandardScaler
    scaler = StandardScaler()  # Initialize the scaler
    X_normalized = scaler.fit_transform(X)  # Fit the scaler to X and transform it
    
    # Call the baseline model function with the normalized data
    return baseline_model(X_normalized, y)

In [5]:
# Run the normalized model and calculate the mean and standard deviation of MSE
mean_b, std_b = normalized_model(X, y)

In [6]:
print("Part B - Normalized Model: Mean MSE =", mean_b, "Std MSE =", std_b)

Part B - Normalized Model: Mean MSE = 369.76160306796623 Std MSE = 93.85089910118067


#### How does the mean of the mean squared errors compare to that from Step A?

The comparison between the Mean MSE values from Part A (Baseline Model) and Part B (Normalized Model) reveals the following:

1. Mean MSE Comparison:
    - Part A (Baseline Model): Mean MSE = 283.68
    - Part B (Normalized Model): Mean MSE = 369.76
      
The mean MSE for the normalized model (Part B) is slightly higher than the baseline model (Part A). This suggests that normalization did not improve the overall accuracy in terms of mean squared error for this specific dataset and model configuration.

2. Standard Deviation Comparison:
    - Part A (Baseline Model): Std MSE = 236.16
    - Part B (Normalized Model): Std MSE = 93.85
      
The standard deviation of the MSE is significantly lower for the normalized model compared to the baseline model. This indicates that normalization has reduced the variability in the model's performance across the 50 runs, making the model predictions more consistent.

#### Conclusion:
While normalization slightly increased the mean MSE, it significantly improved the model's stability by reducing the standard deviation of the MSE. This suggests that normalization may help in achieving more reliable predictions, even if the accuracy does not always improve.