In [1]:
import os
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
# Load data
df = pd.read_csv('abalone.data.csv')

In [3]:
# Perform one-hot encoding for gender column
df = pd.get_dummies(df, columns=['Sex'])

In [4]:
# Standardize continuous features
continuous_columns = ['Length', 'Diameter', 'Height', 'Whole_weight', 'Shucked_weight', 'Viscera_weight', 'Shell_weight']
scaler = StandardScaler()
df[continuous_columns] = scaler.fit_transform(df[continuous_columns])

In [5]:
# Separate features and target
X = df.drop('Rings', axis=1).values
y = df['Rings'].values

In [6]:
# Split data into training and testing sets (using 80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=49)

In [7]:
# Convert datasets into PyTorch tensors
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
y_train = y_train.astype(np.float32).reshape(-1, 1)
y_test = y_test.astype(np.float32).reshape(-1, 1)

X_train_tensor = torch.tensor(X_train)
X_test_tensor = torch.tensor(X_test)
y_train_tensor = torch.tensor(y_train)
y_test_tensor = torch.tensor(y_test)

In [8]:
# Define a class AbaloneModel

class AbaloneModel(nn.Module):
    def __init__(self):
        super(AbaloneModel, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

In [9]:
model = AbaloneModel()
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

In [10]:
def train_model(model, criterion, optimizer, X_train, y_train, epochs=100):
    for epoch in range(epochs):
        model.train()
        optimizer.zero_grad()
        outputs = model(X_train_tensor)
        loss = criterion(outputs, y_train_tensor)
        loss.backward()
        optimizer.step()

train_model(model, criterion, optimizer, X_train_tensor, y_train_tensor)

In [11]:
def evaluate(model, X_test, y_test):
    model.eval()
    with torch.no_grad():
        outputs = model(X_test)
        mse = nn.MSELoss()
        loss = mse(outputs, y_test)
    return loss.item()

mse = evaluate(model, X_test_tensor, y_test_tensor)
print("Mean Squared Error on Test Set:", mse)

Mean Squared Error on Test Set: 5.463046073913574


In [12]:
# Define the Adagrad optimizer
optimizer_adagrad = torch.optim.Adagrad(model.parameters(), lr=0.01)

# Train the model with Adagrad optimizer
train_model(model, criterion, optimizer_adagrad, X_train_tensor, y_train_tensor)

# Evaluate the model
mse_adagrad = evaluate(model, X_test_tensor, y_test_tensor)
print("Mean Squared Error with Adagrad optimizer:", mse_adagrad)

Mean Squared Error with Adagrad optimizer: 5.044127464294434


Observations:

MSE Comparison: The Mean Squared Error (MSE) obtained using Adagrad (5.097) is lower than the MSE obtained using SGD (5.482). This indicates that the model trained with Adagrad performs slightly better in terms of minimizing the loss compared to the model trained with SGD.

Adaptive Learning Rates: Adagrad adapts the learning rates of each parameter based on the frequency of parameter updates during training. This adaptiveness can be beneficial for training deep neural networks, as it allows for different learning rates for different parameters. In contrast, SGD uses a fixed learning rate throughout training, which may not be optimal for all parameters.

Convergence Speed: Adagrad may converge faster than SGD for certain problems, especially when the learning rates need to be adjusted dynamically during training. However, it's essential to monitor convergence behavior and avoid overfitting, as Adagrad's aggressive learning rate adaptation can lead to unstable training dynamics in some cases.

Performance Trade-offs: While Adagrad may offer better performance in terms of minimizing the loss, it comes with computational overhead due to the accumulation of squared gradients in the denominator of the learning rate update formula. This can lead to slower training compared to SGD, especially for large-scale datasets and complex models.

Hyperparameter Sensitivity: Both Adagrad and SGD have hyperparameters that need to be carefully tuned for optimal performance. The choice of learning rate, batch size, and other hyperparameters can significantly impact the training dynamics and final performance of the model. Experimentation and cross-validation are crucial for identifying the best hyperparameters for a given task.

In summary, Adagrad generally offers better performance in terms of minimizing the loss compared to SGD, thanks to its adaptive learning rate mechanism. However, it's essential to consider the trade-offs in convergence speed, computational complexity, and hyperparameter sensitivity when choosing between different optimization algorithms for training neural networks.

Now, starting with Hyper-parameter analysis and tuning (Finding the combination of hyper-parameters which gives the least MSE).

In [13]:
# Define a class AbaloneModel_hyper for hyperparameter analysis and tuning

class AbaloneModel_hyper(nn.Module):
    def __init__(self, num_hidden_layers=1, num_hidden_nodes=64):
        super(AbaloneModel_hyper, self).__init__()
        self.input_layer = nn.Linear(10, num_hidden_nodes)
        self.hidden_layers = nn.ModuleList([nn.Linear(num_hidden_nodes, num_hidden_nodes) for _ in range(num_hidden_layers)])
        self.output_layer = nn.Linear(num_hidden_nodes, 1)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.input_layer(x))
        for layer in self.hidden_layers:
            x = self.relu(layer(x))
        x = self.output_layer(x)
        return x

In [14]:
# Define train_model_hyper for hyperparameter analysis and tuning

def train_model_hyper(model, criterion, optimizer, X_train, y_train, batch_size, epochs=100):
    data_size = X_train.size(0)
    num_batches = (data_size + batch_size - 1) // batch_size

    for epoch in range(epochs):
        model.train()  # Set the model to training mode
        epoch_loss = 0.0
        for i in range(num_batches):
            start_idx = i * batch_size
            end_idx = min((i + 1) * batch_size, data_size)
            batch_X = X_train[start_idx:end_idx]
            batch_y = y_train[start_idx:end_idx]

            optimizer.zero_grad()  # Clear gradients
            outputs = model(batch_X)  # Forward pass
            loss = criterion(outputs, batch_y)  # Calculate loss
            loss.backward()  # Backward pass
            optimizer.step()  # Update weights
            epoch_loss += loss.item()

        # Print average loss for the epoch
        avg_loss = epoch_loss / num_batches
        print(f'Epoch [{epoch + 1}/{epochs}], Avg. Loss: {avg_loss:.4f}')

# Example usage:
model = AbaloneModel_hyper()
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
train_model_hyper(model, criterion, optimizer, X_train_tensor, y_train_tensor, batch_size=32, epochs=100)

Epoch [1/100], Avg. Loss: 12.5229
Epoch [2/100], Avg. Loss: 5.6597
Epoch [3/100], Avg. Loss: 5.2014
Epoch [4/100], Avg. Loss: 5.0581
Epoch [5/100], Avg. Loss: 4.9504
Epoch [6/100], Avg. Loss: 4.8682
Epoch [7/100], Avg. Loss: 4.7981
Epoch [8/100], Avg. Loss: 4.7404
Epoch [9/100], Avg. Loss: 4.6922
Epoch [10/100], Avg. Loss: 4.6485
Epoch [11/100], Avg. Loss: 4.6107
Epoch [12/100], Avg. Loss: 4.5701
Epoch [13/100], Avg. Loss: 4.5415
Epoch [14/100], Avg. Loss: 4.5118
Epoch [15/100], Avg. Loss: 4.4844
Epoch [16/100], Avg. Loss: 4.4604
Epoch [17/100], Avg. Loss: 4.4382
Epoch [18/100], Avg. Loss: 4.4140
Epoch [19/100], Avg. Loss: 4.3968
Epoch [20/100], Avg. Loss: 4.3709
Epoch [21/100], Avg. Loss: 4.3549
Epoch [22/100], Avg. Loss: 4.3387
Epoch [23/100], Avg. Loss: 4.3184
Epoch [24/100], Avg. Loss: 4.2996
Epoch [25/100], Avg. Loss: 4.2810
Epoch [26/100], Avg. Loss: 4.2721
Epoch [27/100], Avg. Loss: 4.2555
Epoch [28/100], Avg. Loss: 4.2414
Epoch [29/100], Avg. Loss: 4.2285
Epoch [30/100], Avg. L

In [15]:
# Ranges of the values of hyperparameters we want to analyse
learning_rates = [0.001, 0.01, 0.1]
batch_sizes = [16, 32, 64]
hidden_nodes = [32, 64, 128]
hidden_layers = [1, 2, 3, 4]

results = []

for lr in learning_rates:
    for batch_size in batch_sizes:
        for nodes in hidden_nodes:
            for layers in hidden_layers:
                # Define model with current hyperparameters
                model = AbaloneModel_hyper(num_hidden_layers=layers, num_hidden_nodes=nodes)
                criterion = nn.MSELoss()
                optimizer = torch.optim.SGD(model.parameters(), lr=lr)

                # Train the model
                train_model_hyper(model, criterion, optimizer, X_train_tensor, y_train_tensor, batch_size=batch_size)

                # Evaluate the model
                mse = evaluate(model, X_test_tensor, y_test_tensor)

                # Store results
                results.append({'Learning Rate': lr, 'Batch Size': batch_size,
                                'Hidden Nodes': nodes, 'Hidden Layers': layers, 'MSE': mse})

# Convert results to DataFrame for tabulation
results_df = pd.DataFrame(results)

Epoch [1/100], Avg. Loss: 29.0676
Epoch [2/100], Avg. Loss: 5.7776
Epoch [3/100], Avg. Loss: 5.0787
Epoch [4/100], Avg. Loss: 4.8445
Epoch [5/100], Avg. Loss: 4.7333
Epoch [6/100], Avg. Loss: 4.6665
Epoch [7/100], Avg. Loss: 4.6077
Epoch [8/100], Avg. Loss: 4.5710
Epoch [9/100], Avg. Loss: 4.5345
Epoch [10/100], Avg. Loss: 4.5065
Epoch [11/100], Avg. Loss: 4.4810
Epoch [12/100], Avg. Loss: 4.4579
Epoch [13/100], Avg. Loss: 4.4363
Epoch [14/100], Avg. Loss: 4.4171
Epoch [15/100], Avg. Loss: 4.3976
Epoch [16/100], Avg. Loss: 4.3810
Epoch [17/100], Avg. Loss: 4.3642
Epoch [18/100], Avg. Loss: 4.3494
Epoch [19/100], Avg. Loss: 4.3355
Epoch [20/100], Avg. Loss: 4.3217
Epoch [21/100], Avg. Loss: 4.3089
Epoch [22/100], Avg. Loss: 4.2961
Epoch [23/100], Avg. Loss: 4.2845
Epoch [24/100], Avg. Loss: 4.2729
Epoch [25/100], Avg. Loss: 4.2624
Epoch [26/100], Avg. Loss: 4.2519
Epoch [27/100], Avg. Loss: 4.2420
Epoch [28/100], Avg. Loss: 4.2309
Epoch [29/100], Avg. Loss: 4.2226
Epoch [30/100], Avg. L

In [16]:
results_df

Unnamed: 0,Learning Rate,Batch Size,Hidden Nodes,Hidden Layers,MSE
0,0.001,16,32,1,5.035463
1,0.001,16,32,2,5.072645
2,0.001,16,32,3,5.244679
3,0.001,16,32,4,5.373796
4,0.001,16,64,1,4.997192
...,...,...,...,...,...
103,0.100,64,64,4,
104,0.100,64,128,1,
105,0.100,64,128,2,
106,0.100,64,128,3,


In [17]:
# Find the row with the minimum MSE (best combination of hyperparameters)
min_mse_row = results_df.loc[results_df['MSE'].idxmin()]

# Print the row
print("Row with Minimum MSE:")
print(min_mse_row)

Row with Minimum MSE:
Learning Rate      0.001000
Batch Size        64.000000
Hidden Nodes     128.000000
Hidden Layers      1.000000
MSE                4.976275
Name: 32, dtype: float64


Increase the number of hidden layers to 10 and use Sigmoid as the activation function in each hidden layer.

In [25]:
class AbaloneModel_sig(nn.Module):
    def __init__(self, num_hidden_layers=10, num_hidden_nodes=64):
        super(AbaloneModel_sig, self).__init__()
        self.input_layer = nn.Linear(10, num_hidden_nodes)
        self.hidden_layers = nn.ModuleList([nn.Linear(num_hidden_nodes, num_hidden_nodes) for _ in range(num_hidden_layers)])
        self.output_layer = nn.Linear(num_hidden_nodes, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.sigmoid(self.input_layer(x))
        for layer in self.hidden_layers:
            x = self.sigmoid(layer(x))
        x = self.output_layer(x)
        return x

# Instantiate the model with 10 hidden layers
model_10_layers = AbaloneModel_sig(num_hidden_layers=10)

# Define the loss function and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model_10_layers.parameters(), lr=0.1)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    model_10_layers.train()  # Set the model to training mode
    optimizer.zero_grad()  # Clear gradients
    outputs = model_10_layers(X_train_tensor)  # Forward pass
    loss = criterion(outputs, y_train_tensor)  # Calculate loss
    loss.backward()  # Backward pass
    optimizer.step()  # Update weights
    print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}')

# Evaluate the model on test data
model_10_layers.eval()  # Set the model to evaluation mode
with torch.no_grad():
    outputs_test = model_10_layers(X_test_tensor)
    mse_test = criterion(outputs_test, y_test_tensor)
    print(f'MSE on test data: {mse_test.item():.4f}')

Epoch [1/100], Loss: 101.7770
Epoch [2/100], Loss: 681.5310
Epoch [3/100], Loss: 184.7860
Epoch [4/100], Loss: 121.9349
Epoch [5/100], Loss: 81.7104
Epoch [6/100], Loss: 55.9670
Epoch [7/100], Loss: 39.4913
Epoch [8/100], Loss: 28.9469
Epoch [9/100], Loss: 22.1985
Epoch [10/100], Loss: 17.8796
Epoch [11/100], Loss: 15.1155
Epoch [12/100], Loss: 13.3465
Epoch [13/100], Loss: 12.2143
Epoch [14/100], Loss: 11.4897
Epoch [15/100], Loss: 11.0260
Epoch [16/100], Loss: 10.7292
Epoch [17/100], Loss: 10.5393
Epoch [18/100], Loss: 10.4177
Epoch [19/100], Loss: 10.3399
Epoch [20/100], Loss: 10.2901
Epoch [21/100], Loss: 10.2583
Epoch [22/100], Loss: 10.2379
Epoch [23/100], Loss: 10.2248
Epoch [24/100], Loss: 10.2165
Epoch [25/100], Loss: 10.2111
Epoch [26/100], Loss: 10.2077
Epoch [27/100], Loss: 10.2055
Epoch [28/100], Loss: 10.2041
Epoch [29/100], Loss: 10.2032
Epoch [30/100], Loss: 10.2026
Epoch [31/100], Loss: 10.2023
Epoch [32/100], Loss: 10.2020
Epoch [33/100], Loss: 10.2019
Epoch [34/100],

The output we got shows the loss decreasing steadily over the course of training, which is a positive sign indicating that the model is learning. However, the loss seems to stabilize around a certain value after a few epochs, without further significant improvement.

Observations:

Stable Loss: The loss decreases initially but stabilizes around 10.2016 after several epochs. This stabilization suggests that the model might have reached a plateau in learning and may not further improve beyond this point.

Vanishing Gradient: There doesn't seem to be a significant issue with vanishing gradients in this architecture, as the loss decreases steadily during training. Vanishing gradients usually manifest as very slow or stagnant learning, leading to little or no improvement in the loss over time. However, this does not appear to be the case here, as the loss decreases consistently.

Model Performance Comparison: Comparing this architecture (with 10 or 15 hidden layers using sigmoid activation function with SGD optimizer) to the previous architecture (with 2 hidden layers using ReLU activation), it's essential to consider the final MSE on the test data. In this case, the MSE on the test data is 11.1682, which is higher than the earlier architecture(Relu activation function with SGD optimizer)'s MSE which was approximately 5.46. Therefore, in terms of minimizing the loss (MSE), the previous architecture might be better.

Model Complexity: Increasing the number of hidden layers and using sigmoid activation can make the model more complex. However, complexity doesn't always translate to better performance, as observed in this case. The model might have overfit to the training data or encountered difficulties in learning complex patterns in the data.

In summary, while the model with increased hidden layers and sigmoid activation shows stable learning dynamics, it may not necessarily outperform the previous architecture in terms of minimizing the loss on the test data. It's essential to strike a balance between model complexity and performance, considering factors such as overfitting and computational efficiency. Further experimentation and tuning of hyperparameters may be necessary to improve the model's performance.