<style>
    .info-card {
        max-width: 650px;
        margin: 25px auto;
        padding: 25px 30px;
        border: 1px solid #e0e0e0;
        border-radius: 12px;
        box-shadow: 0 4px 12px rgba(0, 0, 0, 0.05);
        font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", Arial, sans-serif;
        background-color: #fdfdfd;
        color: #333;
    }
    .info-card .title {
        color: #1a237e; /* Dark Indigo */
        font-size: 24px;
        font-weight: 600;
        margin-top: 0;
        margin-bottom: 15px;
        text-align: center;
        border-bottom: 2px solid #e8eaf6; /* Light Indigo */
        padding-bottom: 10px;
    }
    .info-card .details-grid {
        display: grid;
        grid-template-columns: max-content 1fr;
        gap: 12px 20px;
        margin-top: 20px;
        font-size: 16px;
    }
    .info-card .label {
        font-weight: 600;
        color: #555;
        text-align: right;
    }
    .info-card .value {
        font-weight: 400;
        color: #222;
    }
</style>

<div class="info-card">
    <h2 class="title">Unit 4 Exercise</h2>
    <div class="details-grid">
        <div class="label">Name:</div>
        <div class="value">Ethan Jed V. Carbonell</div>
        <div class="label">Date:</div>
        <div class="value">October 17, 2025</div>
        <div class="label">Year & Section:</div>
        <div class="value">BSCS 3A AI</div>
        <div></div>
    </div>
</div>

## Library imports
### Set np.random.seed to 0 for fair comparison

In [318]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data


nnfs.init()
np.random.seed(0)

## Classes
### Hidden Layers

In [319]:
# Hidden Layers
# Dense
class Layer_Dense:
    # Layer initialization
    # randomly initialize weights and set biases to zero
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))


    # Forward pass
    def forward(self, inputs):
        # Remember the input values
        self.inputs = inputs
        # Calculate the output values from inputs, weight and biases
        self.output = np.dot(inputs, self.weights) + self.biases

    # Backward pass/Backpropagation
    def backward(self, dvalues):
        # Gradients on parameters:
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)
        # Gradient on values
        self.dinputs = np.dot(dvalues, self.weights.T)

### ReLU

In [320]:
# ReLU
class Activation_ReLU:
    # Forward pass
    def forward(self, inputs):
        # Remember the input values
        self.inputs = inputs
        # Calculate the output values from inputs
        self.output = np.maximum(0, inputs)

    # Backward pass
    def backward(self, dvalues):
        # Make a copy of the original values first
        self.dinputs = dvalues.copy()
    
        # Zero gradient where input values were negative
        self.dinputs[self.inputs <= 0] = 0

### Softmax with Categorical Cross Entropy

In [321]:
class Activation_Softmax_Loss_CategoricalCrossEntropy():
    # Creates activation and loss function objects
    def __init__(self):
        pass # No activation or loss objects needed separately

    # Forward pass
    def forward(self, inputs, y_true):
        # Remember inputs for backward pass
        self.inputs = inputs
        
        # Get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        # Normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities
        
        # Calculate loss
        # Clip data to prevent division by 0
        y_pred_clipped = np.clip(self.output, 1e-7, 1 - 1e-7)
        
        # Probabilities for target values - only if categorical labels
        if len(y_true.shape) == 1:
            correct_confidences = y_pred_clipped[range(len(self.output)), y_true]
        # Mask values - only for one-hot encoded labels
        elif len(y_true.shape) == 2:
            correct_confidences = np.sum(y_pred_clipped * y_true, axis=1)
            
        # Calculate and return the mean loss
        negative_log_likelihoods = -np.log(correct_confidences)
        return np.mean(negative_log_likelihoods)

    # Backward pass
    def backward(self, dvalues, y_true):
        # Number of samples
        samples = len(dvalues)
        
        # If labels are one-hot encoded, turn them into discrete values
        if len(y_true.shape) == 2:
            y_true = np.argmax(y_true, axis=1)
            
        # Copy so we can safely modify
        self.dinputs = dvalues.copy()
        # Calculate gradient using the simplified and stable formula
        self.dinputs[range(samples), y_true] -= 1
        # Normalize gradient
        self.dinputs = self.dinputs / samples

### Optimizers

In [322]:
# SGD Optimizer (with learning rate decay and momentum)
class Optimizer_SGD:
    # Initialize optimizer - set settings
    def __init__(self, learning_rate=1., decay=0., momentum=0.):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.momentum = momentum

    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * \
                (1. / (1. + self.decay * self.iterations))

    # Update parameters
    def update_params(self, layer):
        # If we use momentum
        if self.momentum:
            # If layer does not contain momentum arrays, create them
            if not hasattr(layer, 'weight_momentums'):
                layer.weight_momentums = np.zeros_like(layer.weights)
                # If there is no momentum array for biases
                # create it
                layer.bias_momentums = np.zeros_like(layer.biases)
            
            # Build weight updates with momentum - take previous
            # updates multiplied by retain factor and update with
            # current gradients
            weight_updates = \
                self.momentum * layer.weight_momentums - \
                self.current_learning_rate * layer.dweights
            layer.weight_momentums = weight_updates
            
            # Build bias updates
            bias_updates = \
                self.momentum * layer.bias_momentums - \
                self.current_learning_rate * layer.dbiases
            layer.bias_momentums = bias_updates
            
        # Vanilla SGD updates (as before momentum update)
        else:
            weight_updates = -self.current_learning_rate * layer.dweights
            bias_updates = -self.current_learning_rate * layer.dbiases
        
        # Update weights and biases using either
        # vanilla or momentum updates
        layer.weights += weight_updates
        layer.biases += bias_updates

    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1


# AdaGrad optimizer
class Optimizer_Adagrad:
    # Initialize optimizer - set settings
    def __init__(self, learning_rate=1., decay=0., epsilon=1e-7):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        
    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * \
                (1. / (1. + self.decay * self.iterations))
                
    # Update parameters
    def update_params(self, layer):
        # If layer does not contain cache arrays,
        # create them
        if not hasattr(layer, 'weight_cache'):
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_cache = np.zeros_like(layer.biases)
            
        # Update cache with squared current gradients
        layer.weight_cache += layer.dweights**2
        layer.bias_cache += layer.dbiases**2
        
        # Vanilla SGD parameter update + normalization
        # with square rooted cache
        layer.weights += -self.current_learning_rate * \
            layer.dweights / \
            (np.sqrt(layer.weight_cache) + self.epsilon)
        layer.biases += -self.current_learning_rate * \
            layer.dbiases / \
            (np.sqrt(layer.bias_cache) + self.epsilon)
            
    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1

#### Data Loading

In [323]:
# Create the dataset
X, y = spiral_data(samples = 100, classes = 3)

#### NN Init

In [324]:
# Dense Layer with 2 input features and 64 output values
dense1 = Layer_Dense(2, 64)

# ReLU activation for the Dense layer above
activation1 = Activation_ReLU()

# 2nd dense layer with 64 input and 3 output values (for 3 classes)
dense2 = Layer_Dense(64, 3)

loss_activation = Activation_Softmax_Loss_CategoricalCrossEntropy()

### Optimizer Selection & Training Loop

In [325]:
# Stochastic Gradient Descent (SGD)
# print("Running with: Vanilla SGD")
# optimizer = Optimizer_SGD(learning_rate=1.0)

# SGD with Learning Rate Decay
# print("Running with: SGD w LR Decay")
# optimizer = Optimizer_SGD(learning_rate=1.0, decay=1e-3)

# SGD with Momentum
print("Running with: SGD with Momentum")
optimizer = Optimizer_SGD(learning_rate=0.2, decay=1e-4, momentum=0.9)

# Adaptive Gradient (AdaGrad)
# print("Running with: AdaGrad")
# optimizer = Optimizer_Adagrad(learning_rate=1.5, decay=0)

epochs = 1001 # Set number of epochs

for epoch in range(epochs):

    # Perform a forward pass of our training data through this layer
    dense1.forward(X)
    # Pass the output of the dense layer through the activation function
    activation1.forward(dense1.output)
    # Pass on to the 2nd layer
    dense2.forward(activation1.output)
    # Activation function for the 2nd layer + Loss
    loss = loss_activation.forward(dense2.output, y)

    # --- Print progress every 100 epochs ---
    loss = loss_activation.forward(dense2.output, y)

    # Print progress
    if not epoch % 100:
        # Get predictions from the activation output
        predictions = np.argmax(loss_activation.output, axis=1)
        accuracy = np.mean(predictions == y)
        print(f'epoch: {epoch}, ' +
              f'acc: {accuracy:.3f}, ' +
              f'loss: {loss:.3f}, ' +
              f'lr: {optimizer.current_learning_rate:.4f}')

    # Backward pass from loss
    loss_activation.backward(loss_activation.output, y)
    dense2.backward(loss_activation.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)

    # Update learning rate (if decay is used)
    optimizer.pre_update_params()
    # Update the weights and biases of each layer
    optimizer.update_params(dense1)
    optimizer.update_params(dense2)
    # Increment iteration count
    optimizer.post_update_params()

Running with: SGD with Momentum
epoch: 0, acc: 0.360, loss: 1.099, lr: 0.2000
epoch: 100, acc: 0.407, loss: 1.079, lr: 0.1980
epoch: 200, acc: 0.410, loss: 1.076, lr: 0.1961
epoch: 300, acc: 0.403, loss: 1.072, lr: 0.1942
epoch: 400, acc: 0.423, loss: 1.063, lr: 0.1923
epoch: 500, acc: 0.447, loss: 1.039, lr: 0.1905
epoch: 600, acc: 0.537, loss: 0.997, lr: 0.1887
epoch: 700, acc: 0.610, loss: 0.933, lr: 0.1869
epoch: 800, acc: 0.623, loss: 0.870, lr: 0.1852
epoch: 900, acc: 0.657, loss: 0.818, lr: 0.1835
epoch: 1000, acc: 0.713, loss: 0.785, lr: 0.1818


# Results
See output html file below

In [328]:
from IPython.display import display, HTML

# Store the entire HTML and CSS content in a multi-line string
html_content = """
<style>
    .pdf-mimic-container {
        max-width: 800px;
        margin: 40px auto;
        padding: 50px 60px;
        border: 1px solid #dcdcdc;
        box-shadow: 0 4px 12px rgba(0, 0, 0, 0.08);
        background-color: #ffffff;
        font-family: Georgia, 'Times New Roman', Times, serif;
        font-size: 16px;
        line-height: 1.6;
        color: #333;
    }
    .pdf-mimic-container .header {
        text-align: right;
        font-size: 14px;
        color: #777;
        margin-bottom: 20px;
        padding-bottom: 10px;
        border-bottom: 1px solid #eee;
    }
    .pdf-mimic-container .footer {
        text-align: center;
        font-size: 14px;
        color: #777;
        margin-top: 40px;
        padding-top: 10px;
        border-top: 1px solid #eee;
    }
    .pdf-mimic-container h1 {
        text-align: center;
        font-size: 24px;
        margin-top: 0;
        margin-bottom: 10px;
        font-weight: 600;
        color: #222;
    }
    .pdf-mimic-container .author-info {
        text-align: center;
        font-size: 16px;
        color: #555;
        margin-bottom: 40px;
    }
    .pdf-mimic-container h2 {
        font-size: 22px;
        margin-top: 40px;
        margin-bottom: 15px;
        padding-bottom: 5px;
        border-bottom: 2px solid #e0e0e0;
        font-weight: 600;
    }
    .pdf-mimic-container h3 {
        font-size: 18px;
        margin-top: 25px;
        margin-bottom: 10px;
        font-weight: 600;
    }
    .pdf-mimic-container table {
        width: 100%;
        border-collapse: collapse;
        margin-top: 20px;
        margin-bottom: 20px;
    }
    .pdf-mimic-container caption {
        caption-side: top;
        text-align: left;
        font-weight: bold;
        padding-bottom: 10px;
        font-size: 16px;
    }
    .pdf-mimic-container th, .pdf-mimic-container td {
        padding: 10px 12px;
        text-align: left;
        border-bottom: 1px solid #e0e0e0;
    }
    .pdf-mimic-container thead th {
        border-bottom: 2px solid #555;
        font-weight: bold;
    }
    .pdf-mimic-container tbody tr:last-child td {
        border-bottom: none;
    }
    .pdf-mimic-container ul {
        padding-left: 25px;
    }
    .pdf-mimic-container li {
        margin-bottom: 8px;
    }
    .pdf-mimic-container strong {
        font-weight: bold;
    }
    .pdf-mimic-container code {
        font-family: 'Courier New', Courier, monospace;
        background-color: #f4f4f4;
        padding: 2px 5px;
        border-radius: 4px;
        font-size: 0.9em;
    }
    .graph-placeholder {
        border: 2px dashed #ccc;
        background-color: #f9f9f9;
        padding: 60px 20px;
        margin: 25px 0;
        text-align: center;
        color: #888;
        font-style: italic;
        font-size: 1em;
    }
</style>

<div class="pdf-mimic-container">

    <div class="header">Optimizer Performance Analysis</div>

    <h1>Unit 4 Exercise: A Comparative Analysis of Neural Network Optimizers</h1>
    <div class="author-info">
        Ethan Jed V. Carbonell<br>
        BSCS 3A AI<br>
        October 17, 2025
    </div>

    <h2>1 Experimental Setup</h2>
    <p>The experiment was conducted using a consistent model architecture and dataset to ensure a fair comparison between the optimizers.</p>

    <h3>1.1 Dataset</h3>
    <p>The model was trained on the <code>spiral_data</code> dataset, generated using the <code>nnfs</code> library. This dataset is a standard benchmark for classification tasks that are not linearly separable. The configuration used was:</p>
    <ul>
        <li><strong>Samples:</strong> 100</li>
        <li><strong>Classes:</strong> 3</li>
        <li><strong>Features:</strong> 2 (x, y coordinates)</li>
    </ul>

    <h3>1.2 Model Architecture</h3>
    <p>A simple sequential feed-forward neural network was constructed with the following layers:</p>
    <table>
        <caption>Table 1: Neural Network Architecture</caption>
        <thead>
            <tr>
                <th>Layer Type</th>
                <th>Input Shape</th>
                <th>Output Shape</th>
                <th>Activation Function</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Input Layer</td>
                <td>2</td>
                <td>2</td>
                <td>-</td>
            </tr>
            <tr>
                <td>Dense Layer 1</td>
                <td>2</td>
                <td>64</td>
                <td>ReLU</td>
            </tr>
            <tr>
                <td>Dense Layer 2</td>
                <td>64</td>
                <td>3</td>
                <td>Softmax</td>
            </tr>
        </tbody>
    </table>
    <p>The <strong>Categorical Cross-Entropy</strong> loss function was used in combination with the Softmax activation on the final layer, as this is standard practice for multi-class classification problems.</p>
    
    <h3>1.3 Optimizer Configuration</h3>
    <p>After a process of hyperparameter tuning, the final configurations for the two optimizers were selected as shown in Table 2.</p>
    <table>
        <caption>Table 2: Final Hyperparameter Settings</caption>
        <thead>
            <tr>
                <th>Optimizer</th>
                <th>Parameters</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>SGD with Momentum</td>
                <td><code>learning_rate=0.2</code>, <code>decay=1e-4</code>, <code>momentum=0.9</code></td>
            </tr>
            <tr>
                <td>AdaGrad</td>
                <td><code>learning_rate=1.5</code>, <code>decay=0</code>, <code>epsilon=1e-7</code></td>
            </tr>
        </tbody>
    </table>

    <h2>2 Hyperparameter Tuning Rationale</h2>
    
    <h3>2.1 SGD with Momentum</h3>
    <p>The final parameters for the SGD optimizer were chosen after a process of balancing training speed against stability. Initial experiments with a higher learning rate (e.g., 0.5) demonstrated rapid learning but proved to be unstable, characterized by significant dips in accuracy and "overshooting" the optimal solution.</p>
    <p>To correct this, a more conservative learning rate of <strong>0.2</strong> was selected. This eliminated the instability, resulting in a much more consistent increase in accuracy and a reliable convergence path. The decay rate of <strong>1e-4</strong> was found to be effective in gradually reducing the learning rate for fine-tuning in later epochs, while the standard momentum value of <strong>0.9</strong> was retained to help the optimizer overcome local minima and plateaus.</p>

    <h3>2.2 AdaGrad</h3>
    <p>The chosen configuration for the AdaGrad optimizer was designed specifically to counteract the algorithm's inherent weakness: premature learning rate decay. Initial attempts that included an external decay rate (e.g., <code>decay=1e-3</code>) resulted in a "double decay" effect, where the combination of external decay and AdaGrad's internal mechanism caused the learning rate to diminish too quickly, leading to training stagnation.</p>
    <p>The most critical step was setting <strong><code>decay=0</code></strong>, which allowed AdaGrad's adaptive learning rate to function as intended. To further combat the inevitable decay from AdaGrad's internal cache, the initial learning rate was increased to <strong>1.5</strong>. This provided the optimizer with a larger initial "budget," enabling it to learn effectively for a longer period before the learning rate became too small to make meaningful updates, ultimately leading to a much-improved final accuracy.</p>

    <h2>3 Results</h2>
    <p>The model was trained for 1001 epochs with each optimizer. The following tables show the accuracy, loss, and learning rate at 100-epoch intervals.</p>
    
    <table>
        <caption>Table 3: Training Progress with SGD with Momentum</caption>
        <thead>
            <tr>
                <th>Epoch</th>
                <th>Accuracy</th>
                <th>Loss</th>
                <th>Learning Rate</th>
            </tr>
        </thead>
        <tbody>
            <tr><td>0</td><td>0.360</td><td>1.099</td><td>0.2000</td></tr>
            <tr><td>100</td><td>0.407</td><td>1.079</td><td>0.1980</td></tr>
            <tr><td>200</td><td>0.410</td><td>1.076</td><td>0.1961</td></tr>
            <tr><td>300</td><td>0.403</td><td>1.072</td><td>0.1942</td></tr>
            <tr><td>400</td><td>0.423</td><td>1.063</td><td>0.1923</td></tr>
            <tr><td>500</td><td>0.447</td><td>1.039</td><td>0.1905</td></tr>
            <tr><td>600</td><td>0.537</td><td>0.997</td><td>0.1887</td></tr>
            <tr><td>700</td><td>0.610</td><td>0.933</td><td>0.1869</td></tr>
            <tr><td>800</td><td>0.623</td><td>0.870</td><td>0.1852</td></tr>
            <tr><td>900</td><td>0.657</td><td>0.818</td><td>0.1835</td></tr>
            <tr><td>1000</td><td>0.713</td><td>0.785</td><td>0.1818</td></tr>
        </tbody>
    </table>

    <table>
        <caption>Table 4: Training Progress with AdaGrad</caption>
        <thead>
            <tr>
                <th>Epoch</th>
                <th>Accuracy</th>
                <th>Loss</th>
                <th>Learning Rate</th>
            </tr>
        </thead>
        <tbody>
            <tr><td>0</td><td>0.360</td><td>1.099</td><td>1.5000</td></tr>
            <tr><td>100</td><td>0.520</td><td>1.006</td><td>1.5000</td></tr>
            <tr><td>200</td><td>0.550</td><td>0.924</td><td>1.5000</td></tr>
            <tr><td>300</td><td>0.607</td><td>0.860</td><td>1.5000</td></tr>
            <tr><td>400</td><td>0.647</td><td>0.801</td><td>1.5000</td></tr>
            <tr><td>500</td><td>0.633</td><td>0.757</td><td>1.5000</td></tr>
            <tr><td>600</td><td>0.650</td><td>0.725</td><td>1.5000</td></tr>
            <tr><td>700</td><td>0.697</td><td>0.682</td><td>1.5000</td></tr>
            <tr><td>800</td><td>0.677</td><td>0.682</td><td>1.5000</td></tr>
            <tr><td>900</td><td>0.740</td><td>0.612</td><td>1.5000</td></tr>
            <tr><td>1000</td><td>0.730</td><td>0.603</td><td>1.5000</td></tr>
        </tbody>
    </table>

    <h2>4 Analysis and Comparison</h2>
    <h3>4.1 Loss Stabilization</h3>
    <p>Comparing the loss values in Tables 3 and 4 reveals a clear difference in convergence speed.</p>

    <ul>
        <li><strong>AdaGrad:</strong> The loss decreased very rapidly in the initial epochs. By epoch 300, the loss had already dropped to 0.860. It continued to decrease and began to stabilize around epoch 700-800, fluctuating near a value of 0.682 before making a final drop. The rapid initial progress is characteristic of adaptive optimizers on this type of problem.</li>
        <li><strong>SGD with Momentum:</strong> The loss decreased much more slowly and steadily. It took until epoch 800 for the loss to reach 0.870, a level AdaGrad surpassed within the first 300 epochs. The loss did not show clear signs of stabilization by epoch 1000, suggesting that it was still converging, albeit slowly.</li>
    </ul>
    <p><strong>Conclusion:</strong> AdaGrad stabilized the loss significantly faster than SGD with Momentum.</p>

    <h3>4.2 Model Accuracy</h3>
    <p>The primary metric for model performance is its final accuracy.</p>

    <ul>
        <li><strong>AdaGrad:</strong> Achieved a higher final accuracy of <strong>73.0%</strong> at epoch 1000, with a peak accuracy of <strong>74.0%</strong> at epoch 900. It also reached key accuracy milestones much faster, exceeding 60% accuracy by epoch 300.</li>
        <li><strong>SGD with Momentum:</strong> Reached a final accuracy of <strong>71.3%</strong> at epoch 1000. Its progress was slower, requiring around 700 epochs to surpass the 60% accuracy mark.</li>
    </ul>
    <p><strong>Conclusion:</strong> For this specific task and hyperparameter configuration, AdaGrad produced a model with slightly higher accuracy and converged to that accuracy much more quickly.</p>

    <h2>5 Conclusion</h2>
    <p>This experiment successfully demonstrated the functional differences between the SGD with Momentum and AdaGrad optimizers. AdaGrad exhibited the key strengths of an adaptive optimizer: rapid initial convergence and strong performance without the need for manual learning rate scheduling. It achieved a final accuracy of 73.0% and stabilized the model's loss relatively early in the training process.</p>
    <p>Conversely, SGD with Momentum provided a slower but steady convergence, reaching a final accuracy of 71.3%. While effective, it required more epochs to achieve a comparable level of performance.</p>
    <p>The results suggest that for problems like the spiral dataset, an adaptive optimizer like AdaGrad can offer a significant advantage in training efficiency. However, the tuning process for AdaGrad proved more nuanced, as its performance was highly sensitive to the interaction between its internal adaptive mechanism and any external learning rate decay.</p>

    <div class="footer">
        Ethan Jed V. Carbonell
    </div>

</div>
"""

display(HTML(html_content))

Layer Type,Input Shape,Output Shape,Activation Function
Input Layer,2,2,-
Dense Layer 1,2,64,ReLU
Dense Layer 2,64,3,Softmax

Optimizer,Parameters
SGD with Momentum,"learning_rate=0.2, decay=1e-4, momentum=0.9"
AdaGrad,"learning_rate=1.5, decay=0, epsilon=1e-7"

Epoch,Accuracy,Loss,Learning Rate
0,0.36,1.099,0.2
100,0.407,1.079,0.198
200,0.41,1.076,0.1961
300,0.403,1.072,0.1942
400,0.423,1.063,0.1923
500,0.447,1.039,0.1905
600,0.537,0.997,0.1887
700,0.61,0.933,0.1869
800,0.623,0.87,0.1852
900,0.657,0.818,0.1835

Epoch,Accuracy,Loss,Learning Rate
0,0.36,1.099,1.5
100,0.52,1.006,1.5
200,0.55,0.924,1.5
300,0.607,0.86,1.5
400,0.647,0.801,1.5
500,0.633,0.757,1.5
600,0.65,0.725,1.5
700,0.697,0.682,1.5
800,0.677,0.682,1.5
900,0.74,0.612,1.5
