# Introduction to Machine Learning

Machine learning has become the dominant approach in the field of Artificial Intelligence (AI), a broad term encompassing efforts to make computational models that can perform complex tasks without the need for human guidance.

In this module, we will develop an intuitive understanding of of how computers learn from data. This will give us a good foundation for the rest of the course, and help us to understand some of the more cutting-edge model architectures used in text understanding and generation.

## A Brief History

The concept of machines learning how to do things has been discussed by scientists for decades. 

In 1943, Warren McCulloch and Walter Pitts [explored](https://home.csulb.edu/~cwallis/382/readings/482/mccolloch.logical.calculus.ideas.1943.pdf) a mathematical model of our brains and nervous systems as a net of 'neurons' that respond to stimuli when 'excitation' exceeds a threshold. 

Later, Frank Rosenblatt worked on the 'percepton' at Cornell University, writing that, *"We are about to witness the birth of...a machine capable of perceiving, recognizing and identifying its surroundings without any human training or control."*

<img src="https://news.cornell.edu/sites/default/files/styles/full_size/public/2019-09/0925_rosenblatt5.jpg?itok=7UpHtbRj" width=450>
Rosenblatt with the Mark I Perceptron, which performed image classification of some shapes successfully.

A decade later, in 1959, Arthur Samuel, a researcher at IBM, [wrote](https://people.csail.mit.edu/brooks/idocs/Samuel.pdf) that:

"A computer can be programmed so that it will learn to play a better game of checkers than can be played by the person who wrote the program".

The idea was emerging that computers could learn functions whereby they applied weights to input which then determined an output, such as whether the image is a square, or a circle.

Machine learning has of course soared in complexity and achievement since the 1940s and 50s, with a particularly intensive period of evolution from the 2010s onwards. There are several reasons for this, including:
* Advances in algorithms
* Availability of computational resources such as accelerators, or GPUs
* Varied and substantial datasets made available by the internet and by smartphones (representing the world and our behaviour)

Arthur Samuel, a researcher at IBM succinctly described a process for machine learning in 1962:

*"Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximise the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would "learn" from its experience."*

What I hope emerges here is the idea that ML is an approach different to traditional computer programming, where we typically give a computer a task and the exact instructions how to accomplish that task, as in this diagram:

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/input_program_output.png?raw=true" width=650>

Instead, we want to create functions which learn from data such that they can receive inputs, apply weights to the features of those inputs (more on this soon), then generate an output. The saved weights from this process if what we call a 'model'.

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/input_model_backprop.png?raw=true" width=650>


In this lesson, we will explore the basics of a neural network, which still closely resemble the vision of McCulloch, Pitts, Rosenblatt and others, building up to our own Multi-Layer Perceptron to classify images.

### The basics

Let's explore how we can craft functions to do the following:
* Make a prediction
* Calculate a loss

These are the essential first steps in ML. We make a prediction, and we need a metric to show how wrong we are.

### Random Noise

Firstly, we need some data to which we will try to fit a line. This could be predicting plant height from leaf size, or house price from square footage, for example. To keep things simple, we will try to fit a line to random 'noise'. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Data generation (same as before)
np.random.seed(42)
X = np.linspace(-3, 3, 100)
y_true = 2 * X**2 + 3 * X + 1
noise = np.random.normal(0, 3, size=X.shape)
y = y_true + noise

# Prepare data as list of (x, y) tuples
data = list(zip(X, y))

plt.scatter(X, y, label='Noisy Data', alpha=0.6)
plt.plot(X, y_true, color='black', linestyle='--', label='True Function')
plt.legend()
plt.title("Generated Noisy Quadratic Data")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


### Quadratic Function

Quadratic functions are useful to model the trajectory of projectiles, arcs and parabolic shapes. They are often used in optimization problems, or to fit lines to data. 

The following function tries to predict weight from height, following this formula:

$a ⋅ x^2 + b ⋅ x + c$


In [None]:
def predict_quadratic(x, a, b, c):
    return a * x**2 + b * x + c

### Loss Function

There are a variety of loss functions used in ML; one of the most common is Mean Squared Error (MSE). MSE is a formal way of saying the 'average of our mistakes'.

This function calculates the difference between a prediction and the actual value, then squares that value.

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

In the mathematical notation, $n$ is the number of examples in our data, $y_i$ is the actual value of example $i$, $\hat{y_i}$, pronounced "y hat" is the prediction.

As with many calculations in ML, it may be easier to understand what's happening in the Python function. This function is more lines of code than what we would write in practice, however, showing the calculation steps in the loop should show exactly what is happening.

In [None]:
def mean_squared_error(data, a, b, c):
    total_error = 0
    for x, actual_y in data:
        predicted_y = predict_quadratic(x, a, b, c)
        error = predicted_y - actual_y
        squared_error = error ** 2
        total_error += squared_error 

    mse = total_error / len(data)
    return mse
    

Now that we have an idea of how wrong our predictions are, we can try updating parameters to improve accuracy. 

In the following function, we take one parameter (a, b, c) at a time, and try increasing, decreasing and keeping it the same. We use MSE to calculate what works best.

In [None]:
# Manual parameter update using simple finite differences (crude update)
def try_update_param(param, delta, fixed_params, data, param_index):
    a, b, c = fixed_params
    
    # Try positive delta
    params_pos = [a, b, c]
    params_pos[param_index] = param + delta
    loss_pos = mean_squared_error(data, *params_pos)
    
    # Try negative delta
    params_neg = [a, b, c]
    params_neg[param_index] = param - delta
    loss_neg = mean_squared_error(data, *params_neg)
    
    # Current loss without change
    params_curr = [a, b, c]
    params_curr[param_index] = param
    loss_curr = mean_squared_error(data, *params_curr)
    
    # Choose best option
    if loss_pos < loss_curr and loss_pos < loss_neg:
        return param + delta, loss_pos
    elif loss_neg < loss_curr and loss_neg < loss_pos:
        return param - delta, loss_neg
    else:
        return param, loss_curr

This process is cumbersone because we have to evaluate the model multiple times, and there are no gradients (slopes) calculated, so just comparing up, down and same is limited. This scenario is comparable to trying to find your way down a mountain blindfolded, just feeling around. This analogy will soon be extended...

Let's visualize our current training process:

In [None]:
from datetime import datetime

# Initialize parameters randomly
a, b, c = np.random.randn(3)
delta = 0.1
epochs = 50

loss_history = []
predictions_history = []
start = datetime.now()

for epoch in range(epochs):
    a, loss_a = try_update_param(a, delta, (a, b, c), data, 0)
    b, loss_b = try_update_param(b, delta, (a, b, c), data, 1)
    c, loss_c = try_update_param(c, delta, (a, b, c), data, 2)
    
    loss = max(loss_a, loss_b, loss_c)
    loss_history.append(loss)
    
    if epoch % 5 == 0:
        print(f"Epoch {epoch}: a={a:.3f}, b={b:.3f}, c={c:.3f}, Loss={loss:.3f}")
        y_pred = predict_quadratic(X, a, b, c)
        predictions_history.append((epoch, y_pred.copy()))

end = datetime.now()
manual_time = end-start
print(f"Time taken:{manual_time}")

# Plot loss over epochs
plt.plot(loss_history)
plt.title("Loss over epochs (crude manual update)")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.show()

In [None]:
plt.scatter(X, y, alpha=0.3, label="Noisy Data")
plt.plot(X, y_true, label="True Function", color='black', linestyle='--')

colors = plt.cm.viridis(np.linspace(0, 1, len(predictions_history)))
for (epoch, pred), color in zip(predictions_history, colors):
    plt.plot(X, pred, color=color, label=f"Epoch {epoch}")

plt.title("Prediction Curve Over Training")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.show()

### Vectorized MSE

If you're familiar with Python, you have probably come across vectorized operations. Vectorization refers to the process of performing calculations across entire arrays, matricies or tensors (more on these soon) at once, rather than using loops to iterate through values one by one.

#### Numpy

Numpy is a popular scientific programming library for Python and we will use several of its methods today. Numpy makes vector operations easy - and will vastly simplify our `mean_squared_error` function. We'll give the updated function a new name to avoid confusion.
Here again is MSE in vanilla Python:

```
def mean_squared_error(data, a, b, c):
    total_error = 0
    for x, actual_y in data:
        predicted_y = predict_quadratic(x, a, b, c)
        error = predicted_y - actual_y
        squared_error = error ** 2
        total_error += squared_error 

    mse = total_error / len(data)
    return mse
```

In Numpy, we simply use the `np.mean` method across the squared difference between our predictions and actuals.

In [None]:
def mse_loss(y_pred, y_true):
    return np.mean((y_pred - y_true)**2)

Let's rewrite the function to update params using our new, vectorized MSE

In [None]:
def try_update_param(param, delta, fixed_params, x, y, param_index):
    """
    param: current value of the parameter to update
    delta: step size to try (+delta and -delta)
    fixed_params: tuple/list of the other two parameters
    param_index: index of param in (a,b,c) to know position
    
    Returns the updated param and the corresponding loss.
    """
    a, b, c = fixed_params
    
    # Try positive delta
    params_pos = [a, b, c]
    params_pos[param_index] = param + delta
    y_pred_pos = predict_quadratic(*params_pos, x)
    loss_pos = mse_loss(y_pred_pos, y)
    
    # Try negative delta
    params_neg = [a, b, c]
    params_neg[param_index] = param - delta
    y_pred_neg = predict_quadratic(*params_neg, x)
    loss_neg = mse_loss(y_pred_neg, y)
    
    # Current loss without change
    params_curr = [a, b, c]
    params_curr[param_index] = param
    y_pred_curr = predict_quadratic(*params_curr, x)
    loss_curr = mse_loss(y_pred_curr, y)
    
    # Choose best option
    if loss_pos < loss_curr and loss_pos < loss_neg:
        return param + delta, loss_pos
    elif loss_neg < loss_curr and loss_neg < loss_pos:
        return param - delta, loss_neg
    else:
        return param, loss_curr

In [None]:
a, b, c = np.random.randn(3)
delta = 0.1
epochs = 50

loss_history = []

start = datetime.now()
for epoch in range(epochs):
    # Update a
    a, loss_a = try_update_param(a, delta, (a, b, c), X, y, 0)
    # Update b
    b, loss_b = try_update_param(b, delta, (a, b, c), X, y, 1)
    # Update c
    c, loss_c = try_update_param(c, delta, (a, b, c), X, y, 2)
    
    # Use the worst loss among the updates (or average, but worst is safe)
    loss = max(loss_a, loss_b, loss_c)
    loss_history.append(loss)
    
    if epoch % 5 == 0:
        print(f"Epoch {epoch}: a={a:.3f}, b={b:.3f}, c={c:.3f}, Loss={loss:.3f}")

end = datetime.now()
vectorized_time = end - start
print(f"Time taken:{vectorized_time}")

# Plot loss over epochs
plt.plot(loss_history)
plt.title("Loss over epochs (crude manual update)")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.show()


In [None]:
plt.scatter(X, y, alpha=0.3, label="Noisy Data")
plt.plot(X, y_true, label="True Function", color='black', linestyle='--')

colors = plt.cm.viridis(np.linspace(0, 1, len(predictions_history)))
for (epoch, pred), color in zip(predictions_history, colors):
    plt.plot(X, pred, color=color, label=f"Epoch {epoch}")

plt.title("Prediction Curve Over Training")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.show()

In [None]:
print(f"Vectorized time {manual_time - vectorized_time} faster than manual time")

Though the vectorized operation is only milliseconds faster, such gains over millions and billions of calculations will be significant!

### Gradient Descent

Now that we've explored making predictions and checking how wrong we are, the next feature of neural networks worth exploring is gradient descent, an efficient way to take steps in the right directon to minimize the loss. 

Here's the promised extension of the mountain analogy: imagine you're at the summit and it's getting misty, cold and dark. You want to find the fastest way down the mountain, so look for the steepest descent available and head down it. Many more steps and decisions to find the right direction will be necessary, since the steepest descent may lead to a plateau, from which you will have to find a new way down. 

Gradient descent is a key building component of how neural networks learn. Now we know how to calculate how wrong we are, we need to be able to suggest incremental improvements to get better, faster. 

#### Intuitive algorithm:

We calculate which direction provides the best option for minimizing the loss, and take a small 'step' in that direction. Repeat.

The parameters, or 'weights' of the model, are updated iteratively using the gradients of the loss function. ML training is in part a process of making millions or more of these tiny adjustments until the model finds the best weights for a task.

### Mathematical Formulation

Here's the more academic mathematical formulation:

For parameters θ = [θ₀, θ₁, ..., θₙ]:

**Loss Function:** $J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2$

$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

**Gradient:** $\nabla J(\theta_j) = \frac{1}{m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right) x^{(i)}_j$

**Update Rule:** $\theta_j = \theta_j - \alpha \cdot \nabla J(\theta_j)$

Where:
- m = number of training examples
- α = learning rate
- hθ(x) = hypothesis function (the model's prediction)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Create dataset
np.random.seed(42)
X = np.linspace(-3, 3, 100)
y_true = 2 * X**2 + 3 * X + 1
noise = np.random.normal(0, 3, size=X.shape)
y = y_true + noise

# Initialize parameters
a, b, c = np.random.randn(3)
learning_rate = 0.01
epochs = 100

loss_history = []
predictions_history = []

for epoch in range(epochs):
    # Forward pass: predict
    y_pred = predict_quadratic(X, a, b, c)
    loss = mse_loss(y_pred, y)
    loss_history.append(loss)

    if epoch % 10 == 0 or epoch == epochs - 1:
        predictions_history.append((epoch, y_pred.copy()))

    # Backward pass: compute gradients
    # These lines calculate the gradients / derivatives that inform:
    # which direction each parameter should move (pos or neg)
    # how steep the slope is (how much to change)
    N = len(X)
    dL_da = (2/N) * np.sum((y_pred - y) * X**2)
    dL_db = (2/N) * np.sum((y_pred - y) * X)
    dL_dc = (2/N) * np.sum((y_pred - y) * 1)

    # Update parameters
    # Here, we move each parameter in the 
    # opposite direction of its gradient
    a -= learning_rate * dL_da
    b -= learning_rate * dL_db
    c -= learning_rate * dL_dc

    if epoch % 10 == 0:
        print(f"Epoch {epoch:3}: a={a:.3f}, b={b:.3f}, c={c:.3f}, Loss={loss:.3f}")


In [None]:
plt.plot(loss_history)
plt.title("MSE Loss Over Epochs (Gradient Descent)")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()


In [None]:
plt.scatter(X, y, alpha=0.3, label="Noisy Data")
plt.plot(X, y_true, color='black', linestyle='--', label="True Function")

colors = plt.cm.plasma(np.linspace(0, 1, len(predictions_history)))
for (epoch, pred), color in zip(predictions_history, colors):
    plt.plot(X, pred, color=color, label=f"Epoch {epoch}")

plt.legend()
plt.title("Curve Fitting Progress (Gradient Descent)")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


As we can see from this toy example, gradient descent offers several benefits over manual updates:
* smart direction - gradients tell us which way to go efficiently
* smooth progress - we take steps proportional to the slope steepness
* Scale - just imagine doing manual updates for large models!

### Model Class

Now that the basics of a 'model' are coming together, lets add our functions to a Python class to keep the methods together, make it easy to do training and prediction, and have somewhere to add further capabilities. 

In [None]:
class QuadraticModel:
    def __init__(self, seed=42):
        np.random.seed(seed)
        self.a = np.random.randn()
        self.b = np.random.randn()
        self.c = np.random.randn()
        self.predictions_history = []

    def predict(self, x):
        return self.a * x**2 + self.b * x + self.c

    def compute_gradients(self, x, y_true):
        y_pred = self.predict(x)
        error = y_pred - y_true
        N = len(x)
        dL_da = (2/N) * np.sum(error * x**2)
        dL_db = (2/N) * np.sum(error * x)
        dL_dc = (2/N) * np.sum(error)
        loss = mse_loss(y_pred, y_true)
        return loss, dL_da, dL_db, dL_dc

    def update_parameters(self, grads, lr):
        _, dL_da, dL_db, dL_dc = grads
        self.a -= lr * dL_da
        self.b -= lr * dL_db
        self.c -= lr * dL_dc

    def train(self, x, y, epochs=100, lr=0.01, verbose=True):
        loss_history = []

        for epoch in range(epochs):
            loss, dL_da, dL_db, dL_dc = self.compute_gradients(x, y)
            self.update_parameters((loss, dL_da, dL_db, dL_dc), lr)
            loss_history.append(loss)

            if epoch % 10 == 0 or epoch == epochs - 1:
                y_pred = self.predict(x)
                self.predictions_history.append((epoch, y_pred.copy()))
                if verbose:
                    print(f"Epoch {epoch:3}: a={self.a:.3f}, b={self.b:.3f}, c={self.c:.3f}, Loss={loss:.3f}")

        return loss_history

In [None]:
# Create dataset
np.random.seed(42)
X = np.linspace(-3, 3, 100)
y_true = 2 * X**2 + 3 * X + 1 + np.random.normal(0, 3, size=X.shape)
noise = np.random.normal(0, 3, size=X.shape)

# Train model
model = QuadraticModel()
loss_history = model.train(X, y, epochs=100, lr=0.01)

In [None]:
def plot_predictions_over_time(X, y, model):
    fig, axes = plt.subplots(2, 5, figsize=(18, 6), sharex=True, sharey=True)
    axes = axes.flatten()

    for i, (epoch, y_pred) in enumerate(model.predictions_history[:10]):
        ax = axes[i]
        ax.scatter(X, y, label='Data', alpha=0.3)
        ax.plot(X, y_pred, color='red', label='Prediction')
        ax.set_title(f"Epoch {epoch}")
        ax.legend()

    fig.suptitle("Prediction Progression Over Epochs", fontsize=16)
    plt.tight_layout()
    plt.show()

plot_predictions_over_time(X, y, model)


In [None]:
!pip install imageio --quiet

In [None]:
import imageio.v2 as imageio
import os

def create_training_gif(X, y, model, filename):
    temp_dir = "frames"
    os.makedirs(temp_dir, exist_ok=True)
    images = []

    for i, (epoch, y_pred) in enumerate(model.predictions_history):
        plt.figure(figsize=(6, 4))
        plt.scatter(X, y, alpha=0.3, label='Data')
        plt.plot(X, y_pred, color='red', label='Prediction')
        plt.title(f"Epoch {epoch}")
        plt.legend()
        plt.tight_layout()

        # Save frame
        frame_path = os.path.join(temp_dir, f"frame_{i:03}.png")
        plt.savefig(frame_path)
        plt.close()

        images.append(imageio.imread(frame_path))

    # Save as GIF
    imageio.mimsave(filename, images, fps=3)
    
    # Clean up
    for f in os.listdir(temp_dir):
        os.remove(os.path.join(temp_dir, f))
    os.rmdir(temp_dir)

    print(f"GIF saved to {filename}")


create_training_gif(X, y, model, filename="quadratic_training.gif")

![Animation](quadratic_training.gif)

### Backpropagation 

Now that we know how to calculate gradients and moving in the right direction (descent), backpropagation is how we calculate those gradients in a complex neural network. For those who remember high or secondary school calculus, backpropagation recalls the chain rule. Namely, the chain rule means we can take a complex function made of nested calculations (such as a neural network!) and derive how much each calculation contributes to the final output. 


In [None]:
x = 2.0
a, b, c = 1.0, 2.0, 1.0
y_true = 15.0

# Forward pass
z1 = x**2         # z1 = x²
z2 = a * z1       # z2 = a * x²
z3 = b * x        # z3 = b * x
z4 = z2 + z3 + c  # z4 = ax² + bx + c
y_pred = z4

# Loss
loss = (y_pred - y_true)**2


In [None]:
# Backpropagation
dL_dypred = 2 * (y_pred - y_true)

# Gradients w.r.t each parameter
dypred_da = z1          # ∂y_pred/∂a = x²
dypred_db = x           # ∂y_pred/∂b = x
dypred_dc = 1           # ∂y_pred/∂c = 1

# Chain rule
dL_da = dL_dypred * dypred_da
dL_db = dL_dypred * dypred_db
dL_dc = dL_dypred * dypred_dc


In [None]:
# Single data point since we're just trying to learn the pattern
# This means that when x = 2, the function's ouptut should be 15
# As a reminder, we're fitting a quadratic function: y = ax² + bx + c
x = 2.0
y_true = 15.0

# Parameters, or "weights" in our simple network
a, b, c = 1.0, 2.0, 1.0
learning_rate = 0.01

print("FORWARD PASS")
print("Following the computation forward, step by step:")

# --- Forward pass ---
# This is like signals flowing forward through a neural network
z1 = x**2         # First operation: square the input
print(f"z1 = x² = {x}² = {z1}")

z2 = a * z1       # Multiply by parameter 'a' 
print(f"z2 = a × z1 = {a} × {z1} = {z2}")

z3 = b * x        # Multiply input by parameter 'b'
print(f"z3 = b × x = {b} × {x} = {z3}")

z4 = z2 + z3 + c  # Add everything together (our prediction)
print(f"z4 = z2 + z3 + c = {z2} + {z3} + {c} = {z4}")

y_pred = z4
print(f"Final prediction: {y_pred}")

# Calculate how wrong we were
loss = (y_pred - y_true)**2
print(f"Target was {y_true}, we predicted {y_pred}")
print(f"Loss (squared error): ({y_pred} - {y_true})² = {loss:.4f}")

print("\nBACKWARD PASS")
print("Now we trace the error backwards to find which parameters need adjustments:")

# --- Backward pass (chain rule) ---
# We work backwards from the loss,working out how much 
# each parameter contributed to inaccuracies

# For parameter 'a': it affects prediction through z2 = a * x²
# Chain rule: dL/da = (dL/dy_pred) × (dy_pred/da)
# Since y_pred = ax² + bx + c, then dy_pred/da = x²
dL_da = dL_dypred * x**2
print(f"dL/da = dL/dy_pred × x² = {dL_dypred:.4f} × {x**2} = {dL_da:.4f}")
print(f"Parameter 'a' is multiplied by x²={x**2}, so it has {x**2}x the impact on our mistake")

# For parameter 'b': it affects prediction through z3 = b * x  
# dy_pred/db = x
dL_db = dL_dypred * x
print(f"dL/db = dL/dy_pred × x = {dL_dypred:.4f} × {x} = {dL_db:.4f}")
print(f"Parameter 'b' is multiplied by x={x}, so it has {x}x the impact")

# For parameter 'c': it's added directly to prediction
# dy_pred/dc = 1 (adding c by 1 increases prediction by 1)
dL_dc = dL_dypred * 1
print(f"dL/dc = dL/dy_pred × 1 = {dL_dypred:.4f} × 1 = {dL_dc:.4f}")
print(f"Parameter 'c' is added directly, so it has 1x the impact")

print(f"\nRESULTS")
print(f"Parameter 'a' gets blame score: {dL_da:.4f}")
print(f"Parameter 'b' gets blame score: {dL_db:.4f}") 
print(f"Parameter 'c' gets blame score: {dL_dc:.4f}")
print("Bigger blame = bigger adjustment needed!")

print(f"\nGRADIENT DESCENT UPDATE")
print("Now we adjust each parameter proportional to their inaccuracy:")
print(f"Old parameters: a={a:.4f}, b={b:.4f}, c={c:.4f}")


# --- Gradient descent update ---
# Move each parameter in the opposite direction of its gradient
# (negative because we want to reduce loss, not increase it)
a -= learning_rate * dL_da
b -= learning_rate * dL_db
c -= learning_rate * dL_dc

print(f"a = a - learning_rate × dL/da = {a + learning_rate * dL_da:.4f} - {learning_rate} × {dL_da:.4f} = {a:.4f}")
print(f"b = b - learning_rate × dL/db = {b + learning_rate * dL_db:.4f} - {learning_rate} × {dL_db:.4f} = {b:.4f}")
print(f"c = c - learning_rate × dL/dc = {c + learning_rate * dL_dc:.4f} - {learning_rate} × {dL_dc:.4f} = {c:.4f}")

print(f"\nUpdated parameters: a={a:.4f}, b={b:.4f}, c={c:.4f}")

### Summarry 

In this notebook, we covered the basic building blocks of a neural network including:
* Loss function
* Gradient descent
* Backpropagation

In the next notebook, we will look at how we introduce non-linearity into our networks with *activation* functions. 


In [None]:
!jupyter nbconvert --to html module_1_a.ipynb   