# Very little, almost nothing, on the nature of deep learning

<img src="deeper_learning.png" alt="Nested" width="700"/>

In [5]:
#!pip install tensorflow
#!python -m spacy download en_core_web_md
#import tensorflow as tf
#from tensorflow.keras import layers, models
import spacy
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt
from IPython.display import Video
from IPython.display import display, HTML
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

In [6]:
video_url = "fading.mp4"

html_code = f"""
<h1 style="text-align: center;">I have seen things...</h1>
<video width="1400" height="900" controls loop>
  <source src="{video_url}" type="video/mp4">
  Your browser does not support the video tag.
</video>
"""

display(HTML(html_code))

## The TL;DR of deep learning

>\[Deep learnng] is a type of machine learning, a technique that enables computer systems to improve with experience and data. \[...] \[M]achine learning is the only viable approach to building AI systems that can operate in complicated real-world environments. Deeep learning is a particular kind of machine learning that achieves great power and flexibility by representing the world as a nested hierarchy of concepts, with each concept defined in terms of simpler concepts, and more abstract representations computed in terms of less abstract ones. (8)
>
> –– <cite>Goodfellow at al. (2016)</cite>

<img src="DL.png" alt="Nested" width="700"/>

Goodfellow, I., Bengio, Y, and Courville, A. (2016). <i>Deep Learning</i>. Cambridge MA: The MIT Press.

<div style="background-color: #ffffff; padding: 10px; border-radius: 5px;">

## 1. The basic unit of learning: The artificial neuron

Let's create  the most basic possible neural network: A switch that turns on a light when the environment gets sufficiently dark. The elements of this system are as follows:

* <b>An input signal, $x$</b>: The light guage
* <b>A weight, $w$</b>: The negative value of the light signal (the brighter the light, the smaller the weight)
* <b>A bias, $b$</b>: The propensity of the neuron to fire
* <b>A summation function, $z$</b>: The combination of the input signal, the weight, and the bias
* <b>An activation function, $f(z)$</b>: Determines the level at which the neuron fires or not

![neuron](neuron.png)

### Sunshine
<img src="sun.png" alt="Nested" width="200"/>


* $x = 100$ lumens
* $w = (-1)$
* $b = 50$ units
* $z = -1 \cdot 100+50 = -50$ units
* $f(z) = \text{(Keep light OFF (0))}$

### Moonshine

<img src="moon.png" alt="Nested" width="200"/>

* $x = 5$ lumens
* $w = (-1)$
* $b = 50$ units
* $z = -1 \cdot 5+50 = 45$ units
* $f(z) = \text{(Turn light ON (I))}$

</div>

## 2. Networks of neurons

In the previous example, it is easy to establish a rule for when the light should be switched on. Once we know the threshold (here, 0), we can adjust the weight and/or the bias to ensure that the light goes on or off when the threshold is crossed. But in many siutuations we cannot do this: at best, we have a fuzzy heuristic that guides us rather than a precise rule that we can easily communicate. How do we solve this problem?

> <b>Deep learning</b> resolves this issues by <i>learning from examples</i>. That is, it takes a neural network with random weights and random biases, and adjusts these weights and biases until the network generates the same outputs as the training examples. For this, a single neuron is not enough––on grounds of both efficiency and its [inability to learn complex, non-linear functions like $XOR$ (exclusive $OR$)](https://automaticaddison.com/linear-separability-and-the-xor-problem/). 

Instead, neurons are aggregated into networks that consist of multiple inputs and least two––though usually more––neurons in the hidden layer. The outputs of the hidden layer are passed to the final layer, which uses an activation function to pick one of the outputs. In the network below, there are three inputs and five neurons in the hidden layer. The activation function––here, the sigmoid function––takes the states of the neurons in the hidden layer and maps it into a probability. Consider the network below, which could be used for a binary classification task:

![five_neuron](nn.png)

This is what's known as a <i>feedforward</i> network, as information passes through it in one direction only: from input to output––there is no feedback mechanism. Though it seems complicated, it is strinctly analagous with the single neuron example––the only difference is that the second network amalgamates the weights and biases to produce the output from three inputs. This happens in the following steps:

1. Three input signals are given by $x_1$, $x_2$, and $x_3$.
2. Each of the three input signals is sent to every neuron in the hidden layer. Each neuron multiplies the input signal by the relevant connection weight and adds the bias ($w_{i}x_{j}+b_n$); for a neuron $n$ in the hidden layer, the resuts are summed across all three inputs to produce $z_n$: $z_n = \sum_{i=1}^{3}{w_{in}x_i+b_n}$.
4. An activation function (here, the sigmoid function) is applied to each $z_n$ producing an activation of $a_n$: $a_n = \frac{1}{1 + e^{-z_n}}$.
5. The activation value for each neuron in the hidden layer is sent to the single neuron on the output layer. This neuron adds up the product of the connecting weights and the activation values for each preceding neuron and adds a bias, $b_{output}$. This gives: $z_{output}= w_{21} a_1 + w_{22} a_2 + w_{23} a_3 + w_{24} a_4 + w_{25} a_5 + b_{output}$.
6. The last step comes with applying the activation function to $z_{output}$ to give $y$: $y = \frac{1}{1 + e^{-z_{output}}}$

What does this look like practice? Let's work through an example.

In [None]:

# Set random values for all parameters
inputs = np.random.uniform(-100., 100., [3])
hidden_bias = np.random.rand(5)
output_bias = np.random.rand(1)
hidden_weights = np.random.uniform(-1., 1., [15])
output_weights = np.random.uniform(-1., 1., [5])

# Define the activation function
def activation(value):
    a = 1/(1+np.exp(-value))
    return a

# Multiply weights by input signal
hidden_layer_signal = np.array([inputs[0] * hidden_weights[:5], inputs[1] * hidden_weights[5:10], inputs[2] * hidden_weights[10:]])
print("The inputs are:\n\n", inputs, "\n")
print("The weights of the hidden layer are:\n\n", hidden_weights, "\n")
print("The weights x inputs to the hidden layer are:\n\n", hidden_layer_signal, "\n")
print("The biases of hidden layer are:\n\n", hidden_bias, "\n")

# Sum across the inputs for each neuron and add the bias
hidden_result = hidden_layer_signal.sum(axis = 0) + hidden_bias
print("Adding the three inputs for each neuron in the hidden layer to the bias for that neuron gives z_n:\n\n", hidden_result, "\n")

# Apply the activation function the output from the hidden layer
active_hidden = activation(hidden_result)
print("Applying the activation function to each z_n gives a_n:\n\n", active_hidden, "\n")

# Muliply the activations from the hidden later by the output layer weights:
output_signal = output_weights * active_hidden 
print("The bias of the output layer neuron, b_output, is:\n\n", output_bias, "\n")

# Sum the signals receieved the output layer and add the bias:
z_output = np.sum(output_signal) + output_bias
print("Summing the signals and adding the bias for the outoput layer gives:\n\n", z_output, "\n")

# Apply the activation function to z_ouput:
output = activation(z_output)
print("Applying the activation function to z_output gives the network output, which is:\n\n", output, "\n")

if output >0.5:
    print("The predicted class is 1")
else:
    print("The predicted class is 0")


## 3. Loss functions

> #### With four parameters I can fit an elephant, and with five I can make him wiggle his trunk
>––John von Neumann


Our network has 26 paramaters: 20 weights and six biases. The task of deep learning is to estimate values for these paramaters that allow the network to accurately predict the correct classification for a specific input. This is usually done by training the model on a sample of data where already have many examples of a correect classification $(x_1, x_2, x_3) \rightarrow y$. This consists of the following steps:

1. Initialise the network with random values for its parameters
2. Present the network with inputs for which there is a known output value
3. Measure the gap between the predicted value and the known value––this is the <i>loss</i> of the model.
4. Establish the difference a small change in each parameter makes to the model's loss––does it make it bigger or smaller? (This is the <i>gradient</i> of the loss function.)
5. Average the loss over lots of training examples.
6. Adjust the parameters values in the direction of smaller loss.
7. When the loss stops falling (or gets worse) stop training.

The first issue we need to tackle here is the notion of a <b>loss function</b>. This is function that measures the error of the model. There are many examples of loss functions, but the most common is the <i>Mean Squared Error</i> (MSE):

$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2$$

Here, $y_1$ represents the true value of the input $x_i$, and $\hat y_i$ the model's predicted value. This is then averaged across all inputs (note that $x_i$ may denote a set of inputs rather than a single value). Intuition for the MSE can be found readily enougn from a linear regression. For every value of $x$, the blue dot represents the true value and the black line the predicted value. The red lines (known as residuals) are the difference between the two. These are squared and averaged to get the MSE.

![MSE](MSE.png)

Let's imagine that a given input, $x_i$, has a true output of $1$. What happens as the estimated value varies? This gives the following result for the squared error (not the MSE!):

1. $SE = (1-\hat y)^2$
2. $SE = (1-\hat y)(1-\hat y)$
3. $SE = 1-\hat y -\hat y + \hat y^2$
4. $SE = \hat y^2 - 2\hat y +1$

Plotting this gives the following curve for the value of the $SE$ as a function of $\hat y$:

![Squared Error](SE_min_plot.png)

## The rate of change of the loss function

How do we minimise the error in our lost function? mathematically, we do this by finding where the rate of the change of function is zero. The rate of change is defined as the slope of the tangent. We can, for instance, get measure the rate of change of the $SE$ by taking slope of the tangent at $\hat{y_i} = 3$.

![tangent](SE_tangent_plot.png)

In practice, we get the rate of change of the loss function by using calculus. This involves differentiating the function, which gives us an expressing for the slope of a tangent at any point on a curve. Differentiation is a huge topic that we don't have time to go into, but it is straightforward to differentiate our $SE$ function and get the rate of change of the $SE$ with respect to $\hat{y_i} = 3$:

$$\frac{dSE}{d\hat{y_i}} = 2\hat{y_i} -2$$

Here, the $d$ symbol should be read as 'delta', with $dSE$ meaning 'a small quantity of $SE$'. The ratio $\frac{dSE}{d\hat{y_i}}$ is called the derivative, and by plugging in values for $\hat{y_i}$ we can get the slope of the tangent at that point. What makes this especially useful is that it allows us to figure out the minimum points in our loss function. Specifically, by letting the derivative equal to zero and solving for $\hat{y_i}$, we get the value of $\hat{y_i}$ for which the error is zero:

1. $2\hat{y_i} -2 = 0$
2. $2\hat{y_i} = 2$
3. $\hat{y_i} = 1$

### Backpropagation

Backpropagation is how the network updates its parameters. It does this by taking a mathematical expression for the loss of every parameter on in the network and changing the value of each parameter in ther direction of the smallest loss. In our neural network, like most neural networks, we have a series of functions, where the output of one function is the input into another. For example, the hidden layer activations feed into the output layer. This means that any attempt to minimise the loss function must ultimately operate on all the parameters of the model. In our example, the model outputs a value $y$ between $0$ and $1$. This value is computed by plugging the value of $z_{output}$ into the sigmoid function. This means that the Loss function, $L(y)$, also depends on $z_{output}$. However, $z_{output}$ in its turn depends on the activation values of the outputs and activations of the the hidden layer: $z_{output} = w_{21} z_1 + w_{22} z_2 + w_{23} z_3 + w_{24} z_4 + w_{25} z_5 + b_{output}$. In their turn, the weights and biases of the hidden layer depend on the weights and inputs of the input layer. How do calculate the rate of change of the loss function in this situation? The <b>chain rule of calculus</b> enables us to to do this. This says that for any functions $z = (y)$ and $y = (x)$, then:

$$\frac{dz}{dx} = \frac{dz}{dy}\cdot\frac{dy}{dx}$$

What this allows us to do is to express the loss function in terms of the network inputs, and calculate the individual contributions each weight (via the activation function) makes to this loss. We're not going to derive all of thse (though it's relatively straighforward); instead we'll trace the impact of a single input through the network. (Remember that the $z_n$ variable here is a function of the network weights, so it 'contains' them.)

$$ \text{Weights: }\frac{dy}{dx_i} = \frac{dy}{dz_{\text{output}}} \cdot \frac{dz_{\text{output}}}{da_j} \cdot \frac{da_j}{dz_j} \cdot \frac{dz_j}{dx_i}$$

As the biases also impact on the loss, we have a similar expression for the bias in the hidden layer:

$$
\text{Biases: }\frac{d y}{d b_j} = \frac{d y}{d z_{\text{output}}} \cdot \frac{d z_{\text{output}}}{d a_j} \cdot \frac{d a_j}{d z_j} \cdot \frac{d z_j}{d b_j}
$$

Backpropagation works by randomly assigning a value to each parameter, calculating the loss, then adjusting each paramater in the direction of lower loss.


### Single neuron example

Let's return our single neural network and take an example: the weight $w_{21}$. Let's assume our loss function is the $MSE$, the error associated with a fixed input. Out job is the figure out the loss associated with this specific weight and adjust the weight in the direction of the lower loss. This would be done for every weight and bias in the network.

![five_neuron](nn.png)

#### 1. Mean Squared Error (MSE) Loss Function:
The loss for a single input is defined as:

$L = \frac{1}{2} (y_{\text{pred}} - y_{\text{true}})^2$

where:

* $y_{\text{pred}}$ is the predicted output of the network.
* $y_{\text{true}}$ is the true label.

---

#### 2. Network Output and Activation:
The output of the network is given by:

$y = \sigma(z_{\text{output}}) = \frac{1}{1 + e^{-z_{\text{output}}}}$

where:

$z_{\text{output}} = w_{21} a_1 + w_{22} a_2 + w_{23} a_3 + w_{24} a_4 + w_{25} a_5 + b_{\text{output}}$

and $ a_1, a_2, a_3, a_4, a_5$) are the activations of the hidden layer neurons:

$a_j = \sigma(z_j) = \frac{1}{1 + e^{-z_j}}$

---

#### 3. Gradient of Loss with Respect to Network Output:
First, we differentiate the loss function with respect to $y$. (Note that the $\partial$ symbol here means we're taking a partial derivative: we're only interested in the effects of varying $w_{21}$ and the functions it feeds into, and are keeping every other weight and bias fixed.)

$\frac{\partial L}{\partial y} = (y_{\text{pred}} - y_{\text{true}})$

Since the network output uses the sigmoid activation function, we compute:

$\frac{\partial y}{\partial z_{\text{output}}} = y_{\text{pred}} (1 - y_{\text{pred}})$

Thus, by the chain rule:

$\frac{\partial L}{\partial z_{\text{output}}} = (y_{\text{pred}} - y_{\text{true}}) \cdot y_{\text{pred}}(1 - y_{\text{pred}})$

---

#### 4. Gradient of Loss with Respect to $w_{21}$:
Since $z_{\text{output}}$ depends on $w_{21}$, we compute:

$\frac{\partial z_{\text{output}}}{\partial w_{21}} = a_1$

Applying the chain rule:

$\frac{\partial L}{\partial w_{21}} = \frac{\partial L}{\partial z_{\text{output}}} \cdot \frac{\partial z_{\text{output}}}{\partial w_{21}}$

Substituting the values:

$\frac{\partial L}{\partial w_{21}} = (y_{\text{pred}} - y_{\text{true}}) \cdot y_{\text{pred}}(1 - y_{\text{pred}}) \cdot a_1$

---

### Summary: Final Gradient Expression:
$\frac{\partial L}{\partial w_{21}} = (y_{\text{pred}} - y_{\text{true}}) \cdot y_{\text{pred}}(1 - y_{\text{pred}}) \cdot a_1$

This expression gives the gradient of the loss function with respect to the weight $w_{21}$, which is used to update the weight during gradient descent. 

* If  $\frac{\partial L}{\partial w_{21}} > 0$, then we decrease $w_{21}$ (move in the negative gradient direction)
* If $\frac{\partial L}{\partial w_{21}} < 0$, then we increase $w_{21}$ (move in the positive gradient direction)




In [7]:
video_url = "blade_runner.mp4"

html_code = f"""
<h1 style="text-align: center;">Because you've never seen a miracle...</h1>
<video width="1400" height="900" controls>
  <source src="{video_url}" type="video/mp4">
  Your browser does not support the video tag.
</video>
"""

display(HTML(html_code))

<div style="background-color: #f6f6f6; padding: 10px; border-radius: 5px;">

## The miracle of the loaves and fishes

![loaves](loaves.jpeg)



### Let's first train a network on some noise

In [None]:

import tensorflow as tf
from tensorflow.keras import layers, models


# Define the model architecture
model = models.Sequential([
    layers.Dense(5, activation='sigmoid', input_shape=(3,)),  # Hidden layer with 5 neurons
    layers.Dense(1, activation='sigmoid')  # Output layer (binary classification)
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Print model summary
model.summary()

# Generate dummy data (100 samples, 3 input features)
X = np.random.rand(100, 3)  # 100 samples, 3 features
y = np.random.randint(0, 2, size=(100, 1))  # 100 binary labels (0 or 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=10, verbose=1)

# Make predictions on the test set
y_pred_prob = model.predict(X_test)  # Get probabilities
y_pred = (y_pred_prob > 0.5).astype(int)  # Convert probabilities to binary labels

# Generate and print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Retrieve weights and biases
for layer in model.layers:
    weights, biases = layer.get_weights()
    print(f"\nLayer: {layer.name}")
    print(f"Weights:\n{weights}")
    print(f"Biases:\n{biases}")


## Now let's train a network on some language data––in this case, food words and non-food words

We can do this by taking the word embeddings for food words from the `spaCy` medium language model. Word embeddings are 300-dimensional vectors of weights that can be used to predict the co-occurrence of words together.

In [None]:

# Load the medium language model
nlp = spacy.load("en_core_web_md")


words = [
        # 🍕 Food words (1)
        "apple", "banana", "carrot", "bread", "cheese", "chicken", "chocolate", "coffee", "cookie", "donut",
        "egg", "fish", "grape", "honey", "icecream", "jam", "ketchup", "lemon", "mango", "milk",
        "noodles", "orange", "pancake", "pepper", "pizza", "popcorn", "pumpkin", "rice", "salad", "salt",
        "sandwich", "sausage", "soup", "spaghetti", "spinach", "strawberry", "sugar", "sushi", "tea", "tomato",
        "turkey", "vanilla", "waffle", "watermelon", "yogurt", "zucchini", "beef", "pasta", "coconut", "burger",
        
        # 🚫 Non-food words (0)
        "car", "bottle", "chair", "laptop", "phone", "television", "pencil", "candle", "mirror", "window",
        "book", "notebook", "desk", "computer", "camera", "keyboard", "mouse", "lamp", "sofa", "door",
        "shoes", "sock", "jacket", "shirt", "trousers", "hat", "watch", "wallet", "backpack", "glasses",
        "earphones", "radio", "guitar", "piano", "violin", "painting", "clock", "newspaper", "magazine", "bicycle",
        "bus", "train", "plane", "ship", "road", "bridge", "building", "island", "mountain", "river"
    ]

classes = [
        # 🍕 1 for Food Words
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

        # 🚫 0 for Non-Food Words
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0
    ]

vectors = [nlp(i).vector for i in words]

vecs_df = pd.DataFrame(vectors)
vecs_df['word'] = words
vecs_df['label'] = classes
vecs_df = vecs_df.sample(frac = 1)

X = vecs_df.drop(['word', 'label'], axis = 1)
y = vecs_df['label']


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the model architecture
model = models.Sequential([
    layers.Dense(5, activation='relu', input_shape=(300,)),  # Hidden layer with 5 neurons
    layers.Dense(1, activation='sigmoid')  # Output layer (binary classification)
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Print model summary
model.summary()

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=5, verbose=1)

# Make predictions on the test set
y_pred_prob = model.predict(X_test)  # Get probabilities
y_pred = (y_pred_prob > 0.5).astype(int)  # Convert probabilities to binary labels

# Generate and print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Retrieve weights and biases
for layer in model.layers:
    weights, biases = layer.get_weights()
    print(f"\nLayer: {layer.name}")
    print(f"Weights:\n{weights}")
    print(f"Biases:\n{biases}")



In [None]:
def pred(term):
    word = nlp(term).vector.reshape(1,-1)
    p = model.predict(word)
    if p[0][0] >=0.5:
        return {'food': float(p[0][0])}
    else:
        return {'not food': float(p[0][0])}


In [None]:
pred('salad')

In [None]:
lemon = np.array(nlp('lemon').vector)

In [None]:
for layer in model.layers:
    weights, biases = layer.get_weights()
    print(f"\nLayer: {layer.name}")
    print(f"Weights:\n{weights}")
    print(f"Biases:\n{biases}")

In [4]:
video_url = "memories_.mp4"

html_code = f"""
<h1 style="text-align: center;">All the best memories are hers...</h1>
<video width="1400" height="900" controls>
  <source src="{video_url}" type="video/mp4">
  Your browser does not support the video tag.
</video>
"""

display(HTML(html_code))

# What is thinking?

<img src="mic_.png" alt="Nested" width="700"/>
