# Lesson 4: Math Behind Neural Networks

## Math of Neural Networks and the Universal Approximation Theorem

Neural networks are computational systems inspired by the biological neural networks found in human and animal brains. At their core, these networks consist of layers of nodes, or "neurons," each of which applies a simple computation to its inputs. The Universal Approximation Theorem provides the theoretical foundation for these systems, offering assurance that neural networks have the capacity to model a wide variety of functions given sufficient complexity and proper configuration.

### Mathematical Representation of a Neural Network

At the simplest level, a neural network can be thought of as a function \( f: \mathbb{R}^n \rightarrow \mathbb{R}^m \) where \( n \) is the dimensionality of the input vector and \( m \) is the dimensionality of the output vector. A basic feed-forward neural network with one hidden layer can be mathematically represented as:

\[
f(x) = \sigma(W_2 \cdot \sigma(W_1 \cdot x + b_1) + b_2)
\]

Where:
- \( x \) is the input vector.
- \( W_1 \) and \( W_2 \) are matrices representing the weights of the first and second layers, respectively.
- \( b_1 \) and \( b_2 \) are vectors representing the biases of the first and second layers, respectively.
- \( \sigma \) represents the activation function applied element-wise. Common choices for \( \sigma \) include the sigmoid function, ReLU (Rectified Linear Unit), and tanh (hyperbolic tangent).

### The Role of the Activation Function

The activation function is a vital component of a neural network. As its name implies, it governs the output, or 'activation,' of a neuron. Its importance lies in its unique ability to introduce non-linearity into the model, which broadens the range and complexity of functions the network can represent.

Without activation functions, a neural network comprising many layers would merely apply a sequence of linear transformations on the input data. Regardless of how many times you apply them, a composition of linear transformations results in another linear transformation. Thus, a neural network without any activation functions, no matter how many layers it has, behaves similarly to linear regression, performing a linear transformation on the input data.

By introducing non-linearity via activation functions, the network is empowered to learn from and represent much more complex patterns in the data.

#### Common Activation Functions

- **Sigmoid function**: Outputs a value between 0 and 1, making it useful for binary classification problems to represent probabilities. However, it suffers from the vanishing gradients problem, limiting its use in deep networks.
  
  \[
  \sigma(x) = \frac{1}{1 + e^{-x}}
  \]

- **Hyperbolic Tangent (tanh)**: Outputs a value between -1 and 1. It's a scaled version of the sigmoid function and, like sigmoid, suffers from the risk of vanishing gradients.

  \[
  \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
  \]

- **Rectified Linear Unit (ReLU)**: Keeps positive inputs unchanged and outputs 0 for negative inputs. It's simple, computationally efficient, and widely used in many neural networks. However, it may cause dead neurons which never get activated.

  \[
  \text{ReLU}(x) = \max(0, x)
  \]

```python
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def tanh(x):
    return np.tanh(x)

def relu(x):
    return np.maximum(0, x)
```

These functions enable the neural network to model a diversity of complex, non-linear phenomena, making them indispensable in the world of deep learning.

### The Universal Approximation Theorem - Simplified Explanation and Code

The Universal Approximation Theorem (UAT) is a key mathematical concept guiding the functionality of neural networks. UAT declares that a neural network with just one hidden layer - a layer between the input and output - containing a finite number of neurons (nodes where computation takes place), can nearly replicate or mimic any sort of continuous function.

Imagine the role of a hidden layer as a talented ensemble of artists. If you have a picture (a function) that you'd like them to recreate, they can do it with their collective skill set. Each artist (neuron) specializes in a different type of stroke or style, and together, they combine their talents to reproduce the image. To replicate more complex pictures (functions), you might need more artists (neurons) or an artist capable of a broader range of styles (non-linear activation function). However, as the Universal Approximation Theorem insists, they will always be able to recreate the picture to the desired level of accuracy.

Here, the artist's style is analogous to the activation function in a neural network, which is typically a non-linear function that transforms the input they receive. The Universal Approximation Theorem does come with a small caveat - it specifies that the activation function must be a non-constant, bounded, and increasing function.

To implement the concept in code and understand it better, let's explore a simple example:

```python
import numpy as np
import matplotlib.pyplot as plt

# Define a target function 
def target_function(x):
    return x * np.sin(x)

# Define the points where the function will be evaluated
x = np.linspace(0, 10, 100)

# Apply the target function 
y = target_function(x) 

# Plot the target function
plt.plot(x, y, label="Target Function: $f(x) = x*\sin(x)$")

# Let's simulate an approximation using a neural network 
n_neurons = 10
np.random.seed(42) 

# Simulate random weights and biases for each neuron
weights = np.random.rand(n_neurons)
biases = np.random.rand(n_neurons)

# Simulate neurons
neurons = np.tanh(weights * x.reshape(-1, 1) + biases)

# Learn the weighting of the neurons
coefficients = np.linalg.lstsq(neurons, y, rcond=None)[0]

# approximate function
y_approx = neurons @ coefficients

plt.plot(x, y_approx, label="Neural Network Approximation")
plt.legend()
plt.show()
```

Thus, with just 10 neurons and the tanh activation function, you can see that our network does a decent job approximating the target function \( f(x) = x \cdot \sin(x) \). Of course, more complex functions may require more hidden neurons or additional layers. However, according to the Universal Approximation Theorem, they can still be approximated by a neural network!

### Deep Neural Networks & The Universal Approximation Theorem

The Universal Approximation Theorem (UAT) in its original form pertains to neural networks with just a single hidden layer. However, in practice, we often encounter many more layers, which constitutes what we call Deep Neural Networks.

In the world of Deep Learning, these deep networks have proven to make a significant difference. When you add more hidden layers, what you're essentially doing is introducing a hierarchy of concepts learned by the neural network. For example, in a deep neural network designed for image recognition, the initial layers might learn to recognize simple patterns like edges, the middle layers may combine these patterns to recognize slightly more complex shapes, and the last layers might identify high-level features such as an entire object.

Interestingly, while the original UAT does not directly apply to deep networks, subsequent research and extensions of the theorem do indicate that deep networks can be more efficient at approximating complex functions compared to shallow networks. Specifically, certain functions that could be compactly represented in a deep network might require exponentially more neurons to be represented in a shallow network.

Let's revisit our previous example, using a deeper network this time:

```python
import numpy as np
import matplotlib.pyplot as plt

# Define a target function 
def target_function(x):
    return x * np.sin(x)

# Define the points where the function will be evaluated
x = np.linspace(0, 10, 100)

# Apply the target function 
y = target_function(x) 

# Plot the target function
plt.plot(x, y, label="Target Function: $f(x) = x*\sin(x)$")

# Let's simulate an approximation using a deeper neural network 
np.random.seed(42) 

# Simulate random weights and biases for each neuron in two layers
weights_1 = np.random.rand(10)
biases_1 = np.random.rand(10)
weights_2 = np.random.rand(10)
biases_2 = np.random.rand(10)

# Simulate the first layer of neurons
neurons_1 = np.tanh(weights_1 * x.reshape(-1, 1) + biases_1)

# The output of the first layer of neurons feeds into the second layer
neurons_2 = np.tanh(weights_2 * neurons_1 + biases_2)

# Learn the weighting of the neurons
coefficients = np.linalg.lstsq(neurons_2, y, rcond=None)[0]

# approximate function
y_approx = neurons_2 @ coefficients

plt.plot(x, y_approx, label="Deep Neural Network Approximation")
plt.legend()
plt.show()
```

With more hidden layers (each simulating a group of artists working on each detail level), our deep network could achieve a high degree of accuracy while approximating a complex function. This power of deep networks to build up layers of abstraction is why they're successful in tasks like image recognition, speech recognition, and natural language processing.

## Summary and How TensorFlow Hides Away the Math Complexity

Neural networks, though theoretically simple, become intricate as the layers deepen and connections increase. The Universal Approximation Theorem guarantees that, given a sufficient number of neurons, these networks can approximate a wide array of functions. However, the practicality of this often requires sophisticated software tools.

TensorFlow is a powerful tool for designing, training, and deploying neural networks. TensorFlow abstracts away much of the mathematical complexity involved in neural network implementation, allowing researchers and developers to focus more on architecture and problem-solving rather than the underlying calculus and linear algebra.

### Here's a simplified breakdown of how TensorFlow achieves this:

1. **Automatic Differentiation**: TensorFlow automatically computes gradients, a key step in training neural networks, especially for backpropagation.
  
2. **Graph Computations**: TensorFlow represents computations as data flow graphs, optimizing performance, particularly for large-scale models.

3. **Tensor Abstraction**: Operations in TensorFlow revolve around 'tensors,' n-dimensional arrays that generalize matrices, hiding the intricate mathematical manipulations.

4. **High-Level APIs**: TensorFlow provides high-level APIs like Keras that make defining, training, and deploying models user-friendly without deep dives into the underlying mathematics.

In essence, TensorFlow and similar tools simplify the complexities of neural networks, enabling widespread adoption and application even by those without extensive mathematical backgrounds.

```python
import tensorflow as tf
from tensorflow.keras import layers, models

# Create a simple feed-forward network in TensorFlow/Keras
model = models.Sequential([
    layers.Dense(10, activation='tanh', input_shape=(1,)),
    layers.Dense(10, activation='tanh'),
    layers.Dense(1)  # output layer
])

# Compile the model
model.compile(optimizer='adam', loss='mse')

# Train the model (using dummy data)
x_train = np.linspace(0, 10, 100)
y_train = x_train * np.sin(x_train)
model.fit(x_train, y_train, epochs=500, verbose=0)

# Predict the function values using the trained model
y_pred = model.predict(x_train)

# Plot the prediction
plt.plot(x_train, y_train, label='True function')
plt.plot(x_train, y_pred, label='Neural network approximation')
plt.legend()
plt.show()
```

In this code snippet, a simple feed-forward neural network is implemented using TensorFlow/Keras. The model is trained to approximate the function \( f(x) = x \cdot \sin(x) \). With just a few lines of code, TensorFlow handles the gradient calculations, optimizes the network parameters, and trains the model, making neural network implementation accessible and manageable.

**Conclusion**: Neural networks are powerful tools for modeling complex functions, supported by the Universal Approximation Theorem. While the underlying math can be intricate, tools like TensorFlow significantly reduce the complexity, making it possible to build and train neural networks with ease. As a result, neural networks have become the backbone of many modern AI applications, capable of solving a diverse array of problems.


## 📊 Summary: Math Behind Neural Networks

### 🧠 **Neural Networks Basics**
1. **Core Concept**: Neural networks are inspired by biological systems, consisting of layers of nodes (neurons).
2. **Mathematical Foundation**: A simple neural network can be represented as \( f(x) = \sigma(W_2 \cdot \sigma(W_1 \cdot x + b_1) + b_2) \).

### 📈 **Activation Functions**
1. **Purpose**: Introduce non-linearity to broaden the range of functions a network can model.
2. **Common Functions**:
   - **Sigmoid**: \( \sigma(x) = \frac{1}{1 + e^{-x}} \)
   - **tanh**: \( \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \)
   - **ReLU**: \( \text{ReLU}(x) = \max(0, x) \)

### 📚 **Universal Approximation Theorem (UAT)**
1. **Theorem Summary**: A neural network with one hidden layer can approximate any continuous function with enough neurons.
2. **Implication**: Provides the theoretical backing for the power of neural networks in function approximation.

### 🛠 **TensorFlow’s Role**
1. **Abstraction**: Simplifies neural network implementation by hiding mathematical complexities.
2. **Features**:
   - **Automatic Differentiation**: For gradient computation.
   - **Graph Computations**: Efficient performance for large models.
   - **Tensor Abstraction**: Manages n-dimensional arrays.
   - **High-Level APIs**: Eases model definition and training.

### 🎯 **Conclusion**
Neural networks, supported by the Universal Approximation Theorem, are powerful tools for approximating complex functions. TensorFlow further simplifies their implementation, making neural network modeling accessible and efficient for a wide range of applications.




## Implementing the Sigmoid Activation Function

## Implementing the ReLU Activation Function

## Neuron Output Calculation Fix

## Defining Second Layer of Neurons