# NEURAL NETWORK
An hands-on approach.

## Introduction

### Introduce the concept of neural networks and the goal of the notebook.


In [1]:
%%markdown
# Building a Simple Neural Network from Scratch

## Introduction to Neural Networks

Neural networks are a set of algorithms, modeled after the human brain, that are designed to recognize patterns. They are particularly good at tasks like image recognition, natural language processing, and making predictions based on complex data. At their core, neural networks are composed of interconnected nodes, or "neurons," organized in layers. These neurons process and transmit information, learning from data through a process called training.

## Notebook Goal

The goal of this notebook is to provide a hands-on introduction to the fundamental building blocks of neural networks by guiding you through the process of coding a simple neural network from scratch using Python. We will start by understanding the basic structure of a neuron, then build up to creating input, hidden, and output layers.

## Dataset

For this notebook, we will be using a simple dataset to illustrate the concepts. The details of the dataset will be covered in the next section.

# Building a Simple Neural Network from Scratch

## Introduction to Neural Networks

Neural networks are a set of algorithms, modeled after the human brain, that are designed to recognize patterns. They are particularly good at tasks like image recognition, natural language processing, and making predictions based on complex data. At their core, neural networks are composed of interconnected nodes, or "neurons," organized in layers. These neurons process and transmit information, learning from data through a process called training.

## Notebook Goal

The goal of this notebook is to provide a hands-on introduction to the fundamental building blocks of neural networks by guiding you through the process of coding a simple neural network from scratch using Python. We will start by understanding the basic structure of a neuron, then build up to creating input, hidden, and output layers.

## Dataset

For this notebook, we will be using a simple dataset to illustrate the concepts. The details of the dataset will be covered in the next section.


## Data preparation

### Load and prepare a simple dataset suitable for a binary classification problem with a few features and a binary target variable.


In [2]:
import numpy as np
import pandas as pd

# Generate synthetic data
np.random.seed(42) # for reproducibility
n_samples = 100

# Feature 1: Random values
feature1 = np.random.rand(n_samples) * 10

# Feature 2: Values related to feature1 with some noise
feature2 = feature1 * 0.5 + np.random.randn(n_samples) * 2

# Target variable: Binary based on a simple threshold rule with some noise
target = (feature1 + feature2 > 10).astype(int)
# Introduce some noise in the target
noise_indices = np.random.choice(n_samples, size=int(n_samples * 0.1), replace=False)
target[noise_indices] = 1 - target[noise_indices]


# Create a pandas DataFrame
df = pd.DataFrame({'feature1': feature1, 'feature2': feature2, 'target': target})

# Separate features and target
X = df[['feature1', 'feature2']]
y = df['target']

# Display the first few rows of the DataFrame
display(df.head())

Unnamed: 0,feature1,feature2,target
0,3.745401,2.046795,0
1,9.507143,4.155557,1
2,7.319939,3.843491,0
3,5.986585,-0.981845,0
4,1.560186,0.340749,0


**Reasoning**:
Describe the generated dataset using a markdown cell.



In [3]:
%%markdown
## Dataset Description

We have created a simple synthetic dataset for binary classification.

*   **Features (`X`):**
    *   `feature1`: A continuous numerical feature generated randomly.
    *   `feature2`: A continuous numerical feature related to `feature1` with some added noise.
*   **Target Variable (`y`):**
    *   `target`: A binary variable (0 or 1) indicating the class. The target is primarily determined by a simple rule based on `feature1` and `feature2`, with some random noise introduced to make it slightly less linearly separable.

This dataset is small and simple, making it suitable for demonstrating the basic concepts of building a neural network from scratch. The features (`X`) are stored in a pandas DataFrame, and the target variable (`y`) is stored in a pandas Series. Both are in a numerical format.

## Dataset Description

We have created a simple synthetic dataset for binary classification.

*   **Features (`X`):**
    *   `feature1`: A continuous numerical feature generated randomly.
    *   `feature2`: A continuous numerical feature related to `feature1` with some added noise.
*   **Target Variable (`y`):**
    *   `target`: A binary variable (0 or 1) indicating the class. The target is primarily determined by a simple rule based on `feature1` and `feature2`, with some random noise introduced to make it slightly less linearly separable.

This dataset is small and simple, making it suitable for demonstrating the basic concepts of building a neural network from scratch. The features (`X`) are stored in a pandas DataFrame, and the target variable (`y`) is stored in a pandas Series. Both are in a numerical format.


## Coding a Neuron

### Explain the basic structure and function of a single neuron, including inputs, weights, bias, weighted sum, and activation function using a markdown cell as the first step of the subtask.

In [4]:
%%markdown
## The Neuron: The Basic Building Block

At the heart of a neural network is the neuron, a simplified model of a biological neuron. A single neuron receives inputs, processes them, and produces an output. Let's break down its components:

1.  **Inputs:** These are the values the neuron receives from the outside world or from other neurons in the network. In our case, these will be the features from our dataset (`feature1` and `feature2`).
2.  **Weights:** Each input is associated with a weight. Weights represent the strength or importance of each input. During the training process, the network learns the optimal values for these weights.
3.  **Bias:** The bias is an additional parameter in the neuron. It can be thought of as a threshold that the weighted sum must exceed to activate the neuron. It allows the neuron to shift the activation function output independently of the inputs.
4.  **Weighted Sum:** The neuron calculates a weighted sum of its inputs. This is done by multiplying each input by its corresponding weight and summing up all these products. The bias term is then added to this sum.
    *   Mathematically: $Z = (input_1 \times weight_1) + (input_2 \times weight_2) + ... + (input_n \times weight_n) + bias$
5.  **Activation Function:** The weighted sum is then passed through an activation function. The activation function introduces non-linearity into the neuron's output, which is crucial for the network to learn complex patterns. Without activation functions, a neural network would simply be a linear model, regardless of the number of layers.

The output of the activation function is the output of the neuron, which can then be passed as input to other neurons in the next layer.

Let's implement the weighted sum and a simple activation function.

## The Neuron: The Basic Building Block

At the heart of a neural network is the neuron, a simplified model of a biological neuron. A single neuron receives inputs, processes them, and produces an output. Let's break down its components:

1.  **Inputs:** These are the values the neuron receives from the outside world or from other neurons in the network. In our case, these will be the features from our dataset (`feature1` and `feature2`).
2.  **Weights:** Each input is associated with a weight. Weights represent the strength or importance of each input. During the training process, the network learns the optimal values for these weights.
3.  **Bias:** The bias is an additional parameter in the neuron. It can be thought of as a threshold that the weighted sum must exceed to activate the neuron. It allows the neuron to shift the activation function output independently of the inputs.
4.  **Weighted Sum:** The neuron calculates a weighted sum of its inputs. This is done by multiplying each input by its corresponding weight and summing up all these products. The bias term is then added to this sum.
    *   Mathematically: $Z = (input_1 \times weight_1) + (input_2 \times weight_2) + ... + (input_n \times weight_n) + bias$
5.  **Activation Function:** The weighted sum is then passed through an activation function. The activation function introduces non-linearity into the neuron's output, which is crucial for the network to learn complex patterns. Without activation functions, a neural network would simply be a linear model, regardless of the number of layers.

The output of the activation function is the output of the neuron, which can then be passed as input to other neurons in the next layer.

Let's implement the weighted sum and a simple activation function.


**Reasoning**:
Implement Python functions for calculating the weighted sum and a simple activation function (step function), and demonstrate their usage with example inputs, weights, and bias.



In [5]:
# 2. Implement a Python function for calculating the weighted sum
def weighted_sum(inputs, weights, bias):
  """
  Calculates the weighted sum of inputs.

  Args:
    inputs: A list or numpy array of input values.
    weights: A list or numpy array of weights corresponding to the inputs.
    bias: The bias term.

  Returns:
    The weighted sum of inputs plus the bias.
  """
  return np.dot(inputs, weights) + bias

# 3. Implement a simple activation function (Step Function)
def step_function(weighted_sum_output):
  """
  Implements the step activation function.

  Args:
    weighted_sum_output: The output of the weighted sum.

  Returns:
    1 if the weighted sum output is non-negative, 0 otherwise.
  """
  return 1 if weighted_sum_output >= 0 else 0

# 4. Demonstrate the usage with example inputs, weights, and bias
# Example inputs (e.g., features for a single data point)
example_inputs = np.array([X['feature1'][0], X['feature2'][0]])

# Example weights (randomly initialized for demonstration)
# The number of weights should match the number of inputs
example_weights = np.array([0.5, -0.2])

# Example bias (randomly initialized for demonstration)
example_bias = 0.1

# Calculate the weighted sum
z = weighted_sum(example_inputs, example_weights, example_bias)
print(f"Example Inputs: {example_inputs}")
print(f"Example Weights: {example_weights}")
print(f"Example Bias: {example_bias}")
print(f"Weighted Sum (Z): {z}")

# Apply the activation function
output = step_function(z)
print(f"Output after Step Activation: {output}")

Example Inputs: [3.74540119 2.04679473]
Example Weights: [ 0.5 -0.2]
Example Bias: 0.1
Weighted Sum (Z): 1.5633416480941815
Output after Step Activation: 1


## Input layer

### Explain the role of the input layer and the importance of data dimensions, then demonstrate checking the dimensions of the input data.

In [6]:
%%markdown
## The Input Layer

The input layer is the first layer of a neural network. Its primary role is to receive the initial data that the network will process. Each neuron in the input layer typically corresponds to a single feature in the input data.

Think of the input layer as the gateway through which your dataset enters the neural network. If your dataset has `n` features, your input layer will typically have `n` neurons, each responsible for receiving the values of one feature for each data sample.

### Ensuring Correct Data Dimensions

Before feeding your data into the input layer, it's crucial to ensure that it has the correct dimensions. The dimensions of your input data should match the expected input shape of your network.

For a typical dataset used in machine learning, the data is organized in a 2D structure (like a spreadsheet or a pandas DataFrame), where:

*   The number of rows represents the number of samples or data points.
*   The number of columns represents the number of features.

The input layer expects data in this format, where each row is a sample and each column is a feature. The number of columns must match the number of neurons in the input layer.

Let's check the dimensions of our input data `X`.

## The Input Layer

The input layer is the first layer of a neural network. Its primary role is to receive the initial data that the network will process. Each neuron in the input layer typically corresponds to a single feature in the input data.

Think of the input layer as the gateway through which your dataset enters the neural network. If your dataset has `n` features, your input layer will typically have `n` neurons, each responsible for receiving the values of one feature for each data sample.

### Ensuring Correct Data Dimensions

Before feeding your data into the input layer, it's crucial to ensure that it has the correct dimensions. The dimensions of your input data should match the expected input shape of your network.

For a typical dataset used in machine learning, the data is organized in a 2D structure (like a spreadsheet or a pandas DataFrame), where:

*   The number of rows represents the number of samples or data points.
*   The number of columns represents the number of features.

The input layer expects data in this format, where each row is a sample and each column is a feature. The number of columns must match the number of neurons in the input layer.

Let's check the dimensions of our input data `X`.


**Reasoning**:
Demonstrate how to check the dimensions of the input data `X` using `.shape`.



In [7]:
# Check the dimensions of the input data X
print("Shape of the input data X:", X.shape)

# The shape of X tells us the number of samples (rows) and the number of features (columns).
# The number of features (columns) is the dimension that matters for the input layer;
# it determines how many neurons the input layer should have.


Shape of the input data X: (100, 2)


## Activation Functions

### Discuss various mathematical functions that can be used as activation functions and implement some of them.


In [8]:
%%markdown
## Activation Functions: Introducing Non-Linearity

In the previous section, we saw that a neuron calculates a weighted sum of its inputs and adds a bias. If we were to simply output this weighted sum, our neural network would only be able to learn linear relationships between the inputs and the output. This is because a sum of linear functions is still a linear function.

However, most real-world data involves complex, non-linear relationships. To enable a neural network to learn these complex patterns, we introduce **activation functions**. An activation function takes the output of the weighted sum (often denoted as $Z$) and transforms it into the neuron's final output. This transformation introduces non-linearity, allowing the network to model and learn from non-linear data.

Activation functions are applied to the output of each neuron in the hidden layers and often in the output layer.

Here are some common types of activation functions:

### 1. Sigmoid Function

The Sigmoid function, also known as the logistic function, is a classic choice, especially for the output layer in binary classification problems. It squashes the input values between 0 and 1.

*   **Mathematical Formula:** $ \sigma(Z) = \frac{1}{1 + e^{-Z}} $
*   **Description:** Outputs values between 0 and 1, making it useful for representing probabilities. However, it suffers from the "vanishing gradient" problem for very large or very small input values, which can slow down training.

### 2. Rectified Linear Unit (ReLU)

The ReLU function is one of the most popular activation functions in deep learning today. It's computationally efficient and helps mitigate the vanishing gradient problem seen in Sigmoid and Tanh for positive inputs.

*   **Mathematical Formula:** $ ReLU(Z) = \max(0, Z) $
*   **Description:** Outputs the input directly if it's positive, otherwise, it outputs zero. It's simple and effective, but can suffer from the "dying ReLU" problem where neurons can become inactive for negative inputs.

### 3. Hyperbolic Tangent (Tanh)

The Tanh function is another common activation function. It is similar to the Sigmoid function but squashes the input values between -1 and 1.

*   **Mathematical Formula:** $ tanh(Z) = \frac{e^Z - e^{-Z}}{e^Z + e^{-Z}} $
*   **Description:** Outputs values between -1 and 1, which can be beneficial as it centers the output around zero. Like Sigmoid, it can suffer from the vanishing gradient problem.

### Other Activation Functions

*   **Leaky ReLU:** An improvement over ReLU that addresses the "dying ReLU" problem by allowing a small, non-zero gradient for negative inputs.
    *   **Mathematical Formula:** $ LeakyReLU(Z) = \max(\alpha Z, Z) $ where $\alpha$ is a small positive constant (e.g., 0.01).
*   **Softmax:** Typically used in the output layer of multi-class classification problems. It converts a vector of raw scores into a vector of probabilities that sum up to 1.
    *   **Mathematical Formula:** For a vector of inputs $Z = [z_1, z_2, ..., z_k]$, the Softmax of the i-th element is $ Softmax(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{k} e^{z_j}} $

## Activation Functions: Introducing Non-Linearity

In the previous section, we saw that a neuron calculates a weighted sum of its inputs and adds a bias. If we were to simply output this weighted sum, our neural network would only be able to learn linear relationships between the inputs and the output. This is because a sum of linear functions is still a linear function.

However, most real-world data involves complex, non-linear relationships. To enable a neural network to learn these complex patterns, we introduce **activation functions**. An activation function takes the output of the weighted sum (often denoted as $Z$) and transforms it into the neuron's final output. This transformation introduces non-linearity, allowing the network to model and learn from non-linear data.

Activation functions are applied to the output of each neuron in the hidden layers and often in the output layer.

Here are some common types of activation functions:

### 1. Sigmoid Function

The Sigmoid function, also known as the logistic function, is a classic choice, especially for the output layer in binary classification problems. It squashes the input values between 0 and 1.

*   **Mathematical Formula:** $ \sigma(Z) = \frac{1}{1 + e^{-Z}} $
*   **Description:** Outputs values between 0 and 1, making it useful for representing probabilities. However, it suffers from the "vanishing gradient" problem for very large or very small input values, which can slow down training.

### 2. Rectified Linear Unit (ReLU)

The ReLU function is one of the most popular activation functions in deep learning today. It's computationally efficient and helps mitigate the vanishing gradient problem seen in Sigmoid and Tanh for positive inputs.

*   **Mathematical Formula:** $ ReLU(Z) = \max(0, Z) $
*   **Description:** Outputs the input directly if it's positive, otherwise, it outputs zero. It's simple and effective, but can suffer from the "dying ReLU" problem where neurons can become inactive for negative inputs.

### 3. Hyperbolic Tangent (Tanh)

The Tanh function is another common activation function. It is similar to the Sigmoid function but squashes the input values between -1 and 1.

*   **Mathematical Formula:** $ tanh(Z) = \frac{e^Z - e^{-Z}}{e^Z + e^{-Z}} $
*   **Description:** Outputs values between -1 and 1, which can be beneficial as it centers the output around zero. Like Sigmoid, it can suffer from the vanishing gradient problem.

### Other Activation Functions

*   **Leaky ReLU:** An improvement over ReLU that addresses the "dying ReLU" problem by allowing a small, non-zero gradient for negative inputs.
    *   **Mathematical Formula:** $ LeakyReLU(Z) = \max(\alpha Z, Z) $ where $\alpha$ is a small positive constant (e.g., 0.01).
*   **Softmax:** Typically used in the output layer of multi-class classification problems. It converts a vector of raw scores into a vector of probabilities that sum up to 1.
    *   **Mathematical Formula:** For a vector of inputs $Z = [z_1, z_2, ..., z_k]$, the Softmax of the i-th element is $ Softmax(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{k} e^{z_j}} $


**Reasoning**:
Implement Python functions for Sigmoid, ReLU, and Tanh activation functions and demonstrate their usage with example weighted sum values.



In [9]:
# Implement the Sigmoid activation function
def sigmoid(z):
  """
  Implements the Sigmoid activation function.

  Args:
    z: The input (weighted sum).

  Returns:
    The output of the Sigmoid function.
  """
  return 1 / (1 + np.exp(-z))

# Implement the ReLU activation function
def relu(z):
  """
  Implements the ReLU activation function.

  Args:
    z: The input (weighted sum).

  Returns:
    The output of the ReLU function.
  """
  return np.maximum(0, z)

# Implement the Tanh activation function
def tanh(z):
  """
  Implements the Tanh activation function.

  Args:
    z: The input (weighted sum).

  Returns:
    The output of the Tanh function.
  """
  return np.tanh(z)

# Demonstrate the usage of each activation function
# Example weighted sum values
example_weighted_sums = np.array([-5.0, -1.0, 0.0, 1.0, 5.0])

print(f"Example Weighted Sums: {example_weighted_sums}")

# Apply and print results for each activation function
print("\nSigmoid Activation:")
print(sigmoid(example_weighted_sums))

print("\nReLU Activation:")
print(relu(example_weighted_sums))

print("\nTanh Activation:")
print(tanh(example_weighted_sums))

Example Weighted Sums: [-5. -1.  0.  1.  5.]

Sigmoid Activation:
[0.00669285 0.26894142 0.5        0.73105858 0.99330715]

ReLU Activation:
[0. 0. 0. 1. 5.]

Tanh Activation:
[-0.9999092  -0.76159416  0.          0.76159416  0.9999092 ]


## Output layer

### Explain the output layer, how to select the appropriate activation function for the task, and how to tune thresholds if necessary.


In [10]:
%%markdown
## The Output Layer: Making Predictions

The output layer is the final layer of a neural network. It receives input from the last hidden layer (or the input layer if there are no hidden layers) and produces the network's final output or prediction. The number of neurons in the output layer depends directly on the type of problem the neural network is designed to solve.

### Choosing the Output Layer Activation Function

The choice of activation function for the output layer is crucial and depends on the nature of the task:

*   **Binary Classification (Two Classes):** For problems where the output is one of two classes (e.g., spam or not spam, yes or no), a single neuron with a **Sigmoid** activation function is typically used. The Sigmoid function outputs a value between 0 and 1, which can be interpreted as the probability of the input belonging to the positive class.
    *   *Example:* Predicting if an email is spam (1) or not spam (0). The Sigmoid output could be the probability of the email being spam.

*   **Multi-Class Classification (More than Two Classes):** When there are more than two mutually exclusive classes (e.g., classifying images of digits 0-9), the output layer will have one neuron for each class, and a **Softmax** activation function is used across all neurons in this layer. Softmax converts a vector of raw scores into a probability distribution, where the sum of probabilities for all classes is 1. The class with the highest probability is the network's prediction.
    *   *Example:* Classifying images of handwritten digits into one of 10 classes (0-9). The Softmax output will give the probability for each digit.

*   **Regression (Predicting Continuous Values):** For tasks where the goal is to predict a continuous numerical value (e.g., predicting house prices, stock prices), the output layer typically has a single neuron with a **linear** or no activation function. A linear activation simply outputs the weighted sum directly.
    *   *Example:* Predicting the price of a house based on its features. The output neuron directly outputs the predicted price.

### Tuning the Threshold for Binary Classification

In binary classification problems using a Sigmoid output, the output value represents a probability. To convert this probability into a binary class prediction (0 or 1), a **threshold** is applied.

*   **Standard Threshold:** A common threshold is 0.5. If the Sigmoid output is greater than or equal to 0.5, the prediction is class 1; otherwise, it's class 0.
*   **Tuning the Threshold:** The choice of threshold can impact the trade-off between precision and recall. Depending on the specific problem and the costs associated with false positives and false negatives, you might want to adjust this threshold.
    *   *Lowering the threshold* (e.g., to 0.4) makes it easier for an instance to be classified as class 1, increasing recall but potentially decreasing precision.
    *   *Raising the threshold* (e.g., to 0.6) makes it harder for an instance to be classified as class 1, increasing precision but potentially decreasing recall.

Tuning the threshold is often done after the model has been trained and involves evaluating the model's performance at different threshold values on a validation set.

Let's demonstrate applying a Sigmoid activation and a threshold to a hypothetical output.

## The Output Layer: Making Predictions

The output layer is the final layer of a neural network. It receives input from the last hidden layer (or the input layer if there are no hidden layers) and produces the network's final output or prediction. The number of neurons in the output layer depends directly on the type of problem the neural network is designed to solve.

### Choosing the Output Layer Activation Function

The choice of activation function for the output layer is crucial and depends on the nature of the task:

*   **Binary Classification (Two Classes):** For problems where the output is one of two classes (e.g., spam or not spam, yes or no), a single neuron with a **Sigmoid** activation function is typically used. The Sigmoid function outputs a value between 0 and 1, which can be interpreted as the probability of the input belonging to the positive class.
    *   *Example:* Predicting if an email is spam (1) or not spam (0). The Sigmoid output could be the probability of the email being spam.

*   **Multi-Class Classification (More than Two Classes):** When there are more than two mutually exclusive classes (e.g., classifying images of digits 0-9), the output layer will have one neuron for each class, and a **Softmax** activation function is used across all neurons in this layer. Softmax converts a vector of raw scores into a probability distribution, where the sum of probabilities for all classes is 1. The class with the highest probability is the network's prediction.
    *   *Example:* Classifying images of handwritten digits into one of 10 classes (0-9). The Softmax output will give the probability for each digit.

*   **Regression (Predicting Continuous Values):** For tasks where the goal is to predict a continuous numerical value (e.g., predicting house prices, stock prices), the output layer typically has a single neuron with a **linear** or no activation function. A linear activation simply outputs the weighted sum directly.
    *   *Example:* Predicting the price of a house based on its features. The output neuron directly outputs the predicted price.

### Tuning the Threshold for Binary Classification

In binary classification problems using a Sigmoid output, the output value represents a probability. To convert this probability into a binary class prediction (0 or 1), a **threshold** is applied.

*   **Standard Threshold:** A common threshold is 0.5. If the Sigmoid output is greater than or equal to 0.5, the prediction is class 1; otherwise, it's class 0.
*   **Tuning the Threshold:** The choice of threshold can impact the trade-off between precision and recall. Depending on the specific problem and the costs associated with false positives and false negatives, you might want to adjust this threshold.
    *   *Lowering the threshold* (e.g., to 0.4) makes it easier for an instance to be classified as class 1, increasing recall but potentially decreasing precision.
    *   *Raising the threshold* (e.g., to 0.6) makes it harder for an instance to be classified as class 1, increasing precision but potentially decreasing recall.

Tuning the threshold is often done after the model has been trained and involves evaluating the model's performance at different threshold values on a validation set.

Let's demonstrate applying a Sigmoid activation and a threshold to a hypothetical output.


**Reasoning**:
Demonstrate applying a Sigmoid activation function to a hypothetical output and then applying a threshold to make a binary prediction.



In [11]:
# Assume a hypothetical weighted sum output from the last layer neuron
hypothetical_weighted_sum = 1.5

# Apply the Sigmoid activation function (using the function defined previously)
sigmoid_output = sigmoid(hypothetical_weighted_sum)

print(f"Hypothetical Weighted Sum: {hypothetical_weighted_sum}")
print(f"Output after Sigmoid Activation: {sigmoid_output:.4f}")

# Define a threshold for binary classification
threshold = 0.5

# Apply the threshold to make a binary prediction
binary_prediction = 1 if sigmoid_output >= threshold else 0

print(f"Threshold: {threshold}")
print(f"Binary Prediction: {binary_prediction}")

# Demonstrate with another hypothetical weighted sum
hypothetical_weighted_sum_2 = -0.5
sigmoid_output_2 = sigmoid(hypothetical_weighted_sum_2)

print(f"\nAnother Hypothetical Weighted Sum: {hypothetical_weighted_sum_2}")
print(f"Output after Sigmoid Activation: {sigmoid_output_2:.4f}")

binary_prediction_2 = 1 if sigmoid_output_2 >= threshold else 0
print(f"Threshold: {threshold}")
print(f"Binary Prediction: {binary_prediction_2}")

# Demonstrate with a different threshold
alternative_threshold = 0.6
binary_prediction_3 = 1 if sigmoid_output >= alternative_threshold else 0

print(f"\nOriginal Sigmoid Output: {sigmoid_output:.4f}")
print(f"Alternative Threshold: {alternative_threshold}")
print(f"Binary Prediction with Alternative Threshold: {binary_prediction_3}")

Hypothetical Weighted Sum: 1.5
Output after Sigmoid Activation: 0.8176
Threshold: 0.5
Binary Prediction: 1

Another Hypothetical Weighted Sum: -0.5
Output after Sigmoid Activation: 0.3775
Threshold: 0.5
Binary Prediction: 0

Original Sigmoid Output: 0.8176
Alternative Threshold: 0.6
Binary Prediction with Alternative Threshold: 1


## Hidden layer

### Explain the purpose of hidden layers and how to code neurons within them, selecting appropriate activation functions.


In [12]:
%%markdown
## Hidden Layers: Learning Complex Patterns

Between the input layer and the output layer are one or more **hidden layers**. These layers are where the majority of the computation in a neural network takes place. They are called "hidden" because their inputs and outputs are not directly exposed to the outside world; they are internal to the network.

### The Role of Hidden Layers

The primary purpose of hidden layers is to learn increasingly complex and abstract representations of the input data. Each neuron in a hidden layer receives inputs from the previous layer (either the input layer or a preceding hidden layer) and transforms these inputs using weights, biases, and an activation function.

As data passes through successive hidden layers, the network can identify more intricate patterns and features. For example, in an image recognition task, the first hidden layer might learn to detect edges, the next layer might combine edges to detect shapes, and subsequent layers might combine shapes to recognize objects.

### Why Non-Linear Activation Functions are Crucial

We discussed the importance of activation functions in introducing non-linearity within a single neuron. This non-linearity is absolutely essential in hidden layers.

*   **Without non-linear activation functions in hidden layers**, a deep neural network (with multiple layers) would essentially behave like a single-layer linear model, no matter how many layers it has. This is because a composition of linear functions is always a linear function.
*   **With non-linear activation functions**, hidden layers can learn non-linear relationships and complex interactions between features. This allows neural networks to model highly intricate patterns in the data that linear models cannot capture.

Common non-linear activation functions used in hidden layers include ReLU, Tanh, and their variants.

Let's see how to calculate the output of a neuron in a hidden layer, incorporating the concepts of weighted sum and a non-linear activation function.

## Hidden Layers: Learning Complex Patterns

Between the input layer and the output layer are one or more **hidden layers**. These layers are where the majority of the computation in a neural network takes place. They are called "hidden" because their inputs and outputs are not directly exposed to the outside world; they are internal to the network.

### The Role of Hidden Layers

The primary purpose of hidden layers is to learn increasingly complex and abstract representations of the input data. Each neuron in a hidden layer receives inputs from the previous layer (either the input layer or a preceding hidden layer) and transforms these inputs using weights, biases, and an activation function.

As data passes through successive hidden layers, the network can identify more intricate patterns and features. For example, in an image recognition task, the first hidden layer might learn to detect edges, the next layer might combine edges to detect shapes, and subsequent layers might combine shapes to recognize objects.

### Why Non-Linear Activation Functions are Crucial

We discussed the importance of activation functions in introducing non-linearity within a single neuron. This non-linearity is absolutely essential in hidden layers.

*   **Without non-linear activation functions in hidden layers**, a deep neural network (with multiple layers) would essentially behave like a single-layer linear model, no matter how many layers it has. This is because a composition of linear functions is always a linear function.
*   **With non-linear activation functions**, hidden layers can learn non-linear relationships and complex interactions between features. This allows neural networks to model highly intricate patterns in the data that linear models cannot capture.

Common non-linear activation functions used in hidden layers include ReLU, Tanh, and their variants.

Let's see how to calculate the output of a neuron in a hidden layer, incorporating the concepts of weighted sum and a non-linear activation function.


**Reasoning**:
Provide a simple Python code example to demonstrate calculating the output of a neuron in a hidden layer, including taking inputs, using random weights and bias, calculating the weighted sum, and applying a non-linear activation function (ReLU).



In [13]:
# Demonstrate calculating the output of a hidden layer neuron

# For this example, let's use the first data sample from X as the input
# In a real network, these inputs would be the outputs of the previous layer
hidden_neuron_inputs = np.array([X['feature1'][0], X['feature2'][0]])

# Randomly initialize weights and bias for a single hidden neuron
# The number of weights must match the number of inputs
hidden_neuron_weights = np.random.rand(len(hidden_neuron_inputs)) * 0.5 - 0.25 # Small random values around 0
hidden_neuron_bias = np.random.rand() * 0.5 - 0.25 # Small random value around 0

print(f"Inputs to the Hidden Neuron: {hidden_neuron_inputs}")
print(f"Weights for the Hidden Neuron: {hidden_neuron_weights}")
print(f"Bias for the Hidden Neuron: {hidden_neuron_bias}")

# Calculate the weighted sum for the hidden neuron (using the function defined previously)
hidden_neuron_weighted_sum = weighted_sum(hidden_neuron_inputs, hidden_neuron_weights, hidden_neuron_bias)
print(f"Weighted Sum (Z) for Hidden Neuron: {hidden_neuron_weighted_sum}")

# Apply a non-linear activation function (e.g., ReLU) to the weighted sum
# We will use the relu function defined previously
hidden_neuron_output = relu(hidden_neuron_weighted_sum)
print(f"Output after ReLU Activation: {hidden_neuron_output}")

# Demonstrate with another activation function (e.g., Tanh)
hidden_neuron_output_tanh = tanh(hidden_neuron_weighted_sum)
print(f"Output after Tanh Activation: {hidden_neuron_output_tanh}")

Inputs to the Hidden Neuron: [3.74540119 2.04679473]
Weights for the Hidden Neuron: [ 0.00815017 -0.08852176]
Bias for the Hidden Neuron: 0.14759309738435183
Weighted Sum (Z) for Hidden Neuron: -0.003067109811097507
Output after ReLU Activation: 0.0
Output after Tanh Activation: -0.0030671001935334685


## Putting it all together

### Create Python functions for the forward pass of the neural network, initialize weights and biases, and demonstrate the forward pass with the input data X.

In [14]:
%%markdown
## Building the Neural Network: Combining the Layers

Now that we understand the fundamental components – the neuron, input layer, activation functions, hidden layers, and output layer – we can combine them to build a simple neural network.

For this demonstration, we will create a network with:
*   An input layer (with a number of neurons equal to the number of features in `X`)
*   One hidden layer with 2 neurons
*   An output layer with 1 neuron (for binary classification)

We will implement the **forward pass**, which is the process of feeding the input data through the network, layer by layer, to produce an output prediction.

The forward pass involves:
1.  Taking the input data.
2.  Calculating the weighted sum and applying an activation function for each neuron in the hidden layer.
3.  Using the outputs of the hidden layer as inputs to the output layer.
4.  Calculating the weighted sum and applying the output layer's activation function (Sigmoid in our case) for the output neuron.
5.  The final output is the network's prediction.

Let's define Python functions for the forward pass and then initialize random weights and biases for our network.

## Building the Neural Network: Combining the Layers

Now that we understand the fundamental components – the neuron, input layer, activation functions, hidden layers, and output layer – we can combine them to build a simple neural network.

For this demonstration, we will create a network with:
*   An input layer (with a number of neurons equal to the number of features in `X`)
*   One hidden layer with 2 neurons
*   An output layer with 1 neuron (for binary classification)

We will implement the **forward pass**, which is the process of feeding the input data through the network, layer by layer, to produce an output prediction.

The forward pass involves:
1.  Taking the input data.
2.  Calculating the weighted sum and applying an activation function for each neuron in the hidden layer.
3.  Using the outputs of the hidden layer as inputs to the output layer.
4.  Calculating the weighted sum and applying the output layer's activation function (Sigmoid in our case) for the output neuron.
5.  The final output is the network's prediction.

Let's define Python functions for the forward pass and then initialize random weights and biases for our network.


**Reasoning**:
Implement the forward pass function, initialize weights and biases for the network layers, and demonstrate the forward pass using the input data X.



In [15]:
# 1. Create Python functions to represent the forward pass
def forward_pass(X, weights_hidden, bias_hidden, weights_output, bias_output):
  """
  Represents the forward pass through a simple neural network.

  Args:
    X: Input data (numpy array or pandas DataFrame).
    weights_hidden: Weights for the hidden layer.
    bias_hidden: Bias for the hidden layer.
    weights_output: Weights for the output layer.
    bias_output: Bias for the output layer.

  Returns:
    The output of the neural network's output layer activation.
  """
  # Ensure X is a numpy array for matrix multiplication
  X = np.array(X)

  # Calculate the weighted sum for the hidden layer
  hidden_layer_input = np.dot(X, weights_hidden) + bias_hidden

  # Apply the activation function for the hidden layer (using ReLU)
  hidden_layer_output = relu(hidden_layer_input)

  # Calculate the weighted sum for the output layer
  output_layer_input = np.dot(hidden_layer_output, weights_output) + bias_output

  # Apply the activation function for the output layer (using Sigmoid)
  output_layer_output = sigmoid(output_layer_input)

  return output_layer_output

# 6. Initialize random weights and biases
# Number of input features is the number of columns in X
n_input_features = X.shape[1]
# Desired number of neurons in the hidden layer
n_hidden_neurons = 2
# Desired number of neurons in the output layer
n_output_neurons = 1

# Initialize weights and biases for the hidden layer
# Weights shape: (number of input features, number of hidden neurons)
weights_hidden = np.random.rand(n_input_features, n_hidden_neurons) * 0.1 # Small random values
bias_hidden = np.random.rand(n_hidden_neurons) * 0.1 # Small random values

# Initialize weights and biases for the output layer
# Weights shape: (number of hidden neurons, number of output neurons)
weights_output = np.random.rand(n_hidden_neurons, n_output_neurons) * 0.1 # Small random values
bias_output = np.random.rand(n_output_neurons) * 0.1 # Small random values

print("Shape of X:", X.shape)
print("Shape of weights_hidden:", weights_hidden.shape)
print("Shape of bias_hidden:", bias_hidden.shape)
print("Shape of weights_output:", weights_output.shape)
print("Shape of bias_output:", bias_output.shape)


# 7. Demonstrate the forward pass
network_output = forward_pass(X, weights_hidden, bias_hidden, weights_output, bias_output)

print("\nOutput of the neural network for the first 5 samples:")
print(network_output[:5])

Shape of X: (100, 2)
Shape of weights_hidden: (2, 2)
Shape of bias_hidden: (2,)
Shape of weights_output: (2, 1)
Shape of bias_output: (1,)

Output of the neural network for the first 5 samples:
[[0.51063872]
 [0.51627805]
 [0.51421699]
 [0.51220791]
 [0.50835198]]


## Training the Neural Network

### Explain and implement a basic training process of training a neural network, including loss functions and gradient descent.

In [16]:
%%markdown
## Training the Neural Network: Learning from Data

So far, we've built the structure of a simple neural network and implemented the **forward pass**, which allows us to make predictions based on given inputs and the network's current weights and biases. However, with randomly initialized weights and biases, our network's predictions are likely to be inaccurate.

The goal of training a neural network is to adjust these weights and biases so that the network's predictions are as close as possible to the actual target values in our dataset. This learning process is typically achieved using an **optimization algorithm**.

### Loss Function: Measuring Prediction Error

Before we can improve our predictions, we need a way to measure how "wrong" they are. This is the role of the **loss function** (also known as the cost function or error function). The loss function quantifies the difference between the network's predicted output and the true target values. A higher loss indicates poorer performance, while a lower loss indicates better performance.

The choice of loss function depends on the type of problem:

*   **Binary Classification:** **Binary Cross-Entropy (BCE)** is a common choice. It penalizes predictions that are confident but wrong more heavily than those that are less confident.
*   **Multi-Class Classification:** **Categorical Cross-Entropy** is typically used.
*   **Regression:** **Mean Squared Error (MSE)** or **Mean Absolute Error (MAE)** are often used.

Our current problem is binary classification, so we will use the Binary Cross-Entropy loss function.

### Optimization Algorithm: Minimizing the Loss

Once we can measure the error using a loss function, we need a method to systematically adjust the weights and biases to minimize this error. This is where **optimization algorithms** come in.

One of the most fundamental and widely used optimization algorithms is **Gradient Descent**.

#### Gradient Descent

Imagine the loss function as a landscape with hills and valleys, where the "height" represents the loss and the "position" represents the values of the network's weights and biases. The goal is to find the lowest point in this landscape, which corresponds to the set of weights and biases that minimizes the loss.

Gradient Descent works by iteratively taking steps in the direction of the **steepest descent** in the loss landscape. The direction of the steepest descent is given by the negative of the **gradient** of the loss function with respect to each weight and bias.

*   **Gradient:** The gradient is a vector of partial derivatives. For each weight and each bias in the network, the gradient tells us how much the loss function changes when that specific weight or bias is slightly changed.
*   **Learning Rate:** The size of the steps taken in the direction of the negative gradient is controlled by a parameter called the **learning rate**. A larger learning rate means bigger steps, which can lead to faster convergence but might overshoot the minimum. A smaller learning rate means smaller steps, which can lead to more stable convergence but might be slow.

The process involves:
1.  **Forward Pass:** Calculate the network's output and the loss for the current inputs.
2.  **Backward Pass (Backpropagation):** Calculate the gradient of the loss with respect to each weight and bias in the network, working backward from the output layer to the input layer using the chain rule of calculus.
3.  **Parameter Update:** Adjust the weights and biases by subtracting a fraction of their gradients (determined by the learning rate).

This iterative process of forward pass, backward pass, and parameter update is repeated for many **epochs** (one full pass through the entire training dataset) until the network's performance is satisfactory or the loss converges.

Let's implement the Binary Cross-Entropy loss function and then delve into the backward pass and gradient descent.

## Training the Neural Network: Learning from Data

So far, we've built the structure of a simple neural network and implemented the **forward pass**, which allows us to make predictions based on given inputs and the network's current weights and biases. However, with randomly initialized weights and biases, our network's predictions are likely to be inaccurate.

The goal of training a neural network is to adjust these weights and biases so that the network's predictions are as close as possible to the actual target values in our dataset. This learning process is typically achieved using an **optimization algorithm**.

### Loss Function: Measuring Prediction Error

Before we can improve our predictions, we need a way to measure how "wrong" they are. This is the role of the **loss function** (also known as the cost function or error function). The loss function quantifies the difference between the network's predicted output and the true target values. A higher loss indicates poorer performance, while a lower loss indicates better performance.

The choice of loss function depends on the type of problem:

*   **Binary Classification:** **Binary Cross-Entropy (BCE)** is a common choice. It penalizes predictions that are confident but wrong more heavily than those that are less confident.
*   **Multi-Class Classification:** **Categorical Cross-Entropy** is typically used.
*   **Regression:** **Mean Squared Error (MSE)** or **Mean Absolute Error (MAE)** are often used.

Our current problem is binary classification, so we will use the Binary Cross-Entropy loss function.

### Optimization Algorithm: Minimizing the Loss

Once we can measure the error using a loss function, we need a method to systematically adjust the weights and biases to minimize this error. This is where **optimization algorithms** come in.

One of the most fundamental and widely used optimization algorithms is **Gradient Descent**.

#### Gradient Descent

Imagine the loss function as a landscape with hills and valleys, where the "height" represents the loss and the "position" represents the values of the network's weights and biases. The goal is to find the lowest point in this landscape, which corresponds to the set of weights and biases that minimizes the loss.

Gradient Descent works by iteratively taking steps in the direction of the **steepest descent** in the loss landscape. The direction of the steepest descent is given by the negative of the **gradient** of the loss function with respect to each weight and bias.

*   **Gradient:** The gradient is a vector of partial derivatives. For each weight and each bias in the network, the gradient tells us how much the loss function changes when that specific weight or bias is slightly changed.
*   **Learning Rate:** The size of the steps taken in the direction of the negative gradient is controlled by a parameter called the **learning rate**. A larger learning rate means bigger steps, which can lead to faster convergence but might overshoot the minimum. A smaller learning rate means smaller steps, which can lead to more stable convergence but might be slow.

The process involves:
1.  **Forward Pass:** Calculate the network's output and the loss for the current inputs.
2.  **Backward Pass (Backpropagation):** Calculate the gradient of the loss with respect to each weight and bias in the network, working backward from the output layer to the input layer using the chain rule of calculus.
3.  **Parameter Update:** Adjust the weights and biases by subtracting a fraction of their gradients (determined by the learning rate).

This iterative process of forward pass, backward pass, and parameter update is repeated for many **epochs** (one full pass through the entire training dataset) until the network's performance is satisfactory or the loss converges.

Let's implement the Binary Cross-Entropy loss function and then delve into the backward pass and gradient descent.


**Reasoning**:
Implement the Binary Cross-Entropy loss function as described in the markdown cell and instruction 2 of the subtask.



In [17]:
# 2. Implement a loss function suitable for binary classification (Binary Cross-Entropy)
def binary_cross_entropy_loss(y_true, y_pred):
  """
  Calculates the Binary Cross-Entropy loss.

  Args:
    y_true: The true binary labels (0 or 1).
    y_pred: The predicted probabilities (output of the Sigmoid function).

  Returns:
    The average Binary Cross-Entropy loss.
  """
  # Avoid log(0) by clipping predicted values
  epsilon = 1e-15
  y_pred = np.clip(y_pred, epsilon, 1 - epsilon)

  # Binary Cross-Entropy formula: - (y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred))
  loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
  return loss

# Demonstrate the usage with example true labels and predicted probabilities
example_y_true = np.array([0, 1, 1, 0, 1])
example_y_pred = np.array([0.1, 0.9, 0.6, 0.4, 0.8]) # Hypothetical outputs from Sigmoid

loss = binary_cross_entropy_loss(example_y_true, example_y_pred)
print(f"Example True Labels: {example_y_true}")
print(f"Example Predicted Probabilities: {example_y_pred}")
print(f"Binary Cross-Entropy Loss: {loss:.4f}")

# Demonstrate with predictions that are closer to true values
example_y_pred_better = np.array([0.05, 0.95, 0.7, 0.3, 0.9])
loss_better = binary_cross_entropy_loss(example_y_true, example_y_pred_better)
print(f"\nExample Predicted Probabilities (Better): {example_y_pred_better}")
print(f"Binary Cross-Entropy Loss (Better): {loss_better:.4f}")

Example True Labels: [0 1 1 0 1]
Example Predicted Probabilities: [0.1 0.9 0.6 0.4 0.8]
Binary Cross-Entropy Loss: 0.2911

Example Predicted Probabilities (Better): [0.05 0.95 0.7  0.3  0.9 ]
Binary Cross-Entropy Loss (Better): 0.1843


**Reasoning**:
Explain and implement the backward pass (backpropagation) and the gradient descent update rule, covering instructions 3 and 4 of the subtask. This requires calculating gradients for both the output and hidden layer parameters and then updating those parameters.



In [18]:
%%markdown
### Backward Pass (Backpropagation) and Gradient Descent

Now that we have a way to measure the error (Loss Function), we need to figure out how to adjust the weights and biases to reduce this error. This is done using the **Backward Pass**, also known as **Backpropagation**, and the **Gradient Descent** update rule.

#### The Chain Rule: The Engine of Backpropagation

Backpropagation uses the **chain rule** of calculus to calculate the gradient of the loss function with respect to each weight and bias in the network. The chain rule allows us to calculate how a change in a weight or bias in an earlier layer affects the final loss, by multiplying the rates of change (derivatives) through each layer.

Essentially, we start at the output layer and calculate the gradient of the loss with respect to the output. Then, we propagate this gradient backward through the network, layer by layer, calculating the gradient with respect to the weights and biases in each layer.

#### Gradient Calculation Steps (for our simple network):

Let $L$ be the Binary Cross-Entropy Loss, $y_{true}$ be the true labels, $y_{pred}$ be the network's output (after Sigmoid), $Z_{output}$ be the weighted sum for the output layer, $A_{hidden}$ be the output of the hidden layer (after ReLU), $Z_{hidden}$ be the weighted sum for the hidden layer, $W_{output}$ and $B_{output}$ be the weights and bias for the output layer, and $W_{hidden}$ and $B_{hidden}$ be the weights and bias for the hidden layer. $X$ is the input data.

1.  **Gradient of Loss with respect to Output ($y_{pred}$):**
    The derivative of the BCE loss with respect to the Sigmoid output $y_{pred}$ is:
    $ \frac{\partial L}{\partial y_{pred}} = -\left(\frac{y_{true}}{y_{pred}} - \frac{1 - y_{true}}{1 - y_{pred}}\right) $

2.  **Gradient of Output ($y_{pred}$) with respect to Output Weighted Sum ($Z_{output}$):**
    The derivative of the Sigmoid function is $ \sigma'(Z) = \sigma(Z)(1 - \sigma(Z)) $. So,
    $ \frac{\partial y_{pred}}{\partial Z_{output}} = y_{pred} (1 - y_{pred}) $

3.  **Gradient of Loss with respect to Output Weighted Sum ($Z_{output}$):**
    Using the chain rule:
    $ \frac{\partial L}{\partial Z_{output}} = \frac{\partial L}{\partial y_{pred}} \times \frac{\partial y_{pred}}{\partial Z_{output}} = -\left(\frac{y_{true}}{y_{pred}} - \frac{1 - y_{true}}{1 - y_{pred}}\right) \times y_{pred} (1 - y_{pred}) $
    This simplifies nicely for BCE and Sigmoid:
    $ \frac{\partial L}{\partial Z_{output}} = y_{pred} - y_{true} $

4.  **Gradient of Loss with respect to Output Layer Weights ($W_{output}$) and Bias ($B_{output}$):**
    Now we can calculate the gradients for the output layer parameters using the chain rule:
    $ \frac{\partial L}{\partial W_{output}} = \frac{\partial L}{\partial Z_{output}} \times \frac{\partial Z_{output}}{\partial W_{output}} $
    Since $Z_{output} = A_{hidden} \cdot W_{output} + B_{output}$, the derivative of $Z_{output}$ with respect to $W_{output}$ is $A_{hidden}^T$.
    $ \frac{\partial L}{\partial W_{output}} = A_{hidden}^T \cdot (y_{pred} - y_{true}) $ (Note: This is a matrix multiplication)

    $ \frac{\partial L}{\partial B_{output}} = \frac{\partial L}{\partial Z_{output}} \times \frac{\partial Z_{output}}{\partial B_{output}} $
    The derivative of $Z_{output}$ with respect to $B_{output}$ is 1.
    $ \frac{\partial L}{\partial B_{output}} = y_{pred} - y_{true} $ (Summed over samples for the bias gradient)

5.  **Gradient of Loss with respect to Hidden Layer Output ($A_{hidden}$):**
    To propagate the gradient backward to the hidden layer, we need the gradient of the loss with respect to the hidden layer's output:
    $ \frac{\partial L}{\partial A_{hidden}} = \frac{\partial L}{\partial Z_{output}} \times \frac{\partial Z_{output}}{\partial A_{hidden}} $
    Since $Z_{output} = A_{hidden} \cdot W_{output} + B_{output}$, the derivative of $Z_{output}$ with respect to $A_{hidden}$ is $W_{output}^T$.
    $ \frac{\partial L}{\partial A_{hidden}} = (y_{pred} - y_{true}) \cdot W_{output}^T $ (Note: This is a matrix multiplication)

6.  **Gradient of Hidden Layer Output ($A_{hidden}$) with respect to Hidden Layer Weighted Sum ($Z_{hidden}$):**
    We used the ReLU activation function for the hidden layer. The derivative of the ReLU function is:
    $ ReLU'(Z) = 1 $ if $ Z > 0 $
    $ ReLU'(Z) = 0 $ if $ Z \le 0 $
    So, $ \frac{\partial A_{hidden}}{\partial Z_{hidden}} = ReLU'(Z_{hidden}) $ (Element-wise multiplication)

7.  **Gradient of Loss with respect to Hidden Layer Weighted Sum ($Z_{hidden}$):**
    Using the chain rule:
    $ \frac{\partial L}{\partial Z_{hidden}} = \frac{\partial L}{\partial A_{hidden}} \times \frac{\partial A_{hidden}}{\partial Z_{hidden}} = \left((y_{pred} - y_{true}) \cdot W_{output}^T\right) \times ReLU'(Z_{hidden}) $ (Element-wise multiplication with the derivative of ReLU)

8.  **Gradient of Loss with respect to Hidden Layer Weights ($W_{hidden}$) and Bias ($B_{hidden}$):**
    Finally, we can calculate the gradients for the hidden layer parameters:
    $ \frac{\partial L}{\partial W_{hidden}} = \frac{\partial L}{\partial Z_{hidden}} \times \frac{\partial Z_{hidden}}{\partial W_{hidden}} $
    Since $Z_{hidden} = X \cdot W_{hidden} + B_{hidden}$, the derivative of $Z_{hidden}$ with respect to $W_{hidden}$ is $X^T$.
    $ \frac{\partial L}{\partial W_{hidden}} = X^T \cdot \left(\frac{\partial L}{\partial Z_{hidden}}\right) $ (Note: This is a matrix multiplication)

    $ \frac{\partial L}{\partial B_{hidden}} = \frac{\partial L}{\partial Z_{hidden}} \times \frac{\partial Z_{hidden}}{\partial B_{hidden}} $
    The derivative of $Z_{hidden}$ with respect to $B_{hidden}$ is 1.
    $ \frac{\partial L}{\partial B_{hidden}} = \frac{\partial L}{\partial Z_{hidden}} $ (Summed over samples for the bias gradient)

#### Gradient Descent Update Rule

Once we have the gradients for all weights and biases, we update them using the gradient descent rule:

$ Parameter = Parameter - LearningRate \times Gradient $

Where `Parameter` is a weight or bias, `LearningRate` is a small positive value that controls the step size, and `Gradient` is the calculated gradient of the loss with respect to that parameter.

Let's implement these gradient calculations and the update rule. We'll need to modify our forward pass function slightly to store the intermediate values ($Z_{hidden}, A_{hidden}, Z_{output}$) which are needed for the backward pass.

### Backward Pass (Backpropagation) and Gradient Descent

Now that we have a way to measure the error (Loss Function), we need to figure out how to adjust the weights and biases to reduce this error. This is done using the **Backward Pass**, also known as **Backpropagation**, and the **Gradient Descent** update rule.

#### The Chain Rule: The Engine of Backpropagation

Backpropagation uses the **chain rule** of calculus to calculate the gradient of the loss function with respect to each weight and bias in the network. The chain rule allows us to calculate how a change in a weight or bias in an earlier layer affects the final loss, by multiplying the rates of change (derivatives) through each layer.

Essentially, we start at the output layer and calculate the gradient of the loss with respect to the output. Then, we propagate this gradient backward through the network, layer by layer, calculating the gradient with respect to the weights and biases in each layer.

#### Gradient Calculation Steps (for our simple network):

Let $L$ be the Binary Cross-Entropy Loss, $y_{true}$ be the true labels, $y_{pred}$ be the network's output (after Sigmoid), $Z_{output}$ be the weighted sum for the output layer, $A_{hidden}$ be the output of the hidden layer (after ReLU), $Z_{hidden}$ be the weighted sum for the hidden layer, $W_{output}$ and $B_{output}$ be the weights and bias for the output layer, and $W_{hidden}$ and $B_{hidden}$ be the weights and bias for the hidden layer. $X$ is the input data.

1.  **Gradient of Loss with respect to Output ($y_{pred}$):**
    The derivative of the BCE loss with respect to the Sigmoid output $y_{pred}$ is:
    $ \frac{\partial L}{\partial y_{pred}} = -\left(\frac{y_{true}}{y_{pred}} - \frac{1 - y_{true}}{1 - y_{pred}}\right) $

2.  **Gradient of Output ($y_{pred}$) with respect to Output Weighted Sum ($Z_{output}$):**
    The derivative of the Sigmoid function is $ \sigma'(Z) = \sigma(Z)(1 - \sigma(Z)) $. So,
    $ \frac{\partial y_{pred}}{\partial Z_{output}} = y_{pred} (1 - y_{pred}) $

3.  **Gradient of Loss with respect to Output Weighted Sum ($Z_{output}$):**
    Using the chain rule:
    $ \frac{\partial L}{\partial Z_{output}} = \frac{\partial L}{\partial y_{pred}} \times \frac{\partial y_{pred}}{\partial Z_{output}} = -\left(\frac{y_{true}}{y_{pred}} - \frac{1 - y_{true}}{1 - y_{pred}}\right) \times y_{pred} (1 - y_{pred}) $
    This simplifies nicely for BCE and Sigmoid:
    $ \frac{\partial L}{\partial Z_{output}} = y_{pred} - y_{true} $

4.  **Gradient of Loss with respect to Output Layer Weights ($W_{output}$) and Bias ($B_{output}$):**
    Now we can calculate the gradients for the output layer parameters using the chain rule:
    $ \frac{\partial L}{\partial W_{output}} = \frac{\partial L}{\partial Z_{output}} \times \frac{\partial Z_{output}}{\partial W_{output}} $
    Since $Z_{output} = A_{hidden} \cdot W_{output} + B_{output}$, the derivative of $Z_{output}$ with respect to $W_{output}$ is $A_{hidden}^T$.
    $ \frac{\partial L}{\partial W_{output}} = A_{hidden}^T \cdot (y_{pred} - y_{true}) $ (Note: This is a matrix multiplication)

    $ \frac{\partial L}{\partial B_{output}} = \frac{\partial L}{\partial Z_{output}} \times \frac{\partial Z_{output}}{\partial B_{output}} $
    The derivative of $Z_{output}$ with respect to $B_{output}$ is 1.
    $ \frac{\partial L}{\partial B_{output}} = y_{pred} - y_{true} $ (Summed over samples for the bias gradient)

5.  **Gradient of Loss with respect to Hidden Layer Output ($A_{hidden}$):**
    To propagate the gradient backward to the hidden layer, we need the gradient of the loss with respect to the hidden layer's output:
    $ \frac{\partial L}{\partial A_{hidden}} = \frac{\partial L}{\partial Z_{output}} \times \frac{\partial Z_{output}}{\partial A_{hidden}} $
    Since $Z_{output} = A_{hidden} \cdot W_{output} + B_{output}$, the derivative of $Z_{output}$ with respect to $A_{hidden}$ is $W_{output}^T$.
    $ \frac{\partial L}{\partial A_{hidden}} = (y_{pred} - y_{true}) \cdot W_{output}^T $ (Note: This is a matrix multiplication)

6.  **Gradient of Hidden Layer Output ($A_{hidden}$) with respect to Hidden Layer Weighted Sum ($Z_{hidden}$):**
    We used the ReLU activation function for the hidden layer. The derivative of the ReLU function is:
    $ ReLU'(Z) = 1 $ if $ Z > 0 $
    $ ReLU'(Z) = 0 $ if $ Z \le 0 $
    So, $ \frac{\partial A_{hidden}}{\partial Z_{hidden}} = ReLU'(Z_{hidden}) $ (Element-wise multiplication)

7.  **Gradient of Loss with respect to Hidden Layer Weighted Sum ($Z_{hidden}$):**
    Using the chain rule:
    $ \frac{\partial L}{\partial Z_{hidden}} = \frac{\partial L}{\partial A_{hidden}} \times \frac{\partial A_{hidden}}{\partial Z_{hidden}} = \left((y_{pred} - y_{true}) \cdot W_{output}^T\right) \times ReLU'(Z_{hidden}) $ (Element-wise multiplication with the derivative of ReLU)

8.  **Gradient of Loss with respect to Hidden Layer Weights ($W_{hidden}$) and Bias ($B_{hidden}$):**
    Finally, we can calculate the gradients for the hidden layer parameters:
    $ \frac{\partial L}{\partial W_{hidden}} = \frac{\partial L}{\partial Z_{hidden}} \times \frac{\partial Z_{hidden}}{\partial W_{hidden}} $
    Since $Z_{hidden} = X \cdot W_{hidden} + B_{hidden}$, the derivative of $Z_{hidden}$ with respect to $W_{hidden}$ is $X^T$.
    $ \frac{\partial L}{\partial W_{hidden}} = X^T \cdot \left(\frac{\partial L}{\partial Z_{hidden}}\right) $ (Note: This is a matrix multiplication)

    $ \frac{\partial L}{\partial B_{hidden}} = \frac{\partial L}{\partial Z_{hidden}} \times \frac{\partial Z_{hidden}}{\partial B_{hidden}} $
    The derivative of $Z_{hidden}$ with respect to $B_{hidden}$ is 1.
    $ \frac{\partial L}{\partial B_{hidden}} = \frac{\partial L}{\partial Z_{hidden}} $ (Summed over samples for the bias gradient)

#### Gradient Descent Update Rule

Once we have the gradients for all weights and biases, we update them using the gradient descent rule:

$ Parameter = Parameter - LearningRate \times Gradient $

Where `Parameter` is a weight or bias, `LearningRate` is a small positive value that controls the step size, and `Gradient` is the calculated gradient of the loss with respect to that parameter.

Let's implement these gradient calculations and the update rule. We'll need to modify our forward pass function slightly to store the intermediate values ($Z_{hidden}, A_{hidden}, Z_{output}$) which are needed for the backward pass.


**Reasoning**:
Implement the modified forward pass to store intermediate values and the backward pass to calculate gradients for all weights and biases, and finally implement the gradient descent update rule, covering instructions 3 and 4.



In [19]:
# Modify the forward pass to store intermediate values needed for backpropagation
def forward_pass_with_intermediates(X, weights_hidden, bias_hidden, weights_output, bias_output):
  """
  Represents the forward pass through a simple neural network,
  returning intermediate values needed for backpropagation.

  Args:
    X: Input data (numpy array).
    weights_hidden: Weights for the hidden layer.
    bias_hidden: Bias for the hidden layer.
    weights_output: Weights for the output layer.
    bias_output: Bias for the output layer.

  Returns:
    A tuple containing:
      - output_layer_output: The final output probability.
      - hidden_layer_output: The output of the hidden layer (after activation).
      - output_layer_input: The weighted sum for the output layer (before activation).
      - hidden_layer_input: The weighted sum for the hidden layer (before activation).
  """
  # Ensure X is a numpy array for matrix multiplication
  X = np.array(X)

  # Calculate the weighted sum for the hidden layer
  hidden_layer_input = np.dot(X, weights_hidden) + bias_hidden

  # Apply the activation function for the hidden layer (using ReLU)
  hidden_layer_output = relu(hidden_layer_input)

  # Calculate the weighted sum for the output layer
  output_layer_input = np.dot(hidden_layer_output, weights_output) + bias_output

  # Apply the activation function for the output layer (using Sigmoid)
  output_layer_output = sigmoid(output_layer_input)

  return output_layer_output, hidden_layer_output, output_layer_input, hidden_layer_input

# Implement the backward pass to calculate gradients
def backward_pass(X, y_true, y_pred, hidden_layer_output, output_layer_input, weights_output, hidden_layer_input):
  """
  Implements the backward pass to calculate gradients for weights and biases.

  Args:
    X: Input data (numpy array).
    y_true: True binary labels (numpy array).
    y_pred: Predicted probabilities from the forward pass (numpy array).
    hidden_layer_output: Output of the hidden layer (after activation) from forward pass.
    output_layer_input: Weighted sum for the output layer (before activation) from forward pass.
    weights_output: Weights for the output layer.
    hidden_layer_input: Weighted sum for the hidden layer (before activation) from forward pass.

  Returns:
    A tuple containing:
      - grads_weights_hidden: Gradients for hidden layer weights.
      - grads_bias_hidden: Gradients for hidden layer bias.
      - grads_weights_output: Gradients for output layer weights.
      - grads_bias_output: Gradients for output layer bias.
  """
  m = X.shape[0] # Number of samples

  # Calculate the gradient of the loss with respect to the output weighted sum (dZ_output)
  # This is the simplified gradient for BCE with Sigmoid
  dZ_output = y_pred - y_true.reshape(-1, 1) # Ensure y_true has compatible shape

  # Calculate gradients for output layer weights and bias
  grads_weights_output = np.dot(hidden_layer_output.T, dZ_output) / m
  grads_bias_output = np.sum(dZ_output, axis=0) / m

  # Calculate the gradient of the loss with respect to the hidden layer output (dA_hidden)
  dA_hidden = np.dot(dZ_output, weights_output.T)

  # Calculate the gradient of the loss with respect to the hidden layer weighted sum (dZ_hidden)
  # Apply the derivative of the ReLU activation function
  dZ_hidden = dA_hidden * (hidden_layer_input > 0) # Derivative of ReLU is 1 for Z > 0, 0 otherwise

  # Calculate gradients for hidden layer weights and bias
  grads_weights_hidden = np.dot(X.T, dZ_hidden) / m
  grads_bias_hidden = np.sum(dZ_hidden, axis=0) / m

  return grads_weights_hidden, grads_bias_hidden, grads_weights_output, grads_bias_output

# Implement the gradient descent update rule
def update_parameters(weights_hidden, bias_hidden, weights_output, bias_output,
                      grads_weights_hidden, grads_bias_hidden, grads_weights_output, grads_bias_output,
                      learning_rate):
  """
  Updates the weights and biases using gradient descent.

  Args:
    weights_hidden: Current hidden layer weights.
    bias_hidden: Current hidden layer bias.
    weights_output: Current output layer weights.
    bias_output: Current output layer bias.
    grads_weights_hidden: Gradients for hidden layer weights.
    grads_bias_hidden: Gradients for hidden layer bias.
    grads_weights_output: Gradients for output layer weights.
    grads_bias_output: Gradients for output layer bias.
    learning_rate: The learning rate for gradient descent.

  Returns:
    A tuple containing the updated weights and biases:
      - weights_hidden_updated
      - bias_hidden_updated
      - weights_output_updated
      - bias_output_updated
  """
  weights_hidden_updated = weights_hidden - learning_rate * grads_weights_hidden
  bias_hidden_updated = bias_hidden - learning_rate * grads_bias_hidden
  weights_output_updated = weights_output - learning_rate * grads_weights_output
  bias_output_updated = bias_output - learning_rate * grads_bias_output

  return weights_hidden_updated, bias_hidden_updated, weights_output_updated, bias_output_updated

# Demonstrate the backward pass and parameter update with a single forward pass example
# Use the initialized weights and biases from the previous step
# We need the intermediate outputs from the forward pass

y_true_array = y.values # Convert pandas Series to numpy array for calculations

# Perform a forward pass to get intermediates
y_pred, hidden_layer_output, output_layer_input, hidden_layer_input = forward_pass_with_intermediates(
    X, weights_hidden, bias_hidden, weights_output, bias_output
)

# Calculate gradients
grads_weights_hidden, grads_bias_hidden, grads_weights_output, grads_bias_output = backward_pass(
    X, y_true_array, y_pred, hidden_layer_output, output_layer_input, weights_output, hidden_layer_input
)

print("Gradients for Output Layer Weights:\n", grads_weights_output[:5])
print("\nGradients for Output Layer Bias:\n", grads_bias_output)
print("\nGradients for Hidden Layer Weights:\n", grads_weights_hidden[:5])
print("\nGradients for Hidden Layer Bias:\n", grads_bias_hidden)


# Demonstrate parameter update
learning_rate = 0.01
weights_hidden_updated, bias_hidden_updated, weights_output_updated, bias_output_updated = update_parameters(
    weights_hidden, bias_hidden, weights_output, bias_output,
    grads_weights_hidden, grads_bias_hidden, grads_weights_output, grads_bias_output,
    learning_rate
)

print("\nOriginal Hidden Layer Weights:\n", weights_hidden[:5])
print("\nUpdated Hidden Layer Weights (first 5 rows):\n", weights_hidden_updated[:5])
print("\nOriginal Output Layer Weights:\n", weights_output[:5])
print("\nUpdated Output Layer Weights (first 5 rows):\n", weights_output_updated[:5])

Gradients for Output Layer Weights:
 [[0.02277063]
 [0.02336304]]

Gradients for Output Layer Bias:
 [0.19156682]

Gradients for Hidden Layer Weights:
 [[ 0.01178141  0.00692273]
 [-0.00226344 -0.00133   ]]

Gradients for Hidden Layer Bias:
 [0.01333256 0.00783418]

Original Hidden Layer Weights:
 [[0.02708323 0.04389714]
 [0.00784564 0.00253507]]

Updated Hidden Layer Weights (first 5 rows):
 [[0.02696541 0.04382791]
 [0.00786827 0.00254837]]

Original Output Layer Weights:
 [[0.06959742]
 [0.04089529]]

Updated Output Layer Weights (first 5 rows):
 [[0.06936971]
 [0.04066166]]


**Reasoning**:
Combine the forward pass, loss calculation, backward pass, and parameter updates into a training loop, train the network for a specified number of epochs, and print the loss periodically, covering instructions 5, 6, and 7.



In [20]:
# 5. Combine the steps into a training loop
# 6. Train the neural network on the dataset X and y for a specified number of epochs
# 7. Print the loss periodically during training to show the learning progress

# Initialize weights and biases again (or use the updated ones from the previous step)
# Let's re-initialize for a fresh training run demonstration
weights_hidden = np.random.rand(n_input_features, n_hidden_neurons) * 0.1
bias_hidden = np.random.rand(n_hidden_neurons) * 0.1
weights_output = np.random.rand(n_hidden_neurons, n_output_neurons) * 0.1
bias_output = np.random.rand(n_output_neurons) * 0.1

learning_rate = 0.1 # Set a learning rate
epochs = 1000 # Define the number of training epochs

# Convert y to numpy array for consistent calculations
y_true_array = y.values.reshape(-1, 1)

print("Starting training...")

# Training loop
for epoch in range(epochs):
  # Forward pass to get predictions and intermediate values
  y_pred, hidden_layer_output, output_layer_input, hidden_layer_input = forward_pass_with_intermediates(
      X, weights_hidden, bias_hidden, weights_output, bias_output
  )

  # Calculate the loss
  loss = binary_cross_entropy_loss(y_true_array, y_pred)

  # Backward pass to calculate gradients
  grads_weights_hidden, grads_bias_hidden, grads_weights_output, grads_bias_output = backward_pass(
      X, y_true_array, y_pred, hidden_layer_output, output_layer_input, weights_output, hidden_layer_input
  )

  # Update parameters using gradient descent
  weights_hidden, bias_hidden, weights_output, bias_output = update_parameters(
      weights_hidden, bias_hidden, weights_output, bias_output,
      grads_weights_hidden, grads_bias_hidden, grads_weights_output, grads_bias_output,
      learning_rate
  )

  # Print the loss periodically
  if epoch % 100 == 0:
    print(f"Epoch {epoch}, Loss: {loss:.4f}")

print("Training finished.")

# Optionally, print the final loss
print(f"Final Loss after {epochs} epochs: {loss:.4f}")

# Optionally, make predictions after training
final_predictions, _, _, _ = forward_pass_with_intermediates(X, weights_hidden, bias_hidden, weights_output, bias_output)
binary_final_predictions = (final_predictions >= 0.5).astype(int)

# You can compare binary_final_predictions with y_true_array to see accuracy
# For demonstration, let's just print the first few predictions
print("\nFirst 10 binary predictions after training:")
print(binary_final_predictions[:10].flatten())
print("First 10 true labels:")
print(y_true_array[:10].flatten())

Starting training...
Epoch 0, Loss: 0.7059
Epoch 100, Loss: 0.5369
Epoch 200, Loss: 0.4649
Epoch 300, Loss: 0.4243
Epoch 400, Loss: 0.3987
Epoch 500, Loss: 0.3864
Epoch 600, Loss: 0.3799
Epoch 700, Loss: 0.3773
Epoch 800, Loss: 0.3768
Epoch 900, Loss: 0.4023
Training finished.
Final Loss after 1000 epochs: 0.3737

First 10 binary predictions after training:
[0 1 1 0 0 0 0 1 0 0]
First 10 true labels:
[0 1 0 0 0 0 0 1 0 0]


## Neural Network Evaluation

### Evaluate the performance of the trained neural network by using the trained weights and biases to make predictions, convert them to binary, and calculate the accuracy.


In [21]:
# 1. Use the trained weights and biases to make predictions on the input data X
# We use the final weights and biases obtained after training
final_predictions, _, _, _ = forward_pass_with_intermediates(
    X, weights_hidden, bias_hidden, weights_output, bias_output
)

# 2. Convert the predicted probabilities into binary predictions (0 or 1) using a threshold of 0.5
threshold = 0.5
binary_predictions = (final_predictions >= threshold).astype(int)

# 3. Calculate the accuracy of the model
# Ensure y is a numpy array for comparison
y_true_array = y.values.reshape(-1, 1)

# Compare binary predictions to true labels
correct_predictions = (binary_predictions == y_true_array).sum()

# Calculate accuracy
accuracy = correct_predictions / y_true_array.shape[0]

# 4. Print the calculated accuracy
print(f"Accuracy of the trained neural network: {accuracy:.4f}")

Accuracy of the trained neural network: 0.9000


## Summarization

In [22]:
%%markdown
## Summary and Next Steps

### What We Learned

In this notebook, we built a simple neural network from scratch, covering the fundamental concepts:

*   **The Neuron:** The basic computational unit, performing a weighted sum of inputs and applying an activation function.
*   **Weighted Sum and Bias:** How inputs are combined with weights and a bias term ($Z = \sum (input_i \times weight_i) + bias$).
*   **Activation Functions:** The role of non-linear functions (like Sigmoid, ReLU, Tanh) in enabling the network to learn complex patterns.
*   **Layers:** The organization of neurons into Input, Hidden, and Output layers.
*   **Input Layer:** Receiving and structuring the initial data.
*   **Output Layer:** Producing the final prediction, with the activation function chosen based on the task (Sigmoid for binary classification).
*   **Hidden Layers:** Learning abstract representations of the data through non-linear transformations.
*   **Forward Pass:** The process of feeding data through the network to get a prediction.
*   **Loss Function:** Measuring the error between predictions and true labels (Binary Cross-Entropy for our task).
*   **Backward Pass (Backpropagation):** Calculating the gradient of the loss with respect to each weight and bias using the chain rule.
*   **Gradient Descent:** An optimization algorithm for updating weights and biases in the direction that minimizes the loss.
*   **Training Loop:** The iterative process of forward pass, loss calculation, backward pass, and parameter updates over multiple epochs.
*   **Evaluation:** Assessing the trained network's performance using metrics like accuracy.

By coding these components ourselves, we gained a deeper understanding of how neural networks work internally.

### Where to Go From Here

This notebook provided a foundation. Neural networks are a vast field with many avenues for further exploration. Here are some suggested next steps:

*   **Experiment with Network Architecture:**
    *   Add more hidden layers to create a deeper network.
    *   Vary the number of neurons in the hidden layer(s).
*   **Try Different Activation Functions:**
    *   Replace ReLU with Leaky ReLU, ELU, or Swish in the hidden layers.
    *   Explore Softmax for multi-class classification problems.
*   **Implement Other Optimization Algorithms:**
    *   Research and implement more advanced optimizers like Adam, RMSprop, or Adagrad, which often converge faster and more effectively than basic Gradient Descent.
*   **Explore Different Training Strategies:**
    *   Implement **Batch Gradient Descent** (using mini-batches of data for updates).
    *   Implement **Stochastic Gradient Descent (SGD)** (updating parameters after each single data point).
*   **Regularization Techniques:**
    *   Learn about techniques like L1/L2 regularization or Dropout to prevent overfitting.
*   **Work with More Complex Datasets:**
    *   Apply your knowledge to larger and more complex datasets (e.g., image datasets like MNIST or CIFAR-10, text datasets). This will likely require using more sophisticated libraries or frameworks, but the core principles you've learned here will still apply.
*   **Implement Other Network Types:**
    *   Explore Convolutional Neural Networks (CNNs) for image data.
    *   Explore Recurrent Neural Networks (RNNs) for sequential data.
*   **Use Deep Learning Frameworks:**
    *   Transition to using popular deep learning libraries like TensorFlow or PyTorch. While coding from scratch is invaluable for understanding, these frameworks significantly simplify building, training, and deploying complex neural networks. You'll find that the concepts (layers, activations, loss functions, optimizers) are the same, but the implementation is much more streamlined.

Continuing to experiment and build will solidify your understanding and open up the possibilities of what you can achieve with neural networks!

## Summary and Next Steps

### What We Learned

In this notebook, we built a simple neural network from scratch, covering the fundamental concepts:

*   **The Neuron:** The basic computational unit, performing a weighted sum of inputs and applying an activation function.
*   **Weighted Sum and Bias:** How inputs are combined with weights and a bias term ($Z = \sum (input_i \times weight_i) + bias$).
*   **Activation Functions:** The role of non-linear functions (like Sigmoid, ReLU, Tanh) in enabling the network to learn complex patterns.
*   **Layers:** The organization of neurons into Input, Hidden, and Output layers.
*   **Input Layer:** Receiving and structuring the initial data.
*   **Output Layer:** Producing the final prediction, with the activation function chosen based on the task (Sigmoid for binary classification).
*   **Hidden Layers:** Learning abstract representations of the data through non-linear transformations.
*   **Forward Pass:** The process of feeding data through the network to get a prediction.
*   **Loss Function:** Measuring the error between predictions and true labels (Binary Cross-Entropy for our task).
*   **Backward Pass (Backpropagation):** Calculating the gradient of the loss with respect to each weight and bias using the chain rule.
*   **Gradient Descent:** An optimization algorithm for updating weights and biases in the direction that minimizes the loss.
*   **Training Loop:** The iterative process of forward pass, loss calculation, backward pass, and parameter updates over multiple epochs.
*   **Evaluation:** Assessing the trained network's performance using metrics like accuracy.

By coding these components ourselves, we gained a deeper understanding of how neural networks work internally.

### Where to Go From Here

This notebook provided a foundation. Neural networks are a vast field with many avenues for further exploration. Here are some suggested next steps:

*   **Experiment with Network Architecture:**
    *   Add more hidden layers to create a deeper network.
    *   Vary the number of neurons in the hidden layer(s).
*   **Try Different Activation Functions:**
    *   Replace ReLU with Leaky ReLU, ELU, or Swish in the hidden layers.
    *   Explore Softmax for multi-class classification problems.
*   **Implement Other Optimization Algorithms:**
    *   Research and implement more advanced optimizers like Adam, RMSprop, or Adagrad, which often converge faster and more effectively than basic Gradient Descent.
*   **Explore Different Training Strategies:**
    *   Implement **Batch Gradient Descent** (using mini-batches of data for updates).
    *   Implement **Stochastic Gradient Descent (SGD)** (updating parameters after each single data point).
*   **Regularization Techniques:**
    *   Learn about techniques like L1/L2 regularization or Dropout to prevent overfitting.
*   **Work with More Complex Datasets:**
    *   Apply your knowledge to larger and more complex datasets (e.g., image datasets like MNIST or CIFAR-10, text datasets). This will likely require using more sophisticated libraries or frameworks, but the core principles you've learned here will still apply.
*   **Implement Other Network Types:**
    *   Explore Convolutional Neural Networks (CNNs) for image data.
    *   Explore Recurrent Neural Networks (RNNs) for sequential data.
*   **Use Deep Learning Frameworks:**
    *   Transition to using popular deep learning libraries like TensorFlow or PyTorch. While coding from scratch is invaluable for understanding, these frameworks significantly simplify building, training, and deploying complex neural networks. You'll find that the concepts (layers, activations, loss functions, optimizers) are the same, but the implementation is much more streamlined.

Continuing to experiment and build will solidify your understanding and open up the possibilities of what you can achieve with neural networks!


## Summary:

### Data Analysis Key Findings

*   A synthetic dataset with 100 samples and 2 features (`feature1`, `feature2`) was generated for binary classification, with a target variable (`target`) determined by a simple rule with added noise.
*   A single neuron's core components (inputs, weights, bias, weighted sum, activation function) were explained and implemented, demonstrating the calculation of the weighted sum and applying a step activation function.
*   The role of the input layer was explained, emphasizing the importance of matching the number of input neurons to the number of features in the data (2 in this case). The shape of the input data `X` was confirmed to be (100, 2).
*   Several common activation functions (Sigmoid, ReLU, Tanh) were explained mathematically and implemented in Python, demonstrating their different output behaviors for various input values.
*   The output layer's function was explained, highlighting the selection of the appropriate activation function based on the task (Sigmoid for binary classification). The concept of applying a threshold (e.g., 0.5) to the Sigmoid output for binary prediction was demonstrated, showing how changing the threshold can alter the prediction.
*   The purpose of hidden layers in learning complex patterns was explained, stressing the necessity of non-linear activation functions. The calculation within a hidden neuron, including the weighted sum and applying ReLU and Tanh activations, was demonstrated.
*   A simple neural network architecture with one hidden layer (2 neurons) and an output layer (1 neuron) was defined. The forward pass was implemented, showing how data flows through the layers, applying weighted sums and activation functions. The dimensions of initialized weights and biases were confirmed to match the network structure.
*   The training process was explained, introducing the Binary Cross-Entropy loss function for binary classification and the Gradient Descent optimization algorithm. The backward pass (backpropagation) was explained as the method for calculating gradients using the chain rule.
*   The Binary Cross-Entropy loss function, a modified forward pass to store intermediate values, the backward pass for gradient calculation, and the parameter update rule (Gradient Descent) were implemented.
*   A training loop combining the forward pass, loss calculation, backward pass, and parameter updates was executed for 1000 epochs. The periodic printing of the loss showed a decrease over epochs, indicating that the network was learning.
*   The trained network's performance was evaluated by calculating its accuracy on the training data, achieving an accuracy of 0.9000.

### Insights or Next Steps

*   The notebook successfully guided the user through building and training a simple neural network from scratch, illustrating the fundamental concepts and their implementation.
*   Next steps could involve exploring more advanced concepts like different optimizers (Adam, RMSprop), regularization techniques (Dropout, L1/L2), different network architectures (more layers, varying neuron counts), and applying the learned principles to more complex datasets.
