# Lesson 1: Stochastic Gradient Descent: Theory and Implementation in Python

Here's the content formatted in Markdown:

---

# Stochastic Gradient Descent: Theory and Implementation in Python

## Introduction
Welcome! We're about to explore **Stochastic Gradient Descent (SGD)**, a pivotal optimization algorithm. SGD, a variant of Gradient Descent, is renowned for its efficiency with large datasets due to its unique stochastic nature. Stochastic means “random” and is the opposite of deterministic. A deterministic algorithm runs the same every time, but a stochastic one introduces randomness. Our journey includes understanding SGD, its theoretical concepts, and implementing it in Python.

## Understanding Stochastic Gradient Descent
SGD starts by understanding its structure. Unlike Gradient Descent, SGD calculates an estimate of the gradient using a randomly selected single data point, not the entire dataset. Consequently, SGD is highly efficient for large datasets.

While the efficient handling of large datasets by SGD is a blessing, its stochasticity can often lead to a slightly noisier process for convergence, resulting in the model not settling at an absolute minimum.

## Defining Data
We are going to use this simple example of data:

```python
# Importing Necessary Library
import numpy as np

# Linear regression problem
X = np.array([0, 1, 2, 3, 4, 5]) 
Y = np.array([0, 1.1, 1.9, 3, 4.2, 5.2])  
```

## Math Behind
In terms of math, SGD can be formulated as follows. Imagine we are looking for a best-fit line, setting the parameters of the familiar `y = mx + b` equation. Remember, `m` is the slope and `b` is the y-intercept. Then:

\[
m' = m - 2\alpha \cdot \left( \left(mx_i + b\right) - y_i \right) \cdot x_i
\]

\[
b' = b - 2\alpha \cdot \left( \left(mx_i + b\right) - y_i \right)
\]

Where:

- `m` and `b` are the initial values of your parameters.
- `m'` and `b'` are the updated parameters.
- `x_i` is a particular feature of your training set.
- `y_i` is the actual output for the given feature `x_i`.
- `\alpha` is the learning rate.

These formulas represent the update rules for parameters `m` and `b`. Here, the term \(\left(mx_i + b\right) - y_i\) presents the difference between the actual data point and the initial model \(mx_i + b\) prediction. Multiplying it with a specific feature `x_i` and averaging over all samples provides the gradient for parameter `m`. The same principle applies to parameter `b`, but without multiplication by `x_i`, as `b` is the bias term.

## Implementing Stochastic Gradient Descent
Now, let's dive into Python to implement SGD. This process encompasses initializing parameters randomly, selecting a random training sample, calculating the gradient, updating the parameters, and running several iterations (also known as epochs).

Let's break it down with the following code:

```python
# Model initialization
m = np.random.randn()  # Initialize the slope (random number)
b = np.random.randn()  # Initialize the intercept (random number)

learning_rate = 0.01  # Define the learning rate
epochs = 10000  # Define the number of iterations

# SGD implementation
for _ in range(epochs):
    random_index = np.random.randint(len(X))  # select a random sample
    x = X[random_index]
    y = Y[random_index]
    pred = m * x + b  # Calculate the predicted y
    # Calculate gradients for m (slope) and b (intercept)
    grad_m = (pred - y) * x 
    grad_b = (pred - y)
    m -= learning_rate * grad_m  # Update m using the calculated gradient
    b -= learning_rate * grad_b  # Update b using the calculated gradient
```

After running the SGD implementation, we should see the final optimized values of `m` (slope) and `b` (intercept).

## Testing the Algorithm
We apply our SGD function and then visualize the progress using Matplotlib.

```python
import matplotlib.pyplot as plt

# Plot the data points
plt.scatter(X, Y, color = "m", marker = "o", s = 30)

# Predicted line for the model
y_pred = m * X + b

# Plotting the predicted line
plt.plot(X, y_pred, color = "g")

# Adding labels to the plot
plt.xlabel('X')
plt.ylabel('Y')

plt.show()
```

Here is the result:

![SGD Plot](attachment:image.png)

This plot visualizes the implementation of SGD on a simple linear regression problem, showcasing the resulting model.

## Lesson Summary and Practice
Today's lesson unveiled critical aspects of the **Stochastic Gradient Descent** algorithm. We explored its significance, advantages, disadvantages, mathematical formulation, and Python implementation. You'll soon practice these concepts in upcoming tasks, cementing your understanding of SGD and enhancing your Python coding skills in machine learning. Happy learning!

---

## Observing Stochastic Gradient Descent in Action

## Tuning the Learning Rate in SGD

## Stochastic Sidesteps: Updating Model Parameters

## Updating the Linear Regression Model Params with SGD