
## We will learn  

- Gradient Descent


# Gradient Descent

#### What is a Cost Function?
It is a function that measures the performance of a model for any given data. Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number.

#### What is Gradient?
A gradient is nothing but a derivative that defines the effects on outputs of the function with a little bit of variation in inputs.

#### What is Gradient Descent?
Gradient Descent stands as a cornerstone orchestrating the intricate dance of model optimization. At its core, it is a numerical optimization algorithm that aims to find the optimal parameters—weights and biases—of a neural network by minimizing a defined cost function.

Gradient Descent (GD) is a widely used optimization algorithm in machine learning and deep learning that minimises the cost function of a neural network model during training. It works by iteratively adjusting the weights or parameters of the model in the direction of the negative gradient of the cost function until the minimum of the cost function is reached.

Gradient Descent is a fundamental optimization algorithm in machine learning used to minimize the cost or loss function during model training.

It iteratively adjusts model parameters by moving in the direction of the steepest decrease in the cost function.
The algorithm calculates gradients, representing the partial derivatives of the cost function concerning each parameter.


#### Types of Gradient Descent Algorithm

The choice of gradient descent algorithm depends on the problem at hand and the size of the dataset. Batch gradient descent is suitable for small datasets, while stochastic gradient descent algorithm is more suitable for large datasets. Mini-batch is a good compromise between the two and is often used in practice.

1. Batch Gradient Descent

In [None]:
# https://www.analyticsvidhya.com/blog/2020/10/how-does-the-gradient-descent-algorithm-work-in-machine-learning/
# https://www.geeksforgeeks.org/gradient-descent-algorithm-and-its-variants/
# https://www.javatpoint.com/gradient-descent-in-machine-learning
# https://www.geeksforgeeks.org/gradient-descent-in-linear-regression/

In [None]:
import numpy as np

def gradient_descent(X, y, learning_rate, num_iters):
  """
  Performs gradient descent to find optimal weights and bias for linear regression.

  Args:
      X: A numpy array of shape (m, n) representing the training data features.
      y: A numpy array of shape (m,) representing the training data target values.
      learning_rate: The learning rate to control the step size during updates.
      num_iters: The number of iterations to perform gradient descent.

  Returns:
      A tuple containing the learned weights and bias.
  """

  # Initialize weights and bias with random values
  m, n = X.shape
  weights = np.random.rand(n)
  bias = 0

  # Loop for the number of iterations
  for i in range(num_iters):
    # Predict y values using current weights and bias
    y_predicted = np.dot(X, weights) + bias

    # Calculate the error
    error = y - y_predicted

    # Calculate gradients for weights and bias
    weights_gradient = -2/m * np.dot(X.T, error)
    bias_gradient = -2/m * np.sum(error)

    # Update weights and bias using learning rate
    weights -= learning_rate * weights_gradient
    bias -= learning_rate * bias_gradient

  return weights, bias

# Example usage
X = np.array([[1, 1], [2, 2], [3, 3]])
y = np.array([2, 4, 5])
learning_rate = 0.01
num_iters = 100

weights, bias = gradient_descent(X, y, learning_rate, num_iters)

print("Learned weights:", weights)
print("Learned bias:", bias)


Gradient Descent is an optimization algorithm used extensively in machine learning and deep learning to minimize a cost function and find the optimal parameters of a model. It is particularly important in training algorithms for models like linear regression, logistic regression, neural networks, and more.

---

### **Key Concepts**
1. **Cost Function**:
   - Represents the error or difference between the predicted and actual values.
   - Examples: Mean Squared Error (MSE) for regression, Cross-Entropy Loss for classification.

2. **Objective**:
   - Minimize the cost function by iteratively updating the model parameters (weights and biases).

3. **Gradient**:
   - The gradient is the partial derivative of the cost function with respect to model parameters.
   - It indicates the direction and rate of the steepest increase of the cost function.

4. **Learning Rate (\(\alpha\))**:
   - A hyperparameter that controls the step size in the parameter update.
   - If \(\alpha\) is too large, the algorithm might overshoot the minimum. If too small, convergence will be slow.

---

### **How Gradient Descent Works**
For a cost function \( J(\theta) \), where \( \theta \) represents the model parameters:
1. Initialize parameters (\( \theta \)) randomly or to zeros.
2. Calculate the gradient of \( J(\theta) \) with respect to \( \theta \).
3. Update the parameters using the formula:
   \[
   \theta := \theta - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta}
   \]
   Here:
   - \( \frac{\partial J(\theta)}{\partial \theta} \): Gradient of the cost function.
   - \( \alpha \): Learning rate.

4. Repeat steps 2 and 3 until convergence (when changes in \( J(\theta) \) are negligible).

---

### **Types of Gradient Descent**
1. **Batch Gradient Descent**:
   - Uses the entire dataset to compute the gradient.
   - Convergence is stable but computationally expensive for large datasets.

2. **Stochastic Gradient Descent (SGD)**:
   - Updates parameters using a single data point (or sample) at each step.
   - Faster updates but more noise in convergence.

3. **Mini-Batch Gradient Descent**:
   - Combines benefits of Batch and SGD.
   - Uses small batches of data to compute the gradient.
   - Efficient and widely used in practice.

---

### **Challenges and Solutions**
1. **Local Minima**:
   - Non-convex functions might have local minima.
   - Solution: Use momentum, adaptive optimizers like Adam.

2. **Learning Rate Tuning**:
   - Choosing an appropriate learning rate is crucial.
   - Solution: Use learning rate schedules or adaptive methods (e.g., AdaGrad, RMSProp).

3. **Slow Convergence**:
   - Near flat regions of the cost function.
   - Solution: Use techniques like momentum or Nesterov acceleration.

---

### **Applications in Machine Learning**
1. **Linear Regression**:
   - Minimize Mean Squared Error to find the best-fit line.
2. **Logistic Regression**:
   - Minimize Cross-Entropy Loss for binary classification.
3. **Neural Networks**:
   - Optimize weights and biases to minimize the loss function during backpropagation.

---

Gradient Descent is the foundation of many machine learning algorithms and continues to evolve with advanced optimizers for faster and more robust learning.

## We Will Learn In next 
- Logistic Regression
- CrossValidations
- Hyperparameter Tuning
- Implementations