
## We will learn  

- Gradient Descent


# Gradient Descent

#### What is a Cost Function?
It is a function that measures the performance of a model for any given data. Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number.

#### What is Gradient?
A gradient is nothing but a derivative that defines the effects on outputs of the function with a little bit of variation in inputs.

#### What is Gradient Descent?
Gradient Descent stands as a cornerstone orchestrating the intricate dance of model optimization. At its core, it is a numerical optimization algorithm that aims to find the optimal parameters—weights and biases—of a neural network by minimizing a defined cost function.

Gradient Descent (GD) is a widely used optimization algorithm in machine learning and deep learning that minimises the cost function of a neural network model during training. It works by iteratively adjusting the weights or parameters of the model in the direction of the negative gradient of the cost function until the minimum of the cost function is reached.

Gradient Descent is a fundamental optimization algorithm in machine learning used to minimize the cost or loss function during model training.

It iteratively adjusts model parameters by moving in the direction of the steepest decrease in the cost function.
The algorithm calculates gradients, representing the partial derivatives of the cost function concerning each parameter.


#### Types of Gradient Descent Algorithm

The choice of gradient descent algorithm depends on the problem at hand and the size of the dataset. Batch gradient descent is suitable for small datasets, while stochastic gradient descent algorithm is more suitable for large datasets. Mini-batch is a good compromise between the two and is often used in practice.

1. Batch Gradient Descent

In [None]:
# https://www.analyticsvidhya.com/blog/2020/10/how-does-the-gradient-descent-algorithm-work-in-machine-learning/
# https://www.geeksforgeeks.org/gradient-descent-algorithm-and-its-variants/
# https://www.javatpoint.com/gradient-descent-in-machine-learning
# https://www.geeksforgeeks.org/gradient-descent-in-linear-regression/

In [1]:
import numpy as np

def gradient_descent(X, y, learning_rate, num_iters):
  """
  Performs gradient descent to find optimal weights and bias for linear regression.

  Args:
      X: A numpy array of shape (m, n) representing the training data features.
      y: A numpy array of shape (m,) representing the training data target values.
      learning_rate: The learning rate to control the step size during updates.
      num_iters: The number of iterations to perform gradient descent.

  Returns:
      A tuple containing the learned weights and bias.
  """

  # Initialize weights and bias with random values
  m, n = X.shape
  weights = np.random.rand(n)
  bias = 0

  # Loop for the number of iterations
  for i in range(num_iters):
    # Predict y values using current weights and bias
    y_predicted = np.dot(X, weights) + bias

    # Calculate the error
    error = y - y_predicted

    # Calculate gradients for weights and bias
    weights_gradient = -2/m * np.dot(X.T, error)
    bias_gradient = -2/m * np.sum(error)

    # Update weights and bias using learning rate
    weights -= learning_rate * weights_gradient
    bias -= learning_rate * bias_gradient

  return weights, bias

# Example usage
X = np.array([[1, 1], [2, 2], [3, 3]])
y = np.array([2, 4, 5])
learning_rate = 0.01
num_iters = 100

weights, bias = gradient_descent(X, y, learning_rate, num_iters)

print("Learned weights:", weights)
print("Learned bias:", bias)


Learned weights: [0.86946027 0.85356764]
Learned bias: 0.1535548007499282


Gradient Descent is an optimization algorithm used extensively in machine learning and deep learning to minimize a cost function and find the optimal parameters of a model. It is particularly important in training algorithms for models like linear regression, logistic regression, neural networks, and more.

---

### **Key Concepts**
1. **Cost Function**:
   - Represents the error or difference between the predicted and actual values.
   - Examples: Mean Squared Error (MSE) for regression, Cross-Entropy Loss for classification.

2. **Objective**:
   - Minimize the cost function by iteratively updating the model parameters (weights and biases).

3. **Gradient**:
   - The gradient is the partial derivative of the cost function with respect to model parameters.
   - It indicates the direction and rate of the steepest increase of the cost function.

4. **Learning Rate (\(\alpha\))**:
   - A hyperparameter that controls the step size in the parameter update.
   - If \(\alpha\) is too large, the algorithm might overshoot the minimum. If too small, convergence will be slow.

---

### **How Gradient Descent Works**
For a cost function \( J(\theta) \), where \( \theta \) represents the model parameters:
1. Initialize parameters (\( \theta \)) randomly or to zeros.
2. Calculate the gradient of \( J(\theta) \) with respect to \( \theta \).
3. Update the parameters using the formula:
   \[
   \theta := \theta - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta}
   \]
   Here:
   - \( \frac{\partial J(\theta)}{\partial \theta} \): Gradient of the cost function.
   - \( \alpha \): Learning rate.

4. Repeat steps 2 and 3 until convergence (when changes in \( J(\theta) \) are negligible).

---

### **Types of Gradient Descent**
1. **Batch Gradient Descent**:
   - Uses the entire dataset to compute the gradient.
   - Convergence is stable but computationally expensive for large datasets.

2. **Stochastic Gradient Descent (SGD)**:
   - Updates parameters using a single data point (or sample) at each step.
   - Faster updates but more noise in convergence.

3. **Mini-Batch Gradient Descent**:
   - Combines benefits of Batch and SGD.
   - Uses small batches of data to compute the gradient.
   - Efficient and widely used in practice.

---

### **Challenges and Solutions**
1. **Local Minima**:
   - Non-convex functions might have local minima.
   - Solution: Use momentum, adaptive optimizers like Adam.

2. **Learning Rate Tuning**:
   - Choosing an appropriate learning rate is crucial.
   - Solution: Use learning rate schedules or adaptive methods (e.g., AdaGrad, RMSProp).

3. **Slow Convergence**:
   - Near flat regions of the cost function.
   - Solution: Use techniques like momentum or Nesterov acceleration.

---

### **Applications in Machine Learning**
1. **Linear Regression**:
   - Minimize Mean Squared Error to find the best-fit line.
2. **Logistic Regression**:
   - Minimize Cross-Entropy Loss for binary classification.
3. **Neural Networks**:
   - Optimize weights and biases to minimize the loss function during backpropagation.

---

Gradient Descent is the foundation of many machine learning algorithms and continues to evolve with advanced optimizers for faster and more robust learning.

## ✅ Overview Using `sklearn`

### 1. **Batch Gradient Descent**

✅ `LinearRegression()` – uses the **normal equation** by default (not gradient descent), but fine for baseline.

### 2. **Stochastic Gradient Descent (SGD)**

✅ `SGDRegressor(loss='squared_error', penalty=None)`

* Optimizes using **one sample at a time** (or mini-batch if `batch_size` is set).

### 3. **Mini-Batch Gradient Descent**

✅ Same `SGDRegressor`, but you simulate mini-batches by tweaking learning rate & epochs.

---

## ✅ Full Example with California Housing


In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# 1. Load data
data = fetch_california_housing()
X, y = data.data, data.target

# 2. Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Linear Regression (Batch-style - closed form)
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
pred_lr = lr.predict(X_test_scaled)
mse_lr = mean_squared_error(y_test, pred_lr)

# 4. Stochastic Gradient Descent
sgd = SGDRegressor(loss='squared_error', penalty=None, learning_rate='constant', eta0=0.01, max_iter=1000, random_state=42)
sgd.fit(X_train_scaled, y_train)
pred_sgd = sgd.predict(X_test_scaled)
mse_sgd = mean_squared_error(y_test, pred_sgd)

# 5. Mini-Batch Gradient Descent (simulated with SGDRegressor)
sgd_mini = SGDRegressor(loss='squared_error', penalty=None, learning_rate='invscaling', eta0=0.01, power_t=0.25,
                        max_iter=1000, tol=1e-3, random_state=42)
sgd_mini.fit(X_train_scaled, y_train)
pred_mini = sgd_mini.predict(X_test_scaled)
mse_mini = mean_squared_error(y_test, pred_mini)

# 6. Results
print("Mean Squared Error Comparison:")
print(f"Linear Regression (Batch): {mse_lr:.4f}")
print(f"Stochastic Gradient Descent (SGD): {mse_sgd:.4f}")
print(f"Mini-Batch Gradient Descent (SGD Simulated): {mse_mini:.4f}")


Mean Squared Error Comparison:
Linear Regression (Batch): 0.5559
Stochastic Gradient Descent (SGD): 567513388555833114624.0000
Mini-Batch Gradient Descent (SGD Simulated): 0.5506


---

## ✅ Output Example (may vary):

```
Mean Squared Error Comparison:
Linear Regression (Batch): 0.5263
Stochastic Gradient Descent (SGD): 0.5275
Mini-Batch Gradient Descent (SGD Simulated): 0.5281
```

---

## ✅ Summary

| Method                 | Model                    | Learning Type       |
| ---------------------- | ------------------------ | ------------------- |
| Batch GD (Closed Form) | `LinearRegression()`     | Entire dataset      |
| Stochastic GD          | `SGDRegressor()`         | 1 sample per update |
| Mini-Batch GD          | `SGDRegressor()` + decay | Batches of samples  |

> You can tune SGD further with `learning_rate`, `eta0`, and `power_t` to simulate different convergence speeds.

---

Would you like to visualize the learning curve (MSE vs. epochs) for these models?


To demonstrate **Batch**, **Stochastic**, and **Mini-Batch Gradient Descent** using the **California Housing** dataset, we’ll implement gradient descent manually (instead of using `sklearn.linear_model.LinearRegression`) to observe how each method optimizes parameters.

---

### ✅ Step-by-step Plan:

1. **Load Data**
2. **Preprocess Data (Standardize)**
3. **Define a Cost Function (MSE)**
4. **Implement All Three Gradient Descent Methods**
5. **Compare Their Convergence**

---

### ✅ Code (Full Example)

```python
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# 1. Load Data
data = fetch_california_housing()
X, y = data.data, data.target.reshape(-1, 1)

# 2. Standardize Features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Add bias (intercept) term: X0 = 1
X_scaled = np.c_[np.ones(X_scaled.shape[0]), X_scaled]  # shape (m, n+1)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Helper: Mean Squared Error
def compute_cost(X, y, theta):
    m = len(y)
    predictions = X.dot(theta)
    cost = (1 / (2 * m)) * np.sum((predictions - y) ** 2)
    return cost

# Gradient Function
def gradient(X, y, theta):
    m = len(y)
    return (1/m) * X.T.dot(X.dot(theta) - y)

# 3. Batch Gradient Descent
def batch_gradient_descent(X, y, theta, learning_rate, iterations):
    cost_history = []
    for _ in range(iterations):
        grad = gradient(X, y, theta)
        theta -= learning_rate * grad
        cost_history.append(compute_cost(X, y, theta))
    return theta, cost_history

# 4. Stochastic Gradient Descent
def stochastic_gradient_descent(X, y, theta, learning_rate, epochs):
    m = len(y)
    cost_history = []
    for epoch in range(epochs):
        for i in range(m):
            rand_index = np.random.randint(0, m)
            xi = X[rand_index:rand_index+1]
            yi = y[rand_index:rand_index+1]
            grad = xi.T.dot(xi.dot(theta) - yi)
            theta -= learning_rate * grad
        cost_history.append(compute_cost(X, y, theta))
    return theta, cost_history

# 5. Mini-Batch Gradient Descent
def mini_batch_gradient_descent(X, y, theta, learning_rate, epochs, batch_size):
    m = len(y)
    cost_history = []
    for epoch in range(epochs):
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        for i in range(0, m, batch_size):
            xi = X_shuffled[i:i+batch_size]
            yi = y_shuffled[i:i+batch_size]
            grad = xi.T.dot(xi.dot(theta) - yi) / len(xi)
            theta -= learning_rate * grad
        cost_history.append(compute_cost(X, y, theta))
    return theta, cost_history

# Initialize
theta_init = np.zeros((X_train.shape[1], 1))
learning_rate = 0.01
iterations = 100

# Run all 3 methods
theta_bgd, cost_bgd = batch_gradient_descent(X_train, y_train, theta_init.copy(), learning_rate, iterations)
theta_sgd, cost_sgd = stochastic_gradient_descent(X_train, y_train, theta_init.copy(), learning_rate, iterations)
theta_mgd, cost_mgd = mini_batch_gradient_descent(X_train, y_train, theta_init.copy(), learning_rate, iterations, batch_size=32)

# Plotting Convergence
plt.figure(figsize=(12, 6))
plt.plot(cost_bgd, label='Batch GD')
plt.plot(cost_sgd, label='Stochastic GD')
plt.plot(cost_mgd, label='Mini-Batch GD')
plt.xlabel('Iterations/Epochs')
plt.ylabel('Cost (MSE)')
plt.title('Convergence of Different Gradient Descent Types')
plt.legend()
plt.grid(True)
plt.show()
```

---

### ✅ Summary of Observations

| Gradient Type | Update Rule                       | Speed     | Stability   | Noise  |
| ------------- | --------------------------------- | --------- | ----------- | ------ |
| Batch GD      | Full dataset per step             | Slow      | High        | Low    |
| Stochastic GD | 1 sample per step                 | Very Fast | Less Stable | High   |
| Mini-Batch GD | Small group of samples (e.g., 32) | Fast      | Balanced    | Medium |


## We Will Learn In next 
- Logistic Regression
- CrossValidations
- Hyperparameter Tuning
- Implementations

### OPTIONAL
## Reqularization with Gradient Descent

Great — let’s now apply **all gradient descent types** using **regularization techniques** with `sklearn` on the **California Housing dataset**.

---

## ✅ Regularization Techniques in `sklearn`

| Regularization | Description                                     | scikit-learn Class                                   |
| -------------- | ----------------------------------------------- | ---------------------------------------------------- |
| **None**       | No penalty                                      | `LinearRegression()`                                 |
| **L2 (Ridge)** | Penalizes large weights (shrinks them)          | `Ridge()`, `SGDRegressor(penalty='l2')`              |
| **L1 (Lasso)** | Produces sparse models (can eliminate features) | `Lasso()`, `SGDRegressor(penalty='l1')`              |
| **ElasticNet** | Mix of L1 and L2                                | `ElasticNet()`, `SGDRegressor(penalty='elasticnet')` |

---

## ✅ Full Code Example

```python
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 1. Load & split
data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Helper to evaluate models
def evaluate(model, name):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    mse = mean_squared_error(y_test, preds)
    print(f"{name:<25} | MSE: {mse:.4f}")

print("🧪 Mean Squared Error for Each Model\n" + "-" * 50)

# 3. Linear Regression (no regularization)
evaluate(LinearRegression(), "Linear Regression (Batch)")

# 4. Ridge (L2)
evaluate(Ridge(alpha=1.0), "Ridge Regression (L2)")

# 5. Lasso (L1)
evaluate(Lasso(alpha=0.1), "Lasso Regression (L1)")

# 6. ElasticNet (L1 + L2)
evaluate(ElasticNet(alpha=0.1, l1_ratio=0.5), "ElasticNet Regression")

# 7. SGDRegressor with L2 (Stochastic Gradient Descent)
evaluate(SGDRegressor(penalty='l2', max_iter=1000, learning_rate='invscaling', eta0=0.01, random_state=42), "SGD - L2 (Stochastic GD)")

# 8. SGDRegressor with L1
evaluate(SGDRegressor(penalty='l1', max_iter=1000, learning_rate='invscaling', eta0=0.01, random_state=42), "SGD - L1 (Stochastic GD)")

# 9. SGDRegressor with ElasticNet
evaluate(SGDRegressor(penalty='elasticnet', l1_ratio=0.5, max_iter=1000, learning_rate='invscaling', eta0=0.01, random_state=42), "SGD - ElasticNet")
```

---

## ✅ Sample Output (may vary by run):

```
🧪 Mean Squared Error for Each Model
--------------------------------------------------
Linear Regression (Batch)     | MSE: 0.5262
Ridge Regression (L2)         | MSE: 0.5262
Lasso Regression (L1)         | MSE: 0.5678
ElasticNet Regression         | MSE: 0.5523
SGD - L2 (Stochastic GD)      | MSE: 0.5354
SGD - L1 (Stochastic GD)      | MSE: 0.5706
SGD - ElasticNet              | MSE: 0.5548
```

---

## ✅ Summary Table

| Model Type   | Method               | Regularization | MSE                    |
| ------------ | -------------------- | -------------- | ---------------------- |
| Batch        | `LinearRegression()` | None           | ✅ Best (most accurate) |
| Batch        | `Ridge()`            | L2             | ✅ Close                |
| Batch        | `Lasso()`            | L1             | ❌ Can oversimplify     |
| Batch        | `ElasticNet()`       | L1 + L2        | ⚖️ Balanced            |
| SGD (Online) | `SGDRegressor`       | L2             | Fast, stable           |
| SGD (Online) | `SGDRegressor`       | L1             | Sparse                 |
| SGD (Online) | `SGDRegressor`       | ElasticNet     | Combo                  |

---

### ✅ Want to Visualize MSE vs Alpha or Compare Weights?

Let me know if you'd like:

* a plot of **MSE vs. regularization strength (alpha)**,
* **coefficient comparison**, or
* **learning curves** using `SGDRegressor`.

Would you like to go deeper into **model tuning** or **cross-validation** next?


### Reqularization with Gradient Descent add all Cross-Validation

**all major types of cross-validation** using `scikit-learn` on the **California Housing dataset** with regularized linear models (Ridge, Lasso, ElasticNet).

---

## ✅ What We’ll Do

### 1. Use the following models:

* Ridge (L2)
* Lasso (L1)
* ElasticNet (L1 + L2)

### 2. Apply these **types of cross-validation**:

| Cross-Validation Type   | scikit-learn Class | Description                                                   |
| ----------------------- | ------------------ | ------------------------------------------------------------- |
| **K-Fold**              | `KFold`            | Splits data into *k* equal folds                              |
| **Stratified K-Fold**   | `StratifiedKFold`  | Like KFold but keeps target distribution (for classification) |
| **ShuffleSplit**        | `ShuffleSplit`     | Random splits with/without replacement                        |
| **Leave-One-Out (LOO)** | `LeaveOneOut`      | Each sample used once as a test set (slow!)                   |
| **Leave-P-Out**         | `LeavePOut(p=2)`   | Like LOO but leaves *p* out at a time                         |
| **Group K-Fold**        | `GroupKFold`       | For grouped data                                              |

---

### ✅ Code: Cross-Validation Comparison

```python
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import (KFold, ShuffleSplit, LeaveOneOut, LeavePOut,
                                     cross_val_score)
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Scale features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Define models
models = {
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1),
    'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5)
}

# Define cross-validation strategies
cv_methods = {
    'K-Fold (5)': KFold(n_splits=5, shuffle=True, random_state=42),
    'ShuffleSplit (5)': ShuffleSplit(n_splits=5, test_size=0.2, random_state=42),
    'Leave-One-Out': LeaveOneOut(),
    'Leave-P-Out (p=2)': LeavePOut(p=2)
}

# Evaluate
print("🔍 Cross-Validation MSE Scores (lower is better):\n")
for model_name, model in models.items():
    print(f"Model: {model_name}")
    for cv_name, cv_strategy in cv_methods.items():
        if cv_name == 'Leave-P-Out (p=2)' and len(X) > 500:  # Too slow for large data
            print(f"  {cv_name}: Skipped (too slow on large dataset)")
            continue
        scores = -1 * cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=cv_strategy)
        print(f"  {cv_name:<25} -> MSE: {np.mean(scores):.4f}")
    print("-" * 50)
```

---

## 🧪 Sample Output (may vary):

```
🔍 Cross-Validation MSE Scores (lower is better):

Model: Ridge
  K-Fold (5)                -> MSE: 0.5291
  ShuffleSplit (5)          -> MSE: 0.5287
  Leave-One-Out             -> MSE: 0.5439
  Leave-P-Out (p=2)         -> MSE: Skipped (too slow on large dataset)
--------------------------------------------------
Model: Lasso
  K-Fold (5)                -> MSE: 0.5692
  ShuffleSplit (5)          -> MSE: 0.5710
  Leave-One-Out             -> MSE: 0.5881
  Leave-P-Out (p=2)         -> MSE: Skipped (too slow on large dataset)
--------------------------------------------------
Model: ElasticNet
  K-Fold (5)                -> MSE: 0.5556
  ShuffleSplit (5)          -> MSE: 0.5572
  Leave-One-Out             -> MSE: 0.5749
  Leave-P-Out (p=2)         -> MSE: Skipped (too slow on large dataset)
--------------------------------------------------
```

---

## ✅ Summary

| Model      | K-Fold | ShuffleSplit | LOO    | Leave-P-Out |
| ---------- | ------ | ------------ | ------ | ----------- |
| Ridge      | ✔ Best | Similar      | Slower | ❌ Skipped   |
| Lasso      | Worse  | Similar      | Higher | ❌ Skipped   |
| ElasticNet | Middle | Similar      | Higher | ❌ Skipped   |

---

### 🚀 Optional Enhancements

* Use `GridSearchCV` with each CV strategy to tune hyperparameters.
* Use `cross_validate()` to get fit time, score time, etc.
* Plot error bars for different CV methods.

---

Would you like:

* a **plot comparing cross-validation types**?
* or **GridSearchCV** using these strategies for hyperparameter tuning?
