In [None]:
'''
 * Copyright (c) 2016 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

## Continuous Optimization in Machine Learning

### Introduction

Since machine learning algorithms are implemented on a computer, mathematical formulations are expressed as **numerical optimization methods**. Training a machine learning model often boils down to **finding a good set of parameters** using an **objective function**.

Optimization problems in machine learning are typically **continuous**, as opposed to **combinatorial optimization** problems for discrete variables.

---

## **Types of Continuous Optimization**
Continuous optimization is divided into two **main branches**:

1. **Unconstrained Optimization**  
   - Uses **gradient descent** to find an optimum value.
   - Moves **opposite** to the gradient direction, aiming for the **minimum**.

2. **Constrained Optimization**  
   - Introduces **constraints** on the optimization process.
   - Uses methods like **Lagrange multipliers**.

---

## **Finding the Global Minimum**
Consider an example function:

$$
\ell(x) = x^4 + 7x^3 + 5x^2 - 17x + 3
$$

The **gradient** (first derivative) is:

$$
\frac{d\ell(x)}{dx} = 4x^3 + 21x^2 + 10x - 17
$$

Setting the derivative **to zero**, we solve for **stationary points**:

$$
\frac{d\ell(x)}{dx} = 0
$$

Since this is a **cubic equation**, it has **three solutions**. Two of them are **minimums**, and one is a **maximum**.

To **check whether a stationary point is a minimum or maximum**, compute the **second derivative**:

$$
\frac{d^2 \ell(x)}{dx^2} = 12x^2 + 42x + 10
$$

A **positive** second derivative means a **local minimum**, while a **negative** second derivative indicates a **local maximum**.

---

## **Gradient Descent for Optimization**
If we cannot solve for $ x $ analytically**, we start from an initial value $ x_0 $ and follow the **negative gradient**:

$$
x_{t+1} = x_t - \eta \frac{d\ell(x)}{dx}
$$

where $ \eta $ is the **learning rate**.

Gradient-based optimization is the foundation for **deep learning**, **convex optimization**, and **stochastic gradient descent**.

---

## **Hessian Matrix for Second-Order Optimization**
To analyze curvature, we use the **Hessian matrix**:

$$
H =
\begin{bmatrix}
\frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\
\frac{\partial^2 f}{\partial x \partial y} & \frac{\partial^2 f}{\partial y^2}
\end{bmatrix}
$$

For multivariate functions $ f: \mathbb{R}^n \to \mathbb{R} $, the **Hessian** is an $ n \times n \$ matrix measuring **local curvature**.




![image.png](attachment:image.png)

Fig.2 Example objective function. Negative gradients are indicated by arrows, and the global minimum is indicated by the dashed blue line.

# Optimization Using Gradient Descent

## **Continuous Optimization in Machine Learning**
Machine learning algorithms rely on **continuous optimization**, where the objective is to minimize a function \( f(x) \) by iteratively refining parameters.

The general **optimization problem** is:

$$
\min f(x), \quad x \in \mathbb{R}^d
$$

where \( f \) is a differentiable function representing the problem at hand.

---

## **Understanding Gradient Descent**
Gradient descent is a **first-order optimization algorithm** where we take **steps proportional to the negative gradient** to find the local minimum.

The gradient points in the **direction of steepest ascent**; moving in the **negative direction** leads to minimization.

For an initial guess \( x_0 \), we update iteratively:

$$
x_{i+1} = x_i - \gamma_i (\nabla f(x_i))^T
$$

where:
- \( \gamma_i \) is the **step size** (learning rate),
- \( \nabla f(x) \) is the **gradient of \( f \)**.

For suitable step sizes, the sequence:

$$
f(x_0) \geq f(x_1) \geq f(x_2) \geq \dots
$$

converges to a **local minimum**.

---

## **Example: Quadratic Function Optimization**
Consider the function:

$$
f =
\begin{bmatrix}
x_1 & 1 \\
x_2 & 2
\end{bmatrix}
\begin{bmatrix}
x_1 & 5 \\
x_2 & 20
\end{bmatrix}
$$

Gradient:

$$
\nabla f =
\begin{bmatrix}
x_1 & 2 & 1 & 5 \\
x_2 & 1 & 20 & 3
\end{bmatrix}
$$

Starting at \( x_0 = [-3, -1]^T \), we apply gradient descent iteratively.

If \( \gamma = 0.085 \):

$$
x_1 = [-1.98, 1.21]^T
$$

$$
x_2 = [-1.32, -0.42]^T
$$

This converges towards the **optimal solution**.

---

## **Challenges with Gradient Descent**
Gradient descent can be **slow** near the minimum due to poor conditioning:

- In long valleys, **gradients zigzag**, making convergence inefficient.
- **Convex optimization** ensures **global minimum**, avoiding dependence on the starting point.



![image-2.png](attachment:image-2.png)

Fig.3 Gradient descent on a two-dimensional quadratic surface (shown as a heatmap). See Example.1 for a description.

In [None]:
max_grad = 100
grad = max(min(derivative(x), max_grad), -max_grad)

# Function to compute the derivative of f(x)
def derivative(x):
    """ Compute the gradient of f(x) = x^4 + 7x^3 + 5x^2 - 17x + 3 """
    return 4*x**3 + 21*x**2 + 10*x - 17

# Gradient descent function
def gradient_descent(x0, learning_rate, iterations):
    """ Perform gradient descent to find a local minimum """
    x = x0  # Initial value
    for _ in range(iterations):
        grad = derivative(x)  # Compute gradient
        x -= learning_rate * grad  # Update x using the negative gradient
    return x

# Example usage
x_initial = -6  # Starting point
learning_rate = 0.05  # Step size
iterations = 1000  # Number of updates

optimal_x = gradient_descent(x_initial, learning_rate, iterations)
print(f"Optimal x: {optimal_x}")
print(f"Function value at optimal x: {x_initial**4 + 7*x_initial**3 + 5*x_initial**2 - 17*x_initial + 3}")
