In [None]:
# gradient descent optimization with adam for a two-dimensional test function
from math import sqrt
from numpy import asarray
from numpy.random import rand
from numpy.random import seed

In [None]:
# objective function : This is simply an example optimisation problem they created.  Basically saying we are trying to find the minimum of the function x^2 + y^2
def objective(x, y):
	return x**2.0 + y**2.0
 
# derivative of objective function : Given the function we are trying to solve is x^2 + y^2, the derivative of that is 2x + 2y
# So we can create an array which contains 2x and 2y
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

This target function (x^2 + y^2) is a paraboloid and has its global minimum at the point (0,0).

The derivative function in the script is supposed to calculate the gradient of this objective function. The gradient of a function gives the direction of the steepest ascent. When optimizing, we want to go in the opposite direction (steepest descent) to minimize the function.

For the given quadratic function, the gradient can be calculated analytically by taking the partial derivatives with respect to each variable (x and y). Here's how it works:

    The derivative of x^2 with respect to x is 2x.
    The derivative of y^2 with respect to y is 2y.

Therefore, the gradient of the function f(x,y)f(x,y) is a vector consisting of these partial derivatives.

In [1]:
# gradient descent algorithm with adam
def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8):
	# generate an initial point
	x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	score = objective(x[0], x[1])
	# initialize first and second moments
	m = [0.0 for _ in range(bounds.shape[0])]
	v = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent updates
	for t in range(n_iter):
		# calculate gradient g(t)
		g = derivative(x[0], x[1])
		# build a solution one variable at a time
		for i in range(x.shape[0]):
			# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)
			m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
			# v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2
			v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2
			# mhat(t) = m(t) / (1 - beta1(t))
			mhat = m[i] / (1.0 - beta1**(t+1))
			# vhat(t) = v(t) / (1 - beta2(t))
			vhat = v[i] / (1.0 - beta2**(t+1))
			# x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + eps)

			## This is the actual update step, updating the parameter being used to minimize the loss
			## In this case, we are updating the x and y values
			## But if we using it in a neural network, we would be updating the weights and biases
			x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps)
		# evaluate candidate point
		score = objective(x[0], x[1])
		# report progress
		print('>%d f(%s) = %.5f' % (t, x, score))
	return [x, score]
 
# seed the pseudo random number generator
seed(1)
#source: https://machinelearningmastery.com/adam-optimization-from-scratch/

# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 60
# steps size
alpha = 0.02
# factor for average gradient
beta1 = 0.8
# factor for average squared gradient
beta2 = 0.999
# perform the gradient descent search with adam
best, score = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)
print('Done!')
print('f(%s) = %f' % (best, score))

>0 f([-0.14595599  0.42064899]) = 0.19825
>1 f([-0.12613855  0.40070573]) = 0.17648
>2 f([-0.10665938  0.3808601 ]) = 0.15643
>3 f([-0.08770234  0.3611548 ]) = 0.13812
>4 f([-0.06947941  0.34163405]) = 0.12154
>5 f([-0.05222756  0.32234308]) = 0.10663
>6 f([-0.03620086  0.30332769]) = 0.09332
>7 f([-0.02165679  0.28463383]) = 0.08149
>8 f([-0.00883663  0.26630707]) = 0.07100
>9 f([0.00205801 0.24839209]) = 0.06170
>10 f([0.01088844 0.23093228]) = 0.05345
>11 f([0.01759677 0.2139692 ]) = 0.04609
>12 f([0.02221425 0.19754214]) = 0.03952
>13 f([0.02485859 0.18168769]) = 0.03363
>14 f([0.02572196 0.16643933]) = 0.02836
>15 f([0.02505339 0.15182705]) = 0.02368
>16 f([0.02313917 0.13787701]) = 0.01955
>17 f([0.02028406 0.12461125]) = 0.01594
>18 f([0.01679451 0.11204744]) = 0.01284
>19 f([0.01296436 0.10019867]) = 0.01021
>20 f([0.00906264 0.08907337]) = 0.00802
>21 f([0.00532366 0.07867522]) = 0.00622
>22 f([0.00193919 0.06900318]) = 0.00477
>23 f([-0.00094677  0.06005154]) = 0.00361
>24 f(

# PROGRAM EXPLANATION

Script Summary:
The script is using the Adam optimization algorithm to find the minimum of a quadratic function, which is simply the sum of the squares of two variables, x and y. The algorithm iteratively adjusts the values of x and y to minimize this function.

Here's a step-by-step explanation of how the script operates:

    Objective Function (objective): This function computes the value of the quadratic function for given x and y.

    Derivative Function (derivative): This function calculates the gradient of the objective function, which are the partial derivatives with respect to x and y.

    Adam Optimization Function (adam):
        Starts by generating a random initial point within specified bounds.
        Initializes the first (m) and second (v) moment vectors, which are used for computing bias-corrected estimates of the gradient and its square, respectively.
        Performs gradient descent updates for a specified number of iterations (n_iter):
            Computes the gradient of the objective function at the current point.
            Updates the moment vectors m and v.
            Calculates bias-corrected estimates of the moments (mhat and vhat).
            Updates the variables x and y by moving in the direction that reduces the objective function, considering the learning rate (alpha), the corrected moments, and a small number eps to prevent division by zero.
            Evaluates and prints the current objective function value.
        Returns the best solution found and the corresponding objective function value.

# VARIABLES

Variables:

    x: A NumPy array representing the current point (containing x and y values).
    score: The current value of the objective function at point x.
    m: A list representing the first moment vector (related to the moving average of the gradients).
    v: A list representing the second moment vector (related to the moving average of the squared gradients).
    g: The gradient of the objective function at point x.
    mhat: Bias-corrected estimate of the first moment vector.
    vhat: Bias-corrected estimate of the second moment vector.
    alpha (step size): The learning rate controlling the step size in the parameter update.
    beta1: The decay rate for the first moment estimates (similar to the momentum factor).
    beta2: The decay rate for the second moment estimates.
    eps: A small constant to prevent division by zero in the parameter update.
    bounds: The bounds for the initial point values.
    n_iter: The number of iterations for which the optimization will run.
    best: The best point found by the algorithm.
    seed(1): This seeds the random number generator to ensure reproducible results.

The main purpose of these variables is to define the optimization problem, control the behavior of the Adam optimizer, and store the necessary information for each iteration, like the history and magnitude of the gradients.