## Chapter 8. Gradient Descent

Frequently when doing data science, we’ll be trying to the find the best model for a certain situation. And usually “best” means **“minimizes the error of the model”** or **“maximizes the likelihood of the data.”** In other words, it will represent the solution to some sort of optimization problem.

This means we’ll need to solve a number of optimization problems. In particular,
we’ll need to solve them from scratch, + our approach = **gradient descent**, which lends itself pretty well to a from-scratch treatment.

### The Idea Behind Gradient Descent

Suppose we have some function `f` that takes as input a vector of real numbers and outputs a single real number. One simple such function is:

In [1]:
def sum_of_squares(vector):
    """Compute sum of squared elements in given vector"""
    return sum(element**2 for element in vector)

We’ll frequently need to maximize (or minimize) such functions, i.e. find the input `vector` that produces the largest (or smallest) possible value. For functions like ours, the **gradient** (vector of partial derivatives) gives the input a direction in which the function most quickly increases.

Accordingly, one approach to maximizing a function is to pick a random starting point,
compute the gradient, take a small step in the direction of the gradient (direction that causes the function to increase the most), and repeat with the new starting point.

Similarly, you can try to minimize a function by taking small steps in the opposite
direction (negative gradient)

* ***NOTE***: If a function has a unique **global minimum**, this procedure is likely to find it. If a function has **local (i.e. multiple) minima**, this procedure might “find” the wrong one of *them*, in which case you might re-run the procedure from a variety of starting points. **If a function has no minimum, then it’s possible the procedure might go on forever.**

## Estimating the Gradient