# Calculus


## What you'll learn in this class

It is often said that Data Scientists are excellent statisticians. Well, that's not necessarily true. What is true is that they have the mathematical knowledge to understand the Machine Learning algorithms they will develop and use. In this course, we will review some of the basic mathematical concepts needed to attack the pure Machine Learning:

* The different mathematical notations
* What is a function?
* What are a derivative and an integral?
* What is a convex function?
* Understanding gradient descent

## 1. Reminder of Mathematics Grades

It is true that mathematicians use a very special vocabulary and way of writing things. So let's take a look at some of the mathematical notations that we will use throughout the program.

### 1.1. Notations

#### Sum:

To write a sum, you can write as follows:

$$S_n = x_1 + x_2 + x_3 + ... + x_n$$

This works, but imagine that we have 200 numbers to add up. That would start to be an equation way too big. Since mathematicians are a bit lazy but very creative, they decided to write the long sums as follows:

$$S_n = \sum_{i=0}^{n}x_i$$

The Greek letter $\sum$ is a sigma. The bottom part corresponds to the start terminal and the top terminal corresponds to the end terminal. For example:

$$S_3 = \sum_{i=0}^{3}x_i = x_0 + x_1 + x_2 + x_3$$

#### Product:

What we can do with the sum, we can do with the product. All you have to do is change the $\sum$ by $\prod$.


$$P_n = \prod_{i=0}^{n}x_i = x_0 \times x_1 \times x_2 \times ... \times x_n$$

#### Intervals:

It is good to know how the intervals are written. Indeed, the latter designate the set of values between two bounds. We will study two types of intervals.

- Integer intervals:

Integers are positive numbers without commas. An integer range can be written as follows:

$$N = [\![ 0;n ]\!] = \left \{ 0; 1; 2; 3; 4; ... ;n \right \}$$



This $N$ is not added in an insignificant way since it is often said that this letter designates the set of integer values.

- Intervals of real numbers:

Real numbers are all known positive or negative numbers that are often referred to by the letter: ${\Bbb R}$.


An interval can be created as follows:

$$I = \left[ start point ; end point \right]$$


This is called a closed interval, i.e. both the start and end terminal are included in the interval. For example:

$$\left [ -10; 10 \right ]$$

Here we have all real numbers between -10 and 10, including -10 and 10. If we wanted to exclude the values, we would have done the following:

$$]-10;10[$$


Here we exclude both the value -10 and the value 10: it is an open interval. We can also include respectively the start or end terminal.

$$[-10;10[$$


In this case, we include the value -10 in the range but exclude the value 10.

To finish on the intervals, we will just have to talk about the infinite value which can be considered as one of the bounds. This is written as follows:

$$]-\infty,+\infty[$$


By convention, infinite limits are always excluded.

#### Vectors:
$$\overrightarrow{x} = (x_0, x_1, x_2, ..., x_n)$$

#### Matrices:
$$\mathbf{X} = \left[\begin{array}
{rrrr}
x_{0,0} & x_{0,1} & ... & x_{0,p} \\
x_{1,0} & x_{1,1} & ... & x_{1,p} \\
... & ... & ... & ... \\
x_{n,0} & x_{n,1} & ... & x_{n,p} \\
\end{array}\right]
$$

### 1.2. Functions

#### Definition

We're going to be talking about functions a lot in the rest of our exercises. Quite simply, a function is a transformation that is applied to a value. Let's take a simple example. Let's take the hypothesis that if I have done 1 year of higher education, I can expect an annual salary of 10 000$ / year. We could write:

$$f(1) = 10000$$

One could generalize and say that the more years I study, the more I will have a high salary such as :

$$f(x) = 10000x$$

This can be represented by the following graph:

![](https://drive.google.com/uc?export=view&id=17i6Ndv3pFmFfy1kFNinRI6acqWuONLux)

Of course, this graph does not really represent reality, but we can still understand it by reading that the more years of education you have, the higher your salary will be.

#### Examples of functions

The function of the top was extremely simple. But most are more complex, such as:

$$f(x) = x^2$$

![](https://drive.google.com/uc?export=view&id=1-p9KjyIK55wd2xrvrV7hVSTAsG0rAzQ9)

$$f(x) = e^x$$

![](https://drive.google.com/uc?export=view&id=1H1ibFdG23Xv9H_13EysV7GtIzBGIWfzE)

Again:

$$f(x) = ln(x)$$

![](https://drive.google.com/uc?export=view&id=1QZ3UJhKWqCefNuBhe2BS9lj-G1gKHygr)

### 1.3 Derivatives

It is important to understand what function derivatives represent. A derivative is the representation of the behaviour of a function at a given location. One could also say that it is the representation of a change at a time T on a function.

- if $f'(x) \geq 0$ on $I$, the $f$ function is increasing on $I$
- if $f'(x) \leq 0$ on $I$, the $f$ function is decreasing on $I$


#### Explanation

Imagine you're in a moving car. You can represent the distance travelled in relation to time in this way:

![](https://drive.google.com/uc?export=view&id=1LftgZvYdaXo6r2BlhfqgYZ6Z6UV0K59P)


This graph is represented by the following function

$$f(t) = 4t^2 - \frac{1}{3}t^3$$

What we can see graphically is that as time increases, the distance travelled increases until it reaches a plateau.

Imagine that we are trying to find out the speed of the car at a given moment. Usually this is represented by :

$$speed = \frac{distance}{time}$$

So for example, in our graph we could see that in second 5, we travelled about 50 meters.

So we could estimate our speed at:

$$speed_{[0;5]} =  \frac{50}{5} = 10m/s$$

However, this is not the speed at which the car was driving at each moment, linearly from the beginning. For example, the speed between second 2 and  second 3 is $\frac{f(3) - f(2)}{3 - 2} = 27 - 13 = 14$m/s.

So this time we're travelling at 14 meters per second.

The smaller the interval that we use for the speed computation, the more precise this speed is.   
The derivative $f'(t)$ of the distance function $f(t)$ represents the speed of the car as a function of time $t$:
$$f'(t)=\frac{\delta f}{\delta t}(t)=t(8-t)$$

![Tangeante et Dérivé](https://upload.wikimedia.org/wikipedia/commons/c/cc/Tangent_animation.gif)

$$f'(x) = \frac{f(x+\Delta x) - f(x)}{x+\Delta x - x} = \frac{f(x+\Delta x) - f(x)}{\Delta x}$$

It is represented graphically by the orange curve:

![](https://drive.google.com/uc?export=view&id=1iPKx13h4wIeRfKWbRU-ESMeeMmdiU5Fx)


We can see that the derivative $f'(t_0)$ is equal to zero for each local minimum or maximum:
$$f'(0) = 0$$
and
$$f'(8) = 0$$

But **CAUTION**:
- A local maximum is not necessary a global maximum: several local maxima can exist.
- A local minimum is not necessary a global minimum: several local minima can exist.
- The derivative $f'(t_0) \neq 0$ is not necessary equal to zero for the global minimum / maximum: this is the case when it is located at the limits of the function domain.

![Example of a function with 2 local minima](https://drive.google.com/uc?export=view&id=1r8bgD4V3mWgGhuiDlRomVhDHX59JPIjK)


#### Table of main derivatives

In terms of derivatives, it will simply be necessary to know how they are calculated in general terms. Here is a table of the most common derivatives:

| Function structure       | Derivative structure |
| ------------------------ |: ------------------- : |
| $$au(x) + bv(x)$$        | $$au'(x) + bv'(x)$$    |
| $$u(x) . v(x)$$          | $$u'(x).v(x) + v'(x).u(x)$$ |
| $$\frac{1}{u(x)}$$       | $$-\frac{u'(x)}{u(x)^2}$$ |   
| $$\frac{u(x)}{v(x)}$$    | $$\frac{u'(x).v(x) - v'(x).u(x)}{v(x)^2}$$ |
| $$u(x)^a$$               | $$a.u'(x).u(x)^{a-1}$$ |
| $$\sqrt{u(x)}$$          | $$\frac{u'(x)}{2\sqrt{u(x)}}$$ | 
| $$u(v(x))$$              | $$v'(x) . u'(v(x))$$ | 


#### Gradient

When a function has several variables, such as a multiple features model, there are as many partial derivatives as the number of variables.   

**The gradient is a vector composed of all the partial derivatives of a multivariate function.**

$$\nabla f = \left( \frac{\delta f}{\delta x_0}, \frac{\delta f}{\delta x_1}, ..., \frac{\delta f}{\delta x_n} \right)$$
   
*Example* :
$$f(x_0, x_1, x_2) = 3 x_2^3 - 2 x_1^2 + x_0 - 10$$
$$\overrightarrow{\nabla} f(x_0, x_1, x_2)=\left( \frac{\delta f}{\delta x_0}(x_0, x_1, x_2), \frac{\delta f}{\delta x_1}(x_0, x_1, x_2), \frac{\delta f}{\delta x_2}(x_0, x_1, x_2) \right)$$
$$\overrightarrow{\nabla} f(x_0, x_1, x_2)=(1, -4x_1, 9x_2^2)$$
   
**For each point, the gradient indicates the direction in which the value of the function increases.**
![A 2 variables function and its gradient represented in the bottom plane](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/3d-gradient-cos.svg/640px-3d-gradient-cos.svg.png?1589869317440)

As for the derivative, we can look for minima and maxima by searching the values where the gradient is equal to zero:
$$\nabla f(\overrightarrow{x}) = \overrightarrow{0}$$
$$\nabla f(x_0, x_1, ..., x_n) = (0, 0, ..., 0)$$

**The above remarks for the one-dimentional function still apply.**
Moreover, in multiple dimensions, the gadient can be zero at a point that is neither a local minimum nor a local maximum: it's called a **saddle point**.

![saddle point](https://upload.wikimedia.org/wikipedia/commons/4/40/Saddle_point.png)



### 1.4. Integration

#### Explanation

An integration is simply the inverse of a derivative.   

Let's take the example from above: imagine that we know the speed of our vehicle as a function of time, but this time we would like to know the distance it has travelled $f(t)$.

So we start from :

$$f^{'}(t) = x(8 - t)$$


Shown graphically :

![](https://drive.google.com/uc?export=view&id=1XA4xjd09ta8q6sWtYRAaFw9lNDjhOn_Z)


By definition, the integral of the derivative of a function is itself up to a constant $C$.

$$f(x) = \int_{0}^{x}f^{'}(t)dt + C$$


The symbol $\int_{0}^{t}$ indicates a sum from the start bounds $0$ to the end bound $x$ whith a  infinitesimally small step.

In the same way, there are ways to calculate integrals but you can also use integrals calculators. Here is one of them:

[https://www.integral-calculator.com](https://www.integral-calculator.com/)

In the previous example, the total distance travelled between the 0th and 8th second is:

$$f(8) = \int_{0}^{8}f^{'}(8) = f(8) - f(0) = 4(8)^{2} - \frac{1}{3}(8)^{3} - 4(0)^{2} - \frac{1}{3}(0)^{3} = 85.34$$

In total, we covered 85.34 metres in 8 seconds.

#### Graphical representation

Graphically, calculating an integral is like calculating the area below the derivative.


![](https://drive.google.com/uc?export=view&id=17RqrKUSMdEnB_SEQ_pbqU1AYncBqPN9e)


All the coloured part corresponds to the 85.34 metres we drove with our car.

### 1.4. Convexity

#### Definition

One last thing you need to know about functions: the difference between a convex and a concave function.   
A function is **convex** on an interval $I$, if every segment between 2 values of this function is above its curve : this curve looks like a **valley**.

For example, a concave function represents a "mountain" while a convex function represents a "valley":

_Example:_ $f(x) = x^2$ is a convex on ${\Bbb R}$.

When a function is convex on $I$, the point $t_0$ where $f'(t_0) = 0$ is a **global minimum** of $f$ on $I$. 

![](https://drive.google.com/uc?export=view&id=18PGFy9VGt9f-kL4Zi56RSo0K25k5FPXm)

On the contrary, a function is **concave** on an interval $I$, if every segment between 2 values of this function is below its curve : this curve looks like a **mountain**.

_Example:_ $f(x) = -x^2$ is concave on ${\Bbb R}$.

When a function is concave on $I$, the point $t_0$ where $f'(t_0) = 0$ is a **global maximum** of $f$ on $I$. 

![](https://drive.google.com/uc?export=view&id=12idjyAeHTaMCpgdgktvEfuTXUhjOOWpG)


#### How do you prove that a function is convex or concave?

To prove that a function is convex or concave on an interval $I$, we will have to calculate the **second derivative** of the function, i.e. **the derivative of the derivative**.

Mathematically:

$$f^{''}(x) =\frac{\delta^2 f}{\delta x^2}(x) = (f^{'}(x))^{'} = ((f(x))^{'})^{'}$$

- if $f''(x) \geq 0$ on $I$ then the function is **convex** on $I$.
- if $f''(x) \leq 0$ on $I$ then the function is **concave** on $I$.

The second derivative of a function allows to analyze the dynamics of its derivative over an $I$ interval.   
In the example of the moving car, the second derivative corresponds to the acceleration of the vehicle.

For each point, the second derivative indicates at what "speed" the value of the function will change if $x$ increases by an infinitesimal amount:
- if $f''(x)$ is high, the $f$ function will evolve more and more rapidly.
- if $f''(x)$ is low, the $f$ function will evolve less and less quickly.

**The second derivative of a function indicates the curvature of the curve of this function.**


## 2. Gradient descent

Why are we talking about calculus in this class? One of the reasons is that it is necessary to understand gradient descent. This is an optimization method widely used in Machine Learning algorithms.


### The Principle

How do you create a Machine Learning algorithm? One of the steps is to train our model on data we already have. Have you ever wondered how the algorithm trains?  

The answer is via a **cost function**. That is to say that initially, the algorithm will test with random values and this cost function will materialize the error. That is to say that the more the algorithm is wrong, the higher the "cost" will be.

One of the most used **cost function** (or **loss**) is the **mean squared error (MSE)** :

$$MSE(y, \hat{y}) = \frac{1}{N}\sum_{j=0}^N(y_j - \hat{y_j})^2$$

where $\hat{y} = \hat{g}(\overrightarrow{w}, X)$ are the model predictions on data $X$ and $y$ are the ground truth labels.

The goal of the algorithm is to find parameters $w_{min}$ that minize this function.

$$\overrightarrow{w_{min}} = \underset{\overrightarrow{w} \in \Bbb R^p}{argmin} \frac{1}{N} \sum_{j=0}^N (y - \hat{g}(\overrightarrow{w}, X))^2=\underset{\overrightarrow{w} \in \Bbb R^p}{argmin}\ f(\overrightarrow{w}, X, y)$$

**Analytical solution**:
- find the points that verify $\nabla f(\overrightarrow{w}, X, y) = \overrightarrow{0}$
- study the convexity of the loss function at the neighborhoods of these points
- study the behavior of the loss function at the limit of its definition domain
- compare the value of the loss function for each of these points

=> Uncertain results   
=> Requires strong mathematical knowledge   
=> Cannot be solved by a computer   


**Gradient descent**:

1. Initialize (randomly) the parameters $\overrightarrow{w_0}$
2. Repeat parameters update: $$\overrightarrow{w_{t+1}} = \overrightarrow{w_t} - \eta \times \nabla f(\overrightarrow{w_t})$$
where $\eta$ is a learning rate: the size of the step the algorithm will use to update the parameters in the gradient direction 
3. Up to the convergence:   
   - the difference between $\overrightarrow{w_{t+1}}$ and $\overrightarrow{w_t}$ is very low ($||\nabla f(\overrightarrow{w_t})|| \approx 0$)
   - the maximum number of iterations has been reached


![Descente de gradient](https://thumbs.gfycat.com/AngryInconsequentialDiplodocus-size_restricted.gif)

The objective is still to look for the minima of the loss function by solving the equation:

$$\nabla f(\overrightarrow{w}, X, y) = \overrightarrow{0}$$

But simply solving this equation can be very costly in terms of computing power. It can then be done in another way by "groping".

In other words, for a parameters $w_i$, we will try to get as close as possible to $f'(w_i) = 0$ by trial and error.   
If $f'(w_i) < 0$ ($f$ is decreasing), so $w_i$ must be increased (by $- \eta \times f'(w_i)$).   
If $f'(w_i) > 0$ ($f$ is increasing) and $w_i$ must be decreased (by $- \eta \times f'(w_i)$).   
This will make it much easier for a computer to solve the equation.

Moreover, if the slope is steep, the current point is far from the nearest local minimum, so we want the algorithm to move quickly. On the other hand, if the slope is weak, there is a chance that a local minimum is close. We must therefore use a small step to avoid missing this local minimum.

## Resources

Gradient descent - [https://bit.ly/2PNlYrU](https://bit.ly/2PNlYrU)

Gradient descent / Lyon University - [https://bit.ly/2PNlY3E](http://eric.univ-lyon2.fr/~ricco/cours/slides/gradient_descent.pdf)

What is a derivative - [https://bit.ly/2PN21EZAJ](https://www.youtube.com/watch?v=9vKqVkMQHKk)