## Part 1: Introduction to Deep Learning Mathematical Basics

### 1. Mathematical Concepts

Below we will cover the basic mathematical concepts of deep learning. I've tried to cover as minimal as possible to get you feeling familiar with the notation and rules involved.


**NOTE: You do not have to learn all of this math in order to be successful at deep learning. I encourage you to skip to the next notebook and come back here. The math is useful to understand gradient descent and backpropagation. Also, if you take time to understand, it will get you feeling comfortable enough to read deep learning posts and possibly even academic papers for new ideas.**

#### 1.1 Vectors and Matrices

A _vector_ is technically defined as something that has a magnitude and direction (in contrast to a _scalar_, which is just a number and only has magnitude). For the purposes of this notebook, a vector will represent a vertical _array_ of items, with dimensions m x 1 (or m rows and 1 column):

$\boldsymbol{x} = \begin{bmatrix} 
x_{1} \\ 
x_{2} \\ 
... \\
x_{m} 
\end{bmatrix}$

Notice we said "items." An item could be a number, a variable, a function, anything... For now, think of these as scalars (eg x = [1,2,3]). **Just remember that a vector is just a data structure and it can contain whatever we want inside of it to suit our particular needs. You can have a vector of scalars, vector of variables, vector of functions, and even vector of vectors!**

Note that the vector notation we use is a bolded $\boldsymbol{x}$ (you may also see $\vec{v}$ or $\hat{v}$ but we will avoid those forms).  In contrast, a non-bolded x will represent a scalar (eg x = 6). If we wanted a horizontal array, we would write it as:

$\boldsymbol{x}^{T} = \begin{bmatrix} 
x_{1} x_{2} ... x_{m}
\end{bmatrix}$

A capitalized bolded $\boldsymbol{X}$ represents a _matrix_, which in our case will be used to represent a two dimensional m x n object, with m rows and n columns:

$\boldsymbol{X} = \begin{bmatrix} 
x_{1,1} & x_{1,2} & ... & x_{1,n} \\ 
x_{2,1} & x_{2,2} & ... & x_{2,n} \\ 
... \\
x_{m,1} & x_{m,2} & ... & x_{m,n} 
\end{bmatrix}$


#### 1.2 Derivatives


Before we get to derivatives of vectors and matrices, let's talk about single variable derivative rules. if you feel like you need a review basic single variable derivative rules check out the videos in:
https://www.khanacademy.org/math/old-ap-calculus-ab/ab-derivative-rules

You can also see https://www.khanacademy.org/math/differential-calculus/dc-diff-intro for more rules. 

Below are the basic  derivative rules if you just need a reference:
<img src="scalar_derivative_rules.jpg" width="600" height="480" />


The below explains how to think of the derivative d/dx. It seems scary but it's no different than + or * because it is just another operator:
<img src="d_dx.jpg" width="600" height="480" />


#### 1.3 Gradient

In the above rules, the function was a paramater of a single variable x. How we compute derivatives if there are multiple parameters (say x,y)? Well, it turns out we can only compute a _partial derivative_ with respect to one variable at a time. The notation used for partial derivative will be the following:
<p>
$\frac{\partial f(x,y)}{\partial x}$ - partial derivative of f(x,y) with respect to x
<p>
$\frac{\partial f(x,y)}{\partial y}$ - partial derivative of f(x,y) with respect to y
<p>


Using the example in [9], let's say $f(x,y) = 3x^{2}y$

To get $\frac{\partial f(x,y)}{\partial x}$ , we have to treat y as a constant and use the single variable derivative rules:


$\frac{\partial f(x,y)}{\partial x} = 6yx$

Similarly, to get $\frac{\partial f(x,y)}{\partial y}$ , we have to treat x as a constant and use the single variable derivative rules:


$\frac{\partial f(x,y)}{\partial y} = 3x^{2}$

When you see the word _gradient_, remember that the gradient is a vector. Specifically, it's a vector of the partial derivatives of a function ("vector of partials" for short).

$\Large\nabla f(x,y) = \begin{bmatrix}
\frac{\partial f(x,y)}{\partial x}, \frac{\partial f(x,y)}{\partial y}
\end{bmatrix}$

**What if we have multiple functions?**

Let's say you also had another function $g(x,y) = 2x + y^{8}$ in addition to f(x,y) defined above.
Then you stack the gradients in a matrix called the _Jacobian_ (we use the notation J for the Jacobian):

<img src="simple_jacobian.jpg" width="600" height="480" />

That's a sample of matrix calculus!

**What if we have a lot of parameters in each of our functions?**

In the case of two parameters, we can manually write out the gradient for the function and consequently the Jacobian of multiple functions. However, we need a way to generalize this to a lot of parameters because neural networks often have a lot of parameters (eg weights). 

Let's say you had a function with many parameters $f(a,b,c...)$. We can rewrite it as $f(\boldsymbol{x})$ (remember the bold symbol means a vector) where

$\boldsymbol{x} = \begin{bmatrix} 
a \\ 
b \\ 
c \\ 
... \\ 
\end{bmatrix}$

which can be rewritten as:

$\boldsymbol{x} = \begin{bmatrix} 
x_{1} \\ 
x_{2} \\ 
x_{3} \\ 
... \\ 
\end{bmatrix}$

NOTE: A initially confusing part may be that in the previous section, each of the elements in the vector (x1, x2, x3) represented a scalar (or number) whereas now each of these elements is actually a variable. 

Now, we can write our gradient of the vector valued function $f(\boldsymbol{x})$ as:

$\Large\nabla f(\boldsymbol{x}) = \begin{bmatrix}
\frac{\partial f(\boldsymbol{x})}{\partial x_{1}}, \frac{\partial f(\boldsymbol{x})}{\partial x_{2}}, \frac{\partial f(\boldsymbol{x})}{\partial x_{3}}, ...
\end{bmatrix}$

**What if we have multiple functions, each with lots of parameters?**

If we have multiple functions, $f_{1}$, $f_{2}$, ..., then we can use the Jacobian to get all the gradients! The Jacobian is just a stack of gradients.

We can define a vector of functions $\boldsymbol{y}$ as:

$\boldsymbol{y} = \begin{bmatrix} 
f_{1}(\boldsymbol{x}) \\ 
f_{2}(\boldsymbol{x}) \\ 
f_{3}(\boldsymbol{x}) \\ 
... \\ 
\end{bmatrix}$

$\Large
J = \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}}
$

The Jacobian matrix is a stack of m × n possible partial derivatives where m = number of functions in the vector $\boldsymbol{y}$ and n = number of variables in the vector $\boldsymbol{x}$.

<img src="complex_jacobian.jpg" width="600" height="480" />

**Can we summarize all this?**

<p>
For a function with a single variable $f(x)$, we use the derivative rules.
<p>
For a function with two variables $f(x, y)$, we use the partial derivative rules to make a gradient (vector of all the partial derivatives)
<p>
For two functions with two variables $f(x,y)$ and $g(x,y)$, we use the Jacobian to stack the two gradients.
<p>
For a function with n variables $f(\boldsymbol{x})$, we use the partial derivative rules to make a gradient (vector of all the partial derivatives).
<p>
For m functions with n variables $\boldsymbol{y}(\boldsymbol{x})$, we use the Jacobian to stack the m gradients (each of which has npartials).

#### 1.4 Chain Rule

Good news! We don't have to learn another rule. We just have to know that such a thing as the "chain rule" exists (you can look it up if you'd like using [9] and [10] or Khan academy videos above).  

The chain rule is what allows us to get the derivative of a function wrapping a single variable function:
<p>
$ y = f(g(x))$<p>
$ u = g(x)$<p>
$ y = f(u)$<p>

and the derivatives of a function wrapping a vector of single variable functions $\boldsymbol{g} = (u1, u2...)$
<p>
$ y = f(\boldsymbol{g}) $


and the derivatives of a vector of functions wrapping a vector of single variable functions:
<p>
$ y = \boldsymbol{f}(\boldsymbol{g}(x))$ :

and the derivatives of a vector of a vector of functions wrapping a vector of single variable functions:
<p>
$\boldsymbol{y} = \begin{bmatrix} 
f_{1}(\boldsymbol{g}(x)) \\ 
f_{2}(\boldsymbol{g}(x)) \\ 
f_{3}(\boldsymbol{g}(x)) \\ 
... \\ 
\end{bmatrix}$


and the partial derivatives of a vector of a vector of functions wrapping a vector of vector-valued functions (notice the bolded x):
<p>
$\boldsymbol{y} = \begin{bmatrix} 
f_{1}(\boldsymbol{g}(\boldsymbol{x})) \\ 
f_{2}(\boldsymbol{g}(\boldsymbol{x})) \\ 
f_{3}(\boldsymbol{g}(\boldsymbol{x})) \\ 
... \\ 
\end{bmatrix}$

**What's the punchline?**

With neural networks, sometimes it works out where the expressions are this complicated. BUT not to worry! Using the combination of chain rules and scalar derivative rules and vector addition rules (not discussed here but mentioned in [9]), **deep learning libraries do all the differentiation for us**. 
<p>

If you follow the  breakdown of the mathematics in [9], we can decompose any derivative using something known as the _vector chain rule_, which reduces to a matrix multiplication where each matrix is mostly 0s and the diagonals are partial derivatives, which you can see below:

<img src="vector_chain_rule.jpg" width="600" height="480" />


I think it's appropriate to insert the following gif of the mind being blown [11]:

<img src="mind_blown.gif"/>

### 3. Gradient Descent



### 4. Backpropagation



### 5. References

<pre>
  [1] Fast.ai (http://course.fast.ai/)  
  [2] CS231N (http://cs231n.github.io/)  
  [3] CS224D (http://cs224d.stanford.edu/syllabus.html)  
  [4] Hands on Machine Learning (https://github.com/ageron/handson-ml)  
  [5] Deep learning with Python Notebooks (https://github.com/fchollet/deep-learning-with-python-notebooks)  
  [6] Deep learning by Goodfellow et. al (http://www.deeplearningbook.org/)  
  [7] Neural networks online book (http://neuralnetworksanddeeplearning.com/)
  [8] Vector Norms https://machinelearningmastery.com/vector-norms-machine-learning/
  [9] The Matrix Calculus You Need For Deep Learning https://arxiv.org/pdf/1802.01528.pdf
  [10] Practical Guide to Matrix Calculus for Deep Learning http://www.psi.toronto.edu/~andrew/papers/matrix_calculus_for_learning.pdf
  [11] https://giphy.com/explore/mind-blown
</pre>

