## Part 1: Introduction to Deep Learning Concepts (and Tensorflow)

### 1. Mathematical Basics

Below we will cover the basic mathematical concepts of deep learning. I've tried to cover as minimal as possible to get you feeling familiar with the notation and rules involved.


**NOTE: You do not have to learn all of this math in order to be successful at deep learning. I encourage you to skip to section 2 and come back to this once you reach section 3. The math is useful to understand gradient descent and backpropagation. Also, if you take time to understand, it will get you feeling comfortable enough to read deep learning posts and possibly even academic papers for new ideas**

#### 1.1 Vectors and Matrices

A _vector_ is technically defined as something that has a magnitude and direction (in contrast to a _scalar_, which is just a number and only has magnitude). For the purposes of this notebook, a vector will represent a vertical _array_ of items, with dimensions n x 1 (or n rows and 1 column):

$\boldsymbol{x} = \begin{bmatrix} 
x_{1} \\ 
x_{2} \\ 
... \\
x_{n} 
\end{bmatrix}$

Notice we said "items." An item could be a number, a variable, a function, anything... For now, think of these as scalars (eg x = [1,2,3]). **Just remember that a vector is just a data structure and it can contain whatever we want inside of it to suit our particular needs. You can have a vector of scalars, vector of variables, vector of functions, and even vector of vectors!**

Note that the vector notation we use is a bolded $\boldsymbol{x}$ (you may also see $\vec{v}$ or $\hat{v}$ but we will avoid those forms).  In contrast, a non-bolded x will represent a scalar (eg x = 6). If we wanted a horizontal array, we would write it as:

$\boldsymbol{x}^{T} = \begin{bmatrix} 
x_{1} x_{2} ... x_{n}
\end{bmatrix}$

A capitalized bolded $\boldsymbol{X}$ represents a _matrix_, which in our case will be used to represent a two dimensional object, with n rows and m columns:

$\boldsymbol{X} = \begin{bmatrix} 
x_{1,1} x_{1,2} ... x_{1,m} \\ 
x_{2,1} x_{2,2} ... x_{2,m} \\ 
... \\
x_{n,1} x_{n,2} ... x_{n,m} 
\end{bmatrix}$


#### 1.2 Derivatives


Before we get to derivatives of vectors and matrices, let's talk about single variable derivative rules. if you feel like you need a review basic single variable derivative rules check out the videos in:
https://www.khanacademy.org/math/old-ap-calculus-ab/ab-derivative-rules

You can also see https://www.khanacademy.org/math/differential-calculus/dc-diff-intro for more rules. 

Below are the basic  derivative rules if you just need a reference:
<img src="scalar_derivative_rules.jpg" width="600" height="480" />


The below explains how to think of the derivative d/dx. It seems scary but it's no different than + or * because it is just another operator:
<img src="d_dx.jpg" width="600" height="480" />


#### 1.3 Gradient

In the above rules, the function was a paramater of a single variable x. How we compute derivatives if there are multiple parameters (say x,y)? Well, it turns out we can only compute a _partial derivative_ with respect to one variable at a time. The notation used for partial derivative will be the following:
<p>
$\frac{\partial f(x,y)}{\partial x}$ - partial derivative of f(x,y) with respect to x
<p>
$\frac{\partial f(x,y)}{\partial y}$ - partial derivative of f(x,y) with respect to y
<p>


Using the example in [9], let's say $f(x,y) = 3x^{2}y$

To get $\frac{\partial f(x,y)}{\partial x}$ , we have to treat y as a constant and use the single variable derivative rules:


$\frac{\partial f(x,y)}{\partial x} = 6yx$

Similarly, to get $\frac{\partial f(x,y)}{\partial y}$ , we have to treat x as a constant and use the single variable derivative rules:


$\frac{\partial f(x,y)}{\partial y} = 3x^{2}$

When you see the word _gradient_, remember that the gradient is a vector. Specifically, it's a vector of the partial derivatives of a function ("vector of partials" for short).

$\Large\nabla f(x,y) = \begin{bmatrix}
\frac{\partial f(x,y)}{\partial x}, \frac{\partial f(x,y)}{\partial y}
\end{bmatrix}$

**What if we have multiple functions?**

Let's say you also had another function $g(x,y) = 2x + y^{8}$ in addition to f(x,y) defined above.
Then you stack the gradients in a matrix called the _Jacobian_ (we use the notation J for the Jacobian):

<img src="simple_jacobian.jpg" width="600" height="480" />

That's a sample of matrix calculus!

**What if we have a lot of parameters in each of our functions?**

In the case of two parameters, we can manually write out the gradient for the function and consequently the Jacobian of multiple functions. However, we need a way to generalize this to a lot of parameters because neural networks often have a lot of parameters (eg weights). 

Let's say you had a function with many parameters $f(a,b,c...)$. We can rewrite it as $f(\boldsymbol{x})$ (remember the bold symbol means a vector) where

$\boldsymbol{x} = \begin{bmatrix} 
a \\ 
b \\ 
c \\ 
... \\ 
\end{bmatrix}$

which can be rewritten as:

$\boldsymbol{x} = \begin{bmatrix} 
x_{1} \\ 
x_{2} \\ 
x_{3} \\ 
... \\ 
\end{bmatrix}$

NOTE: A initially confusing part may be that in the previous section, each of the elements in the vector (x1, x2, x3) represented a scalar (or number) whereas now each of these elements is actually a variable. 

Now, we can write our gradient of the vector valued function $f(\boldsymbol{x})$ as:

$\Large\nabla f(\boldsymbol{x}) = \begin{bmatrix}
\frac{\partial f(\boldsymbol{x})}{\partial x_{1}}, \frac{\partial f(\boldsymbol{x})}{\partial x_{2}}, \frac{\partial f(\boldsymbol{x})}{\partial x_{3}}, ...
\end{bmatrix}$

**What if we have multiple functions, each with lots of parameters?**

If we have multiple functions, $f_{1}$, $f_{2}$, ..., then we can use the Jacobian to get all the gradients! The Jacobian is just a stack of gradients.

We can define a vector of functions $\boldsymbol{y}$ as:

$\boldsymbol{y} = \begin{bmatrix} 
f_{1}(\boldsymbol{x}) \\ 
f_{2}(\boldsymbol{x}) \\ 
f_{3}(\boldsymbol{x}) \\ 
... \\ 
\end{bmatrix}$

$\Large
J = \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}}
$

The Jacobian matrix is a stack of m Ã— n possible partial derivatives where m = number of functions in the vector $\boldsymbol{y}$ and n = number of variables in the vector $\boldsymbol{x}$.

<img src="complex_jacobian.jpg" width="600" height="480" />

**Can we summarize all this?**

<p>
For a function with a single variable $f(x)$, we use the derivative rules.
<p>
For a function with two variables $f(x, y)$, we use the partial derivative rules to make a gradient (vector of all the partial derivatives)
<p>
For two functions with two variables $f(x,y)$ and $g(x,y)$, we use the Jacobian to stack the two gradients.
<p>
For a function with n variables $f(\boldsymbol{x})$, we use the partial derivative rules to make a gradient (vector of all the partial derivatives).
<p>
For m functions with n variables $\boldsymbol{y}(\boldsymbol{x})$, we use the Jacobian to stack the m gradients (each of which has npartials).

#### 1.4 Chain Rule

Good news! We don't have to learn another rule. We just have to know that such a thing as the "chain rule" exists (you can look it up if you'd like using [9] and [10] or Khan academy videos above).  

The chain rule is what allows us to get the derivative of a function wrapping a single variable function:
<p>
$ y = f(g(x))$<p>
$ u = g(x)$<p>
$ y = f(u)$<p>

and the derivatives of a function wrapping a vector of single variable functions $\boldsymbol{g} = (u1, u2...)$
<p>
$ y = f(\boldsymbol{g}) $


and the derivatives of a vector of functions wrapping a vector of single variable functions:
<p>
$ y = \boldsymbol{f}(\boldsymbol{g}(x))$ :

and the derivatives of a vector of a vector of functions wrapping a vector of single variable functions:
<p>
$\boldsymbol{y} = \begin{bmatrix} 
f_{1}(\boldsymbol{g}(x)) \\ 
f_{2}(\boldsymbol{g}(x)) \\ 
f_{3}(\boldsymbol{g}(x)) \\ 
... \\ 
\end{bmatrix}$


and the partial derivatives of a vector of a vector of functions wrapping a vector of vector-valued functions (notice the bolded x):
<p>
$\boldsymbol{y} = \begin{bmatrix} 
f_{1}(\boldsymbol{g}(\boldsymbol{x})) \\ 
f_{2}(\boldsymbol{g}(\boldsymbol{x})) \\ 
f_{3}(\boldsymbol{g}(\boldsymbol{x})) \\ 
... \\ 
\end{bmatrix}$

**What's the punchline?**

With neural networks, sometimes it works out where the expressions are this complicated. BUT not to worry! Using the combination of chain rules and scalar derivative rules and vector addition rules (not discussed here but mentioned in [9]), **deep learning libraries do all the differentiation for us**. 
<p>

If you follow the  breakdown of the mathematics in [9], we can decompose any derivative using something known as the _vector chain rule_, which reduces to a matrix multiplication where each matrix is mostly 0s and the diagonals are partial derivatives, which you can see below:

<img src="vector_chain_rule.jpg" width="600" height="480" />


I think it's appropriate to insert the following gif of the mind being blown [11]:

<img src="mind_blown.gif"/>

### 2. Tensorflow

Tensorflow (https://www.tensorflow.org/install/) is an extremely popular deep learning library built by Google and will be the main library used for of the rest of these notebooks (in the last lesson, we briefly used numpy, a numerical computation library that's useful but does not have deep learning functionality). NOTE: Other popular deep learning libraries include Pytorch and Caffe2. Keras is another popular one, but its API has since been absorbed into Tensorflow.  Tensorflow is chosen here because:
* it has the most active community on Github
* it's well supported by Google in terms of core features
* it has Tensorflow serving, which allows you to serve your models online (something we'll see in a future notebook)
* it has Tensorboard for visualization (which we will use in this lesson)

**Let's train our first model to get a sense of how powerful Tensorflow can be!**

In [36]:
# Some initial setup. Borrowed from:
# https://github.com/ageron/handson-ml/blob/master/09_up_and_running_with_tensorflow.ipynb

# Common imports
import numpy as np
import os
import tensorflow as tf

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "tensorflow"

def save_fig(fig_id):
  path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
  print("Saving figure", fig_id)
  plt.tight_layout()
  plt.savefig(path, format='png', dpi=300)

def stabilize_output():
  tf.reset_default_graph()
  # needed to avoid the following error: https://github.com/RasaHQ/rasa_core/issues/80
  tf.keras.backend.clear_session()
  tf.set_random_seed(seed=42)
  np.random.seed(seed=42)

print "Done"

Done


In [38]:
# Training your first model
# From Tensorflow tutorial: https://www.tensorflow.org/tutorials/
# (Also in https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/get_started/_index.ipynb)

# to ensure relatively stable output across sessions
stabilize_output()

mnist = tf.keras.datasets.mnist
# load data (requires Internet connection)
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# build a model
model = tf.keras.models.Sequential([
  # flattens the input
  tf.keras.layers.Flatten(),
  # 1 "hidden" layer with 512 units - more on this in the next notebook
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  # does something called regularization, which we have not talked about yet
  tf.keras.layers.Dropout(0.2),
  # 10 because there's possible didigts - 0 to 9
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
# train a model (using 5 epochs -> notice the accuracy improving with each epoch)
model.fit(x_train, y_train, epochs=5)

print model.metrics_names  # see https://keras.io/models/model/ for the full API
# evaluate model accuracy
model.evaluate(x_test, y_test)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
['loss', 'acc']


[0.06135971167207463, 0.9818]

You should see something similar to [0.06788356024027743, 0.9806]. The first number is the final loss and the second number is the accuracy.

<p>Congratulations, it means you've trained a classifier that classifies digit images in the MNIST Dataset with **98% accuracy**! We'll break down how the model is optimizing to achieve this accuracy below.</p>


### 3. Gradient Descent



### 4. Backpropagation



### 5. References

<pre>
  [1] Fast.ai (http://course.fast.ai/)  
  [2] CS231N (http://cs231n.github.io/)  
  [3] CS224D (http://cs224d.stanford.edu/syllabus.html)  
  [4] Hands on Machine Learning (https://github.com/ageron/handson-ml)  
  [5] Deep learning with Python Notebooks (https://github.com/fchollet/deep-learning-with-python-notebooks)  
  [6] Deep learning by Goodfellow et. al (http://www.deeplearningbook.org/)  
  [7] Neural networks online book (http://neuralnetworksanddeeplearning.com/)
  [8] Vector Norms https://machinelearningmastery.com/vector-norms-machine-learning/
  [9] The Matrix Calculus You Need For Deep Learning https://arxiv.org/pdf/1802.01528.pdf
  [10] Practical Guide to Matrix Calculus for Deep Learning http://www.psi.toronto.edu/~andrew/papers/matrix_calculus_for_learning.pdf
  [11] https://giphy.com/explore/mind-blown
</pre>

