## Neural Networks and Deep Learning
### Supervised learning
1. Train a function approximator based on inputs to predict an output
2. Image - usese CNN
3. Audio/translation - RNN
4. Autonomous driving - custom/hybrid NN
5. Structure data - input/output databases; Unstructured data - raw audio, images, text
6. One of the huge breakthroughs of NN is the switching of sigmoid to ReLU function. The signmoid function has close to zero gradients at both ends that slows down gradient descent.

## Binary classifiications
1. Based on two classes.
2. Typical notiation, training inputs $X$, $X\in \mathbb{R}^{n_x \times m}$ where all the training data is spanning the columns. For the output, $Y$ is also written as a concatenation of columns.

### Logistic regression
1. Given $x$, we want $\hat y = P(y = 1 | x) \in \{0,1\}$. In linear regression, we have $y = w^T x + b$, but for logistic regression, we will use the sigmoid function, i.e. $\hat y = \sigma (w^T x + b)$,   $\sigma (z) = \frac{1}{1+e^{-z}}$.
2. Cost function:  the loss function for logistic regression is such that $L(\hat y, y) = - (y log \hat y + (1-y) log (1-\hat y)$ (https://medium.com/analytics-vidhya/derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d) (https://web.stanford.edu/~jurafsky/slp3/5.pdf)
3. The overall loss function is defined as the sum of all errors over all the training examples, i.e. $J (w,b) = 1/m \Sigma_{i=1..m} L(\hat y_i, y_i)$
4. Logistic regression can be solved using gradient discent, for example using $w = w - \alpha \frac{\partial J (w,b)}{\partial w}$ and vice versa for $b$

### Computation graph for gradient descent on logistic regression
1. Computation graph is very useful to setting up automatic computations of derivatives
2. Suppose $x \in \mathbb{R}^2$, we thus have the corresponding parameters $\{ w_1, w_2 \}$ and $b$. We setup the computation graph with $w_1, w_2, b, x_1, x_2$, $z = w_1 x_1 + w_2 x_2 + b$, $a = \sigma(z) $ and $L = L(a,y)$
3. To train on $m$ examples, notice that the individual partial derivatives are also average of $m$ examples since the lost function is linear w.r.t. all individual parameters.

### Vectorization
1. In numpy, this is done with the ".dot" function (example below). You could also use SIMD functions

In [14]:
import numpy as np
import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)
c = 0

tic = time.time()
c = a.dot(b)
toc = time.time()
print("Time taken in ms = {}".format(1000*(toc-tic)))

tic = time.time()
c = np.dot(a,b)
toc = time.time()
print("Time taken in ms = {}".format(1000*(toc-tic)))


Time taken in ms = 2.761363983154297
Time taken in ms = 2.0134449005126953
