# Machine Learning Concepts

In [2]:
# download all python packages
import sys
!{sys.executable} -m pip install keras
!{sys.executable} -m pip install torch
!{sys.executable} -m pip install tensorflow

Collecting keras
[?25l  Downloading https://files.pythonhosted.org/packages/5e/10/aa32dad071ce52b5502266b5c659451cfd6ffcbf14e6c8c4f16c0ff5aaab/Keras-2.2.4-py2.py3-none-any.whl (312kB)
[K    100% |████████████████████████████████| 317kB 4.0MB/s ta 0:00:01
[?25hCollecting keras-applications>=1.0.6 (from keras)
[?25l  Downloading https://files.pythonhosted.org/packages/90/85/64c82949765cfb246bbdaf5aca2d55f400f792655927a017710a78445def/Keras_Applications-1.0.7-py2.py3-none-any.whl (51kB)
[K    100% |████████████████████████████████| 61kB 15.7MB/s ta 0:00:01
Collecting pyyaml (from keras)
[?25l  Downloading https://files.pythonhosted.org/packages/9f/2c/9417b5c774792634834e730932745bc09a7d36754ca00acf1ccd1ac2594d/PyYAML-5.1.tar.gz (274kB)
[K    100% |████████████████████████████████| 276kB 3.7MB/s ta 0:00:01
[?25hCollecting keras-preprocessing>=1.0.5 (from keras)
[?25l  Downloading https://files.pythonhosted.org/packages/c0/bf/0315ef6a9fd3fc2346e85b0ff1f5f83ca17073f2c31ac719ab2e4da0

  Building wheel for gast (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/fliang/Library/Caches/pip/wheels/5c/2e/7e/a1d4d4fcebe6c381f378ce7743a3ced3699feb89bcfbdadadd
Successfully built absl-py termcolor gast
Installing collected packages: absl-py, mock, tensorflow-estimator, grpcio, markdown, tensorboard, termcolor, gast, astor, tensorflow
Successfully installed absl-py-0.7.1 astor-0.7.1 gast-0.2.2 grpcio-1.20.1 markdown-3.1 mock-3.0.4 tensorboard-1.13.1 tensorflow-1.13.1 tensorflow-estimator-1.13.0 termcolor-1.1.0
[33mYou are using pip version 19.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## 1. Neural Networks

### Basics
A Neural Network is a collection of software "neurons" arranged in layers, connected together in a way that allows communication. By changing the values of the Weight (W) and the Bias (b), we can try to minimize the cost function of the NN.

#### Single Neuron
Each neuron receives a set of x-values (numbered from 1 to n) as an input and computes the predicted y-hat value. Vector **x** actually contains the values of the features in one of _m_ examples from the training set. Each unit has its own set of parameters, usually referred to as **w** (column vector of weights) and b (bias) which changes during the learning process. **In each iteration, the neuron calculates a weighted average of the values of the vector x**, based on the current weight vector **w** and adds bias. Finally, the result of this calculation is passed through a non-linear activation function _g_. 

<img src="images/single_neuron.png" width="50%">

#### Single Layer
Using the knowledge of what is happening inside a single unit and vectorize across full layer to combine those calculations into matrix equations. To unify the notation, the equations will be written for the selected layer _l_.

#### Neural Network Workflow
<img src="images/nn_flow.png" width="50%">

##### Forward Propagation
In the forward propagation phase, the activation function is applied to each node to predict the y_hat values from the given x_values and continues through the layers. Each layer can be seen as a vector which allows for progressing one layer at a time using linear algebra.

##### Loss Function
The loss function identifies how well the current predictions are doing. The loss function calculates how far we are from the ideal solution and we try to minize the difference. This is done through the **cost** and **accuracy** functions. One example of a of loss function is binary crossentropy.

##### Backward Propagation
Backward propagation serves to calculate the gradient effectively and gradient descent is using the calculated gradient to optimize. In NN, we calculate the gradient of the cost function in respect to its parameters but backpropagation is used to calculate the derivation of any function. The essence of this algorithm is to recursively apply chain rule by calculating a derivative of functions by assembling other functions whose derivatives are already known.
<img src="images/nn_prop.png" width="50%">

#### Updating Parameter Values
Main purpose of this workflow process is to update the parameter values using gradient optimization. In this way, we can try to bring our target function closer to a minimum. We do this by using two variables `params_values` which stores current parameter values and `grad_values` which stores the cost derivatives with respect to those parameter values. Now by applying the following equations to each layer, we can optimize the parameters of the gradient. There are more complex optimizers later on.
> $W^{[l]}$ = $W^{[l]}$ - $\alpha$ $dW^{[l]}$

> $b^{[l]}$ = $b^{[l]}$ - $\alpha$ $db^{[l]}$

### Activation Functions
In neural networks, the activation function of a node defines the output of that node (or neuron) given an input or a set of inputs. The output is then used as input for the next node and so on until a desired solution to the original problem is found.

#### ReLU
ReLU is a rectified linear unit and is a type of activation function. It is mathematically _y=max(0, x)_. Visually it looks like the following:

<img src="images/reLU.png" width="50%">

ReLU is the most common activation function to use in neural networks (especialy for CNN).
> If unsure about what activation function to use in a network, use ReLU.

##### Functionality
ReLU is linear for all **positive values** and zero for all **negative values**.

This means that:
* cheap to compute as there is no complicated math; model takes less time to run
* converges faster, linearity means that the slope doesn't plateau or "saturate" when x gets large (like in sigmoid or tanh)
* sparsely activated; since ReLU is zero for all negative inputs, its likely for any given unit to not activate at all (often desirable)

#### Sigmoid
Sigmoid has a characteristic "S" shape with equation _y = 1/(1+e^-x)_. Visually it looks like the following:

<img src="images/sigmoid.png" width="50%">

##### Functionality
A sigmoid function is a bounded, differentiable, real function that is defined for all real input values and has a non-negative derivative at each point.

In general, a sigmoid function is monotonic (only increases or only decreases) and has a first derivative with is bell shaped. A sigmoid function is constrained by a pair of horizontal asymptotes as x -> +/-inf.

The sigmoid function is convex for values less than 0, and it is concave for values more than 0. Because of this, the sigmoid function and its similar compositions can possess multiple optima (optimal points).

In [None]:
# CODING NEURAL NETWORK POINTS

# tensorflow
import tensorflow as tf

# this creates a CNN
con_layer = tf.layers.conv2d(
    inputs = input_layer,
    filters = 32,
    kernel_size = [5, 5],
    padding='same',
    activation=tf.nn.relu
)

# keras
from keras.layers import Actvation, Dense

# this adds a ReLU layer to the current model
model.add(Dense(64, actvation='relu'))

# pytorch
from torch.nn import RNN

# this will add 2 CNN layers with ReLU
model = nn.Sequential(
    nn.Conv2d(1, 20, 5),
    nn.ReLU(),
    nn.Conv2d(20, 64, 5),
    nn.ReLU()
)

### Example

To make the following neural network using KERAS.
<img src="images/neural_network_architecture.png" width="50%">

In [None]:
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(4, input_dim=2, activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# X_train and y_train are data sets
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=50, verbose=0)

### Types of Neural Networks
#### Convolutional Neural Networks (CNN)
#### Baysian Neural Networks (CNN)