# Work-item 2:
Học Deep learning in computer vision.
Keras Framework
## 1. Image and Classification Fundamentals: 

Four steps in the deep learning classification pipeline:
- Gathering our dataset
- Splitting our data into training, testing, and validation steps, 
- Training our network,
- Finally evaluating our model.

## 2. Parameterized:

Components of parameterized learning:
1. Data: - In the context of image classification, our input data is our dataset of images. 
2. Scoring function: The scoring function produces predictions for a given input image.
3. Loss function: The loss function then quantifies how good or bad a set of predictions are over the dataset.
4. Weights and biases: The weight matrix (W) and bias (b) vectors are what enable us to actually “learn” from the input data – these parameters will be tweaked and tuned via optimization methods in an attempt to obtain higher classification accuracy.

Hinge loss and cross-entropy loss:
- Hinge loss: $L_i = \sum_{j \neq y_i} max(0,s_i - s_{y_i} + 1) $
- Cross-entropy loss: $L_i =  -log(e^{s_{y_i}}/ \sum_{j}e^{s_j}) $

where ${s_j} $ predicted score of the j-th class via the i-th data point:
    ${s_j = f(x_i, W)}$
## 3. Optimization Methods:

  - Most important aspect of machine learning, neural networks, and deep learning is optimization.

#### Gradient descent: (Optimization Methods):
Gradient descent algorithms are controlled via a learning rate:
There are two types of gradient descent:
1. The standard vanilla flavor: Vanilla gradient descent performs only one weight update per epoch,
 making it very slow (if not impossible) to converge on large datasets.

2. The stochastic version that is more commonly used: since it applies multiple weight updates per epoch by computing the gradient on small mini-batches.

    By using SGD we can dramatically reduce the time it takes to train a model while also enjoying lower loss and higher accuracy.

Pseudocode for Gradient Descent ( standard vanilla flavor version)

In [None]:
while True:
    W_gradient = evaluate_gradient(loss, data, W)
    W += -alpha * W_gradient


1. Looping until some condition is met, typical are: 
    + Specified number of epochs has passed.
    + Our loss has become sufficiently low or training accuracy satisfactory high.
    + Or loss has not improved in M subsequent epochs.
2. Then calls a function named evaluate_gradient. 
This function requires three parameters:
    1. loss: A function used to compute the loss over our current W and input data.
    2. data: Our training data where each training sample is represented by an image.
    3. W: Our actual weight matrix that we are optimizing over.
    Our goal is to apply gradient descent to find a W that yields minimal loss.
3. We then apply gradient descent. We multiply our W_gradient by alpha (a), our learning rate.
    The learning rate controls the size of our step.

Pseudocode for Gradient Descent ( SGD version)

In [None]:
while True:
    batch = next_training_batch(data, 256)
    W_gradient = evaluate_gradient(loss, batch, W)
    W += -alpha * W_gradient

The only difference between vanilla gradient descent and SGD is the addition of
the next_training_batch function.

Instead of computing our gradient over the entire data set, we instead sample our data,
yielding a batch. We evaluate the gradient on the batch, and update our weight matrix W.
We also try to randomize our training samples before applying SGD since the algorithm is
sensitive to batches

Typical batch sizes include 32, 64, 128 and 256


## 4. Regularization
Regularization helps us control our model capacity, ensuring that our models are better at
making (correct) classifications on data points that they were not trained on, which we call the
ability to generalize

Three common types of regularization there are applied directly to the loss function.
- L2 regularization (“weight decay”): 

- L1 regularization which takes the absolute value rather than the square:

- Elastic Net regularization seeks to combine both L1 and L2 regularization:

In deep learning and neural networks,  the L2 regularization used commonly for image classification 
– the trick is tuning the alpha parameter to include just the right amount of regularization.

## 5. Neural Network (artificial)
Implement with keras: Link to [neutral_net](neutral_net.ipynb)
### Perceptron architecture:
![Perceptron](images/Selection_003.png)
### Perceptron Training Procedure 
1. Initialize our weight vector w with small random values
2. Until Perceptron converges:
    - Loop over each feature vector $x_j$ and true class label $d_i$ in our training set D
    - Take x and pass it through the network, calculating the output value: $y_j = f(w(t)·xj)$
    - Update the weights w: ${w}_i (t +1) = w_i(t)+ \alpha(d_j − y_j)x_{ji}$  for all features $0 <= i <= n$

### Multi-layer Networks:
Backpropagation is the most important algorithm in neural network: Backpropagation can be considered
the cornerstone of modern neural networks and deep learning.
1. The forward pass where our inputs are passed through the network and output predictions obtained.
2. The backward pass where we compute the gradient of the loss function at the final layer (i.e.,
predictions layer) of the network and use this gradient to recursively apply the chain rule to
update the weights in our network.
(Backpropagation: efficiently train neural networks and “teach” them to learn from their mistakes.)


### Neutral Network Recipe:
- Dataset
- Loss Function ('categorical cross-entropy') ('binary cross-entropy')
- Model/Architecture: 
1. How many data points you have.
2. The number of classes.
3. How similar/dissimilar the classes are.
4. The intra-class variance.
- Optimization Method: SGD (Stochastic Gradient Descent)


## 6. Convolutional Neural Networks (CNNs)
### Convolutions: Link to [convolutions](convolutions.ipynb)
### CNN Building Blocks:Link to [convolutions](cnn_building_block.ipynb)


# Work-item 4:
Deep learning Framework: tìm hiểu thêm về keras and tensorflow. 
## 7. Learning Rate Schedulers
- Learning Rate Schedulers used to increase classification accuracy.
- Two primary types of learning rate schedulers:
    1. Time-based schedulers that gradually decrease based on epoch number.
    2. Drop-based schedulers that drop based on a specific epoch, similar to the behavior of a piecewise function

Standard time-based schedule provided by Keras (with the rule of thumb of decay = alpha_init / epochs):
```python
opt = SGD(lr=0.01, decay=0.01 / 40, momentum=0.9, nesterov=True)
```
Keras applies learning rate schedule to adjust the learning rate after every epoch.
We need to change learning rate through many experiments to obtain a high accuracy model.

## 8. Underfitting and Overfitting
Underfitting occurs when your model cannot obtain sufficiently low loss on the training set.
In this case, ours model fails to learn the underlying patterns in your training data.
On the other end of the spectrum, we have overfitting where your network models the training data
too well and fails to generalize to your validation data.
Therefore, our goal when training a machine learning model is to:
    1. Reduce the training loss as much as possible.
    2. While ensuring the gap between the training and testing loss is reasonably small.

Controlling whether a model is likely to underfit or overfit can be accomplished by adjusting
the capacity of the neural network.
We can increase capacity by adding more layers and neurons to
our network. Similarly, we can decrease capacity by removing layers and neurons and applying
regularization techniques (weight decay, dropout, data augmentation, early stopping, etc.).

Underfitting is relatively easy to combat: simply add more layers/neurons to your network.
Overfitting is an entirely different beast though. When overfitting occurs you should consider:
    1. Reducing the capacity of your network by removing layers/neurons (not recommended unless
    for small dataset).
    2. Applying stronger regularization techniques.

## 9. Checkpointing Models
We can monitor a given metric (e.x., validation loss, validation accuracy,
etc.) during training and then save high performing networks to disk.
There are two methods to accomplish this inside Keras:
1. Checkpoint incremental improvements.
2. Checkpoint only the best model found during the process.

```python
# construct the callback to save only the *best* model to disk
# based on the validation loss
checkpoint = ModelCheckpoint(args["weights"], monitor="val_loss", save_best_only=True, verbose=1)
callbacks = [checkpoint]
# train the network
print("[INFO] training network...")
H = model.fit(trainX, trainY, validation_data=(testX, testY),
batch_size=64, epochs=40, callbacks=callbacks, verbose=2)
```
## 10. Architecture Visualization
The process of constructing a graph of nodes and associated connections in a network
and saving the graph to disk as an image

These graphs typically include the following components for each layer:
1. The input volume size.
2. The output volume size.
3. And optionally the name of the layer.

We typically use network architecture visualization when (1) debugging our own custom
network architectures and (2) publication, where a visualization of the architecture is easier to
understand than including the actual source code or trying to construct a table to convey the same
information.

Sample of LeNet network architecture visualization with keras
```python
# import the necessary packages
from pyimagesearch.nn.conv import LeNet
from keras.utils import plot_model

# initialize LeNet and then write the network architecture
# visualization graph to disk
model = LeNet.build(28, 28, 1, 10)
plot_model(model, to_file="lenet.png", show_shapes=True)
```
## 11. State-of-the-art CNNs in Keras
Keras library ships with many CNNs that have been pre-trained on the ImageNet dataset:
* VGG16
* VGG19
* ResNet50
* Inception V3
* Xception
Depending on our own motivation and end goals of studying deep learning, these networks alone may be enough to build
own desired application.

## 12. Data Augmentation

According to Goodfellow et al., regularization is “any modification we make to a learning algorithm that is intended
to reduce its generalization error, but not its training error”
![Perceptron](images/Selection_007.png)
* Data augmentation is a type of regularization technique that operates on the training data.
* Data augmentation randomly modify our training data by applying a series of random
translations, rotations, shears, and flips.
* Applying these simple transformations does not change the class label of the input image; however, each augmented
image can be considered a “new” image that the training algorithm has not seen before.
* Therefore, our training algorithm is being constantly presented with new training samples, allowing it to learn
 more robust and discriminative patterns.
 
Simple of data augmentaion in keras
 ```python
from keras.preprocessing.image import ImageDataGenerator
aug = ImageDataGenerator(rotation_range=30, width_shift_range=0.1, 
    height_shift_range=0.1, shear_range=0.2, zoom_range=0.2,
    horizontal_flip=True, fill_mode="nearest")
# construct the actual Python generator
imageGen = aug.flow(image, batch_size=1, save_to_dir=args["output"],
    save_prefix=args["prefix"], save_format="jpg")
 ```

## 13. Transfer learning
Transfer learning is the concept of using a pre-trained Convolutional Neural Network to classify class labels outside of what it was originally trained on. In
general, there are two methods to perform transfer learning when applied to deep learning and
computer vision:
### 13.1: Networks as Feature Extractors
Treat networks as feature extractors, forward propagating the image until a given layer, and
then taking these activations and treating them as feature vectors.

 * Deep CNNs such as VGG, Inception, and ResNet are capable of acting as powerful feature extraction machines,
even more powerful than hand-designed algorithms such as HOG, SIFT, and Local Binary Patterns.

### 13.2. Fine-tuning Networks
Fine-tuning networks by adding a brand-new set of fully-connected layers to the head of
the network and tuning these FC layers to recognize new classes (while still using the same
underlying CONV filters). The layers in the body of the original network are frozen while we train the new FC layers.

Applying fine-tuning is an extremely powerful technique as we do not have to train an entire network from scratch.
Instead, we can leverage pre-existing network architectures, such as state-of-the-art models trained on the ImageNet
dataset which consist of a rich, discriminative set of filters. 

Using these filters, we can “jump start” our learning, allowing us to perform network
surgery, which ultimately leads to a higher accuracy transfer learning model with less effort (and headache)
than training from scratch.

## 14.  Single Shot Detectors (SSDs)
The SSD object detector is entirely end-to-end, contains no complex moving parts,
 and is capable of super real-time performance. 

The SSD composes of 2 parts:
Extract feature maps, and Apply convolution filters to detect objects.
The SSD starts with a base network (typically pre-trained network).
A set of new CONV layers are used to replace later CONV and POOL layers.
Each CONV layer connects to the output FC layer.
Combined with the (modified) Multibox algorithm, this allows SSDs to detect objects at varying
scales in the image in a single forward pass.

### TODO: 
## 15. Faster R-CNNs
## 16. Advanced Optimization Methods
Working with HDF5 and Large Datasets, 
Working with one of popular network like GoogLeNet, MobileNet, ResNet
Working with ImageNet dataset
