#### <h1><center>CMSC 478: Machine Learning</center></h1>

<center><img src="img/title.jpg" align="center"/></center>


<h3 style="color:blue;"><center>Instructor: Fereydoon Vafaei</center></h3>


<h5 style="color:purple;"><center>Convolutional Neural Networks CNN</center></h5>

<center><img src="img/UMBC_logo.png" align="center"/></center>

<h1><center>Agenda</center></h1>

- <b>Convolutional Neural Networks</b>
    - Convolution Operation
    - Pooling Operation
    - Padding and Strides
    - CNN Architectures
        - LeNet-5
        - AlexNet
            - Data Augmentation
        - GoogleNet
            - Inception Module
        - VGGNet
        - ResNet
            - Residual Learning
        - Xception
        - SENet
- <b>Further Applications</b>
    - Object Detection (IoU & Non-Max Suppression)
    - Face Recognition

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

<h1><center>Computer Vision</center></h1>

- Computer Vision is one of the areas that has been advancing rapidly thanks to Deep Learning and specifically **CNN**s.


- CV in self-driving cars includes different ML tasks such as: image classification, object detection, image segmentation, etc. 

<center><img src="img/computer-vision.gif" align="center"/></center>

<center><img src="img/cv-2.png" align="center"/></center>

<font size='1'>Image from Ref[4]</font>

<h1><center>Image Classification vs Object Detection</center></h1>

- **Image classification** takes an image as an input and outputs the classification label of that image with some metric (probability score, confidence, etc). 


- **Object detection** is the process of finding instances of objects in images. The task of classifying and localizing multiple objects in an image is called **object detection**.

<center><img src="img/object-localization.png" align="center"/></center>
<font size=1>Image from: https://www.kaggle.com/getting-started/169984</font>

<h1><center>Object Detection Demos</center></h1>

[Tensorflow](https://www.tensorflow.org/lite/models/object_detection/overview)

[Watson](https://watson-visual-recognition-duo-dev.ng.bluemix.net/)

[YOLO-v3](https://pjreddie.com/darknet/yolo/)

<h1><center>Neural Style Transfer</center></h1>

[Tensorflow Tutorial](https://www.tensorflow.org/tutorials/generative/style_transfer)

<center><img src="img/neural-style-transfer.jpeg" align="center"/></center>

<font size='1'>Image from Ref[5]</font>

<h1><center>Image and Video Colorization</center></h1>

[Image Colorization API from DeepAI](https://deepai.org/machine-learning-model/colorizer)


[DeOldify - A library in GitHub](https://github.com/jantic/DeOldify)


[Coloring Movie Psycho (1960)](https://www.youtube.com/watch?v=l3UXXid04Ys&feature=emb_logo)

<h1><center>Motivation - Large Images & Too Many Parameters</center></h1>

<center><img src="img/large-images.png" align="center"/></center>

<font size='1'>Image from Ref[3]</font>

<h1><center>Motivation - Convolution Operation (from DL Textbook [2])</center></h1>

- Convolutional networks have been tremendously successful in CV practical applications. The name **convolutional neural network** indicates that the network employs a mathematical operation called **convolution**.


- **Convolution** is a specialized kind of linear operation.


- Convolutional networks are simply neural networks that use **convolution** in place of general matrix multiplication in at least one of their layers.

<h1><center>Motivation - Convolution Operation (from DL Textbook [2])</center></h1>

- **Convolution** leverages three important ideas that can help improve a machine learning system:
    - Sparse interactions

    - Parameter sharing

    - Equivariant representations


- Moreover, **convolution** provides a means for working with inputs of variable size.

<h1><center>Edge Detection Example</center></h1>

<center><img src="img/edge-detection.png" align="center"/></center>

<font size='1'>Image from Ref[3]</font>

<h1><center>Convolution Operation - Vertical Edge Detection</center></h1>

<center><img src="img/conv-op-1.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Convolution Operation - Vertical Edge Detection</center></h1>

<center><img src="img/conv-op-2.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Convolution Operation - Vertical Edge Detection</center></h1>

<center><img src="img/conv-op-3.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Convolution Operation - Vertical Edge Detection</center></h1>

<center><img src="img/conv-op-4.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Convolution Operation Example</center></h1>

<center><img src="img/conv-op-text.png" align="center"/></center>

<font size='1'>Image from Ref[2]</font>

<h1><center>Convolution Operation Animation</center></h1>

<center><img src="img/convolution-operation-1.gif" align="center"/></center>

<font size=1> Image from Ref[13]</font>

<h1><center>Convolution Operation Example-1</center></h1>

<center><img src="img/convolution-operation-2.gif" align="center"/></center>

<font size=1> Image from Ref[13]</font>

<h1><center>Convolution Operation Example-2</center></h1>

<center><img src="img/convolution-operation-3.gif" align="center"/></center>

<font size=1> Image from Ref[15]</font>

<h1><center>Inspiration from Visual Cortex</center></h1>

<center><img src="img/visual-cortex.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Convolution Layers</center></h1>

<center><img src="img/cnn-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Filters and Feature Maps</center></h1>

<center><img src="img/feature-map-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Padding</center></h1>

<center><img src="img/padding.gif" align="center"/></center>

<font size=1> Image from Ref[13]</font>

<h1><center>Zero Padding - "same"</center></h1>

<center><img src="img/padding-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>"same" Padding</center></h1>

- Pad so that output size is the same as input size:

    - $n+2p-f+1 = n \implies p = \frac{f-1}{2}$
    
    
- Note: Here stride is assumed to be 1.

<h1><center>Padding "same" vs "valid"</center></h1>

- If set to "same" , the convolutional layer uses zero padding if necessary.

- Then zeros are added as evenly as possible around the inputs, as needed. When strides=1 , the layer’s outputs will have the same spatial dimensions (width and height) as its inputs, hence the name "same".

<center><img src="img/padding-same-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Strided Convolution</center></h1>

<center><img src="img/strided.gif" align="center"/></center>

<font size=1> Image from Ref[13]</font>

<h1><center>Strided Convolution Example - 1</center></h1>

<center><img src="img/strided-1.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Strided Convolution Example - 2</center></h1>

<center><img src="img/strided-2.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Strided Convolution Example - 3</center></h1>

<center><img src="img/strided-3.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Strided Convolution Example - 4</center></h1>

<center><img src="img/strided-4.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Strided Convolution - Computing Output Dimension</center></h1>

$$\frac{n + 2p - f}{s} + 1$$

- where:
    - n: input size
    - f: filter size
    - p: padding
    - s: stride
      

- If the result is not integer, round it down, i.e. use floor() function.

<h1><center>Convolution Summary</center></h1>

<center><img src="img/conv-summary.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Strided Convolution Effect - Dimensionality Reduction</center></h1>

<center><img src="img/striding-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Convolution over Volume</center></h1>

<center><img src="img/conv-3D.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Convolution Over RGB Channels</center></h1>

<center><img src="img/rgb.gif" align="center"/></center>

<font size=1> Image from Ref[13]</font>

**Equation 14-1: Computing the output of a neuron in a convolutional layer**

$
z_{i,j,k} = b_k + \sum\limits_{u = 0}^{f_h - 1} \, \, \sum\limits_{v = 0}^{f_w - 1} \, \, \sum\limits_{k' = 0}^{f_{n'} - 1} \, \, x_{i', j', k'} \times w_{u, v, k', k}
\quad \text{with }
\begin{cases}
i' = i \times s_h + u \\
j' = j \times s_w + v
\end{cases}
$

<h1><center>Convolution on Multiple Channels</center></h1>

<center><img src="img/conv-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Shared Parameters in CNN</center></h1>


- The fact that all neurons in a feature map **share the same parameters** dramatically reduces the number of parameters in the model.


- Once the CNN has learned to recognize a pattern in one location, it can recognize it in any other location. 


- In contrast, once a regular DNN has learned to recognize a pattern in one location, it can recognize it only in that particular location.

<h1><center>Pooling</center></h1>

<center><img src="img/pooling-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Max Pooling</center></h1>

<center><img src="img/max-pooling.png" align="center"/></center>

<font size=1> Image from Ref[14]</font>

<h1><center>Pooling Effect - Translation Invariance</center></h1>

<center><img src="img/pooling-invariance-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Depthwise Pooling</center></h1>

<center><img src="img/pooling-invariance-2-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>CNN Architecture</center></h1>

<center><img src="img/cnn-architecture-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>CNN Architecture</center></h1>

- The image gets smaller and smaller as it progresses through the network, but it also typically gets deeper and deeper (i.e., with more feature maps), thanks to the convolutional layers.

<center><img src="img/cnn-mathworks.png" align="center"/></center>

<font size='1'>Image from Ref[15]</font>

<h1><center>Sparse Connectivity</center></h1>

- Traditional neural network layers use matrix multiplication by a matrix of parameters with a separate parameter describing the interaction between each input unit and each output unit. This means that every output unit interacts with every input unit. 


- Convolutional networks, however, typically have sparse interactions---also referred to as **sparse connectivity** or **sparse weights**.

<h1><center>Effect of Sparse Connectivity - Viewed from Below</center></h1>

<center><img src="img/sparse-connectivity-1.png" align="center"/></center>

<font size='1'>Image from Ref[2]</font>

<h1><center>Effect of Sparse Connectivity - Viewed from Above</center></h1>

<center><img src="img/sparse-connectivity-2.png" align="center"/></center>

<font size='1'>Image from Ref[2]</font>

<h1><center>CNN in Tensorflow</center></h1>

- In TensorFlow, each input image is typically represented as a 3D tensor of shape [height, width, channels].


- A mini-batch is represented as a 4D tensor of shape [mini-batch size, height, width, channels].


- The weights of a convolutional layer are represented as a 4D tensor of shape [$f_h , f_w , f_n′ , f_n$].


- The bias terms of a convolutional layer are simply represented as a 1D tensor of shape [$f_n$].

In [67]:
# CNN Example: The following code works on a 10-class image classification with input size: 32x32x3
model = keras.models.Sequential()
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), strides=(1, 1), padding='same',
                        input_shape=(32, 32, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu')) # number of filters usually grows
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu')) 
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu')) # FC Layer
model.add(layers.Dense(10, activation='softmax')) # Output layer for 10-class classification

In [68]:
model.summary()

Model: "sequential_16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_42 (Conv2D)           (None, 32, 32, 32)        896       
_________________________________________________________________
max_pooling2d_28 (MaxPooling (None, 16, 16, 32)        0         
_________________________________________________________________
conv2d_43 (Conv2D)           (None, 14, 14, 64)        18496     
_________________________________________________________________
max_pooling2d_29 (MaxPooling (None, 7, 7, 64)          0         
_________________________________________________________________
conv2d_44 (Conv2D)           (None, 5, 5, 64)          36928     
_________________________________________________________________
flatten_13 (Flatten)         (None, 1600)              0         
_________________________________________________________________
dense_28 (Dense)             (None, 64)              

<h1><center>CNN Architectures and ImageNet</center></h1>

- Over the years, variants of this fundamental architecture have been developed, leading to amazing advances in the field.


- A good measure of this progress is the error rate in competitions such as the ILSVRC **ImageNet** challenge.


- In this competition the top-five error rate for image classification fell from over 26% to less than 2.3% in just six years.


- The top-five error rate is the number of test images for which the system’s top five predictions did not include the correct answer.


- The images are large (256 pixels high) and there are 1,000 classes, some of which are really subtle (try distinguishing 120 dog breeds).


- Looking at the evolution of the winning entries is a good way to understand how CNNs work.

<h1><center>LeNet-5 Architecture</center></h1>

<center><img src="img/lenet-5-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>AlexNet</center></h1>

- The AlexNet CNN architecture won the 2012 ImageNet ILSVRC challenge by a large margin: it achieved a top-five error rate of 17%, while the second best achieved only 26%!


- AlexNet was developed by Alex Krizhevsky (hence the name), Ilya Sutskever, and Geoffrey Hinton.


- AlexNet is similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of one another, instead of stacking a pooling layer on top of each convolutional layer.


- To reduce overfitting, the authors used two regularization techniques:
    - First, they applied dropout with a 50% dropout rate during training to the outputs of layers F8 and F9.
    - Second, they performed **data augmentation** by randomly shifting the training images by various offsets, flipping them horizontally, and changing the lighting conditions.

<h1><center>AlexNet Architecture</center></h1>

<center><img src="img/alexnet-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Data Augmentation</center></h1>

- Data augmentation artificially increases the size of the training set by generating many realistic variants of each training instance.


- This reduces overfitting, making this a regularization technique.


- The generated instances should be as realistic as possible: ideally, given an image from the augmented training set, a human should not be able to tell whether it was augmented or not. Simply adding white noise will not help; the modifications should be learnable (white noise is not).


- For example, you can slightly shift, rotate, and resize every picture in the training set by various amounts and add the resulting pictures to the training set.


- This forces the model to be more tolerant to variations in the position, orientation, and size of the objects in the pictures.


- For a model that’s more tolerant of different lighting conditions, you can similarly generate many images with various contrasts.


- In general, you can also flip the pictures horizontally (except for text, and other asymmetrical objects). By combining these transformations, you can greatly increase the size of your training set.

<h1><center>Data Augmentation</center></h1>

<center><img src="img/data-aug-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Image Augmentation</center></h1>

https://github.com/aleju/imgaug

<center><img src="img/imgaug.png" align="center"/></center>

<font size='1'>Image from Ref[6]</font>

<h1><center>Image Augmentation Using Tensorflow ImageDataGenerator</center></h1>


In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Create data generator
data_generator = ImageDataGenerator(width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True)

# Prepare iterator
iterate_train = data_generator.flow(X_train, y_train, batch_size=64)
steps = int(train_images.shape[0] / 64)

# Train the model using the iterator
# Note: In future tf versions, fit() method is used instead of fit_generator()
history = model.fit_generator(iterate_train, steps_per_epoch=steps, validation_data=(X_test, y_test),
                              epochs=EPOCHS, callbacks=[early_stop], verbose=1)

<h1><center>GoogleNet - Inception Module</center></h1>

<center><img src="img/inception-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>GoogleNet Architecture</center></h1>

<center><img src="img/googlenet-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>VGGNet</center></h1>

- The runner-up in the ILSVRC 2014 challenge was VGGNet, developed by Karen Simonyan and Andrew Zisserman from the Visual Geometry Group (VGG) research lab at Oxford University.


- VGGNet has a very simple and classical architecture, with 2 or 3 convolutional layers and a pooling layer, then again 2 or 3 convolutional layers and a pooling layer, and so on (reaching a total of just 16 or 19 convolutional layers, depending on the VGG variant), plus a final dense network with 2 hidden layers and the output layer.


- VGGNet uses only 3 × 3 filters, but many filters.

<h1><center>Residual Learning</center></h1>

- As CNN models get deeper and deeper, training them gets more and more challenging, e.g. issues such as vanishing and exploding gradients may arise.


- The key to being able to train such a deep network is to use skip connections (also called shortcut connections): the signal feeding into a layer is also added to the output of a layer located a bit higher up the stack.

<center><img src="img/residual-learning-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Residual Network vs DNN</center></h1>

- If you add many skip connections, the network can start making progress even if several layers have not started learning yet.

- Thanks to skip connections, the signal can easily make its way across the whole network.

<center><img src="img/dnn-resnet-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>ResNet Architecture</center></h1>

<center><img src="img/resnet-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Skip Connection - Matching Dimensions</center></h1>

- Note that the number of feature maps is doubled every few residual units, at the same time as their height and width are halved (using a convolutional layer with stride 2).

- When this happens, the inputs cannot be added directly to the outputs of the residual unit because they don’t have the same shape.

- To solve this problem, the inputs are passed through a 1 × 1 convolutional layer with stride 2 and the right number of output feature maps.

<center><img src="img/skip-connection.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Xception</center></h1>

- Another variant of the GoogLeNet architecture is worth noting: Xception (which stands for Extreme Inception) was proposed in 2016 by François Chollet (the author of Keras), and it significantly outperformed Inception-v3 on a huge vision task (350 million images and 17,000 classes).

- Just like Inception-v4, it merges the ideas of GoogLeNet and ResNet, but it replaces the inception modules with a special type of layer called a depthwise separable convolution layer (or separable convolution layer for short).

- While a regular convolutional layer uses filters that try to simultaneously capture spatial patterns (e.g., an oval) and cross-channel patterns (e.g., mouth + nose + eyes = face), a separable convolutional layer makes the strong assumption that spatial patterns and cross-channel patterns can be modeled separately. Thus, it is composed of two parts: the first part applies a single spatial filter for each input feature map, then the second part looks exclusively for cross-channel patterns—it is just a regular convolutional layer with 1 × 1 filters.

<center><img src="img/depthwise.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>SENet</center></h1>

- The winning architecture in the ILSVRC 2017 challenge was the Squeeze-and-Excitation Network (SENet).

- This architecture extends existing architectures such as inception networks and ResNets, and boosts their performance.

- This allowed SENet to win the competition with an astonishing 2.25% top-five error rate!

- The extended versions of inception networks and ResNets are called SE-Inception and SE-ResNet, respectively.

- The boost comes from the fact that a SENet adds a small neural network, called an **SE Block**, to every unit in the original architecture (i.e., every inception module or every residual unit).


<center><img src="img/senet.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>SE Block</center></h1>

- An SE block analyzes the output of the unit it is attached to, focusing exclusively on the depth dimension (it does not look for any spatial pattern), and it learns which features are usually most active together.

- It then uses this information to recalibrate the feature maps, as shown in Figure 14-21.

- For example, an SE block may learn that mouths, noses, and eyes usually appear together in pictures: if you see a mouth and a nose, you should expect to see eyes as well. So if the block sees a strong activation in the mouth and nose feature maps, but only mild activation in the eye feature map, it will boost the eye feature map (more accurately, it will reduce irrelevant feature maps). If the eyes were somewhat confused with something else, this feature map recalibration will help resolve the ambiguity.

<center><img src="img/se-block-1.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>SE Block</center></h1>

- An SE block is composed of just three layers: a global average pooling layer, a hidden dense layer using the ReLU activation function, and a dense output layer using the sigmoid activation function.

<center><img src="img/se-block-2.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Using CNN Architectures</center></h1>


- In general, you won’t have to implement standard models like GoogLeNet or ResNet manually, since pretrained networks are readily available with a single line of code in the `keras.applications` package.


- For example, you can load the ResNet-50 model, pretrained on ImageNet, with the following line of code:

In [None]:
model = keras.applications.resnet50.ResNet50(weights="imagenet")

<h1><center>Transfer Learning with CNNs</center></h1>

- If you want to build an image classifier but you do not have enough training data, then it is often a good idea to reuse the lower layers of a pretrained model, as we discussed Transfer Learning in DNNs.


- For example, you can train a model to classify pictures of flowers, reusing a pretrained **Xception** model.

In [None]:
'''
- Load an Xception model, pretrained on ImageNet.
- We exclude the top of the network by setting `include_top=False` : 
    this excludes the global average pooling layer and the dense output layer.
- We then add our own global average pooling layer, 
    based on the output of the base model, followed by a dense output layer with one unit per class,
    using the softmax activation function. Finally, we create the Keras Model :
'''
base_model = keras.applications.xception.Xception(weights="imagenet",
include_top=False)
avg = keras.layers.GlobalAveragePooling2D()(base_model.output)
output = keras.layers.Dense(n_classes, activation="softmax")(avg)
model = keras.Model(inputs=base_model.input, outputs=output)

<h1><center>Object Localization and Recognition</center></h1>


- Localizing an object in a picture can be expressed as a regression task.


- To predict a bounding box around the object, a common approach is to predict the horizontal and vertical coordinates of the object’s center, as well as its height and width. This means we have four numbers to predict.

<center><img src="img/object-detection.png" align="center"/></center>

<font size='1'>Image from Ref[7]</font>

<h1><center>Intersection over Union (IoU)</center></h1>

- The MSE often works fairly well as a cost function to train a model for regression, but it is not a great metric to evaluate how well the model can predict bounding boxes.


- The most common metric for this is the Intersection over Union (IoU): the area of overlap between the predicted bounding box and the target bounding box, divided by the area of their union.


- In `tf.keras`, it is implemented by the `tf.keras.metrics.MeanIoU` class.

<center><img src="img/iou.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Object Localization and Recognition</center></h1>

- The task of classifying and localizing multiple objects in an image is called **object detection**.

- Until a few years ago, a common approach was to take a CNN that was trained to classify and locate a single object, then slide it across the image, as shown below.

- This technique is fairly straightforward, but as you can see it will detect the same object multiple times, at slightly different positions. Some post-processing will then be needed to get rid of all the unnecessary bounding boxes. A common approach for this is called **non-max suppression**.

<center><img src="img/object-detection-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Non-Max Suppression</center></h1>

<center><img src="img/non-max.jpg" align="center"/></center>

<font size='1'>Image from Ref[10]</font>

<h1><center>YOLO - You Only Look Once</center></h1>

- YOLO developed in University of Washington is the state-of-the-art object detection algorithm that uses several components including: CNNs, IoU, and Non-Max Suppression among others.

https://pjreddie.com/darknet/yolo/

<h1><center>Face Recognition</center></h1>

<center><img src="img/face-recog.jpg" align="center"/></center>

<font size='1'>Image from Ref[12]</font>

<h1><center>Face Recognition with One-Shot Learning</center></h1>

<center><img src="img/one-shot.png" align="center"/></center>

<font size='1'>Image from Ref[9]</font>

<h1><center>Facial Expression Recognition</center></h1>

**Demo:**

https://github.com/justadudewhohacks/face-api.js#age-estimation--gender-recognition

<h1><center>References</center></h1>

[1] Hands-On ML Textbook Edition-2 2019

[2] Deep Learning [Textbook](http://www.deeplearningbook.org/contents/convnets.html) by Ian Goodfellow et al.

[3] Andrew Ng's CNN Course in [Coursera](https://www.coursera.org/learn/convolutional-neural-networks?=)

[4] https://towardsdatascience.com/everything-you-ever-wanted-to-know-about-computer-vision-heres-a-look-why-it-s-so-awesome-e8a58dfb641e

[5] https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-neural-style-transfer-ef88e46697ee

[6] https://github.com/aleju/imgaug

[7] https://towardsdatascience.com/object-detection-with-10-lines-of-code-d6cb4d86f606

[8] https://github.com/justadudewhohacks/face-api.js#age-estimation--gender-recognition

[9] https://blog.netcetera.com/face-recognition-using-one-shot-learning-a7cf2b91e96c

[10] https://www.pyimagesearch.com/2014/11/17/non-maximum-suppression-object-detection-python/

[11] https://pjreddie.com/darknet/yolo/

[12] https://www.nec.com/en/global/solutions/biometrics/face/index.html

[13] https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1

[14] https://www.freecodecamp.org/news/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050/

[15] https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html