# Gesture Recognition


### Machine Learning

Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. "Machine Learning" emphasizes that the computer program (or machine) must do some work after it is given data.  The Learning step is made explicit. Eventhough Machine Learning was started in use to recognize patterns, Researchers started applying Machine Learning to Robotics (reinforcement learning, manipulation, motion planning, grasping), to genome data, as well as to predict financial markets. 

<img src="./images/ml-eng.png">

### Deep Learning

Fast forward to today and what we’re seeing is a large interest in something called Deep Learning which is a subset of Machine Learning. Deep learning is a machine learning technique that teaches computers to do what comes naturally to humans: learn by example. Deep learning is a key technology behind driverless cars, enabling them to recognize a stop sign, or to distinguish a pedestrian from a lamppost. The most popular kinds of Deep Learning models, as they are using in large scale image recognition tasks, are known as Convolutional Neural Nets, or simply ConvNets. 

<img src="./images/traditional-ml-deep-learning-2.png">

#### Convolutional Neural Network

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The pre-processing required in a ConvNet is much lower as compared to other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, ConvNets have the ability to learn these filters/characteristics.

The architecture of a ConvNet is analogous to that of the connectivity pattern of Neurons in the Human Brain and was inspired by the organization of the Visual Cortex. Individual neurons respond to stimuli only in a restricted region of the visual field known as the Receptive Field. A collection of such fields overlap to cover the entire visual area.

<img src="./images/Typical_cnn.png">

### How it works

#### Input

In the figure, we have an RGB image which has been separated by its three color planes — Red, Green, and Blue. There are a number of such color spaces in which images exist — Grayscale, RGB, HSV, CMYK, etc.

<img src="./images/input-img.png">

You can imagine how computationally intensive things would get once the images reach dimensions, say 8K (7680×4320). The role of the ConvNet is to reduce the images into a form which is easier to process, without losing features which are critical for getting a good prediction.

#### Convolution

Think of convolution as applying a filter to our image. We pass over a mini image, usually called a kernel, and output the resulting, filtered subset of our image.

<img src="./images/Convolution_schematic.gif">

The objective of the Convolution Operation is to extract the high-level features such as edges, from the input image.

<img src="./images/convolution-layer.gif">

There are a few parameters that get adjusted here:

    * Kernel Size – the size of the filter.
    * Kernel Type – the values of the actual filter. Some examples include identity, edge detection, and sharpen.
    * Stride – the rate at which the kernel passes over the input image. A stride of 2 moves the kernel in 2-pixel increments.
    * Padding – we can add layers of 0s to the outside of the image in order to make sure that the kernel properly passes over the edges of the image.
    * Output Layers – how many different kernels are applied to the image.

Output of the convolution process is called the “convolved feature” or “feature map.” 

#### ReLU
CNNs often add in a nonlinear function to help approximate such a relationship in the underlying data. ReLU (Rectified Linear Unit) is one such simple function.

#### Max Pooling

We pass over sections of our image and pool them into the highest value in the section.

Similar to Convolution layer, the pooling layer decreases the computational power required to process the data through dimensionality reduction. Furthermore, it is useful for extracting dominant features which are rotational and positional invariant, thus maintaining the process of effectively training of the model.

<img src="./images/max-pooling.png">

#### Fully Connected Layers
After the above preprocessing steps are applied, the resulting image (which may end up looking nothing like the original!) is passed into the traditional neural network architecture.

After going through the above process, we have successfully enabled the model to understand the features. Moving on, we are going to flatten the final output and feed it to a regular Neural Network for classification purposes.


### PyTorch

A replacement for NumPy to use the power of GPUs. 

Lets construct a randomly initialized matrix. Run the snippet below.

In [4]:
import torch

x = torch.rand(5, 3)
print(x)

tensor([[9.8809e-02, 5.7240e-01, 9.6262e-05],
        [7.8903e-01, 7.3890e-01, 8.3572e-01],
        [1.6577e-01, 8.9676e-01, 4.5417e-01],
        [4.0741e-01, 6.9280e-01, 7.5464e-01],
        [6.6123e-01, 6.3295e-01, 3.9002e-01]])


PyTorch uses an imperative / eager paradigm. That is, each line of code required to build a graph defines a component of that graph. We can independently perform computations on these components itself, even before your graph is built completely. This is called “define-by-run” methodology.

<img src="./images/pytorch-variable.gif">

#### Tensors

Tensors are nothing but multidimensional arrays. Tensors in PyTorch are similar to numpy’s ndarrays. PyTorch requires the data set to be transformed into a tensor so it can be consumed in the training and testing of the network.

In [2]:
import torch

# define a tensor
a = torch.FloatTensor([2])
b = torch.FloatTensor([3])

print(a + b)

tensor([5.])


#### Autograd module

PyTorch uses a technique called automatic differentiation. That is, we have a recorder that records what operations we have performed, and then it replays it backward to compute our gradients. This technique is especially powerful when building neural networks.

```
from torch.autograd import Variable

x = Variable(train_x)
y = Variable(train_y, requires_grad=False) 
```

train_x and train_y are training data and respective labels.

#### Optim module

`torch.optim` is a module that implements various optimization algorithms used for building neural networks.

```
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
```

#### nn module

PyTorch autograd makes it easy to define computational graphs and take gradients, but raw autograd can be a bit too low-level for defining complex neural networks. This is where the nn module comes into play.

```
import torch

# define model
model = torch.nn.Sequential(
                 torch.nn.Linear(input_num_units, hidden_num_units),
                 torch.nn.ReLU(),
                 torch.nn.Linear(hidden_num_units, output_num_units),
        )
loss_fn = torch.nn.CrossEntropyLoss()
```

Now that you know the basic components of PyTorch, you can easily build your own neural network from scratch.

### Model Parameters (Constants)

The batch size is a number of samples processed before the model is updated.

The number of epochs is the number of complete passes through the training dataset.

The learning rate or step size in machine learning is a hyperparameter which determines to what extent newly acquired information overrides old information. At the global minima we can be confident that the learning algorithm has achieved a high level of accuracy, and is sufficient for making predictions on test or other unseen data.

<img src="./images/global-minima.png">

We must specify the batch size, number of epochs and learning rate for any learning algorithm.

```
EPOCHS = 10
BATCH_SIZE = 1
LEARNING_RATE = 0.003
```

Assuming there is a nice music track that comes with with lyrics and made of six (6) verses that you want to learn. After playing the full music track (`Epoch`) and going through each of the 6 verses (`Iterations`) for the first time, there is no guarrantee you will be able to sing this song on your own without refering to the lyrics or replaying the song. Thus, you may need to replay the song for a sufficient number of times(`Multiple Epochs`) until perfection or confident enough to sing on your own or even recite any verse from any parts of the song (`High learning acuracy`).


### Loading the Data

PyTorch ships with the torchvision package, which makes it easy to download and use datasets for CNNs.

```
data = torchvision.datasets.ImageFolder(root=DATA_FOLDER_PATH, transform=TRANSFORM_IMG)
data_loader = data.DataLoader(data, batch_size=BATCH_SIZE, shuffle=True,  num_workers=4)
```

The transform parameter `TRANSFORM_IMG` can be used to preprocess the images.

```
TRANSFORM_IMG = transforms.Compose([
                        transforms.Grayscale(num_output_channels=1),
                        transforms.ToTensor()
                ])
```

### Designing a Neural Net

We’ll be making use of four major functions in our CNN class:

    * torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding) – applies convolution
    * torch.nn.relu(x) – applies ReLU
    * torch.nn.MaxPool2d(kernel_size, stride, padding) – applies max pooling
    * torch.nn.Linear(in_features, out_features) – fully connected layer (multiply inputs by learned weights)
    
We will create a CNN class with one class method: forward. The forward() method computes a forward pass of the CNN, which includes the preprocessing steps we outlined above.

```
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Sequential(         # input shape (1, 28, 28)
            nn.Conv2d(
                in_channels=1,              # input height
                out_channels=16,            # n_filters
                kernel_size=5,              # filter size
                stride=1,                   # filter movement/step
                padding=2,                  # if want same width and length of this image after Conv2d, 
                                            #     padding=(kernel_size-1)/2 if stride=1
            ),                              # output shape (16, 28, 28)
            nn.ReLU(),                      # activation
            nn.MaxPool2d(kernel_size=2),    # choose max value in 2x2 area, output shape (16, 14, 14)
        )
        self.conv2 = nn.Sequential(         # input shape (16, 14, 14)
            nn.Conv2d(16, 32, 5, 1, 2),     # output shape (32, 14, 14)
            nn.ReLU(),                      # activation
            nn.MaxPool2d(2),                # output shape (32, 7, 7)
        )
        self.out = nn.Linear(32 * 7 * 7, 2) # fully connected layer, output 2 classes

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = x.view(x.size(0), -1)           # flatten the output of conv2 to (batch_size, 32 * 7 * 7)
        output = self.out(x)
        return output, x                    # return x for visualization
```




The Neural Net can then be initialized in a single line as.

```
    model = CNN()
```


Further the optim model and loss function are defined as below.

```
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
    loss_func = nn.CrossEntropyLoss()
    
```


We used the `torch.nn.CrossEntropyLoss()` function. Cross Entropy Loss, also referred to as Log Loss, outputs a probability value between 0 and 1 that increases as the probability of the predicted label diverges from the actual label.


Also, to check if GPU is available and to initialize PyTorch on the right device, we can use

```
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
```


### Training the Neural Net

Once we’ve defined the class for our CNN, we need to train the net itself. This is where neural network code gets interesting.Our basic flow is a training loop: each time we pass through the loop (called an “epoch”), we compute a forward pass on the network and implement backpropagation to adjust the weights. We’ll also record some other measurements like loss and time passed, so that we can analyze them as the net trains itself.

Finally, we’ll define a function to train our CNN using a simple for loop. During each epoch of training, we pass data to the model in batches whose size we define when we call the training loop. Data is feature-engineered using the SimpleCNN class we’ve defined, and then basic metrics are printed after a few passes. During each loop, we also calculate the loss on our validation set.

```
    for epoch in range(EPOCHS):
        for step, (x, y) in enumerate(train_data_loader):
        
            b_x = Variable(x.float())   # batch x (image)
            b_y = Variable(y)   # batch y (target)
    
            output = model(b_x)[0]
            loss = loss_func(output, b_y)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        print('Current Epoch: ', epoch)
        print('Current Loss:', loss.data)

```


### Testing Accuracy

At the end of every training epoch we test the current accuracy of the model.

```
        for step, (tx, ty) in enumerate(test_data_loader):

            test_x = Variable(tx)
            test_y = Variable(ty)
            
            test_output, last_layer = model(test_x)
            
            pred_y = torch.max(test_output, 1)[1].data.squeeze()
            accuracy = sum(pred_y == test_y) / float(test_y.size(0))
            
            print('Epoch: ', epoch, '| train loss: %.4f' % loss.data, '| test accuracy: %.2f' % accuracy)
```

### Save and Load Model

```
# Save and load the entire model.
torch.save(model, 'model.ckpt')
model = torch.load('model.ckpt')

# Save and load only the model parameters (recommended).
torch.save(model.state_dict(), 'params.ckpt')
model.load_state_dict(torch.load('params.ckpt'))
```