# Gesture Recognition


### Machine Learning

Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. "Machine Learning" emphasizes that the computer program (or machine) must do some work after it is given data.  The Learning step is made explicit. Eventhough Machine Learning was started in use to recognize patterns, Researchers started applying Machine Learning to Robotics (reinforcement learning, manipulation, motion planning, grasping), to genome data, as well as to predict financial markets. 

<img src="./images/ml-eng.png">

### Deep Learning

Fast forward to today and what we’re seeing is a large interest in something called Deep Learning which is a subset of Machine Learning. Deep learning is a machine learning technique that teaches computers to do what comes naturally to humans: learn by example. Deep learning is a key technology behind driverless cars, enabling them to recognize a stop sign, or to distinguish a pedestrian from a lamppost. The most popular kinds of Deep Learning models, as they are using in large scale image recognition tasks, are known as Convolutional Neural Nets, or simply ConvNets. 

<img src="./images/traditional-ml-deep-learning-2.png">

#### Convolutional Neural Network

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The pre-processing required in a ConvNet is much lower as compared to other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, ConvNets have the ability to learn these filters/characteristics.

The architecture of a ConvNet is analogous to that of the connectivity pattern of Neurons in the Human Brain and was inspired by the organization of the Visual Cortex. Individual neurons respond to stimuli only in a restricted region of the visual field known as the Receptive Field. A collection of such fields overlap to cover the entire visual area.

<img src="./images/Typical_cnn.png">

### How it works

#### Input

In the figure, we have an RGB image which has been separated by its three color planes — Red, Green, and Blue. There are a number of such color spaces in which images exist — Grayscale, RGB, HSV, CMYK, etc.

<img src="./images/input-img.png">

You can imagine how computationally intensive things would get once the images reach dimensions, say 8K (7680×4320). The role of the ConvNet is to reduce the images into a form which is easier to process, without losing features which are critical for getting a good prediction.

#### Convolution

Think of convolution as applying a filter to our image. We pass over a mini image, usually called a kernel, and output the resulting, filtered subset of our image.

<img src="./images/Convolution_schematic.gif">

The objective of the Convolution Operation is to extract the high-level features such as edges, from the input image.

<img src="./images/convolution-layer.gif">

There are a few parameters that get adjusted here:

    * Kernel Size – the size of the filter.
    * Kernel Type – the values of the actual filter. Some examples include identity, edge detection, and sharpen.
    * Stride – the rate at which the kernel passes over the input image. A stride of 2 moves the kernel in 2-pixel increments.
    * Padding – we can add layers of 0s to the outside of the image in order to make sure that the kernel properly passes over the edges of the image.
    * Output Layers – how many different kernels are applied to the image.

Output of the convolution process is called the “convolved feature” or “feature map.” 

#### ReLU
CNNs often add in a nonlinear function to help approximate such a relationship in the underlying data. ReLU (Rectified Linear Unit) is one such simple function.

#### Max Pooling

We pass over sections of our image and pool them into the highest value in the section.

Similar to Convolution layer, the pooling layer decreases the computational power required to process the data through dimensionality reduction. Furthermore, it is useful for extracting dominant features which are rotational and positional invariant, thus maintaining the process of effectively training of the model.

<img src="./images/max-pooling.png">

#### Fully Connected Layers
After the above preprocessing steps are applied, the resulting image (which may end up looking nothing like the original!) is passed into the traditional neural network architecture.

After going through the above process, we have successfully enabled the model to understand the features. Moving on, we are going to flatten the final output and feed it to a regular Neural Network for classification purposes.


### PyTorch

A replacement for NumPy to use the power of GPUs. 

Lets construct a randomly initialized matrix. Run the snippet below.

In [1]:
import torch

x = torch.rand(5, 3)
print(x)

tensor([[0.0260, 0.5003, 0.1294],
        [0.3276, 0.0727, 0.1508],
        [0.0020, 0.2896, 0.4529],
        [0.9286, 0.0560, 0.8160],
        [0.0689, 0.6063, 0.9105]])


PyTorch uses an imperative / eager paradigm. That is, each line of code required to build a graph defines a component of that graph. We can independently perform computations on these components itself, even before your graph is built completely. This is called “define-by-run” methodology.

<img src="./images/pytorch-variable.gif">

#### Tensors

Tensors are nothing but multidimensional arrays. Tensors in PyTorch are similar to numpy’s ndarrays. PyTorch requires the data set to be transformed into a tensor so it can be consumed in the training and testing of the network.

In [2]:
# define a tensor
a = torch.FloatTensor([2])
b = torch.FloatTensor([3])

print(a + b)

tensor([5.])


### Model Parameters (Constants)

The batch size is a number of samples processed before the model is updated.

The number of epochs is the number of complete passes through the training dataset.

The learning rate or step size in machine learning is a hyperparameter which determines to what extent newly acquired information overrides old information. At the global minima we can be confident that the learning algorithm has achieved a high level of accuracy, and is sufficient for making predictions on test or other unseen data.

<img src="./images/global-minima.png">

We must specify the batch size, number of epochs and learning rate for any learning algorithm.


In [3]:
EPOCHS = 10
BATCH_SIZE = 1
LEARNING_RATE = 0.003

### Loading the Data

`.._DATA_PATH` will be the training and test folder images we saved earlier.


In [4]:
from os.path import dirname, abspath

# Inside the script use abspath('') to obtain the absolute path of this script
# Call os.path.dirname twice to get parent directory of this directory    

parent_directory = dirname(dirname(abspath('')))

TRAIN_DATA_PATH = parent_directory + "/data/train/"
TEST_DATA_PATH = parent_directory + "/data/test/"

print(TRAIN_DATA_PATH)
print(TEST_DATA_PATH)

/home/rama/workspace/ai4all/data/train/
/home/rama/workspace/ai4all/data/test/


The transform parameter `TRANSFORM_IMG` can be used to preprocess the images.


In [5]:
from torchvision import transforms

TRANSFORM_IMG = transforms.Compose([
                        transforms.Grayscale(num_output_channels=1),
                        transforms.ToTensor()
                ])
print(TRANSFORM_IMG)

Compose(
    Grayscale(num_output_channels=1)
    ToTensor()
)


PyTorch ships with the torchvision package, which makes it easy to download and use datasets for CNNs.


In [6]:
import torch.utils.data as data
import torchvision

train_data = torchvision.datasets.ImageFolder(root=TRAIN_DATA_PATH, transform=TRANSFORM_IMG)
train_data_loader = data.DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True,  num_workers=4)

test_data = torchvision.datasets.ImageFolder(root=TEST_DATA_PATH, transform=TRANSFORM_IMG)
test_data_loader  = data.DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)

print(train_data)
print(train_data_loader)
print('')
print(test_data)
print(test_data_loader)

Dataset ImageFolder
    Number of datapoints: 20
    Root location: /home/rama/workspace/ai4all/data/train/
<torch.utils.data.dataloader.DataLoader object at 0x7fbce47a9b10>

Dataset ImageFolder
    Number of datapoints: 2
    Root location: /home/rama/workspace/ai4all/data/test/
<torch.utils.data.dataloader.DataLoader object at 0x7fbcdf3bc290>


#### Autograd module

PyTorch uses a technique called automatic differentiation. That is, we have a recorder that records what operations we have performed, and then it replays it backward to compute our gradients. This technique is especially powerful when building neural networks.

x and y are image and target label respectively.

In [7]:
from torch.autograd import Variable

for step, (x, y) in enumerate(train_data_loader):
    
    b_x = Variable(x.float())   # batch x (image)
    b_y = Variable(y)   # batch y (target)
    
    print(b_x)
    print(b_y)

tensor([[[[0.4039, 0.4392, 0.4784, 0.4510, 0.5333, 0.5843, 0.6118, 0.6588,
           0.6745, 0.6902, 0.7059, 0.8235, 0.8000, 0.7843, 0.7686, 0.7569,
           0.7373, 0.7216, 0.6941, 0.6784, 0.6196, 0.5608, 0.4588, 0.4392,
           0.4549, 0.4745, 0.4588, 0.4392],
          [0.4471, 0.4196, 0.4196, 0.5451, 0.5686, 0.6078, 0.6314, 0.6471,
           0.6784, 0.6980, 0.7216, 0.8353, 0.8314, 0.7961, 0.7961, 0.7569,
           0.7529, 0.7569, 0.7294, 0.7059, 0.6549, 0.6157, 0.5333, 0.4471,
           0.4078, 0.4392, 0.4235, 0.4863],
          [0.4784, 0.4706, 0.5098, 0.5333, 0.5804, 0.6235, 0.6510, 0.6784,
           0.6824, 0.7137, 0.7176, 0.8549, 0.8157, 0.8157, 0.8078, 0.7922,
           0.7725, 0.7451, 0.7294, 0.7059, 0.6941, 0.6510, 0.5922, 0.5176,
           0.4510, 0.4078, 0.4980, 0.5216],
          [0.5294, 0.5059, 0.5608, 0.5843, 0.6078, 0.6431, 0.6549, 0.6824,
           0.6941, 0.7176, 0.6000, 0.5882, 0.6157, 0.8471, 0.8000, 0.7961,
           0.7882, 0.7686, 0.7529, 0.7255, 

tensor([[[[0.4627, 0.4275, 0.3804, 0.4824, 0.5412, 0.6000, 0.6039, 0.6510,
           0.6745, 0.6941, 0.7216, 0.7412, 0.8431, 0.8275, 0.7922, 0.7961,
           0.7804, 0.7608, 0.7451, 0.7216, 0.6980, 0.6431, 0.6235, 0.5059,
           0.4353, 0.3725, 0.4000, 0.4431],
          [0.4980, 0.4902, 0.4667, 0.5137, 0.5804, 0.6039, 0.6314, 0.6745,
           0.6902, 0.6784, 0.7294, 0.7137, 0.8627, 0.8353, 0.8314, 0.8039,
           0.7961, 0.7569, 0.7608, 0.7373, 0.7333, 0.6980, 0.6275, 0.5843,
           0.5098, 0.4392, 0.4588, 0.5098],
          [0.5294, 0.5137, 0.5294, 0.5608, 0.5961, 0.6118, 0.6549, 0.6667,
           0.6784, 0.7451, 0.6510, 0.5843, 0.6000, 0.6392, 0.8392, 0.8157,
           0.8157, 0.7922, 0.7725, 0.7373, 0.7373, 0.7020, 0.6863, 0.6196,
           0.5647, 0.5059, 0.5176, 0.5608],
          [0.5569, 0.5412, 0.5608, 0.6078, 0.6118, 0.6392, 0.6824, 0.6863,
           0.7098, 0.5804, 0.6039, 0.5922, 0.5922, 0.5961, 0.6078, 0.8196,
           0.8078, 0.8039, 0.7882, 0.7725, 

#### nn module

PyTorch autograd makes it easy to define computational graphs and take gradients, but raw autograd can be a bit too low-level for defining complex neural networks. This is where the nn (Neural Network) module of PyTorch comes into play.

A Simple Neural Network will have the following format.

```
# define model
model = torch.nn.Sequential(
                 torch.nn.Linear(input_num_units, hidden_num_units),
                 torch.nn.ReLU(),
                 torch.nn.Linear(hidden_num_units, output_num_units),
        )
loss_fn = torch.nn.CrossEntropyLoss()
```

Now that you know the basic components of PyTorch, you can easily build your own neural network from scratch.

### Designing your Neural Net

We’ll be making use of four major functions in our CNN class:

    * torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding) – applies convolution
    * torch.nn.relu(x) – applies ReLU
    * torch.nn.MaxPool2d(kernel_size, stride, padding) – applies max pooling
    * torch.nn.Linear(in_features, out_features) – fully connected layer (multiply inputs by learned weights)
    
We will create a CNN class with one class method: forward. The forward() method computes a forward pass of the CNN, which includes the preprocessing steps we outlined above.


In [8]:
import torch.nn as nn

class CNN(nn.Module):
    def __init__(self):
        """
        In the constructor we instantiate two nn.Sequential (Convolution) modules and 
        one nn.Linear (Fully connected) module and assign them as member variables.
        """
        super(CNN, self).__init__()
        self.conv1 = nn.Sequential(         # input shape (1, 28, 28)
            nn.Conv2d(
                in_channels=1,              # input height
                out_channels=16,            # n_filters
                kernel_size=5,              # filter size
                stride=1,                   # filter movement/step
                padding=2,                  # if want same width and length of this image after Conv2d, 
                                            #     padding=(kernel_size-1)/2 if stride=1
            ),                              # output shape (16, 28, 28)
            nn.ReLU(),                      # activation
            nn.MaxPool2d(kernel_size=2),    # choose max value in 2x2 area, output shape (16, 14, 14)
        )
        self.conv2 = nn.Sequential(         # input shape (16, 14, 14)
            nn.Conv2d(16, 32, 5, 1, 2),     # output shape (32, 14, 14)
            nn.ReLU(),                      # activation
            nn.MaxPool2d(2),                # output shape (32, 7, 7)
        )
        self.out = nn.Linear(32 * 7 * 7, 2) # fully connected layer, output 2 classes

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary (differentiable) operations on Tensors.
        """
        x = self.conv1(x)
        x = self.conv2(x)
        x = x.view(x.size(0), -1)           # flatten the output of conv2 to (batch_size, 32 * 7 * 7)
        output = self.out(x)
        return output, x                    # return x for visualization


#The Neural Net can then be initialized in a single line as.

model = CNN()
print(model)

CNN(
  (conv1): Sequential(
    (0): Conv2d(1, 16, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (conv2): Sequential(
    (0): Conv2d(16, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (out): Linear(in_features=1568, out_features=2, bias=True)
)


We’ll also define our loss and optimizer functions that the CNN will use to find the right weights. We’ll be using Cross Entropy Loss (Log Loss) as our loss function, which strongly penalizes high confidence in the wrong answer. The optimizer is the popular Adam algorithm (not a person!).

In [9]:
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
loss_func = nn.CrossEntropyLoss()

print(optimizer)
print(loss_func)

Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.003
    weight_decay: 0
)
CrossEntropyLoss()


Also, to check if GPU is available and to initialize PyTorch on the right device, we can use


In [10]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


### Training the Neural Net

Once we’ve defined the class for our CNN, we need to train the net itself. This is where neural network code gets interesting.Our basic flow is a training loop: each time we pass through the loop (called an “epoch”), we compute a forward pass on the network and implement backpropagation to adjust the weights. We’ll also record some other measurements like loss and time passed, so that we can analyze them as the net trains itself.

Finally, we’ll define a function to train our CNN using a simple for loop. During each epoch of training, we pass data to the model in batches whose size we define when we call the training loop. Data is feature-engineered using the SimpleCNN class we’ve defined, and then basic metrics are printed after a few passes. During each loop, we also calculate the loss on our validation set.


In [11]:
for epoch in range(EPOCHS):
    
        for step, (x, y) in enumerate(train_data_loader):
        
            b_x = Variable(x.float())   # batch x (image)
            b_y = Variable(y)   # batch y (target)
    
            output = model(b_x)[0]
            loss = loss_func(output, b_y)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        print('Current Epoch: ', epoch)
        print('Current Loss:', loss.data)

('Current Epoch: ', 0)
('Current Loss:', tensor(0.6881))
('Current Epoch: ', 1)
('Current Loss:', tensor(0.6919))
('Current Epoch: ', 2)
('Current Loss:', tensor(0.6076))
('Current Epoch: ', 3)
('Current Loss:', tensor(0.0143))
('Current Epoch: ', 4)
('Current Loss:', tensor(0.2198))
('Current Epoch: ', 5)
('Current Loss:', tensor(0.0236))
('Current Epoch: ', 6)
('Current Loss:', tensor(0.0053))
('Current Epoch: ', 7)
('Current Loss:', tensor(0.0015))
('Current Epoch: ', 8)
('Current Loss:', tensor(0.0050))
('Current Epoch: ', 9)
('Current Loss:', tensor(0.0003))


### Testing Accuracy

At the end of every training epoch we test the current accuracy of the model which will give a set of print statement for each EPOCH of which first one is training loss and the second one is validation loss.


In [12]:
for epoch in range(EPOCHS):
    
    for _, (tx, ty) in enumerate(test_data_loader):

        test_x = Variable(tx)
        test_y = Variable(ty)

        test_output, last_layer = model(test_x)

        pred_y = torch.max(test_output, 1)[1].data.squeeze()
        accuracy = sum(pred_y == test_y) / float(test_y.size(0))

        print('Epoch: ', epoch, '| train loss: %.4f' % loss.data, '| test accuracy: %.2f' % accuracy)


('Epoch: ', 0, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 0, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 1, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 1, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 2, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 2, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 3, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 3, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 4, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 4, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 5, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 5, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 6, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 6, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 7, '| train loss: 0.0003', '| test accuracy: 1.00')
('Epoch: ', 7, '| train loss: 0.0003', '

### Save and Load Model

Saving

In [13]:
torch.save({'state_dict': model.state_dict()}, '../../data/checkpoint.pth.tar')
print(model.state_dict())

OrderedDict([('conv1.0.weight', tensor([[[[ 0.1714,  0.1695, -0.1754,  0.1610,  0.1165],
          [ 0.0948, -0.1709, -0.0858,  0.1308, -0.0206],
          [-0.0683, -0.0954, -0.0405,  0.1121,  0.1950],
          [-0.1576, -0.1964,  0.1961,  0.1877, -0.0082],
          [-0.1260, -0.0954,  0.1295, -0.1168,  0.0749]]],


        [[[ 0.0392,  0.0322, -0.1547,  0.2144,  0.0637],
          [ 0.1612,  0.0127,  0.0175, -0.0431,  0.1231],
          [-0.1211,  0.1276,  0.1531,  0.0741,  0.1507],
          [ 0.0872, -0.2289, -0.2112, -0.1895,  0.0331],
          [-0.0756,  0.1316, -0.1575,  0.0604, -0.0769]]],


        [[[-0.0179,  0.1728, -0.1148,  0.0221,  0.1647],
          [ 0.1089, -0.1034, -0.1346, -0.1794, -0.0760],
          [ 0.1683, -0.0468,  0.0269, -0.2315, -0.1304],
          [-0.1319,  0.1508,  0.0865,  0.1156,  0.0819],
          [ 0.0154,  0.2225,  0.0591,  0.0113,  0.0834]]],


        [[[ 0.1857,  0.2133, -0.1062,  0.1494,  0.0666],
          [ 0.0713, -0.1152, -0.1240,  0.119

Loading

In [15]:
new_model = CNN()

# Model will have different state_dict() at this point
# print(new_model.state_dict())

checkpoint = torch.load('../../data/checkpoint.pth.tar')
new_model.load_state_dict(checkpoint['state_dict'])
print(new_model.state_dict())

OrderedDict([('conv1.0.weight', tensor([[[[ 0.1714,  0.1695, -0.1754,  0.1610,  0.1165],
          [ 0.0948, -0.1709, -0.0858,  0.1308, -0.0206],
          [-0.0683, -0.0954, -0.0405,  0.1121,  0.1950],
          [-0.1576, -0.1964,  0.1961,  0.1877, -0.0082],
          [-0.1260, -0.0954,  0.1295, -0.1168,  0.0749]]],


        [[[ 0.0392,  0.0322, -0.1547,  0.2144,  0.0637],
          [ 0.1612,  0.0127,  0.0175, -0.0431,  0.1231],
          [-0.1211,  0.1276,  0.1531,  0.0741,  0.1507],
          [ 0.0872, -0.2289, -0.2112, -0.1895,  0.0331],
          [-0.0756,  0.1316, -0.1575,  0.0604, -0.0769]]],


        [[[-0.0179,  0.1728, -0.1148,  0.0221,  0.1647],
          [ 0.1089, -0.1034, -0.1346, -0.1794, -0.0760],
          [ 0.1683, -0.0468,  0.0269, -0.2315, -0.1304],
          [-0.1319,  0.1508,  0.0865,  0.1156,  0.0819],
          [ 0.0154,  0.2225,  0.0591,  0.0113,  0.0834]]],


        [[[ 0.1857,  0.2133, -0.1062,  0.1494,  0.0666],
          [ 0.0713, -0.1152, -0.1240,  0.119

Now you have successfully trained your CNN and tested it. Now only thing left to do is use this trained model which will directly predict if the desired gesture is there in a given image stream or not.