# Hand Gesture Recognition & Air Drawing

## Overview:
In this notebook we explore hand gesture (open/closed) recognition using computer vision along with a naive technique as well as two different machine learning techniques. We develop a system capable of recognizing hand gestures in real-time that is complemented with an "air drawing" functionality to draw in a board when the triggering gesture (closed hand) is recognized.

Additionally, a nice functionality to generate labeled data for training on the fly is included.


## Contents:
1. [Hand gesture detection & Air Drawing: naive technique](#naive)
2. ML models for gesture recognition: 
    1. [Hand gesture detection: clustering approach](#cluster) Unsupervised Learning with Clustering to detect whether hand is closed/open.
    2. [Hand gesture detection: multilayer NN approach](#neuron) Supervised Learning by training a neural network (ANN) model to classify whether hand is closed/open. Also building a functionality to generate labeled data for training on the fly.



## [Hand gesture detection & Air Drawing: naive technique](#naive)

The technique involves calculating the distance between key landmarks, setting a manual threshold to determine hand open/close states. If the distance falls below the threshold, the hand is classified as closed; if it exceeds the threshold, the hand is classified as open. This is a simple and naive model that does have limitations such as e.g. if the hand is too far from the camera does method does not show any adaptation.

In any case, the "Air drawing" feature is built and demostrated here. Drawing on the board happens with the movement of the hand, when closed.

In [1]:
import cv2
import mediapipe as mp
import numpy as np
import math

#Class for detecting hands

class HandDetector:
    def __init__(self, threshold=0.1,n_frames=5):
        """
        Initializes HandDetector
        
        Args:
        - threshold (float): Distance threshold for open/closed hand gesture detection.
        - n_frames (int): Number of frames to use for averaging index finger position, which is the one serving as the pointer
                          in the board. Introduced this idea to make it more robust to quick movements.
        
        """
        self.hands = mp.solutions.hands.Hands() #MediaPipe hands module
        self.mpDraw = mp.solutions.drawing_utils #initialize drawing module
        self.threshold = threshold
        self.index_start = None 
        self.mano_abierta = False 
        self.n_frames=n_frames
        self.history = []
    
    def update_index_history(self, index_start):
        """
        Updates history of index finger positions.

        Args:
        - index_start (mp.Vector2D): Position of index finger landmark.
        
        """
        self.history.append(index_start) # adds new position
        self.history = self.history[-self.n_frames:] # takes the last n_frames positions

    def calculate_distance(self, point1, point2):
        """
        Calculates Euclidean distance between two mediapipe coordinates

        Args:
        - point1: First point.
        - point2: Second point.
        
        Returns:
        - float: Euclidean distance between the points.
        
        """
        return math.sqrt((point1.x - point2.x)**2 + (point1.y - point2.y)**2)

    def detect_hand(self, frame):
        """
        Detects hands in the frame.

        Args:
        - frame: Input frame.
        
        """
        self.index_start = None  # Reset index_start value for each iteration
        results = self.hands.process(frame) # try to detect hands
        # if hands are detected
        if results.multi_hand_landmarks:
            # draw landmarks on the frame
            for handLms in results.multi_hand_landmarks:
                self.mpDraw.draw_landmarks(frame, handLms, mp.solutions.hands.HAND_CONNECTIONS)

                thumb_tip = handLms.landmark[4] # we take this as one reference
                index_tip = handLms.landmark[8] # we take this as the other reference
                self.index_start = handLms.landmark[5]  # update index_start

                distance = self.calculate_distance(self.index_start, index_tip)
                cv2.putText(frame, f"Distance: {distance}", (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)

                if distance < self.threshold: #then hand is closed
                    self.mano_abierta = False
                    cv2.putText(frame, f"Closed hand", (100, 100), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
                    self.update_index_history(self.index_start)
                elif distance > self.threshold: #then hand is open
                    self.mano_abierta = True
                    self.history=[] #clean the index history to avoid corrupting the next piece of drawing
                    cv2.putText(frame, f"Open hand", (100, 100), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
                    
    
    def calculate_average_index_position(self, history):
        """
        Calculates average index finger position based on history.

        Args:
        - history (list): List of index finger positions.

        Returns:
        - list: Average x,y coordiates for index finger position.
        
        """

        index_x_list=np.array([index_coord.x for index_coord in history])
        index_y_list=np.array([index_coord.y for index_coord in history])
        return [index_x_list.mean(), index_y_list.mean()]     
        
    def get_average_index_position(self):
        """
        Retrieves average index finger position.

        Returns:
        - list: Average index finger position.
        
        """
        return self.calculate_average_index_position(self.history)

    
# Class for drawing
class HandDrawer:
    def __init__(self, drawing, reverse=True):
        """
        Initializes HandDrawer

        Args:
        - drawing (numpy.ndarray): Initial drawing frame.
        - reverse (bool): Flag to indicate if drawing is reversed.
        
        """
        self.drawing = drawing
        self.drawing_pointer = drawing
        self.reverse = reverse

    def draw_hand(self, average_index_position):
        """
        Draws averaged index finger position on the drawing frame.

        Args:
        - average_index_position: Average position of index finger.
        
        """
        if average_index_position[0]>=0:
            h, w, c = self.drawing.shape
            cx, cy = int(average_index_position[0] * w), int(average_index_position[1] * h)
            if self.reverse==False:
                cv2.circle(self.drawing, (cx, cy), 10, (0, 0, 255), -1) #draw
            if self.reverse == True:
                cv2.circle(self.drawing, (w - cx, cy), 10, (0, 0, 255), -1) #draw
    
    def update_pointer(self, index_position):
        """
        Updates drawing pointer position based on index finger coords.

        Args:
        - index_position: Position of index finger.
        
        """
        self.drawing_pointer = self.drawing.copy()
        h, w, c = self.drawing_pointer.shape
        cx, cy = int(index_position.x * w), int(index_position.y * h)
        if self.reverse==False:
            cv2.circle(self.drawing_pointer, (cx, cy), 10, (0, 255, 255), -1)
        if self.reverse == True:
            cv2.circle(self.drawing_pointer, (w - cx, cy), 10, (0, 255, 255), -1)

    def show_frames(self, original_frame):
        """
        Displays original and drawing frames.

        Args:
        - original_frame (numpy.ndarray): Original frame.
        
        """
        cv2.imshow("original", original_frame)
        cv2.imshow("drawing", self.drawing_pointer)

# flips the frame horizontally     
class Reversor:
    def revert_frame(frame):
        reverted_frame = frame[:, ::-1]
        return reverted_frame

def main():
    # Capture video. Initialize before main loop
    cap = cv2.VideoCapture(0)
    # Read the frame
    ret, frame = cap.read()

    # Initialize hand detector and drawer
    hand_detector = HandDetector(n_frames=5)
    hand_drawer = HandDrawer(np.zeros_like(frame))
    
    # Main loop
    while True:
        ret, frame = cap.read()
        # Detect hand in the frame
        hand_detector.detect_hand(frame)
        
        # Uncomment the following line if you want to reverse the frame
        # frame = Reversor.revert_frame(frame)
        
        # Update pointer position if index finger detected
        if hand_detector.index_start is not None:
            index_position = hand_detector.index_start
            hand_drawer.update_pointer(index_position)
            
        # Draw with hand movement if hand is closed and history is complete (for averaging)
        if hand_detector.mano_abierta == False and len(hand_detector.history)==hand_detector.n_frames:
            index_position = hand_detector.get_average_index_position()
            hand_drawer.draw_hand(index_position)
            
        # Display both frames    
        hand_drawer.show_frames(frame)

        k = cv2.waitKey(1)
        if k == ord("q"):
            break

    cap.release()
    cv2.destroyAllWindows()


main()

## [Hand gesture detection: clustering approach](#cluster)

The technique involves applying a clustering algorithm (K-means with K = 2) to the extracted features from hand landmarks. We determine if the hand is open/closed depending on whether specific landmarks belong to the same cluster or not.

In [1]:
import numpy as np
from sklearn.cluster import KMeans
import cv2
import mediapipe as mp

# Open cap
cap = cv2.VideoCapture(0)
# Initialize mediapipe
hands = mp.solutions.hands.Hands()
mpDraw = mp.solutions.drawing_utils
# Initialize KMeans clustering with 2 clusters (to distinguish open/close)
cluster = KMeans(n_clusters=2)


while True:
    ret, frame = cap.read()
    
    results = hands.process(frame)
    
    # If hands are detected
    if results.multi_hand_landmarks:
        # Draw landmarks on the frame
        for handLms in results.multi_hand_landmarks:
            mpDraw.draw_landmarks(frame, handLms, mp.solutions.hands.HAND_CONNECTIONS)
            # Extract coordinates of hand landmarks
            coord = np.array([[handLms.landmark[i].x, handLms.landmark[i].y] for i in range(len(handLms.landmark))])
            #Apply K means 
            result = cluster.fit(coord)
            
            # Check the label of the 16th and 0th landmarks
            # If they belong to the same cluster, consider the hand as closed
            
            if result.labels_[16] == result.labels_[0]:
                cv2.putText(frame, f"Closed hand", (100, 100), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
            else:
                cv2.putText(frame, f"Open hand", (100, 100), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
    # Display the frame
    cv2.imshow("Frame", frame)
    
    k = cv2.waitKey(1)
    
    if k == ord("q"):
        break
        
cv2.destroyAllWindows()
cap.release()



## [Hand gesture detection: multilayer NN approach](#neuron)

The unsupervised technique involves training an artificial neural network (ANN) to classify hand gestures as open or closed based on hand landmark coordinates. This supervised learning approach uses labeled training data that we generate on the fly.

In [3]:
import tensorflow as tf
import cv2
import mediapipe as mp
import numpy as np
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.callbacks import TensorBoard

from time import strftime
import os

### Capturing Training Data on the fly

In [4]:
LABEL_CLOSED = 0
LABEL_OPEN = 1

#Function to generate training data
def get_data(data_size=1000):
    """
    Capture hand landmark data for training for closed and open hand gestures.

    Args:
    - data_size (int): Size of the dataset.

    Returns:
    - X (numpy.ndarray): Feature matrix containing hand landmark data.
    - y (numpy.ndarray): Target vector containing labels for hand gestures.
    """

    cap = cv2.VideoCapture(0) #open cap
    #initialize mediapipe
    hands = mp.solutions.hands.Hands()
    mpDraw = mp.solutions.drawing_utils
    
    # where coords will be stored
    coords_closed_hands=[]
    coords_open_hands=[]


    input("Please open your hand and press any key")
    
    
    while True:
        
        if len(coords_open_hands)==data_size/2: #take open hand coords until we have as many as half of the size of the data set
            break
        
        ret, frame = cap.read()

        results = hands.process(frame)
        
        #if hands detected
        if results.multi_hand_landmarks:
            # draw landmarks
            for handLms in results.multi_hand_landmarks:
                mpDraw.draw_landmarks(frame, handLms, mp.solutions.hands.HAND_CONNECTIONS)
                
                # capture landmarks
                coord = np.array([[handLms.landmark[i].x, handLms.landmark[i].y, handLms.landmark[i].z] for i in range(len(handLms.landmark))])
                coord = coord.reshape(1,63)
                coords_open_hands.append(coord) #add to the list
                cv2.putText(frame, f"Count:{len(coords_open_hands)+1}", (100, 100), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
                
                
        cv2.imshow("Frame", frame)

        k = cv2.waitKey(1)

        if k == ord("q"):
            break
            
    cv2.destroyAllWindows()
    cap.release()
    
    cap = cv2.VideoCapture(0)
    input("Please close your hand and press any key")
    
    # Now same idea but for closed hand landmarks
    while True:
        
        if len(coords_closed_hands)==data_size/2:
            break
        
        ret, frame = cap.read()

        results = hands.process(frame)

        if results.multi_hand_landmarks:
            for handLms in results.multi_hand_landmarks:
                mpDraw.draw_landmarks(frame, handLms, mp.solutions.hands.HAND_CONNECTIONS)

                coord = np.array([[handLms.landmark[i].x, handLms.landmark[i].y, handLms.landmark[i].z] for i in range(len(handLms.landmark))])
                coord = coord.reshape(1,63)
                coords_closed_hands.append(coord)
                cv2.putText(frame, f"Count:{len(coords_closed_hands)+1}", (100, 100), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
                
                
        cv2.imshow("Frame", frame)

        k = cv2.waitKey(1)

        if k == ord("q"):
            break

    cv2.destroyAllWindows()
    cap.release()
    
    closed_coords_X, open_coords_X = np.array(coords_closed_hands), np.array(coords_open_hands)
    
    # generate labels for each group of coords
    closed_data_y = np.zeros((int(data_size/2),1), dtype=np.uint8)
    open_data_y = np.ones((int(data_size/2),1), dtype=np.uint8)
    
    # concatenate feature matrices and target vectors
    X = np.concatenate((closed_coords_X,open_coords_X), axis=0)
    X = X.reshape(data_size, 63)
    
    y = np.concatenate((closed_data_y,open_data_y), axis=0)
    
    
    print("Data succesfully captured!")

    return X,y

In [5]:
DATA_SIZE = 500
X,y = get_data(DATA_SIZE) # getting 250 closed hand coordinates and 250 open hand coordinates for training 

Please open your hand and press any key
Please close your hand and press any key
Data succesfully captured!


In [6]:
# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# further split the training set into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

### Training the ANN model

In [7]:
#create a tensorboard callback

def get_tensorboard(model_name):
    folder_name = f'{model_name} at {strftime("%H %M")}'
    dir_paths = os.path.join(LOG_DIR, folder_name)

    try:
        os.makedirs(dir_paths)
    except OSError as err:
        print(err.strerror)
    else:
        print('Successfully created directory')
    return TensorBoard(log_dir=dir_paths, histogram_freq=1)  

LOG_DIR = 'tensorboard_logs/'

In [8]:
# Define model architecture

model_1 = Sequential([
    Dense(units=128, input_dim=63, activation='relu', name='m1_hidden1'),
    Dense(units=64, activation='relu', name='m1_hidden2'),
    Dense(16, activation='relu', name='m1_hidden3'),
    Dense(1, activation='sigmoid', name='m1_output') #sigmoid activation for binary classification
])

# Compile the model
model_1.compile(optimizer='adam',
                loss='binary_crossentropy', #for binary classification
                metrics=['accuracy'])

# Train the model
nr_epochs = 20
model_1.fit(X_train, y_train, epochs=nr_epochs,
            callbacks=[get_tensorboard('model_1')],#tensorboard callback for visualization
            verbose=1, validation_data=(X_val, y_val))

Successfully created directory
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x2b980125be0>

In [9]:
test_loss, test_accuracy = model_1.evaluate(X_test, y_test)



In [10]:
# Define model architecture with dropout layers. That's the only change wrt model_1

model_drop = Sequential([
    Dropout(0.2),
    Dense(units=128, input_dim=63, activation='relu', name='m1_hidden1'),
    Dropout(0.2),
    Dense(units=64, activation='relu', name='m1_hidden2'),
    Dense(16, activation='relu', name='m1_hidden3'),
    Dense(1, activation='sigmoid', name='m1_output')
])

# Compile the model
model_drop.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])

# Train the model
nr_epochs = 20
model_drop.fit(X_train, y_train, epochs=nr_epochs,
            callbacks=[get_tensorboard('model_drop')],
            verbose=1, validation_data=(X_val, y_val))

Successfully created directory
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x2b9801de828>

### Using the model to perform real-time detections

In [11]:
#Detecting whether hand is open/closed with our ANN. 

cap = cv2.VideoCapture(0)
hands = mp.solutions.hands.Hands()
mpDraw = mp.solutions.drawing_utils


while True:
    ret, frame = cap.read()
    
    results = hands.process(frame)
    
    if results.multi_hand_landmarks:
        for handLms in results.multi_hand_landmarks:
            mpDraw.draw_landmarks(frame, handLms, mp.solutions.hands.HAND_CONNECTIONS)
            
            #taking set of coordinates at the current moment
            coord = np.array([[handLms.landmark[i].x, handLms.landmark[i].y, handLms.landmark[i].z] for i in range(len(handLms.landmark))])
            coord = coord.reshape(1,63)
            
            # Predict whether hand is open or closed using the trained model
            if float(model_1.predict(coord, verbose=0))<0.5: #the output comes from a sigmoid!
                cv2.putText(frame, f"Closed hand", (100, 100), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
            else:
                cv2.putText(frame, f"Open hand", (100, 100), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
    
    cv2.imshow("Frame", frame)
    
    k = cv2.waitKey(1)
    
    if k == ord("q"):
        break
        
cv2.destroyAllWindows()
cap.release()