# Course 4: Mediapipe

`Mediapipe` is a framework for building multimodal (e.g. video, audio, any time series data), cross-platform (i.e. Android, iOS, web, edge devices) applied ML pipelines. It is used for research and production applications.

In this notebook, we will use `mediapipe` library to recognise the gesture, and train a model to recognise rock, paper and scissors gestures.

## Setup

First, we need to install the `mediapipe` library.

```bash
pip install mediapipe
```

Note that `mediapipe` requires `opencv-python` to be installed.

It is worthwhile to announce that in `OpenCV`, the default color space is `BGR`. So firstly, we need to convert the image to `RGB` color space.

```python
import cv2
dest = cv2.cvtColor(src, cv2.COLOR_BGR2RGB)
```

## Basic Concepts

`Mediapipe` provides a variety of models for different tasks. For example, `Hands` model is used to detect hands and hand landmarks. `Pose` model is used to detect human poses. `Face` model is used to detect faces and facial landmarks. Take `Hands` as an example, we can recognize nodes (about 21) of a hand.

In [15]:
import cv2
import torch

import mediapipe as mp

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(static_image_mode=False, max_num_hands=2, min_detection_confidence=0.5)
mp_draw = mp.solutions.drawing_utils

image = cv2.imread('./mediapipe/victory.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = hands.process(image)

h, w, c = image.shape

if results.multi_hand_landmarks:
    for hand_landmarks in results.multi_hand_landmarks:
        mp_draw.draw_landmarks(image, hand_landmarks, mp_hands.HAND_CONNECTIONS)

image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)

cv2.imshow('Hand Tracking', image)
cv2.waitKey(0)
cv2.destroyAllWindows()

hands.close()

I0000 00:00:1721134559.276552  404695 gl_context.cc:357] GL version: 2.1 (2.1 Metal - 89.3), renderer: Apple M2 Pro
W0000 00:00:1721134559.283286  445423 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1721134559.289613  445423 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.


KeyboardInterrupt: 

Chore: integrate the camera and recognize hands promptly.

In [51]:
camera = cv2.VideoCapture(0)
import mediapipe as mp

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(static_image_mode=False, max_num_hands=2, min_detection_confidence=0.5)
mp_draw = mp.solutions.drawing_utils

while True:
    ret, image = camera.read()
    if not ret:
        break

    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    results = hands.process(image)

    h, w, c = image.shape

    if results.multi_hand_landmarks:
        for hand_landmarks in results.multi_hand_landmarks:
            mp_draw.draw_landmarks(image, hand_landmarks, mp_hands.HAND_CONNECTIONS)

    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)

    cv2.imshow('Hand Tracking', image)
    if cv2.waitKey(1) and 0xFF == ord('q'):
        break

camera.release()
hands.close()
cv2.destroyAllWindows()

I0000 00:00:1721129416.564226  280736 gl_context.cc:357] GL version: 2.1 (2.1 Metal - 89.3), renderer: Apple M2 Pro
W0000 00:00:1721129416.569112  339687 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1721129416.575007  339687 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.


KeyboardInterrupt: 

## Gesture Recognition based on rules

- `results.multi_hand_landmarks` is a list containing two left and right hand landmark values.  
Each of these elements can be accessed point by point using `landmark.landmark`
- `mcp` metacarpophalangeal joint
- `ip` interphalangeal joint
- `pip` proximal interphalangeal joint
- `dip` distal interphalangeal joint
- `cmc` carpometacarpal joint

> Because its implementation is too rude, I just adapted from the source and did not modify it.

In [None]:
import cv2
import mediapipe as mp
import numpy as np

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(static_image_mode=False, max_num_hands=2, min_detection_confidence=0.5)
mp_draw = mp.solutions.drawing_utils

image = cv2.imread('./mediapipe/victory.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = hands.process(image)
h, w, c = image.shape

lst_lms = []

# Add all gesture coordinate points to a list [(x1, y1), (x2, y2),....] Total 21
if results.multi_hand_landmarks:
    for single_hand_marks in results.multi_hand_landmarks:
        for id, lm in enumerate(single_hand_marks.landmark):
            # The original `lm.x` and `lm.y` are decimals, multiplied by `w` and `h` to make the true coordinates
            x, y = int(w * lm.x), int(h * lm.y)
            cv2.circle(image, (x, y), 2, (255, 1, 0), -1)
            lst_lms.append([x, y])

lst_lms = np.array(lst_lms)
hull_index = [0, 1, 2, 3, 6, 10, 14, 18, 17]  # Taking out nine points of the incoming target.
hull = cv2.convexHull(lst_lms[hull_index, :])  # Connecting the nine points into a closed loop.

cv2.polylines(image, [hull], True, (222, 222, 0), 2)  # Draw out this closed loop

up_finger = []  # Here's the list outside the closed loop

# Rotate these five fingertip points to see which ones are outside the closed loop of the above
for i in [4, 8, 12, 16, 20]:
    point = (int(lst_lms[i][0]), int(lst_lms[i][1]))
    # Calculates the distance from the point to the outline, less than 0 means outside the outline.
    dist = cv2.pointPolygonTest(hull, point, True)
    print(dist)
    if dist < 0:
        up_finger.append(i)
print(up_finger)

if len(up_finger) == 1 and up_finger[
    0] == 8:  # If there is only one point outside the closed loop and this hand is point 8
    guesture = 'one'
else:
    guesture = 'None'

if guesture:
    cv2.putText(image, guesture, (30, 30), cv2.FONT_HERSHEY_COMPLEX, 1, (222, 21, 122), 1)

image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

cv2.imshow('MediaPipe Hand Tracking', image)
cv2.waitKey(0)
cv2.destroyAllWindows()
hands.close()

## Gesture Recognition based on Machine Learning

Via `PyTorch` and some datasets, we can easily recognise the gestures of rock, paper and scissors.

We set the classifier as 4 results, including rock, paper, scissors and none.

Unfortunately, official `mediapipe` supports TensorFlow (made by Google) but not PyTorch, so we can't use the bridge library (`mediapipe_model_maker`), and we need to implement the model by ourselves.

### Input Layer

We first perform the recognition via `mediapipe` and then use the coordinates of the hand landmarks as the input of the model, returning them as $(\mathrm{id}, x, y)$. Then we can use these coordinates to train the model.

In [1]:
import cv2
import mediapipe as mp
from cv2.typing import MatLike
import numpy as np
from PIL import Image

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(static_image_mode=False, max_num_hands=1, min_detection_confidence=0.5)
mp_draw = mp.solutions.drawing_utils

class ToNumpy:
    def __call__(self, image: Image):
        return np.array(image)

class CropToHand:
    def __call__(self, image: MatLike):
        global y_min, y_max, x_min, x_max, hand_image_resized
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        results = hands.process(image)
        if results.multi_hand_landmarks:
            for hand_landmarks in results.multi_hand_landmarks:
                # Get the bounding box coordinates of the hand
                image_height, image_width, _ = image.shape
                x_min = image_width
                y_min = image_height
                x_max = y_max = 0
                for landmark in hand_landmarks.landmark:
                    x = int(landmark.x * image_width)
                    y = int(landmark.y * image_height)
                    x_min = min(x_min, x)
                    y_min = min(y_min, y)
                    x_max = max(x_max, x)
                    y_max = max(y_max, y)
    
            # Crop the hand region from the frame
            hand_image = image[y_min:y_max, x_min:x_max]

            # Resize the cropped hand image to 256x256
            hand_image_resized = cv2.resize(hand_image, (256, 256))
            return hand_image_resized
        else:
            return image.reshape((256, 256))

class ExtractKeypoints:
    def __init__(self, hands: mp_hands.Hands):
        self.hands = hands
    
    def __call__(self, image: MatLike) -> np.ndarray:
        image = cv2.resize(image, (256, 256))
        results = hands.process(image)
        h, w, c = image.shape
        lst_lms = []
        x0, y0 = 0, 0
        if results.multi_hand_landmarks:
            for single_hand_marks in results.multi_hand_landmarks:
                for id, lm in enumerate(single_hand_marks.landmark):
                    if id == 0:
                        x0, y0 = int(w * lm.x), int(h * lm.y)
                    else:
                        x, y = int(w * lm.x) - x0, int(h * lm.y) - y0
                        lst_lms.append([id, x, y])
        
        return np.array(lst_lms, dtype=np.float32)

class HandleGestureDataset:
    def __init__(self):
        pass
    
    def __call__(self, matrix: np.ndarray):
        # assert matrix.shape == (20, 3)
        if matrix.shape == (0, 0) or matrix.shape == (0, ):
            matrix = np.random.random((5, 4, 2))
            return matrix
        # Find the missed points
        if matrix.shape[0] < 20:
            print(matrix, matrix.shape)
            for i in range(1, 21):
                if i not in matrix[:, 0]:
                    matrix = np.insert(matrix, i, [i, 0, 0], axis=0)
        matrix = matrix[:, 1:]
        matrix = matrix.reshape(5, 4, 2)
        return matrix

image = cv2.imread('./mediapipe/victory.jpg')

points = ExtractKeypoints(hands)(image)
handler = HandleGestureDataset()
handler(points)

I0000 00:00:1721137076.931580  485802 gl_context.cc:357] GL version: 2.1 (2.1 Metal - 89.3), renderer: Apple M2 Pro
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
W0000 00:00:1721137076.935914  486091 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1721137076.940586  486091 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.


array([[[0.83560001, 0.6741797 ],
        [0.71665594, 0.60959557],
        [0.92752786, 0.87106068],
        [0.12921649, 0.04644386]],

       [[0.41396106, 0.35537373],
        [0.99849717, 0.3019162 ],
        [0.15953014, 0.96268324],
        [0.02818244, 0.20892876]],

       [[0.28579477, 0.11263638],
        [0.14204413, 0.4199289 ],
        [0.3806742 , 0.37125793],
        [0.98116345, 0.65386357]],

       [[0.25299978, 0.2490395 ],
        [0.60256314, 0.51773068],
        [0.37269369, 0.4607935 ],
        [0.88227983, 0.31962578]],

       [[0.68383068, 0.20295217],
        [0.917635  , 0.48825266],
        [0.34977718, 0.42750554],
        [0.25990254, 0.83325976]]])

### Datasets

We can use the `rock_paper_scissors` dataset (downloaded to ./mediapipe/rps folder) to train the model. Via `ImageFolder`, these images can be loaded easily.

In [2]:
from torchvision.datasets import ImageFolder
from torchvision import transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch
from torch.utils.data import DataLoader, random_split

device = torch.device('mps')

data_transform = transforms.Compose([
    transforms.RandomRotation((-90, 90)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    ToNumpy(),
    ExtractKeypoints(hands),
    HandleGestureDataset(),
    transforms.ToTensor(),
])

dataset = ImageFolder('./mediapipe/rps', transform=data_transform)
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

batch_size=32

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

### Neural Network Design

Inspired by convolutional neural networks, we firstly use convolutional layers to extract features, and then use fully connected layers to classify the gestures. Finally, we use the softmax function to output the probabilities of each gesture.

In [3]:
class GestureClassifier(nn.Module):
    def __init__(self):
        super(GestureClassifier, self).__init__()
        # Define convolution and pooling layers
        self.conv = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=2, padding=1, stride=1)
        self.pooling = nn.MaxPool2d(kernel_size=2, stride=1)
        # Fully connected layers
        self.fc1 = nn.Linear(40, 64)  # Adjusted to match the output shape after conv and pooling
        self.fc2 = nn.Linear(64, 4)
        self.output = nn.Softmax(dim=1)
        self.dropout = nn.Dropout(0.7)

    def forward(self, x):
        batch_size = x.size(0)
        slices = []

        # Loop over the 5 slices in the input
        for i in range(5):
            slice = x[:, :, i, :].unsqueeze(1)  # Extract the i-th slice and add channel dimension
            conv_out = self.conv(slice)
            pool_out = self.pooling(conv_out)
            slices.append(pool_out)

        # Stack the slices and flatten
        x = torch.cat(slices, dim=1)  # Concatenate slices along the channel dimension
        x = x.view(batch_size, -1)  # Flatten to (batch_size, 5 * 3 * 3)

        # Fully connected layers
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        x = self.output(x)
        return x

model = GestureClassifier().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
epochs = 10

### Training

We can train the model via the following code.

In [4]:
train_model = False
if train_model:
    from tqdm import tqdm
    
    for epoch in range(epochs):
        for (data, target) in tqdm(train_loader, desc=f'Epoch {epoch + 1}'):
            data, target = data.float().to(device), target.float().to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
        print(f'Epoch {epoch + 1}, Loss: {loss.item() / batch_size}')
        
        with torch.no_grad():
            correct = 0
            total = 0
            for (data, target) in tqdm(val_loader):
                data, target = data.float().to(device), target.float().to(device)
                output = model(data)
                _, predicted = torch.max(output, 1)
                total += target.size(0)
                correct += (predicted == target).sum().item()
            
            accuracy = correct / total
            print(f'Epoch {epoch + 1}, Loss: {loss.item()}, Accuracy: {accuracy}')
    
    torch.save(model.state_dict(), './mediapipe/gesture_classifier.pth')
else:
    model.load_state_dict(torch.load('./mediapipe/gesture_classifier.pth'))

### Evaluate the Model

We can call the camera and evaluate the model.

In [5]:
def have_hand(image: MatLike):
    results = hands.process(image)
    if results.multi_hand_landmarks:
        return True
    return False

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(static_image_mode=False, max_num_hands=1, min_detection_confidence=0.5)
camera = cv2.VideoCapture(0)

model.eval()

while True:
    ret, image = camera.read()
    if not ret:
        break
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    transformer = transforms.Compose([
        ExtractKeypoints(hands),
        HandleGestureDataset(),
        transforms.ToTensor(),
    ])
    if have_hand(image):
        src = transformer(image).float().to(device)
        with torch.no_grad():
            output = model(src.unsqueeze(0))
            _, predicted = torch.max(output, 1)
            print(predicted.item())
        cv2.putText(image, str(predicted.item()), (90, 90), cv2.FONT_HERSHEY_COMPLEX, 1, (255, 255, 255), 1)
    cv2.putText(image, str(have_hand(image)), (30, 30), cv2.FONT_HERSHEY_COMPLEX, 1, (222, 21, 122), 1)

    cv2.imshow('Hand Tracking', image)
    if cv2.waitKey(1) and 0xFF == ord('q'):
        break

I0000 00:00:1721137088.266770  485802 gl_context.cc:357] GL version: 2.1 (2.1 Metal - 89.3), renderer: Apple M2 Pro
W0000 00:00:1721137088.271231  486500 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1721137088.275946  486504 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.


1
1
1
2
1
1
2
1
1
3
1
1
1


KeyboardInterrupt: 

### Final Game: Play Rock, Paper, Scissors with the Computer!

Via `random`, we can play the game with the computer.

1. Computer set the countdown;
2. Random the computer's action (rock, paper, scissors);
3. Recognize the player's action;
4. Compare the results.

Let's begin.

#### 1. Computer set the countdown

In [6]:
import time

for i in range(3, 0, -1):
    print(i)
    time.sleep(1)

print('Go!')

time.sleep(0.1)

3
2
1
Go!


#### 2. Random the computer's action

We can use `random` to random the computer's action.

In [7]:
import random

actions = ['paper', 'rock', 'scissors']

computer_action = random.choice(actions)

#### 3. Recognize the player's action

Get 30 frames to recognize the player's action, it can reduce the risk of misjudgment.

In [10]:
player_action = None
record = [0, 0, 0, 0]

for _ in range(30):
    ret, image = camera.read()
    if not ret:
        break
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    transformer = transforms.Compose([
        ExtractKeypoints(hands),
        HandleGestureDataset(),
        transforms.ToTensor(),
    ])
    if have_hand(image):
        src = transformer(image).float().to(device)
        with torch.no_grad():
            output = model(src.unsqueeze(0))
            _, predicted = torch.max(output, 1)
            record[predicted.item()] += 1

if max(record) == record[0]:
    player_action = 'none'
elif max(record) == record[1]:
    player_action = 'paper'
elif max(record) == record[2]:
    player_action = 'rock'
else:
    player_action = 'scissors'

print(f"Player's action: {player_action}, Computer's action: {computer_action}")

Player's action: paper, Computer's action: scissors


#### 4. Compare the results

In [11]:
if player_action == 'none':
    print('You did not make any action.')
elif player_action == computer_action:
    print('Draw!')
elif (player_action == 'rock' and computer_action == 'scissors') or (player_action == 'scissors' and computer_action == 'paper') or (player_action == 'paper' and computer_action == 'rock'):
    print('You win!')
else:
    print('You lose!')

You lose!
