
# Pose Estimation Overview

## 1. What is Pose Estimation?
Pose estimation is a computer vision technique used to identify and track human body parts and their positions in an image or video. It predicts the coordinates of key points (e.g., joints, limbs) to model the pose or movement of the body.

## 2. Software for Pose Estimation:
Common software and frameworks include:
- **MediaPipe:** A lightweight framework by Google for real-time pose detection.
- **OpenPose:** An open-source library for multi-person pose estimation.
- **TensorFlow.js or PyTorch:** For implementing custom pose estimation models like PoseNet.
- **BlazePose:** Built specifically for fast and accurate single-person pose estimation.

## 3. What is MediaPipe?
MediaPipe is a cross-platform framework by Google that offers efficient pipelines for machine learning and computer vision tasks, such as pose estimation, hand tracking, and facial landmark detection. It supports mobile devices, web, and desktop environments with pre-trained models.


## 4. Landmarks in MediaPipe:
Landmarks are specific points on the body that MediaPipe detects to represent key parts like joints (e.g., elbows, knees) or regions (e.g., shoulders, hips).
- For pose estimation, MediaPipe's **Pose** solution identifies 33 3D landmarks across the human body, enabling applications like fitness tracking, gesture recognition, and AR/VR interactions.
![image](https://ai.google.dev/static/edge/mediapipe/images/solutions/hand-landmarks.png)
## 5. Brief Explanation:
Pose estimation using MediaPipe is highly efficient and lightweight, making it suitable for real-time applications on mobile and desktop platforms. Its built-in models for detecting landmarks are pre-trained, ensuring high accuracy with minimal resource usage.


# Import required libraries

In [1]:
import cv2
import mediapipe as mp
import numpy as np
from datetime import datetime
import os
import time
from sklearn.model_selection import train_test_split 
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from model import *

2024-11-21 08:28:46.889564: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


The following code initializes modules from the Mediapipe library, which is used for various machine learning solutions, including hand tracking and gesture recognition. Here's what each line means:
```python
mp_hands = mp.solutions.hands
```
`mp.solutions.hands` refers to the Hands solution provided by Mediapipe.
This module is designed for hand detection and tracking, including landmark estimation for each finger joint.
By assigning it to mp_hands, you create a shorthand to access its functionality in your code.
```python
mp_drawing = mp.solutions.drawing_utils
```
`mp.solutions.drawing_utils` is a utility module for visualizing results.
It includes functions to draw detected landmarks and connections (like joints and bones in the hand) on images or video frames.
By assigning it to mp_drawing, you can easily use these drawing functions to display the detected hand landmarks.


In [3]:
mp_hands = mp.solutions.hands
mp_drawing = mp.solutions.drawing_utils

# Extraction Feature Function

In [12]:
def feature_extract(frame, hands):
    # Convert to RGB
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

    # Process frame with Mediapipe
    results = hands.process(frame_rgb)

    # Frame dimensions
    height, width, _ = frame.shape

    # Initialize numpy arrays for left and right hands
    left_hand_landmarks = np.zeros((21, 2))
    right_hand_landmarks = np.zeros((21, 2))

    # Helper function to process landmarks
    def process_landmarks(hand_landmarks, width, height):
        landmarks = [(lm.x * width, lm.y * height) for lm in hand_landmarks.landmark]
        landmark_0 = np.array(landmarks[0])
        landmark_5 = np.array(landmarks[5])
        normalized_landmarks = [
            ((x - landmark_0[0]) / (landmark_5[0] - landmark_0[0] + 1e-6),
             (y - landmark_0[1]) / (landmark_5[1] - landmark_0[1] + 1e-6))
            for x, y in landmarks
        ]
        return np.array(normalized_landmarks)

    # If hands are detected
    if results.multi_hand_landmarks and results.multi_handedness:
        for hand_landmarks, hand_handedness in zip(results.multi_hand_landmarks, results.multi_handedness):
            # Identify hand as left or right
            handedness = hand_handedness.classification[0].label
            processed_landmarks = process_landmarks(hand_landmarks, width, height)
            if handedness == 'Left':
                left_hand_landmarks = processed_landmarks
            elif handedness == 'Right':
                right_hand_landmarks = processed_landmarks

            # Draw landmarks on the frame
            mp_drawing.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS)

    # Concatenate left and right hand landmarks
    concatenated_landmarks = np.concatenate((left_hand_landmarks.flatten(), right_hand_landmarks.flatten()))
    return concatenated_landmarks

### Code explanation `feature_extract`

The function `feature_extract` extracts features from a video frame by processing hand landmarks detected by Mediapipe Hands. It outputs normalized 2D landmark coordinates for both left and right hands.

### Parameters
1. `frame`: A single video frame (image) from which hand landmarks are to be extracted.
1. `hands`: An instance of Mediapipe's `Hands` class for hand detection and landmark tracking.
### Function outputs
Output

The function returns a 1D numpy array containing 84 values:

1. 42 values for the left hand (21 points × 2 coordinates: x, y).
2. 42 values for the right hand (21 points × 2 coordinates: x, y).

If no hand is detected, the corresponding values are zeros.

# Landmarks Reader and Dataset Preparation for Training

In [14]:

def read_landmarks(dataset, labels):
    # Initialize Mediapipe Hands and Drawing
    hands = mp_hands.Hands(static_image_mode=False,
                           max_num_hands=2,
                           min_detection_confidence=0.5,
                           min_tracking_confidence=0.5)

    kelas = np.eye(len(labels))
    y = []
    """
    Reads all `.jpg` images from multiple label directories inside the dataset.

    Args:
        dataset (str): The base directory of the dataset.
        labels (list): List of label names (subdirectories).

    Returns:
        dict: A dictionary where keys are label names and values are lists of image frames.
    """
    images_by_label = {}
    X = []
    for i, label in enumerate(labels):
        # Construct the directory path
        directory_path = os.path.join(dataset, label)

        # Check if the directory exists
        if not os.path.exists(directory_path):
            print(f"Directory '{directory_path}' does not exist.")
            images_by_label[label] = []
            continue

        # List all `.jpg` files in the directory
        filenames = [f for f in os.listdir(directory_path) if f.endswith('.jpg')]

        # Read images and store them as frames
        # frames = []
        for filename in filenames:
            file_path = os.path.join(directory_path, filename)
            frame = cv2.imread(file_path)  # Read the image with OpenCV
            fitur = feature_extract(frame, hands)
            X.append(fitur)
            y.append(kelas[i])

    hands.close()
    return np.array(X), np.array(y)


### Code Explanation: `read_landmarks`

The function `read_landmarks` is designed to:

1. Traverse a dataset organized by labels (subdirectories).
1. Read all `.jpg` images within each subdirectory.
1. Extract features from each image using Mediapipe.
1. Prepare the data (`X`) and labels (`y`) for use in machine learning models.

### Function Breakdown
Parameters

1. `dataset` (str): The root directory containing subdirectories for each label.
1. `labels` (list): A list of label names, where each corresponds to a subdirectory in the `dataset`.

### Step 1: Mediapipe Hands is initialized to process images:

1. `static_image_mode=False`: Operates in dynamic mode for video or multiple frames.
1. `max_num_hands=2`: Tracks up to 2 hands.
1. `min_detection_confidence=0.5`: Minimum confidence to detect a hand.
1. `min_tracking_confidence=0.5`: Minimum confidence for hand landmark tracking.

### Step 2: One-Hot Encode Labels
```python
kelas = np.eye(len(labels))
y = []
```
A one-hot encoded matrix (`kelas`) is created for the labels. For example, if there are 3 labels:
```lua
[[1, 0, 0],
 [0, 1, 0],
 [0, 0, 1]]
```
`y` is initialized to store the labels corresponding to each image.

### Step 3: Iterate Over Labels
```python
for i, label in enumerate(labels):
    directory_path = os.path.join(dataset, label)
```
Loops through each label and constructs the full path to its corresponding directory.

### Step 4: List All `.jpg` Files
```python
filenames = [f for f in os.listdir(directory_path) if f.endswith('.jpg')]
```
Retrieves all filenames ending in `.jpg` from the directory.

### Step 5: Read and Process Images
```python
for filename in filenames:
    file_path = os.path.join(directory_path, filename)
    frame = cv2.imread(file_path)
    fitur = ekstraksi_fitur(frame, hands)
    X.append(fitur)
    y.append(kelas[i])
```

1. For each image file:
    1. Constructs the full file path.
    2. Reads the image using OpenCV (`cv2.imread`).
    3. Calls `feature_extract` to extract hand landmark features
    4. Normalized features for left and right hands are extracted using Mediapipe.
    5. Appends the extracted features to `X` and the corresponding one-hot encoded label to `y`.


### Output

1. `X`: A numpy array of extracted features for all images.
1. `y`: A numpy array of one-hot encoded labels corresponding to each feature in `X`.

## Data conversion from numpy array to Tensor
The following function is used to convert from numpy array to Tensor. This conversion is intended for Pytorch dataset loader

In [15]:
def prepare_data(X, y):
    X_tensor = torch.tensor(X, dtype=torch.float32)
    y_tensor = torch.tensor(y, dtype=torch.float32)
    return TensorDataset(X_tensor, y_tensor)

# Hand Landmark Extraction using Mediapipe

In [17]:
# Example usage
dataset = "dataset"                  # Base dataset directory
labels = ["Satu", "Dua"]     # List of labels (subdirectories)

In [19]:
# Call the function
X,y = read_landmarks(dataset, labels)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

I0000 00:00:1732154245.260926   15604 gl_context_egl.cc:85] Successfully initialized EGL. Major : 1 Minor: 5
I0000 00:00:1732154245.263658   20591 gl_context.cc:357] GL version: 3.2 (OpenGL ES 3.2 Mesa 23.2.1-1ubuntu3.1~22.04.2), renderer: llvmpipe (LLVM 15.0.7, 256 bits)


(49, 84) (49, 2) (13, 84) (13, 2)


In [20]:
# Parameters
input_size = X_train.shape[1]  # Number of features (flattened landmarks)
num_classes = len(labels)      # Number of labels
batch_size = 32                # Batch size for DataLoader
learning_rate = 0.001          # Learning rate
num_epochs = 20                # Number of epochs

In [21]:
# Create model, loss function, and optimizer
model = HandGestureCNN(input_size=input_size, num_classes=num_classes)
criterion = nn.CrossEntropyLoss()  # For classification tasks
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [23]:
# Prepare data loaders
train_dataset = prepare_data(X_train, y_train)
test_dataset = prepare_data(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# It's the training time

In [24]:
# Training loop
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    running_loss = 0.0
    for inputs, labels in train_loader:
        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, torch.argmax(labels, dim=1))  # One-hot to index
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(train_loader):.4f}")


Epoch [1/20], Loss: 0.6673
Epoch [2/20], Loss: 0.6448
Epoch [3/20], Loss: 0.6185
Epoch [4/20], Loss: 0.5765
Epoch [5/20], Loss: 0.5708
Epoch [6/20], Loss: 0.5554
Epoch [7/20], Loss: 0.5180
Epoch [8/20], Loss: 0.5029
Epoch [9/20], Loss: 0.4920
Epoch [10/20], Loss: 0.4533
Epoch [11/20], Loss: 0.4492
Epoch [12/20], Loss: 0.4226
Epoch [13/20], Loss: 0.3885
Epoch [14/20], Loss: 0.3697
Epoch [15/20], Loss: 0.3475
Epoch [16/20], Loss: 0.3342
Epoch [17/20], Loss: 0.3179
Epoch [18/20], Loss: 0.3047
Epoch [19/20], Loss: 0.2644
Epoch [20/20], Loss: 0.2471


# Evaluation

In [26]:
# Testing loop
model.eval()  # Set the model to evaluation mode
correct = 0
total = 0
with torch.no_grad():
    for inputs, labels in test_loader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)  # Get the index of the max log-probability
        total += labels.size(0)
        correct += (predicted == torch.argmax(labels, dim=1)).sum().item()

print(f"Test Accuracy: {100 * correct / total:.2f}%")

Test Accuracy: 92.31%


# Model Saving for further purpose

In [30]:
output_path = "outputs"
model_name = "model.pt"
if not os.path.exists(output_path):
    os.makedirs(output_path)  # Buat direktori jika belum ada
    print(f"Directory '{output_path}' created.")
torch.save(model.state_dict(), os.path.join(output_path,model_name))

# Inferencing

Let's load our model from local.

In [32]:
model_path = os.path.join(output_path,model_name)
model = HandGestureCNN(input_size=input_size, num_classes=num_classes)
model.load_state_dict(torch.load(model_path, weights_only=True))


<All keys matched successfully>

In [34]:
correct = 0
total = 0
with torch.no_grad():
    for inputs, labels in test_loader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)  # Get the index of the max log-probability
        total += labels.size(0)
        correct += (predicted == torch.argmax(labels, dim=1)).sum().item()

print(f"Test Accuracy: {100 * correct / total:.2f}%")

Test Accuracy: 92.31%
