## Session 2 - MediaPipe

TOC : <br>

1. Overview of MediaPipe and its Capabilities
2. Comparison with other computer vision libraries and frameworks
3. Installation and Setup
4. MediaPipe Hand tracking and Gesture recognition
5. MediaPipe Face Detection and Tracking

### 1. Overview of MediaPipe and its Capabilities

MediaPipe is an open-source cross-platform framework developed by Google for building multimodal machine learning applications. It provides a wide range of pre-built modules for tasks like face detection, pose estimation, hand tracking, object detection, and more. MediaPipe is built using C++ and Python and can be used on multiple platforms like Android, iOS, Windows, and Linux. MediaPipe also provides APIs for integrating custom machine learning models into the pipeline.

Mediapipe is a powerful library that provides a wide range of computer vision and machine learning solutions. Here are some of the things that can be done using Mediapipe:

- Object detection and tracking: Detecting and tracking objects in video or images.
- Face detection and recognition: Detecting faces and recognizing faces in video or images.
- Hand tracking and gesture recognition: Detecting and tracking hands and recognizing gestures in real-time video or images.
- Pose estimation: Estimating the human body pose from images or video streams.
- Segmentation: Segmenting objects from images or video streams.
Image and video processing: Various image and video processing operations such as resizing, cropping, rotation, filtering, and blending.
- Audio processing: Processing and analyzing audio signals for various applications such as speech recognition, speaker identification, and emotion detection.
- Natural Language Processing (NLP): Natural language processing tasks such as sentiment analysis, text classification, and speech-to-text conversion.
These are just a few examples of the many things that can be done using Mediapipe. The library is constantly evolving and new features are added frequently, making it an extremely versatile and powerful tool for computer vision and machine learning applications.

### 2. Comparison with other computer vision libraries and frameworks

MediaPipe provides a unique combination of machine learning-based approaches and traditional computer vision techniques that make it stand out from other computer vision libraries like OpenCV and frameworks like TensorFlow. MediaPipe provides a pipeline for building multimodal applications that integrate multiple machine learning and computer vision techniques. It also provides pre-built modules that can be used out of the box, reducing the need for complex code development.

### 3. Installation and Setup

MediaPipe can be installed using pip, the Python package manager, as follows:

```pip install mediapipe```

In [2]:
!pip install mediapipe



### 4. MediaPipe Hand tracking and Gesture recognition

One of the most popular features of Mediapipe is the hand landmark detection module, which allows for real-time and accurate detection of 21 key points (landmarks) on a person's hand.

The hand landmark detection module uses a deep neural network to analyze an input image or video frame and predict the 3D coordinates of the 21 hand landmarks. These landmarks correspond to various points on the hand, such as the tips of the fingers, the base of the thumb, and the center of the palm.

The Mediapipe hand landmark detection pipeline is composed of several stages, including:

- Hand detection: The first step involves detecting the presence of a hand in the input image or video frame. This is done using a machine learning model that has been trained to recognize the shape and structure of a hand.

- Hand localization: Once a hand has been detected, the next step involves localizing the hand and aligning it to a canonical coordinate system. This is important for ensuring that the hand landmarks are consistently detected across different orientations and positions of the hand.

- Hand landmark estimation: The final stage involves estimating the 3D coordinates of the 21 hand landmarks. This is done using a deep neural network that has been trained on a large dataset of hand images and corresponding landmark annotations.

Once the hand landmarks have been detected, they can be used for a wide range of applications, such as gesture recognition, hand tracking, and virtual try-on. The Mediapipe hand landmark detection module is highly optimized for real-time performance and can be easily integrated into Python applications using the Mediapipe Python API.

In [1]:
import cv2
import mediapipe as mp

# Load the Mediapipe hand landmark model
mp_hands = mp.solutions.hands.Hands(
    static_image_mode=False,
    max_num_hands=2,
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5)

# Initialize the drawing module for hands
mp_drawing = mp.solutions.drawing_utils

# Initialize the video capture object
cap = cv2.VideoCapture(0)

while True:
    # Read a new frame from the video capture object
    ret, frame = cap.read()

    # Convert the color space from BGR to RGB
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

    # Detect the hand landmarks in the current frame
    results = mp_hands.process(frame)

    # Draw the hand landmarks and connections on the current frame
    if results.multi_hand_landmarks:
        for hand_landmarks in results.multi_hand_landmarks:
            mp_drawing.draw_landmarks(
                frame, hand_landmarks, mp.solutions.hands.HAND_CONNECTIONS)

    # Display the current frame in a window
    cv2.imshow('Hand Landmarks', frame)

    # Check for a key event and exit if 'q' is pressed
    if cv2.waitKey(1) == ord('q'):
        break

# Release the video capture object and destroy all windows
cap.release()
cv2.destroyAllWindows()


<img align = "middle" src = 'Images/Output1.png' width = '700' height = '500'>

### 5. MediaPipe Face Detection and Tracking

- MediaPipe Face Detection and Tracking is a pre-built computer vision pipeline developed by Google that uses machine learning to detect and track faces in real-time video streams or image sequences. It is based on a deep neural network trained on a large dataset of images and is capable of detecting and tracking multiple faces simultaneously.

- The MediaPipe Face Detection and Tracking pipeline consists of two main components: a face detection model and a face tracking model. The face detection model is responsible for detecting faces in the input video frames or images, while the face tracking model is responsible for tracking the detected faces across frames and maintaining their identities.

- The face detection model is based on the Single Shot Detector (SSD) architecture, which is a popular object detection algorithm that uses a single neural network to predict object bounding boxes and class probabilities in an input image. The SSD architecture is trained on a large dataset of annotated images of faces and is capable of detecting faces in various orientations and lighting conditions.

- The MediaPipe Face Detection and Tracking pipeline can be used for a wide range of applications, including video conferencing, virtual makeup try-on, and emotion detection. It is also highly customizable, allowing developers to fine-tune the pipeline for specific use cases and integrate it into their own applications.

Here's an example code snippet to detect and track faces using MediaPipe :

In [2]:
import cv2
import mediapipe as mp

# Initialize the MediaPipe face detection module
mp_face_detection = mp.solutions.face_detection

# Initialize the MediaPipe drawing module
mp_draw = mp.solutions.drawing_utils

# Initialize the VideoCapture object
cap = cv2.VideoCapture(0)

# Loop over the frames
while True:
    # Read the frame from the camera
    success, img = cap.read()
    if not success:
        break

    # Convert the image to RGB
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # Detect faces in the image
    with mp_face_detection.FaceDetection(model_selection=0, min_detection_confidence=0.5) as face_detection:
        results = face_detection.process(img_rgb)
        if results.detections:
            for detection in results.detections:
                # Draw the bounding box around the face
                mp_draw.draw_detection(img, detection)

    # Display the image
    cv2.imshow("Face Detection", img)

    # Wait for a key press
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

# Release the VideoCapture object and destroy the windows
cap.release()
cv2.destroyAllWindows()


<img align = "middle" src = 'Images/Output2.png' width = '700' height = '500'>

### 6. MediaPipe Hand Gesture Recognition on Automatic Volume Control Project

In this Python project, we are trying to process a video so that we can control volume of device with help of webcam camera using the tip our index finger.


Gesture recognition helps computers to understand human body language. This helps to build a more potent link between humans and machines, rather than just the basic text user interfaces or graphical user interfaces (GUIs). In this project for gesture recognition, the human body’s motions are read by computer camera. The computer then makes use of this data as input to handle applications. The objective of this project is to develop an interface which will capture human hand gesture dynamically and will control the volume level.

1) NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

2) Pycaw : Python Audio Control Library

3) Mediapipe is an open-source machine learning library of Google, which has some solutions for face recognition and gesture recognition, and provides encapsulation of python, js and other languages. MediaPipe Hands is a high-fidelity hand and finger tracking solution. It uses machine learning (ML) to infer 21 key 3D hand information from just one frame. We can use it to extract the coordinates of the key points of the hand.



<img align = "middle" src = 'Images/GIF-Yash-Datar-Mediapipe-Project.gif' width = '700' height = '500'>

The following pip install is required for the code you provided:
```pip install mediapipe pycaw numpy comtypes```
OR

```pip install mediapipe –upgrade```

```pip install pycaw –upgrade```

```pip install comtypes –upgrade``` 


In [5]:
!pip install comtypes mediapipe pycaw numpy



In [None]:
##### Import necessary libraries
import cv2
import mediapipe as mp
from math import hypot
from ctypes import cast, POINTER
from comtypes import CLSCTX_ALL
from pycaw.pycaw import AudioUtilities, IAudioEndpointVolume
import numpy as np

##### Capture video from webcam
cap = cv2.VideoCapture(0)

##### Initialize MediaPipe Hands
mpHands = mp.solutions.hands
hands = mpHands.Hands()
##### Initialize MediaPipe Drawing Utilities
mpDraw = mp.solutions.drawing_utils

### Python code To access speaker through the library pycaw
##### Access speaker through the library pycaw
devices = AudioUtilities.GetSpeakers()
interface = devices.Activate(IAudioEndpointVolume._iid_, CLSCTX_ALL, None)
volume = cast(interface, POINTER(IAudioEndpointVolume))

##### Set initial volume bar position and percentage
volbar = 400
volper = 0


volMin, volMax = volume.GetVolumeRange()[:2]
    
    
while True:
  success, img = cap.read()

  imgRGB = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

  ### Python code to Collection of gesture information
  results = hands.process(imgRGB)

  lmList = []
  ##### If hands are detected
  if results.multi_hand_landmarks:
    ##### Iterate over all detected hands
    for handlandmark in results.multi_hand_landmarks:
      ##### Iterate over all landmarks in the hand
      for id, lm in enumerate(handlandmark.landmark):
        ### Python code to Get finger joint points
        h, w, _ = img.shape
        cx, cy = int(lm.x * w), int(lm.y * h)
        ##### Add landmark to list
        lmList.append([id, cx, cy])
      ##### Draw landmarks on image
      mpDraw.draw_landmarks(img, handlandmark, mpHands.HAND_CONNECTIONS)

 
    ##### If any landmarks were found
    if lmList != []:
        ##### Get coordinates of palm base and index finger tip
        #getting the value at a point
                        #x      #y
        x1,y1 = lmList[0][1],lmList[0][2]  #palm
        x2,y2 = lmList[8][1],lmList[8][2]  #index finger
        
        ### Python code to creating circle at the tips of thumb and index finger
        cv2.circle(img,(x1,y1),13,(0,0,255),cv2.FILLED) #image #fingers #radius #rgb
        cv2.circle(img,(x2,y2),13,(255,0,0),cv2.FILLED) #image #fingers #radius #rgb
        cv2.line(img,(x1,y1),(x2,y2),(0,255,0),3)  #create a line b/w tips of index finger and thumb
        
        ### Python code to add a LIGHT BLUE box for INDEX FINGER MEDIAPIPE hand landmarks +8
        # Get the index finger landmark.
        index_finger_landmark = lmList[8]

        ### Python code to Calculate the top-left and bottom-right coordinates of the box.
        box_top_left = (index_finger_landmark[1] - 10, index_finger_landmark[2] - 10)
        box_bottom_right = (index_finger_landmark[1] + 10, index_finger_landmark[2] + 10)

        # Draw the box.
        cv2.rectangle(img, box_top_left, box_bottom_right, (255, 255, 0), 2)
  
  
  
        ### Python code to Calculate the distance between palm base and index finger tip.
        length = hypot(x2-x1,y2-y1) #distance b/w tips using hypotenuse
        # from numpy we find our length,by converting hand range in terms of volume range ie b/w -63.5 to 0
        vol = np.interp(length,[30,350],[volMin,volMax]) 
        volbar=np.interp(length,[30,350],[400,150])
        volper=np.interp(length,[30,350],[0,100])
        
        
        
        # Python code to display_vol_in_reverse(vol)
        # Convert the values in the array to a range of 0 to 100, with 0 displayed as 100, 1 as 99, etc. till 100 as 0.
        # Convert the input from negative to positive and positive to negative.
        vols = -np.abs(vol)
        vols = (100 + vols)
        print("Volume => ",vols, " Length => ",int(length))
        volume.SetMasterVolumeLevel(vol, None)
        
        
        
        
        
        # Hand range 30 - 350
        # Volume range -63.5 - 0.0
        ### Python code to creating volume bar for volume level 
        cv2.rectangle(img,(50,150),(85,400),(0,0,255),4) # vid ,initial position ,ending position ,rgb ,thickness
        cv2.rectangle(img,(50,int(volbar)),(85,400),(0,0,255),cv2.FILLED)
        #RK cv2.putText(img,f"{int(volper)}%",(10,100),cv2.FONT_ITALIC,1,(0, 255, 98),3)
        cv2.putText(img,f"Volume Level : {int(vols)}",(10,40),cv2.FONT_ITALIC,1,(255, 0, 0),3)
        
        ### Python code to tell the volume percentage ,location,font of text,length,rgb color,thickness
    cv2.imshow('Image',img) #Show the video 
    if cv2.waitKey(1) & 0xff==ord(' '): #By using spacebar delay will stop
        break
        
        
        
        
cap.release()     #stop cam       
cv2.destroyAllWindows() #close window


<img align = "middle" src = 'Images/VolumneOutput1.png' width = '700' height = '500'>

<hr>