# Pose Estimation using YOLOv7 Algorithm!

`Previously, I worked on several computer vision and YOLOv3 projects like the automatic out-of-stock inventory management system. But, since YOLOv7 is considered as one of the best object detection algorithms, I gave it a try! Created in the year 2015, by Joseph Redmon (who also proposed YOLOv2 & YOLOv3). Since then, the open soure contributors have collaborated together to create more versions in the YOLO family like:`

YOLOv4, YOLOv5, PP-YOLO, Scaled YOLOv4, PP-YOLOv2, YOLOv5, YOLOv6, and YOLOv7 (built on top of YOLOR - You Only Learn One Representation). YOLOv7 is more than just an object detection architecture. It provides a new model head that emits keypoints (skeleton) and can perform instance segmentation with just bounding box regression.

Why did I use a YOLO model?
1. Enables me to process video feeds at a high frames-per-second rate.

2. Continuing R&D from the open-source community. YOLOv7 being the latest addition. 

3. No shortage of information or bug removals while implementing a model.

### Accuracy comparisons 
`1. YOLOv7-E6E        - 56.8 
 2. Bi-Fusion Pyramid - 55.9 
 3. Dual Swin R-CNN   - 53.9 
 4. Faster-RCNN       - 42.0 
 5. YOLOv6            - 35.0`
 
 Hence, our choice being YOLOv7

# 

## Downloading Necessary Weights
Convolutional Neural Network (CNN) like the YOLO need a lot of images with variations to train on. To avoid redundancy, we transfer the learnings of the first few layers and just needs to learn the last (or maybe last few) layers to work for your specific use case.

In [None]:
! curl -L https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7-w6-pose.pt -o yolov7-w6-pose.pt

# 

## Importing Libraries

In [7]:
#For common image transformations
import torch
from torchvision import transforms

#Selects a subset of bounding boxes in descending order of score
from utils.general import non_max_suppression_kpt

#To resize and pad the video to a shape that the model can work with
from utils.datasets import letterbox
from utils.plots import output_to_keypoint, plot_skeleton_kpts

#for graphs & plots
import matplotlib.pyplot as plt

#for image and video analysis 
import cv2

#for working with arrays
import numpy as np

# 

## CPU or GPU availability check. Then, load pre-trained weights
Having a GPU quickens the processing as it switches from float32 to float16. Peak float16 matrix multiplication and convolution performance is 16x faster than peak float32 performance on A100 GPUs.

In [8]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def load_model():
    model = torch.load('Weights/yolov7-w6-pose.pt', map_location=device)['model']
    # Put in inference mode
    model.float().eval()

    if torch.cuda.is_available():
        # half() turns predictions into float16 tensors
        # which significantly lowers inference time
        model.half().to(device)
    return model

model = load_model()

# 

## Transform Video & Run through the model!
Here, we enter an image as an array, from every frame of a video. Then, we resize and pad the video according to the model's requirement using letterbox() function, and run it through the model!

In [9]:
def run_inference(image):
    # Resize and pad image
    image = letterbox(image, 960, stride=64, auto=True)[0] # shape: (567, 960, 3)
    # Apply transforms
    image = transforms.ToTensor()(image) # torch.Size([3, 567, 960])
    if torch.cuda.is_available():
        image = image.half().to(device)
    # Turn image into batch
    image = image.unsqueeze(0) # torch.Size([1, 3, 567, 960])
    with torch.no_grad():
        output, _ = model(image)
    return output, image

# 

## Returning Predictions & Image as a Tensor
Returning image and predictions as a Tensor (which can store data in N dimensions, along with its linear operations). Then, we use the Non Maximum Suppression function, which is a computer vision method that selects a single entity out of many overlapping entities. Lastly, plotting the prediction skeletons.

In [10]:
def draw_keypoints(output, image):
    output = non_max_suppression_kpt(output, 
                                     0.25, # Confidence Threshold
                                     0.65, # IoU Threshold
                                     nc=model.yaml['nc'], # Number of Classes
                                     nkpt=model.yaml['nkpt'], # Number of Keypoints
                                     kpt_label=True)
    with torch.no_grad():
        output = output_to_keypoint(output)
    nimg = image[0].permute(1, 2, 0) * 255
    nimg = nimg.cpu().numpy().astype(np.uint8)
    nimg = cv2.cvtColor(nimg, cv2.COLOR_RGB2BGR)
    for idx in range(output.shape[0]):
        plot_skeleton_kpts(nimg, output[idx, 7:].T, 3)

    return nimg

# 

## Using OpenCV to read a video:
We run this entire process for every frame on a video. On each frame, we'll also write the frame into a new file, encoded as a video. This process will take significant time (more if done without a GPU).

In [11]:
def pose_estimation_video(filename):
    cap = cv2.VideoCapture(filename)
    # VideoWriter for saving the video
    fourcc = cv2.VideoWriter_fourcc(*'MP4V')
    out = cv2.VideoWriter('Video_output.mp4', fourcc, 30.0, (int(cap.get(3)), int(cap.get(4))))
    while cap.isOpened():
        (ret, frame) = cap.read()
        if ret == True:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            output, frame = run_inference(frame)
            frame = draw_keypoints(output, frame)
            frame = cv2.resize(frame, (int(cap.get(3)), int(cap.get(4))))
            out.write(frame)
            cv2.imshow('Pose estimation', frame)
        else:
            break

        if cv2.waitKey(10) & 0xFF == ord('q'):
            break

    cap.release()
    out.release()
    cv2.destroyAllWindows()


# 

## Final Deployment

In [12]:
pose_estimation_video('Cap.mp4')

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]


# 

## Then, the newly generated gets saved as 'Video_output' in the same directory! The last process will take some time, so be patient!

# 

# Thank you!