# Introduction 

In this tutorial, we will make use of the next-generation Pose Detection model from Google Research, which can detect up to 17 keypoints in the human body. 

# MoveNet architecture in a nutshell 

## How does it work ? 

MoveNet uses heatmaps to accurately localize human keypoints. It's a **bottom-up** estimation model, which means that it first detects the human joints of all persons, and then assemble these joints into poses for each person.[[Source]](https://arxiv.org/pdf/1807.09972.pdf#:~:text=The%20top%2Ddown%20approaches%20first,full%20poses%20for%20all%20persons.)

## Architecture (2 main components)

- **Feature extractor** : A MobileNetV2 with an attached feature pyramid network. [Learn more about MobileNetV2](https://arxiv.org/pdf/1801.04381.pdf)
- **A set of predictor heads** : attached to the feature extractor. They are responsible for predicting : 
 - the geometric center of the instances (persons)
 - the full set of keypoints for a person
 - the location of all the keypoints
 - local offsets from each output feature map pixel to the precise sub-pixel location of each keypoint
 
 
## [A deeper explanation on the MoveNet processing steps](https://blog.tensorflow.org/2021/05/next-generation-pose-detection-with-movenet-and-tensorflowjs.html). Now, let's start coding ! 





# Libraries

In [1]:
# Computer vision/graphics library
import cv2

# Gif writer
import imageio 

# Display libraries 
import matplotlib.pyplot as plt 
from IPython.display import HTML, display

# Calculations and Deep Learning library
import numpy as np 
import tensorflow as tf 
import tensorflow_hub as hub

In [2]:
! pip install -q git+https://github.com/tensorflow/docs

> We will use the "embed" module from TF Docs to embed the generated gif to our Notebook

# Setup

## Map the bones (keypoint edges) to a matplotlib color name

> ![Colors_index](https://raw.githubusercontent.com/Justsecret123/Human-pose-estimation/main/Screenshots/mpl_colors.PNG)

In [3]:
cyan = (255, 255, 0)
magenta = (255, 0, 255)

In [4]:
EDGE_COLORS = {
    (0, 1): magenta,
    (0, 2): cyan,
    (1, 3): magenta,
    (2, 4): cyan,
    (0, 5): magenta,
    (0, 6): cyan,
    (5, 7): magenta,
    (7, 9): cyan,
    (6, 8): magenta,
    (8, 10): cyan,
    (5, 6): magenta,
    (5, 11): cyan,
    (6, 12): magenta,
    (11, 12): cyan,
    (11, 13): magenta,
    (13, 15): cyan,
    (12, 14): magenta,
    (14, 16): cyan
}

# Load the model from TF hub

> *Notes* : MoveNet is offered with two variants, known as : 
> - **Lightning :** intended for latency-critical applications 
> - **Thunder :** for applications that require high accuracy 

We will load the multipose *Lightning* model, which is able to detect mutliple people (up to 6 instances) in the image frame at the same time. 

In [5]:
model = hub.load("https://tfhub.dev/google/movenet/multipose/lightning/1")
movenet = model.signatures["serving_default"]

# Adjust the input size 

According to [the official documentation](https://tfhub.dev/google/movenet/multipose/lightning/1) : 

1. The height/width are both multiple of 32.
2. The height to width ratio is close (and enough) to cover the original image's aspect ratio.
3. Make the larger side to be 256 (one should adjust this based on the speed/accuracy requirements). For example, a 720p image (i.e. 720x1280 (HxW)) should be resized and padded to 160x256 image.

Following these rules, our input_gif would have been reshaped like this : 
- width : 461 ---> 256 (the larger side = 256). 
- height : 250 ---> 250 * (250/461) ~136. Since the rule 1 specifies the height to be a mutliple of 32, we'd have to round it to the closest one, which is 128. 

> *Note : 250/461 is the aspect ratio.  

For visualization purposes, we'll set the input_size to 256*256. 

In [6]:
#initial_width, initial_height = (461,250)
WIDTH = HEIGHT = 256

# Inference

## Define the loop

Steps : loop through the results ---> Draw the keypoints ----> Draw the edges 

In [7]:
def loop(frame, keypoints, threshold=0.11):
    """
    Main loop : Draws the keypoints and edges for each instance
    """
    
    # Loop through the results
    for instance in keypoints: 
        # Draw the keypoints and get the denormalized coordinates
        denormalized_coordinates = draw_keypoints(frame, instance, threshold)
        # Draw the edges
        draw_edges(denormalized_coordinates, frame, EDGE_COLORS, threshold)

## Draw keypoints

### Steps : 
- Get the parameters (shape): width, height, channel 
- Denormalize the output coordinates by mutliplying the keypoints with the with the parameters
- Iterate through the denormalized keypoints and draw the circles where the confidence score is higher than the preset threshold

In [8]:
def draw_keypoints(frame, keypoints, threshold=0.11):
    """Draws the keypoints on a image frame"""
    
    # Denormalize the coordinates : multiply the normalized coordinates by the input_size(width,height)
    denormalized_coordinates = np.squeeze(np.multiply(keypoints, [WIDTH,HEIGHT,1]))
    #Iterate through the points
    for keypoint in denormalized_coordinates:
        # Unpack the keypoint values : y, x, confidence score
        keypoint_y, keypoint_x, keypoint_confidence = keypoint
        if keypoint_confidence > threshold:
            """"
            Draw the circle
            Note : A thickness of -1 px will fill the circle shape by the specified color.
            """
            cv2.circle(
                img=frame, 
                center=(int(keypoint_x), int(keypoint_y)), 
                radius=4, 
                color=(255,0,0),
                thickness=-1
            )
    return denormalized_coordinates

## Draw the edges 
### Steps : 
- Iterate through the edges and draw the edges
- Get the edge points and the associated dict value
- Draw the lines where the confidence score is higher than the preset threshold

In [9]:
def draw_edges(denormalized_coordinates, frame, edges_colors, threshold=0.11):
    """
    Draws the edges on a image frame
    """
    
    # Iterate through the edges 
    for edge, color in edges_colors.items():
        # Get the dict value associated to the actual edge
        p1, p2 = edge
        # Get the points
        y1, x1, confidence_1 = denormalized_coordinates[p1]
        y2, x2, confidence_2 = denormalized_coordinates[p2]
        # Draw the line from point 1 to point 2, the confidence > threshold
        if (confidence_1 > threshold) & (confidence_2 > threshold):      
            cv2.line(
                img=frame, 
                pt1=(int(x1), int(y1)),
                pt2=(int(x2), int(y2)), 
                color=color, 
                thickness=2, 
                lineType=cv2.LINE_AA # Gives anti-aliased (smoothed) line which looks great for curves
            )

## Progress bar

In [10]:
def progress(value, max=100):
    """
    Returns an HTML progress bar with a certain value. Used within each step
    """
    
    
    return HTML("""
      <progress
          value='{value}'
          max='{max}',
          style='width: 100%'
      >
          {value}
      </progress>
  """.format(value=value,
                max=max))

## Capture and process each frame

### Load the gif

In [11]:
def load_gif():
    """
    Loads the gif and return its details
    """
    
    # Load the gif
    gif = cv2.VideoCapture("dancing.gif")
    # Get the frame count
    frame_count = int(gif.get(cv2.CAP_PROP_FRAME_COUNT))
    # Display parameter
    print(f"Frame count: {frame_count}")
    
    """""
    Initialize the video writer 
    We'll append each frame and its drawing to a vector, then stack all the frames to obtain a sequence (video). 
    """
    output_frames = []
    
    # Get the initial shape (width, height)
    initial_shape = []
    initial_shape.append(int(gif.get(cv2.CAP_PROP_FRAME_WIDTH)))
    initial_shape.append(int(gif.get(cv2.CAP_PROP_FRAME_HEIGHT)))
    
    return gif, frame_count, output_frames, initial_shape

## Run inference

In [21]:
def run_inference():
    """
    Runs inferences then starts the main loop for each frame
    """
    
    # Load the gif
    gif, frame_count, output_frames, initial_shape = load_gif()
    # Set the progress bar to 0. It ranges from the first to the last frame
    bar = display(progress(0, frame_count-1), display_id=True)
    
    # Loop while the gif is opened
    while gif.isOpened():
        
        # Capture the frame
        ret, frame = gif.read()
        
        # Exit if the frame is empty
        if frame is None: 
            break
        
        # Retrieve the frame index
        current_index = gif.get(cv2.CAP_PROP_POS_FRAMES)
        
        # Copy the frame
        image = frame.copy()
        image = cv2.resize(image, (WIDTH,HEIGHT))
        # Resize to the target shape and cast to an int32 vector
        input_image = tf.cast(tf.image.resize_with_pad(image, WIDTH, HEIGHT), dtype=tf.int32)
        # Create a batch (input tensor)
        input_image = tf.expand_dims(input_image, axis=0)

        # Perform inference
        results = movenet(input_image)
        """
        Output shape :  [1, 6, 56] ---> (batch size), (instances), (xy keypoints coordinates and score from [0:50] 
        and [ymin, xmin, ymax, xmax, score] for the remaining elements)
        First, let's resize it to a more convenient shape, following this logic : 
        - First channel ---> each instance
        - Second channel ---> 17 keypoints for each instance
        - The 51st values of the last channel ----> the confidence score.
        Thus, the Tensor is reshaped without losing important information. 
        """
        
        keypoints = results["output_0"].numpy()[:,:,:51].reshape((6,17,3))

        # Loop through the results
        loop(image, keypoints, threshold=0.11)
        
        # Get the output frame : reshape to the original size
        frame_rgb = cv2.cvtColor(
            cv2.resize(
                image,(initial_shape[0], initial_shape[1]), 
                interpolation=cv2.INTER_LANCZOS4
            ), 
            cv2.COLOR_BGR2RGB # OpenCV processes BGR images instead of RGB
        ) 
        
        # Add the drawings to the output frames
        output_frames.append(frame_rgb)
        
        # Update the progress bar
        bar.update(progress(current_index, frame_count-1))
    
    # Release the object
    gif.release()
    
    print("Completed !")
    
    return output_frames, keypoints

In [31]:
output_frames, keypoints = run_inference()

Frame count: 33


Completed !


## Display the results

In [14]:
from tensorflow_docs.vis import embed

In [15]:
# Stack the output frames horizontally to compose a sequence
output = np.stack(output_frames, axis=0) 
# Write the sequence to a gif
imageio.mimsave("./animation.gif", output, fps=15) 
# Embed the output to the notebook
embed.embed_file("./animation.gif") 

In [27]:
keypoints[0]

array([[0.25068104, 0.3601315 , 0.38916332],
       [0.23691972, 0.36608386, 0.3389375 ],
       [0.23892567, 0.34809518, 0.3740547 ],
       [0.24940054, 0.3704369 , 0.35202104],
       [0.25153872, 0.3348491 , 0.43017188],
       [0.30375075, 0.3895792 , 0.51235986],
       [0.31242025, 0.31626543, 0.7562996 ],
       [0.3011682 , 0.4167532 , 0.31978253],
       [0.2989637 , 0.3058221 , 0.2980497 ],
       [0.22511733, 0.40980348, 0.36508068],
       [0.20935246, 0.28852484, 0.6458673 ],
       [0.5131142 , 0.39824238, 0.7379388 ],
       [0.5158026 , 0.34628385, 0.7933722 ],
       [0.6910965 , 0.42619348, 0.61773205],
       [0.6953954 , 0.34197348, 0.42400232],
       [0.8711319 , 0.40923277, 0.17951949],
       [0.8746763 , 0.4178651 , 0.24260043]], dtype=float32)

In [30]:
keypoints[1]

array([[0.24404463, 0.5989478 , 0.35459504],
       [0.2320594 , 0.603568  , 0.30183798],
       [0.2323194 , 0.59706414, 0.37059602],
       [0.24187021, 0.61982214, 0.3234627 ],
       [0.24112438, 0.59949803, 0.31255576],
       [0.2917554 , 0.63753325, 0.44443354],
       [0.29789343, 0.5915087 , 0.41707975],
       [0.2637463 , 0.6480555 , 0.2878409 ],
       [0.2736293 , 0.6068786 , 0.20658395],
       [0.21979797, 0.6256859 , 0.2988396 ],
       [0.21100773, 0.6058525 , 0.22582237],
       [0.4772237 , 0.63216686, 0.59316695],
       [0.47819534, 0.60517496, 0.6549765 ],
       [0.62875414, 0.6386265 , 0.2938608 ],
       [0.59333664, 0.61933243, 0.28924587],
       [0.702123  , 0.6328608 , 0.24164288],
       [0.70140356, 0.6302175 , 0.26039487]], dtype=float32)

In [29]:
keypoints[3]

array([[9.2037553e-01, 4.3546325e-01, 4.9395151e-02],
       [9.2174315e-01, 4.3376902e-01, 5.5599563e-02],
       [9.1734844e-01, 4.3778184e-01, 3.8194831e-02],
       [9.3117046e-01, 4.0847531e-01, 4.4713177e-02],
       [9.2292511e-01, 4.3170574e-01, 3.7023149e-02],
       [9.5769811e-01, 3.2655218e-01, 9.6377824e-03],
       [8.4079975e-01, 4.2242569e-01, 1.1152841e-02],
       [9.4138068e-01, 3.3038598e-01, 8.3121248e-03],
       [9.1023040e-01, 4.2733881e-01, 3.8673871e-03],
       [9.0664637e-01, 3.3365458e-01, 4.4401184e-02],
       [8.8156235e-01, 4.2483351e-01, 2.2789411e-02],
       [7.4496341e-01, 3.3291078e-01, 2.8617997e-03],
       [7.0700854e-01, 4.1083524e-01, 6.3218211e-04],
       [7.4298465e-01, 3.3451051e-01, 3.8401224e-02],
       [7.4469841e-01, 4.2093572e-01, 3.8223501e-02],
       [8.7113190e-01, 4.0923277e-01, 1.7951949e-01],
       [8.7467629e-01, 4.1786510e-01, 2.4260043e-01]], dtype=float32)

# Thank you for your time ! :-)

Acknowledgements : 
- TF tutorials 

Ibrahim SEROUIS, 2022