<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# DSI 37 Capstone

<a id='part_i'></a>
[Part II](Part_2-Cleaning_and_EDA.ipynb#part_ii) <br>
[Part III](Part_3-Modelling.ipynb#part_iii) <br>
[Part IV](Part_4-Modelling.ipynb#part_iv)<br>
[Part V](Part_5-Implementation.ipynb#part_v)

# Part I: Import

<a id='part_i'></a>

## Contents

[1. Intro](#intro)<br>
[2. Glossary](#glossary)<br>
[3. Imports](#imports)<br>
[4. Keypoints Capture To CSV](#data_creation)<br>


## 1. Intro

<a id='intro'></a>

## Problem Statement

The surging global popularity of Muay Thai, with a projected market size of [research needed] by 2030, has led to an influx of new practitioners. Proper form is essential to prevent injuries and optimise performance. However, group classes lack personalised attention from instructors to correct form. Costly personal training is not affordable for everyone, especially casual hobbyists. Thus, there is a market opportunity for an AI Muay Thai trainer capable of observing users and offering real-time form recommendations. This solution could enhance gym training, allowing practitioners of all levels to refine their technique and achieve optimal results.

This project seeks to build a test version that will address only two moves in Muay Thai: a jab and a roundhouse kick.

## Goals

* Build a model that can distinguish between two different classes (jab and roundhouse kick) with an f1 score of at least 0.7
* Have it respond fast enough to produce real-time recommendations
* Show recommendations on screen, along with counters for the two classes

## Description of this codebook

This is part 1 of my overall code for this project. This part concerns how video data that was previously filmed in the gym was used to generate keypoint data. It shows the process of identifying important keypoints and joints relevant to the two classes (which I will determine from domain knowledge) and using the relative positions and angles of the above to, firstly, distinguish between the two classes, and secondly, to identify different stages in each move to allow for recommendations and counting. The cleaning process will consist of removing lines that are contributing noise. 

## 2. Glossary

<a id='glossary'></a>

### Muay Thai:

Muay Thai, a Thai martial art, is also known as the "Art of Eight Limbs" because besides the arms and legs it also utilises the elbows and knees. It has experienced a global surge of popularity in recent years. It is also one of the most popular striking disciplines adopted by Mixed Martial Arts fighters. 

### Jab:

The Muay Thai jab is a quick, straight punch using the lead (or front) hand. It's a versatile and effective technique used to maintain distance, set up combinations, and gauge an opponent's reaction. The jab plays a crucial role in both offense and defense strategies in Muay Thai.

### Roundhouse Kick:

The Muay Thai roundhouse kick is a powerful kicking technique that involves pivoting on the supporting foot while swinging the other leg in a circular motion. It generates tremendous force from the hips, allowing the shin to strike the opponent, making it one of the sport's most devastating and signature moves.

### MoveNet:

*Source: https://tfhub.dev/google/movenet/singlepose/thunder/4*

A convolutional neural network model that runs on RGB images and predicts human joint locations of a single person. The model is designed to be run in the browser using Tensorflow.js or on devices using TF Lite in real-time, targeting movement/fitness activities. This variant: MoveNet.SinglePose.Thunder is a higher capacity model (compared to MoveNet.SinglePose.Lightning) that performs better prediction quality while still achieving real-time (>30FPS) speed. Naturally, thunder will lag behind the lightning, but it will pack a punch. 

#### Training Data
* COCO Keypoint Dataset Training Set 2017: In-the-wild images with diverse scenes, instance sizes, and occlusions. The original training set contains 64k images (images, annotations). The images with three or more people were filtered out, resulting in a 28k final training set.
* Active Dataset Training Set: Images sampled from YouTube fitness videos which captures people exercising (e.g. HIIT, weight-lifting, etc.), stretching, or dancing. It contains diverse poses and motion with more motion blur and self-occlusions. The set of images with a single person contains 23.5k images.

#### Inputs

A frame of video or an image, represented as an int32 tensor of shape: 256x256x3. Channels order: RGB with values in [0, 255].

#### Outputs

A float32 tensor of shape [1, 1, 17, 3].

* The first two channels of the last dimension represents the yx coordinates (normalized to image frame, i.e. range in [0.0, 1.0]) of the 17 keypoints (in the order of: [nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, right ankle]).

* The third channel of the last dimension represents the prediction confidence scores of each keypoint, also in the range [0.0, 1.0].


![COCO_keypoints.png](attachment:COCO_keypoints.png)

## 3. Imports (Libraries)

<a id='imports'></a>

In [2]:
# basic dependencies
import os
import datetime as dt
import pandas as pd
import numpy as np
import csv

In [3]:
# basic visualisation
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
import matplotlib.patches as patches

In [4]:
# video capture
import tensorflow as tf
import tensorflow_hub as hub
import cv2

## 4. Keypoints Capture To CSV

<a id='data_creation'></a>

In this section I map keypoints to the videos that I took using the MoveNet documentation.

See Part C in this section for a description of the model.

Here is a sub-chapter contents page for the methods:

### Contents - Keypoints Capture To CSV
[A - Dictionaries](#dictionaries)<br>
[B - Visualisation Functions](#viz_fns)<br>
[C - Importing MoveNet Model](#imports_movenet)<br>
[D - Importing Video Files](#imports_vids)<br>


### A - Dictionaries

<a id='dictionaries'></a>

In [5]:
# Dictionary that maps from joint names to keypoint indices.
# See Part C for diagram 
KEYPOINT_DICT = {
    'nose': 0,
    'left_eye': 1,
    'right_eye': 2,
    'left_ear': 3,
    'right_ear': 4,
    'left_shoulder': 5,
    'right_shoulder': 6,
    'left_elbow': 7,
    'right_elbow': 8,
    'left_wrist': 9,
    'right_wrist': 10,
    'left_hip': 11,
    'right_hip': 12,
    'left_knee': 13,
    'right_knee': 14,
    'left_ankle': 15,
    'right_ankle': 16
}


In [6]:
# Maps bones to a matplotlib color name.
KEYPOINT_EDGE_INDS_TO_COLOR = {
    (0, 1): 'm',
    (0, 2): 'c',
    (1, 3): 'm',
    (2, 4): 'c',
    (0, 5): 'm',
    (0, 6): 'c',
    (5, 7): 'm',
    (7, 9): 'm',
    (6, 8): 'c',
    (8, 10): 'c',
    (5, 6): 'y',
    (5, 11): 'm',
    (6, 12): 'c',
    (11, 12): 'y',
    (11, 13): 'm',
    (13, 15): 'm',
    (12, 14): 'c',
    (14, 16): 'c'
}

### B - Visualisation Functions

<a id='viz_fns'></a>

These functions draw keypoints and edges on the video based on certain confidence thresholds.

In [7]:
def _keypoints_and_edges_for_display(keypoints_with_scores,
                                     height,
                                     width,
                                     keypoint_threshold=0.11):
    """
    Returns high confidence keypoints and edges for visualisation.
    
    Args:
    
    keypoints_with_scores: A numpy array with shape [1, 1, 17, 3] representing
                            the keypoint coordinates and scores returned from the MoveNet model.
    height: height of the image in pixels.
    width: width of the image in pixels.
    keypoint_threshold: minimum confidence score for a keypoint to be visualised.
    
    Returns:
    
    A (keypoints_xy, edges_xy, edge_colors) containing:
      * the coordinates of all keypoints of all detected entities;
      * the coordinates of all skeleton edges of all detected entities;
      * the colors in which the edges should be plotted.
      
    """

    keypoints_all = []
    keypoint_edges_all = []
    edge_colors = []
    num_instances, _, _, _ = keypoints_with_scores.shape
    for idx in range(num_instances):
        kpts_x = keypoints_with_scores[0, idx, :, 1]
        kpts_y = keypoints_with_scores[0, idx, :, 0]
        kpts_scores = keypoints_with_scores[0, idx, :, 2]
        kpts_absolute_xy = np.stack(
            [width * np.array(kpts_x), height * np.array(kpts_y)], axis=-1)
        kpts_above_thresh_absolute = kpts_absolute_xy[
            kpts_scores > keypoint_threshold, :]
        keypoints_all.append(kpts_above_thresh_absolute)

    for edge_pair, color in KEYPOINT_EDGE_INDS_TO_COLOR.items():
        if (kpts_scores[edge_pair[0]] > keypoint_threshold and
            kpts_scores[edge_pair[1]] > keypoint_threshold):
            x_start = kpts_absolute_xy[edge_pair[0], 0]
            y_start = kpts_absolute_xy[edge_pair[0], 1]
            x_end = kpts_absolute_xy[edge_pair[1], 0]
            y_end = kpts_absolute_xy[edge_pair[1], 1]
            line_seg = np.array([[x_start, y_start], [x_end, y_end]])
            keypoint_edges_all.append(line_seg)
            edge_colors.append(color)
    if keypoints_all:
        keypoints_xy = np.concatenate(keypoints_all, axis=0)
    else:
        keypoints_xy = np.zeros((0, 17, 2))

    if keypoint_edges_all:
        edges_xy = np.stack(keypoint_edges_all, axis=0)
    else:
        edges_xy = np.zeros((0, 2, 2))
    return keypoints_xy, edges_xy, edge_colors

In [8]:
def draw_prediction_on_image(
    image, keypoints_with_scores, crop_region=None, close_figure=False,
    output_image_height=None):
    """
    
    Draws the keypoint predictions on image.
    
    Args:
    
    image: A numpy array with shape [height, width, channel] representing the
            pixel values of the input image.
    keypoints_with_scores: A numpy array with shape [1, 1, 17, 3] representing
                            the keypoint coordinates and scores returned from the MoveNet model.
    crop_region: A dictionary that defines the coordinates of the bounding box
                  of the crop region in normalized coordinates (see the init_crop_region
                  function below for more detail). If provided, this function will also
                  draw the bounding box on the image.
    output_image_height: An integer indicating the height of the output image.
                          Note that the image aspect ratio will be the same as the input image.

    Returns:

    A numpy array with shape [out_height, out_width, channel] representing the
    image overlaid with keypoint predictions.
    """

    height, width, channel = image.shape
    aspect_ratio = float(width) / height
    
    fig, ax = plt.subplots(figsize=(12 * aspect_ratio, 12))

    # To remove the huge white borders

    fig.tight_layout(pad=0)
    ax.margins(0)
    ax.set_yticklabels([])
    ax.set_xticklabels([])
    plt.axis('off')

    im = ax.imshow(image)
    line_segments = LineCollection([], linewidths=(4), linestyle='solid')
    ax.add_collection(line_segments)

    # Turn off tick labels

    scat = ax.scatter([], [], s=60, color='#FF1493', zorder=3)
    (keypoint_locs, keypoint_edges, edge_colors) = _keypoints_and_edges_for_display(
                                                    keypoints_with_scores, height, width)

    line_segments.set_segments(keypoint_edges)
    line_segments.set_color(edge_colors)
    if keypoint_edges.shape[0]:
        line_segments.set_segments(keypoint_edges)
        line_segments.set_color(edge_colors)

    if keypoint_locs.shape[0]:
        scat.set_offsets(keypoint_locs)
    
    if crop_region is not None:

        xmin = max(crop_region['x_min'] * width, 0.0)
        ymin = max(crop_region['y_min'] * height, 0.0)
        rec_width = min(crop_region['x_max'], 0.99) * width - xmin
        rec_height = min(crop_region['y_max'], 0.99) * height - ymin
        rect = patches.Rectangle(
            (xmin,ymin),rec_width,rec_height,
            linewidth=1,edgecolor='b',facecolor='none')
        
        ax.add_patch(rect)


    fig.canvas.draw()

    image_from_plot = np.frombuffer(fig.canvas.tostring_rgb(), dtype=np.uint8)

    image_from_plot = image_from_plot.reshape(
        fig.canvas.get_width_height()[::-1] + (3,))

    plt.close(fig)

    if output_image_height is not None:
        output_image_width = int(output_image_height / height * width)
        image_from_plot = cv2.resize(
            image_from_plot, dsize=(output_image_width, output_image_height),
            interpolation=cv2.INTER_CUBIC)
    
    return image_from_plot

### C - Importing MoveNet Model

<a id='imports_movenet'></a>

In [9]:
module = hub.load("https://tfhub.dev/google/movenet/singlepose/thunder/4")
input_size = 256

In [10]:
def movenet(input_image):
    """
    Runs detection on an input image.
    
    Args:
    
    input_image: A [1, height, width, 3] tensor represents the input image
    pixels. Note that the height/width should already be resized and match the
    expected input resolution of the model before passing into this function.
    
    Returns:
    A [1, 1, 17, 3] float numpy array representing the predicted keypoint
    coordinates and scores. 
    """

    model = module.signatures['serving_default']

    # SavedModel format expects tensor type of int32.

    input_image = tf.cast(input_image, dtype=tf.int32)

    # Run model inference.

    outputs = model(input_image)

    # Output is a [1, 1, 17, 3] tensor.

    keypoints_with_scores = outputs['output_0'].numpy()

    return keypoints_with_scores

### D - Importing Video Files

<a id='imports_vids'></a>

In [11]:
videos_folder = '../data/01-edited_videos/'
vidclasses = ['jabs', 'kicks']

jabs_folder = videos_folder + vidclasses[0] 
kicks_folder = videos_folder + vidclasses[1] 


In [12]:
def get_files(path):
    for file in os.listdir(path):
        if os.path.isfile(os.path.join(path, file)):
            yield file

In [13]:
files = get_files(jabs_folder)
jabs_files = sorted([f for f in files if f.endswith('.mov')])

print(jabs_files)

['1-jabs.mov', '2-jabs.mov', '3-jabs.mov', '4-jabs.mov', '5-jabs.mov', '6-jabs.mov', '7-jabs.mov', '8-jabs.mov', '9-jabs.mov', 'single_jab.mov']


In [15]:
files = get_files(kicks_folder)
kicks_files = sorted([f for f in files if f.endswith('.mov')])

print(kicks_files)

['01-kicks.mov', '02-kicks.mov', '03-kicks.mov', '04-kicks.mov', '05-kicks.mov', '06-kicks.mov', '07-kicks.mov', '08-kicks.mov', '09-kicks.mov', '10-kicks.mov', '11-kicks.mov', '12-kicks.mov', '13-kicks.mov', '14-kicks.mov', '15-kicks.mov', '16-kicks.mov', '17-kicks.mov', '18-kicks.mov', '19-kicks.mov', '20-kicks.mov', '21-kicks.mov', '22-kicks.mov', '23-kicks.mov', '24-kicks.mov', '25-kicks.mov', '26-kicks.mov', '27-kicks.mov', '28-kicks.mov', '29-kicks.mov', '30-kicks.mov', '31-kicks.mov']


In [16]:
columns = ['class']
for key, value in KEYPOINT_DICT.items():
    columns.extend([key + '_y', key + '_x', key + '_conf'])

with open('../data/02-raw_csv/jabs/coords.csv', mode='w', newline='') as f:
    csv_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerow(columns)

In [17]:
for f in jabs_files:
    video_path = jabs_folder + '/' + f
    
    cap = cv2.VideoCapture(video_path)
    
    output_path = '../data/03-png/jabs/'

    frame_count = 0
    

    while cap.isOpened():
        ret, frame = cap.read()

        if not ret:
            break


        num_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

        # Reshape image
        image = frame.copy()
        image_height, image_width, channels = image.shape
        input_image = tf.expand_dims(image, axis=0)
        input_image = tf.image.resize_with_pad(input_image, input_size, input_size)


        # Run model inference.
        keypoints_with_scores = movenet(input_image)

        # Extract coordinates
        row = keypoints_with_scores[0][0].flatten().tolist()

        # Append class name
        row.insert(0,'jab')

        # Export to CSV
        with open('../data/02-raw_csv/jabs/coords.csv', mode='a', newline='') as f:
            csv_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
            csv_writer.writerow(row)


        # Visualise the predictions with image.
        display_image = tf.expand_dims(image, axis=0)
        display_image = tf.cast(tf.image.resize_with_pad(
        display_image, 1280, 1280), dtype=tf.int32)
        output_overlay = draw_prediction_on_image(
        np.squeeze(display_image.numpy(), axis=0), keypoints_with_scores)

        cv2.imshow(f'{f}', output_overlay)
        
        # Export as png
        frame_count += 1
        cv2.imwrite(output_path + str(frame_count) + '.png', output_overlay)

        if cv2.waitKey(10) & 0xFF==ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()
    cv2.waitKey(1)
    cap.release()
    print('number of frames is ',num_frames)



2023-07-25 16:38:18.840447: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


number of frames is  3774
number of frames is  296
number of frames is  4946
number of frames is  3025
number of frames is  64
number of frames is  472
number of frames is  831
number of frames is  193
number of frames is  82
number of frames is  58


In [18]:
columns = ['class']
for key, value in KEYPOINT_DICT.items():
    columns.extend([key + '_y', key + '_x', key + '_conf'])

with open('../data/02-raw_csv/kicks/coords.csv', mode='w', newline='') as f:
    csv_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerow(columns)

In [None]:
for f in kicks_files:
    video_path = kicks_folder + '/' + f
    
    cap = cv2.VideoCapture(video_path)
    
    output_path = '../data/03-png/kicks/'

    frame_count = 0

    while cap.isOpened():
        ret, frame = cap.read()

        if not ret:
            break


        num_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

        # Reshape image
        image = frame.copy()
        image_height, image_width, channels = image.shape
        input_image = tf.expand_dims(image, axis=0)
        input_image = tf.image.resize_with_pad(input_image, input_size, input_size)


        # Run model inference.
        keypoints_with_scores = movenet(input_image)

        # Extract coordinates
        row = keypoints_with_scores[0][0].flatten().tolist()

        # Append class name
        row.insert(0,'kick')

        # Export to CSV
        with open('../data/02-raw_csv/kicks/coords.csv', mode='a', newline='') as f:
            csv_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
            csv_writer.writerow(row)


        # Visualise the predictions with image.
        display_image = tf.expand_dims(image, axis=0)
        display_image = tf.cast(tf.image.resize_with_pad(
        display_image, 1280, 1280), dtype=tf.int32)
        output_overlay = draw_prediction_on_image(
        np.squeeze(display_image.numpy(), axis=0), keypoints_with_scores)

        cv2.imshow(f'{f}', output_overlay)
        
        # Export as png
        frame_count += 1
        cv2.imwrite(output_path + str(frame_count) + '.png', output_overlay)

        if cv2.waitKey(10) & 0xFF==ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()
    cv2.waitKey(1)
    cap.release()
    print('number of frames is ',num_frames)



number of frames is  162
number of frames is  727
number of frames is  429
number of frames is  101
number of frames is  145
number of frames is  144
number of frames is  95
number of frames is  118
number of frames is  1465
number of frames is  92
number of frames is  452
number of frames is  370
number of frames is  116
number of frames is  683
number of frames is  85
number of frames is  79
number of frames is  105
number of frames is  125
number of frames is  201
number of frames is  110
number of frames is  157
number of frames is  101
number of frames is  163
number of frames is  85
number of frames is  293
number of frames is  83
number of frames is  82
number of frames is  1997
number of frames is  3019
number of frames is  5771


## <b> End of Part I</b> <br>

[Part II](Part_2-Cleaning_and_EDA.ipynb#part_ii) <br>
[Part III](Part_3-Modelling.ipynb#part_iii) <br>
[Part IV](Part_4-Modelling.ipynb#part_iv)<br>
[Part V](Part_5-Implementation.ipynb#part_v)