# COGS 118B - Project Proposal (MotionMappers)

# Names
- Valeria Avila
- Hailey Nguyen
- Mohamed Abdilahi
- Sarah Kim
- Juan Villalobos


# Abstract 
In this project, we will be using UCF50 Action Recognition Dataset to accurately identify the type of action/behavior a person is performing based on the pose landmark readings and unsupervised learning clustering algorithms. With the help of MediaPipe’s pose landmark detection, we will track the individuals’ joints as they move throughout the video and extract their joint coordinates. Having this joint coordinate data, line graphs will then be created, allowing us to observe any trends, patterns, and fluctuations in movement. Finally, in order to cluster these actions/behaviors, both a convolutional neural network and k-means clustering will be used. The success of the clustering will be measured using PCK and MAE.

# Background
Human action recognition decoded from video data is a field of research with plenty of interest from many sectors and industries and as such is very important. Previously, this space has been strongly leaning on supervised learning models which require high annotated datasets that are both expensive and time-consuming to collect. And they also come with their own problems, such as their inability to generalize across the wide variety of activities . One of the recent shifts is a trend towards the use of unsupervised learning which  aim to address these challenges by leveraging the vast quantities of unlabeled data available.

A notable advancement in this area of unsupervised learning and its relation to action recognition is the work by Rhodin et al., which introduces an unsupervised learning approach for developing a geometry-aware representation of 3D human action from multi-view images without the need for labeled data[1]. This methodology signifies a significant stride toward scalable and adaptable pose estimation models by reducing reliance on extensive labeled datasets. Another study in this related field from Pavllo et al. have demonstrated the efficacy of dilated temporal convolutions over 2D keypoints for video-based 3D human pose estimation. Their fully convolutional model adeptly captures temporal information, offering substantial improvements in computational efficiency and simplicity over traditional RNN-based methods.[2]

We aim to add to this area of research and take inspiration from their work for our own project and to apply unsupervised learning techniques to answer our own question regarding the task of action recognition in short videos. We aim to potentially push the boundaries for action recognition and the understanding of human motion and contribute meaningfully to this academic conversation.


# Problem Statement
The problem at hand involves the task of action recognition in short videos utilizing the UCF 50-Action Recognition dataset. The objective is to precisely identify and localize body parts on individuals depicted in videos. This has practical applications across domains such as sports analytics, healthcare, and human-computer interaction. A potential solution entails the development of a machine learning model, such as a convolutional neural network (CNN) that is trained on the action recognition dataset. This model would learn the intricate mapping between image pixels and corresponding body parts, allowing it to predict action maps for individuals in a given image. This problem is quantifiable because it involves predicting dense pose maps which are matrices that associate each pixel in an image to a specific body part. The accuracy of the model can be measured using metrics such as Mean Average Precision (mAP) or Intersection over Union (IoU). These metrics provide a quantitative measure of how well the predicted poses align with the ground truth. This problem is replicable because the action recognition dataset is publicly available. The dataset can be divided into training, validation, and test sets, ensuring that the experiment can be reproduced by others to validate the model’s performance. 


# Data

To complete this project, details of the dataset we will be using is the following:

Link/reference to obtain data:
- https://www.crcv.ucf.edu/data/UCF50.php

Description of the size of the dataset (# of variables, # of observations):
- number of Variables: 50
- number of Observations: 6676

What an observation consists of:
- videos of people performing some type of action in the wild such as biking, drumming, jumping jacks–only to name a few

What some critical variables are, how they are represented:
- the variables categorized as actions, where the videos are grouped into 25 groups 

Any special handling, transformations, cleaning, etc will be needed:
- since the data is in the format of a video we will need to transform the data into analyzable information


# Proposed Solution

For the posed problem of human action estimation using the chosen dataset, a practical solution involves leveraging the MediaPipe library, which provides a pre-trained pose detection model. This solution is applicable to the project domain as it offers a quick and efficient way to estimate key points representing body parts without the need for extensive model training. MediaPipe's pose detection model is designed for real-time processing, making it suitable for scenarios where live or near-live pose estimation is crucial. The algorithm behind MediaPipe utilizes a convolutional neural network (CNN) architecture to detect key landmarks on the human body, enabling the extraction of essential pose information.

Implementation of the solution involves installing the MediaPipe library, loading the pre-trained pose detection model, and processing input images or frames. The extracted key points can then be used for various applications within the project domain. To ensure reproducibility, a clear set of instructions, including library versions and dependencies, will be provided.

To test the solution's effectiveness, a set of evaluation metrics can be employed, such as Mean Average Precision (mAP) or Intersection over Union (IoU). These metrics will quantify the accuracy of the pose estimation results against ground truth annotations from the action recognition dataset. Additionally, a benchmark model, such as a basic pose detection model without fine-tuning on the dataset, can be used for comparison to assess the performance gain achieved by leveraging the pre-trained MediaPipe model. The reproducibility of the experiment will be ensured by documenting the dataset splits, preprocessing steps, and model parameters used during testing and benchmarking.


# Evaluation Metrics
We are considering three evaluation metrics (subject to change): Percentage of Correct Keypoints (PCK), Mean Absolute Error (MAE), and Confusion Matrices. PCK and MAE metrics are common metrics in unsupervised machine learning for evaluating action estimation in videos. However, as our chosen dataset contains categorized poses, we also consider using a confusion matrix if needed.

PCK measures the percentage of key points within a certain distance threshold and compares it to the ground truth. 

![EvaluationMetricsEquationsP1](group_template/imgs/evalmetrics1.png)
![EvaluationMetricsEquationsP2](group_template/imgs/evalmetrics2.png)

# Ethics & Privacy

The data that was collected was  from the UCF50 dataset where they record videos of people performing some form of action in the “wild.” Some of the concerns with the UCF 50 dataset is that it contains videos of individuals from YouTube that captured their actions ー the ages vary, but it looks like the images contain adults and children. To address this concern of people’s information, we will try to, if necessary, get rid of identifiers that can implicitly show the person’s name or address. Another concern with this is that of informed consent of the people who had their video taken of them. We can not control how the data was collected, so we need to trust that the individuals in the dataset were informed there was a video taken of them. 

# Team Expectations 

- Make sure to come to each meeting. If anyone needs to be absent, tell the other team members beforehand.
- Every team member needs to complete their assigned task.
- Each team member needs to communicate at every meeting, giving the group an update on their progress/work.


# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/14  |  10 AM |  Brainstorm topics/questions (all)  | Find potential datasets, brainstorm ideas, and assign each member to a specific part.  | 
| 2/20  |  8 PM |  Finalized dataset  | Finalize project proposal and submit it | 
| 2/24  | 8 PM  | Import data/ Clean the data  | Import the data and perform any necessary data wrangling   |
| 2/29  | 8 PM  | Analyze data | Run the data through Machine learning models and track the performance   |
| 3/10 | 8 PM  | Make adjustments to analysis | Make adjustments based on the performance of the results |
| 3/15  | 8 PM  | Work on report| Discuss/edit the full project |
| 3/20  | Before 11:59 PM  | Final review of the project as a team | Turn in Final Project  |

# Footnotes
https://www.crcv.ucf.edu/data/UCF50_files/MVAP_UCF50.pdf
https://oecd.ai/en/catalogue/metrics/percentage-of-correct-keypoints-%28pck%29

[1] Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation

[2] 3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training
