# <strong><u>1: Video Processing Notebook |</u></strong>

This notebook is an implementation of video processing using MediaPipe, Google's pose estimation model. The process extracts x, y movement of the body using markerless motion capture for ## articulators, or keypoints. 

In this workflow, we output the x,y movement for the keypoints in the upper body that are particularly relevant for co-speech gesture research: `right_shoulder`, `left_shoulder`, `right_elbow`, `left_elbow`,  `right_wrist`,  `left_wrist`, `right_eye`,  `left_eye`, `nose`. It is possible to track other keypoints, as is documented in more detail in the [MediaPipe documentation](https://developers.google.com/mediapipe/solutions/vision/pose_landmarker). For this workflow, later stages of processing will further subset these keypoints for a specific analysis.

This script takes a video input file, uses the MediaPipe model to track body movement, and outputs a dataframe with the time scale and x, y coordinates for the upperbody articulators specified above. The dataframe is then written to a csv file.

This script assumes that you have cloned the Co-Speech-Gesture-Automation github repository and that your input files are stored in the VIDEO_FILES folder. Your outputs will be written to the MOTION_TRACKING_FILES folder.

### <strong>Requirements</strong>

To run this notebook, you will need the following Python packages:

- mediapipe
- opencv-python
- pandas
- matplotlib

You can install these packages using pip:
```shell
    pip install mediapipe opencv-python pandas matplotlib
```



### <strong>Importing Libraries</strong>
Here, we import essential libraries:
- `pandas`: For DataFrame manipulation and data analysis
- `numpy`: For numerical computations
- `plotly`: For data visualization

In [1]:
import cv2
import mediapipe as mp
import os
import pandas as pd

### <strong>Setting Parameters</strong>
In this section, you can modify the following parameters:
- `MODEL`: Choose between Lite model (`1`) and Full model (`2`). Lite Model (`1`) is the  `Default`. 

The lite model is faster and relatively accurate, but using the full model will improve tracking accuracy.

- `video_path`: Path to the video file.


This cell also provides you a way to check that the file and directory specified in `video_path` can be found before beginning video processing. If the file path was found, you can proceed to the video processing initialization and video processing steps. If your file was not found and you should amend your `video_path` before proceeding.

In [2]:
MODEL = 2  # 1 = Lite model, 2 = Full model
video_path = "C:/Users/cosmo/Desktop/Random Scripts/Co-Speech Gesture Automation/Co-Speech-Gesture-Automation/TEST_VIDEOS/5003_I_board_one.MOV"  # Add your video file path here

if os.path.exists(video_path) == True:
    print(f"{video_path} is a valid file. Proceed with processing.")

else:
    raise ValueError(f"{video_path} does not exist. Try adding the entire file path.")

C:/Users/cosmo/Desktop/Random Scripts/Co-Speech Gesture Automation/Co-Speech-Gesture-Automation/TEST_VIDEOS/5003_I_board_one.MOV is a valid file. Proceed with processing.


### <strong>Initialization</strong>
This part initializes MediaPipe components used in the notebook.

There is no need to do anything with this cell before running it.

In [3]:
# Initialize MediaPipe components
mp_drawing = mp.solutions.drawing_utils
mp_holistic = mp.solutions.holistic

### <strong>Keypoint Tracking</strong>
The core logic of video processing is performed in this loop.
This code block reads in a video file and extracts pose landmarks from each frame of the video using the `mediapipe` and `opencv-python` libraries. The pose landmarks are then stored in a DataFrame along with the corresponding time stamp.

This cell can be run without making any changes unless you require tracking for keypoints not specified in this workflow.


In [4]:
print(f"Processing video at {video_path}")
with mp_holistic.Holistic(static_image_mode=False, model_complexity=MODEL) as holistic:
    # Initialize DataFrame to store data
    data = []
    cap = cv2.VideoCapture(video_path)
    while cap.isOpened():
        ret, image = cap.read()
        
        if not ret:
            print("Ignoring empty camera frame.")
            break

        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        results = holistic.process(image)

        # Append data to list
        time_ms = cap.get(cv2.CAP_PROP_POS_MSEC)
        
        # Dictionary to store data
        if results.pose_landmarks is not None:
            right_shoulder_x = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.RIGHT_SHOULDER].x
            right_shoulder_y = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.RIGHT_SHOULDER].y
            left_shoulder_x = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.LEFT_SHOULDER].x
            left_shoulder_y = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.LEFT_SHOULDER].y
            right_elbow_x = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.RIGHT_ELBOW].x
            right_elbow_y = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.RIGHT_ELBOW].y
            left_elbow_x = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.LEFT_ELBOW].x
            left_elbow_y = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.LEFT_ELBOW].y
            right_wrist_x = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.RIGHT_WRIST].x
            right_wrist_y = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.RIGHT_WRIST].y
            left_wrist_x = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.LEFT_WRIST].x
            left_wrist_y = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.LEFT_WRIST].y
            right_eye_x = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.RIGHT_EYE].x
            right_eye_y = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.RIGHT_EYE].y
            left_eye_x = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.LEFT_EYE].x
            left_eye_y = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.LEFT_EYE].y
            nose_x = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.NOSE].x
            nose_y = results.pose_landmarks.landmark[mp_holistic.PoseLandmark.NOSE].y
            
            data.append([time_ms, right_shoulder_x, right_shoulder_y, left_shoulder_x, left_shoulder_y, right_elbow_x, right_elbow_y, left_elbow_x, left_elbow_y, right_wrist_x, right_wrist_y, left_wrist_x, left_wrist_y, right_eye_x, right_eye_y, left_eye_x, left_eye_y, nose_x, nose_y])

    cap.release()

Processing video at C:/Users/cosmo/Desktop/Random Scripts/Co-Speech Gesture Automation/Co-Speech-Gesture-Automation/TEST_VIDEOS/5003_I_board_one.MOV
Ignoring empty camera frame.


### <strong>Converting the data to a dataframe</strong>
Here the keypoint data is converted to a dataframe to preview before writing to a csv file. All keypoint data are saved at this stage of processing, and there is no need to make changes to the articulators listed here; however, later stages of processing will subset the data to a chosen articulator.

In [5]:
# Convert to DataFrame
df = pd.DataFrame(data, columns=[
    "time_ms", 
    "right_shoulder_x", "right_shoulder_y", 
    "left_shoulder_x", "left_shoulder_y", 
    "right_elbow_x", "right_elbow_y", 
    "left_elbow_x", "left_elbow_y", 
    "right_wrist_x", "right_wrist_y", 
    "left_wrist_x", "left_wrist_y", 
    "right_eye_x", "right_eye_y", 
    "left_eye_x", "left_eye_y",
    "nose_x", "nose_y"
    ])
df

Unnamed: 0,time_ms,right_shoulder_x,right_shoulder_y,left_shoulder_x,left_shoulder_y,right_elbow_x,right_elbow_y,left_elbow_x,left_elbow_y,right_wrist_x,right_wrist_y,left_wrist_x,left_wrist_y,right_eye_x,right_eye_y,left_eye_x,left_eye_y,nose_x,nose_y
0,0.000000,0.199620,0.340158,0.306103,0.321906,0.210611,0.519510,0.318417,0.467959,0.310565,0.599996,0.332360,0.565261,0.267303,0.210757,0.287394,0.215454,0.284312,0.236780
1,33.366700,0.199077,0.341203,0.306702,0.322055,0.210875,0.520649,0.318398,0.468547,0.309275,0.600028,0.332472,0.569860,0.269785,0.211651,0.287389,0.215700,0.285467,0.236680
2,66.733400,0.197892,0.342645,0.306807,0.321924,0.210921,0.521773,0.318284,0.468442,0.308196,0.599823,0.332442,0.569820,0.271009,0.212417,0.287368,0.215893,0.286066,0.236662
3,100.100100,0.197796,0.343561,0.306889,0.321761,0.211130,0.522904,0.318264,0.468158,0.308243,0.599915,0.332369,0.569683,0.269460,0.212267,0.287030,0.215534,0.285035,0.235656
4,133.466800,0.197817,0.343666,0.306919,0.321722,0.211240,0.523742,0.318264,0.467784,0.308225,0.600016,0.332232,0.569621,0.269199,0.212232,0.286917,0.215327,0.284551,0.234914
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2527,84317.584251,0.191473,0.324141,0.300755,0.319947,0.209462,0.501711,0.312668,0.475842,0.304399,0.592583,0.323143,0.591776,0.263099,0.198799,0.279575,0.204344,0.277965,0.221259
2528,84350.950951,0.191727,0.324120,0.300916,0.320243,0.209446,0.501701,0.312638,0.475948,0.304468,0.592853,0.323126,0.591873,0.261940,0.195944,0.279442,0.202622,0.276505,0.218559
2529,84384.317651,0.192007,0.324079,0.301033,0.320874,0.209437,0.501639,0.312634,0.476322,0.304462,0.592961,0.323138,0.591878,0.260903,0.195190,0.279620,0.202309,0.275700,0.217921
2530,84417.684351,0.192335,0.324077,0.301245,0.321590,0.209433,0.501635,0.312636,0.476725,0.304369,0.593014,0.323242,0.591977,0.260167,0.195006,0.279948,0.202337,0.275417,0.217824


### <strong>Data Output</strong>
Finally, the processed data is stored in a DataFrame and saved as a pickle file or csv file. For testing we recommend using a csv file and then moving to a pickle file for processing larger datasets because pickle files are more efficient for processing and storing.

In [6]:
# Print DataFrame shape
print(f"DataFrame Head: {df.head()}")

# # Save DataFrame as pickle file
# pickle_file_name = "C:/Users/cosmo/Desktop/Random Scripts/Co-Speech Gesture Automation/Co-Speech-Gesture-Automation/MOTION_TRACKING_FILES/" + os.path.splitext(os.path.basename(video_path))[0] + "_keypoints.pkl"
# df.to_pickle(pickle_file_name)
# print(f"DataFrame saved as {pickle_file_name}")

# Save DataFrame as CSV file
csv_file_name = "C:/Users/cosmo/Desktop/Random Scripts/Co-Speech Gesture Automation/Co-Speech-Gesture-Automation/MOTION_TRACKING_FILES/" + os.path.splitext(os.path.basename(video_path))[0] + "_keypoints.csv"
df.to_csv(csv_file_name, index=False)
print(f"DataFrame saved as {csv_file_name}")

DataFrame Head:     time_ms  right_shoulder_x  right_shoulder_y  left_shoulder_x  \
0    0.0000          0.199620          0.340158         0.306103   
1   33.3667          0.199077          0.341203         0.306702   
2   66.7334          0.197892          0.342645         0.306807   
3  100.1001          0.197796          0.343561         0.306889   
4  133.4668          0.197817          0.343666         0.306919   

   left_shoulder_y  right_elbow_x  right_elbow_y  left_elbow_x  left_elbow_y  \
0         0.321906       0.210611       0.519510      0.318417      0.467959   
1         0.322055       0.210875       0.520649      0.318398      0.468547   
2         0.321924       0.210921       0.521773      0.318284      0.468442   
3         0.321761       0.211130       0.522904      0.318264      0.468158   
4         0.321722       0.211240       0.523742      0.318264      0.467784   

   right_wrist_x  right_wrist_y  left_wrist_x  left_wrist_y  right_eye_x  \
0       0.310565  