# **Exploration of Features**

---

In [None]:
import pandas as pd
from pathlib import Path
import sys

sys.path.append('..')
from preprocessing.preprocessing_functions import make_sequence_df

## **Understanding the structure of the given data**

<img src="../landmark_visualization/structure.png" width="1200" height="600">

## **Sequence Imports**

- A single sequence dataframe just randomly picked from the bunch
- A 1k - 1.5k sequence dataframe, consisting of every sequence of some random sign

In [4]:
# 1)
df_single_sequence = pd.read_parquet('../asl-signs/train_landmark_files/2044/635217.parquet')
display(df_single_sequence)

# 2)
df_data = pd.read_csv('../asl-signs/train.csv')
subset_signs = ['cry', 'blow', 'shhh']
path = Path('../asl-signs')

df_multi_sequence = make_sequence_df(path = path, df_data = df_data, sign_list = subset_signs)
display(df_multi_sequence)

Unnamed: 0,frame,row_id,type,landmark_index,x,y,z
0,22,22-face-0,face,0,0.438251,0.449453,-0.047826
1,22,22-face-1,face,1,0.414527,0.404880,-0.071994
2,22,22-face-2,face,2,0.423745,0.420681,-0.042145
3,22,22-face-3,face,3,0.402349,0.372041,-0.043906
4,22,22-face-4,face,4,0.411857,0.393013,-0.074747
...,...,...,...,...,...,...,...
3796,28,28-right_hand-16,right_hand,16,0.237886,0.460492,-0.121701
3797,28,28-right_hand-17,right_hand,17,0.163785,0.452878,-0.134627
3798,28,28-right_hand-18,right_hand,18,0.279028,0.476364,-0.156879
3799,28,28-right_hand-19,right_hand,19,0.252220,0.497005,-0.139577


Unnamed: 0,frame,row_id,type,landmark_index,x,y,z,participant_id,sequence_id,sign
0,20,20-face-0,face,0,0.494400,0.380470,-0.030626,26734,1000035562,blow
1,20,20-face-1,face,1,0.496017,0.350735,-0.057565,26734,1000035562,blow
2,20,20-face-2,face,2,0.500818,0.359343,-0.030283,26734,1000035562,blow
3,20,20-face-3,face,3,0.489788,0.321780,-0.040622,26734,1000035562,blow
4,20,20-face-4,face,4,0.495304,0.341821,-0.061152,26734,1000035562,blow
...,...,...,...,...,...,...,...,...,...,...
25033381,45,45-right_hand-16,right_hand,16,,,,27610,996627508,shhh
25033382,45,45-right_hand-17,right_hand,17,,,,27610,996627508,shhh
25033383,45,45-right_hand-18,right_hand,18,,,,27610,996627508,shhh
25033384,45,45-right_hand-19,right_hand,19,,,,27610,996627508,shhh


## **Sequence Structure**

In [3]:
frames = df_single_sequence.groupby('frame').size()
display(frames)

types = df_single_sequence['type'].unique()
display(types)

df_types = df_single_sequence.groupby(['frame','type']).size().unstack()
display(df_types)

df_landmarks = df_single_sequence.groupby(['frame', 'type'])['landmark_index'].nunique().unstack()
display(df_landmarks)

missing_by_frame_type = df_single_sequence.groupby(['frame', 'type'])[['x', 'y', 'z']].apply(lambda x: x.isna().sum())
display(missing_by_frame_type)

frame
22    543
23    543
24    543
25    543
26    543
27    543
28    543
dtype: int64

array(['face', 'left_hand', 'pose', 'right_hand'], dtype=object)

type,face,left_hand,pose,right_hand
frame,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
22,468,21,33,21
23,468,21,33,21
24,468,21,33,21
25,468,21,33,21
26,468,21,33,21
27,468,21,33,21
28,468,21,33,21


type,face,left_hand,pose,right_hand
frame,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
22,468,21,33,21
23,468,21,33,21
24,468,21,33,21
25,468,21,33,21
26,468,21,33,21
27,468,21,33,21
28,468,21,33,21


Unnamed: 0_level_0,Unnamed: 1_level_0,x,y,z
frame,type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
22,face,0,0,0
22,left_hand,21,21,21
22,pose,0,0,0
22,right_hand,0,0,0
23,face,0,0,0
23,left_hand,21,21,21
23,pose,0,0,0
23,right_hand,0,0,0
24,face,0,0,0
24,left_hand,21,21,21


**Takeaways, Structure of a sequence**:

- This sequence is 7 frames long and each frame consists of 543 rows
- There are 4 parts of the body recorded
    - left hand
    - right hand
    - pose
    - face 
- In each and every frame we have:
    - 468 face landmark points
    - 21 left hand landmark points
    - 21 right hand landmark points
    - 33 pose landmark points

    => Each landmark point is represented with its own unique landmark id
    
    => so 468 + 21 + 21 + 33 = 543. 
- Every landmark has x y and z cords
- Row id just seems to be the concatenation of the frame-type-landmark_id seems useless for now
- Only the x y z cordinates seem to be missing, in this case every left hand cordinate is missing

**To do**:

- Probably every sequence has incosistent frame numbers, like this one has: 22 -> 28
    - Reindex them from 0
- Missing values:
    - In the keypoint normalization step, missing values will be explored further and handled
- Explore larger sequence dataframes to see the mentioned sequence length variety

In [4]:
df_frames = df_multi_sequence.groupby('sequence_id')['frame'].nunique()
sequences_count = df_multi_sequence['sequence_id'].nunique()
print('Number of sequences: ', sequences_count)

print('\nSequence ID and their frame count')
display(df_frames)
print('Description: ')
display(df_frames.describe())

Number of sequences:  1192

Sequence ID and their frame count


sequence_id
8707044        45
9763903        11
11814611       11
13828063       15
19503372       19
             ... 
4278134502    146
4279964985     17
4281280533     43
4284322802     19
4293583116     11
Name: frame, Length: 1192, dtype: int64

Description: 


count    1192.000000
mean       38.676174
std        43.014291
min         2.000000
25%        14.000000
50%        23.000000
75%        46.000000
max       404.000000
Name: frame, dtype: float64

**Takeaway**:

- In this dataframe which consists of only cry, blow or shhh signs has 1192 sequences
- But each sequences varies greatly in frame counts

- Sequence with the minimum frames has: 2
- Sequence with the maximum frames has: 404