# 01_feature selection_and_data_loading

This notebook demonstrates how to use the FacemeshDataLoader class with facial landmark clusters for targeted analysis.

## Overview
- **Data Loader**: Handles loading enhanced CSV files with rolling baseline features
- **Facial Clusters**: Predefined anatomical groupings of landmarks for focused analysis
- **Rolling Baseline Features**: Temporal features (rb5, rb10) that provide movement context

## Configuration Options
You can customize the data loading in several ways:
- **Window Size**: Choose between 5-frame (rb5) or 10-frame (rb10) rolling baselines
- **Feature Types**: Select 'base', 'rb', 'diff', or combinations
- **Facial Regions**: Focus on specific anatomical areas (eyes, mouth, etc.)
- **Subjects**: Choose which participants to include
- **Sessions**: Select baseline, specific sessions, or all sessions

In [20]:
import sys
import pandas as pd
import numpy as np
from pathlib import Path

# Direct imports (no path modification needed)
from data_loader_template import FacemeshDataLoader
from facial_clusters import (
    FACIAL_CLUSTERS, CLUSTER_GROUPS, EXPRESSION_CLUSTERS,
    get_cluster_indices, get_group_indices, 
    get_all_cluster_names, get_all_group_names
)

print("✓ Imports successful")
print(f"Available facial clusters: {len(FACIAL_CLUSTERS)}")
print(f"Available cluster groups: {len(CLUSTER_GROUPS)}")

✓ Imports successful
Available facial clusters: 32
Available cluster groups: 7


In [21]:
loader = FacemeshDataLoader(window_size=5)

## Facial Landmark Clusters

MediaPipe provides 478 facial landmarks. We've organized these into anatomically meaningful clusters:

### Individual Clusters
- **Eyes**: Detailed upper/lower regions, iris tracking
- **Mouth**: Inner/outer lip boundaries  
- **Eyebrows**: Upper and lower eyebrow regions
- **Nose**: Tip, bottom, corners
- **Face Shape**: Silhouette/outline points

### Grouped Regions
- **mouth**: All lip-related clusters
- **eyes**: Combined left/right eye regions
- **eyebrows**: Combined eyebrow regions
- **nose**: All nose-related points
- **face_shape**: Overall face outline

### Expression-Specific Clusters
Predefined combinations for common expressions like smile, frown, surprise, etc.

In [22]:
# Display available clusters and groups
print("=== INDIVIDUAL CLUSTERS ===")
for cluster_name in sorted(get_all_cluster_names()):
    indices = get_cluster_indices(cluster_name)
    print(f"{cluster_name:20}: {len(indices):2d} landmarks")

print("\n=== CLUSTER GROUPS ===")
for group_name in sorted(get_all_group_names()):
    indices = get_group_indices(group_name)
    print(f"{group_name:15}: {len(indices):3d} landmarks")

print("\n=== EXPRESSION CLUSTERS ===")
for expr, groups in EXPRESSION_CLUSTERS.items():
    total_landmarks = sum(len(get_group_indices(g)) for g in groups)
    print(f"{expr:15}: {groups} ({total_landmarks} landmarks)")

=== INDIVIDUAL CLUSTERS ===
leftCheek           :  1 landmarks
leftEyeIris         :  5 landmarks
leftEyeLower0       :  9 landmarks
leftEyeLower1       :  9 landmarks
leftEyeLower2       :  9 landmarks
leftEyeLower3       :  9 landmarks
leftEyeUpper0       :  7 landmarks
leftEyeUpper1       :  7 landmarks
leftEyeUpper2       :  7 landmarks
leftEyebrowLower    :  6 landmarks
leftEyebrowUpper    :  8 landmarks
lipsLowerInner      : 11 landmarks
lipsLowerOuter      : 10 landmarks
lipsUpperInner      : 11 landmarks
lipsUpperOuter      : 11 landmarks
midwayBetweenEyes   :  1 landmarks
noseBottom          :  1 landmarks
noseLeftCorner      :  1 landmarks
noseRightCorner     :  1 landmarks
noseTip             :  1 landmarks
rightCheek          :  1 landmarks
rightEyeIris        :  5 landmarks
rightEyeLower0      :  9 landmarks
rightEyeLower1      :  9 landmarks
rightEyeLower2      :  9 landmarks
rightEyeLower3      :  9 landmarks
rightEyeUpper0      :  7 landmarks
rightEyeUpper1      :  7 la

## Data Loader Configuration

The `FacemeshDataLoader` can be configured for different analysis needs:

### Window Size Options:
- **rb5**: 5-frame rolling baseline (captures short-term variations)
- **rb10**: 10-frame rolling baseline (captures longer-term patterns)

### Feature Type Options:
- **'base'**: Original coordinate features (feat_X_y, feat_X_z)
- **'rb'**: Rolling baseline averages (smoothed positions)
- **'diff'**: Deviations from rolling baseline (movement intensity)

### Typical Configurations:
- **Movement Analysis**: Use 'diff' features to study motion patterns
- **Position Analysis**: Use 'rb' features for stable positioning
- **Full Analysis**: Use all feature types for comprehensive modeling

In [23]:
# Configuration options
WINDOW_SIZE = 5  # or 10 for longer temporal context
DATA_ROOT = "read"

# Initialize data loader
loader = FacemeshDataLoader(data_root=DATA_ROOT, window_size=WINDOW_SIZE)

print(f"✓ Data loader initialized")
print(f"  - Window size: {WINDOW_SIZE} frames")
print(f"  - Data root: {DATA_ROOT}")
print(f"  - Looking for files with suffix: {loader.suffix}")

✓ Data loader initialized
  - Window size: 5 frames
  - Data root: read
  - Looking for files with suffix: -rb5


## Single Subject Data Loading

Let's load data for one subject to understand the structure and verify everything works.

### What we're testing:
1. **File accessibility**: Can we find and load the enhanced CSV files?
2. **Data structure**: What columns are available?
3. **Feature organization**: How are the rolling baseline features organized?
4. **Data quality**: Are there any obvious issues?

In [24]:
# Test loading single subject data
TEST_SUBJECT = "e1"
TEST_SESSION = "baseline"

print(f"Loading {TEST_SUBJECT}-{TEST_SESSION}...")
df_test = loader.load_subject_data(TEST_SUBJECT, TEST_SESSION)

if not df_test.empty:
    print(f"✓ Successfully loaded data")
    print(f"  - Rows: {len(df_test):,}")
    print(f"  - Columns: {len(df_test.columns):,}")
    print(f"  - Time range: {df_test['Time (s)'].min():.2f}s to {df_test['Time (s)'].max():.2f}s")
    print(f"  - Duration: {df_test['Time (s)'].max() - df_test['Time (s)'].min():.2f}s")
    
    # Show column types
    feature_types = {
        'metadata': [col for col in df_test.columns if col in ['Subject Name', 'Test Name', 'Time (s)', 'Face Depth (cm)']],
        'base_features': [col for col in df_test.columns if col.startswith('feat_') and '_rb' not in col],
        'rb_averages': [col for col in df_test.columns if f'_rb{WINDOW_SIZE}' in col and not col.endswith('_diff')],
        'rb_differences': [col for col in df_test.columns if col.endswith(f'_rb{WINDOW_SIZE}_diff')]
    }
    
    print(f"\n=== COLUMN BREAKDOWN ===")
    for ftype, cols in feature_types.items():
        print(f"{ftype:15}: {len(cols):4d} columns")
        
else:
    print(f"❌ Failed to load data for {TEST_SUBJECT}-{TEST_SESSION}")
    print("Check if the file exists and has the correct naming format")

Loading e1-baseline...
❌ Failed to load data for e1-baseline
Check if the file exists and has the correct naming format
