# Full Pipeline for Live Sign Language Translation with Citizen ASL Dataset

This notebook implements a complete pipeline for converting sign language video into live text predictions using pose estimation and a language model.

## Pipeline Overview
1. **Landmark Extraction**: Extract body and hand landmarks from Citizen ASL sign videos using [MediaPipe Holistic](https://developers.google.com/mediapipe).
2. **Preprocessing** : Normalize and split the landmark sequences into training, validation, and test sets.
3. **Model Training** : Train a deep learning model on the preprocessed data.
4. **Live translation** : Use a trained model and LLM in a real-time translation system.

## Requirements:
- Install dependencies via:
    ```bash
    pip install -r requirements.txt 
    ```
    **Or**, install them manually. Make sure you are using the following versions:
    - Python: 3.10.17
    - TensorFlow: 2.19.0
    - MediaPipe 0.10.9
    - OpenAI 1.82.0

- Obtain an API key for a Large Language Model (LLM):
    - This pipeline uses [OpenAI's GPT API](https://platform.openai.com/) for language enhancement in the live translation phase.
    - Note: OpenAI is not free. You must have a valid and funded API key.

- **Landmark extraction** requires [ASL Citizen Dataset](https://www.microsoft.com/en-us/research/project/asl-citizen/) downloaded

> **Preprocessed data** and **model training** is available on [this notebook](https://www.kaggle.com/code/tobypu/aslcitizen-top200-training).

> **Live translation** is available for testing using a trained model covering 200 unique glosses (271 total classes including duplicates). To use it, specify the path to the trained model and the accompanying index_to_glos_200.json file, which maps model outputs to gloss labels.

In [1]:
import sys
import tensorflow as tf
import mediapipe as mp
import openai
import cv2

print("Python:", sys.version)
print("TensorFlow:", tf.__version__)
print("MediaPipe:", mp.__version__)
print("OpenAI:", openai.__version__)
print("cv2:", cv2.__version__)


2025-06-08 15:16:12.758524: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-06-08 15:16:12.761459: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-06-08 15:16:12.769440: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749370572.782765   48545 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749370572.786568   48545 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1749370572.797503   48545 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linkin

Python: 3.10.17 (main, Jun  8 2025, 14:44:46) [GCC 15.1.1 20250521 (Red Hat 15.1.1-2)]
TensorFlow: 2.19.0
MediaPipe: 0.10.9
OpenAI: 1.82.0
cv2: 4.11.0


The following modules are imported for different stages of the pipeline:

In [2]:
from utils import asl_citizen_MP_encoding
from utils import preprocessing_split 
from utils import train_model 
from utils import live_translation

## 1. Landmark Extraction
This step converts raw sign videos from the Citizen ASL dataset into landmark sequences using MediaPipe Holistic.

### Input
- Sign videos from the Citizen ASL dataset
- Split CSV files: `train.csv`, `test.csv`, and `val.csv` under `splits/` directory
- Citizen ASL dataset folders and videos should not be modified or renamed from their original structure.

### What It Does
For each video, the pipeline:
1. Extracts 3D landmarks for:
    - 33 pose landmarks
    - 21 left-hand landmarks
    - 21 right-hand landmarks
    - A subset of 32 facial landmarks
2. Combines them into `(frames, 107, 3)` NumPy array
3. Stores only the`(x,y)` coordinates and `gloss` to a `.npz` file.
### Output
- One `.npz` file per video saved under `processed_save_dir`
- Each `.npz` file contains:
    - `landmarks`: `np.darray` of shape (frames,107,2)
    - `gloss`: The sign label


> **Note** : Approximately takes about 3-4 days to process all videos with Mediapipe Holistic.

In [None]:
video_dir='/ASL_Citizen/videos'
processed_save_dir='processed_all'
asl_citizen_MP_encoding.process_asl_dataset(video_dir,processed_save_dir)

## 2. Preprocessing
This step prepared the extracted landmarks for model training by normalizing, pad the data, and encode the labels.

### What It Does
1. Generate Label and One-Hot Encoders
    - Reads the glosses from `train.csv`, `val.csv`, or `test.csv`
    - Creates:
        - A label encoder (gloss -> integer)
        - A one-hot encoder (integer -> one-hot vector)
        - A gloss_to_index.json dictionary for mapping (gloss -> integer)
2. Encode Labels
    - For each `.npz` file:
        - Loads the gloss label
        - Converts it to one-hot format using the encoders
    - Saves the resulting arrays as `y_train.npy`, `y_val.npy`, and `y_test.npy` under `save_dir`
3. Preprocess Landmark Features
    - Loads landmark arrays of shape `(frames, 107, 2)`
    - Removes:
        - Pose landmarks 0-10 and 23-32
    - Pads or repeats frame to a fixed length of 150
    - Applies normalization:
        - Anchor-based normalization for body, face, and arm landmarks.
        - Hand normalization to bound hand keypoints in `[-0.5,0.5]`, centered at `(0,0)`
    - Saves the processed data as `x_train.npy`, `x_val.npy`, and `x_test.npy` in shape `(videos, 160,86 , 2)` under `save_dir`

### Input
- `.npz` files from the landmark extrtaction step (under `save_dir_name/{split}`)
- Corresponding `train.csv`, `val.csv`, or `test.csv` file

### Output
- One-hot encoded label arrays: `y_train.npy`, `y_val.npy`, and `y_test.npy`
- Preprocessed landmark arrays: `x_train.npy`, `x_val.npy`, and `x_test.npy`

> Note: This step applies pose normalization, arm alignment, and hand bounding box to remove spatial and body proportionality bias. For a detailed explanation and justification of these choices, please refer  to [our research paper](https://doi.org/10.26877/sj5scb03)

In [None]:
# === 1. Generate Label Encoder, OneHot Encoder, and Gloss Dictionary ===
csv_path='ASL_Citizen/splits/test.csv' #location of test/train/val.csv
save_dict_path='gloss_to_index' #Path and dictionary name
label_encoder, onehot_encoder, gloss_to_index = preprocessing_split.generate_gloss_dictionary(csv_path,save_dict_path)

In [None]:
len(gloss_to_index) #STRING (GlOSS) TO UNIQUE INTEGER (0-2730)

2731

In [None]:
save_dir = 'model_train_data'
processed_save_dir='/processed_all'

#Create y_train, y_test, and y_val
preprocessing_split.encode_labels(label_encoder, onehot_encoder, processed_save_dir, save_dir)

#Create X_train, X_test, and X_val
preprocessing_split.preprocess_and_save_x(processed_save_dir, save_dir, split='train')
preprocessing_split.preprocess_and_save_x(processed_save_dir, save_dir, split='test')
preprocessing_split.preprocess_and_save_x(processed_save_dir, save_dir, split='val')

### 3. Model Training

This section trains a GRU-based model using the preprocessed sign language data. The training logic is defined in `train_model.py`, which consists of two main components:

#### A. `data_loader(...)`
This function loads and optionally filters the training, validation, and test datasets:
- Loads `.npy` files containing pose features and one-hot encoded labels.
- Due to prediction results, labels are converted back from one-hot encoded vectors into unique integer class labels.
- Supports partial gloss filtering by mapping similar gloss variants (ex: `GO1` and `GO2` -> `GO`). 
- Optionally merges the test set into training for cases like leaderboard training.
- Returns: tuples of `(X_train, y_train)`, `(X_val, y_val)`, `(X_test, y_test)`, and a decoder dictionary (if filtered) (decoder maps integer -> gloss).

### B. `train_gru_model(...)`
Trains a sequential GRU-based neural network using TensorFlow/Keras:
- Two GRU layers (`386` units and followed by `192` units).
- Dropout layers for regularization.
- Final `Dense` layer with `softmax` for multi-class classification.
- Automatically saves the **best-performing model** on validation accuracy.
- If test data is passed, accuracy and macro F1 score are computed.
- The final model is saved as `best_model.keras`.

---

Training was performed on **Kaggle** using GPU acceleartion. The notebook is available [here](https://www.kaggle.com/code/tobypu/aslcitizen-top200-training)





We selected the **top 200 most commonly used signs** based on [HandSpeak's list of most-used signs](https://www.handspeak.com/word/most-used/). These signs are listed in `.txt`file, each separtated by a newline. 

We also moved test data into train for better interference.

In [None]:
(train_X, train_y), (val_X, val_y), (_, _), decoder = train_model.data_loader(
    data_dir="/kaggle/input/citizen-asl-mediapipe-encoded-and-preprocessed",
    gloss_to_index_dir='/kaggle/input/citizen-asl-mediapipe-encoded-and-preprocessed/gloss_to_index.json',
    filtered_txt_path="/kaggle/input/top200citizen/Citizen200.txt",
    merge_test_to_train=True
)

#Saves the mapping
with open("index_to_gloss_200.json", "w") as f:
    json.dump(decoder, f, indent=2)


model = train_model.train_gru_model(
    train_X, train_y,
    val_X, val_y)


#### Model summary

In [4]:
from tensorflow.keras.models import load_model
model=load_model('models/best_model200.keras')
# Show a summary of the model architecture
model.summary()

### 4. Live translation

The `start_live_feed()` function is responsible for capturing webcam input, performing real-time sign language recognition using a trained GRU model, and generating meaningful translations through a language model (OpenAI GPT).

#### Input Arguments
   - `model_path`: Path to the `.keras` model file containing the trained GRU network for recognizing sequences of body/keypoint features.
   - `encoder_path`: JSON file mapping predicted class indices to sign language glosses (e.g., `{0: "HELLO", 1: "THANK-YOU", ...}`).
   - `client`: Load OpenAI API key stored in `.env`
   - `threshold` : Controls the sensitivity to motion. A lower value makes the system more sensitive, triggering capture with smaller movements. A higher value requires more movement to start capturing, making it less sensitive. Default is `1.2` (float)
   - `webcam` : Specifies which webcam to use webcam device. Default is `0`. If you have multiple, try `1`, `2`, etc (integer)
   - `complexity_setting` : Sets the model complexity for the MediaPipe Holistic pipeline. (`0` : Fastest but least accurate, `1` : Balanced, `2` : Most accurate but slowest). Default is `0`

#### **Flowchart Summary**
![Flowchart](flowchart.jpg)

#### **Creating `.env` API KEY file**
1. Open a text editor and add your OpenAI API key like this:
    ```bash
    API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    ```
2. Save the file as `.env` in your project repository

In [3]:
client = live_translation.get_client('.env') #Make sure to have API key ready

live_translation.start_live_feed(model_path='models/best_model200.keras',
                encoder_path='models/index_to_gloss_200.json',
                client=client)

2025-06-08 15:17:33.830891: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0000 00:00:1749370654.052871   48545 gl_context_egl.cc:85] Successfully initialized EGL. Major : 1 Minor: 5
I0000 00:00:1749370654.057886   48859 gl_context.cc:344] GL version: 3.2 (OpenGL ES 3.2 Mesa 25.0.6), renderer: AMD Radeon Graphics (radeonsi, renoir, ACO, DRM 3.61, 6.14.9-300.fc42.x86_64)
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 265ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step
