Deep Learning-Based Sign Language Recognition from Videos and Cross-Lingual Translation to Indian Vernaculars
Deep learning pipeline that classifies sign-language video clips into English word labels and translates them into Hindi, Telugu, and Bengali.
Built as part of an Executive M.Tech project at IIT Patna, under the guidance of Dr. Chandranath Adak.
📄 Paper: Deep Learning-Based Sign Language Recognition from Videos and Cross-Lingual Translation to Indian Vernaculars — [arXiv link coming soon]
The system works in two stages:
- Recognition — A fine-tuned VideoMAE (
MCG-NJU/videomae-base) video transformer classifies a 16-frame sign-language video clip into one of 13 English word labels. - Translation — The predicted English label is translated into Hindi, Telugu, and Bengali using Meta AI's NLLB-200 (
facebook/nllb-200-distilled-600M) multilingual translation model.
A Streamlit demo app lets a user upload a video and see the predicted label plus all three translations in real time.
Fine-tuned on a 13-class subset (~2GB) of the AI4Bharat Indian Sign Language dataset from IIT Madras:
| Metric | Value |
|---|---|
| Training accuracy | 99.4% |
| Validation accuracy | 77.5% (31/40) |
| Classes | 13 |
| Epochs | 15 |
Most confusion occurs between visually similar signs (e.g., hat / dress / shirt) and a semantically related cluster (ugly / deaf / blind). See the paper for full error analysis.
pip install torch torchvision transformers decord scikit-learn
pip install streamlit opencv-python pillow sentencepiece pyngrokSet your Hugging Face token as an environment variable rather than hardcoding it:
export HF_TOKEN="your_token_here"
export NGROK_AUTH_TOKEN="your_token_here"This project uses a 13-class subset of the AI4Bharat ISL video corpus:
- 8 adjectives:
loud,quiet,happy,sad,beautiful,ugly,deaf,blind - 5 clothing nouns:
hat,dress,suit,skirt,shirt
Full dataset: AI4Bharat OpenHands
Other ISL sources explored (not used for training):
# Key hyperparameters
Train-Test Split: 80:20
Optimizer: AdamW (lr=1e-5)
Loss: CrossEntropyLoss
Epochs: 15
Batch size: 2See notebooks/sign_language_classification.ipynb for the full training loop.
streamlit run app.pyUpload a .mp4 or .mov clip of a sign from the trained label set. The app returns the predicted English label with a confidence score, followed by Hindi, Telugu, and Bengali translations.
- Small, imbalanced label set (13 classes, as few as 1–2 validation clips for some classes)
- Isolated-word classification only — no continuous sentence generation from unsegmented video
- Sensitive to lighting, camera angle, and signer style variation
- Single-word translation can be ambiguous without sentence context
- Offline/batch inference only — not real-time
See the paper's Limitations section for details.
- Scale to the full AI4Bharat vocabulary
- Move from isolated-word classification to continuous sign-to-sentence translation
- Extend to more Indian languages via NLLB-200
- Optimize for edge/mobile deployment
- AI4Bharat and IIT Madras for the ISL dataset
- VideoMAE and NLLB open-source teams
- Department of Computer Science and Engineering, IIT Patna
@misc{nandipalli2025signlanguage,
title={Deep Learning-Based Sign Language Recognition from Videos and Cross-Lingual Translation to Indian Vernaculars},
author={Nandipalli, Ramesh and Adak, Chandranath},
year={2025},
note={IIT Patna Executive M.Tech Thesis}
}[Add a license, e.g., MIT — see https://choosealicense.com]

