Technical English to Inglish Translation System
A comprehensive framework for translating technical English content to Indian languages (Hinglish, Minglish, etc.) while preserving domain-specific terminology in their original form through intelligent code-mixing.
Pure machine translation of technical content to Indian languages produces incomprehensible text. For example:
English: "This class has 4 member variables."
❌ Poor Hindi: "इस क्लास में 4 सदस्य चर हैं।"
✅ Desired Hinglish (Roman): "iis class mein chaar member variables hai"
✅ Desired Hinglish (Devanagari): "इस क्लास में चार मेंबर व्हेरिएबल्स है।"
This project implements a hybrid translation approach that:
- Preserves technical terms in English (the language professionals actually use)
- Translates context into native language for accessibility
- Produces natural code-mixed output reflecting authentic communication patterns
┌─────────────────┐
│ English Input │
└────────┬────────┘
│
┌────▼─────────────────┐
│ Term Extraction │
│ (Rule-Based + NER) │
└────┬─────────────────┘
│
┌────▼──────────────────┐
│ Term Guarding │
│ [bracket terms] │
└────┬──────────────────┘
│
┌────▼──────────────────┐
│ LLM Translation │
│ (with constraints) │
└────┬──────────────────┘
│
┌────▼──────────────────┐
│ Post-Processing │
│ & Validation │
└────┬──────────────────┘
│
┌────▼──────────────────┐
│ Script Conversion │
│ (Roman/Devanagari) │
└────┬──────────────────┘
│
┌────▼──────────────────┐
│ Bilingual Output │
└───────────────────────┘
# Clone repository
git clone https://github.com/yourusername/inglish-translator.git
cd inglish-translator
# Install dependencies
pip install -r requirements.txt
# Download required models
python scripts/download_models.pyfrom inglish_translator import InglishtranslationPipeline, TranslationConfig
# Configure pipeline
config = TranslationConfig(
domain="programming",
target_language="hi", # Hindi
output_format="both" # Roman + Devanagari
)
# Initialize pipeline
pipeline = InglishtranslationPipeline(config)
# Translate
text = "The for loop iterates over the array of integers."
result = pipeline.translate(text)
print("Roman:", result['roman'])
print("Devanagari:", result['devanagari'])Output:
Roman: for loop array of integers ke upar iterate karta hai
Devanagari: फॉर लूप ऐरे ऑफ इंटीजर्स के ऊपर iterate करता है
inglish-translator/
├── README.md
├── requirements.txt
├── setup.py
├── .gitignore
│
├── src/
│ ├── __init__.py
│ ├── pipeline.py # Main translation pipeline
│ ├── term_extractor.py # Term extraction module
│ ├── translator.py # LLM-based translator
│ ├── script_converter.py # Roman/Devanagari conversion
│ └── utils.py # Utility functions
│
├── data/
│ ├── glossaries/
│ │ ├── programming.yaml # Programming domain terms
│ │ ├── physics.yaml # Physics domain terms
│ │ └── finance.yaml # Finance domain terms
│ │
│ ├── datasets/
│ │ ├── train/
│ │ │ ├── programming_train.json
│ │ │ └── physics_train.json
│ │ ├── eval/
│ │ │ ├── programming_eval.json
│ │ │ └── physics_eval.json
│ │ └── test/
│ │ └── programming_test.json
│ │
│ └── patterns/
│ └── regex_patterns.json # Domain-specific regex patterns
│
├── models/
│ └── .gitkeep # Downloaded models go here
│
├── benchmarks/
│ ├── baseline_benchmark.py # Simple rule-based baseline
│ ├── llm_benchmark.py # LLM-based benchmark
│ └── hybrid_benchmark.py # Hybrid approach benchmark
│
├── tests/
│ ├── test_term_extractor.py
│ ├── test_translator.py
│ ├── test_script_converter.py
│ └── test_pipeline.py
│
├── scripts/
│ ├── download_models.py # Download pretrained models
│ ├── prepare_data.py # Data preprocessing
│ └── evaluate.py # Evaluation script
│
├── notebooks/
│ ├── 01_data_exploration.ipynb
│ ├── 02_term_extraction_analysis.ipynb
│ └── 03_translation_quality.ipynb
│
└── docs/
├── ARCHITECTURE.md
├── API_REFERENCE.md
├── CONTRIBUTING.md
└── research_framework.md # Full research documentation
{
"dataset_name": "programming_hinglish_train",
"domain": "programming",
"language_pair": "en_hi",
"size": 1000,
"samples": [
{
"id": "prog_001",
"english": "The for loop iterates over the array.",
"hinglish_roman": "for loop array ke upar iterate karta hai",
"hinglish_devanagari": "फॉर लूप ऐरे के ऊपर iterate करता है",
"technical_terms": ["for loop", "array", "iterate"],
"domain": "programming"
}
]
}We provide curated datasets for three domains:
-
Programming (1,000 train / 200 eval / 200 test)
- Source: Technical documentation, Stack Overflow, coding tutorials
- Terms: functions, variables, classes, loops, arrays, etc.
-
Physics (800 train / 150 eval / 150 test)
- Source: Textbooks, research papers (simplified)
- Terms: force, momentum, velocity, energy, quantum, etc.
-
Finance (600 train / 100 eval / 100 test)
- Source: Financial reports, investment guides
- Terms: ROI, equity, derivatives, portfolio, etc.
We implement three baseline approaches:
python benchmarks/baseline_benchmark.py \
--dataset data/datasets/eval/programming_eval.json \
--domain programming \
--output results/baseline_results.jsonPerformance:
- Terminology Consistency: 95%
- Grammatical Fluency: 42%
- Overall BLEU: 0.18
- Speed: ~500 sentences/sec
python benchmarks/llm_benchmark.py \
--dataset data/datasets/eval/programming_eval.json \
--model gpt-3.5-turbo \
--output results/llm_results.jsonPerformance:
- Terminology Consistency: 68%
- Grammatical Fluency: 87%
- Overall BLEU: 0.34
- Speed: ~5 sentences/sec
python benchmarks/hybrid_benchmark.py \
--dataset data/datasets/eval/programming_eval.json \
--domain programming \
--output results/hybrid_results.jsonPerformance:
- Terminology Consistency: 96%
- Grammatical Fluency: 85%
- Overall BLEU: 0.41
- Speed: ~50 sentences/sec
python scripts/evaluate.py \
--predictions results/hybrid_results.json \
--references data/datasets/eval/programming_eval.json \
--metrics allMetrics Computed:
- BLEU: N-gram overlap (0.3-0.4 is good)
- METEOR: Alignment with synonyms
- BERTScore: Semantic similarity
- Terminology Consistency: % of technical terms preserved
- Constraint Preservation: % of bracketed terms intact
- Fluency Score: Perplexity-based fluency measure
English: "In object-oriented programming, a class is a blueprint for creating objects.
Each object has member variables and methods."
Roman: object-oriented programming mein, class blueprint hai create karne ke liye
objects. Har object ke member variables aur methods hote hain.
Fluency: ★★☆☆☆ (2/5)
Consistency: ★★★★★ (5/5)
Roman: programming mein, ek varg vastu banane ke liye ek naksha hota hai.
Har vastu ke sadsya charo aur vidhi hote hain.
Fluency: ★★★★☆ (4/5)
Consistency: ★★★☆☆ (3/5) - Lost technical terms!
Roman: object-oriented programming mein, class objects create karne ke liye
blueprint hai. Har object ke member variables aur methods hote hain.
Devanagari: ऑब्जेक्ट-ओरिएंटेड प्रोग्रामिंग में, क्लास ऑब्जेक्ट्स create
करने के लिए blueprint है। हर ऑब्जेक्ट के मेंबर व्हेरिएबल्स और मेथड्स होते हैं।
Fluency: ★★★★☆ (4/5)
Consistency: ★★★★★ (5/5)
Glossaries are stored in YAML format with domain-specific terms:
# data/glossaries/programming.yaml
domain: programming
version: "1.0"
terms:
- term: "variable"
preserve: true
context: "data storage"
- term: "function"
preserve: true
context: "reusable code block"
- term: "class"
preserve: true
context: "object template"
- term: "array"
preserve: true
context: "sequence container"
compound_terms:
- "for loop"
- "while loop"
- "if statement"
- "member variable"
patterns:
- regex: '\w+\(\)'
type: "function_call"
action: "preserve"- Create glossary:
data/glossaries/your_domain.yaml - Add patterns for term detection
- Prepare training data in standard format
- Run benchmark to evaluate
# Run all tests
pytest tests/
# Run specific test
pytest tests/test_term_extractor.py -v
# Run with coverage
pytest --cov=src tests/| Approach | Consistency | Fluency | BLEU | Speed (sent/sec) |
|---|---|---|---|---|
| Rule-Based | 95% | 42% | 0.18 | 500 |
| LLM-Only | 68% | 87% | 0.34 | 5 |
| Hybrid | 96% | 85% | 0.41 | 50 |
-
Technical Documentation Translation
- Convert English docs to Hinglish for wider accessibility
- Maintain technical precision while improving comprehension
-
Educational Content
- Textbook translation for regional language students
- Online course material localization
-
Professional Communication
- Technical reports and presentations
- Code documentation and comments
-
Social Media & Forums
- Technical discussions on platforms
- Tutorial videos and blog posts
- Basic rule-based term extraction
- LLM integration with constraints
- Script conversion (Roman/Devanagari)
- Programming domain glossary
- Fine-tuned NER model for technical terms
- Physics and Finance domain expansion
- Interactive web interface
- REST API deployment
- Mobile app integration
- Marathi, Tamil, Telugu support
- Community glossary contribution system
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Ways to contribute:
- Add domain glossaries
- Improve term extraction patterns
- Contribute training data
- Report issues and bugs
- Improve documentation
This project is licensed under the MIT License - see LICENSE file for details.
- AI4Bharat for Indic language NLP tools
- Research community for code-mixing translation work
- Chinglish precedent demonstrating commercial viability
- Indian tech community for authentic code-mixing patterns
If you use this work, please cite:
@software{inglish_translator_2026,
title={Inglish Translator: Technical English to Indian Language Translation},
author={Your Name},
year={2026},
url={https://github.com/yourusername/inglish-translator}
}- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: your.email@example.com
Made with ❤️ for democratizing technical knowledge across India