DLC8 - Vietnamese Translation & Localization System

Open-source Vietnamese translation and content localization system for social media marketing.

🎯 Purpose

DLC8 is a comprehensive NLP pipeline designed specifically for translating and localizing English marketing content into Vietnamese for social media platforms. It ensures culturally appropriate, grammatically correct Vietnamese text with automated quality checks.

Key Features:

✅ Glossary enforcement (preserve brand terms, ban inappropriate phrases)
✅ Tone analysis (formal, casual, promotional, informational)
✅ Quality scoring (grammar, word choice, readability)
✅ Vietnamese text normalization (abbreviations, number formats, punctuation)
✅ Diacritics validation (coverage analysis, missing diacritics detection, auto-correction)

🚀 Quick Start

Prerequisites

Docker & Docker Compose
OpenAI API key (or local Ollama setup)

Installation

Clone the repository:

git clone https://github.com/YOUR_USERNAME/dlc8.git
cd dlc8

Set up environment:

cp .env.example .env
# Edit .env and add your OPENAI_API_KEY

Start services:

docker compose up -d

Verify services:

# Check NLP service health
curl http://localhost:8000/health

# Check backend service
curl http://localhost:3000/health

Usage Example

Translate with all NLP features:

curl -X POST http://localhost:3000/translate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Visit our bakery at Q.1, TP.HCM! Prices from 50,000 VND. Open T2-T7.",
    "glossary": {
      "preserve": ["Dulce"],
      "banned": ["best"]
    },
    "nlp_postprocess": true
  }'

Response:

{
  "translation": "Ghé tiệm bánh Dulce tại Quận 1, Thành phố Hồ Chí Minh! Giá từ 50.000 VND. Mở cửa Thứ hai-Thứ bảy.",
  "tone": {
    "detected_tone": "promotional",
    "confidence": 0.78,
    "model": "vinai/phobert-base"
  },
  "quality": {
    "score": 82,
    "grade": "B",
    "issues": []
  }
}

📚 Documentation

Complete Guides

NLP Enhancement Guide - Complete overview of all 5 phases
Phase 1: Glossary Enforcement
Phase 2: Tone & Quality Scoring
Phase 3: Text Normalization
Phase 4: Diacritics Validation
Production Test Summary

API Endpoints

Backend Service (Port 3000)

POST /translate - Main translation endpoint with NLP
GET /health - Service health check

NLP Service (Port 8000)

GET /health - Service health and dependency status
POST /nlp/postprocess - Phase 1-3 processing (glossary, tone, quality, normalization)
POST /normalize - Phase 3: Vietnamese text normalization
POST /validate-diacritics - Phase 4: Diacritics validation

Vietnamese Text Features

Phase 3: Normalization

Abbreviation expansion (26 common terms: TP.HCM, Q., T2-T7, etc.)
Vietnamese number formatting (1,000 → 1.000, 3.5% → 3,5%)
Punctuation normalization (spacing, quotes)

Phase 4: Diacritics Validation

Coverage calculation (target: 20-35% for Vietnamese text)
Missing diacritics detection (50+ common words)
Auto-correction (conservative and aggressive modes)

🏗️ Architecture

┌─────────────────┐
│   User/Client   │
└────────┬────────┘
         │
         ▼
┌─────────────────────────────┐
│  Backend Service (Node.js)  │
│  - Translation API          │
│  - OpenAI integration       │
│  - Request routing          │
└────────┬────────────────────┘
         │
         ▼
┌──────────────────────────────────┐
│  NLP Service (Python/FastAPI)    │
│  - Glossary enforcement (Phase 1)│
│  - Tone analysis (Phase 2)       │
│  - Quality scoring (Phase 2)     │
│  - Text normalization (Phase 3)  │
│  - Diacritics validation (Phase 4)│
│  - PhoBERT (vinai/phobert-base) │
└──────────────────────────────────┘

🤝 Contributing

We welcome contributions! This is an open-source project to improve Vietnamese NLP for marketing content.

Areas for Contribution:

Expand Vietnamese diacritics dictionary (currently 50+ words)
Add more abbreviation patterns (currently 26 patterns)
Improve quality scoring metrics
Test with real-world Vietnamese content
Report bugs or edge cases

How to Contribute

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Test thoroughly (see test files in root)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Testing

Run the included test suite:

# Test Phase 3 normalization
curl -X POST http://localhost:8000/normalize \
  -H "Content-Type: application/json" \
  -d @test_phase3_comprehensive.json

# Test Phase 4 diacritics
curl -X POST http://localhost:8000/validate-diacritics \
  -H "Content-Type: application/json" \
  -d @test_phase4_detection.json

# Test full pipeline
curl -X POST http://localhost:8000/nlp/postprocess \
  -H "Content-Type: application/json" \
  -d @test_vietnamese_product.json

📊 Project Status

Current Version: 0.5-production-phase4

Phase	Feature	Status
Phase 1	Glossary Enforcement	✅ Complete
Phase 2	Tone Analysis & Quality Scoring	✅ Complete
Phase 3	Vietnamese Text Normalization	✅ Complete
Phase 4	Diacritics Validation	✅ Complete
Phase 5	Smart Features (Hashtags, Emojis)	📋 Planned

🛠️ Technology Stack

Backend:

Node.js + Express
OpenAI API (gpt-4o-mini)

NLP Service:

Python 3.11
FastAPI
Transformers (HuggingFace)
PhoBERT (vinai/phobert-base)
underthesea (Vietnamese NLP)
ftfy, langdetect, rapidfuzz, emoji

Infrastructure:

Docker & Docker Compose
Uvicorn (ASGI server)

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

PhoBERT by VinAI Research for Vietnamese language models
underthesea for Vietnamese NLP tools
OpenAI for translation models
HuggingFace for transformer infrastructure

📧 Contact & Support

Issues: GitHub Issues
Discussions: GitHub Discussions

🗺️ Roadmap

Completed (v0.5):

✅ Core translation pipeline
✅ Vietnamese glossary enforcement
✅ Tone detection with PhoBERT
✅ Quality scoring system
✅ Text normalization (abbreviations, numbers, punctuation)
✅ Diacritics validation and auto-correction

In Progress:

🔄 Production usage testing
🔄 Dictionary expansion based on real data

Planned (v0.6+):

📋 Phase 5: Smart features (hashtags, emojis, CTAs)
📋 Web UI for testing
📋 API authentication
📋 Usage analytics
📋 Multi-language support beyond Vietnamese

Made with ❤️ for the Vietnamese marketing community

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
backend		backend
nlp_service		nlp_service
.env.example		.env.example
.gitignore		.gitignore
FINDINGS_SUMMARY.md		FINDINGS_SUMMARY.md
NLP_ENHANCEMENT_GUIDE.md		NLP_ENHANCEMENT_GUIDE.md
PHASE1_IMPLEMENTATION_COMPLETE.md		PHASE1_IMPLEMENTATION_COMPLETE.md
PHASE2_COMPLETION.md		PHASE2_COMPLETION.md
PHASE3_COMPLETION.md		PHASE3_COMPLETION.md
PHASE4_COMPLETION.md		PHASE4_COMPLETION.md
PRODUCTION_READY.md		PRODUCTION_READY.md
PRODUCTION_TEST_SUMMARY.md		PRODUCTION_TEST_SUMMARY.md
README.md		README.md
chat.html		chat.html
docker-compose.yml		docker-compose.yml
glossary.md		glossary.md
index.html		index.html
nlp-deps.md		nlp-deps.md
phase3_normalize_test.json		phase3_normalize_test.json
production_verification_test.json		production_verification_test.json
prompt_templates.md		prompt_templates.md
security.md		security.md
test_enhanced_translation.json		test_enhanced_translation.json
test_full_pipeline.json		test_full_pipeline.json
test_glossary_enforcement.json		test_glossary_enforcement.json
test_phase3_comprehensive.json		test_phase3_comprehensive.json
test_phase3_normalize.json		test_phase3_normalize.json
test_phase3_numbers.json		test_phase3_numbers.json
test_phase4_detection.json		test_phase4_detection.json
test_phase4_good_text.json		test_phase4_good_text.json
test_postprocess_phase4.json		test_postprocess_phase4.json
test_production_full.json		test_production_full.json
test_tone_request.json		test_tone_request.json
test_vietnamese_product.json		test_vietnamese_product.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DLC8 - Vietnamese Translation & Localization System

🎯 Purpose

🚀 Quick Start

Prerequisites

Installation

Usage Example

📚 Documentation

Complete Guides

API Endpoints

Backend Service (Port 3000)

NLP Service (Port 8000)

Vietnamese Text Features

🏗️ Architecture

🤝 Contributing

How to Contribute

Testing

📊 Project Status

🛠️ Technology Stack

📝 License

🙏 Acknowledgments

📧 Contact & Support

🗺️ Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DLC8 - Vietnamese Translation & Localization System

🎯 Purpose

🚀 Quick Start

Prerequisites

Installation

Usage Example

📚 Documentation

Complete Guides

API Endpoints

Backend Service (Port 3000)

NLP Service (Port 8000)

Vietnamese Text Features

🏗️ Architecture

🤝 Contributing

How to Contribute

Testing

📊 Project Status

🛠️ Technology Stack

📝 License

🙏 Acknowledgments

📧 Contact & Support

🗺️ Roadmap

About

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages