Open-source Vietnamese translation and content localization system for social media marketing.
DLC8 is a comprehensive NLP pipeline designed specifically for translating and localizing English marketing content into Vietnamese for social media platforms. It ensures culturally appropriate, grammatically correct Vietnamese text with automated quality checks.
Key Features:
- β Glossary enforcement (preserve brand terms, ban inappropriate phrases)
- β Tone analysis (formal, casual, promotional, informational)
- β Quality scoring (grammar, word choice, readability)
- β Vietnamese text normalization (abbreviations, number formats, punctuation)
- β Diacritics validation (coverage analysis, missing diacritics detection, auto-correction)
- Docker & Docker Compose
- OpenAI API key (or local Ollama setup)
- Clone the repository:
git clone https://github.com/YOUR_USERNAME/dlc8.git
cd dlc8- Set up environment:
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY- Start services:
docker compose up -d- Verify services:
# Check NLP service health
curl http://localhost:8000/health
# Check backend service
curl http://localhost:3000/healthTranslate with all NLP features:
curl -X POST http://localhost:3000/translate \
-H "Content-Type: application/json" \
-d '{
"text": "Visit our bakery at Q.1, TP.HCM! Prices from 50,000 VND. Open T2-T7.",
"glossary": {
"preserve": ["Dulce"],
"banned": ["best"]
},
"nlp_postprocess": true
}'Response:
{
"translation": "GhΓ© tiα»m bΓ‘nh Dulce tαΊ‘i QuαΊn 1, ThΓ nh phα» Hα» ChΓ Minh! GiΓ‘ tα»« 50.000 VND. Mα» cα»a Thα»© hai-Thα»© bαΊ£y.",
"tone": {
"detected_tone": "promotional",
"confidence": 0.78,
"model": "vinai/phobert-base"
},
"quality": {
"score": 82,
"grade": "B",
"issues": []
}
}- NLP Enhancement Guide - Complete overview of all 5 phases
- Phase 1: Glossary Enforcement
- Phase 2: Tone & Quality Scoring
- Phase 3: Text Normalization
- Phase 4: Diacritics Validation
- Production Test Summary
POST /translate- Main translation endpoint with NLPGET /health- Service health check
GET /health- Service health and dependency statusPOST /nlp/postprocess- Phase 1-3 processing (glossary, tone, quality, normalization)POST /normalize- Phase 3: Vietnamese text normalizationPOST /validate-diacritics- Phase 4: Diacritics validation
Phase 3: Normalization
- Abbreviation expansion (26 common terms: TP.HCM, Q., T2-T7, etc.)
- Vietnamese number formatting (1,000 β 1.000, 3.5% β 3,5%)
- Punctuation normalization (spacing, quotes)
Phase 4: Diacritics Validation
- Coverage calculation (target: 20-35% for Vietnamese text)
- Missing diacritics detection (50+ common words)
- Auto-correction (conservative and aggressive modes)
βββββββββββββββββββ
β User/Client β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Backend Service (Node.js) β
β - Translation API β
β - OpenAI integration β
β - Request routing β
ββββββββββ¬βββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β NLP Service (Python/FastAPI) β
β - Glossary enforcement (Phase 1)β
β - Tone analysis (Phase 2) β
β - Quality scoring (Phase 2) β
β - Text normalization (Phase 3) β
β - Diacritics validation (Phase 4)β
β - PhoBERT (vinai/phobert-base) β
ββββββββββββββββββββββββββββββββββββ
We welcome contributions! This is an open-source project to improve Vietnamese NLP for marketing content.
Areas for Contribution:
- Expand Vietnamese diacritics dictionary (currently 50+ words)
- Add more abbreviation patterns (currently 26 patterns)
- Improve quality scoring metrics
- Test with real-world Vietnamese content
- Report bugs or edge cases
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Test thoroughly (see test files in root)
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Run the included test suite:
# Test Phase 3 normalization
curl -X POST http://localhost:8000/normalize \
-H "Content-Type: application/json" \
-d @test_phase3_comprehensive.json
# Test Phase 4 diacritics
curl -X POST http://localhost:8000/validate-diacritics \
-H "Content-Type: application/json" \
-d @test_phase4_detection.json
# Test full pipeline
curl -X POST http://localhost:8000/nlp/postprocess \
-H "Content-Type: application/json" \
-d @test_vietnamese_product.jsonCurrent Version: 0.5-production-phase4
| Phase | Feature | Status |
|---|---|---|
| Phase 1 | Glossary Enforcement | β Complete |
| Phase 2 | Tone Analysis & Quality Scoring | β Complete |
| Phase 3 | Vietnamese Text Normalization | β Complete |
| Phase 4 | Diacritics Validation | β Complete |
| Phase 5 | Smart Features (Hashtags, Emojis) | π Planned |
Backend:
- Node.js + Express
- OpenAI API (gpt-4o-mini)
NLP Service:
- Python 3.11
- FastAPI
- Transformers (HuggingFace)
- PhoBERT (vinai/phobert-base)
- underthesea (Vietnamese NLP)
- ftfy, langdetect, rapidfuzz, emoji
Infrastructure:
- Docker & Docker Compose
- Uvicorn (ASGI server)
This project is licensed under the MIT License - see the LICENSE file for details.
- PhoBERT by VinAI Research for Vietnamese language models
- underthesea for Vietnamese NLP tools
- OpenAI for translation models
- HuggingFace for transformer infrastructure
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Completed (v0.5):
- β Core translation pipeline
- β Vietnamese glossary enforcement
- β Tone detection with PhoBERT
- β Quality scoring system
- β Text normalization (abbreviations, numbers, punctuation)
- β Diacritics validation and auto-correction
In Progress:
- π Production usage testing
- π Dictionary expansion based on real data
Planned (v0.6+):
- π Phase 5: Smart features (hashtags, emojis, CTAs)
- π Web UI for testing
- π API authentication
- π Usage analytics
- π Multi-language support beyond Vietnamese
Made with β€οΈ for the Vietnamese marketing community