Skip to content

redrover13/dlc8

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DLC8 - Vietnamese Translation & Localization System

Open-source Vietnamese translation and content localization system for social media marketing.

License: MIT

🎯 Purpose

DLC8 is a comprehensive NLP pipeline designed specifically for translating and localizing English marketing content into Vietnamese for social media platforms. It ensures culturally appropriate, grammatically correct Vietnamese text with automated quality checks.

Key Features:

  • βœ… Glossary enforcement (preserve brand terms, ban inappropriate phrases)
  • βœ… Tone analysis (formal, casual, promotional, informational)
  • βœ… Quality scoring (grammar, word choice, readability)
  • βœ… Vietnamese text normalization (abbreviations, number formats, punctuation)
  • βœ… Diacritics validation (coverage analysis, missing diacritics detection, auto-correction)

πŸš€ Quick Start

Prerequisites

  • Docker & Docker Compose
  • OpenAI API key (or local Ollama setup)

Installation

  1. Clone the repository:
git clone https://github.com/YOUR_USERNAME/dlc8.git
cd dlc8
  1. Set up environment:
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY
  1. Start services:
docker compose up -d
  1. Verify services:
# Check NLP service health
curl http://localhost:8000/health

# Check backend service
curl http://localhost:3000/health

Usage Example

Translate with all NLP features:

curl -X POST http://localhost:3000/translate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Visit our bakery at Q.1, TP.HCM! Prices from 50,000 VND. Open T2-T7.",
    "glossary": {
      "preserve": ["Dulce"],
      "banned": ["best"]
    },
    "nlp_postprocess": true
  }'

Response:

{
  "translation": "GhΓ© tiệm bΓ‘nh Dulce tαΊ‘i QuαΊ­n 1, ThΓ nh phα»‘ Hα»“ ChΓ­ Minh! GiΓ‘ tα»« 50.000 VND. Mở cα»­a Thα»© hai-Thα»© bαΊ£y.",
  "tone": {
    "detected_tone": "promotional",
    "confidence": 0.78,
    "model": "vinai/phobert-base"
  },
  "quality": {
    "score": 82,
    "grade": "B",
    "issues": []
  }
}

πŸ“š Documentation

Complete Guides

API Endpoints

Backend Service (Port 3000)

  • POST /translate - Main translation endpoint with NLP
  • GET /health - Service health check

NLP Service (Port 8000)

  • GET /health - Service health and dependency status
  • POST /nlp/postprocess - Phase 1-3 processing (glossary, tone, quality, normalization)
  • POST /normalize - Phase 3: Vietnamese text normalization
  • POST /validate-diacritics - Phase 4: Diacritics validation

Vietnamese Text Features

Phase 3: Normalization

  • Abbreviation expansion (26 common terms: TP.HCM, Q., T2-T7, etc.)
  • Vietnamese number formatting (1,000 β†’ 1.000, 3.5% β†’ 3,5%)
  • Punctuation normalization (spacing, quotes)

Phase 4: Diacritics Validation

  • Coverage calculation (target: 20-35% for Vietnamese text)
  • Missing diacritics detection (50+ common words)
  • Auto-correction (conservative and aggressive modes)

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   User/Client   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Backend Service (Node.js)  β”‚
β”‚  - Translation API          β”‚
β”‚  - OpenAI integration       β”‚
β”‚  - Request routing          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  NLP Service (Python/FastAPI)    β”‚
β”‚  - Glossary enforcement (Phase 1)β”‚
β”‚  - Tone analysis (Phase 2)       β”‚
β”‚  - Quality scoring (Phase 2)     β”‚
β”‚  - Text normalization (Phase 3)  β”‚
β”‚  - Diacritics validation (Phase 4)β”‚
β”‚  - PhoBERT (vinai/phobert-base) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🀝 Contributing

We welcome contributions! This is an open-source project to improve Vietnamese NLP for marketing content.

Areas for Contribution:

  1. Expand Vietnamese diacritics dictionary (currently 50+ words)
  2. Add more abbreviation patterns (currently 26 patterns)
  3. Improve quality scoring metrics
  4. Test with real-world Vietnamese content
  5. Report bugs or edge cases

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Test thoroughly (see test files in root)
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

Testing

Run the included test suite:

# Test Phase 3 normalization
curl -X POST http://localhost:8000/normalize \
  -H "Content-Type: application/json" \
  -d @test_phase3_comprehensive.json

# Test Phase 4 diacritics
curl -X POST http://localhost:8000/validate-diacritics \
  -H "Content-Type: application/json" \
  -d @test_phase4_detection.json

# Test full pipeline
curl -X POST http://localhost:8000/nlp/postprocess \
  -H "Content-Type: application/json" \
  -d @test_vietnamese_product.json

πŸ“Š Project Status

Current Version: 0.5-production-phase4

Phase Feature Status
Phase 1 Glossary Enforcement βœ… Complete
Phase 2 Tone Analysis & Quality Scoring βœ… Complete
Phase 3 Vietnamese Text Normalization βœ… Complete
Phase 4 Diacritics Validation βœ… Complete
Phase 5 Smart Features (Hashtags, Emojis) πŸ“‹ Planned

πŸ› οΈ Technology Stack

Backend:

  • Node.js + Express
  • OpenAI API (gpt-4o-mini)

NLP Service:

  • Python 3.11
  • FastAPI
  • Transformers (HuggingFace)
  • PhoBERT (vinai/phobert-base)
  • underthesea (Vietnamese NLP)
  • ftfy, langdetect, rapidfuzz, emoji

Infrastructure:

  • Docker & Docker Compose
  • Uvicorn (ASGI server)

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • PhoBERT by VinAI Research for Vietnamese language models
  • underthesea for Vietnamese NLP tools
  • OpenAI for translation models
  • HuggingFace for transformer infrastructure

πŸ“§ Contact & Support

πŸ—ΊοΈ Roadmap

Completed (v0.5):

  • βœ… Core translation pipeline
  • βœ… Vietnamese glossary enforcement
  • βœ… Tone detection with PhoBERT
  • βœ… Quality scoring system
  • βœ… Text normalization (abbreviations, numbers, punctuation)
  • βœ… Diacritics validation and auto-correction

In Progress:

  • πŸ”„ Production usage testing
  • πŸ”„ Dictionary expansion based on real data

Planned (v0.6+):

  • πŸ“‹ Phase 5: Smart features (hashtags, emojis, CTAs)
  • πŸ“‹ Web UI for testing
  • πŸ“‹ API authentication
  • πŸ“‹ Usage analytics
  • πŸ“‹ Multi-language support beyond Vietnamese

Made with ❀️ for the Vietnamese marketing community

About

Vietnamese translation and localization system for social media marketing

Resources

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors