This repository contains a multi-stage machine translation workflow built with Python and notebooks.
It includes:
- A baseline SMT spelling corrector (
app/smt.py) - An improved SMT spelling corrector with guardrails (
app/smt_improved.py) - A Gemini-powered dataset generator (
utils/smt-data-gen.py) - A complete 3-stage translation notebook (
final-3-stage-machine-translation.ipynb)
app/— main SMT implementationsdata/— parallel corpus, LM corpus, and test textutils/— dataset generation utilitiesfinal-3-stage-machine-translation.ipynb— end-to-end pipeline notebook
- Python
>=3.13 - Poetry (recommended)
Dependencies are defined in:
pyproject.toml
From the repository root:
poetry installBaseline version:
poetry run python app/smt.pyImproved version:
poetry run python app/smt_improved.pyUse the notebook below for the full end-to-end system:
final-3-stage-machine-translation.ipynb
The notebook is organized into three stages:
- Stage 1: Statistical spelling correction preprocessing
- Builds and tests a statistical spelling corrector
- Exposes helper functions and an object usable in a larger pipeline
- Stage 2: Complex English to simpler English conversion
- Includes environment setup, utility functions, optional fine-tuning, and inference
- Provides a
convert_lang_eng(...)function for pipeline use
- Stage 3: Final neural translation
- Uses a quantized MBART translation pipeline
- Connects all stages in a full system flow and includes a Gradio interface
Set your API key first:
export GEMINI_API_KEY="your_api_key"Then run:
poetry run python utils/smt-data-gen.pyThis generates parallel_corpus.txt and lm_corpus.txt in the current working directory.