NLP Machine Translation

This repository contains a multi-stage machine translation workflow built with Python and notebooks.
It includes:

A baseline SMT spelling corrector (app/smt.py)
An improved SMT spelling corrector with guardrails (app/smt_improved.py)
A Gemini-powered dataset generator (utils/smt-data-gen.py)
A complete 3-stage translation notebook (final-3-stage-machine-translation.ipynb)

Repository structure

app/ — main SMT implementations
data/ — parallel corpus, LM corpus, and test text
utils/ — dataset generation utilities
final-3-stage-machine-translation.ipynb — end-to-end pipeline notebook

Requirements

Python >=3.13
Poetry (recommended)

Dependencies are defined in:

pyproject.toml

Setup

From the repository root:

poetry install

Run the SMT corrector

Baseline version:

poetry run python app/smt.py

Improved version:

poetry run python app/smt_improved.py

3-stage notebook pipeline

Use the notebook below for the full end-to-end system:

final-3-stage-machine-translation.ipynb

The notebook is organized into three stages:

Stage 1: Statistical spelling correction preprocessing
- Builds and tests a statistical spelling corrector
- Exposes helper functions and an object usable in a larger pipeline
Stage 2: Complex English to simpler English conversion
- Includes environment setup, utility functions, optional fine-tuning, and inference
- Provides a convert_lang_eng(...) function for pipeline use
Stage 3: Final neural translation
- Uses a quantized MBART translation pipeline
- Connects all stages in a full system flow and includes a Gradio interface

Generate training data with Gemini (optional)

Set your API key first:

export GEMINI_API_KEY="your_api_key"

Then run:

poetry run python utils/smt-data-gen.py

This generates parallel_corpus.txt and lm_corpus.txt in the current working directory.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
app		app
data		data
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
final-3-stage-machine-translation.ipynb		final-3-stage-machine-translation.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Machine Translation

Repository structure

Requirements

Setup

Run the SMT corrector

3-stage notebook pipeline

Generate training data with Gemini (optional)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP Machine Translation

Repository structure

Requirements

Setup

Run the SMT corrector

3-stage notebook pipeline

Generate training data with Gemini (optional)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages