Indo-European Language translation app

This project provides an translation model which includes all indo-enropean languages plus Chinese using a Transformer architecture. The project is organized to be easily downloaded, set up, and used for both inference and further training.

Features

Jieba + BPE tokenization for chinese language:
- Experimented with multiple tokenization methods, For Chinese language, the combination of pre-tokenization with Jieba and pre trained BPE yielded the best results
Trained with well-structured and high-quality parallel corpora
Train BPE tokenizers for both source and target languages

Setup

Clone the repository:

git clone <repository_url>
cd Indo-European-Language-translation-app

Install dependencies:

pip install -r resources/requirements.txt

Inference

Run the inference script:

English to Chinese translation example: 
❯ python scripts/inference.py en zh "tell me how you feel"
Translated Sentence: 告訴我你感觉如何?

English to German translation example: 
 python scripts/inference.py en de "today is such a good day"
Translated sentence: Heute ist eine gute Tag

Inference detail:

The script will load the model checkpoint with the best performance from 'checkpoints/model.pt'
It will process the input sentence, encode it will trained spm model located under data/bpe/bpe.model, and the output the translated chinese sentence

Further training

Run the training script:

    python scripts/train.py

Training details:

The script will read from data/source-lang- target_lang/ encoded_data.json
train_loader and validation_loader will be created using MyDataset and collate_fn from utils/dataset.py
The model checkpoint will be saved in the checkpoints/source_lang-target_lang

Model Architecture

The translation model utilizes a Transformer architecture, which consists of an encoder and a decoder based on this famous paper: Attention Is All You Need. Different model settings were experimented and made the model slightly less complex compared to the original paper. The encoder processes the input English sentence, and the decoder generates the corresponding Chinese translation.

Dataset

The dataset was obtained from this open-source nlp data: https://opus.nlpl.eu/results/en&cmn/corpus-result-table

Contributing

Contributions to this project are welcome. If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request on the project's GitHub repository

Acknowledgements

Thank the contributors and researchers whose work and ideas have inspired and influenced this project. Special thanks to the developers of the Transformer architecture and the Jieba and BPE tokenization techniques

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
checkpoints		checkpoints
data		data
model		model
resources		resources
scripts		scripts
utils		utils
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Indo-European Language translation app

Features

Setup

Inference

Run the inference script:

Inference detail:

Further training

Run the training script:

Training details:

Model Architecture

Dataset

Contributing

Acknowledgements

About

Releases

Packages

Languages

rosaaldama278/multilingual-translation-transformer

Folders and files

Latest commit

History

Repository files navigation

Indo-European Language translation app

Features

Setup

Inference

Run the inference script:

Inference detail:

Further training

Run the training script:

Training details:

Model Architecture

Dataset

Contributing

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages