Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
..
Failed to load latest commit information.
data
README.markdown
Scratch Pad.ipynb
data.py
tokenizer.py

README.markdown

An English-Tamil Parallel Corpus

More information here

We have collected English-Tamil bilingual data from some of the publicly available websites for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpora cover texts from bible, cinema and news domains.

Download raw data from here

Processed data

Processed data is available here

Remarks

  • Tokenizing tamil words and learning sentences word by word, is a ridiculous idea. In tamil, each word is fused to the next with a contextual modifier. So I am gonna try to generate sentences, character by character, which seems just as preposterous.

Citation

Morphological Processing for English-Tamil Statistical Machine Translation
Loganathan Ramasamy and Ondrej Bojar and Zdenek Zabokrtsky
2012, 113--122, Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages (MTPIL-2012)