An English-Tamil Parallel Corpus
More information here
We have collected English-Tamil bilingual data from some of the publicly available websites for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpora cover texts from bible, cinema and news domains.
Download raw data from here
Processed data is available here
- Tokenizing tamil words and learning sentences word by word, is a ridiculous idea. In tamil, each word is fused to the next with a contextual modifier. So I am gonna try to generate sentences, character by character, which seems just as preposterous.
Morphological Processing for English-Tamil Statistical Machine Translation
Loganathan Ramasamy and Ondrej Bojar and Zdenek Zabokrtsky
2012, 113--122, Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages (MTPIL-2012)