Social Media Machine Translation Toolkit
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
data normalized data Sep 13, 2013
scripts getting absolute paths Sep 13, 2013
LICENSE Initial commit Sep 12, 2013 update to run more smoothly Sep 13, 2013


Social Media Machine Translation Toolkit - Everything you need to start building Machine Translation Systems on Social Media

This toolkit was proposed as a project in MTMarathon 2013.

Proposed and Maintained by:Wang Ling (

Contributors: Carolin Haas, Chris Dyer, Adam Lopez

requirements: Moses, Giza++ and KenLM - You can follow the guide in, which will get these installed

Usage: scripts/ source(ex: en) target(ex: zh) rootdir(where this package is) mosesdir(moses instalation) mosesexternaldir(where giza is, probably mosesdecoder/tools) model(where the model and results will be generated)


data/parallel/ - parallel dataset directory: to add more data add files ending with ".en-cn" in the format " ||| " ( see existing data/parallel/microtopia.en-cn for an example)

scripts/ - script that builds an mt system using existing data and evaluates on the microblog testset. Run without arguments for description.

scripts/tokenize/ - path with different tokenizers for different languages

The parallel data is obtained from So, if you use this toolkit please cite:

@inproceedings{wangling:acl2013, author = {Ling, Wang and Xiang, Guang and Dyer, Chris and Black, Alan and Trancoso, Isabel}, title = {Microblogs as Parallel Corpora}, booktitle = {Proceedings of the 51st Annual Meeting on Association for Computational Linguistics}, series = {ACL '13}, year = {2013}, location = {Sofia, Bulgaria}, numpages = {8}, publisher = {Association for Computational Linguistics} }