Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Social Media Machine Translation Toolkit

branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.


Social Media Machine Translation Toolkit - Everything you need to start building Machine Translation Systems on Social Media

This toolkit was proposed as a project in MTMarathon 2013.

Proposed and Maintained by:Wang Ling (

Contributors: Carolin Haas, Chris Dyer, Adam Lopez

requirements: Moses, Giza++ and KenLM - You can follow the guide in, which will get these installed

Usage: scripts/ source(ex: en) target(ex: zh) rootdir(where this package is) mosesdir(moses instalation) mosesexternaldir(where giza is, probably mosesdecoder/tools) model(where the model and results will be generated)


data/parallel/ - parallel dataset directory: to add more data add files ending with ".en-cn" in the format " ||| " ( see existing data/parallel/microtopia.en-cn for an example)

scripts/ - script that builds an mt system using existing data and evaluates on the microblog testset. Run without arguments for description.

scripts/tokenize/ - path with different tokenizers for different languages

The parallel data is obtained from So, if you use this toolkit please cite:

@inproceedings{wangling:acl2013, author = {Ling, Wang and Xiang, Guang and Dyer, Chris and Black, Alan and Trancoso, Isabel}, title = {Microblogs as Parallel Corpora}, booktitle = {Proceedings of the 51st Annual Meeting on Association for Computational Linguistics}, series = {ACL '13}, year = {2013}, location = {Sofia, Bulgaria}, numpages = {8}, publisher = {Association for Computational Linguistics} }

Something went wrong with that request. Please try again.