Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.


System packages

  • Python 3.6+
  • git

Installing Python dependencies

  • pip3 install requests sh click
  • pip3 install regex docopt numpy sklearn scipy, if you want to use
  • git clone

This will create a new folder called unify-emotion-datasets.

Running the two scripts

First run the script that downloads all obtainable datasets:

  • cd unify-emotion-datasets # go inside the repository
  • python3

Please read carefully the instructions, you will be asked to read and confirm having read the licenses and terms of use of each dataset. In case the dataset is not obtainable directly you will be given instructions on how to obtain the dataset.

Then run the script that unifies the downloaded datasets, which will be located in unify-emotion-datasets/datasets/:


This will create a new file called unified-dataset.jsonl in the same folder.

Also, we advise you to cite the papers corresponding to the datasets you use. The corresponding bibtex citations you find in the file datasets/ or while running


An Analysis of Annotated Corpora for Emotion Classification in Text

If you plan to use this corpus, please use this citation:

  author = {Bostan, Laura Ana Maria and Klinger, Roman},
  title = {An Analysis of Annotated Corpora for Emotion Classification in Text},
  booktitle = {Proceedings of the 27th International Conference on Computational Linguistics},
  year = {2018},
  publisher = {Association for Computational Linguistics},
  pages = {2104--2119},
  location = {Santa Fe, New Mexico, USA},
  url = {},
  pdf = {}

Experimenting with classification

If you want to reuse the code for the emotion classification task, see the script

python3 --help will show you the following:

Classify using MaxEnt algorithm

Usage: [options] <first> <second> [options] --all-vs <second>

    -j --json=<JSONFILE>  Filename of the json file [default: ../unified.jsonl]
    -a --all-vs<=dataset> Dataset name of the testing data
    -d --debug            Use a small word list and a fast classifier
    -o --output=<OUTPUT>  Output folder [default: .]
    -m --force-multi      Force using multi-label classification
    -k --keep-last        Quit immediately if results file found

For example if you want to train on TEC and test on SSEC do the following:

python3 -d tec emoint 

The names of the dataset are the ones used in the file unified-dataset.jsonl in the field source.


Use jq for an easy interaction with the unified-dataset.jsonl

Examples of how to use it for various tasks:

  • selecting the instances of that have as a source crowdflower or tec jq 'select(.source=="crowdflower" or .source =="tec")' <unified-dataset.jsonl | less
  • count how often instances are annotated with high surprise per dataset jq 'select(.emotions.surprise >0.5) | .source' <unified-dataset.jsonl | sort | uniq -c


A Survey and Experiments on Annotated Corpora for Emotion Classification in Text








No releases published


No packages published