Skip to content

Public version of the code for the EMNLP 2019 paper: "Incorporating Label Dependencies in Multilabel Stance Detection" by William Ferreira and Andreas Vlachos

License

Notifications You must be signed in to change notification settings

willferreira/multilabel-stance-detection

Repository files navigation

multilabel-stance-detection-public

Public version of the code for the EMNLP 2019 paper: "Incorporating Label Dependencies in Multilabel Stance Detection" by William Ferreira and Andreas Vlachos.

Data Preparation

The paper explores methods of performing multilabel stance detection with reference to three multilabel datasets. The datasets cannot be provided with this code due to licensing conditions, and should be requested from the relevant authors:

  1. The Brexit Blog Corpus (BBC): Vasiliki Simaki, Carita Paradis, and Andreas Kerren.2017. Stance classification in texts from blogs on the 2016 British referendum. In SPECOM 2017.

  2. US Election Twitter Corpus (ETC): Parinaz Sobhani, Diana Inkpen, and Xiaodan Zhu.2019. Exploring deep neural networks for multi-target stance detection. Computational Intelligence, 35(1):82–97.

  3. Hoover, J., Portillo-Wightman, G., Yeh, L., Havaldar, S., Davani, A. M., Lin, Y., … Dehghani, M. (2019, April 10). Moral Foundations Twitter Corpus: A collection of 35k tweets annotated for moral sentiment. https://doi.org/10.31234/osf.io/w4f72

The code in this repo is written in Python 3.7.x. To run the code, we suggest you create a Python virtual environment using the Anaconda Python platform (https://www.anaconda.com/), for example:

conda create --name emnlp_2019_multilabel anaconda

In addition to the packages that come with anaconda, you will need:

conda install -c conda-forge tensorflow-hub

conda install -c conda-forge keras

You will also need an installation of the FastText binary appropriate for your environment (see https://fasttext.cc/docs/en/support.html). The code in this repo predates the availability of the official FastText for Python library, and so uses the official binary version through a bespoke wrapper. Once the FastText binary is installed, you will need to set the environment variable FASTTEXT_HOME to point to the folder/dir containing the binary.

For each of the datasets, there is an associated script that pre-processes the original dataset to be used in the code:

BBC Dataset

Run the script prepare_bbc_dataset.py which looks for a file called brexit_blog_corpus.xlsx in the same directory as the script. The script does some data cleaning and pre-processing (tokenizing, generating ELMO embeddings) and saves the output to the same directory. The output consists of the following files:

a. bbc_dataset.csv - a comma-separated file consisting of an utterance ID, a tokenized utterance string, and binary-valued columns for each of the ten stances, with the obvious interpretation, and a final column indicating whether the utterance is in the 80% training set, or the 20% held-out test set.

b. bbc_dataset_folds.csv - a comma-separated file consisting of an utterance ID, and for each cv fold, a column indicating if the utterance is an item in the train or test set for that fold, for all the data in the training set (see a.).

c. bbc_elmo_train_embeddings.csv - a comma-separated file consisting of an utterance ID and the vector representation of the ELMO embedding for the tokenized utterance, for all the utterances in the training set (see a.).

d. bbc_elmo_test_embeddings.csv - a comma-separated file consisting of an utterance ID and the vector representation of the ELMO embedding for the tokenized utterance, for all the utterances in the test set (see a.).

ETC Dataset

Run the script prepare_tweet_dataset.py which looks for a file called all_data_tweet_text.csv in the same directory as the script. The script splits the data in the file into the three target pairs: Donald Trump - Hilary Clinton (DT_HC), Donald Trump - Ted Cruz (DT_TC), and Hilary Clinton - Bernie Sanders (HC_BS). For each target pair the script:

a. combines the Train and Dev sets to produce a single train set, and keeps the existing test set as a hold-out test set,

b. splits the new training set into five train/test folds,

c. generates a new comma-separated file for each target pair called tweet-x.csv, where x in {DT_HC, DT_TC, HC_BS}, the columns are: ID, Tweet, Target 1, Target 2, Test/Train/Dev, set, fold_1, fold_2, fold_3, fold_4, fold_5, and:

 i. Tweet is the tokenized tweet,
 ii. Target 1 is the first target stance (e.g. FOR)
 iii. Target 2 is the second target stance (e.g. AGAINST)
 iv. Test/Train/Dev is the original set designator
 v. set is the new set (i.e. train or test) designator
 vi. fold_i indicates whether the instance is in the train or test set for cv fold i in (1..5)

d. generates a comma-separated file consisting of an ID and the vector representation of the ELMO embedding for the tokenized tweet

MFTC Dataset

Run the script prepare_mftc_dataset.py which looks for a file called MFTC_V3_Text.json in the same directory as the script. The script requires an argument --corpus <corpus name> where corpus name is one of ALM, Baltimore, BLM, Davidson, Election, MeToo or Sandy. For example:

  <code>python prepare_mftc_dataset.py --corpus ALM</code>

prepares the data for the ALM dataset of the corpus, and generates two files: moral-dataset-ALM.csv and moral- dataset-ALM_elmo_embeddings.csv. The first file is comma-separated with columns: ID, Tweet, set, fold_1, fold_2, fold_3, fold_4, fold_5, where Tweet is the tokenized tweet text, set indicates whether the tweet is in the train or hold-out test set, and fold_i indicates whether the tweet is in the train or test set for fold i. The second file contains the ELMO embeddings for the tweets, keyed by ID. Repeat running the script for the remaining datasets: Baltimore, BLM, ..., Sandy.

Model Estimation and Cross-validation

The script run_cv.py is used to run the various models against the different datasets, using cross-validation to explore the model parameter space. The cv parameter ranges are stored in the models variable in the script. There are various combinations of learning algorithm: FastText (FT), Multi-task Learning (MTL); data encoding methods: Binary-relevance (BR), Label power-set (LP); and loss functions: Binary Cross-entropy, Cross-label dependency (XLD). There are also three datasets, where in addition, the ETC and MFTC datasets are split into sub-datasets.

The main script parameters are --model-name and --dataset-name, where the model-name and dataset-name can take the following values:

--model-name

mlp-base                     - MTL model with binary cross-entropy loss
mlp-powerset                 - MTL model with label power-set encoding and categorical cross-entropy loss
mlp-cross-label-dependency   - MTL model with combined binary cross-entropy and cross-label dependency loss

fasttext-binary-relevance    - Fasttext model with binary relevance encoding
fasttext-powerset            - Fasttext model with label power-set encoding

--dataset-name

bbc              - BBC dataset
tweets-X         - ETC dataset, X in {DT_HC, DT_TC, HC_BS}
moral-dataset-X  - MFTC dataset, X in {ALM, BLM, Baltimore, Davidson, Election, MeToo, Sandy}

To run a specific model-name/dataset-name combination, for example, mlp-base and bbc, run the following command:

python run_cv.py --model-name mlp-base --dataset-name bbc

The output from the script is written to a sub-dir/folder called results, and is a pickle file with the name cv_results_<model-name>_<dataset-name>.pkl. The pickle file contains a Python dict with the following keys:

'cv_results' - sklearn.model_selection.GridSearchCV.cv_results_

'best_params' - sklearn.model_selection.GridSearchCV.best_params

'best_score' - sklearn.model_selection.GridSearchCV.best_score

'y_test' - holdout test set labels

'y_pred' - predicted labels using holdout test set data and model trained on best cv params

'model_name' - model name

'dataset_name' - dataset name

The Jupyter lab file cv_results_analysis.ipynb is a convenience script that unpacks the pickle files and dispays the results for each of the best models for a given dataset, when run on the holdout test set. To run the script set the dataset_key variable in the third cell to one of: 'bbc', 'tweet' or 'moral', for the BBC, ETC and MFTC datasets, respectively. For the BBC and MFTC datasets, the script calculates the usual metrics of accuracy, F1, precision, recall, and also Jaccard (as described in the paper). For the ETC dataset, the script calculates the F1 macro-averaged score for the FOR and AGAINST labels, as described in https://www.aclweb.org/anthology/E17-2088. In addition, the script calculates the average scores for each model across the different sub datasets, where appropriate (ETC & MFTC).

Experiments

Bootstrap training (section 5.1)

Section 5.1 of the paper looks at the distribution of scores when the MTL models with their best parameters, as chosen by cross-validation, are estimated on bootstrapped samples of the data. The bootstrap is done using the script run_mlp_model_boostrap.py which has the following parameters:

a. --n-samples - the number of samples to bootstrap (default=30)

b. --sample-frac - the fraction of the dataset to be sampled as training data (default=0.7)

c. --dataset-name - the dataset name, e.g. bbc, tweets-DT_HC, etc. (default=moral-dataset-MeToo)

The output of the script is saved in a pickle file called bootstrap_results_<dataset-name>.pkl in the results/ dir/folder. The Jupyter lab file bootstrap_analysis.ipynb is used to analyse and display the results of the bootstrapping output. To analyse a particular set of results, set the variable name dataset_name to a dataset name, e.g. moral-dataset-MeToo and run all the cells. The lab code extracts the bootstrap results and displays summary statistics of the bootstrap samples, KDE plots of the results and the Welch test statistics.

Learning curve (section 5.2)

Section 5.2 of the paper looks at how the performance of MTL-XLD and MTL-LP compare as the training set size is increased. This experiment is performed for the BBC and MFTC datasets using the script run_training_size_bootstrap.py which has the following parameters:

a. --n-samples - the number of samples to bootstrap (default=10)

b. --dataset-name - the dataset name, e.g. bbc, moral-dataset-MeToo, etc. (default=moral-dataset-MeToo)

The script output is in the form of a pickle file called training_size_bootstrap_<dataset-name>.pkl in the results/ dir/folder. The Jupyter lab file training_size_bootstrap_analysis.ipynb is used to analyse and display the results of the learning curve experiment output. To analyse a particular set of results, set the variable name dataset_name to a dataset name, e.g. moral-dataset-MeToo and run all the cells.

About

Public version of the code for the EMNLP 2019 paper: "Incorporating Label Dependencies in Multilabel Stance Detection" by William Ferreira and Andreas Vlachos

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages