GitHub - tbose20/D-Ref: Implementation of the paper "Dynamically Refined Regularization for Improving Cross-corpora Hate Speech Detection"

D-Ref

Repository for the paper "Dynamically Refined Regularization for Improving Cross-corpora Hate Speech Detection", Tulika Bose, Nikolaos Aletras, Irina Illina and Dominique Fohr, at Findings of ACL 2022. Paper available at this link

Prerequisites

Install necessary packages by using

conda env create -f environment.yml
conda activate D-Ref
pip install transformers

Getting the Data

Please note that we are not allowed to distribute the datasets. As such, please follow these instructions to retreive the presented datasets:

Dynamic: The dataset used for the experiments in the paper is an older version of the one present here. Since the older version is no longer available, the latest version of the dataset can be downloaded from the link above.

HatEval: Please fill this form to submit a request to the authors.

Waseem: The dataset is available as TweetIDs here as NAACL_SRW_2016.csv. Note that the dataset used for the experiments may not exactly match the dataset obtained after crawling the TweetIDs as tweets keep getting removed over time.

Training and Evaluating the models

The dataset should be prepared in the same format as used by this example dataset tasks/sst/sst_dataset.csv. The name of the data file should follow the naming convention of <name of dataset sub-directory>_dataset.csv, e.g. sst/sst_dataset.csv.
Run the script run_tuning.sh for obtaining the optimal values of the hyper-parameters 'alpha' and 'INP_per' that correspond to the lambda and k, respectively. Please use the validation set performances obtained by running the script for each pair of 'alpha' and 'INP_per' to find the optimal values.

sh run_tuning.sh > tuning.txt

You can train and save the models with train_eval_bc.py script with the following options:

dataset : {HatEval, Dynamic, Waseem} The train sub-part in the dataset file should be the training set from the source corpus, whereas the validation sub-part should be the validation set from the target corpus (which is passed to the 'out-dataset' argument too). For obtaining in-corpus performance, the test subpart of this file can be kept as the test set of the source corpus.
out_dataset : {HatEval, Dynamic, Waseem} The train, validation and test subparts should belong to the target corpus.
encoder : bert
data_dir : directory where the datasets are present
model_dir : directory for saved models
vanilla: flag to obtain the baseline results for BERT Van-FT. While running D-Ref, do not use this flag.

Example script

python -u train_eval_bc.py -dataset HatEval -out_dataset  Dynamic -encoder bert -data_dir tasks/ -model_dir models/ -alpha $alpha -perc_inp $INP_per

This code is adapted from the repository here

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
captum		captum
modules		modules
tasks/sst		tasks/sst
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
captum_exmp.py		captum_exmp.py
confidence_interval.py		confidence_interval.py
environment.yml		environment.yml
run_tuning.sh		run_tuning.sh
train_eval_bc.py		train_eval_bc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

D-Ref

Prerequisites

Getting the Data

Training and Evaluating the models

About

Releases

Packages

Languages

License

tbose20/D-Ref

Folders and files

Latest commit

History

Repository files navigation

D-Ref

Prerequisites

Getting the Data

Training and Evaluating the models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages