Word embedding models have become a fundamental component in a wide range of Natural Language Processing (NLP) applications. However, embeddings trained on human-generated corpora have been demonstrated to inherit strong gender stereotypes that reflect social constructs. To address this concern, in this paper, we propose a novel training procedure for learning gender-neutral word embeddings. Our approach aims to preserve gender information in certain dimensions of word vectors while compelling other dimensions to be free of gender influence. Based on the proposed method, we generate a Gender-Neutral variant of GloVe (GN-GloVe). Quantitative and qualitative experiments demonstrate that GN-GloVe successfully isolates gender information without sacrificing the functionality of the embedding model.
The coreference model in this paper is based on end2end 2017 version.
Please note the embeddings trained by us didn't do lowercase.
Please modify the debias.sh to your own datapath (line 18).
We also train the GN-GloVe using 1-billion training data. It can be downloaded here. There are 142527 tokens in this embedding corpus.
In Table 1 and 3, it should be "Hard-GloVe". And on Page 5, it should be "OntoNotes".
The seed words we use in our paper is under wordlist.
The SemBias dataset can be found under SemBias. The last 40 instances are the "subset" we used in our paper.
You can run the code using "debias.sh" (Please change the corresponding parameters).
See the LICENSE file.