Skip to content

Repo for project on the geometry of Word Embeddings and how it influences bias downstream

Notifications You must be signed in to change notification settings

rimusa/embedding_bias

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is the code for my dissertation project, "Exploring the Relationship Between Intrinsic and Extrinsic Bias in Spanish Word Embeddings". This document and the project proposal can be found in ./documents.

Prerequisites

In order to run the experiments I used for my project, you need:

  • Tweets extracted from the Twitter API. There are some mirrors in archive.org.
    • These should go into the ./data/archive/[DATE] path.
    • If you use data from any other month, you might have to modify some of the preprocessing scripts, unless you save it in the wrong path.
  • A hate speech classification dataset. My code assumes that you are using the HatEval dataset, but you can find many more at the hatespeechdata list.
    • If you use any other dataset, you might have to modify some of the preprocessing scripts.
    • If the dataset has different tags, you should probably remove any reference to the ./scripts/cnn2 path in any of the scripts executed.
    • I tested my code with binary classification. I'm pretty sure it works with more classes, but I would recommend you to check Mugdha's version of the code in that case.
  • For some parts of the code you will need a GPU. More specifically, the extinsic metric.

Finally, the environment you are using needs to have these packages:

TODO

Preparing the data

Before running any of the scripts, make sure you have the data in the correct paths. Then use the notebooks in ./scripts/data_cleaning/. These are numbered and have instructions on how to run them.

Running the MSc experiments

To run my experiments, you only need to use the run_msc.sh script after having prepared the data. Due to some issues that I had with the environments, they'll assume that your current environment is called NAME and has pytorch and gensim installed and that you have an environment called tf-gpu-cuda8 which has tensorflow installed. This isn't ideal, I know, but that's how I had to do it back then.

Running new experiments

This will be done through the run.sh script. TODO

Citing

If you use this code, cite TODO.

You should also probably cite the data you are using and the people whose code this was based on. Note that many of these were modified for this project. Most of these changes are commented on the code:

  • Preprocessing script by Mugdha Pandya.
  • XWEAT by Anne Lauscher and Goran Glavaš.
  • Attract-repel by Mile Mrkšić. The version used here was updated to work in Python 3 by Rebecca Marchant.
  • The CNN proposed by Yoon Kim. The particular implementation we used was the one by github user Shawn1993.
    • Note that the code in ./scripts/CNN/ is a modified version where you can load pretrained embeddings in word2vec format and with a special data loader.
    • The code in ./scripts/CNN/ is a heavily modified version designed for subtask B of the HatEval classification task. Be careful if using this.

About

Repo for project on the geometry of Word Embeddings and how it influences bias downstream

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 74.8%
  • Jupyter Notebook 17.1%
  • Shell 8.1%