Here we have the code and data for the following paper: Debiasing Multilingual Word Embeddings: A Case Study of Three Indian Languages by Srijan Bansal, Vishal Garimella, Ayush Suhane, Animesh Mukherjee. Proceedings of ACM HyperText 2021
- Create a conda environment with python 3.7.11
- Install requirments using:
pip install -r requirements.txt
inside the conda env
run bash download_embedding.sh
The embeddings will be saved in data/embedding/ folder
Download options:
- All: All the datasets, embeddings will be generated
- Essential: Fasttext embeddings, aligned text files, MKB datasets will be downloaded. Rest of the embeddings can be generated using scripts
- MKB: contains language pairs
- MKB_pickle: contains MKB dataset
- aligned_files: contains aligned files
- Bilingual: bilingual alinged embeddings (can be generated by generate_cross_lingual_embeddings.py)
- EQR: debiased embeddings (can be generated by running debias_run.py)
- LID: debiased embeddings (can be generated by running debias_run.py)
- debias_run_runner.py: wrapper code for debias_run.py.
- Run using
python debias_run_runner.py
- Run using
- debias_run.py: Generates debiased embeddings
- Run using
python debias_run.py lang1 lang2 pre num_lang_pairs
- ex:
python debias_run.py en hi LID 10
- Run using
- generate_cross_lingual_embeddings_runner.py: wrapper code for generate_cross_lingual_embeddings.py
- Run using
python generate_cross_lingual_embeddings.py
- Run using
- generate_cross_lingual_embeddings.py: Aligns embedding from language lang1 to lang2
- Run using
python generate_cross_lingual_embeddings.py lang1 lang2
- ex:
python generate_cross_lingual_embeddings.py bn hi
- Run using
- MKB/alignment_runner.py: wrapper code for alignment.py
- Run using
python MKB/alignment.py
- Run using
- MKB/alignment.py: Evaluates debiased embeddings
- Run using
python alignment.py pre1 pre2 lang1 lang2
- ex:
python alignment.py fin fin bn en
- Run using
- MKB/bilingual_dict.py:
- Generate pkl files. It might throw error due to problem with googletranslator api. Instead pkl files can be downloaded using MKB_pickle from download_embedding.sh
- Download datasets and embeddings using:
download_embedding.sh
with essential option - Run
generate_cross_lingual_embeddings_runner.py
to generate all bilingual embeddings.- Input: MKD dataset, aligned text files from data/MKB and data/aligned_files
- Output: Aligned embeddings will be saved to data/embedding/Bilingual/
- Run
debias_run_runner.py
to generate all debiased embeddings.- Input: Bilingual embeddings from data/embedding/Bilingual/
- Output: Debiased embeddings will be saved to data/embedding/LID/, data/embedding/EQR/ etc
- Run
MKB/alignment_runner.py
to evaluate the debiased embeddings. Results will be saved to MKB/Alignment_runner.csv- Input: Debiased embeddings from data/embedding/LID/, data/embedding/EQR/ etc
- Output: Debiased embeddings score will be saved to Alignment_Runner.csv