Skip to content

srijan-bansal/Debiasing-Multilingual-Word-Embeddings-A-Case-Study-of-Three-Indian-Languages

 
 

Repository files navigation

Debiasing-Multilingual-Word-Embeddings-A-Case-Study-of-Three-Indian-Languages

Here we have the code and data for the following paper: Debiasing Multilingual Word Embeddings: A Case Study of Three Indian Languages by Srijan Bansal, Vishal Garimella, Ayush Suhane, Animesh Mukherjee. Proceedings of ACM HyperText 2021

Setup Enviroment

  • Create a conda environment with python 3.7.11
  • Install requirments using: pip install -r requirements.txt inside the conda env

Download data

run bash download_embedding.sh The embeddings will be saved in data/embedding/ folder Download options:

  • All: All the datasets, embeddings will be generated
  • Essential: Fasttext embeddings, aligned text files, MKB datasets will be downloaded. Rest of the embeddings can be generated using scripts
  • MKB: contains language pairs
  • MKB_pickle: contains MKB dataset
  • aligned_files: contains aligned files
  • Bilingual: bilingual alinged embeddings (can be generated by generate_cross_lingual_embeddings.py)
  • EQR: debiased embeddings (can be generated by running debias_run.py)
  • LID: debiased embeddings (can be generated by running debias_run.py)

Scripts

  • debias_run_runner.py: wrapper code for debias_run.py.
    • Run using python debias_run_runner.py
  • debias_run.py: Generates debiased embeddings
    • Run using python debias_run.py lang1 lang2 pre num_lang_pairs
    • ex: python debias_run.py en hi LID 10
  • generate_cross_lingual_embeddings_runner.py: wrapper code for generate_cross_lingual_embeddings.py
    • Run using python generate_cross_lingual_embeddings.py
  • generate_cross_lingual_embeddings.py: Aligns embedding from language lang1 to lang2
    • Run using python generate_cross_lingual_embeddings.py lang1 lang2
    • ex: python generate_cross_lingual_embeddings.py bn hi
  • MKB/alignment_runner.py: wrapper code for alignment.py
    • Run using python MKB/alignment.py
  • MKB/alignment.py: Evaluates debiased embeddings
    • Run using python alignment.py pre1 pre2 lang1 lang2
    • ex: python alignment.py fin fin bn en
  • MKB/bilingual_dict.py:
    • Generate pkl files. It might throw error due to problem with googletranslator api. Instead pkl files can be downloaded using MKB_pickle from download_embedding.sh

How to run

  • Download datasets and embeddings using: download_embedding.sh with essential option
  • Run generate_cross_lingual_embeddings_runner.py to generate all bilingual embeddings.
    • Input: MKD dataset, aligned text files from data/MKB and data/aligned_files
    • Output: Aligned embeddings will be saved to data/embedding/Bilingual/
  • Run debias_run_runner.py to generate all debiased embeddings.
    • Input: Bilingual embeddings from data/embedding/Bilingual/
    • Output: Debiased embeddings will be saved to data/embedding/LID/, data/embedding/EQR/ etc
  • Run MKB/alignment_runner.py to evaluate the debiased embeddings. Results will be saved to MKB/Alignment_runner.csv
    • Input: Debiased embeddings from data/embedding/LID/, data/embedding/EQR/ etc
    • Output: Debiased embeddings score will be saved to Alignment_Runner.csv

About

Debiasing-Multilingual-Word-Embeddings-A-Case-Study-of-Three-Indian-Languages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.2%
  • Shell 3.8%