Code for Bilingual Sparse Embeddings from the NAACL 2016 paper
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data
utils
README.md
fasta.m
fasta_biling.m
fasta_biling.sh
fasta_biling_solver.m

README.md

Sparse Bilingual Word Representations

Code for Sparse Bilingual Embeddings as described in Sparse Bilingual Word Representations for Cross-lingual Lexical Entailment.

Prerequisites

  • MATLAB

Getting Embeddings

Run 'sh fasta_biling.sh' with the following parameters (in order):

  • En Vocab File : One word per line (|e| lines)
  • Fr Vocab File : One word per line (|f| lines)
  • Dense En embeddings : One vector per line, each vector a space seperated list of floats (|e| lines)
  • Dense Fr embeddings : One vector per line, each vector a space seperated list of floats (|f| lines)
  • Alignment matrix : .mat file containing the crosslingual statistics matrix S (of size |e| x |f|)

Example files are available here.

The output of the above script will be two vector files, one for each language. These new vectors will be sparse and interpretable, with the dimensions aligned across languages!

NB :There are other hyperparameters in the script which you should consider adjusting.

Data

The data folder contains

  • final_dataset.tsv - The French-English crosslingual lexical entailment dataset
  • bisparse_{en,fr}.txt - The French-English bilingual sparse vectors used to obtained results in the paper

Utils

This folder contains some other useful code :

  • top_dims.py - Interpret the dimensions given a (sparse) vector file

If you use this code or the associated dataset, please cite the paper!

@InProceedings{VyasCarpuat2015,
    	Title = {Sparse Bilingual Word Representations for Cross-lingual Lexical Entailment},
    	Booktitle = {Proceedings of NAACL},
    	Author = {Vyas, Yogarshi and Carpuat, Marine},
    	Year = {2016},
    	Location = {San Diego, United States of America}
}