Code for Sparse Bilingual Embeddings as described in Sparse Bilingual Word Representations for Cross-lingual Lexical Entailment.
- MATLAB
Run 'sh fasta_biling.sh' with the following parameters (in order):
- En Vocab File : One word per line (|e| lines)
- Fr Vocab File : One word per line (|f| lines)
- Dense En embeddings : One vector per line, each vector a space seperated list of floats (|e| lines)
- Dense Fr embeddings : One vector per line, each vector a space seperated list of floats (|f| lines)
- Alignment matrix : .mat file containing the crosslingual statistics matrix S (of size |e| x |f|)
Example files are available here.
The output of the above script will be two vector files, one for each language. These new vectors will be sparse and interpretable, with the dimensions aligned across languages!
NB :There are other hyperparameters in the script which you should consider adjusting.
The data folder contains
- final_dataset.tsv - The French-English crosslingual lexical entailment dataset
- bisparse_{en,fr}.txt - The French-English bilingual sparse vectors used to obtained results in the paper
This folder contains some other useful code :
- top_dims.py - Interpret the dimensions given a (sparse) vector file
If you use this code or the associated dataset, please cite the paper!
@InProceedings{VyasCarpuat2015,
Title = {Sparse Bilingual Word Representations for Cross-lingual Lexical Entailment},
Booktitle = {Proceedings of NAACL},
Author = {Vyas, Yogarshi and Carpuat, Marine},
Year = {2016},
Location = {San Diego, United States of America}
}