Skip to content

x-mia/Limit_computer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Limit_computer

A script for extracting the target words from cross-lingual word embeddings by either limiting the cosine similarity or selecting k nearest neighbours. The script partially uses code from MUSE (Conneau et al., 2017). Part of the repository is the script for annotating the data manually. In demo.ipynb you can visualize the graphs using a sample of manually annotated dataframe. More information in the article Parallel, or Comparable? That Is the Question.

Requirements

Evaluating the cross-lingual embedding model

To evaluate aligned embeddings using the cosine similarity score limit, add the --limit flag for enabling the limit and select a threshold for the lowest cosine similarity, otherwise, the script will use KNN search, in this case, select a threshold for K nearest neighbours, examples: With limit:

python eval.py --src_lng SRC_LNG --tgt_lng TGT_LNG --src_path SRC_PATH --tgt_path TGT_PATH --eval_df EVAL_DF --limit LIMIT --treshold TRESHOLD --nmax NMAX --output OUTPUT

Example:

python eval.py --src_lng et --tgt_lng sk --src_path vectors-et.txt --tgt_path vectors-sk.txt --eval_df et-sk.csv --limit --treshold 0.6 --nmax 50000 --output df.csv

K nearest neighbours:

python eval.py --src_lng SRC_LNG --tgt_lng TGT_LNG --src_path SRC_PATH --tgt_path TGT_PATH --eval_df EVAL_DF --treshold TRESHOLD --nmax NMAX --output OUTPUT

Example:

python eval.py --src_lng et --tgt_lng sk --src_path vectors-et.txt --tgt_path vectors-sk.txt --eval_df et-sk.csv --treshold 3 --nmax 50000 --output df.csv

Annotating the data

To manually annotate the data, simply run:

python annotate_data.py --src_lng SRC_LNG --tgt_lng TGT_LNG --df_path DF_PATH --limit LIMIT --output OUTPUT

Example:

python annotate_data.py --src_lng et --tgt_lng sk --df_path et-sk.csv --limit 0.6 --output annotated_df.csv

References

  • Please cite [1] if you found the resources in this repository useful.

[1] Denisová, M. (2022). Parallel, or Comparable? That Is the Question: The Comparison of Parallel and Comparable Data-based Methods for Bilingual Lexicon Induction. In Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022, pp. 3-13. Tribun EU.

@inproceedings{denisova2022,
   author = {Denisová, Michaela},
   title = {Parallel, or Comparable? That Is the Question: The Comparison of Parallel and Comparable Data-based Methods for Bilingual Lexicon Induction},
   booktitle = {Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022},
   pages = {3--13},
   publisher = {Tribun EU},
   url = {https://nlp.fi.muni.cz/raslan/2022/paper8.pdf},
   year = {2022}
}

Related work

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors