Skip to content

ximenina/BPEProductivity

Repository files navigation

Languages through the looking glass of BPE compression

(under construction)

  • scripts/ Set of scripts for processing the text, and extracting productivity measures for the subwords obtained at each merge.

  • outputs/ It contains the outputs of the scripts

  • notebooks/ These notebooks plot the BPE Space, do the clustering, and do some further analysis (using the generated data in outputs/) .

  • If you're interested in using the language vectors that we derived for 47 languages from the Parallel Bible Corpus, you can go directly to outputs/BPEresults_productivity_corpusPBCtok_0_200_1/corpusPBCtok200_vectors.csv

Ximena Gutierrez-Vasques, Christian Bentz, Tanja Samardžić. Languages through the Looking Glass of BPE Compression. Computational Linguistics (2023)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors