(under construction)
-
scripts/Set of scripts for processing the text, and extracting productivity measures for the subwords obtained at each merge. -
outputs/It contains the outputs of the scripts -
notebooks/These notebooks plot the BPE Space, do the clustering, and do some further analysis (using the generated data in outputs/) . -
If you're interested in using the language vectors that we derived for 47 languages from the Parallel Bible Corpus, you can go directly to outputs/BPEresults_productivity_corpusPBCtok_0_200_1/corpusPBCtok200_vectors.csv
Ximena Gutierrez-Vasques, Christian Bentz, Tanja Samardžić. Languages through the Looking Glass of BPE Compression. Computational Linguistics (2023)
