A Joint Matrix Factorization Analysis of Multilingual Representations

This repository contains code for the paper "Understanding Domain Learning in Language Models Through Subpopulation Analysis".

Creating Conda Environment

The following instructions will assume that conda is already installed on your system.

Clone this repository.
Run the following command to create the conda environment from the provided environment.yml file:
```
conda env create -f environment.yml
```
Activate the enviroment using the following command:
```
conda activate joint-multilingual-analysis
```

Obtaining and Preprocessing Pretraining Data

We use CoNLL’s 2017 Wikipedia dump for pretraining our models. To obtain the data, see instructions adopted from Fujinuma et al. (2022):

Obtain CoNLL 2017 Wikipedia dump from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989. Or
```
wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-1989/English-annotated-conll17.tar
```
and change "English" for other languages. Then extract the .tar file:
```
tar -xvf English-annotated-conll17.tar
```
Preprocess by obtaining the raw text by running
```
./scripts/preprocess.sh
```
Downsample by running
```
./scripts/downsample_train_dev.sh
```

Pretraining

Run the following script for pretraining monolingual models.
```
./scripts/train/pretrain_mono.sh
```
Change LANG variable in the script to pretrain monlingual models for other languages.
Run the following script for pretraining multilingual models.
```
./scripts/train/pretrain_multi.sh
```
The default setting will use all 33 langauges mentioned in the paper. To train other configurations mentioned in Section 5.2 of the paper (e.g. pretrain using a diverse set of languages), use --langs_to_use to include the desired set of languages. Examples are also provided in the commented out sections in the script.

Getting Morphosyntactic Features

We use treebanks from UD 2.1 which contain sentences annotated with morphosyntactic information. To obtain the data, see instructions adopted from Stanczak et al. (2022).

Download UD 2.1 treebanks and extract the .tgz file: tar -xvzf ud-treebanks-v2.1.tgz. Then put them in morphosyntactic_features/data/ud/ud-treebanks-v2.1
Clone the modified UD converter to this repo's parent folder (e.g. if this repo's path is /path/to/parent/folder/joint_multilingual_analysis/ then the UD converter repo should be cloned under /path/to/parent/folder/). You can then convert the treebank annotations to the UniMorph schema using the following script:
```
./morphosyntactic_features/scripts/ud_to_um.sh
```
Run the following commands to preprocess all the relevant treebanks using relevant embeddings. This process may take a while.
```
cd morphosyntactic_features/ 
./scripts/preprocess_xlmr_base.sh
./scripts/preprocess_roberta_base.sh
```

PARAFAC2

To perform PARAFAC2 analysis, you will first need to preprocess the embeddings obtained from previous step. This can be done by running:

./scripts/analysis/preprocess.sh

This will create separate *.npy files in ./data/analysis_data/UD_* storing representations for each layer and each morphosyntactic attributes for different languages. This step will also skip languages that does not have a particular morphosyntactic attribute.

Our PARAFAC2 implementation is based on tensorly library.

python src/analysis/parafac2.py \
    --treebanks_root /data/path/to/ud-treebanks-v2.1 \
    --output_dir /data/path/to/output/folder \
    --langs_to_use $LANGS \
    --filename_exp filename-for-your-experimental(multilingual)-model \
    --filename_ctl filename-for-your-control(monolingual)-model \
    --exp_name experiment-name \
    --do_parafac2 \
    --layer $LAYER \
    --rank $RANK \
    --verbose \
    --n_iter_max 100 \
    --attribute $ATTR

You will need to pass the following arguments:

--langs_to_use sets the list of languages that you wish to perform the PARAFAC2 decomposition. E.g., ar bg ca zh hr cs da nl en et fi fr gl de el he hi hu id it ja ko pl pt ro ru sk sl es sv tr uk vi will run decomposition over all 33 supported languages in the paper.
--layer controls which layer (e.g. 0) to performn the decomposition.
--rank sets the number of components in the decomposition. In our paper, we set RANK=768 which is the dimension of our model embeddings.
--verbose only use this argument if you would like to see details from each iteration.
--n_iter_max sets the maximum number of iterations in the decomposition algorithm. If unset, the default is 2000.
--attribute sets which morphosyntactic attributes to perform the decomposition. The list can be found at src/analysis/res/properties.lst.

Alternatively, you could also use the provided script to run the PARAFAC2 decomposition:

./scripts/analysis/analysis.sh

Once the decomposition has finished, the results will be saved in the folder ./data/analysis_data/decompositions in python pickle files. The filenames are in the format of {EXP_NAME}-LAYER-{LAYER_NUM}-ATTR-{ATTRIBUTE_NAME}.pickle. You can then run analysis over these files.

The provided example below will calculate the average signature values for each language for layer 0 for the Part-of-Speech attribute:

>>> import numpy as np
>>> import os, pickle

# load the pickled decomposition file
>>> file_path = os.path.join("./data/analysis_data/decompositions", "{}-LAYER-{}-ATTR-{}.pickle".format("all33", 0, "POS"))
>>> decompositions = pickle.load(open(file_path, 'rb'))
>>> (weights, factors, projections) = decompositions

# A contains the signature for each l-th language, it has a shape of [num_lang x embedding_dim] 
>>> A, B, C = factors

# we can get the average signature for all languages below
>>> signature_avg = np.average(A, axis=1)

To read more about what A, B, C, weights, projections represent, you can visit this page.

Citation

@inproceedings{zhao-etal-2023-joint,
    title = "A Joint Matrix Factorization Analysis of Multilingual Representations",
    author = "Zhao, Zheng  and
      Ziser, Yftah  and
      Webber, Bonnie  and
      Cohen, Shay",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.851",
    doi = "10.18653/v1/2023.findings-emnlp.851",
    pages = "12764--12783",
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
morphosyntactic_features		morphosyntactic_features
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
environment.yml		environment.yml
parafac2.png		parafac2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

morphosyntactic_features

morphosyntactic_features

scripts

scripts

src

src

LICENSE

LICENSE

README.md

README.md

init.py

init.py

environment.yml

environment.yml

parafac2.png

parafac2.png

Repository files navigation

A Joint Matrix Factorization Analysis of Multilingual Representations

Creating Conda Environment

Obtaining and Preprocessing Pretraining Data

Pretraining

Getting Morphosyntactic Features

PARAFAC2

Citation

About

Releases

Packages

Languages

License

zsquaredz/joint_multilingual_analysis

Folders and files

Latest commit

History

Repository files navigation

A Joint Matrix Factorization Analysis of Multilingual Representations

Creating Conda Environment

Obtaining and Preprocessing Pretraining Data

Pretraining

Getting Morphosyntactic Features

PARAFAC2

Citation

About

Resources

License

Stars

Watchers

Forks

Languages