This repository contains code for the paper "Understanding Domain Learning in Language Models Through Subpopulation Analysis".
The following instructions will assume that conda is already installed on your system.
- Clone this repository.
- Run the following command to create the conda environment from the provided
environment.yml
file:conda env create -f environment.yml
- Activate the enviroment using the following command:
conda activate joint-multilingual-analysis
We use CoNLL’s 2017 Wikipedia dump for pretraining our models. To obtain the data, see instructions adopted from Fujinuma et al. (2022):
- Obtain CoNLL 2017 Wikipedia dump from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989. Or
and change "English" for other languages. Then extract the .tar file:
wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-1989/English-annotated-conll17.tar
tar -xvf English-annotated-conll17.tar
- Preprocess by obtaining the raw text by running
./scripts/preprocess.sh
- Downsample by running
./scripts/downsample_train_dev.sh
- Run the following script for pretraining monolingual models.
Change
./scripts/train/pretrain_mono.sh
LANG
variable in the script to pretrain monlingual models for other languages. - Run the following script for pretraining multilingual models.
The default setting will use all 33 langauges mentioned in the paper. To train other configurations mentioned in Section 5.2 of the paper (e.g. pretrain using a diverse set of languages), use
./scripts/train/pretrain_multi.sh
--langs_to_use
to include the desired set of languages. Examples are also provided in the commented out sections in the script.
We use treebanks from UD 2.1 which contain sentences annotated with morphosyntactic information. To obtain the data, see instructions adopted from Stanczak et al. (2022).
- Download UD 2.1 treebanks and extract the .tgz file:
tar -xvzf ud-treebanks-v2.1.tgz
. Then put them inmorphosyntactic_features/data/ud/ud-treebanks-v2.1
- Clone the modified UD converter to this repo's parent folder (e.g. if this repo's path is
/path/to/parent/folder/joint_multilingual_analysis/
then the UD converter repo should be cloned under/path/to/parent/folder/
). You can then convert the treebank annotations to the UniMorph schema using the following script:./morphosyntactic_features/scripts/ud_to_um.sh
- Run the following commands to preprocess all the relevant treebanks using relevant embeddings. This process may take a while.
cd morphosyntactic_features/ ./scripts/preprocess_xlmr_base.sh ./scripts/preprocess_roberta_base.sh
To perform PARAFAC2 analysis, you will first need to preprocess the embeddings obtained from previous step. This can be done by running:
./scripts/analysis/preprocess.sh
This will create separate *.npy
files in ./data/analysis_data/UD_*
storing representations for each layer and each morphosyntactic attributes for different languages. This step will also skip languages that does not have a particular morphosyntactic attribute.
Our PARAFAC2 implementation is based on tensorly library.
python src/analysis/parafac2.py \
--treebanks_root /data/path/to/ud-treebanks-v2.1 \
--output_dir /data/path/to/output/folder \
--langs_to_use $LANGS \
--filename_exp filename-for-your-experimental(multilingual)-model \
--filename_ctl filename-for-your-control(monolingual)-model \
--exp_name experiment-name \
--do_parafac2 \
--layer $LAYER \
--rank $RANK \
--verbose \
--n_iter_max 100 \
--attribute $ATTR
You will need to pass the following arguments:
--langs_to_use
sets the list of languages that you wish to perform the PARAFAC2 decomposition. E.g.,ar bg ca zh hr cs da nl en et fi fr gl de el he hi hu id it ja ko pl pt ro ru sk sl es sv tr uk vi
will run decomposition over all 33 supported languages in the paper.--layer
controls which layer (e.g. 0) to performn the decomposition.--rank
sets the number of components in the decomposition. In our paper, we setRANK=768
which is the dimension of our model embeddings.--verbose
only use this argument if you would like to see details from each iteration.--n_iter_max
sets the maximum number of iterations in the decomposition algorithm. If unset, the default is 2000.--attribute
sets which morphosyntactic attributes to perform the decomposition. The list can be found atsrc/analysis/res/properties.lst
.
Alternatively, you could also use the provided script to run the PARAFAC2 decomposition:
./scripts/analysis/analysis.sh
Once the decomposition has finished, the results will be saved in the folder ./data/analysis_data/decompositions
in python pickle files. The filenames are in the format of {EXP_NAME}-LAYER-{LAYER_NUM}-ATTR-{ATTRIBUTE_NAME}.pickle
. You can then run analysis over these files.
The provided example below will calculate the average signature values for each language for layer 0
for the Part-of-Speech
attribute:
>>> import numpy as np
>>> import os, pickle
# load the pickled decomposition file
>>> file_path = os.path.join("./data/analysis_data/decompositions", "{}-LAYER-{}-ATTR-{}.pickle".format("all33", 0, "POS"))
>>> decompositions = pickle.load(open(file_path, 'rb'))
>>> (weights, factors, projections) = decompositions
# A contains the signature for each l-th language, it has a shape of [num_lang x embedding_dim]
>>> A, B, C = factors
# we can get the average signature for all languages below
>>> signature_avg = np.average(A, axis=1)
To read more about what A
, B
, C
, weights
, projections
represent, you can visit this page.
@inproceedings{zhao-etal-2023-joint,
title = "A Joint Matrix Factorization Analysis of Multilingual Representations",
author = "Zhao, Zheng and
Ziser, Yftah and
Webber, Bonnie and
Cohen, Shay",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.851",
doi = "10.18653/v1/2023.findings-emnlp.851",
pages = "12764--12783",
}