Toolkit for measuring mean-field theoretic manifold analysis (MFTMA) of linguistic manifolds implemented in Python for the results of the paper Emergence of Separable Manifolds in Deep Language Representations, ICML 2020.
First install required dependencies with
pip install -r requirements.txt
Then install the package via
pip install -e .
The following contains usage instructions for preparing the data, extracting the features, run MFTMA analysis and plot the results.
Prepare the data for MFTMA analysis given a labeled dataset (dataset_file
), the list of
relevant tags (tag_file
) and a
randomization seed (seed
).
The dataset should be formatted such that each line contains the space-separated words of the
sentence, the tabular character ('\t') and the space-separated respective tags.
Here is an example of a line from Penn Treebank,
So how many , um , credit cards do you have ? \t UH WRB JJ , UH , NN NNS VBP PRP VB .
The script samples the dataset such that each
manifold contains between min_manifold_size
and max_manifold_size
number of
samples. It produces 'sample' file that contains the line index, word index and tag of the
randomly sampled dataset for MFTMA analysis.
usage: prepare_data.py [-h] [--dataset_file DATASET_FILE]
[--tag_file TAG_FILE] [--sample SAMPLE]
[--max_manifold_size MAX_MANIFOLD_SIZE]
[--min_manifold_size MIN_MANIFOLD_SIZE] [--seed SEED]
optional arguments:
-h, --help show this help message and exit
--dataset_file DATASET_FILE
Input file with the relevant dataset. Each line
contains the space-separated words of the sentence,
the tabular character (\t) and the space-separated
respective tags.
--tag_file TAG_FILE Input file with the list of tags to use for the MFTMA
analysis/
--sample SAMPLE Output file containing the line index, word index and
tag of the randomly sampled dataset.
--max_manifold_size MAX_MANIFOLD_SIZE
The maximal number of words per manifold.
--min_manifold_size MIN_MANIFOLD_SIZE
The minimal number of words per manifold.
--seed SEED Randomization seed.
Extract linguistic features from Transformer given the dataset_file
, tag_file
, the name
of the pretrained transformer model pretrained_model_name
. Use mask
for masking tokens and
random_init
for random initialization of the pretrained model. We use
HuggingFace
Transformers library.
Extracted features are stored under feature_dir
.
usage: feature_extract.py [-h] [--dataset_file DATASET_FILE]
[--tag_file TAG_FILE] [--sample SAMPLE]
[--feature_dir FEATURE_DIR]
[--pretrained_model_name {bert-base-cased,openai-gpt,distilbert-base-uncased,roberta-base,albert-base-v1}]
[--mask] [--random_init]
optional arguments:
-h, --help show this help message and exit
--dataset_file DATASET_FILE
Input file with the relevant dataset. Each line
contains the space-separated words of the sentence,
the tabular character ( ) and the space-separated
respective tags.
--tag_file TAG_FILE Input file with the list of tags to use for the MFTMA
analysis/
--sample SAMPLE Input file containing the line index, word index and
tag of the randomly sampled dataset (output from
prepare_data.py.
--feature_dir FEATURE_DIR
Output feature data directory.
--pretrained_model_name {bert-base-cased,openai-gpt,distilbert-base-uncased,roberta-base,albert-base-v1}
Pretrained model name.
--mask Boolean indicating whether to mask relevant word.
--random_init Boolean indication whether to randomly initialize the
model.
MFTMA analysis of the extracted features over layers, given the number of layers num_layers
, .
the margin size to use in the analysis kappa
and the number of gaussian vectors to sample per
manifold n_t
.
We use a python implementation of the MFTMA analysis method developed available at Replica Mean Field Theory Analysis of Object Manifolds. Relevant code has been copied under mftma.
MFTMA analysis output is stored under mftma_analysis_dir
folder.
usage: mftma_analysis.py [-h] [--feature_dir FEATURE_DIR]
[--mftma_analysis_dir MFTMA_ANALYSIS_DIR]
[--num_layers NUM_LAYERS] [--kappa KAPPA] [--n_t N_T]
MFTMA analysis over layers.
optional arguments:
-h, --help show this help message and exit
--feature_dir FEATURE_DIR
Input feature data directory.
--mftma_analysis_dir MFTMA_ANALYSIS_DIR
Location to output MFTMA analysis directory.
--num_layers NUM_LAYERS
Number of hidden layers.
--kappa KAPPA Margin size to use in the analysis (kappa > 0).
--n_t N_T Number of gaussian vectors to sample per manifold.
Plotting MFTMA analysis results (classification capacity, manifold radius,
manifold dimension and center correlation) over layers given the location of the folder with
MFTMA analysis mftma_analysis_dir
.
usage: generate_plot.py [-h] [--num_layers NUM_LAYERS]
[--mftma_analysis_dir MFTMA_ANALYSIS_DIR]
optional arguments:
-h, --help show this help message and exit
--num_layers NUM_LAYERS
Number of hidden layers.
--mftma_analysis_dir MFTMA_ANALYSIS_DIR
Location to output MFTMA analysis directory.
If you find this code useful for your research, please cite our paper:
@InProceedings{mamou2020emergence,
title={Emergence of Separable Manifolds in Deep Language Representations},
author={Mamou, Jonathan and Le, Hang and Del Rio, Miguel and Stephenson, Cory and Tang, Hanlin and Kim, Yoon and Chung, SueYeon},
booktitle={37th International Conference on Machine Learning, ICML 2020},
year={2020},
organization={International Machine Learning Society (IMLS)}
}