Comparing Test Sets with IRT

This is a repository for the paper "Comparing Test Sets with Item Response Theory", to appear at ACL 2021.

Requirements

For training the models and generating new model responses using jiant (we use v2.1.1), please refer to the related README.

For IRT analysis, use Python 3 (we use Python 3.7) and install the required packages.

git clone https://github.com/nyu-mll/nlu-test-sets.git
cd nlu-test-sets
pip install -r requirements.txt

IRT Analysis

The data directory contains model responses for the 29 datasets described in the paper. The params directory consists of estimated parameters based on the model responses. You can then simply run the plot_virt.ipynb to generate some plots.

Analyzing New Test Set(s)

Generating Model Responses

To run the same analysis for new test set(s), first you would need to train each dataset using 18 Transformer models used in the paper. We use jiant v2.1.1 for all of our experiments. The scripts for hyperparameter tuning, storing checkpoints, and evaluation can be found in the IRT_experiment branch. Please refer to the provided readme if you want to run a new experiment and generate the corresponding model responses.

Fitting an IRT model

After obtaining the model responses for each test set, add them into the data folder. After that, edit the irt_scripts/estimate_irt_params.sh by adding the new test sets into the TASKS list. You also need to modify the PATH so that it will refer to your working directory.

You can use the following command to estimate parameters:

bash irt_scripts/estimate_irt_params.sh

The script will generate some parameters which will be store in the params directory. There are two files, params.p and responses.p.

After that, you can run the same notebook to generate plots.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
data		data
data_trimmed_item		data_trimmed_item
data_trimmed_model		data_trimmed_model
irt_scripts		irt_scripts
jiant_scripts		jiant_scripts
params/alpha-lognormal-identity_theta-normal-identity_nosubsample_1.00_0.30		params/alpha-lognormal-identity_theta-normal-identity_nosubsample_1.00_0.30
plots_LEH_BERT		plots_LEH_BERT
plots_trimmed_item		plots_trimmed_item
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

data_trimmed_item

data_trimmed_item

data_trimmed_model

data_trimmed_model

irt_scripts

irt_scripts

jiant_scripts

jiant_scripts

params/alpha-lognormal-identity_theta-normal-identity_nosubsample_1.00_0.30

params/alpha-lognormal-identity_theta-normal-identity_nosubsample_1.00_0.30

plots_LEH_BERT

plots_LEH_BERT

plots_trimmed_item

plots_trimmed_item

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Comparing Test Sets with IRT

Requirements

IRT Analysis

Analyzing New Test Set(s)

Generating Model Responses

Fitting an IRT model

About

Releases

Packages

Languages

saneem89/nlu-test-sets

Folders and files

Latest commit

History

Repository files navigation

Comparing Test Sets with IRT

Requirements

IRT Analysis

Analyzing New Test Set(s)

Generating Model Responses

Fitting an IRT model

About

Resources

Stars

Watchers

Forks

Languages