iSEA: An Interactive Pipeline of Semantic Error Analysis for NLP models

This is the official code repository for iSEA: An Interactive Pipeline for Semantic Error Analysis of NLP Models, by Jun Yuan, Jesse Vig, Nazneen Rajani.

Repository Overview

This repository contains the following two parts:

pre-process/: This folder contains the code of pre-processing the text documents. We use the pre-trained DistilBERT as an example to demonstrate how we process the data in several Jupyter Notebook files. These notebooks include code for the following content:
- preprocessing of documents (tokenization, lemmatization, document embedding, etc.);
- model performance;
- high-level feature generation;
- rule generation;
- instance-level model explanation (SHAP values).
ui/: This folder contains code and processed data of running the front-end.

System Architecture

We first pre-compute all the necessary information such as model output, analysis information, and error rules in the server. We then present this information in the user interface. Based on the user input, the server calculates subpopulation-level information (errors, document statistics, aggregated SHAP values, etc.) and returns this information back to the UI.

Data & Model

In the paper, we present two use cases with the following data and models:

For MultiNLI dataset, we first train a DistilBERT model based on the government genre. We then analyze the model performance on the travel genre. The checkpoint can be found here.
For the sentiment analysis task on Twitter dataset, we analyze the errors from the open-sourced twitter-roberta-base-sentiment model on test data via our pipeline.

To apply iSEA to your own data/model, please follow the instructions in the pre-process/ folder for data preprocessing and the instructions in the ui/.

Citation

When referencing this repository, please cite this paper:

@misc{yuan22isea,
      title={iSEA: An Interactive Pipeline for Semantic Error Analysis of NLP Models}, 
      author={Yuan, Jun and Vig, Jesse and Rajani, Nazneen},
      year={2022},
      eprint={2203.04408},
      archivePrefix={arXiv},
      primaryClass={cs.HC},
      url={https://arxiv.org/abs/2203.04408}
}

License

This repository is released under the BSD-3 License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
pre-process		pre-process
ui		ui
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pre-process

pre-process

ui

ui

.gitignore

.gitignore

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

LICENSE.txt

LICENSE.txt

README.md

README.md

SECURITY.md

SECURITY.md

Repository files navigation

iSEA: An Interactive Pipeline of Semantic Error Analysis for NLP models

Table of Contents

Repository Overview

System Architecture

Data & Model

Citation

License

About

Contributors 2

Languages

License

salesforce/iSEA

Folders and files

Latest commit

History

Repository files navigation

iSEA: An Interactive Pipeline of Semantic Error Analysis for NLP models

Table of Contents

Repository Overview

System Architecture

Data & Model

Citation

License

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages