PLM4IL: An Extendible Incremental Learning Framework for Pretrained Language Models

Introduction

This is a repository for Incremental Learning with Pretrained Language Models.

It supports both generative and discriminative models in transformers.
It supports using accelerate for distributed data parrallel and model parallel.
It supports using wandb for logging.

Supported List

Scenario

Instance-Incremental Learning
Class-Incremental Learning
Task-Incremental Learning

Tasks

Text Classification
Intent Classification
Relational Extraction
Named Entity Recognition

Methods

More baselines will be released in the future!

General (Text/Intent) Classification

Named Entity Recognition

Original for Image Classification

Datasets

Instance Incremental Learning

Concept-1K (The raw and the preprocessed Concept-1K are included in dataset/concept_1k, dataset/concept_1k_task10, dataset/concept_1k_task1).

Intent Classification

Topic3datasets (agnews, dbpedia, yahoo)

Intent Classification

CLINC150
Banking77

Relation Extraction

FewRel
TACRED

Named Entity Recognition

Few-NERD
Ontonotes5
I2B2

Usage

Overview

.
├── main_CL.py              # This this the python file to be executed for running all experiments
├── utils                       # This folder contains all basic files for incremental learning 
│   ├── backbone.py             # This file loads backbone models from the transformers library
│   ├── buffer.py               # This file defines the replay buffer
│   ├── classifier.py           # This file loads Linear/CosineLinear classifiers
│   ├── wrapmodel.py            # This file wrap the model for using DeepSpeed with accelerate
│   ├── dataformat_preprocess.py# This file preprocess the raw datasets to the continual learning dataset
│   ├── dataloader.py           # This file prepare the input for languge models
│   ├── dataset.py              # This file defines the format for different datasets for continual learning
│   ├── download_backbones.py   # This file downloads models in advance to avoid network problem.
│   ├── evaluation.py           # This file defines the evaluation process for various tasks
│   ├── factory.py              # This file loads the various models from the ./models folder
│   ├── logger.py               # This file defines the logger
│   ├── metric.py               # This file defines the evaluation metric for continual learning
│   ├── optimizer.py            # This file defines the optimizer for different models
│   ├── prompt.py               # This file defines the prompt used for different tasks
│   ├── probing.py              # This file computes the probing performance
│   └── config.py               # This file defines general parameters and settings for the experiments
├── config                  # This folder contains the hyper-parameters for each methods in each datasets
├── dataset                 # This folder contains datasets for continual learning
├── models                  # This folder contains models for continual learning
└── experiments             # This folder contains log data for each run

Quick Start

Step 1: prepare the environment

pip install -r requirement.txt

Step 2: prepare the dataset

Check the support_dataset_list in utils/dataformat_preprocess.py and select the dataset you want for experiment.

Then, download the raw dataset to the folder dataset/{dataset-name}. For example, download the clinc150 to the folder dataset/clinc150. The raw datasets can be downloaded here. We note that the raw data of Conept-1K is in dataset/concept_1k. The preprocessed Concept-1K for 10 step incremental learning is in dataset/concept_1k_task10. The whole Concept-1K is in dataset/concept_1k_task1.

Next, exceute the preprocess_dataset.sh. It will automatically preprocess 8 default datasets for reproducing results ('topic3datasets','clinc150','banking77', 'fewrel','tacred','conll2003','fewnerd','i2b2','ontonotes5') and create new folders in datasets/{dataset-for-continual-learning-name} automatically (e.g.,backing_task7). If you do not need to customize the datasets, you can skip to Step 3.

To customize the datasets, you can run utils/dataformat_preprocess.py with your own parameters (e.g., random seeds, num of tasks). This process will create a new target folder dataset/{dataset-for-continual-learning-name}. In the target folder, two json files continual_data.json and continual_config.json will be saved. For example, you can prepare clinc150 and fewrel dataset by runing

python utils/dataformat_preprocess.py --dataset clinc150 --seed 1

and

python utils/dataformat_preprocess.py --dataset fewrel --seed 1

The program will create target folders dataset/clinc150_task15 and dataset/fewrel_task8.

For NER datasets, for example ontonotes5, you can run the following command

python utils/dataformat_preprocess.py --dataset ontonotes5 --seed 1 --base_task_entity 8 --incremental_task_entity 2 --seen_all_labels False

The program will create a target folder dataset/ontonotes5_task6_base8_inc2. We note that fixing the random seed enables that exctaly the same datasets can be generated on different devices. Finally, the post-precessed dataset clinc150_task15,fewrel_task8, and ontonotes5_task6_base8_inc2 are ready for continual learning!

Step 3: select the yaml file for hyper-parameters

The yaml file contains the hyper-parameters for each method. For example, the hyper-parameter of SEQ* (w/ and w/o pre-allocating future classifiers) for generative backbones under CIL settings is defined in config/CIL/generative_backbones/clinc150_task15/SEQ_pre_warm_fix.yaml and config/CIL/generative_backbones/clinc150_task15/SEQ_warm_fix.yaml respectively.

Step 4: reproduce the results

The scripts for reproducing the probing study are in the folder reproduce_shell/exp-probing.

The scripts for reproducing the probing study with different pre-training steps are in the folder reproduce_shell/exp-probing-pretraining.

The scripts for reproducing the experiments of comparing SEQ* with SOTA methods are in the folder reproduce_shell/exp-sota.

If you want to run an experiment, execute the main_CL.py. For example, you can run SEQ method on clinc150_task15 dataset with bert-base-cased using the following command:

python main_CL.py --exp_prefix {your-experiment-name} --cfg './config/clinc150_task15/SEQ_full.yaml' --backbone bert-base-cased --classifier Linear --training_epochs 5

If you want to use wandb for logging (see here for more help):

python main_CL.py --is_wandb True --wandb_project {your-project-name} --wandb_entity {your-entity-name} --exp_prefix {your-experiment-name} --cfg './config/clinc150_task15/SEQ_full.yaml' --backbone bert-base-cased --classifier Linear --training_epochs 5

If you want to use accelerate for data/model parallel (see here for more help):

accelerate launch --config_file {your-accelerate-config-file} main_CL.py --is_wandb True --wandb_project {your-project-name} --wandb_entity {your-entity-name} --exp_prefix {your-experiment-name} --cfg './config/clinc150_task15/SEQ_full.yaml' --backbone bert-base-cased --classifier Linear --training_epochs 5

Please refer to utils/config.py for more general paramters and models/{model-name}.py for more model-specific parameters.

Main Results

The results on IIL scenario.

The results on CIL and TIL scenario.

Questions and Citation

If you have questions about this repository, please feel free to contact me at junhaozheng47@outlook.com.

If you find this repository useful, please consider citing our paper.

@misc{zheng2024concept1k,
      title={Concept-1K: A Novel Benchmark for Instance Incremental Learning}, 
      author={Junhao Zheng and Shengjie Qiu and Qianli Ma},
      year={2024},
      eprint={2402.08526},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
@misc{zheng2023learn,
      title={Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models}, 
      author={Junhao Zheng and Shengjie Qiu and Qianli Ma},
      year={2023},
      eprint={2312.07887},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
config		config
dataset		dataset
imgs		imgs
models		models
reproduce_shell		reproduce_shell
utils		utils
.gitignore		.gitignore
ICL_gpt4.py		ICL_gpt4.py
README.md		README.md
main_CL.py		main_CL.py
preprocess_dataset.sh		preprocess_dataset.sh
requirements.txt		requirements.txt

zzz47zzz/pretrained-lm-for-incremental-learning

Folders and files

Latest commit

History

Repository files navigation

PLM4IL: An Extendible Incremental Learning Framework for Pretrained Language Models

Contents

Introduction

Supported List

Scenario

Tasks

Methods

General (Text/Intent) Classification

Named Entity Recognition

Original for Image Classification

Datasets

Instance Incremental Learning

Intent Classification

Intent Classification

Relation Extraction

Named Entity Recognition

Usage

Overview

Quick Start

Step 1: prepare the environment

Step 2: prepare the dataset

Step 3: select the yaml file for hyper-parameters

Step 4: reproduce the results

Main Results

Questions and Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages