Skip to content

taidnguyen/software_entity_recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Software Entity Recognition with Noisy-label Learning

Official implementation for our paper "Software Entity Recognition with Noise-robust Learning", ASE 2023.

WikiSER

WikiSER corpus includes 1.7M sentences with named entity labels extracted from 79k Wikipedia articles. Relevant software named entities are labeled under 12 fine-grained categories:

Type Examples
Algorithm Auction algorithm, Collaborative filtering
Application Adobe Acrobat, Microsoft Excel
Architecture Graphics processing unit, Wishbone
Data_Structure Array, Hash table, mXOR linked list
Device Samsung Gear S2, iPad, Intel T5300
Error Name Buffer overflow, Memory leak
General_Concept Memory management, Nouvelle AI
Language C++, Java, Python, Rust
Library Beautiful Soup, FastAPI
License Cryptix General License, MIT License
Operating_System Linux, Ubuntu, Red Hat OS, MorphOS
Protocol TLS, FTPS, HTTP 404

WikiSER is organized by the Wiki article in which the data was scraped from.

|-- Adobe_Flash.txt
|-- Linux.txt
|-- Java_(programming_language).txt
|-- ...

Each sentences are split by <s>...</s> and tokenized with stokenizer.

Download the full dataset from Huggingface or this folder.

Models

The finetuned checkpoints are available through HuggingFace: wikiser-bert-case and wikiser-bert-large.

You can load in the model by the standard API:

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("taidng/wikiser-bert-base")
model = AutoModelForTokenClassification.from_pretrained("taidng/wikiser-bert-base")

Train with Self-regularization

We suggest using conda to set up your environment. To begin, create a new environment using environment.yml, naming it "ser" by default.

conda env create -f environment.yml

To start training script with BERT and self-regularization:

python3 train_nll.py --model_name_or_path=bert-base-cased --alpha=10 --n_model=2 --dropout_prob=0.1 --data_dir=data/wikiser-small --epochs=25
  • --alpha: positive multiplier to weighing the agreement loss
  • --n_model: k number of forward passes for regularization
  • --data_dir: Specify one dataset out of wikiser-small, sner, and relabeled softner-9.

By default, training loss and evaluation statistics are stored in wandb.

Citation

If you find our work helpful, please cite:

@inproceedings{nguyen2023software,
  title={Software Entity Recognition with Noise-Robust Learning},
  author={Nguyen, Tai and Di, Yifeng and Lee, Joohan and Chen, Muhao and Zhang, Tianyi},
  booktitle={Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE'23)},
  year={2023},
  organization={IEEE/ACM}
}

About

[ASE 2023] Software Entity Recognition with Noise-Robust Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published