This repository is the benchmark to evaluate the performance of the address parsing task, specifically the segmentation of a singular inline postal address description into distinct chunks in accordance with the USPS address component standard, as outlined below.
-
Inline address description:
1006 S Main Street W, Los Angeles, CA 90001
-
Parse address description:
Address component Parsed results House number 1006 Predirectional S Base street name Main Road type Street Postdirectional W City name Los Angeles State CA Zip code 90001
The benchmark_dataset
folder encompasses synthesized low-quality address descriptions designed for evaluating postal geocoding, address parsing, NER systems, and more. Comprising 239,000 U.S. address records spanning all 50 states and the District of Columbia, this dataset ensures diversity by extracting unique combinations of address components (excluding house numbers such as street name, predirectional, postdirectional, city name, and postal code) from each U.S. state and D.C. In essence, one address description per street is obtained across all states and D.C., forming this distinctive address description dataset.
This dataset is further partitioned into three subsets tailored for training, validation, and testing purposes, with all address descriptions in these sets being mutually exclusive. Each address description is annotated using the IOB (Inside–Outside–Beginning) tagging scheme, which assigns the appropriate address component label to each chunk segmented by white space.
The distribution of address descriptions with different degrees of errors and variations is listed in the table below.
The input errors and variations that the benchmark dataset contains are detected and generated by mining transaction logs from an in-production geocoding system. In total, this dataset has 21 input errors and variations that occur on different address components as listed below. These input errors and variations are randomly injected to address descriptions.
The figures displayed below, from top to bottom, illustrate the distribution of input errors and variations within the Training, Validation, and Test sets, respectively.
The models
folder contains the implementations of the evaluated address parsers built upon five different models. Each model represents significant strides in the field of NLP and achieved SOTA performance previously.
-
Recurrent neural network-based model
- Bidirectional LSTM-CRF, built upon https://github.com/allanj/pytorch_neural_crf
-
Transformer-based model
- BERT
- roBERTa
- DistilBERT
The above transformer-based models are implemented using the hugging face library
-
Generative Pre-trained Transformer
- GPT-3, implemented by Promptify library and gpt-3.5 turbo API
If you find this benchmark useful, please cite the following paper: Is ChatGPT a game changer for geocoding - a benchmark for geocoding address parsing techniques
- Preprint version on arXiv
@misc{yin2023chatgpt,
title={Is ChatGPT a game changer for geocoding -- a benchmark for geocoding address parsing techniques},
author={Zhengcong Yin and Diya Li and Daniel W. Goldberg},
year={2023},
eprint={2310.14360},
archivePrefix={arXiv},
primaryClass={cs.CL}
}