Is ChatGPT a game changer for geocoding - a benchmark for geocoding address parsing techniques

Benchmark description

Address parsing task

This repository is the benchmark to evaluate the performance of the address parsing task, specifically the segmentation of a singular inline postal address description into distinct chunks in accordance with the USPS address component standard, as outlined below.

Inline address description: 1006 S Main Street W, Los Angeles, CA 90001
Parse address description:

Address component Parsed results

House number 1006

Predirectional S

Base street name Main

Road type Street

Postdirectional W

City name Los Angeles

State CA

Zip code 90001

Benchmark dataset

The benchmark_dataset folder encompasses synthesized low-quality address descriptions designed for evaluating postal geocoding, address parsing, NER systems, and more. Comprising 239,000 U.S. address records spanning all 50 states and the District of Columbia, this dataset ensures diversity by extracting unique combinations of address components (excluding house numbers such as street name, predirectional, postdirectional, city name, and postal code) from each U.S. state and D.C. In essence, one address description per street is obtained across all states and D.C., forming this distinctive address description dataset. This dataset is further partitioned into three subsets tailored for training, validation, and testing purposes, with all address descriptions in these sets being mutually exclusive. Each address description is annotated using the IOB (Inside–Outside–Beginning) tagging scheme, which assigns the appropriate address component label to each chunk segmented by white space.

The distribution of address descriptions with different degrees of errors and variations is listed in the table below.

The input errors and variations that the benchmark dataset contains are detected and generated by mining transaction logs from an in-production geocoding system. In total, this dataset has 21 input errors and variations that occur on different address components as listed below. These input errors and variations are randomly injected to address descriptions.

The figures displayed below, from top to bottom, illustrate the distribution of input errors and variations within the Training, Validation, and Test sets, respectively.

Baseline models

The models folder contains the implementations of the evaluated address parsers built upon five different models. Each model represents significant strides in the field of NLP and achieved SOTA performance previously.

Recurrent neural network-based model
- Bidirectional LSTM-CRF, built upon https://github.com/allanj/pytorch_neural_crf
Transformer-based model
- BERT
- roBERTa
- DistilBERT
The above transformer-based models are implemented using the hugging face library
Generative Pre-trained Transformer
- GPT-3, implemented by Promptify library and gpt-3.5 turbo API

Citation

If you find this benchmark useful, please cite the following paper: Is ChatGPT a game changer for geocoding - a benchmark for geocoding address parsing techniques

Preprint version on arXiv

  @misc{yin2023chatgpt,
      title={Is ChatGPT a game changer for geocoding -- a benchmark for geocoding address parsing techniques}, 
      author={Zhengcong Yin and Diya Li and Daniel W. Goldberg},
      year={2023},
      eprint={2310.14360},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
benchmark_dataset		benchmark_dataset
models		models
.gitignore		.gitignore
README.md		README.md
main.py		main.py
main_utils.py		main_utils.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark_dataset

benchmark_dataset

models

models

.gitignore

.gitignore

README.md

README.md

main.py

main.py

main_utils.py

main_utils.py

requirements.txt

requirements.txt

utils.py

utils.py

Repository files navigation

Is ChatGPT a game changer for geocoding - a benchmark for geocoding address parsing techniques

Benchmark description

Address parsing task

Benchmark dataset

Baseline models

Citation

About

Releases

Packages

Languages

Address component	Parsed results
House number	1006
Predirectional	S
Base street name	Main
Road type	Street
Postdirectional	W
City name	Los Angeles
State	CA
Zip code	90001

zhengcongyin/Geocoding-Address-Parsing-Benchmark

Folders and files

Latest commit

History

Repository files navigation

Is ChatGPT a game changer for geocoding - a benchmark for geocoding address parsing techniques

Benchmark description

Address parsing task

Benchmark dataset

Baseline models

Citation

About

Resources

Stars

Watchers

Forks

Languages