Skip to content

zhengcongyin/Geocoding-Address-Parsing-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Is ChatGPT a game changer for geocoding - a benchmark for geocoding address parsing techniques

Benchmark description

Address parsing task

This repository is the benchmark to evaluate the performance of the address parsing task, specifically the segmentation of a singular inline postal address description into distinct chunks in accordance with the USPS address component standard, as outlined below.

  • Inline address description: 1006 S Main Street W, Los Angeles, CA 90001

  • Parse address description:

    Address component Parsed results
    House number 1006
    Predirectional S
    Base street name Main
    Road type Street
    Postdirectional W
    City name Los Angeles
    State CA
    Zip code 90001

Benchmark dataset

The benchmark_dataset folder encompasses synthesized low-quality address descriptions designed for evaluating postal geocoding, address parsing, NER systems, and more. Comprising 239,000 U.S. address records spanning all 50 states and the District of Columbia, this dataset ensures diversity by extracting unique combinations of address components (excluding house numbers such as street name, predirectional, postdirectional, city name, and postal code) from each U.S. state and D.C. In essence, one address description per street is obtained across all states and D.C., forming this distinctive address description dataset. This dataset is further partitioned into three subsets tailored for training, validation, and testing purposes, with all address descriptions in these sets being mutually exclusive. Each address description is annotated using the IOB (Inside–Outside–Beginning) tagging scheme, which assigns the appropriate address component label to each chunk segmented by white space.

The distribution of address descriptions with different degrees of errors and variations is listed in the table below.
image

The input errors and variations that the benchmark dataset contains are detected and generated by mining transaction logs from an in-production geocoding system. In total, this dataset has 21 input errors and variations that occur on different address components as listed below. These input errors and variations are randomly injected to address descriptions.

image

The figures displayed below, from top to bottom, illustrate the distribution of input errors and variations within the Training, Validation, and Test sets, respectively.

Baseline models

The models folder contains the implementations of the evaluated address parsers built upon five different models. Each model represents significant strides in the field of NLP and achieved SOTA performance previously.

Citation

If you find this benchmark useful, please cite the following paper: Is ChatGPT a game changer for geocoding - a benchmark for geocoding address parsing techniques

  @misc{yin2023chatgpt,
      title={Is ChatGPT a game changer for geocoding -- a benchmark for geocoding address parsing techniques}, 
      author={Zhengcong Yin and Diya Li and Daniel W. Goldberg},
      year={2023},
      eprint={2310.14360},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

ACM SIG Spatial GeoSearch'23 Geocoding Address Parsing Benchmark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages