Skip to content

GazPNE2: A general and annotation-free place name extractor for microblogs fusing gazetteer and transformer model

License

Notifications You must be signed in to change notification settings

uhuohuy/GazPNE2

Repository files navigation

GazPNE2

Introduction

We present a robust and general place name extraction method from tweet texts, named GazPNE2. It fuses deep learning, global gazetteers (i.e., OpenStreetMap and GeoNames), and pretrained transformer models (i.e., BERT and BERTweet), requiring no manually annotated data. It can extract place names at both coarse (e.g., country and city) and fine-grained (e.g., street and creek) levels and place names with abbreviations (e.g., ‘tx’ for ‘Texas’ and ‘studemont rd’ for ‘studemont road’).

Test Data

The data we used to evaluate our approach is as follows:

Result

Use the code

Prepare model data

Download the trained model and unzip the files into the model folder.

Install

Java and Python 3.7 is required

conda create -n gazpne2 python=3.7

conda activate gazpne2

pip install -r requirements.txt

Download pretrained BERTweet model

wget https://public.vinai.io/BERTweet_base_fairseq.tar.gz

tar -xzvf BERTweet_base_fairseq.tar.gz

In the first run, the pretrained BERT models will be automaticlly downloaded and cached on the local drive.

Test the code

A snippet of example code is as below.

from main import GazPNE2
gazpne2=GazPNE2() # This will take around 30 seconds to load models
tweets = ["Associates at the Kuykendahl Rd & Louetta Rd. store in Spring, TX gave our customers a reason to smile",\
"Rockport TX any photos of damage down Corpus Christi Street and Hwy 35 area? #houstonflood"]
# It is faster to input multiple tweets at once than one single tweet mutiple times. 
locations = gazpne2.extract_location(tweets)
print(locations)
'''This will output:
{0: [{'LOC': 'Kuykendahl Rd', 'offset': (18, 30)}, {'LOC': 'Louetta Rd', 'offset': (34, 43)},
{'LOC': 'Spring', 'offset': (55, 60)}, {'LOC': 'TX', 'offset': (63, 64)}], 
1: [{'LOC': 'Corpus Christi Street', 'offset': (38, 58)}, {'LOC': 'Hwy 35', 'offset': (64, 69)},
{'LOC': 'Rockport', 'offset': (0, 7)}, {'LOC': 'TX', 'offset': (9, 10)}, {'LOC': 'houston', 'offset': (78, 84)}]}
'''

Execute the command below in case of a jave error.

spack load openjdk

To extract locations from txt file, execute the following command. In the txt file, each line corresponds to a tweet message.

python -u main.py --input=0 --input_file=data/test.txt

To test our manually annotated datasets (3000 tweets), execute the following command.

python -u main.py --input=2

To test public datasets (19), execute the following command. You will get the result of partial datasets since some are not publicly available.

python -u main.py --input=4

datasets [a,b,c] can be obtained from https://rebrand.ly/LocationsDataset.

datasets [e,f] can be obtained from https://revealproject.eu/geoparse-benchmark-open-dataset/.

datasets [g,h] can be obtained by contacting the author of the data.

Citation

If you use the code, please cite the following publication:

@article{hu2022gazpne2,
  title={GazPNE2: A general place name extractor for microblogs fusing gazetteers and pretrained transformer models},
  author={Hu, Xuke and Zhou, Zhiyong and Sun, Yeran and Kersten, Jens and Klan, Friederike and Fan, Hongchao and Wiegmann, Matti},
  journal={IEEE Internet of Things Journal},
  volume={9},
  number={17},
  pages={16259--16271},
  year={2022},
  publisher={IEEE}
}

About

GazPNE2: A general and annotation-free place name extractor for microblogs fusing gazetteer and transformer model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages