We present a robust and general place name extraction method from tweet texts, named GazPNE2. It fuses deep learning, global gazetteers (i.e., OpenStreetMap and GeoNames), and pretrained transformer models (i.e., BERT and BERTweet), requiring no manually annotated data. It can extract place names at both coarse (e.g., country and city) and fine-grained (e.g., street and creek) levels and place names with abbreviations (e.g., ‘tx’ for ‘Texas’ and ‘studemont rd’ for ‘studemont road’).
The data we used to evaluate our approach is as follows:
Download the trained model and unzip the files into the model folder.
Java and Python 3.7 is required
conda create -n gazpne2 python=3.7
conda activate gazpne2
pip install -r requirements.txt
tar -xzvf BERTweet_base_fairseq.tar.gz
In the first run, the pretrained BERT models will be automaticlly downloaded and cached on the local drive.
A snippet of example code is as below.
from main import GazPNE2
gazpne2=GazPNE2() # This will take around 30 seconds to load models
tweets = ["Associates at the Kuykendahl Rd & Louetta Rd. store in Spring, TX gave our customers a reason to smile",\
"Rockport TX any photos of damage down Corpus Christi Street and Hwy 35 area? #houstonflood"]
# It is faster to input multiple tweets at once than one single tweet mutiple times.
locations = gazpne2.extract_location(tweets)
print(locations)
'''This will output:
{0: [{'LOC': 'Kuykendahl Rd', 'offset': (18, 30)}, {'LOC': 'Louetta Rd', 'offset': (34, 43)},
{'LOC': 'Spring', 'offset': (55, 60)}, {'LOC': 'TX', 'offset': (63, 64)}],
1: [{'LOC': 'Corpus Christi Street', 'offset': (38, 58)}, {'LOC': 'Hwy 35', 'offset': (64, 69)},
{'LOC': 'Rockport', 'offset': (0, 7)}, {'LOC': 'TX', 'offset': (9, 10)}, {'LOC': 'houston', 'offset': (78, 84)}]}
'''
Execute the command below in case of a jave error.
spack load openjdk
To extract locations from txt file, execute the following command. In the txt file, each line corresponds to a tweet message.
python -u main.py --input=0 --input_file=data/test.txt
To test our manually annotated datasets (3000 tweets), execute the following command.
python -u main.py --input=2
To test public datasets (19), execute the following command. You will get the result of partial datasets since some are not publicly available.
python -u main.py --input=4
datasets [a,b,c] can be obtained from https://rebrand.ly/LocationsDataset.
datasets [e,f] can be obtained from https://revealproject.eu/geoparse-benchmark-open-dataset/.
datasets [g,h] can be obtained by contacting the author of the data.
If you use the code, please cite the following publication:
@article{hu2022gazpne2,
title={GazPNE2: A general place name extractor for microblogs fusing gazetteers and pretrained transformer models},
author={Hu, Xuke and Zhou, Zhiyong and Sun, Yeran and Kersten, Jens and Klan, Friederike and Fan, Hongchao and Wiegmann, Matti},
journal={IEEE Internet of Things Journal},
volume={9},
number={17},
pages={16259--16271},
year={2022},
publisher={IEEE}
}