GazPNE2
Introduction
We present a robust and general place name extraction method from tweet texts, named GazPNE2. It fuses deep learning, global gazetteers (i.e., OpenStreetMap and GeoNames), and pretrained transformer models (i.e., BERT and BERTweet), requiring no manually annotated data. It can extract place names at both coarse (e.g., country and city) and fine-grained (e.g., street and creek) levels and place names with abbreviations (e.g., ‘tx’ for ‘Texas’ and ‘studemont rd’ for ‘studemont road’).
Test Data
The data we used to evaluate our approach is as follows:
Result
Use the code
Prepare model data
Download the trained model and unzip the files into the model folder.
Install
Java and Python 3.7 is required
conda create -n gazpne2 python=3.7
conda activate gazpne2
pip install -r requirements.txt
Download pretrained BERTweet model
tar -xzvf BERTweet_base_fairseq.tar.gz
In the first run, the pretrained BERT models will be automaticlly downloaded and cached on the local drive.
Test the code
A snippet of example code is as below.
from main import GazPNE2
gazpne2=GazPNE2() # This will take around 30 seconds to load models
tweets = ["Associates at the Kuykendahl Rd & Louetta Rd. store in Spring, TX gave our customers a reason to smile",\
"Rockport TX any photos of damage down Corpus Christi Street and Hwy 35 area? #houstonflood"]
# It is faster to input multiple tweets at once than one single tweet mutiple times.
locations = gazpne2.extract_location(tweets)
print(locations)
'''This will output:
{0: [{'LOC': 'Kuykendahl Rd', 'offset': (18, 30)}, {'LOC': 'Louetta Rd', 'offset': (34, 43)},
{'LOC': 'Spring', 'offset': (55, 60)}, {'LOC': 'TX', 'offset': (63, 64)}],
1: [{'LOC': 'Corpus Christi Street', 'offset': (38, 58)}, {'LOC': 'Hwy 35', 'offset': (64, 69)},
{'LOC': 'Rockport', 'offset': (0, 7)}, {'LOC': 'TX', 'offset': (9, 10)}, {'LOC': 'houston', 'offset': (78, 84)}]}
'''
Execute the command below in case of a jave error.
spack load openjdk
To extract locations from txt file, execute the following command. In the txt file, each line corresponds to a tweet message.
python -u main.py --input=0 --input_file=data/test.txt
To test our manually annotated datasets (3000 tweets), execute the following command.
python -u main.py --input=2
To test public datasets (19), execute the following command. You will get the result of partial datasets since some are not publicly available.
python -u main.py --input=4
datasets [a,b,c] can be obtained from https://rebrand.ly/LocationsDataset.
datasets [e,f] can be obtained from https://revealproject.eu/geoparse-benchmark-open-dataset/.
datasets [g,h] can be obtained by contacting the author of the data.
Citing
If you do make use of GazPNE2 or any of its components please cite the following publication:
X. Hu et al., "GazPNE2: A general place name extractor for microblogs fusing gazetteers and pretrained transformer models," in IEEE Internet of Things Journal, doi: 10.1109/JIOT.2022.3150967.
Contact
If you have any questions, feel free to contact Xuke Hu via xuke.hu@dlr.de