<a href="https://colab.research.google.com/github/segmue/Geo871_geoparsing_example/blob/main/GEO871_geoparser_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Example Code Different Location Reference Recognitions

This notebook contains sample code for the following methods:
- a simple gazetteer lookup
- using spaCy Natural Language Processing Models (NLP)
- geoparser package



**Important Note: The first two code blocks need ~15 Minutes to load, start them early**




In [9]:
%%capture
!pip install geoparser
!pip install geopandas
!pip install mapclassify
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm

Next, the code below downloads the geonames gazeteer (worldwide) and loads it in a SQL database. Smaller, or local gazeteers are not natively supported yet (This part downloads ca. 5 GB data, so give it some time)

In [10]:
!python -m geoparser download geonames

2024-10-23 12:04:46.251215: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-23 12:04:46.570493: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-23 12:04:46.657480: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-23 12:04:47.174654: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Database setup...
Downloading allCountries.zip: 100% 

## Swissnames3D Gazetteer Lookup

#### Introduction:
This task should show the limitations and possibilities of using a gazetteer (e.g. a list of placenames) for location reference recognition. For this, we will use a lightweight version of Swissnames3D. All rows in the original gazetter (which would also contain geometries) were just grouped by name and their occurances counted.


#### Questions:
- Test the 'lookup_words_in_gazeteer' function with different sentences, what do you notice?
- What problems do you detect?
- Could you think of possible solutions, how to improve this approach?

##### Small Coding task (voluntary)
- Which five placenames have the highest occurance in the swiss gazetteer?
- How many times does 'Zürich' occur in the gazetteer? What about alternative ways of spelling Zürich or as part of a whole string? (e.g. "Zürich HB")

In [None]:
import pandas as pd
swissnames = pd.read_csv('https://raw.githubusercontent.com/segmue/Geo871_geoparsing_example/af77e1653b92919eba234c793bb021dc190bbb7d/swissnames_dict.csv')
swissnames = dict(zip(swissnames.Name, swissnames.Count))

In [None]:
def lookup_words_in_gazeteer(strings):
  if not isinstance(strings, list):
    strings = [strings]
  counter = 0
  for string in strings:
      counter += 1
      words = string.split()
      ## Make found words a tuple, containing the dict Key and Value
      found_words = [(word, swissnames[word.lower()]) for word in words if word.lower() in swissnames]

      #found_words = [word for word in words if word.lower() in swissnames]
      print(f'Sample {counter} ------------')
      print(f"Placenames: {found_words}")

In [None]:
examples = ['Vom Bahnhof Buchs aus nehme ich den Zug nach Zürich',
            'Hier steht ein beispielhafter Satz']

match_words_with_gazeteer(examples)

Sample 1 ------------
Placenames: [('Bahnhof', 8), ('Buchs', 13), ('Zug', 15), ('Zürich', 2)]
Sample 2 ------------
Placenames: [('Satz', 9)]


## Example using SpaCy

#### Introduction
This time, we're using a pretrained language model to recognize locations in text. Spacy flags basically every wort in the sentence with a label. We're looking for entities, which have label LOC (Location), GPE (Geo-Political Entity) or FAC (Facility).
The next code blocks are either for English Language or German. Use whatever you like.

#### Questions:
- Again, test different sentences. Is it better than the gazetter lookup?
- Which of the two methods do you think would have a higher precision value? Which has higher recall values?

In [None]:
## English Language Model
import spacy

nlp = spacy.load("en_core_web_sm")

def find_locations(texts):
    for i, text in enumerate(texts, 1):
        doc = nlp(text)
        locations = [ent.text for ent in doc.ents if ent.label_ in {"LOC", "FAC", "GPE"}]
        print(f"text {i} ------")
        print(f"Locations: {locations if locations else 'None found'}\n")

# Example usage:
texts = [
    "Each morning, I walk to the Train Station in Chelsea, and take the train to Greenwich. The train even drives to Westminster and Brixton.",
    "There is a small town near the mountains.",
    "Berlin is the capital of Germany.",
    "Interestingly, the city of Bath in Somerset was founded by the Roman Empire, because of the hot springs."
]

find_locations(texts)

KeyboardInterrupt: 

In [None]:
## German Language Model

import spacy

nlp = spacy.load("de_core_news_sm")

def find_locations(texts):
    for i, text in enumerate(texts, 1):
        doc = nlp(text)
        locations = [ent.text for ent in doc.ents if ent.label_ in {"LOC", "FAC", "GPE"}]
        print(f"text {i} ------")
        print(f"Locations: {locations if locations else 'None found'}\n")

# Example usage:
examples = ['Vom Bahnhof Buchs aus nehme ich den Zug nach Zürich',
            'Hier steht ein beispielhafter Satz']

find_locations(examples)

## Geoparser

geoparser is a quite new package developed during a GIUZ Master Thesis.
It is a hybrid approach, using a global gazeteer (geonames), a NLP (spaCy) and a transformer deep learning architecture.
(https://github.com/dguzh/geoparser)

- We load a language model from spacy.
- The transformer model is a pre-trained Deep Learning Model.
- The Gazeteer used is a global Gazetter called "geonames"

## Exercise:
Here you have some example texts.
You can use these first examples to test the location reference recognition.

Task1:
- Run the code below and the next chunk. The output should be a list of recognized toponyms in the document. Are all locations recognized? Why not?

- Test some different sentences or phrasing. Does the parser recognize all your locations?


Task2 (voluntarily):
- Use the map_toponyms function to look at the fully geocoded locations in the map. Change the sentences as you like and see how the geocoder works.

In [11]:
from geoparser import Geoparser

geo = Geoparser(spacy_model='en_core_web_sm', transformer_model='dguzh/geo-all-MiniLM-L6-v2', gazetteer='geonames')

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/741 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [12]:
def print_toponyms(text):
  if not isinstance(text, list):
    text = [text]
  parsed_docs = geo.parse(text)
  counter = 0
  for doc in parsed_docs:
    print(f"Sample {counter} ------------")
    counter += 1
    for toponym in doc.toponyms:
        print(toponym)
        if toponym.location:
            print(toponym.location['latitude'], toponym.location['longitude'])
        else:
            print("No location found")

In [13]:
text_samples = [
    "I love to buy Pork in York, New Yorkshire. But the ham in New York is also lovely",
    "Arabic is the 6th most common language in the United States",
    "Interestingly, the city of Bath in Somerset was founded by the Roman Empire, because of the hot springs."
    "I think I'll take a Bath today"
]
print_toponyms(text_samples)

Toponym Recognition...


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/31 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Sample 0 ------------
York
53.95763 -1.08271
New Yorkshire
54.44158 -1.91088
New York
55.02485 -1.48619
Sample 1 ------------
the United States
18.34829 -64.98348
Sample 2 ------------
Somerset
51.08333 -3.0
the Roman Empire
No location found


In [30]:
import geopandas as gpd
from shapely.geometry import Point

def map_toponyms(docs):
    all_toponyms = []

    for counter, doc in enumerate(geo.parse(docs), start=1):
        for toponym in doc.toponyms:
            if toponym.location:
                latitude = toponym.location['latitude']
                longitude = toponym.location['longitude']

                toponym_data = {
                    'name': toponym,
                    'sample_number': counter,
                    'Text' : doc,
                    'geometry': Point(longitude, latitude)
                }
                all_toponyms.append(toponym_data)

    gdf = gpd.GeoDataFrame(all_toponyms, crs="EPSG:4326")
    return gdf.explore(marker_type="marker")

In [31]:
map_toponyms(text_samples)

Toponym Recognition...


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/31 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### Other Example, using Tweets
You can try out a corpus of disaster tweets from kaggle
https://www.kaggle.com/datasets/vstepanenko/disaster-tweets?resource=download

which were already uploaded on github

In [20]:
import pandas as pd
tweets = pd.read_csv('https://raw.githubusercontent.com/segmue/Geo871_geoparsing_example/refs/heads/main/tweets.csv')

list(tweets['text'])

In [32]:
# All 10'000 Tweets is a bit too large, so we take the first 200:
tweets_reduced = tweets['text'].tolist()[0:200]
map_toponyms(tweets_reduced)

Toponym Recognition...


Batches:   0%|          | 0/200 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/71 [00:00<?, ?it/s]

Batches:   0%|          | 0/11 [00:00<?, ?it/s]