# Irchel Geoparser Workshop: Getting Started with Geoparser

This notebook is designed to help you explore and interact with the Geoparser library. By the end of this tutorial, you will have a solid understanding of how to use Geoparser for basic geoparsing tasks.

**Objectives:**

- Understand how to initialize and use Geoparser.
- Perform geoparsing on text strings and documents.
- Access and analyze geoparsing results.
- Visualize locations on a map.

---

## 📖 **Documentation**

This tutorial requires you to consult the Geoparser documentation: [docs.geoparser.app](https://docs.geoparser.app). Knowing where to find information about the library is crucial for working with it effectively, especially as it is likely that future updates may bring changes to functionality. The documentation will always be the definitive reference for how to use the package.

---

## 1. Geoparsing Text Strings <a name="geoparsing-text-strings"></a>

Let's begin by geoparsing some simple text strings directly within this notebook.

### 1.1 Initialize Geoparser <a name="task-initialize-geoparser"></a>

**Objective:** Initialize the Geoparser so that it's ready to parse text.

**Instructions:**

- Import the necessary class to use Geoparser.
- Initialize an instance of the `Geoparser` class.

By default, Geoparser uses the following configuration:

geoparser = Geoparser(
    spacy_model="en_core_web_sm",
    transformer_model="dguzh/geo-all-MiniLM-L6-v2",
    gazetteer="geonames"
)

These defaults prioritize speed over accuracy and are optimized for English texts. If you require higher accuracy and don’t mind increased computational cost, or need to process texts in other languages, you can specify different models as shown in the Advanced Usage section.


In [None]:
# To start using Geoparser, import it and create an instance of the Geoparser class:
from geoparser import Geoparser

In [3]:
geoparser = Geoparser()

### 1.2 Geoparse a List of Strings <a name="task-geoparse-a-list-of-strings"></a>

**Objective:** Use the Geoparser to parse a list of text strings.

**Instructions:**

- Use the example texts provided below.
- Use the initialized Geoparser to parse these texts.
- Store the results in a variable for further analysis.

**Example Texts:**

In [4]:
texts = [
    "The University of Zurich, located in the heart of Zurich, Switzerland, is the largest university in Switzerland. It was founded in 1833 and has since become a leading institution in Europe.",
    "Researchers at the University of Zurich frequently collaborate with institutions in Germany, Belgium, Japan and Australia.",
    "The Botanical Garden of the University of Zurich is a popular attraction, featuring plants from regions like the Amazon rainforest and the Himalayas.",
    "Notable alumni include Albert Einstein, who later worked at the Institute for Advanced Study in Princeton, New Jersey.",
    "Students at the University of Zurich have the opportunity to study abroad in places like Tokyo, Sydney, and Cape Town."
]

**Your Code:**

*(Write your code in the cell below.)*

In [5]:
docs = geoparser.parse(texts)

Toponym Recognition...


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/23 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

### 1.3 Exploring Geoparsing Results <a name="exploring-geoparsing-results"></a>

**Objective:** Understand and explore the structure of the results.

Now that you've parsed the texts, let's explore the results. Geoparsed documents are stored in `GeoDoc` objects, which contain `GeoSpan` objects for each recognized toponym. Each `GeoSpan` represents a toponym and provides access to resolved location information.

**Instructions:**

- Iterate over the parsed documents:
    - For each document:
      - Print the original text.
      - For each toponym found in the document:
        - Print the toponym text.
        - Access the resolved location information.
        - Print some attributes of the resolved location.

**Hints:**

- Use `for` loops to iterate over the documents and toponyms.
- Check if the `location` attribute of a toponym is valid (i.e. not `None`) before accessing its attributes.

**Your Code:**

*(Write your code in the cell below.)*

In [10]:
# Option 1. Iterate through the toponyms in a document to access the resolved locations:
for doc in docs:
    print(f"Document: {doc.text}")
    for toponym in doc.toponyms:
        print(f"- Toponym: {toponym.text}")
        location = toponym.location
        if location:
            print(f"  Resolved Location: {location['name']}, {location['country_name']}")
            print(f"  Feature Type: {location['feature_type']}")
            print(f"  Coordinates: ({location['latitude']}, {location['longitude']})")
            print(f"  Score: {toponym.score}")
        else:
            print("Location could not be resolved.")
    print()

Document: The University of Zurich, located in the heart of Zurich, Switzerland, is the largest university in Switzerland. It was founded in 1833 and has since become a leading institution in Europe.
- Toponym: Zurich
  Resolved Location: Zürich, Switzerland
  Feature Type: seat of a first-order administrative division
  Coordinates: (47.36667, 8.55)
  Score: 0.8428872227668762
- Toponym: Switzerland
  Resolved Location: Switzerland, Switzerland
  Feature Type: independent political entity
  Coordinates: (47.00016, 8.01427)
  Score: 0.8402798175811768
- Toponym: Switzerland
  Resolved Location: Switzerland, Switzerland
  Feature Type: independent political entity
  Coordinates: (47.00016, 8.01427)
  Score: 0.8402798175811768
- Toponym: Europe
  Resolved Location: Europe, None
  Feature Type: continent
  Coordinates: (48.69096, 9.14062)
  Score: 0.7988290190696716

Document: Researchers at the University of Zurich frequently collaborate with institutions in Germany, Belgium, Japan and A

In [9]:
# Option 2. When working with large datasets, it is recommended to access location data through the doc.locations property, which bundles the location retrieval of all toponyms within a document into a single database query:

for doc in docs:
    print(f"Document: {doc.text}")
    for toponym, location in zip(doc.toponyms, doc.locations):
        print(f"- Toponym: {toponym.text}")
        if location:
            print(f"  Resolved Location: {location['name']}, {location['country_name']}")
            print(f"  Feature Type: {location['feature_type']}")
            print(f"  Coordinates: ({location['latitude']}, {location['longitude']})")
            print(f"  Score: {toponym.score}")
        else:
            print("Location could not be resolved.")
    print()

Document: The University of Zurich, located in the heart of Zurich, Switzerland, is the largest university in Switzerland. It was founded in 1833 and has since become a leading institution in Europe.
- Toponym: Zurich
  Resolved Location: Zürich, Switzerland
  Feature Type: seat of a first-order administrative division
  Coordinates: (47.36667, 8.55)
  Score: 0.8428872227668762
- Toponym: Switzerland
  Resolved Location: Switzerland, Switzerland
  Feature Type: independent political entity
  Coordinates: (47.00016, 8.01427)
  Score: 0.8402798175811768
- Toponym: Switzerland
  Resolved Location: Switzerland, Switzerland
  Feature Type: independent political entity
  Coordinates: (47.00016, 8.01427)
  Score: 0.8402798175811768
- Toponym: Europe
  Resolved Location: Europe, None
  Feature Type: continent
  Coordinates: (48.69096, 9.14062)
  Score: 0.7988290190696716

Document: Researchers at the University of Zurich frequently collaborate with institutions in Germany, Belgium, Japan and A

Note that spaCy would, for example, recognise University of Zurich as an organization, not a location.

---

## 2. Geoparsing Text Documents <a name="geoparsing-text-documents"></a>

In this section, we'll work with actual text documents.

### 2.1 Loading Text Files <a name="loading-text-files"></a>

**Objective:** Load text files from your `data` folder.

**Instructions:**

- Prepare some text files that you would like to geoparse.
  - You can use your own documents or, for example, download text files from [this site](http://textfiles.com/wdirectory.html).
- Store the `.txt` files in a dedicated `data` folder in your working directory.
- Use the provided function `load_texts_from_directory` to load all `.txt` files from the `data` directory.

**Prepared Code:**

In [11]:
import os
import re



In [19]:
def load_texts_from_directory(directory):
    texts = []
    filenames = []
    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
                text = file.read()
                # Replace newlines followed by a non-capitalized letter with a space
                text = re.sub(r'\n([a-z])', r' \1', text)
                texts.append(text)
                filenames.append(filename)
    return texts, filenames



In [21]:
# Load the texts
texts, filenames = load_texts_from_directory('data')

In [22]:
# Check loaded texts
for filename, text in zip(filenames, texts):
    print(f"{filename}: {len(text)} characters")

bbscase.txt: 11159 characters
banned.txt: 15060 characters
bandbook.txt: 6090 characters


### 2.2 Geoparse Loaded Documents <a name="task-geoparse-loaded-documents"></a>

**Objective:** Use Geoparser to parse the loaded text documents.

**Instructions:**

- Use the Geoparser to parse the list of texts you just loaded.
- Store the results in a variable.

**Your Code:**

*(Write your code in the cell below.)*

In [25]:
# Geoparse the texts
docs = geoparser.parse(texts)
    

Toponym Recognition...


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/473 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (344 > 256). Running this sequence through the model will result in indexing errors


Batches:   0%|          | 0/30 [00:00<?, ?it/s]

### 2.3 Analyzing Results <a name="analyzing-results"></a>

**Objective:** Analyze the geoparsing results qualitatively.

**Instructions:**

- For each document:
  - Print the filename.
  - Print the number of toponyms found.
  - List the toponyms and their resolved locations.
  - For each resolved toponym, also print the similarity score.

**Hints:**

  - You can retrieve the sentence the toponym appears in using the `GeoSpan.sent.text` property.
  - Use `for doc, filename in zip(docs, filenames)` to iterate through docs and filenames in parallel.

**Analyze the Results:**

- Qualitatively assess the results regarding toponym recognition and resolution.
- Consider how the similarity score reflects the quality of the resolution.

**Your Code:**

*(Write your code in the cell below.)*

In [28]:
# - For each documen that we geoparsed above:
  #- Print the filename.
  #- Print the number of toponyms found.
  #- List the toponyms and their resolved locations.
  #- For each resolved toponym, also print the similarity score.

for filename, doc in zip(filenames, docs):
    print(f"Filename: {filename}")
    print(f"Number of Toponyms: {len(doc.toponyms)}")
    for toponym, location in zip(doc.toponyms, doc.locations):
        print(f"- Toponym: {toponym.text}")
        if location:
            print(f"  Resolved Location: {location['name']}, {location['country_name']}")
            print(f"  Feature Type: {location['feature_type']}")
            print(f"  Coordinates: ({location['latitude']}, {location['longitude']})")
            print(f"  Score: {toponym.score}")
        else:
            print("Location could not be resolved.")
    print()

Filename: bbscase.txt
Number of Toponyms: 7
- Toponym: Munroe
Falls
  Resolved Location: Munroe Falls, United States
  Feature Type: populated place
  Coordinates: (41.1445, -81.43983)
  Score: 0.6490542888641357
- Toponym: Summit County
  Resolved Location: Summit County, United States
  Feature Type: second-order administrative division
  Coordinates: (41.12598, -81.53217)
  Score: 0.8736891746520996
- Toponym: the
City
  Resolved Location: Thompsonville, United States
  Feature Type: populated place
  Coordinates: (41.99704, -72.59898)
  Score: 0.6001893877983093
- Toponym: California
  Resolved Location: California, United States
  Feature Type: first-order administrative division
  Coordinates: (37.25022, -119.75126)
  Score: 0.9630898237228394
- Toponym: BBS'ing
Location could not be resolved.
- Toponym: Cleveland
  Resolved Location: Cleveland, United States
  Feature Type: seat of a second-order administrative division
  Coordinates: (41.4995, -81.69541)
  Score: 0.897909402847

### 2.4 Exploring Different Models <a name="task-exploring-different-models"></a>

**Objective:** Explore how using different spaCy and transformer models affect the geoparsing results in terms of toponym recognition and resolution.

**Instructions:**

- Reinitialize Geoparser with different configurations.
- Experiment with a different spaCy model (e.g. `en_core_web_trf`).
- Experiment with a different transformer model (e.g. `dguzh/geo-all-distilroberta-v1`).
- Compare the results (briefly).

**Hints:**

- You may need to install additional spaCy models if they are not already installed.

**Note:** The suggested models are expected to be more accurate, but can also significantly increase runtimes when used for large collections of text.

**Your Code:**

*(Write your code in the cell below.)*

In [46]:
def geoparse_spacy(model, texts):
    #Define spaCy model
    geoparser = Geoparser(spacy_model= model) # e.g., "en_core_web_trf" or transformer_model="dguzh/geo-all-distilroberta-v1"

    # Parse the texts
    docs = geoparser.parse(texts)

    # Print results like above
    for filename, doc in zip(filenames, docs):
        print(f"Filename: {filename}")
        print(f"Number of Toponyms: {len(doc.toponyms)}")
        for toponym, location in zip(doc.toponyms, doc.locations):
            print(f"- Toponym: {toponym.text}")
            if location:
                print(f"  Resolved Location: {location['name']}, {location['country_name']}")
                print(f"  Feature Type: {location['feature_type']}")
                print(f"  Coordinates: ({location['latitude']}, {location['longitude']})")
                print(f"  Score: {toponym.score}")
            else:
                print("Location could not be resolved.")
        print()

In [47]:
geoparse_spacy("en_core_web_trf", texts)

Toponym Recognition...


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/486 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (261 > 256). Running this sequence through the model will result in indexing errors


Batches:   0%|          | 0/26 [00:00<?, ?it/s]

Filename: bbscase.txt
Number of Toponyms: 11
- Toponym: Munroe
  Resolved Location: Munroe Falls, United States
  Feature Type: populated place
  Coordinates: (41.1445, -81.43983)
  Score: 0.6490542888641357
- Toponym: Falls
  Resolved Location: Falls County, United States
  Feature Type: second-order administrative division
  Coordinates: (31.25327, -96.93585)
  Score: 0.5819347500801086
- Toponym: OH
  Resolved Location: Ohio, United States
  Feature Type: first-order administrative division
  Coordinates: (40.25034, -83.00018)
  Score: 0.8684738874435425
- Toponym: Munroe Falls
  Resolved Location: Munroe Falls, United States
  Feature Type: populated place
  Coordinates: (41.1445, -81.43983)
  Score: 0.6490542888641357
- Toponym: Summit County
  Resolved Location: Summit County, United States
  Feature Type: second-order administrative division
  Coordinates: (41.12598, -81.53217)
  Score: 0.8020797967910767
- Toponym: Munroe Falls
  Resolved Location: Munroe Falls, United States
 

In [48]:
def geoparse_transformer(model, texts):
    #Define spaCy model
    geoparser = Geoparser(transformer_model= model) #  "dguzh/geo-all-distilroberta-v1"

    # Parse the texts
    docs = geoparser.parse(texts)

    # Print results like above
    for filename, doc in zip(filenames, docs):
        print(f"Filename: {filename}")
        print(f"Number of Toponyms: {len(doc.toponyms)}")
        for toponym, location in zip(doc.toponyms, doc.locations):
            print(f"- Toponym: {toponym.text}")
            if location:
                print(f"  Resolved Location: {location['name']}, {location['country_name']}")
                print(f"  Feature Type: {location['feature_type']}")
                print(f"  Coordinates: ({location['latitude']}, {location['longitude']})")
                print(f"  Score: {toponym.score}")
            else:
                print("Location could not be resolved.")
        print()

In [49]:
geoparse_transformer("dguzh/geo-all-distilroberta-v1", texts)

Toponym Recognition...


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/473 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (573 > 512). Running this sequence through the model will result in indexing errors


Batches:   0%|          | 0/30 [00:00<?, ?it/s]

Filename: bbscase.txt
Number of Toponyms: 7
- Toponym: Munroe
Falls
  Resolved Location: Munroe Falls, United States
  Feature Type: populated place
  Coordinates: (41.1445, -81.43983)
  Score: 0.68886798620224
- Toponym: Summit County
  Resolved Location: Summit County, United States
  Feature Type: second-order administrative division
  Coordinates: (41.12598, -81.53217)
  Score: 0.8035531044006348
- Toponym: the
City
  Resolved Location: Sherman, United States
  Feature Type: populated place
  Coordinates: (41.57926, -73.49568)
  Score: 0.5464845895767212
- Toponym: California
  Resolved Location: California, United States
  Feature Type: first-order administrative division
  Coordinates: (37.25022, -119.75126)
  Score: 0.9667490720748901
- Toponym: BBS'ing
Location could not be resolved.
- Toponym: Cleveland
  Resolved Location: Cleveland, United States
  Feature Type: seat of a second-order administrative division
  Coordinates: (41.4995, -81.69541)
  Score: 0.8633217215538025
- T

---

## 3. Visualizing Locations on a Map <a name="visualizing-locations-on-a-map"></a>

Visualizing the extracted locations on a map can provide valuable insights.

### 3.1 Preparing Data for Mapping <a name="preparing-data-for-mapping"></a>

**Objective:** Extract coordinates of resolved locations and prepare them for mapping.

**Instructions:**

- Extract the coordinates of all resolved locations from the parsed documents.
- Prepare a list of tuples containing the latitude and longitude of each location.
  - The expected structure is a list of tuples: `[(latitude1, longitude1), (latitude2, longitude2), ...]`

**Your Code:**

*(Write your code in the cell below.)*

In [None]:
# Your code here

### 3.2 Map the Geoparsed Locations <a name="task-map-the-geoparsed-locations"></a>

**Objective:** Create an interactive map displaying the resolved locations.

**Instructions:**

- Use the provided function `create_map` to generate a map with markers for each location.

**Prepared Code:**

In [None]:
import folium

def create_map(coordinates):
    # Initialize map
    m = folium.Map()

    # Add markers to the map
    for coord in coordinates:
        folium.Marker(
            location=[coord[0], coord[1]],
        ).add_to(m)

    return m

# Create and display the map
m = create_map(coordinates)  # Replace 'coordinates' with your variable
m

---

## 4. Additional Exercises <a name="additional-exercises"></a>

### 4.1 Handling Ambiguous Toponyms <a name="handling-ambiguous-toponyms"></a>

**Objective:** Explore how contextual clues affect toponym disambiguation.

Ambiguous toponyms like "Paris" can refer to multiple places around the world. Geoparser uses contextual clues to disambiguate such toponyms.

**Instructions:**

- Parse the provided example texts.

**Analyze the Results:**

- Qualitatively assess how Geoparser resolves the toponym "Paris" in each case.
- How does the context influence the resolution?
- What limitations can you observe?

**Example Texts:**

In [None]:
ambiguous_texts = [
    "The Eiffel Tower is one of the most famous landmarks in Paris.",
    "After passing through Paris, we drove all the way to Dallas.",
    "I have friends living in Paris, Ontario, who love to kayak on the Grand River.",
    "The Governor of Texas visited Paris for the 2024 Summer Olympics."
    "There are 34 places named Paris in the United States. Examples include Paris, Arkansas, Paris, Texas and Paris, Wisconsin."
]

**Your Code:**

*(Write your code in the cell below.)*

In [None]:
# Your code here

---

### 4.2 Filtering Candidate Locations <a name="filtering-candidate-locations"></a>

**Objective:** Use filters to restrict candidate locations during parsing.

In some cases, you might want to restrict the candidate locations considered during geoparsing. This can be done by applying filters during candidate generation.

**Instructions:**

- Use the ambiguous texts from the previous exercise.
- Apply a filter during parsing so that it only considers locations in France.

**Note:** In practice, filters should only be used if you are absolutely sure about the scope of location references. Generally, using no filter is the recommended way to go.


**Your Code:**

*(Write your code in the cell below.)*

In [None]:
# Your code here

---

## 5. Bonus Exercise: Integrating Geoparser into a Workflow <a name="bonus-exercise"></a>

In this bonus exercise, we'll see how Geoparser can be integrated into a real-world workflow. We'll fetch news article headlines based on a keyword, parse them with Geoparser, and visualize the extracted locations on a map.

### 5.1 Fetching News Articles <a name="fetching-news-articles"></a>

**Objective:** Fetch news article headlines from Google News RSS feed based on a keyword.

**Instructions:**

- Choose a keyword to search for news articles (e.g., "Technology", "Climate Change", etc.).
- Use the `fetch_news_articles` function to get the article titles.

**Prepared Code:**

In [None]:
import requests
import xml.etree.ElementTree as ET

def fetch_news_articles(keyword):
    url = f"https://news.google.com/rss/search?q={keyword}"
    response = requests.get(url)

    # Parse the RSS feed
    root = ET.fromstring(response.content)

    # Extract article titles
    articles = []
    for item in root.findall(".//item"):
        title = item.find("title").text
        articles.append(title)

    return articles

# Choose a keyword
keyword = "Your Keyword Here"  # Replace with your chosen keyword

# Fetch articles
articles = fetch_news_articles(keyword)

### 5.2 Parse the Articles with Geoparser <a name="task-parse-articles"></a>

**Objective:** Use Geoparser to parse the list of article titles.

**Instructions:**

- Use Geoparser to parse the `articles`.
- Store the parsed documents for further analysis.

**Your Code:**

*(Write your code in the cell below.)*

In [None]:
# Your code here

### 5.3 Collecting and Analyzing Toponym Data <a name="collecting-toponym-data"></a>

**Objective:** Collect toponym data from the parsed article headlines.

**Instructions:**

- Iterate over the parsed documents.
- For each toponym in a document:
  - Access the resolved location information.
  - Group the results by location and format the data in a structure suitable for mapping.

**Hints:**

- We want to create a data structure `toponym_data` that is a dictionary where:
  - The key is a tuple `(location_name, latitude, longitude)`, representing a referenced location.
  - The value is another dictionary with:
    - `"count"`: number of times the location is referenced.
    - `"sentences"`: list of sentences (article titles) where the location is referenced.
- We have initialized a `defaultdict` named `toponym_data` in the code cell below. You can add data to this `defaultdict` like this:
  - To up the counter for a specific location:
    - `toponym_data[("Zurich", 47.3769, 8.5417)]["count"] += 1`
  - To add a sentence:
    - `toponym_data[("Zurich", 47.3769, 8.5417)]["sentences"].append("Zurich is a financial hub.")`

**Your Code:**

*(Write your code in the cell below.)*

In [None]:
from collections import defaultdict

# Initialize toponym_data
toponym_data = defaultdict(lambda: {"count": 0, "sentences": []})

# Collect toponym data
# Your code here

### 5.4 Visualizing the Locations on a Map <a name="visualizing-news-locations"></a>

**Objective:** Create a map with the collected location data.

**Instructions:**

- Use the `create_news_map` function to create a map using your `toponym_data`.

**Prepared Code:**

In [None]:
import folium

def create_news_map(toponym_data):
    # Initialize map centered at global coordinates
    m = folium.Map()

    for (name, lat, lon), data in toponym_data.items():
        # Prepare popup content
        sentences_html = "<br><br>".join(data["sentences"])
        popup_html = f"""
        <strong>{name}</strong><br><br>{sentences_html}
        """

        folium.CircleMarker(
            location=[lat, lon],
            radius=5 + data["count"] * 2,  # Size scales with the number of occurrences
            popup=folium.Popup(popup_html, max_width=300),
            color='crimson',
            fill=True,
            fill_color='crimson'
        ).add_to(m)
    return m

# Create and display the map
m = create_news_map(toponym_data)
m