# Irchel Geoparser Workshop: Fine-Tuning a Transformer Model

In this second notebook, we will learn how to fine-tune the transformer model used by Geoparser for specific text corpora. This tutorial will guide you through the process of preparing training data, training a custom model, and evaluating its performance.

**Objectives:**

- Learn how to annotate text data using the Geoparser Annotator web app.
- Understand how to prepare annotated data for model training and testing.
- Fine-tune the Geoparser model using custom annotations.
- Evaluate and compare the performance of the base model and the fine-tuned model.

---

## 📖 **Documentation**

This tutorial requires you to consult the Geoparser documentation: [docs.geoparser.app](https://docs.geoparser.app). Knowing where to find information about the library is crucial for working with it effectively, especially as it is likely that future updates may bring changes to functionality. The documentation will always be the definitive reference for how to use the package.

---

## 1. Exploring the Geoparser Annotator

The Geoparser Annotator is a browser-based tool for creating annotated datasets for training and evaluation. In this section, you will launch the Annotator app, upload texts, and familiarize yourself with its features.

### 1.1 Launch the Geoparser Annotator

**Objective:** Start the Geoparser Annotator web app on your device.

**Instructions:**

- Open your terminal or command prompt.
- Run the command below to launch the Annotator.
- Once the app starts, it will automatically open in your browser on `http://127.0.0.1:5000/`.
- Familiarize yourself with the Annotator interface.

**Hints:**
- Ensure that you run the launch command in an environment where you have `geoparser` installed.

**Launch command:**
```bash
python -m geoparser annotator
```
You can close with command c
### 1.2 Upload and Annotate Texts

**Objective:** Upload texts to the annotator and familiarize yourself with the annotation process.

**Instructions:**

- Upload the `.txt` files from the previous tutorial to the Annotator.
- Correct the pre-recognised toponyms:
  - Delete invalid toponyms.
  - Adjust faulty string boundaries.
  - Tag missed toponyms.
- Annotate toponym locations:
  - Infer the locations from the surrounding context.
  - Use the search and filter functionality to find the correct locations in the database.
  - Annotate the location.
- Explore the different features:
  - Play around with session settings.
  - Add and remove documents.
  - Download the annotation session file.
  - Restart the Annotator and resume a previous session (from file or cache).

**Note:** This initial tinkering is to get you comfortable with the Annotator. The quality of annotations is not important for this task.

---

## 2. Preparing Training Data

To fine-tune the transformer model, we need annotated data. We will fine-tune a transformer using annotated tweets to optimize its performance on a tweet corpus.

### 2.1 Complete the Annotations

**Objective:** Finish annotating the provided annotation session to create a training dataset.

**Instructions:**

- Download the provided annotation session file (`training_annotations_incomplete.json`) that contains tweets with most toponyms already annotated.
- Upload the file to the Annotator.
- Finish annotating the remaining texts.
- Download the final annotations file (e.g. as `training_annotations.json`).

### 2.2 Initialize GeoparserTrainer

**Objective:** Initialize the GeoparserTrainer with the transformer model we want to fine-tune.

**Instructions:**

- Import the necessary class to use GeoparserTrainer.
- Initialize an instance of the `GeoparserTrainer` class.
- Specify `dguzh/geo-all-MiniLM-L6-v2` as the transformer model to be fine-tuned.

**Your Code:**

*(Write your code in the cell below.)*

In [None]:
# Your code here

### 2.3 Load Training Data into GeoparserTrainer

**Objective:** Load your annotated training data into GeoparserTrainer.

**Instructions:**

- Load the completed training annotations file into the GeoparserTrainer.
- Store the prepared training documents in a variable.
- Use `include_unmatched=True` to avoid excluding annotations not recognized by spaCy.

**Your Code:**

*(Write your code in the cell below.)*

In [1]:
from geoparser import GeoparserTrainer


In [2]:

trainer = GeoparserTrainer(
    spacy_model="en_core_web_trf",
    transformer_model="dguzh/geo-all-distilroberta-v1",
    gazetteer="geonames"
)

In [3]:
train_docs = trainer.annotate("test_annotations.json")

  0%|          | 0/1282 [00:00<?, ?it/s]

In [8]:
#define path of this notebook using pathlib. NameError: name '__file__' is not defined
from pathlib import Path
path = Path.cwd()
print(path)

/Users/skalli-adm/Library/Mobile Documents/com~apple~CloudDocs/Documents/GitHub Repositories/geoparser-workshop


In [10]:
# Train model
trainer.train(train_docs, output_path="path/model", epochs=1, batch_size=8)

  0%|          | 0/1282 [00:00<?, ?it/s]

KeyboardInterrupt: 

---

## 3. Evaluating the Base Model

Before fine-tuning, we'll evaluate the performance of the base model using the test dataset.

### 3.1 Load Test Data from the Provided JSON File

**Objective:** Load the test annotations to use for model evaluation.

**Instructions:**

- We've provided a fully annotated test annotations file (`test_annotations.json`).
- Load them into the GeoparserTrainer in the same way as the training data. Store the test documents in a separate variable.

**Your Code:**

*(Write your code in the cell below.)*

In [None]:
# Your code here

### 3.2 Evaluate the Base Model

**Objective:** Evaluate the performance of the base model using the test data.

**Instructions:**

- Resolve the toponyms in the test dataset to generate predictions for them using the base model.
- Evaluate these predictions to assess the model's performance.
- Store the evaluation results for later comparison.
- Print the evaluation results.

**Your Code:**

*(Write your code in the cell below.)*

In [None]:
# Your code here

---

## 4. Fine-Tuning the Model

Now we'll fine-tune the Geoparser model using our annotated training data.

### 4.1 Train the Model With the Prepared Training Data

**Objective:** Fine-tune the Geoparser model with your training data.

**Instructions:**

- Use the initialized `GeoparserTrainer` to train the loaded transformer model using the prepared training documents.
- Specify an output path where the fine-tuned model will be saved (e.g., `"fine_tuned_model"`).

**Your Code:**

*(Write your code in the cell below.)*

In [None]:
# Your code here

### 4.2 Evaluate the Fine-Tuned Model

**Objective:** Evaluate the performance of the fine-tuned model using the test data.

**Instructions:**

- Follow the same procedure you used to evaluate the base model.
- Use the same test documents you used for evaluating the base model.

**Hints:**
- After training, the original transformer model loaded in the GeoparserTrainer is automatically replaced with the fine-tuned version.
- Resolving toponyms in previously processed GeoDoc objects will overwrite existing predictions with those generated by the new model.

**Your Code:**

*(Write your code in the cell below.)*

In [None]:
# Your code here

---

## 5. Comparing Model Performance

Finally, we'll compare the evaluation results of the base model and the fine-tuned model to see if there's an improvement.

### 5.1 Compare the Results Before and After Fine-Tuning

**Objective:** Analyze the evaluation metrics to determine the impact of fine-tuning.

**Analyze the Results:**

- Compare the evaluation metrics obtained from both models.
- Consider the costs and benefits of fine-tuning a model, and evaluate the scenarios in which it makes sense to undertake this process.