### Description
Multilingual NER model for the identification mountains.
Supported languages: UA, EN

### Overview
This notebook demonstrates the inference capabilities of a BERT-based Named Entity Recognition (NER) model.
It showcases:
- Extracting entities from raw text.
- Handling both English and Ukrainian inputs.
- Analyzing performance on edge cases and ambiguous contexts.
- Detailed metrics on the test set (quantitive evaluation)

### Setup & Loading

In [None]:
%load_ext autoreload
%autoreload 2

import pandas as pd
from IPython.display import display
from inference import MountainNER

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [51]:
model_path = "./mnt_ner_model"  # Adjust this path if needed!
ner = MountainNER(model_path=model_path)

print("Model loaded.")

Loading model from ./mnt_ner_model to cuda...
Model loaded.


### Visual inspection
Below we test the model on specific example sentances

In [53]:
def visualize_predictions(text, entities):
    """
    Highlights entities using simple color codes (ANSI).
    """
    if not entities:
        print(f"{text} (No mountains found)")
        return

    # ANSI escape codes for coloring
    BLUE_BOLD = "\033[1;94m"
    RESET = "\033[0m" # Reset bold colour

    highlighted_text = text
    
    for entity in entities:
        # Wrap the found entity in color codes and brackets
        replacement = f"{BLUE_BOLD}[{entity}]{RESET}"
        highlighted_text = highlighted_text.replace(entity, replacement)

    print(f"Prediction: {highlighted_text}")

In [61]:
# 1. Simple EN Test
text_en = "My dream is to climb Mount Everest and K2 before I turn 30."

# We access [0] because predict returns a batch list
raw_preds = ner.predict(text_en)[0]

print("Raw Prediction:")
print(f"{'TOKEN':<15} {'TAG'}")
print("-" * 30)

# Loop through and print nicely aligned
for word, tag in raw_preds:
    # Optional: Highlight the interesting tags to make them pop
    if tag != 'O':
        print(f"{word:<15} {tag}  <-- Found entity")
    else:
        print(f"{word:<15} {tag}")

print("\nCleaned version")

found_mountains = ner.clean_predictions(ner.predict(text_en))[0]
visualize_predictions(text_en, found_mountains)

Raw Prediction:
TOKEN           TAG
------------------------------
My              O
dream           O
is              O
to              O
climb           O
Mount           B-MNT  <-- Found entity
Everest         I-MNT  <-- Found entity
and             O
K2              B-MNT  <-- Found entity
before          O
I               O
turn            O
30              O
.               O

Cleaned version
Prediction: My dream is to climb [1;94m[Mount Everest][0m and [1;94m[K2][0m before I turn 30.


In [None]:
# 2. Multilingual Test
text_ua = "Говерла - найвища точка України, але Монблан вищий."

found_mountains_ua = ner.clean_predictions(ner.predict(text_ua))[0]
visualize_predictions(text_ua, found_mountains_ua)

Prediction: [1;94m[Говерла][0m - найвища точка України, але [1;94m[Монблан][0m вищий.


In [72]:
wiki_text = """Heights of mountains are typically measured above sea level. 
Using this metric, Mount Everest is the highest mountain on Earth, at 8,848 metres (29,029 ft).[78] 
There are at least 100 mountains with heights of over 7,200 metres (23,622 ft) above sea level, all of which are located in central and southern Asia. 
The highest mountains above sea level are generally not the highest above the surrounding terrain. 
There is no precise definition of surrounding base, but Denali,[79] Mount Kilimanjaro and Nanga Parbat are possible candidates for the tallest mountain on land by this measure. 
The bases of mountain islands are below sea level, and given this consideration Mauna Kea (4,207 m (13,802 ft) above sea level) is the world's tallest mountain and volcano, rising about 10,203 m (33,474 ft) from the Pacific Ocean floor.[80]

The highest mountains are not generally the most voluminous. Mauna Loa (4,169 m or 13,678 ft) is the largest mountain on Earth in terms of base area (about 2,000 sq mi or 5,200 km2) and volume (about 18,000 cu mi or 75,000 km3).[81] 
Mount Kilimanjaro is the largest non-shield volcano in terms of both base area (245 sq mi or 635 km2) and volume (1,150 cu mi or 4,793 km3). 
Mount Logan is the largest non-volcanic mountain in base area (120 sq mi or 311 km2)."""

# We get the first element [0] because our input is a single string
wiki_found = ner.clean_predictions(ner.predict(wiki_text))[0]

print(f"- Wikipedia Text Analysis ({len(wiki_found)} entities found) -")
visualize_predictions(wiki_text, wiki_found)

- Wikipedia Text Analysis (7 entities found) -
Prediction: Heights of mountains are typically measured above sea level. 
Using this metric, [1;94m[Mount Everest][0m is the highest mountain on Earth, at 8,848 metres (29,029 ft).[78] 
There are at least 100 mountains with heights of over 7,200 metres (23,622 ft) above sea level, all of which are located in central and southern Asia. 
The highest mountains above sea level are generally not the highest above the surrounding terrain. 
There is no precise definition of surrounding base, but [1;94m[Denali][0m,[79] [1;94m[[1;94m[Mount Kilimanjaro][0m][0m and [1;94m[Nanga Parbat][0m are possible candidates for the tallest mountain on land by this measure. 
The bases of mountain islands are below sea level, and given this consideration Mauna Kea (4,207 m (13,802 ft) above sea level) is the world's tallest mountain and volcano, rising about 10,203 m (33,474 ft) from the Pacific Ocean floor.[80]

The highest mountains are not generally t

While the model successfully identified complex entities like "Mount Kilimanjaro" and "Mount Logan", it is not perfect.

Missed Entity: The model failed to identify the first appearance of *"Mauna Kea"* in the 6th sentence.

### Analysing text traps

In [None]:
text_trap_ua = "Футбольний клуб \"Карпати\" зіграв унічию сьогодні."

found_ua = ner.clean_predictions(ner.predict(text_trap_ua))[0]
visualize_predictions(text_trap_ua, found_ua)

Prediction: Футбольний клуб "[1;94m[Карпати][0m" зіграв унічию сьогодні.


*Analysis: token overfitting*

The model incorrectly identified **"Карпати"** (Football Club) as a mountain.
During training, the token "Карпати" likely appeared exclusively with the MNT label

To fix this, we must introduce negative sampling into the dataset (or improve prompts for generating dataset with LLMs)

In [None]:
metrics = ner.evaluate_file("./data/final/test.jsonl")

# Display Overall Results
print("\nOverall Performance")
df_overall = pd.DataFrame(metrics['Overall']).transpose()
display(df_overall.round(2))

# Display Per-Language Results
print("\nUkrainian Performance")
if 'UA' in metrics:
    df_ua = pd.DataFrame(metrics['UA']).transpose()
    display(df_ua.round(2))
else:
    print("No UA data found.")

print("\nEnglish Performance")
if 'EN' in metrics:
    df_en = pd.DataFrame(metrics['EN']).transpose()
    display(df_en.round(2))
else:
    print("No EN data found.")

Loading test data from ./data/final/test.jsonl...
Running inference on 569 examples...

Overall Performance


Unnamed: 0,precision,recall,f1-score,support
MNT,0.92,0.91,0.91,137.0



Ukrainian Performance


Unnamed: 0,precision,recall,f1-score,support
MNT,0.88,0.88,0.88,60.0



English Performance


Unnamed: 0,precision,recall,f1-score,support
MNT,0.95,0.92,0.93,77.0


### Conclusion & Analysis

#### Performance Insights
* The model is trustworthy (precision 0.95 in English) - rarely produces false positives.
* The model performs better on English (F1 0.93) than on Ukrainian (F1 0.88). This is because BERT model was pre-trained on a significantly larger corpus of English text.
* In Ukrainian, Precision and Recall are identical (0.88). This indicates the model is just as likely to miss a mountain as it is to hallucinate one.

#### Error Insights
* The model incorrectly identified Карпати as a mountain when it referred to the football club. (overfitted to the token, lack of negative samples)
* The model missed an instance of "Mauna Kea" in wiki stress test. This proves that the model is imperfect.

#### Conclusion
1. Performance Summary
* The model achieved an overall F1-score of 0.91 across 569 test entities. 
* English Performance (F1: 0.93): The model is highly reliable for English text.
* Ukrainian Performance (F1: 0.88): While effective, the model exhibits a slight performance drop in Ukrainian.

2. Key Strengths
* High Precision (0.95 Overall): The model rarely produces false positives in standard contexts.
* Long-Context Handling: The model successfully tracks multiple entities within dense paragraphs without losing coherence.