# title: Mountain NER Dataset Preparation
### **description:** This notebook combines three scripts to generate and process data for training and testing a Named Entity Recognition (NER) model to identify mountain names in sentences.
### author: Bytsenko Anna
date: 13.11.24





## Step 1: Data Generation

This section generates sentences about mountains and annotates them with BIO (Begin, Inside, Outside) labels.

In [26]:
# Importing Libraries
import random
import pandas as pd
import re
import requests
from bs4 import BeautifulSoup


    Step 1.1:

This code snippet **parses a web page** to extract a **list of mountain names** from a specific web page. Specifically, the code accesses a web resource that contains a list of mountains, parses the resulting page, and extracts the mountain names contained in certain HTML elements. Next, duplicates are removed to leave only unique mountain names. Below is a detailed description of the code with comments.

In [None]:
# Scraping Mountain Names
# Define the URL for a list of mountain names
url = "https://www.britannica.com/topic/list-of-mountains-2009175"

# Fetch the webpage content
response = requests.get(url)
response.raise_for_status()  # Ensure the request was successful

# Parse the content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Extract mountain names from the webpage
mountain_names = []
for section in soup.select("section > ul > li "): # way to target objicts
    for li in section.find_all('li'):
        name = li.get_text(strip=True) #cleaning up any extra spaces
        if name:
            mountain_names.append(name)

# Remove duplicates to ensure unique mountain names
mountain_names = list(dict.fromkeys(mountain_names))
print(f"Total unique mountain names extracted: {len(mountain_names)}")

Total unique mountain names extracted: 220


    Step 1.2:

Using ChatGPT **generate the templates** of sentences to further create sentences by **randomly substituting** extracted mountain names from the site into the generated templates.

In [28]:
# ## Defining Sentence Templates
# Templates for generating sentences about mountains
sentence_templates = [
    "{} is one of the most beautiful mountains in the world.",
    "Many climbers dream of reaching the summit of {}.",
    "The {} range stretches over a vast area.",
    "Climbing {} requires a lot of training and preparation.",
    "{} is a popular destination for tourists and hikers.",
    "The highest peak in the {} is well-known among mountaineers.",
    "Exploring the {} can be a thrilling experience.",
    "{} has a unique landscape compared to other mountains.",
    "{} is famous for its challenging climbing routes.",
    "Every year, thousands of adventurers visit {}.",
    "{} is known for its unpredictable weather and stunning views.",
    "The ecosystem around {} is home to various rare species.",
    "For centuries, {} has attracted explorers from around the world.",
    "Climbing {} is considered a rite of passage for experienced climbers.",
    "{} offers breathtaking views of the surrounding landscapes.",
    "The journey to the peak of {} is both difficult and rewarding.",
    "Many legends are told about the mysteries of {}.",
    "People come from afar to witness the beauty of {}.",
    "The path to {} is filled with both beauty and danger.",
    "Sunrises and sunsets at {} are unforgettable experiences.",
    "{} stands as a symbol of resilience and adventure.",
    "The climate on {} can change rapidly, posing risks to climbers.",
    "The altitude of {} challenges even the fittest athletes.",
    "The surrounding area of {} is a haven for nature lovers.",
    "In winter, {} becomes a snowy paradise for skiers and climbers.",
    "{} has inspired countless works of art and literature.",
    "The steep cliffs of {} attract the most daring adventurers.",
    "Reaching the top of {} is an achievement many aspire to.",
    "The slopes of {} are covered with diverse vegetation.",
    "Local communities revere {} as a sacred mountain.",
    "Stories of {} have been passed down through generations.",
    "The sheer size of {} is awe-inspiring to anyone who sees it.",
    "{} presents a unique challenge to those who attempt to conquer it.",
    "The landscape around {} changes dramatically with the seasons.",
    "During certain times of the year, {} is covered in beautiful wildflowers.",
    "Legends say that spirits protect the summit of {}.",
    "{} has long been a place of pilgrimage for travelers and explorers.",
    "The rugged terrain of {} demands skill and endurance from climbers.",
    "{} is a part of a chain of mountains that spans multiple countries.",
    "The snow-capped peaks of {} are visible from miles away.",
    "{} is surrounded by valleys and rivers, creating a picturesque scene.",
    "The harsh conditions on {} are not for the faint-hearted.",
    "The beauty of {} is matched by the challenges it presents.",
    "Expeditions to {} are often halted due to adverse weather.",
    "Many have tried and failed to reach the top of {}.",
    "The legends of {} add to its mystique and allure.",
    "{} has been a source of inspiration for poets and writers.",
    "The icy winds on {} can be relentless at higher altitudes.",
    "Every step toward the summit of {} tests one's willpower.",
    "Few places on Earth are as captivating as the {}.",
    "The path up {} is marked with stunning rock formations.",
    "The rivers near {} provide fresh water to local communities.",
    "Many believe that {} holds hidden treasures within its caves.",
    "Surviving the night on {} requires advanced camping skills.",
    "Reaching {} requires both mental and physical strength.",
    "The flora and fauna on {} are unique to its ecosystem.",
    "{} is a UNESCO World Heritage Site due to its natural beauty.",
    "Travelers often share stories of their journey to {}.",
    "{} holds cultural significance to indigenous communities nearby.",
    "The view from {}'s summit is worth the difficult journey.",
    "Exploring the base of {} is an adventure in itself.",
    "The legend of {} tells of lost travelers and hidden caves.",
    "Tourists flock to {} for its serene and majestic beauty.",
    "The forests surrounding {} are filled with diverse wildlife.",
    "During storms, {} becomes an even more formidable place.",
    "{} is home to ancient rock formations that date back millennia.",
    "Hiking to {}'s summit can take several days to accomplish.",
    "The mist that surrounds {} in the mornings adds to its mystique.",
    "Those who climb {} often speak of its spiritual energy.",
    "{} is part of a national park that protects its unique environment.",
    "The rugged trails on {} attract experienced hikers from around the world.",
    "Photographers travel to {} to capture its breathtaking views.",
    "The mountain range that includes {} is one of the longest in the world.",
    "Reaching the peak of {} is an unforgettable experience.",
    "{} is frequently covered in snow, even during the summer months.",
    "The wildlife near {} includes rare species like mountain goats and eagles.",
    "{} is known for its sharp, jagged peaks.",
    "Ancient civilizations left their mark on the slopes of {}.",
    "The scenic views around {} attract travelers from far and wide.",
    "The base of {} is often a meeting point for climbers and tourists alike.",
    "Some say the spirits of past climbers still roam {}.",
    "The beauty of {} has made it a popular filming location.",
    "The history of {} is filled with stories of exploration and adventure.",
    "The clouds often envelop {} in a blanket of mist.",
    "{} is a natural wonder that captures the imagination of all who visit.",
    "The descent from {} is as challenging as the ascent.",
    "Hikers near {} are advised to be cautious of sudden weather changes.",
    "{} has several trails that vary in difficulty, attracting all types of adventurers.",
    "The ecosystem around {} is fragile and needs to be preserved.",
    "The dense forests near {} add to its allure.",
    "{} is a major landmark and a point of pride for the local community.",
    "On clear days, the view from {} stretches for miles.",
    "The summit of {} can be reached only by the most experienced climbers.",
    "{} has unique geological features not found anywhere else.",
    "Local legends claim that {} has healing powers.",
    "The journey to {} requires careful planning and preparation.",
    "During the winter months, {} becomes almost inaccessible.",
    "The rocks on {} are treacherous, and climbers must tread carefully.",
    "Many expeditions have been launched to explore the depths of {}.",
    "{} is often referred to as the crown jewel of the region.",
    "The paths leading up to {} are winding and steep.",
    "{} has been the site of many scientific studies due to its unique environment.",
    "The river that flows from {} provides life to the valleys below.",
    "People come from around the globe to admire the grandeur of {}.",
    "Mountaineers respect {} for the challenges it presents.",
    "{}'s slopes are covered with dense forests and flowing streams.",
    "Legends say that {} is protected by ancient spirits.",
    "Locals often tell stories of hidden caves within {}.",
    "The peak of {} is one of the most beautiful places on Earth."
]

    Step 1.3:

The code snippet is responsible for:

- `Tokenisation of the text into words:` Breaking down a sentence into individual tokens (words, numbers, punctuation, etc.) to prepare the text for analysis.

- `Annotation of tokens according to the BIO scheme:` Each token is assigned a label from the BIO set:

    *B-MOUNTAIN*: Indicates the beginning of the mountain name.

    *I-MOUNTAIN*: Denotes the continuation of the mountain name (for multi-word names).

    *O*: Denotes tokens that do not belong to mountain names.

In this way, the **functions prepare text data** for the Named Entity Recognition (NER) model by giving each token a corresponding label.

In [None]:
# ## Tokenization and Annotation Functions
# Function to tokenize sentences into words
def tokenize(sentence):
    return re.findall(r'\b\w+\b|\.', sentence) #finding sequences of characters (words) delimited by spaces, 
                                               #the beginning or end of a sentence, including dots as separate tokens

# Function to annotate tokens with BIO labels
def bio_annotate(tokens, mountain_names):
    labels = [] #list for saving labels
    skip = 0  # Tracks how many tokens to skip due to multi-word matches
    for i, token in enumerate(tokens):
        if skip > 0: # skip if tokens is already processed
            skip -= 1
            continue
        matched = False #flag for checking for coincidence with the name of the mountain
        for mountain in mountain_names: # Checking tokens for compliance with each mountain name
            mountain_tokens = mountain.split() # Splitting the mountain name into tokens
            # Checking if the current token sequence matches the mountain name
            if tokens[i:i+len(mountain_tokens)] == mountain_tokens:
                # Add a label ‘B-MOUNTAIN’ for the first token and ‘I-MOUNTAIN’ for the rest
                labels += ['B-MOUNTAIN'] + ['I-MOUNTAIN'] * (len(mountain_tokens) - 1)
                skip = len(mountain_tokens) - 1 # Set how many tokens to skip
                matched = True # Indicate that a match has been found
                break
        if not matched:
            labels.append('O')  # Default label for non-entity tokens
    return labels

    Step 1.4:

This code snippet solves the problem of **generating artificial text data** for training a named object recognition (NER) model. The task is to generate texts containing mentions of mountain names and mark them up in the BIO (Begin-Inside-Outside) format to be used as a training set.

In [None]:
# Generating Annotated Sentences
# Generate sentences and their corresponding BIO annotations
sentences = [] # list for saving generated sentences
annotations = [] # list for saving BIO-mark for each sentences

for _ in range(5000):  # Generate 5000 sentences
    # Choose random mountain name from list
    mountain = random.choice(mountain_names)
    # Choose random template for inserting mountain name
    template = random.choice(sentence_templates)
    # creating sentence 
    sentence = template.format(mountain)
    # tokenize result sentence
    tokens = tokenize(sentence)
    # Generate BIO-markup for each token in the sentence 
    # For example, if ‘Mount Everest’, then the tokens will be: [‘B-MOUNTAIN’, ‘I-MOUNTAIN’]. 
    labels = bio_annotate(tokens, mountain_names)
    # adding tokens and its labels to lists
    sentences.append(" ".join(tokens)) # appending tokens in sentence
    annotations.append(" ".join(labels)) # appending labels in row

# Create a DataFrame to store sentences and annotations
df = pd.DataFrame({
    "sentence": sentences, # column with sentences
    "annotation": annotations # column with BIO-marks
})

# Save the generated data to a CSV file
output_file_path = './data/annotated_mountain_sentences.csv'
df.to_csv(output_file_path, index=False, encoding="utf-8") # saving without indexes
print(f"Generated data saved to {output_file_path}")

Generated data saved to ./data/annotated_mountain_sentences.csv


## Step 2: Data Processing

This section processes the generated dataset into the desired format for training an NER model.

This code snippet performs **data preparation and preprocessing** tasks for training a named entity recognition (NER) model. The main goal is to load an annotated dataset from a CSV file, convert the sentences and annotations into a token list format, and save the result to a new CSV file for further use.

In [None]:
# Reading the Annotated Dataset
input_file = './data/annotated_mountain_sentences.csv'
output_file = './data/annotated_data.csv'

# Read the data from the CSV file
# Two columns are expected in the input file: ‘sentence’ and ‘annotation’ (labels for each word).
data = pd.read_csv(input_file)

# Processing Sentences and Annotations
# Split sentences and annotations into tokenized lists
processed_sentences = []
processed_annotations = []

# Going through each line of the DataFrame 
# For each line, the text of the sentence and the corresponding annotations are processed
for _, row in data.iterrows():
    # Splitting a sentence into tokens (words)
    sentence = row['sentence'].split()
    # Splitting the corresponding labels into tokens (annotations)
    annotation = row['annotation'].split()
    # Add tokenised sentences and labels to lists
    processed_sentences.append(sentence)
    processed_annotations.append(annotation)

# Combine into a new DataFrame
processed_df = pd.DataFrame({
    "sentence": processed_sentences,
    "annotation": processed_annotations
})

# Save the processed data to a new CSV file
processed_df.to_csv(output_file, index=False) # excluding indexes from file
print(f"Processed data saved to {output_file}")

# Preview of the Processed Data
# Display the first few rows of the processed dataset
print(processed_df.head())


Processed data saved to ./data/annotated_data.csv
                                            sentence  \
0  [Photographers, travel, to, Croagh, Patrick, t...   
1  [Legends, say, that, Guadalupe, Peak, is, prot...   
2  [The, slopes, of, Mount, Apo, are, covered, wi...   
3  [Qurnat, al, Sawdāʾ, has, been, a, source, of,...   
4  [Hekla, has, been, the, site, of, many, scient...   

                                          annotation  
0  [O, O, O, B-MOUNTAIN, I-MOUNTAIN, O, O, O, O, ...  
1  [O, O, O, B-MOUNTAIN, I-MOUNTAIN, O, O, O, O, ...  
2  [O, O, O, B-MOUNTAIN, I-MOUNTAIN, O, O, O, O, ...  
3         [O, O, O, O, O, O, O, O, O, O, O, O, O, O]  
4  [B-MOUNTAIN, O, O, O, O, O, O, O, O, O, O, O, ...  


# Step3: Test data generation

This code snippet performs several key tasks related to the creation and processing of a dataset for training a named object recognition (NER) model. In particular, it:

- `Generates a synthetic text dataset`: Creates sentences with randomly inserted mountain names into template phrases.

- `Annotates these sentences`: Determines BIO (Beginning, Inside, Outside) labels for each token in the sentence where the mountain names are labelled as ‘B-MOUNTAIN’ or ‘I-MOUNTAIN’.

- `Saves the result to a CSV file`: Saves the annotated data in a table for later use.

- `Process annotated data set`: Loads the saved annotated data, separates the tokens and their labels into lists for the model.

- `Generates the final table`: Saves the processed data to a new CSV file.

In [None]:
# usage the same method as in step 1-2 for generating training data

# List of sentence templates with a ‘{}’ placeholder for inserting mountain names
sentence_templates = [
    "{} is a must-visit destination for any nature enthusiast.",
    "The {} has been featured in many travel documentaries.",
    "Tourists often describe {} as breathtaking and serene.",
    "Local legends tell stories of mystical events at {}.",
    "Climbers regard {} as one of the ultimate challenges.",
    "The weather on {} changes rapidly, making it unpredictable.",
    "Hiking trails near {} attract adventurers from around the globe.",
    "{} is surrounded by stunning natural beauty.",
    "The cultural significance of {} cannot be overstated.",
    "Many artists have drawn inspiration from the beauty of {}.",
    "{} offers some of the best panoramic views in the world.",
    "Mountaineers train for months before attempting to scale {}.",
    "The history of {} is rich with fascinating stories of exploration.",
    "Visitors to {} often leave with unforgettable memories.",
    "Many expeditions have been launched to conquer {}.",
    "The peak of {} stands as a testament to human resilience.",
    "At sunrise, {} casts a magnificent glow over the surrounding landscape.",
    "Adventure seekers flock to {} for its extreme climbing conditions.",
    "The majestic {} is visible from miles away on clear days.",
    "The breathtaking scenery surrounding {} draws photographers from all over.",
    "{} is often regarded as the crown jewel of the region’s natural wonders.",
    "The extreme altitude of {} makes it one of the most challenging climbs.",
    "A journey to the summit of {} requires both mental and physical strength.",
    "Many people dream of standing at the top of {} and witnessing its grandeur.",
    "In the shadows of {} lies a diverse ecosystem that thrives in harsh conditions.",
    "The sheer scale of {} dwarfs the surrounding mountains.",
    "{} stands as a beacon for explorers and mountaineers worldwide.",
    "Rising above the clouds, {} offers an unparalleled view of the world below.",
    "The annual trekking season for {} draws hundreds of adventure seekers.",
    "The terrain surrounding {} is known for its ruggedness and beauty.",
    "Climbing {} is considered a rite of passage for many professional mountaineers.",
    "{} is a geographical wonder that attracts scientists and researchers alike.",
    "In winter, {} is covered in a blanket of snow, offering a magical landscape.",
    "The summit of {} is often described as a place of peace and solitude.",
    "The winds around {} can be fierce and unforgiving, adding to its mystique.",
    "Legends speak of ancient civilizations that once worshipped {}.",
    "At night, the stars shine brightly above {} like jewels in the sky.",
    "{} has become synonymous with adventure and exploration.",
    "Standing at the base of {}, you can't help but feel awe-struck by its size.",
    "The changing weather conditions on {} make it a popular subject for meteorological studies.",
    "The surrounding national park of {} is home to a diverse array of wildlife.",
    "Every year, thousands of trekkers set out to scale {}.",
    "The challenging terrain of {} tests the skills of even the most seasoned climbers.",
    "From the summit of {}, you can see a sprawling view of the entire valley.",
    "The slopes of {} are perfect for skiing during the winter season.",
    "{} has long been a symbol of strength and endurance in local culture.",
    "The quiet isolation of {} offers a unique opportunity for reflection and tranquility.",
    "The dangerous crevasses and ice fields around {} present a serious risk for climbers.",
    "{} is often seen as a symbol of national pride for the country it belongs to.",
    "The first successful ascent of {} was a historic achievement in mountaineering.",
    "The local communities living around {} have a deep connection to the mountain.",
    "The altitude sickness around {} is a common challenge for climbers.",
    "Many people come to {} for spiritual and physical renewal.",
    "The view from the top of {} is so breathtaking that it’s often described as life-changing.",
    "The massive glaciers surrounding {} are a sight to behold.",
    "{} is one of the most iconic landmarks in the world.",
    "At certain times of the year, {} is completely shrouded in mist, creating a mysterious atmosphere.",
    "The region surrounding {} is rich in history and culture.",
    "The monsoon season near {} brings unpredictable weather and challenges for trekkers.",
    "Every year, an increasing number of tourists visit {} to experience its awe-inspiring beauty.",
    "The sacred status of {} attracts pilgrims from around the world.",
    "The summit of {} is often the subject of daring feats by extreme climbers.",
    "Researchers continue to study the unique ecological systems around {}.",
    "Despite the harsh conditions, life around {} is surprisingly diverse and thriving."
]

# Tokenization and Annotation Functions
# Function to tokenize sentences into words
def tokenize(sentence):
    return re.findall(r'\b\w+\b|\.', sentence)

# Function to annotate tokens with BIO labels
def bio_annotate(tokens, mountain_names):
    labels = []
    skip = 0  # Tracks how many tokens to skip due to multi-word matches
    for i, token in enumerate(tokens):
        if skip > 0:
            skip -= 1
            continue
        matched = False
        for mountain in mountain_names:
            mountain_tokens = mountain.split()
            if tokens[i:i+len(mountain_tokens)] == mountain_tokens:
                labels += ['B-MOUNTAIN'] + ['I-MOUNTAIN'] * (len(mountain_tokens) - 1)
                skip = len(mountain_tokens) - 1
                matched = True
                break
        if not matched:
            labels.append('O')  # Default label for non-entity tokens
    return labels

# ## Generating Annotated Sentences
# Generate sentences and their corresponding BIO annotations
sentences = []
annotations = []

for _ in range(1000):  # Generate 1000 sentences
    mountain = random.choice(mountain_names)
    template = random.choice(sentence_templates)
    sentence = template.format(mountain)
    tokens = tokenize(sentence)
    labels = bio_annotate(tokens, mountain_names)
    sentences.append(" ".join(tokens))
    annotations.append(" ".join(labels))

# Create a DataFrame to store sentences and annotations
df = pd.DataFrame({
    "sentence": sentences,
    "annotation": annotations
})

# Save the generated data to a CSV file
output_file_path = './data/test_annotated_mountain_sentences.csv'
df.to_csv(output_file_path, index=False, encoding="utf-8")
print(f"Generated data saved to {output_file_path}")

# Reading the Annotated Dataset
input_file = './data/test_annotated_mountain_sentences.csv'
output_file = './data/test_annotated_data.csv'

# Read the data from the CSV file
data = pd.read_csv(input_file)

# Processing Sentences and Annotations
# Split sentences and annotations into tokenized lists
processed_sentences = []
processed_annotations = []

for _, row in data.iterrows():
    sentence = row['sentence'].split()
    annotation = row['annotation'].split()
    processed_sentences.append(sentence)
    processed_annotations.append(annotation)

# Combine into a new DataFrame
processed_df = pd.DataFrame({
    "sentence": processed_sentences,
    "annotation": processed_annotations
})

# Save the processed data to a new CSV file
processed_df.to_csv(output_file, index=False)
print(f"Processed data saved to {output_file}")

# Preview of the Processed Data
# Display the first few rows of the processed dataset
print(processed_df.head())



Generated data saved to ./data/test_annotated_mountain_sentences.csv
Processed data saved to ./data/test_annotated_data.csv
                                            sentence  \
0  [The, slopes, of, Mihintale, are, perfect, for...   
1  [The, summit, of, Concepción, Volcano, is, oft...   
2  [The, peak, of, Snowdon, stands, as, a, testam...   
3  [Every, year, an, increasing, number, of, tour...   
4  [The, Lugnaquillia, Mountain, has, been, featu...   

                                          annotation  
0   [O, O, O, B-MOUNTAIN, O, O, O, O, O, O, O, O, O]  
1  [O, O, O, B-MOUNTAIN, I-MOUNTAIN, O, O, O, O, ...  
2      [O, O, O, B-MOUNTAIN, O, O, O, O, O, O, O, O]  
3  [O, O, O, O, O, O, O, O, B-MOUNTAIN, O, O, O, ...  
4  [O, B-MOUNTAIN, I-MOUNTAIN, O, O, O, O, O, O, ...  
