# **Mountain Named Entity Recognition (NER) Project**

This project involves creating a dataset of sentences related to famous mountains from Wikipedia, preprocessing the data, and training a BERT model for Named Entity Recognition (NER) to identify mountain names within sentences.


## Step 1: Installing Required Libraries

First, we need to install the necessary Python libraries that will be used throughout the project.


In [None]:
!pip install openai==0.28
!pip install datasets
!pip install evaluate
!pip install seqeval


Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl.metadata (13 kB)
Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
Successfully installed openai-0.28.0
Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl 

## Step 2: Importing Libraries and Downloading NLTK Data

We import the essential libraries and download necessary data for natural language processing.


In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import nltk


nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

- **requests**: To make HTTP requests for fetching Wikipedia pages.
- **BeautifulSoup**: For parsing HTML content from Wikipedia.
- **pandas**: For handling data in DataFrame structures.
- **re**: For regular expressions used in text cleaning.
- **nltk**: Natural Language Toolkit for text processing tasks.



## Step 3: Defining the List of Mountains

We create a list of mountain names that we want to gather information about.


In [None]:
mountain_names = [
    'Mount Everest', 'K2', 'Kangchenjunga', 'Lhotse', 'Makalu',
    'Cho Oyu', 'Dhaulagiri', 'Manaslu', 'Nanga Parbat', 'Annapurna'
]


## Step 4: Scraping Wikipedia for Sentences Containing Mountain Names

We define a function to scrape Wikipedia pages for each mountain and extract relevant sentences.


In [None]:
def get_sentences_from_wikipedia(mountain_name):
    mountain_url = mountain_name.replace(' ', '_')
    url = f'https://en.wikipedia.org/wiki/{mountain_url}'

    try:
        response = requests.get(url)
        if response.status_code != 200:
            print(f"Could not retrieve page for {mountain_name}")
            return []

        soup = BeautifulSoup(response.content, 'html.parser')
        # Extract text from paragraphs
        paragraphs = soup.find_all('p')
        text_content = ''
        for para in paragraphs:
            text_content += para.get_text()

        # Clean text
        text_content = re.sub(r'\[[0-9]+\]', '', text_content)

        # Split into sentences
        sentences = nltk.sent_tokenize(text_content)

        relevant_sentences = []
        for sentence in sentences:
            if mountain_name in sentence:
                relevant_sentences.append(sentence)

        return relevant_sentences

    except Exception as e:
        print(f"Error processing {mountain_name}: {e}")
        return []


**Function Explanation:**

1. **URL Construction**: Converts the mountain name to a format suitable for Wikipedia URLs.
2. **Fetching Content**: Makes an HTTP request to fetch the Wikipedia page.
3. **Parsing HTML**: Uses BeautifulSoup to parse the HTML content and extract all paragraph texts.
4. **Cleaning Text**: Removes citation numbers like `[1]`, `[2]` using regular expressions.
5. **Sentence Tokenization**: Splits the cleaned text into individual sentences.
6. **Filtering Sentences**: Keeps only those sentences that mention the mountain name.


## Step 5: Creating the Dataset

We iterate over each mountain, scrape relevant sentences, and compile them into a DataFrame which is then saved as a CSV file.


In [None]:
data = []
for mountain in mountain_names:
    print(f"Processing {mountain}")
    sentences = get_sentences_from_wikipedia(mountain)
    for sentence in sentences:
        data.append({'sentence': sentence, 'mountain': mountain})


df = pd.DataFrame(data)


df.to_csv('mountain_dataset.csv', index=False)

print("Dataset creation complete. Saved to 'mountain_dataset.csv'.")


Processing Mount Everest
Processing K2
Processing Kangchenjunga
Processing Lhotse
Processing Makalu
Processing Cho Oyu
Processing Dhaulagiri
Processing Manaslu
Processing Nanga Parbat
Processing Annapurna
Dataset creation complete. Saved to 'mountain_dataset.csv'.


- **Data Collection**: For each mountain, relevant sentences are collected and stored along with the mountain's name.
- **DataFrame Creation**: Organizes the collected data into a tabular format.
- **Saving Data**: The DataFrame is saved as `mountain_dataset.csv` for later use.



## Step 6: Preparing Data for Model Training

We load the dataset, define labels for NER, and tokenize the sentences.


In [None]:
import torch
from transformers import BertTokenizerFast, BertForTokenClassification, Trainer, TrainingArguments
from datasets import Dataset, DatasetDict
import evaluate  # Updated import
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load dataset
df = pd.read_csv('mountain_dataset.csv')

# Define label list and mappings
label_list = ['O', 'B-MTN', 'I-MTN']
label_encoding_dict = {'O': 0, 'B-MTN': 1, 'I-MTN': 2}

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



- **BERT Tokenizer**: Converts text into tokens that BERT can understand.
- **Labels**:
  - **O**: Outside of a named entity.
  - **B-MTN**: Beginning of a mountain name entity.
  - **I-MTN**: Inside a mountain name entity.


### Tokenizing Sentences and Assigning Labels

We define a function to tokenize sentences and assign appropriate labels to each token.


In [None]:
def tokenize_and_label(row):
    sentence = row['sentence']
    mountain = row['mountain']

    # Tokenize sentence
    tokens = tokenizer.tokenize(sentence)

    # Initialize labels
    labels = ['O'] * len(tokens)

    # Tokenize mountain name
    mountain_tokens = tokenizer.tokenize(mountain)
    mountain_token_length = len(mountain_tokens)

    # Find the position of the mountain name in the tokenized sentence
    for i in range(len(tokens) - mountain_token_length + 1):
        if tokens[i:i+mountain_token_length] == mountain_tokens:
            labels[i] = 'B-MTN'
            for j in range(1, mountain_token_length):
                labels[i+j] = 'I-MTN'
            break  # Assuming mountain name appears only once per sentence

    return {'tokens': tokens, 'labels': labels}


**Function Explanation:**

1. **Tokenization**: Splits the sentence and mountain name into tokens.
2. **Label Initialization**: Starts with all tokens labeled as 'O' (outside).
3. **Label Assignment**: Identifies the position of the mountain name in the sentence and labels the corresponding tokens as 'B-MTN' (beginning) and 'I-MTN' (inside).



### Applying the Tokenization and Labeling

We process all sentences and create a new DataFrame with tokens and labels.


In [None]:
processed_data = df.apply(tokenize_and_label, axis=1)


data_df = pd.DataFrame({
    'tokens': processed_data.apply(lambda x: x['tokens']),
    'labels': processed_data.apply(lambda x: x['labels'])
})


### Splitting the Dataset into Training and Testing Sets

We split the data to train the model and evaluate its performance.


In [None]:
train_df, test_df = train_test_split(data_df, test_size=0.2, random_state=42)


train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
test_dataset = Dataset.from_pandas(test_df.reset_index(drop=True))
datasets = DatasetDict({'train': train_dataset, 'test': test_dataset})


- **Training Set**: 80% of the data used to train the model.
- **Testing Set**: 20% of the data used to evaluate the model's performance.


### Aligning Tokens with Labels

We prepare the data so that each token has a corresponding label, handling any tokenization discrepancies.


In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples['tokens'],
        is_split_into_words=True,
        truncation=True,
        padding='max_length',
        max_length=128
    )

    labels = []
    for i, label in enumerate(examples['labels']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label_encoding_dict[label[word_idx]])
            else:
                label_ids.append(label_encoding_dict[label[word_idx]] if label[word_idx].startswith('I') else -100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Tokenize datasets
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)


Map:   0%|          | 0/259 [00:00<?, ? examples/s]

Map:   0%|          | 0/65 [00:00<?, ? examples/s]

**Function Explanation:**

- **Tokenization with Alignment**: Ensures that labels align correctly with tokens, handling cases where words are split into subwords.
- **Label Adjustment**: Assigns `-100` to tokens that should be ignored during training (e.g., padding tokens).



## Step 7: Setting Up the BERT Model for NER

We load a pre-trained BERT model and prepare it for token classification tasks.


In [None]:
model = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=len(label_list))


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


- **BERT Model**: A transformer-based model pre-trained on a large corpus of text.
- **Token Classification**: Fine-tuning BERT to classify each token in a sentence.


## Step 8: Defining Evaluation Metrics

We set up the evaluation metrics to assess the model's performance.


In [None]:
metric = evaluate.load("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_labels = [
        [label_list[l] for l in label if l != -100]
        for label in labels
    ]
    true_predictions = [
        [label_list[pred] for (pred, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }


- **Seqeval**: A library for evaluating sequence labeling tasks like NER.
- **Metrics**:
  - **Precision**: Correctly identified entities out of all identified.
  - **Recall**: Correctly identified entities out of all actual entities.
  - **F1 Score**: Harmonic mean of precision and recall.
  - **Accuracy**: Overall correctness of predictions.


## Step 9: Configuring Training Arguments and Trainer

We set up the training parameters and initialize the Trainer for model training.


In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForTokenClassification

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    save_strategy='epoch',
    logging_strategy='epoch',
    report_to='none',  # Disable wandb logging
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    greater_is_better=True
)

data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)




- **TrainingArguments**:
  - **output_dir**: Directory to save model checkpoints.
  - **evaluation_strategy**: How often to evaluate the model.
  - **learning_rate**: Learning rate for the optimizer.
  - **batch_size**: Number of samples per training/evaluation step.
  - **num_train_epochs**: Number of training epochs.
  - **weight_decay**: Weight decay for regularization.
  - **save_total_limit**: Maximum number of checkpoints to save.
  - **load_best_model_at_end**: Automatically load the best model after training.
  - **metric_for_best_model**: Metric to determine the best model.
  - **greater_is_better**: Whether a higher metric score is better.
- **DataCollator**: Handles padding and batching of data.
- **Trainer**: Handles the training loop, evaluation, and more.


## Step 10: Training the Model

We initiate the training process.


In [None]:
trainer.train()


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.3839,0.134267,0.065217,0.092308,0.076433,0.945051
2,0.1019,0.049324,0.622222,0.861538,0.722581,0.982311
3,0.0545,0.037574,0.722892,0.923077,0.810811,0.986075


TrainOutput(global_step=99, training_loss=0.18010471083901144, metrics={'train_runtime': 1434.9966, 'train_samples_per_second': 0.541, 'train_steps_per_second': 0.069, 'total_flos': 50757353885952.0, 'train_loss': 0.18010471083901144, 'epoch': 3.0})

## Step 11: Saving the Trained Model and Tokenizer

After training, we save the model and tokenizer for future use.


In [None]:
# Save model and tokenizer
model.save_pretrained('ner_mountain_model')
tokenizer.save_pretrained('ner_mountain_model')

('ner_mountain_model/tokenizer_config.json',
 'ner_mountain_model/special_tokens_map.json',
 'ner_mountain_model/vocab.txt',
 'ner_mountain_model/added_tokens.json',
 'ner_mountain_model/tokenizer.json')

- **Saved Files**: The trained model and tokenizer are saved in the `ner_mountain_model` directory.



# What Can Be Improved


## 1. Limited Dataset Diversity

**Problem:** The current dataset is limited to sentences extracted from Wikipedia articles about ten specific mountains. This narrow scope may lead to a model that does not generalize well to other contexts or recognize mountain names outside the provided list.

- **Data Bias:** Training data heavily influences a model's ability to generalize. A dataset sourced from a single type of text (Wikipedia) may not capture the linguistic diversity found in other text sources, such as news articles, travel blogs, or scientific journals.
  
- **Entity Coverage:** Focusing on only ten mountains restricts the model's ability to recognize a broader range of mountain names, potentially failing to identify less prominent or differently named mountains.

**Proposed Solution:**

- **Expand the Mountain List:** Include a more comprehensive list of mountain names, possibly by incorporating data from databases like the Global Mountain Biodiversity Assessment (GMBA) or the Peakbagger database.

- **Diversify Data Sources:** Scrape sentences from various sources to capture different writing styles and contexts. This will help the model learn to recognize mountain names in diverse textual environments.

## 2. Inadequate Handling of Entity Variations

**Problem:** The current preprocessing and labeling approach assumes that mountain names appear exactly as specified in the list, failing to account for variations such as abbreviations, alternate spellings, or translations.


- **Entity Variations:** Mountain names may appear in text in various forms. For example, "Mount Everest" might be referred to as "Everest," "Mt. Everest," or even "Sagarmatha" (its Nepali name).

- **Case Sensitivity:** The labeling function is case-sensitive, potentially missing mountain names that are not capitalized properly due to typos or stylistic choices.

**Proposed Solution:**

- **Implement Fuzzy Matching:** Use more sophisticated string matching techniques, such as Levenshtein distance or regular expressions, to identify mountain names despite minor differences.

- **Case Normalization:** Convert text to a consistent case (e.g., lowercase) during preprocessing to reduce the impact of case discrepancies.

## 3. Suboptimal Model Performance Due to Limited Hyperparameter Tuning

**Problem:** The model training uses default hyperparameters and does not explore the hyperparameter space, which may result in suboptimal performance.

**Proposed Solution:**

- **Hyperparameter Optimization:** Employ techniques such as grid search, random search, or Bayesian optimization to systematically explore hyperparameter combinations.

- **Cross-Validation:** Use k-fold cross-validation to assess model performance more reliably and prevent overfitting.

- **Learning Rate Scheduling:** Implement learning rate schedulers that adjust the learning rate during training, allowing for better convergence.







