# Natural Language Processing. Named entity recognition

### **description:** This file is demo of project with an explanation/instruction on the project logic and the procedure for interacting with files and their task/content. 

### author: Bytsenko Anna
date: 13.11.24

### Step 1: Run data generation.ipynb

This notebook with explanation **prepares a dataset for a Named Entity Recognition (NER)** task by generating, annotating, and processing data related to mountain names. 

*It uses a combination of data scraping, annotation, and formatting steps.*

    1. Data Generation

`Description`: This section involves generating sentences about mountains and annotating them with BIO (Begin, Inside, Outside) labels to mark named entities.

    1.1 Mountain Name Scraping

Code Functionality:
Scrapes a webpage containing a list of mountains (in this case, from Britannica).
Extracts mountain names from specific HTML elements using the BeautifulSoup library.
Ensures that only unique mountain names are retained.

    1.2 Data Cleaning and Annotation

This step likely involves processing the scraped names and generating labeled datasets, though further analysis of the notebook is needed to confirm.

    2. Data Annotation

`Description`: This section annotates sentences with BIO (Begin-Inside-Outside) labels for mountains' names.

    2.1 Random Sentence Generation

`Description`: Random sentences are generated, embedding mountain names into various contexts. This creates examples where mountain names are either present or absent in natural-sounding sentences.

Key Steps:

- Define templates with placeholders for mountain names.
- Randomly select mountain names and fill placeholders.
- Store annotated sentences and corresponding labels.

    3. Test Dataset Preparation

Similar to the previous training data generation, combining the two points to create a test data set using identical methods and order.


### Step 2: Run train_model.py

`Description`: This code is part of a machine learning pipeline for training a Named Entity Recognition (NER) model to identify mountain names in text. 

The primary steps include:

- Loading and processing the annotated dataset containing sentences and their corresponding labels (e.g., B-MOUNTAIN, I-MOUNTAIN, O).

- Preparing data for tokenization and aligning labels with tokenized words.

- Initializing a pretrained BERT model for token classification.

- Splitting the data into training and testing sets.

- Training the model on the training set and evaluating it on the test set.

- Saving the trained model and tokenizer for future use.

- Generating evaluation metrics (e.g., classification report) to assess the model's performance

### Step 3: Run inference.py 

`Description`: This code demonstrates the use of a trained BERT-based Named Entity Recognition (NER) model to identify and classify mountain names in given text inputs. It includes steps to load a pre-trained model, tokenize input sentences, make predictions, and handle subtokens to produce a coherent output. 

The key goals are to:

- Tokenize input text: Split the text into tokens that the model understands.

- Predict entity labels: Use the trained model to assign labels (e.g., B-MOUNTAIN, I-MOUNTAIN, O) to tokens.

- Post-process tokens: Merge subtokens and remove special tokens for a human-readable output.

- Display results: Present predictions for each token in the text.


### Additional step: Run evaluate_model.py

`Description`: This code evaluates a pre-trained Named Entity Recognition (NER) model for identifying mountain names in a given text. 

It performs the following steps:

- Loads a saved model and tokenizer.

- Processes test data, aligning labels with tokenized inputs.

- Prepares the data in a format suitable for evaluation.

- Uses the Trainer class from the Hugging Face library to predict and evaluate the model's performance on the test dataset.

- Generates a classification report and calculates additional evaluation metrics (precision, recall, and F1 score).

- Result of model performance's analysis(detailed info at the end of evaluate_model.py file):



### Key Observations

1. **High Recall Across All Classes**  
   The recall values of 1.00 for both `B-MOUNTAIN` and `I-MOUNTAIN` indicate that the model successfully identifies all true mountain-related entities in the dataset.  

2. **Slightly Lower Precision for `B-MOUNTAIN`**  
   The precision of 0.98 for `B-MOUNTAIN` suggests a small number of false positives. These could be due to non-mountain tokens being mistakenly classified as `B-MOUNTAIN`.  

3. **Perfect Performance for `I-MOUNTAIN`**  
   Both precision and recall for `I-MOUNTAIN` are 1.00, showing that the model flawlessly identifies continuation tokens.  

4. **Overall Weighted Metrics**  
   - Weighted Precision: **0.9903**  
     Indicates that, on average, 99% of all predicted entities (weighted by class size) are correct.  
   - Weighted Recall: **1.0000**  
     All true entities in the dataset are captured by the model.  
   - Weighted F1-Score: **0.9951**  
     Combines high precision and perfect recall, confirming robust model performance. 

 
*High/close to ideal values of the model's evaluation characteristics may be triggered by the nature of the data (synthetically generated by templates)*