## Introduction
Named Entity Recognition (NER) is a critical component in Natural Language Processing (NLP) that aims to identify and extract named entities from unstructured text. It is a challenging task due to the complexity and ambiguity of natural language, and it plays a vital role in various NLP applications such as information retrieval, question answering, and machine translation. In recent years, deep learning techniques have shown great promise in achieving state-of-the-art results in NER tasks. Deep learning is a subset of machine learning that involves training artificial neural networks to learn from data and make predictions. One of the most popular deep learning architectures for NER is the Transformer model, which was introduced in 2017 and has since become a cornerstone of modern NLP. In addition to the Transformer model, there are many other deep learning techniques that have been applied to NER, including Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Bidirectional Encoder Representations from Transformers (BERT), among others.

NER is very useful in biomedical text mining. Previous NER methods for biomedical text mining rely on dictionary or rule based methods and machine learning techniques which are time consuming and have been proven to perform worse than deep learning methods. However, accurate deep neural network systems for NER have only recently been developed. Complications of NER occur due to the category of named entities being dependent on the context of the surrounding text and named entities having multiple definitions and evaluation criteria. Neural network (NN) systems are preferential to other machine learning systems because they require less feature engineering, so they are more domain independent. Evaluation of NER performance can be done in different ways. It can be based on type, which is whether the predicted label was correct regardless of entity boundaries, or on text, which is whether the predicted label was correct  regardless of label. The precision is the correct predictions divided by the total number of predictions. Recall is the number of entities a system predicted correctly divided by the number that were identified by human annotators. A statistic that is often used when evaluating NER performance is the F-Score which is the harmonic mean of precision and recall from both type and text.

In this research project, we explore various deep learning techniques for NER and compare their performance on a benchmark dataset, with the goal of improving the accuracy and efficiency of NER systems.

[@xiong2021Improving]




### Recurrent Neural Network (RNN)
A Recurrent Neural Network (RNN) is a method of deep learning that uses sequential data or time-series data. They are often used in ordinal problems, where the output is not continues but it is discrete and ordered, and temporal problems, where data is collected over time and the order of the observations is important. These problems include speech recognition, language translation, and natural language processing (nlp). RNNs make use of training data by taking prior inputs and using them to influence current inputs and outputs. Utilizing prior inputs in this way is what creates the RNN’s memory. This is something that distinguishes RNNs from other deep learning methods. RNNs are also distinguished by the sharing of parameters across each layer in the network. A popular type of RNN architecture known as Long Short-Term Memory (LSTM) has been used in named entity recognition. LSTMs have individual units in the hidden layers of a neural network, each of which has three gates. These include an input gate that controls the input information that goes into a memory cell. A forget gate that controls the amount of historical information that passes through from the previous state, and an output gate that controls the amount of information that is passed on to the next step. RNNs allow information to persist across multiple steps which enable the network to capture dependencies and context. The architecture and abilities of RNNs allow it to be a useful tool in NER that has been proven to perform better than previous NER systems


## Transformers and BERT: A Breif Overview

### Transformers

The Transformer model is a groundbreaking neural network architecture introduced in the paper "Attention Is All You Need" (Vaswani, 2017). The model is designed for sequence-to-sequence tasks, such as machine translation, and is known for its ability to process input sequences in parallel rather than sequentially. This parallel processing makes the Transformer model highly efficient and scalable. One of the key innovations of the Transformer model is the self-attention mechanism. Self-attention allows the model to weigh the importance of different words in the input sequence relative to each other when making predictions. The model uses multi-head attention, which means it can simultaneously attend to various input aspects. This ability to capture complex dependencies and relationships between words contributes to the model's strong performance. The Transformer architecture consists of an encoder and decoder, each composed of multiple layers of self-attention and feedforward neural networks. The encoder processes the input sequence while the decoder generates the output sequence. The connections between the encoder and decoder are facilitated by attention mechanisms that allow the decoder to focus on different parts of the input sequence as it generates the output. Given its effectiveness and efficiency in handling sequence data, the Transformer model has become the foundation for many subsequent natural language processing (NLP) models and architectures. The architecture of this model consists of the following key components
1.	The Transformer model is structured into two core segments: an encoder responsible for interpreting and encoding the input sequence and a decoder that builds the final output based on the encoder's representation. Both the encoder and decoder are composed of a series of identical layers stacked on each other.
2.	Each layer in the encoder has two key components: multi-head self-attention, which allows the model to weigh the relevance of different input elements, and a feedforward neural network, which is applied to each position separately.
3.	Decoder layers also have three key components: multi-head self-attention, similar to the encoder; multi-head cross-attention, which pays attention to the encoder's output; and a feedforward neural network, like the one in the encoder.
4.	Multi-head attention is a feature that lets the model focus on different aspects of the input data with multiple "attention heads." This mechanism helps the model understand input data more effectively.
5.	Positional encoding gives the model information about the order of words in a sequence. It's added to the initial word embeddings and helps the model understand the position of each word.
6.	Once the decoder completes its processing, the resulting output is directed through two subsequent layers: a linear layer and a SoftMax layer. These layers work together to generate a probability distribution across the entire target vocabulary. From this distribution, the model selects the word with the highest probability as the final output for each position in the sequence.


***INSERT PHOTO!!!***


The Transformer model is a robust neural network for handling sequential data. It uses attention mechanisms to understand the relationships between different elements in the input and produces high-quality output for tasks like language translation.


### BERT (Bidirectional Encoder Representations from Transformers)

BERT, which stands for Bidirectional Encoder Representations from Transformers, is a powerful pre-trained language model introduced in the paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (Devlin, 2018). BERT has achieved state-of-the-art results on various NLP tasks, including text classification, named entity recognition, and question answering. One of the distinguishing features of BERT is its ability to capture bidirectional context. Unlike traditional language models that process text from left to right or right to left, BERT considers both the left and right context when making predictions. This bidirectional context is achieved through a pre-training objective called masked language modeling (MLM). In MLM, specific tokens in the input text are randomly masked, and the model is trained to predict the masked tokens based on the surrounding context. BERT is also pre-trained using a next-sentence prediction (NSP) objective, which trains the model to predict whether two sentences follow each other in the original text. This objective helps BERT understand the relationships between sentences. BERT's pre-training phase is conducted on large unannotated text corpora, resulting in a language model that captures rich language representations. The pre-trained BERT model is then fine-tuned on specific NLP tasks using relatively small amounts of labeled data, resulting in a strong performance across diverse NLP benchmarks. BERT's success has led to the development of numerous variants and adaptations of the model, and it has become one of the most influential models in NLP research and applications. The following figure presents examples of both pre-training tasks using a pair of input sentences. It also highlights how the BERT model transforms the input into token sequences for processing. The critical components are as follows:
1.	Input Representation: BERT's input representation includes three key components: token embeddings, segment embeddings, and position embeddings. These embeddings are combined to create a comprehensive representation of each token. The input sequence starts with the [CLS] token (classification token) and uses the [SEP] token (separator token) to separate the sentences.
2.	Masked Language Model (MLM): The MLM task involves randomly masking specific tokens in the input sequence and training the model to predict the masked tokens based on the context provided by the unmasked tokens. In the figure, the word "making" is masked and replaced with the [MASK] token, and the model predicts the original word.
3.	Next Sentence Prediction (NSP): The NSP task trains the model to understand the relationships between pairs of sentences. The model predicts whether the second sentence will likely follow the first sentence. This task is binary, with the possible predictions being "IsNext" (the second sentence follows the first) or "NotNext" (the second sentence is random). In the figure, the model predicts "IsNext."

***INSERT PHOTO!!!***

Overall, this figure provides a visual depiction of the input representation used by BERT and the two pre-training tasks that contribute to its language understanding capabilities. The MLM task helps BERT understand language context, while the NSP task allows it to understand sentence-level relationships. These pre-training tasks enable BERT to learn deep bidirectional representations, making it a powerful language model.






## Methods

### Data Set

https://www.kaggle.com/datasets/finalepoch/medical-ner?select=Corona2.json

## Data set description
The data was manually tagged (diseases,pathogens and medication) for training NER system item. It includes the following columns:

- Text (The actual content)
- Starts (Position on where the label starts)
- Ends (Position on where the label ends)
- Labels (The actual label from the text)
- Categories:
  - Medical condition names (example: influenza, headache, malaria)
  - Medicine names (example : aspirin, penicillin, ribavirin, methotrexate)
  - Pathogens ( example: Corona Virus, Zika Virus, cynobacteria, E. Coli)

## Code Cell


In [None]:
!pip install -q nbclient
!pip install -q requests
!pip install -q pandas
!pip install -q nbformat
!pip install -q plotly.express

In [None]:
import requests

# Define the raw URL of the JSON file on GitHub
url = 'https://raw.githubusercontent.com/jsanc223/datasetCorona2/main/Corona2.json'
# Make an HTTP GET request to the raw URL
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the JSON data from the response
    data = response.json()
    print(data)
else:
    print('Failed to fetch JSON data:', response.status_code)

Parse json data into dictionary to manipulata data.

In [None]:
training_data = []
for example in data['examples']:
  temp_dict = {}
  temp_dict['text'] = example['content']
  temp_dict['entities'] = []
  for annotation in example['annotations']:
    start = annotation['start']
    end = annotation['end']
    label = annotation['tag_name'].upper()
    temp_dict['entities'].append((start, end, label))
  training_data.append(temp_dict)

Convert data from Dictionary to Dataframe

In [None]:
import pandas as pd
# Initialize empty lists to store the data for the DataFrame
texts = []
starts = []
ends = []
labels = []

# Iterate through the training_data to extract individual entity annotations
for example in training_data:
    text = example['text']
    for entity in example['entities']:
        start, end, label = entity
        # Append data to the lists
        texts.append(text)
        starts.append(start)
        ends.append(end)
        labels.append(label)

# Create a DataFrame from the lists
df = pd.DataFrame({'text': texts, 'start': starts, 'end': ends, 'label': labels})
df.head(5)

Data statistics

In [None]:
import plotly.express as px

# Count the occurrences of each label
label_counts = df['label'].value_counts()

# Create a DataFrame with labels and their respective counts
df_counts = pd.DataFrame({'label': label_counts.index, 'count': label_counts.values})

# Plot the frequency of each entity label using a bar plot in Plotly
fig = px.bar(df_counts, x='label', y='count', text='count', color='label',
             color_discrete_sequence=px.colors.qualitative.Plotly, title='Frequency of Entity Labels')

# Display the counter label inside the bars
fig.update_traces(textposition='inside')

# Update axis titles
fig.update_layout(xaxis_title='Entity Label', yaxis_title='Frequency')

fig.show()

In [None]:
import plotly.express as px

# Get the counts of each unique label
label_counts = df['label'].value_counts()
# Plot a pie chart using Plotly
fig = px.pie(label_counts, values=label_counts.values, names=label_counts.index, title='Proportion of Entity Labels', hole=0.3)
fig.update_traces(textinfo='percent+label', textfont_size=12)
fig.show()

In [None]:
import plotly.express as px

# Plot a histogram of entity start positions using Plotly
fig = px.histogram(df, x='start', nbins=30, title='Histogram of Entity Start Positions')
fig.update_layout(xaxis_title='Entity Start Position', yaxis_title='Frequency')
fig.show()

In [None]:
import plotly.express as px

# Create box plots for 'start' and 'end' columns using Plotly
fig = px.box(df, y=['start', 'end'], points='all', title='Box Plots of Start and End Entity Positions')
fig.update_layout(yaxis_title='Value', xaxis_title='Column')
fig.show()

## Analysis and Results

### Data and Vizualisation

### Statistical Modeling

### Conlusion

## References

https://arxiv.org/pdf/1910.10683.pdf

https://arxiv.org/pdf/1810.04805.pdf

https://arxiv.org/abs/1910.11470

https://ojs.aaai.org/index.php/AAAI/article/view/3861

https://arxiv.org/abs/1909.10649

Devlin, J. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv.Org. https://doi.org/10.48550/arXiv.1810.04805 

Vaswani, A. (2017). Attention Is All You Need. ArXiv.Org. https://doi.org/10.48550/arXiv.1706.03762