# Named Entity Recognition

## Objective
The aim of this project is to perform named entity recognition on the text provided. Our goal is to recognize only Locations and Dates... 
However, we will perform named entity recognition for all the pre-defined entities in the SpaCy library and later on we will pick only those which are of our interest (Locations and Dates).

In [1]:
# importing relevant packages...
import pandas as pd
import spacy
import random
from datasets import load_dataset
from tqdm import tqdm

In [2]:
# ! pip install datasets

## Data Extraction and Text manipulation

- In terms of text manipulation we are removing newlines and index numbers only. Not removing special chars as some of the place names contain special chars.
- For named entity recognition we are using $SpaCy$ library. $SpaCy$ is an open source NLP library which allows us perform tasks like **NER** very effeciently. It gives us the option to train our own custom labeled named entity recognition or we can simply use their generic pre-trained model which works really well for entities like *person name, place, date, numbers etc*. 

In [2]:
# loading data from datasets package...
dataset = load_dataset("ecthr_cases")

No config specified, defaulting to: ecthr_cases/alleged-violation-prediction
Reusing dataset ecthr_cases (C:\Users\rosha\.cache\huggingface\datasets\ecthr_cases\alleged-violation-prediction\1.1.0\8922a012792758e64921d4a66d42adf759e42838aae54a6a8871607f6399aecf)
100%|██████████| 3/3 [00:00<00:00, 85.44it/s]


In [30]:
dataset['test']['facts'][0][:3]

['5.  The applicant is a journalist for DN.no, a Norwegian Internet-based version of the newspaper Dagens Næringsliv (“DN”), published by the company DN Nye Medier AS.',
 '6.  On 23 June 2010 Mr X was indicted for market manipulation and insider trading under the 1997 Act on the Trade of Financial Assets (verdipapirhandelloven). He was accused of having requested Mr Y, an attorney, to draft a letter concerning the Norwegian Oil Company (“DNO”), a limited liability company quoted on the stock exchange. The letter, addressed to a trustee company representing the interests of bond holders in DNO (“the bond trustee company”), gave the impression that it had been written on behalf of a number of bond holders who were seriously concerned about the company’s liquidity, finances and future. In fact, it had been written only on Mr X’s behalf. He had owned only one bond, which he had acquired the same day as he had asked attorney Y to draft the letter.',
 '7.  Mr X had sent a copy of the above-m

In [3]:
texts = []
counts = []
for i in tqdm(range(len(dataset['test']['facts']))):
    text = ""
    cnt = 0
    for j in dataset['test']['facts'][i]:
        cnt+=1
        # print(j.split("  ")[-1])
        text += j.split("  ")[-1].strip() + ". "
        text = text.replace('\n', ' ')
    counts.append(cnt)
    texts.append(text)

100%|██████████| 1000/1000 [00:38<00:00, 25.67it/s]


In [4]:
# Sanity check...
for i in tqdm(range(len(counts))):
    if len(dataset['test']['facts'][i]) != counts[i]:
        print(f"text: {len(dataset['test']['facts'][i])}, count: {counts[i]}")

100%|██████████| 1000/1000 [00:38<00:00, 25.84it/s]


## Modelling

Below flowchart shows the workflow/pipeline of how SpaCy performs Named Entity Recognition: <br>
STEP 1. The text is passed to $tokenizer$ which splits the text into tokens (words, punctuations marks etc.). <br>
STEP 2. Tokens are passed on to $tagger$ where each tokens get tagged to their respective part-of-speech (POS). <br>
STEP 3. Tagged tokens are then passed on to $parser$ where it tries to learn the syntactic (Grammar) structure of the tokens. <br>
STEP 4: After going through $parser$ lemmatization is performed to assign each word to their base form. <br>
STEP 5: After step 5 the $attribute ruler$ may or may not be applied. It is typically used to handle exceptions for token attributes and to map values between attributes such as mapping fine-grained POS tags(Singular noun, plural noun, or proper noun) to coarse-grained POS tags (Noun). <br>
STEP 6: And then finally named entity recognition is performed. <br>
#### After NER many more operations can be performed such as text categorization or custom components can also be added.
![NER-Pipeline](../plots/NER-Pipeline.svg) <br>
Image Source: spacy.io

In [5]:
def modelling(texts, model):
    locations = [] # list to store all the location entities...
    dates = [] # list to store all the date entities...

    # iteating over all the text fields in the test data and selecting only Location and Date entity...
    for text in tqdm(texts):
        location = []
        date = []
        doc = model(text)
        for ent in doc.ents:
            if ent.label_ == 'GPE':
                location.append([ent.text, ent.label_])
            elif ent.label_ == 'DATE':
                date.append([ent.text, ent.label_])
        locations.append(location)
        dates.append(date)
    return locations, dates

In [6]:
# loading SpaCy's large pre-trained model for english language...
nlp = spacy.load("en_core_web_lg")
locations, dates = modelling(texts, nlp)

100%|██████████| 1000/1000 [02:57<00:00,  5.63it/s]


In [28]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [None]:
# ! python -m spacy download en_core_web_trf

In [10]:
# loading SpaCy's pre-trained transformers model..
trf = spacy.load('en_core_web_trf')
trf_locations, trf_dates = modelling(texts[:10], trf)

100%|██████████| 10/10 [00:54<00:00,  5.44s/it]


In [12]:
# Creating a dataframe to store each test with its location and date entities...
dict = {'Text': texts, 'Location': locations, 'Date': dates}
df = pd.DataFrame(dict)

In [34]:
# Saving the recognized entities in csv file if needed for later analysis...
df.to_csv('../data/location-date.csv') 

In [13]:
df = pd.read_csv('../data/location-date.csv')

In [35]:
df.head()

Unnamed: 0,Text,Location,Date
0,"The applicant is a journalist for DN.no, a Nor...","[[DN.no, GPE], [Oslo, GPE], [the United Kingdo...","[[23 June 2010, DATE], [1997, DATE], [the same..."
1,The applicant was born in 1940 and lives in Od...,"[[Odesa, GPE]]","[[1940, DATE], [March 2001, DATE], [12 January..."
2,The applicant was born in 1965 and lives in Sm...,[],"[[1965, DATE], [9 November 2006, DATE], [6 Jan..."
3,The applicant was born in 1967 and lives in Ky...,"[[Kyiv, GPE], [Ukraine, GPE]]","[[1967, DATE], [August 2002, DATE], [7 Decembe..."
4,The applicant was born in 1967 and lives in St...,"[[Varna, GPE], [Varna, GPE], [Varna, GPE]]","[[1967, DATE], [11 March 2014, DATE], [between..."


## Results

### Below table shows SpaCy's pre-trained transformers model is slightly more accurate than its large pre-trained model <br>
### which can also be seen in our results... <br>
![model-performance](../plots/model-performance.png)

In [26]:
# Randomly picking a text from test data and displaying all the named entities...
rand_idx = random.randint(0,9)
doc = nlp(df['Text'][rand_idx])
spacy.displacy.render(doc, style="ent", jupyter=True) # display in Jupyter...

In [22]:
# Displaying only location and date entities for the same text from previous step...
doc_viz = nlp(df['Text'][rand_idx])
new_ents = []
for ent in doc_viz.ents:
    if ent.label_ == 'GPE' or ent.label_ == 'DATE':
            new_ents.append(ent)
doc_viz.ents = new_ents
spacy.displacy.render(doc_viz, style="ent", jupyter=True) # display in Jupyter...

In [23]:
doc = trf(df['Text'][rand_idx])
spacy.displacy.render(doc, style="ent", jupyter=True) # display in Jupyter...



In [24]:
# Displaying only location and date entities for the same text from previous step...
doc_viz = trf(df['Text'][rand_idx])
new_ents = []
for ent in doc_viz.ents:
    if ent.label_ == 'GPE' or ent.label_ == 'DATE':
            new_ents.append(ent)
doc_viz.ents = new_ents
spacy.displacy.render(doc_viz, style="ent", jupyter=True) # display in Jupyter...

If you want to have a look at the code at later stage... It is available on my github... Please scan the QR code and it will take you there... :)

![NER-QR](../plots/NER-QR.png) <br>
Or simply click on this [GitHub](https://github.com/roshan-pandey/Named-Entity-Recognition) link.