# Named Entity Recognition

## Objective
The aim of this project is to perform named entity recognition on the text provided. Our goal is to recognize only Locations and Dates... 
However, we will perform named entity recognition for all the pre-defined entities in the SpaCy library and later on we will pick only those which are of our interest (Locations and Dates).

In [29]:
# importing relevant packages...
import pandas as pd
import spacy
import random
from datasets import load_dataset
from tqdm import tqdm

In [2]:
# ! pip install datasets

## Data Extraction and Text manipulation

In [3]:
# loading data from datasets package...
dataset = load_dataset("ecthr_cases")

No config specified, defaulting to: ecthr_cases/alleged-violation-prediction
Reusing dataset ecthr_cases (C:\Users\rosha\.cache\huggingface\datasets\ecthr_cases\alleged-violation-prediction\1.1.0\8922a012792758e64921d4a66d42adf759e42838aae54a6a8871607f6399aecf)
100%|██████████| 3/3 [00:00<00:00, 50.84it/s]


In [19]:
len(dataset['test']['facts'][1])

22

In [30]:
texts = []
counts = []
for i in tqdm(range(len(dataset['test']['facts']))):
    text = ""
    cnt = 0
    for j in dataset['test']['facts'][i]:
        cnt+=1
        # print(j.split("  ")[-1])
        text += j.split("  ")[-1].strip() + ". "
        text = text.replace('\n', ' ')
    counts.append(cnt)
    texts.append(text)

100%|██████████| 1000/1000 [00:39<00:00, 25.27it/s]


In [28]:
# Sanity check...
for i in tqdm(range(len(counts))):
    if len(dataset['test']['facts'][i]) != counts[i]:
        print(f"text: {len(dataset['test']['facts'][i])}, count: {counts[i]}")

## Modelling

In [32]:
# loading SpaCy's large pre-trained model for english language...
nlp = spacy.load("en_core_web_lg")

locations = [] # list to store all the location entities...
dates = [] # list to store all the date entities...

# iteating over all the text fields in the test data and selecting only Location and Date entity...
for text in tqdm(texts):
    location = []
    date = []
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ == 'GPE':
            location.append([ent.text, ent.label_])
        elif ent.label_ == 'DATE':
            date.append([ent.text, ent.label_])
    locations.append(location)
    dates.append(date)


100%|██████████| 1000/1000 [03:02<00:00,  5.48it/s]


In [33]:
# Creating a dataframe to store each test with its location and date entities...
dict = {'Text': texts, 'Location': locations, 'Date': dates}
df = pd.DataFrame(dict)

In [34]:
# Saving the recognized entities in csv file if needed for later analysis...
df.to_csv('../data/location-date.csv') 

In [35]:
df.head()

Unnamed: 0,Text,Location,Date
0,"The applicant is a journalist for DN.no, a Nor...","[[DN.no, GPE], [Oslo, GPE], [the United Kingdo...","[[23 June 2010, DATE], [1997, DATE], [the same..."
1,The applicant was born in 1940 and lives in Od...,"[[Odesa, GPE]]","[[1940, DATE], [March 2001, DATE], [12 January..."
2,The applicant was born in 1965 and lives in Sm...,[],"[[1965, DATE], [9 November 2006, DATE], [6 Jan..."
3,The applicant was born in 1967 and lives in Ky...,"[[Kyiv, GPE], [Ukraine, GPE]]","[[1967, DATE], [August 2002, DATE], [7 Decembe..."
4,The applicant was born in 1967 and lives in St...,"[[Varna, GPE], [Varna, GPE], [Varna, GPE]]","[[1967, DATE], [11 March 2014, DATE], [between..."


In [36]:
# Randomly picking a text from test data and displaying all the named entities...
rand_idx = random.randint(0,999)
doc = nlp(df['Text'][rand_idx])
spacy.displacy.render(doc, style="ent", jupyter=True) # display in Jupyter

In [37]:
# Displaying only location and date entities for the same text from previous step...
doc_viz = nlp(df['Text'][rand_idx])
new_ents = []
for ent in doc_viz.ents:
    if ent.label_ == 'GPE' or ent.label_ == 'DATE':
            new_ents.append(ent)
doc_viz.ents = new_ents
spacy.displacy.render(doc_viz, style="ent", jupyter=True) # display in Jupyter