# Text Classification - NER Dataset


## $\color{blue}{Sections:}$
* Admin
* Export - Data
* Import - Labelled Data
* Analysis



## $\color{blue}{Preamble:}$

This notebook prepares the data for the finetuning of the NER model.

In this notebook we export 200 datapoints. 100 Points will be used to finetune a LLM on the task. 100 Points will be used to test improvement.

The Labelling will be done externally and reimported.

## $\color{blue}{Admin:}$


In [None]:
from google.colab import drive

drive.mount("/content/drive")
%cd '/content/drive/MyDrive/'

Mounted at /content/drive
/content/drive/MyDrive


## $\color{blue}{Export:}$


In [None]:
import pandas as pd

df = pd.read_pickle('class/datasets/df_train')

In [None]:
df.columns

Index(['index', 'master', 'book_idx', 'book', 'chapter_idx', 'chapter',
       'author', 'content', 'vanilla_embedding', 'vanilla_embedding.1',
       'ft_embedding', 'ft_embedding_pal'],
      dtype='object')

In [None]:
df_ner = df[:200][['index', 'content']]

In [None]:
df_ner.to_csv('class/datasets/df_ner')

## $\color{blue}{Import:}$


In [None]:
df = pd.read_csv('class/datasets/df_ner_annotated')

In [None]:
df = df[['id', 'content', 'annotated_content']]
df.head()

Unnamed: 0,id,content,annotated_content
0,8114,“Is it John of Tuam?” “Are you sure of that ...,“Is it @@John of Tuam##Person ?” “Are you su...
1,4951,sibly there were several others. He personally...,sibly there were several others. He personally...
2,4629,"Stephen, who was trying his dead best to yawn ...","@@Stephen##Person , who was trying his dead be..."
3,11556,"Now to the historical, for as Madam Mina write...","Now to the historical, for as @@Madam Mina##Pe..."
4,12262,The harmonies which you mean are the mixed or ...,The harmonies which you mean are the mixed or ...


## $\color{blue}{Analysis:}$


Count the number of entities

In [None]:
import re
pattern = r"@@([^#]*)##(\w+\b)\S*"

In [None]:
all_entities = [re.findall(pattern, text) for text in df['annotated_content']]

In [None]:
count_zeros = 0
count_people = 0
count_places = 0
people_list = []
place_list = []
for entity in all_entities:
  if len(entity) < 1:
    count_zeros += 1
  for tup in entity:
    if tup[1] == "Person":
      count_people += 1
      people_list.append(tup[0])
    elif tup[1] == "Location":
      count_places += 1
      place_list.append(tup[0])


print(f'Proportion of texts with entities = {(len(all_entities) - count_zeros) / len(all_entities)}.')
print(f'\nThere are {count_people} Person entities.')
print(f'\nThere are {count_places} Location entities.')

Proportion of texts with entities = 0.58.

There are 208 Person entities.

There are 50 Location entities.


In [None]:
people_list = [el.lower().strip() for el in people_list]
place_list = [el.lower().strip() for el in place_list]

In [None]:
from collections import Counter
people_count = dict(Counter(people_list))
location_count = dict(Counter(place_list))

In [None]:
sorted(people_count.items(), key=lambda item: item[1], reverse=True)

[('bloom', 19),
 ('stephen', 12),
 ('mr bloom', 3),
 ('madam mina', 3),
 ('van helsing', 3),
 ('jonathan', 3),
 ('zoe', 3),
 ('martin cunningham', 3),
 ('lucy', 3),
 ('lynch', 3),
 ('long john fanning', 3),
 ('john of tuam', 2),
 ('mr cunningham', 2),
 ('lydian', 2),
 ('martha', 2),
 ('dolly', 2),
 ('bilder', 2),
 ('moll', 2),
 ('lenehan', 2),
 ('john wyse nolan', 2),
 ('gabriel', 2),
 ('florry', 2),
 ('dr. van helsing', 2),
 ('joe', 2),
 ('mina', 2),
 ('john', 2),
 ('mr fogarty', 1),
 ('captain john lever', 1),
 ('john eglinton', 1),
 ('mr best', 1),
 ('goodwin', 1),
 ('dedalus', 1),
 ('barney kiernan', 1),
 ('nannetti', 1),
 ('hynes', 1),
 ('mr patrick dignam', 1),
 ('morris', 1),
 ('cook', 1),
 ('professor', 1),
 ('john seward', 1),
 ('lionelleopold', 1),
 ('henry', 1),
 ('mady', 1),
 ('raoul', 1),
 ('julia', 1),
 ('freddy', 1),
 ('cochrane', 1),
 ('erin', 1),
 ('patrick', 1),
 ('calpornus', 1),
 ('potitus', 1),
 ('odyssus', 1),
 ('leopold bloom', 1),
 ('bous stephanoumenos', 1),
 (

In [None]:
sorted(location_count.items(), key=lambda item: item[1], reverse=True)

[('ireland', 3),
 ('borgo pass', 2),
 ('hellas', 2),
 ('london', 2),
 ('alameda', 2),
 ('harcourt road', 1),
 ('seaside', 1),
 ('count’s house', 1),
 ('whitby', 1),
 ('drawing-room', 1),
 ('sleepy hollow', 1),
 ('lethe', 1),
 ('persia', 1),
 ('holyhead', 1),
 ('egan', 1),
 ('varna', 1),
 ('eire', 1),
 ('europe', 1),
 ('greek street', 1),
 ('ben howth', 1),
 ('roundtown', 1),
 ('lombard street', 1),
 ('moher', 1),
 ('connemara', 1),
 ('lough neagh', 1),
 ('giant’s causeway', 1),
 ('terenure', 1),
 ('dame street', 1),
 ('belvedere', 1),
 ('egypt', 1),
 ('geneva', 1),
 ('gibraltar', 1),
 ('livermore', 1),
 ('asculum', 1),
 ('mürzsteg', 1),
 ('austria', 1),
 ('nuremberg', 1),
 ('black sea', 1),
 ('england', 1),
 ('holmwood', 1),
 ('bukovina', 1),
 ('cork', 1),
 ('holeopen', 1),
 ('kinsale', 1)]