# Text Classification - NER Tagging (Off the shelf)



## $\color{blue}{Sections:}$
* Admin
* Datasets - Make a developement and test dataset
* Prompt
* Inference - Get results from LLM
* Metrics

## $\color{blue}{Preamble:}$

In this notebook we test how well GPT-4o-mini can annotate our dataset with labels. We will subsequently finetune it and make a comparisson.

The most appropriate model will be used to annotate our entire dataset, and these annotation will be used with downstream LLM and GNN tasks.

## $\color{blue}{Admin:}$


In [None]:
from google.colab import drive
from google.colab import userdata

drive.mount("/content/drive")
%cd '/content/drive/MyDrive/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive


In [None]:
import os
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

Let's run the LLM through Langchain this time

In [None]:
%%capture
!pip install -U -q langchain langchain_openai

## $\color{blue}{Datasets:}$


In [None]:
import pandas as pd
df = pd.read_csv('class/datasets/df_ner_annotated')
df = df[['id', 'content', 'annotated_content']]
df_train = df[:100]
df_dev = df[100:]

save...

reload data

In [None]:
import pandas as pd
path = 'class/datasets/ner_annotated'
df_train = pd.read_pickle(path + 'train')
df_dev = pd.read_pickle(path + 'dev')
df_example = pd.read_pickle(path + 'example')

In [None]:
print(df_train.shape, df_dev.shape, df_example.shape)

(97, 3) (100, 4) (3, 3)


In [None]:
for el in df_example['content']:
  print(el)
  print('\n')

“Is it John of Tuam?”   “Are you sure of that now?” asked Mr Fogarty dubiously. “I thought it was some Italian or American.”   “John of Tuam,” repeated Mr Cunningham, “was the man.”   He drank and the other gentlemen followed his lead.


sibly there were several others. He personally, being of a sceptical bias, believed and didn’t make the smallest bones about saying so either that man or men in the plural were always hanging around on the waiting list about a lady,


Now to the historical, for as Madam Mina write not in her stenography, I must, in my cumbrous old fashion, that so each day of us may not go unrecorded. We got to the Borgo Pass just after sunrise yesterday morning. When I saw the signs of the dawn I got ready for the hypnotism.




In [None]:
for el in df_example['annotated_content']:
  print(el)
  print('\n')

“Is it @@John of Tuam##Person ?”   “Are you sure of that now?” asked @@Mr Fogarty##Person dubiously. “I thought it was some Italian or American.”   “@@John of Tuam##Person,” repeated @@Mr Cunningham##Person, “was the man.”   He drank and the other gentlemen followed his lead.


sibly there were several others. He personally, being of a sceptical bias, believed and didn’t make the smallest bones about saying so either that man or men in the plural were always hanging around on the waiting list about a lady,


Now to the historical, for as @@Madam Mina##Person write not in her stenography, I must, in my cumbrous old fashion, that so each day of us may not go unrecorded. We got to the @@Borgo Pass##Location  just after sunrise yesterday morning. When I saw the signs of the dawn I got ready for the hypnotism.




## $\color{blue}{Prompt:}$


The tagging formatting is learnt with a single example, but here we use all labels and a null sentence to fully inform the LLM.

In [None]:
template = """The task is to label the Location and Person entities in the given ###Text section, Following the format in the ###Examples section.
The output should be identicle to the input with the exception of the Person and Location tags if required.

###Examples
Input: “Is it John of Tuam?”   “Are you sure of that now?” asked Mr Fogarty dubiously. “I thought it was some Italian or American.”   “John of Tuam,” repeated Mr Cunningham, “was the man.”   He drank and the other gentlemen followed his lead.
Output: “Is it @@John of Tuam##Person ?”   “Are you sure of that now?” asked @@Mr Fogarty##Person dubiously. “I thought it was some Italian or American.”   “@@John of Tuam##Person,” repeated @@Mr Cunningham##Person, “was the man.”   He drank and the other gentlemen followed his lead.

Input: sibly there were several others. He personally, being of a sceptical bias, believed and didn’t make the smallest bones about saying so either that man or men in the plural were always hanging around on the waiting list about a lady,
Output: sibly there were several others. He personally, being of a sceptical bias, believed and didn’t make the smallest bones about saying so either that man or men in the plural were always hanging around on the waiting list about a lady,

Input: Now to the historical, for as Madam Mina write not in her stenography, I must, in my cumbrous old fashion, that so each day of us may not go unrecorded. We got to the Borgo Pass just after sunrise yesterday morning. When I saw the signs of the dawn I got ready for the hypnotism.
Output: Now to the historical, for as @@Madam Mina##Person write not in her stenography, I must, in my cumbrous old fashion, that so each day of us may not go unrecorded. We got to the @@Borgo Pass##Location  just after sunrise yesterday morning. When I saw the signs of the dawn I got ready for the hypnotism.

**DON'T LABEL PRONOUNS AS PERSON**

###Text
Input: {}
Output:"""

In [None]:
prompt_in = 'Dog'
print(template.format(prompt_in))

The task is to label the Location and Person entities in the given ###Text section, Following the format in the ###Examples section.
The output should be identicle to the input with the exception of the Person and Location tags if required.

###Examples
Input: “Is it John of Tuam?”   “Are you sure of that now?” asked Mr Fogarty dubiously. “I thought it was some Italian or American.”   “John of Tuam,” repeated Mr Cunningham, “was the man.”   He drank and the other gentlemen followed his lead.
Output: “Is it @@John of Tuam##Person ?”   “Are you sure of that now?” asked @@Mr Fogarty##Person dubiously. “I thought it was some Italian or American.”   “@@John of Tuam##Person,” repeated @@Mr Cunningham##Person, “was the man.”   He drank and the other gentlemen followed his lead.

Input: sibly there were several others. He personally, being of a sceptical bias, believed and didn’t make the smallest bones about saying so either that man or men in the plural were always hanging around on the wait

## $\color{blue}{Inference:}$


In [None]:
import requests
URL = "https://api.openai.com/v1/chat/completions" # endpoint

system_message = "You are an excellent linguist"

key = userdata.get('OPENAI_API_KEY')
model = "gpt-4o-mini"
payload = {
"model": model,
"messages": [{"role": "system", "content": system_message}],
"temperature" : 0, # creativity of the model
"top_p":1.0, # percentile probability sampling
"n" : 1, # number of responses to generate
"stream": False,
"presence_penalty":0, # penalize/ incentivize given tokens
"frequency_penalty":0, # penalize/ incentivize given tokens
}

headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {key}"
}

In [None]:
import json
# parse the response json
def get_predicted(response):
  """Get content of the response from OpenAI"""
  out = response.content
  out_dict = json.loads(out)
  return out_dict['choices'][0]['message']['content']

In [None]:
responses = [""] * df_dev.shape[0]
count = 0
n = df_dev.shape[0]
for i in range(n):
  if count % 20 == 0:
    print(count)
  payload['messages'] = [{"role": "system", "content": system_message}] # reset payload
  new_prompt = template.format(df_dev.loc[i]["content"]) # make prompt
  payload['messages'].append({'role':'user', 'content': new_prompt}) # add prompt to payload
  try:
    response = requests.post(URL, headers=headers, json=payload, stream=False, timeout=80) # send request
    responses[i] = get_predicted(response) # extract content
  except:
    responses[i] = "fail"
    print(f"fail")
  count += 1

0
20
40
60
80


In [None]:
fail = 0
failed = []
for i in range(len(responses)):
  if len(responses[i]) == 0:
    fail += 1
    failed.append(i)
print(fail)

0


In [None]:
df_dev['predictions'] = responses
df_dev['predictions']

Unnamed: 0,predictions
0,"And yet, my dear, let me whisper, I felt a thr..."
1,"I did not know what to do, the less as the how..."
2,Looks full up of bad gas. Must be an infernal ...
3,"Why, clearly, he said, then he and his boon co..."
4,"Secondly, I will show that all men who practis..."
...,...
95,an idea he utterly repudiated. Quite apart fro...
96,"—My dear @@Myles##Person, he said, flinging hi..."
97,a reward which a man might fairly expect who n...
98,and is ready to compete with him in word or de...


In [None]:
import re
pattern = r"@@([^#]*)##(\w+\b)\S*"
all_entities = [re.findall(pattern, text) for text in df_dev['predictions']]

count_zeros = 0
count_people = 0
count_places = 0
people_list = []
place_list = []
for entity in all_entities:
  if len(entity) < 1:
    count_zeros += 1
  for tup in entity:
    if tup[1] == "Person":
      count_people += 1
      people_list.append(tup[0])
    elif tup[1] == "Location":
      count_places += 1
      place_list.append(tup[0])


print(f'Proportion of texts with entities = {(len(all_entities) - count_zeros) / len(all_entities)}.')
print(f'\nThere are {count_people} Person entities.')
print(f'\nThere are {count_places} Location entities.')

Proportion of texts with entities = 0.7.

There are 129 Person entities.

There are 43 Location entities.


## $\color{blue}{Metrics:}$


In [None]:
def get_labels(true, predicted):
  pattern = r"@@([^#]*)##(\w+\b)\S*"
  true_labels = re.findall(pattern, true)
  predicted_labels = re.findall(pattern, predicted)

  #clean labels for string type checking
  true_people = []
  true_locations = []
  predicted_people = []
  predicted_locations = []

  if len(true_labels) > 0:
    for el in true_labels:
      if (el[1] == 'Person') or (el[1] == 'person'):
        true_people.append(el[0].lower().strip())
      elif (el[1] == 'Location') or (el[1] == 'location'):
        true_locations.append(el[0].lower().strip())

  if len(predicted_labels) > 0:
    for el in predicted_labels:
      if (el[1] == 'Person') or (el[1] == 'person'):
        predicted_people.append(el[0].lower().strip())
      elif (el[1] == 'Location') or (el[1] == 'location'):
        predicted_locations.append(el[0].lower().strip())

  return true_people, true_locations, predicted_people, predicted_locations





In [None]:
from copy import deepcopy
from collections import namedtuple

Stats = namedtuple("stats", ['TP','FP', 'FN'])

def count_metrics(real, predicted):
  real = deepcopy(real)
  predicted = deepcopy(predicted)
  # Count Metrics
  TP = 0
  FP = 0
  FN = 0

  for item in real:
    if item in predicted:
      TP += 1
      predicted.remove(item)
    else:
      FN += 1

  for item in predicted:
    FP += 1

  return Stats(TP, FP, FN)


In [None]:
true = list(df_dev['annotated_content'])
predicted = list(df_dev['predictions'])

people = []
locations = []

for i in range(len(true)):
  true_people, true_locations, predicted_people, predicted_locations = get_labels(true[i], predicted[i])
  people.append(count_metrics(true_people, predicted_people))
  locations.append(count_metrics(true_locations, predicted_locations))


In [None]:
def report_stats(lstr):
  TP = 0
  FP = 0
  FN = 0
  for item in lstr:
    TP += item.TP
    FP += item.FP
    FN += item.FN

  precision = TP / (TP + FP)
  recall = TP / (TP + FN)
  f1 = 2 * ((precision * recall)/ (precision + recall))
  print('F1: ', f1)
  print('Precision: ', precision)
  print('Recall: ', recall)

In [None]:
print('-' * 10)
print("PEOPLE STATS")
report_stats(people)
print('\n')
print('-' * 10)
print("LOCATION STATS")
report_stats(locations)
print('\n')
print('-' * 10)
print("OVERALL STATS")
report_stats(people + locations)


----------
PEOPLE STATS
F1:  0.7603305785123968
Precision:  0.7131782945736435
Recall:  0.8141592920353983


----------
LOCATION STATS
F1:  0.5079365079365079
Precision:  0.37209302325581395
Recall:  0.8


----------
OVERALL STATS
F1:  0.7081967213114754
Precision:  0.627906976744186
Recall:  0.8120300751879699


In [None]:
df_dev.to_pickle(path + "dev")