In [1]:
import numpy as np
import pandas as pd
import os
import re
import matplotlib.pyplot as plt
import seaborn as sns
import openai

## Notebook to annotate HLS speeches for relevance
### A: one-hot coded labels

Codebooks:
- A1.0: zero shot
- A1.1: one shot
- A1.2: two shot

- A1.1.1: one shot with specific inclusion of context

Test for 5 different seeds
Batch of 20 sentences
5 iterations

Temperature: 0 - 0.2 - 0.6

Hypothesis: accuracy decreases with temperature


Model selection:
 As of 22-05-2024, gpt-4-turbo-2024-04-09 seems to be the only gpt-model that returns a fingerprint in addition to gpt-4o

  #model= "gpt-4-turbo-2024-04-09"
  #model = "gpt-3.5-turbo-0125"


### Import text to annotate
Select only relevant columns of the full dataframe, in this case:
relevance_0, relevance_1, relevance_2

In [2]:
# Import numerical csv file
HLS_train = pd.read_csv('data/num/HLS_train_dummies.csv')

In [3]:
### Select only japan for testing purposes
HLS_train_japan = HLS_train[HLS_train['id']=='COP19_japan']
HLS_train_japan

Unnamed: 0,id,Text,relevance_0,relevance_1,relevance_2,principle_0,principle_1,principle_2,principle_3,principle_4,...,unit_7,shape_0,shape_1,shape_2,shape_3,shape_4,shape_5,shape_6,shape_7,shape_8
0,COP19_japan,"Thank you, Mr. President .",1,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,COP19_japan,"On beha lf of the government of Japan , I wou...",1,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,COP19_japan,I would also like to expr ess my d eepest con...,1,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,COP19_japan,Mr. President: A fair and effective framewor...,0,0,1,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
4,COP19_japan,"In this regard, Japan firmly supports the est...",0,1,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
5,COP19_japan,Such a framework must be based on “nationally ...,0,0,1,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
6,COP19_japan,I will devote myself toward the s uccessful o...,0,1,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
7,COP19_japan,Mr. President: The Great East Japan Earthqua...,1,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
8,COP19_japan,Even under such circumstance both the public ...,1,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
9,COP19_japan,Our greenhouse gas emissions for the first co...,1,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [4]:
# Select only columns containing relevance labels
HLS_relevance = HLS_train_japan[['Text', 'relevance_0', 'relevance_1', 'relevance_2']]

### Import necessary files
- codebooks
- API key
- import gpt_annotate_num

In [5]:
# Load codebook - zero shot
with open('codebooks/A1.0', 'r', encoding='utf-8') as file:
    A10 = file.read()

In [6]:
# OpenAI key
with open('gpt_api_key.txt', 'r') as f:
    key = f.read().strip()

In [7]:
import gpt_annotate_num

### Prepare data for annotation
Compares column names in HLS_relevance to the codes identified by GPT-4o in the codebook. Seed for this identification is set to 1234.

In [10]:
# Prepare dataframe for annotation
text_to_annotate = gpt_annotate_num.prepare_data(HLS_relevance, A10, key, prep_codebook=True)

ChatCompletion(id='chatcmpl-9TljdNzK66sKsQvhJA3ijG46BDSSe', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='relevance_0, relevance_1, relevance_2', role='assistant', function_call=None, tool_calls=None))], created=1716882033, model='gpt-4o-2024-05-13', object='chat.completion', system_fingerprint='fp_43dfabdef1', usage=CompletionUsage(completion_tokens=12, prompt_tokens=389, total_tokens=401))

Categories to annotate:
1) relevance_0
2) relevance_1
3) relevance_2


Data is ready to be annotated using gpt_annotate()!

Glimpse of your data:
Shape of data:  (37, 6)
   unique_id                                               text  relevance_0  \
0          0                         Thank you, Mr. President .            1   
1          1   On beha lf of the government of Japan , I wou...            1   
2          2   I would also like to expr ess my d eepest con...            1   
3          3   Mr. President:  A fair and effective framewo

Fingerprint used: fp_43dfabdef1

Seed of textpreparation is hardcoded into gpt_annotate. This to ensure that onlye the results of the same fingerprint for all seeds and all iterations. Essentially every time GPT-4o is called only results with this specific fingerprint are saved.

### Run gpt_annotate_num
Evaluation per seed -
5 different seeds
Batch of 20 sentences
5 iterations

Returns 5 outputs per seed;
1. all_iterations_num_{seed}.csv
2. final_num_{seed}.csv
3. performance_metrics_{seed}
4. incorrect_{seed}.csv
5. fingerprints_all.csv


In [8]:
fingerprint = 'fp_43dfabdef1'
seeds = [3644,3441, 280, 5991, 7917]

In [11]:
# Annotate the data - T0
for seed in seeds:
    gpt_annotate_num.gpt_annotate(text_to_annotate, A10, key, seed, fingerprint,experiment='A1.0', num_iterations=3, model="gpt-4o", temperature=0,batch_size=20, human_labels=True)

Seed: 3644: Iteration 1, batch 1: fingerprintmatch found!
Seed: 3644: Iteration 1, batch 2: fingerprintmatch found!
iteration:  1 completed
Seed: 3644: Iteration 2, batch 1: fingerprintmatch found!
Seed: 3644: Iteration 2, batch 2: fingerprintmatch found!
iteration:  2 completed
Seed: 3644: Iteration 3, batch 1: fingerprint does not match
Seed: 3644: Iteration 3, batch 2: fingerprintmatch found!
iteration:  3 completed
Seed: 3441: Iteration 1, batch 1: fingerprintmatch found!
Seed: 3441: Iteration 1, batch 2: fingerprintmatch found!
iteration:  1 completed
Seed: 3441: Iteration 2, batch 1: fingerprintmatch found!
Seed: 3441: Iteration 2, batch 2: fingerprintmatch found!
iteration:  2 completed
Seed: 3441: Iteration 3, batch 1: fingerprintmatch found!
Seed: 3441: Iteration 3, batch 2: fingerprintmatch found!
iteration:  3 completed
Seed: 280: Iteration 1, batch 1: fingerprintmatch found!
Seed: 280: Iteration 1, batch 2: fingerprintmatch found!
iteration:  1 completed
Seed: 280: Iteratio

Time to evaluate: 5 seeds, 5 iterations, 37 lines: 4 min

In [16]:
all_3644 = pd.read_csv('NUM_RESULT/A1.0/JAPAN - no temp/all_iterations_num_3644.csv')
all_3644

Unnamed: 0,unique_id,text,relevance_0_x,relevance_1_x,relevance_2_x,llm_query,relevance_0_y,relevance_1_y,relevance_2_y,iteration
0,0,"Thank you, Mr. President .",1,0,0,"0 Thank you, Mr. President .\n",1,0,0,1
1,0,"Thank you, Mr. President .",1,0,0,"0 Thank you, Mr. President .\n",1,0,0,2
2,0,"Thank you, Mr. President .",1,0,0,"0 Thank you, Mr. President .\n",1,0,0,3
3,0,"Thank you, Mr. President .",1,0,0,"0 Thank you, Mr. President .\n",1,0,0,4
4,0,"Thank you, Mr. President .",1,0,0,"0 Thank you, Mr. President .\n",1,0,0,5
...,...,...,...,...,...,...,...,...,...,...
168,35,"Through the Olympic Games , we will demonstra...",0,1,0,"35 Through the Olympic Games , we will demons...",0,0,1,4
169,36,Thank you for your kind attention.,1,0,0,36 Thank you for your kind attention.\n,1,0,0,1
170,36,Thank you for your kind attention.,1,0,0,36 Thank you for your kind attention.\n,1,0,0,2
171,36,Thank you for your kind attention.,1,0,0,36 Thank you for your kind attention.\n,1,0,0,3


### Evaluate Performance metrics
Print performance metrics

In [13]:
# Function to extract numbers from the filename string
def extract_numbers(filename):
    return re.findall(r'\d+', filename)

# Iterate through each file in the directory
directory = 'NUM_RESULT/A1.0/performance_metrics_num'

# Create empty list to store all dataframes
dataframes = []

# Open each file and make dataframe of complete scores
for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    df = pd.read_csv(file_path)

    seed = extract_numbers(filename)
    # Convert the list of numbers to a string
    seed_str = '_'.join(seed)

    # Add the extracted numbers to the first column entries
    if not df.empty and seed_str:
        # Ensure the first column is treated as a string
        df.iloc[:, 0] = seed_str + df.iloc[:, 0].astype(str)

    dataframes.append(df)

performance_all = pd.concat(dataframes, ignore_index=True)
performance_all

Unnamed: 0,Category,Accuracy,Precision,Recall,F1
0,0_280relevance_0,0.864865,1.0,0.705882,0.827586
1,0_280relevance_1,0.864865,0.782609,1.0,0.878049
2,0_280relevance_2,0.891892,0.0,0.0,0.0
3,0_3441relevance_0,0.837838,0.923077,0.705882,0.8
4,0_3441relevance_1,0.837838,0.772727,0.944444,0.85
5,0_3441relevance_2,0.891892,0.0,0.0,0.0
6,0_3644relevance_0,0.837838,0.923077,0.705882,0.8
7,0_3644relevance_1,0.837838,0.772727,0.944444,0.85
8,0_3644relevance_2,0.891892,0.0,0.0,0.0
9,0_5991relevance_0,0.837838,0.923077,0.705882,0.8


### Evaluate full dataframes - to see if any inconsistencies occur
Hypothesis: String based annotation performs significantly better.
Could in itself be a limitation of the testing possibilities. Limits the evaluation metrics that can be used.

In [22]:
result_all = pd.read_csv('NUM_RESULT/A1.0/all_iterations_num/all_iterations_num_280.csv')

result_all

Unnamed: 0,unique_id,text,relevance_0_x,relevance_1_x,relevance_2_x,llm_query,relevance_0_y,relevance_1_y,relevance_2_y,iteration
0,0,"Thank you, Mr. President .",1,0,0,"0 Thank you, Mr. President .\n",1,0,0,1
1,0,"Thank you, Mr. President .",1,0,0,"0 Thank you, Mr. President .\n",1,0,0,2
2,0,"Thank you, Mr. President .",1,0,0,"0 Thank you, Mr. President .\n",1,0,0,3
3,0,"Thank you, Mr. President .",1,0,0,"0 Thank you, Mr. President .\n",1,0,0,4
4,0,"Thank you, Mr. President .",1,0,0,"0 Thank you, Mr. President .\n",1,0,0,5
...,...,...,...,...,...,...,...,...,...,...
185,36,Thank you for your kind attention.,1,0,0,36 Thank you for your kind attention.\n,1,0,0,1
186,36,Thank you for your kind attention.,1,0,0,36 Thank you for your kind attention.\n,1,0,0,2
187,36,Thank you for your kind attention.,1,0,0,36 Thank you for your kind attention.\n,1,0,0,3
188,36,Thank you for your kind attention.,1,0,0,36 Thank you for your kind attention.\n,1,0,0,4


Observations: Answers are consistent - no switching of label after evaluation. (for temperature 0). However, sentence 20 is included twice.

Accuracy really is a lot worse. Not a single relevant sentence is labelled correctly.
However, does show that precision is 0 and apparently for all seeds.