# Stage 2: Data Annotation

In this notebook, I'll leverage a Large Language Model (LLM) to perform sentiment annotation on the ESG document dataset, assigning scores of 0 for negative, 0.5 for neutral, and 1 for positive sentiment.  
The workflow involves manually creating a "gold standard" by annotating ~500 sentences, afterward setting up 2-3 LLMs for trial annotations, and experimenting with prompting strategies (zero-shot/few-shot) that we'll evaluate against the "gold standard".

## Setup & Data Loading

In [3]:
# Imports
import os
import ast
import pandas as pd
import numpy as np

In [4]:
# Load the preprocessed data
cleaned_data = pd.read_csv('../data/checkpoints/enriched_cleaned_data.csv', delimiter = '|')

In [5]:
# Define a function to convert a string representation of a list to a list datatype
def string_to_list(string):
    try:
        return ast.literal_eval(string)
    except (ValueError, SyntaxError):
        print('List conversion failed')
        return []

# Convert the string representations of the lists to the correct 'list' datatype
cleaned_data['word_tokens'] = cleaned_data['word_tokens'].apply(string_to_list)
cleaned_data['sentence_tokens'] = cleaned_data['sentence_tokens'].apply(string_to_list)
cleaned_data['pos_tagged_word_tokens'] = cleaned_data['pos_tagged_word_tokens'].apply(string_to_list)
cleaned_data['pos_tagged_sentence_tokens'] = cleaned_data['pos_tagged_sentence_tokens'].apply(string_to_list)
cleaned_data['esg_topics'] = cleaned_data['esg_topics'].apply(string_to_list)

In [6]:
# Add some count features for the analysis
cleaned_data['cnt_word'] = cleaned_data['word_tokens'].apply(len)
cleaned_data['cnt_sentence'] = cleaned_data['sentence_tokens'].apply(len)
cleaned_data['cnt_esg'] = cleaned_data['esg_topics'].apply(len)

# Calculate ratio between words/sentences
cleaned_data['ratio_word_sentence'] = cleaned_data['cnt_word'] / cleaned_data['cnt_sentence']

# Convert date to correct datatype
cleaned_data['date'] = pd.to_datetime(cleaned_data['date'])

# Derive year and month to aggregate
cleaned_data['year_month'] = cleaned_data['date'].apply(lambda x: x.strftime('%Y-%m'))
cleaned_data['year'] = cleaned_data['date'].apply(lambda x: x.strftime('%Y'))
cleaned_data['month'] = cleaned_data['date'].apply(lambda x: x.strftime('%m'))

In [7]:
# Define function to save intermediary steps in a file
def csv_checkpoint(df, filename='checkpoint'):
    """
    Saves a DataFrame to a CSV file and loads it back into a DataFrame.

    Args:
        df (pandas.DataFrame): The DataFrame to save and load.
        filename (str): The name of the CSV file to save the DataFrame to (default: 'checkpoint').

    Returns:
        pandas.DataFrame: The loaded DataFrame.
    """
    if not os.path.exists('../data/checkpoints/'):  # Check if the directory exists and create it if it doesn't
        os.makedirs('../data/checkpoints/')

    # Save DataFrame to CSV
    df.to_csv(f'../data/checkpoints/{filename}.csv', index=False, sep='|')  # Save DataFrame to CSV with specified filename
    print(f'Saved DataFrame to {filename}.csv')

    # Load CSV back into DataFrame
    df = pd.read_csv(f'../data/checkpoints/{filename}.csv', delimiter='|')  # Load CSV back into DataFrame
    print(f'Loaded DataFrame from {filename}.csv')

    return df

## Manual sentence sentiment annotation

To define a "gold standard" for the sentiment, 500 randomly sampled sentences are manually annotate with:  
**0 = negative, 0.5 = neutral, 1 = positive**

In [141]:
# Crate a deep copy so no reload from CSV files is necessary
documents = cleaned_data.copy(deep=True)

In [142]:
# Craete new column to store the sentence sentiment
documents['sentence_sentiment_value_llm'] = np.nan

In [143]:
# Explode the dataset based on the sentence tokens, so each row contains one sentence
documents = documents.explode('sentence_tokens')

# Preserve original index, so a later aggregation is possible
documents['original_index'] = documents.index

# Reset the index
documents = documents.reset_index(drop=True)

In [144]:
# Separate the DataFrame into internal/external sentences with a defined ratio
internal = documents[documents['internal'] == 1]
external = documents[documents['internal'] == 0]

# Determine the number of samples from each group, 1000 sentences in total
n_internal = int(0.2 * 1000)  # 20% of samples
n_external = 1000 - n_internal  # Remaining samples

# Sample 1000 random sentences with a seed, so a re-run samples the same sentences
sampled_internal = internal.sample(n=n_internal, random_state=42)
sampled_external = external.sample(n=n_external, random_state=42)

# Concatenate and shuffle the samples the DataFrames
sampled_documents = pd.concat([sampled_internal, sampled_external])
sampled_documents = sampled_documents.sample(frac=1, random_state=42)

# Drop the sampled rows from the original DataFrame
documents = documents.drop(sampled_documents.index)

In [145]:
# Check the sampled data
sampled_documents[['title','sentence_tokens','internal','sentence_sentiment_value_llm']].head()

Unnamed: 0,title,sentence_tokens,internal,sentence_sentiment_value_llm
529113,Transcript levels in plasma contribute substan...,therefore adjust differences sample quality in...,0,
339673,Absolutely everything you need to go bikepacki...,way little quicker easier make coffee porridge...,0,
390711,STARTUP STAGE: Tripshifu connects experienced ...,started career multinational tata steel joinin...,0,
354554,Automotive Aftermarket Market by Global Busine...,notable trend currently influencing dynamics a...,0,
420428,Smashing Podcast Episode 50 With Marko Dugonji...,know never used tables layout,0,


In [None]:
# Loop the samples to annotate them
for idx, row in sampled_documents.iterrows():
    # Loop until valid input is received
    while True:
        # Print the title of the document and the sentence
        print(f"Title: {row['title']}\nSentence: {row['sentence_tokens']}\n")

        # Wait for user input
        sentiment = input("Enter sentiment value (+ for 1.0, - for 0.0, Enter for 0.5): ")

        # Check if the input is valid
        if sentiment == '+':
            sampled_documents.at[idx, 'sentence_sentiment_value_llm'] = 1.0
            break
        elif sentiment == '-':
            sampled_documents.at[idx, 'sentence_sentiment_value_llm'] = 0.0
            break
        elif sentiment == '':
            sampled_documents.at[idx, 'sentence_sentiment_value_llm'] = 0.5
            break
        else:
            print("Invalid input. Please try again.")

In [150]:
# Combine the annotated samples with the complete dataset
documents = pd.concat([documents, sampled_documents])

In [151]:
# Check the manually annotated data
documents[documents['sentence_sentiment_value_llm'].notnull()].head()

Unnamed: 0,company,datatype,date,domain,esg_topics,internal,symbol,title,cleaned_content,word_tokens,...,sentiment_value,cnt_word,cnt_sentence,cnt_esg,ratio_word_sentence,year_month,year,month,sentence_sentiment_value_llm,original_index
529113,Qiagen,thinktank,2022-03-17,thelancet,"[GenderDiversity, Privacy]",0,QIA,Transcript levels in plasma contribute substan...,aa remain underrepresented alzheimers disease ...,"[aa, remain, underrepresented, alzheimers, dis...",...,0.100772,5054,280,2,18.05,2022-03,2022,3,1.0,7601
339673,Beiersdorf,general,2021-04-27,cyclingweekly,"[Compliance, Recycling, CustomerService, Gende...",0,BEI,Absolutely everything you need to go bikepacki...,get know area far intimately staying accommoda...,"[get, know, area, far, intimately, staying, ac...",...,0.314033,5031,431,4,11.672854,2021-04,2021,4,0.0,2774
390711,Deutsche Bank,general,2022-09-26,phocuswire,"[HumanCapital, Social, Recruiting, Misinformat...",0,DBK,STARTUP STAGE: Tripshifu connects experienced ...,founded february currently five employee idea ...,"[founded, february, currently, five, employee,...",...,0.249278,520,50,4,10.4,2022-09,2022,9,1.0,4228
354554,Continental,general,2021-10-05,ecochunk,[RussianFederation],0,CON,Automotive Aftermarket Market by Global Busine...,recording estimating analysing market data rep...,"[recording, estimating, analysing, market, dat...",...,0.280167,676,36,1,18.777778,2021-10,2021,10,0.0,3046
420428,Deutsche Telekom,business,2022-08-09,smashingmagazine,"[CorporateCulture, HumanCapital, Environment, ...",0,DTE,Smashing Podcast Episode 50 With Marko Dugonji...,ask affect change ux design large organization...,"[ask, affect, change, ux, design, large, organ...",...,0.203908,2392,320,4,7.475,2022-08,2022,8,0.5,4946


In [130]:
# Group by the 'original_index' and aggregate the columns
documents = documents.groupby('original_index').agg({
    # Use 'first' function for all columns except for 'sentence_tokens' and 'sentence_sentiment_value_llm'
    'title': 'first',  
    'company': 'first',  
    'datatype': 'first',  
    'date': 'first',  
    'domain': 'first',  
    'esg_topics': 'first',  
    'internal': 'first',  
    'symbol': 'first',  
    'title': 'first',  
    'cleaned_content': 'first',  
    'word_tokens': 'first',  
    'sentence_tokens': list,  # Combine the 'sentence_tokens' into a list
    'sentence_tokens': 'first',  
    'pos_tagged_word_tokens': 'first',  
    'pos_tagged_sentence_tokens': 'first',  
    'market_cap_in_usd_b': 'first',  
    'sector': 'first',  
    'industry': 'first',  
    'sentiment_value': 'first',  
    'cnt_word': 'first',  
    'cnt_sentence': 'first',  
    'cnt_esg': 'first',  
    'ratio_word_sentence': 'first',  
    'year_month': 'first',  
    'year': 'first',  
    'month': 'first',  
    'sentence_sentiment_value_llm': lambda x: [i if pd.notnull(i) else np.nan for i in x]  # Combine the 'sentence_sentiment_value_llm' into a list, substituting NaN where no manual sentiment was added
})

# Reset the index
documents = documents.reset_index(drop=True)

## LLM Annotation

### LLM Setups
As a first step, different LLM models for comparison needs to be initialized.

### Testing Zero-Shot Strategies

### Testing Few-Shot Strategies

## Comparison

## LLM Annotation
Based on the comparison, XY demonstrated best results to annotate the ESG documents. This LLM with the respective prompting strategy is used to annotate all sentences.