# CLT Project - Stage III





- **Author:**             Arian Contessotto, Tim Giger, Levin Reichmuth
- **Submission Date:**    1 June 2023

## 1. Setup & Data Loading

If running on Colab, install the required packages and load data.

In [None]:
# Clone repo with dataset
!git clone https://github.com/syX113/hslu-nlp

In [None]:
# Check if files are loaded
!ls hslu-nlp/stage2/annotated/

In [None]:
# Required package installation
!transformers==4.28.0
!pip install torch

### 1.1 Import Packages & Downloads

In [1]:
# Imports
import torch
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from transformers import TrainingArguments, Trainer
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification
from transformers import XLNetTokenizerFast, XLNetForSequenceClassification
from transformers import EarlyStoppingCallback
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore")

2023-05-27 18:07:52.249868: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-05-27 18:17:05.413986: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


The final dataframe from stage one is loaded. These data are the basis for stage two.

In [2]:
# Define file name
esg_file = '../stage2/annotated/full_llm_annotated.csv' # Local filepath
#esg_file = 'hslu-nlp/stage2/annotated/full_llm_annotated.csv' # Filepath on Colab

# Define function to load and merge data
def load_data(file):

    # Load the data
    df = pd.read_csv(file, delimiter = '|')

    # Apply eval function
    df['esg_topics'] = df['esg_topics'].apply(eval)
    df['sentence_tokens'] = df['sentence_tokens'].apply(eval)
    df['sentiment_llm_continuous'] = df['sentiment_llm_continuous'].apply(eval)
    df['sentiment_llm_categorial'] = df['sentiment_llm_categorial'].apply(eval)

    return df

df = load_data(esg_file)

# Print shape and diyplay header
print(df.shape)
df.head()

(11071, 17)


Unnamed: 0,company,datatype,title,date,domain,esg_topics,internal,symbol,sentence_tokens,market_cap_in_usd_b,sector,industry,year_month,year,month,sentiment_llm_continuous,sentiment_llm_categorial
0,Beiersdorf,sustainability_report,BeiersdorfAG Sustainability Report 2021,2021-03-31,,"[CleanWater, GHGEmission, ProductLiability, Va...",1,BEI,[brands strategy sustainability agenda care be...,25.99,Consumer Staples,Household & Personal Products,2021-03,2021,3,"[0.4510161280632019, 0.6138720512390137, 0.226...","[0.5, 0.5, 0.0, 0.0, 0.5, 0.5, 0.5, 0.5, 1.0, ..."
1,Deutsche Telekom,sustainability_report,DeutscheTelekomAG Sustainability Report 2021,2021-03-31,,"[DataSecurity, Iso50001, GlobalWarming, Produc...",1,DTE,"[management facts, deutsche telekom cr report,...",101.78,Communication Services,Telecom Services,2021-03,2021,3,"[0.35756340622901917, 0.29088783264160156, 0.3...","[0.5, 0.0, 0.5, 0.5, 0.0, 0.5, 0.5, 0.5, 0.5, ..."
2,Vonovia,sustainability_report,VonoviaSE Sustainability Report 2021,2021-03-31,,"[Whistleblowing, DataSecurity, Vaccine, GHGEmi...",1,VNA,"[sustainable future, sustainability report dea...",20.35,Real Estate,Real Estate Services,2021-03,2021,3,"[0.4570336639881134, 0.45287153124809265, 0.26...","[0.5, 0.5, 0.0, 0.5, 0.5, 0.5, 0.0, 0.5, 0.0, ..."
3,Merck,sustainability_report,MerckKGaA Sustainability Report 2021,2021-03-31,,"[DataSecurity, DataMisuse, DrugResistance, Iso...",1,MRK,[management employees profile attractive emplo...,87.64,Healthcare,Drug Manufacturers—Specialty & Generic,2021-03,2021,3,"[0.36378589272499084, 0.6118267178535461, 0.48...","[0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, ..."
4,MTU,sustainability_report,MTUAeroEngines Sustainability Report 2020,2020-03-31,,"[WorkLifeBalance, Corruption, AirQuality, Data...",1,MTX,[sustainability goes far beyond climate action...,12.24,Industrials,Aerospace & Defense,2020-03,2020,3,"[0.46082836389541626, 0.46208637952804565, 0.4...","[0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.0, 0.5, 0.5, ..."


### 1.3 Create different Dataframes (Sentences & full Document)

In [3]:
def create_sentence_df(data):

    # Select relevant columns
    data = data[['internal','sentence_tokens','sentiment_llm_categorial']]

    # Explode the tokens, so each sentence is a row
    data = data.set_index(['internal']).apply(pd.Series.explode).reset_index()

    # Rename the columns and change order
    data.rename(columns={'sentence_tokens': 'sentence', 'sentiment_llm_categorial': 'sentiment'}, inplace=True)
    data = data[['internal', 'sentence', 'sentiment']]

    # Convert types
    data['internal'] = data['internal'].astype(int)
    data['sentence'] = data['sentence'].astype(str)
    data['sentiment'] = data['sentiment'].astype(float)
    
    return data

# Create sentence data
sentence_df = create_sentence_df(df)

# Display header and shape
print(sentence_df.shape)
sentence_df.head()

(678529, 3)


Unnamed: 0,internal,sentence,sentiment
0,1,brands strategy sustainability agenda care bey...,0.5
1,1,successfully reduced carbon footprint absolute...,0.5
2,1,end consumer business returned levels reduced ...,0.0
3,1,decoupling human economic activity natural res...,0.0
4,1,inspired beiersdorf ambitious sustainability a...,0.5


In [4]:
# Function to create document data
def create_document_df(data):

    # Join tokens
    data['document'] = data['sentence_tokens'].apply(' '.join)  # Convert tokens to strings

    # Compute the mean of the computed sentiment and discretize it
    def discretize_sentiment(value):
        if value <= 0.33:
            return 0.0
        elif value <= 0.66:
            return 0.5
        else:
            return 1.0

    data['sentiment'] = data['sentiment_llm_continuous'].apply(np.mean).apply(discretize_sentiment)

    # Convert types
    data['internal'] = data['internal'].astype(int)
    data['sentence'] = data['document'].astype(str)
    data['sentiment'] = data['sentiment'].astype(float)

    # Return needed columns and discretized mean of the sentiment
    return data[['internal', 'document', 'sentiment']]

# Create sentence data
document_df = create_document_df(df)

# Display header and shape
print(document_df.shape)
document_df.head()

(11071, 3)


Unnamed: 0,internal,document,sentiment
0,1,brands strategy sustainability agenda care bey...,0.5
1,1,management facts deutsche telekom cr report th...,0.5
2,1,sustainable future sustainability report dear ...,0.5
3,1,management employees profile attractive employ...,0.5
4,1,sustainability goes far beyond climate action ...,0.5


The subsets for the training should have equally distributed classes. In addition, external and internal documents should be represented.  
One of these conditions needs to be more "loose", we decide class equality is more important.

In [5]:
def balance_sentiment_and_internal(df):
    # Get minimum number of observations across sentiment classes
    min_internal_count = df['internal'].value_counts().min()
    
    # Get the minimum number of observations between internal == 0 and internal == 1
    min_sentiment_count = min(df[df['sentiment'] == 0].shape[0], df[df['sentiment'] == 1].shape[0], min_internal_count)

    # Create "balanced" dataframe
    balanced_df = pd.concat([df[df['sentiment'] == i].sample(min_sentiment_count, random_state=1) for i in df['sentiment'].unique()])

    return balanced_df

# Sample the sentence dataframe
sub_sentence_df = balance_sentiment_and_internal(sentence_df)

# Display header and shape
print(sub_sentence_df.shape)
sub_sentence_df.head()

(103812, 3)


Unnamed: 0,internal,sentence,sentiment
482121,0,july incyte eli lilly announced fda would meet...,0.5
459567,0,still cautious highly contagious delta strain ...,0.5
110344,1,march first publicprivate peer employees tied ...,0.5
267939,0,content plans featuring presentation followed ...,0.5
315677,0,pubmed scopus google scholar however link nets...,0.5


In [6]:
# Sample the document dataframe
sub_document_df = balance_sentiment_and_internal(document_df)

# Display header and shape
print(sub_document_df.shape)
sub_document_df.head()

(186, 3)


Unnamed: 0,internal,document,sentiment
10179,0,dgapnews ag key word quarterly interim stateme...,0.5
2292,0,president fraunhofer institute ceramic technol...,0.5
10338,0,yet father granted biopic still waiting twenty...,0.5
5472,0,stepped gear dedicated taskforce held inaugura...,0.5
8299,0,business technology platform critical piece la...,0.5


In [7]:
# Inspect sampling results
print('Sentence subset:')
print(sub_sentence_df['internal'].value_counts())
print(sub_sentence_df['sentiment'].value_counts())
print('\n')
print('Document subset:')
print(sub_document_df['internal'].value_counts())
print(sub_document_df['sentiment'].value_counts())

Sentence subset:
0    72896
1    30916
Name: internal, dtype: int64
0.5    34604
0.0    34604
1.0    34604
Name: sentiment, dtype: int64


Document subset:
0    186
Name: internal, dtype: int64
0.5    62
1.0    62
0.0    62
Name: sentiment, dtype: int64


In [8]:
# Drop uncessary column and reset index
document_df = document_df.drop(columns=['internal']).reset_index(drop=True)
sentence_df = sentence_df.drop(columns=['internal']).reset_index(drop=True)
sub_sentence_df = sub_sentence_df.drop(columns=['internal']).reset_index(drop=True)

As a result, the models can be evaluated and trained with 2 approaches:  
- A dataframe containing the full document and a discretized mean sentiment of all included sentences
- A dataframe containing each sentence with the corresponding discretized sentiment  
- Two sampled subset dataframes for moel evaluation

"Discretized" corresponds to the labels 0.0 (negative), 0.5 (neutral) and 1.0 (positive).

## 2. Model Finetuning

The evaluation for the model is based on the following conceptual approach:
1. Select multiple pretrained (Huggingface) models, based on previous stages
2. Train the selected models on a small subset of the full documents and the single sentences to keep the training time short
3. Compare the training outcomes of the different models on the two subsets and select the best model

In [9]:
# Function to compute the comparison metrics
def compute_metrics(p):
    pred, labels = p
    
    # Use the appropriate metrics, since we don't have discrete classes but a continous score 
    mse = mean_squared_error(y_true=labels, y_pred=pred)
    mae = mean_absolute_error(y_true=labels, y_pred=pred)
    r2 = r2_score(y_true=labels, y_pred=pred)

    return {"MSE": mse, "MAE": mae, "R2": r2}

In [10]:
# Load Tensorboard for training monitoring
%load_ext tensorboard

In [11]:
# Kill potential Tensorboard process, so it don't block the port
!pkill -f "tensorboard"

In [12]:
# Start Tensorboard to monitor training processes
%tensorboard --logdir ./evaluation/ --port 6010

### 1.1 Finetune Model 1: *distilbert-base-uncased*

As a first test, we use the lightweight "distilbert-base-uncased" model and fine-tune it on the full documents and the sentences, since the finetuned *"nlptown/bert-base-multilingual-uncased-sentiment"* demonstrated high alignment with the gold standard in stage 2.  
Since BERT only accepts 512 input word tokens, the full documents are heavyily truncated.  

🤗 page: https://huggingface.co/distilbert-base-uncased

In [13]:
# Define pretrained tokenizer and model
model_name = "distilbert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=1) # 1 label to get a continuous score between 0 and 1

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DistilBertTokenizer'. 
The class this function is called from is 'BertTokenizer'.
You are using a model of type distilbert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing BertForSequenceClassification: ['distilbert.transformer.layer.0.attention.k_lin.weight', 'distilbert.transformer.layer.5.sa_layer_norm.bias', 'distilbert.transformer.layer.3.attention.q_lin.weight', 'distilbert.transformer.layer.5.ffn.lin1.bias', 'distilbert.transformer.layer.2.attention.v_lin.weight', 'distilbert.transformer.layer.0.attention.q_lin.bias', 'distilbert.transformer.layer.0.attention.out_lin.bias', 'distilbert.transformer.layer.2.attentio

In [14]:
# Create the torch datasets to use data in PyTorch and override necessary methods
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

#### Finetune *distilbert-base-uncased* on sentence subset

In [15]:
# Split the data with a 70%, 15% and 15% ratio (train, valid, test)
X = list(sub_sentence_df["sentence"])
y = list(sub_sentence_df["sentiment"])
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3) # Split 70% train data
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5) # Split the other 30% in 50% each to get the correct ratio

# Tokenize the datasets
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=512)
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)

# Create the train, validation and test dataset as PyTorch datasets
train_dataset_distilbert_sent = Dataset(X_train_tokenized, y_train)
val_dataset_distilbert_sent = Dataset(X_val_tokenized, y_val)
test_dataset_distilbert_sent = Dataset(X_test_tokenized, y_test)

In [16]:
# Define training arguments
args = TrainingArguments(
    output_dir="./evaluation/distilbert_sentences",
    evaluation_strategy="steps",
    eval_steps=1000,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    gradient_accumulation_steps=4,
    seed=0,
    optim="adamw_torch", # Use newer PyTorch optimizer
    learning_rate=2e-5,
    logging_steps=10,
    fp16=True,
    report_to='tensorboard')

# Define Huggingface Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset_distilbert_sent,
    eval_dataset=val_dataset_distilbert_sent,
    compute_metrics=compute_metrics
    #callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

In [17]:
# Delete GPU cache
torch.cuda.empty_cache()
# Train pre-trained model
trainer.train()
# Save the model
model.save_pretrained("./models/distilbert_sentences")

Step,Training Loss,Validation Loss


#### Finetune *distilbert-base-uncased* on document subset

In [18]:
# Split the data with a 70%, 15% and 15% ratio (train, valid, test)
X = list(sub_document_df["document"])
y = list(sub_document_df["sentiment"])
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3) # Split 70% train data
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5) # Split the other 30% in 50% each to get the correct ratio

# Tokenize the datasets
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=512)
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)

# Create the train, validation and test dataset as PyTorch datasets
train_dataset_distilbert_doc = Dataset(X_train_tokenized, y_train)
val_dataset_distilbert_doc = Dataset(X_val_tokenized, y_val)
test_dataset_distilbert_doc = Dataset(X_test_tokenized, y_test)

In [19]:
# Define training arguments
args_distilbert = TrainingArguments(
    output_dir="./evaluation/distilbert_documents",
    evaluation_strategy="steps",
    eval_steps=10,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=100,
    gradient_accumulation_steps=4,
    seed=0,
    optim="adamw_torch", # Use newer PyTorch optimizer
    learning_rate=2e-5,
    logging_steps=1,
    fp16=True,
    report_to='tensorboard')

# Define Huggingface Trainer
trainer = Trainer(
    model=model,
    args=args_distilbert,
    train_dataset=train_dataset_distilbert_doc,
    eval_dataset=val_dataset_distilbert_doc,
    compute_metrics=compute_metrics
    #callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

In [20]:
# Delete GPU cache
torch.cuda.empty_cache()
# Train pre-trained model
trainer.train()
# Save the model
model.save_pretrained("./models/distilbert_documents")

Step,Training Loss,Validation Loss


### 1.2 Finetune Model 2: *roberta-base*

As a second model for the comparison, we choose RoBERTa. It is a further development of BERT and should perform better.  

This model benefits substantially from extended training duration, larger data batches, and an increase in dataset size. Its performance further increases by eliminating the next sentence prediction objective and integrating longer sequences during training.  
Lastly, the model's optimization is boosted by dynamically altering the masking pattern applied to the training data.

Reference: https://arxiv.org/pdf/1907.11692.pdf  
🤗 page: https://huggingface.co/roberta-base

In [21]:
# Load default methods again, since a few are overwritten for destillbert
from datasets import Dataset 

In [22]:
# Load the tokenizer for RoBERTa
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')

def tokenize_and_format_sentence(examples):

    # Tokenize the text and map sentiment to label
    tokenized_inputs = tokenizer(examples['sentence'], truncation=True, padding='max_length')
    labels = examples['sentiment']
    
    # Return both the tokenized inputs and labels
    return {**tokenized_inputs, 'labels': labels}

def tokenize_and_format_document(examples):
    # Same as above for documents
    tokenized_inputs = tokenizer(examples['document'], truncation=True, padding='max_length')
    labels = examples['sentiment']
    
    return {**tokenized_inputs, 'labels': labels}

In [23]:
# Load the base RoBERTa model
model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=1)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'lm_head.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.

#### Finetune *roberta-base* on sentence subset

In [24]:
# Train, validation and test split (70%, 15% and 15%)
train_dataset, temp_df = train_test_split(sub_sentence_df, test_size=0.3, random_state=42)
val_dataset, test_dataset = train_test_split(temp_df, test_size=0.5, random_state=42)

# Convert pandas DataFrame to Hugging Face Dataset
train_dataset = Dataset.from_pandas(train_dataset)
val_dataset = Dataset.from_pandas(val_dataset)
test_dataset = Dataset.from_pandas(test_dataset)

# Tokenizing the datasets
train_dataset_roberta_sent = train_dataset.map(tokenize_and_format_sentence, batched=True)
val_dataset_roberta_sent = val_dataset.map(tokenize_and_format_sentence, batched=True)
test_dataset_roberta_sent = test_dataset.map(tokenize_and_format_sentence, batched=True)

# Set the correct data format for PyTorch
train_dataset_roberta_sent.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
val_dataset_roberta_sent.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset_roberta_sent.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/72668 [00:00<?, ? examples/s]

Map:   0%|          | 0/15572 [00:00<?, ? examples/s]

Map:   0%|          | 0/15572 [00:00<?, ? examples/s]

In [25]:
# Prepare to train the model
args = TrainingArguments(
    output_dir="./evaluation/roberta_sentences",
    evaluation_strategy="steps",
    eval_steps=1000,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    gradient_accumulation_steps=4,
    seed=0,
    optim="adamw_torch", # Use newer PyTorch optimizer
    learning_rate=2e-5,
    logging_steps=10,
    fp16=True,
    report_to='tensorboard')

# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset_roberta_sent,
    eval_dataset=val_dataset_roberta_sent
)

In [26]:
# Delete GPU cache
torch.cuda.empty_cache()
# Train the model
trainer.train()
# Save the model
model.save_pretrained("./models/roberta_sentences")

Step,Training Loss,Validation Loss


#### Finetune *roberta-base* on document subset

In [27]:
# Train, validation and test split (70%, 15% and 15%)
train_dataset, temp_df = train_test_split(sub_document_df, test_size=0.3, random_state=42)
val_dataset, test_dataset = train_test_split(temp_df, test_size=0.5, random_state=42)

# Convert pandas DataFrame to Hugging Face Dataset
train_dataset = Dataset.from_pandas(train_dataset)
val_dataset = Dataset.from_pandas(val_dataset)
test_dataset = Dataset.from_pandas(test_dataset)

# Tokenizing the datasets
train_dataset_roberta_doc = train_dataset.map(tokenize_and_format_document, batched=True)
val_dataset_roberta_doc = val_dataset.map(tokenize_and_format_document, batched=True)
test_dataset_roberta_doc = test_dataset.map(tokenize_and_format_document, batched=True)

# Set the correct data format for PyTorch
train_dataset_roberta_doc.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
val_dataset_roberta_doc.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset_roberta_doc.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/130 [00:00<?, ? examples/s]

Map:   0%|          | 0/28 [00:00<?, ? examples/s]

Map:   0%|          | 0/28 [00:00<?, ? examples/s]

In [28]:
# Prepare to train the model
args = TrainingArguments(
    output_dir="./evaluation/roberta_documents",
    evaluation_strategy="steps",
    eval_steps=10,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=100,
    gradient_accumulation_steps=4,
    seed=0,
    optim="adamw_torch", # Use newer PyTorch optimizer
    learning_rate=2e-5,
    logging_steps=1,
    fp16=True,
    report_to='tensorboard')

# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset_roberta_doc,
    eval_dataset=val_dataset_roberta_doc
)

In [29]:
# Delete GPU cache
torch.cuda.empty_cache()
# Train the model
trainer.train()
# Save the model
model.save_pretrained("./models/roberta_documents")

Step,Training Loss,Validation Loss


### 1.3 Finetune Model 3: *xlnet-base-cased*

XLNet is a pretraining model for natural language processing tasks that combines the advantages of both autoregressive language models and denoising autoencoding models like BERT. 
Unlike BERT, XLNet mitigates dependency issues between masked positions and avoids a pretrain-finetune discrepancy by employing a generalized autoregressive method. 
This approach allows for learning bidirectional contexts by maximizing expected likelihood over all possible factorization orders. 
Furthermore, it integrates the strengths of Transformer-XL, a leading autoregressive model, into its pretraining procedure. 
Empirical evidence suggests that XLNet surpasses BERT in performance across a range of tasks including question answering, natural language inference, sentiment analysis, and document ranking.

Reference: https://arxiv.org/abs/1906.08237  
🤗 page: https://huggingface.co/xlnet-base-cased


In [30]:
# Load the tokenizer
tokenizer = XLNetTokenizerFast.from_pretrained('xlnet-base-cased')
# Load the XLNet model
model = XLNetForSequenceClassification.from_pretrained('xlnet-base-cased', num_labels=1)

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.bias', 'lm_loss.weight']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.bias', 'logits_proj.weight', 'logits_proj.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

In [31]:
# Adjust compute metrics function for XLNet
def compute_metrics_xlnet(eval_pred):
    logits, labels = eval_pred
    predictions = logits.detach().numpy()
    labels = labels.detach().numpy()
    
    mse = mean_squared_error(labels, predictions)
    mae = mean_absolute_error(labels, predictions)
    r2 = r2_score(labels, predictions)

    return {'mse': mse, 'mae': mae, 'r2': r2}

In [32]:
# Prepare the data
def prepare_data(sentences, labels):
    # Tokenize the inputs
    inputs = tokenizer(sentences, truncation=True, padding='max_length', max_length=512, return_tensors='pt')

    # Convert labels to tensors and resize to match input dimensions
    labels = torch.tensor(labels).unsqueeze(1).float()

    return inputs.input_ids, labels

# Create torch Dataset and adjust methods
class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        return {'input_ids': self.encodings[idx], 'labels': self.labels[idx]}

    def __len__(self):
        return len(self.labels)

#### Finetune *xlnet-base-cased* on sentence subset

In [33]:
# Split the data with 70%, 15% and 15% 
train_df, temp_df = train_test_split(sub_sentence_df, test_size=0.3, random_state=42)
val_df, test_df = train_test_split(sub_sentence_df, test_size=0.5, random_state=42)

# Get sentences and labels
train_sentences = train_df['sentence'].tolist()
train_labels = train_df['sentiment'].tolist()
val_sentences = val_df['sentence'].tolist()
val_labels = val_df['sentiment'].tolist()
test_sentences = test_df['sentence'].tolist()
test_labels = test_df['sentiment'].tolist()

# Prepare inputs and labels
train_input_ids, train_labels = prepare_data(train_sentences, train_labels)
val_input_ids, val_labels = prepare_data(val_sentences, val_labels)
test_input_ids, test_labels = prepare_data(test_sentences, test_labels)

# Create sentence datasets
train_dataset_xlnet_sent = SentimentDataset(train_input_ids, train_labels)
val_dataset_xlnet_sent = SentimentDataset(val_input_ids, val_labels)
test_dataset_xlnet_sent = SentimentDataset(test_input_ids, test_labels)

In [34]:
# Define TrainingArguments
training_args = TrainingArguments(
    output_dir="./evaluation/xlnet_sentences",
    evaluation_strategy="steps",
    eval_steps=1000,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    gradient_accumulation_steps=4,
    seed=0,
    optim="adamw_torch", # Use newer PyTorch optimizer
    learning_rate=2e-5,
    logging_steps=10,
    fp16=True,
    report_to='tensorboard')

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_xlnet_sent,
    eval_dataset=val_dataset_xlnet_sent,
    compute_metrics=compute_metrics_xlnet,
)

In [35]:
# Delete GPU cache
torch.cuda.empty_cache()
# Train pre-trained model
trainer.train()
# Save the model
model.save_pretrained("./models/xlnet_sentences")

Step,Training Loss,Validation Loss


AttributeError: 'numpy.ndarray' object has no attribute 'detach'

#### Finetune *xlnet-base-cased* on document subset

In [None]:
# Split the data with 70%, 15% and 15% 
train_df, temp_df = train_test_split(sub_document_df, test_size=0.3, random_state=42)
val_df, test_df = train_test_split(sub_document_df, test_size=0.5, random_state=42)

# Get sentences and labels
train_sentences = train_df['document'].tolist()
train_labels = train_df['sentiment'].tolist()
val_sentences = val_df['document'].tolist()
val_labels = val_df['sentiment'].tolist()
test_sentences = test_df['document'].tolist()
test_labels = test_df['sentiment'].tolist()

# Prepare inputs and labels
train_input_ids, train_labels = prepare_data(train_sentences, train_labels)
val_input_ids, val_labels = prepare_data(val_sentences, val_labels)
test_input_ids, test_labels = prepare_data(test_sentences, test_labels)

# Create sentence datasets
train_dataset_xlnet_doc = SentimentDataset(train_input_ids, train_labels)
val_dataset_xlnet_doc = SentimentDataset(val_input_ids, val_labels)
test_dataset_xlnet_doc = SentimentDataset(test_input_ids, test_labels)

In [None]:
# Define TrainingArguments
training_args = TrainingArguments(
    output_dir="./evaluation/xlnet_documents",
    evaluation_strategy="steps",
    eval_steps=10,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=100,
    gradient_accumulation_steps=4,
    seed=0,
    optim="adamw_torch", # Use newer PyTorch optimizer
    learning_rate=2e-5,
    fp16=True,
    report_to='tensorboard')

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_xlnet_doc,
    eval_dataset=val_dataset_xlnet_doc,
    compute_metrics=compute_metrics_xlnet,
)

In [None]:
# Delete GPU cache
torch.cuda.empty_cache()
# Train pre-trained model
trainer.train()
# Save the model
model.save_pretrained("./models/xlnet_documents")

### 1.4 Finetune Model 4: *flan-t5-base* (not working correctly)

🤗 page: https://huggingface.co/google/flan-t5-base

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, TrainingArguments, Trainer
tokenizer = T5Tokenizer.from_pretrained('google/flan-t5-base')
model = T5ForConditionalGeneration.from_pretrained('google/flan-t5-base')

# Use correct Dataset class and adjust needed methods
class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels.float()
        self.decoder_input_ids = torch.ones((len(labels),1), dtype=torch.long) # Initialize with start token

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        item['decoder_input_ids'] = self.decoder_input_ids[idx]
        return item

    def __len__(self):
        return len(self.labels)


def prepare_data(sentences, labels):
    # Tokenize the inputs with added task-specific prefix ("sentiment")
    inputs = tokenizer(["sentiment: " + sentence for sentence in sentences], truncation=True, padding='max_length', max_length=512, return_tensors='pt')

    # Convert labels to tensors
    labels = torch.tensor(labels.to_numpy())

    dataset = SentimentDataset(inputs, labels)
    return dataset

sentence_data = sub_sentence_df['sentence']
label_data = sub_sentence_df['sentiment']

train_frac = 0.8
train_size = int(train_frac * len(sentence_data))

train_sentences = sentence_data[:train_size]
train_labels = label_data[:train_size]

val_sentences = sentence_data[train_size:]
val_labels = label_data[train_size:]

train_dataset = prepare_data(train_sentences, train_labels)
val_dataset = prepare_data(val_sentences, val_labels)

def compute_metrics_flan(eval_pred):
    predictions, labels = eval_pred

    # Reduce predictions to a single value per sequence (e.g., using mean)
    predictions = predictions.mean(dim=-1)

    mse = mean_squared_error(y_true=labels, y_pred=predictions)
    mae = mean_absolute_error(y_true=labels, y_pred=predictions)
    r2 = r2_score(y_true=labels, y_pred=predictions)

    return {"MSE": mse, "MAE": mae, "R2": r2}

class SentimentTrainer(Trainer):
    def predict(self, test_dataset):
        predictions, labels, _ = super().predict(test_dataset)
        # convert predicted token ids to float values
        predictions = [float(tokenizer.decode(pred)) for pred in predictions]
        return predictions, labels
    
    # Adjust loss function to regression
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits

        # Reduce logits to a single value per sequence (e.g., using mean)
        logits = logits.mean(dim=-1)

        # Use MSE loss for regression
        loss_fct = torch.nn.MSELoss()
        loss = loss_fct(logits, labels)
        return (loss, outputs) if return_outputs else loss

training_args = TrainingArguments(
    output_dir="./evaluation/flant5_sentences",
    evaluation_strategy="steps",
    eval_steps=50,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    gradient_accumulation_steps=4,
    seed=0,
    optim="adamw_torch", # Use newer PyTorch optimizer
    learning_rate=2e-5,
    logging_steps=10,
    fp16=True,
    report_to='tensorboard')

trainer = SentimentTrainer(
    model=model,              
    args=training_args, 
    compute_metrics=compute_metrics_flan,     
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Delete GPU cache
torch.cuda.empty_cache()

# Train the model
trainer.train()

# Save the model
model.save_pretrained("./models/flant5_sentences")

T5 models are normally used for text-to-text tasks. Therefore, fine-tuning T5 for a classification task with a continous prediction between 0 and 1 is a bit of a diversion. Still, above code works and the training can be started.  
But the user-defined loss function does not work properly (it does not drop, but shows 0 throughout the training), and there is also uncertainty about the correct encodings.  

Therefore, this model is not considered.

## 3. Model Evaluation

The finetuned models are evaluated based on MSE, MAE and R2. In addition, the test datasets are used to test the predictions.

In [None]:
# Load trained distilbert model
model_distilbert_sent = BertForSequenceClassification.from_pretrained('./models/distilbert_sentences/', num_labels=1)
# Define distilbert test trainer
trainer_distilbert_sent = Trainer(model_distilbert_sent)
# Make predictions on the test subset
pred_distilbert_sent = trainer_distilbert_sent.predict(test_dataset_distilbert_sent)

In [None]:
# Load trained RoBERTa model
model_roberta_sent = RobertaForSequenceClassification.from_pretrained('./models/roberta_sentences/', num_labels=1)
# Define RoBERTa test trainer
trainer_roberta_sent = Trainer(model_roberta_sent)
# Make predictions on the test subset
pred_roberta_sent = trainer_roberta_sent.predict(test_dataset_roberta_sent)

In [None]:
# Load trained xlnet model
model_xlnet_sent = XLNetForSequenceClassification.from_pretrained('./models/xlnet_sentences/', num_labels=1)
# Define xlnet test trainer
trainer_xlnet_sent = Trainer(model_xlnet_sent)
# Make predictions on the test subset
pred_xlnet_sent = trainer_xlnet_sent.predict(test_dataset_xlnet_sent)

In [None]:
trainer_distilbert_sent.evaluate(eval_dataset=test_dataset_distilbert_sent)

In [None]:
trainer_roberta_sent.evaluate(eval_dataset=test_dataset_roberta_sent)

In [None]:
trainer_xlnet_sent.evaluate(eval_dataset=test_dataset_xlnet_sent)

Load Tensorboard to inspect the metrics.

In [None]:
# Kill potential Tensorboard process, so it don't block the port
!pkill -f "tensorboard"

In [None]:
# Start Tensorboard to monitor training process
%tensorboard --logdir ./evaluation/ --port 6010

In [None]:
# test_dataset_distilbert_doc
# test_dataset_roberta_doc
# test_dataset_xlnet_doc

## 4. Hyperparameter Tuning for selected Model

## 5. Full Training of selected Model

To speed up the full training, PEFT (https://github.com/huggingface/peft) is considered.  
RoBERTa supports LoRa, Prefix Tuning, P-Tuning and Prompt Tuning: https://github.com/huggingface/peft#sequence-classification

## 6. Evaluation of fully trained Model

## 7. Sentiment Prediction with selected Model

## 8. Compare internal vs. external