# DSA4265 Assignment 1: Fine-Tuning of FinBERT for Volatility Forecasting based on Mergers & Acquisitions-related Headlines

Mergers and acquisitions play a crucial role in shaping a company's outlook, directly influencing investor sentiment and stock performance. Depending on whether investors perceive the news positively or negatively, they may adjust their trading decisions, causing fluctuations in stock prices and increased volatility. By forecasting these volatility shifts based on a single headline, investors with lower risk appetites may choose to sell their holdings or avoid purchasing the stock altogether, as higher volatility, while potentially offering greater returns, also brings the risk of significant losses. 

With that said, the goal of this assignment is to assess the impact of mergers and acquisitions-related headlines on the short-term, medium-term, and long-term volatility of the closing prices of a company’s ordinary shares, and this is done through the fine-tuning of a FinBERT model.

## Part 1: Data Extraction

The following section describes the data extraction process and generation of the labelled dataframe. The tickers used for analysis are as such:

In [73]:
# Tickers used for Analysis
top_tickers = [
    'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'TSLA', 'META', 'NVDA', 'BRK-B', 'V', 'PYPL', 'MA',
    'BABA', 'JNJ', 'WMT', 'NFLX', 'DIS', 'NVDA', 'HD', 'PFE', 'ADBE', 'INTC', 'VZ', 'CSCO',
    'GS', 'BA', 'CVX', 'UNH', 'KO', 'XOM', 'PEP', 'MCD', 'NKE', 'IBM', 'GM', 'GE', 'INTU',
    'TXN', 'WFC', 'AMT', 'LMT', 'SPY', 'MS', 'GS', 'CAT', 'T', 'AXP', 'COP', 'MDT', 'MMM',
    'UPS', 'PLD', 'AMD', 'F', 'PFE', 'GILD', 'C', 'STZ', 'LVS', 'MGM', 'ZTS', 'BMY', 'DELL',
    'TMO', 'LLY', 'AMGN', 'DUK', 'STT', 'BA', 'AIG', 'CL', 'OXY', 'MU', 'SLB', 'HCA', 'HPE',
    'LUV', 'UAL', 'SPG', 'CME', 'CSX', 'ETN', 'FIS', 'REGN', 'CVS', 'MDLZ', 'SYK', 'WBA',
    'ZBRA', 'NOC', 'RMD', 'CHTR', 'TROW', 'RJF', 'FOXA', 'FOXF', 'HSY', 'KHC', 'NKE', 'TRV',
    'ADP', 'VFC', 'MSCI', 'MKC', 'BAX', 'LRCX', 'TGT', 'FIS', 'RTX', 'CDNS', 'DG',
    'LPLA', 'EXC', 'CTSH', 'MTB', 'STZ', 'WFC', 'COF', 'DHR', 'CCL', 'ECL', 'SPGI', 
    'TEL', 'WST', 'FMC', 'EL', 'VRSN', 'SWK', 'TMO', 'DE', 'ADBE', 'JNJ', 'SBUX', 'PFE',
    'HPE', 'MMM', 'ABT', 'INTU', 'MELI', 'BA', 'GOOG'
]

# Addition of .O to indicate Ordinary Shares
tickers_ordinary_share = [ticker + '.O' for ticker in top_tickers]

### Part 1a: Data Extraction from Refinitiv Workspace and yfinance

It is to note that the data obtained was sourced from Refinitiv Workspace, and the code to extract the dataframes were all copied and pasted from its in-built CodeBook. Following the data extraction, the stock prices were extracted with the help of yfinance libary. From these prices, the corresponding rolling volatilities were then computed.

In this assignment, I utilised 3 different windows to represent various types of volatilities:

1) 3-Day Rolling Window (Short-Term Volatility)
2) 5-Day Rolling Window (Medium-Term Volatility)
3) 10-Day Rolling Window (Long-Term Volatility)

By using these 3 different windows, greater insight on the impact of headlines on the different terms of volatilities (Short-Term, Medium-Term and Long-Term) is hoped to be attained. Subsequently, based on the metrics used (F1 score), I can then decide which type of volatility (short-term, medium-term or long-term) is the easiest to predict based on the headlines.

In [None]:
import refinitiv.data as rd
from refinitiv.data.content import news
from IPython.display import HTML
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Open Refinitiv Workspace Session
rd.open_session()

# Extraction of Merger News Headlines data from Refinitiv Workspace
start_dates = ['2022-01-01', '2024-01-01', '2024-07-01', '2025-01-01']
end_dates = ['2023-12-31', '2024-06-30', '2024-12-31', '2025-02-11']
csv_names = ['headlines_top_100_stocks_2022_2023.csv', 'headlines_top_100_stocks_2024_first_half.csv', 
            'headlines_top_100_stocks_2024_second_half.csv', 'headlines_top_100_stocks_2025.csv']
for i in range(4):
    compNews = pd.DataFrame()
    start_date = start_dates[i]
    end_date = end_dates[i]
    csv_name = csv_names[i]
    for ric in tickers_ordinary_share:
        try:
            cHeadlines = rd.news.get_headlines("Topic:MRG AND "+ "R:" + ric + " AND Language:LEN AND Source:RTRS", start= start_date, 
                                            end = end_date, count = 100)
            cHeadlines['cRIC'] = ric
            
            if len(compNews):
                compNews = pd.concat([compNews,cHeadlines])
            else:
                compNews = cHeadlines
        except Exception:
            pass

    compNews.to_csv(csv_name)

In [None]:
# Combination of extracted data from Refinitiv API to form the overall fine-tuning test set

import pandas as pd
headlines_2022_2023 = pd.read_csv('headlines_top_100_stocks_2022_2023.csv')
headlines_2024_1 = pd.read_csv('headlines_top_100_stocks_2024_first_half.csv')
headlines_2024_2 = pd.read_csv('headlines_top_100_stocks_2024_second_half.csv')
headlines_2025 = pd.read_csv('headlines_top_100_stocks_2025.csv')
combined = pd.concat([headlines_2022_2023,headlines_2024_1,headlines_2024_2,headlines_2025],axis=0)

combined = combined.drop_duplicates() # Avoiding any duplicate headlines

combined['cRIC'] = combined['cRIC'].str.replace('.O', '', regex=False)
combined = combined.rename(columns={'versionCreated':'published_date',
                                    'cRIC':'Ticker'})
combined = combined.drop(columns=['storyId','sourceCode'])
combined.to_csv('combined_headlines_refinitiv.csv')
combined

Unnamed: 0,published_date,headline,Ticker
0,2023-12-27 07:19:16.000,PRESS DIGEST- Wall Street Journal - Dec 27,AAPL
1,2023-12-19 00:58:19.000,PRESS DIGEST- Financial Times - Dec. 19,AAPL
2,2023-12-18 12:31:23.000,RPT-FOCUS-Goldman Sachs faces rocky exit from ...,AAPL
3,2023-12-05 05:14:57.902,Newscasts - Wall Street ends down as megacaps ...,AAPL
4,2023-12-04 21:00:38.184,Newscasts - U.S. Day Ahead: Treasury yields r...,AAPL
...,...,...,...
128,2025-01-22 15:34:31.000,FACTBOX-List of UK competition regulator cases...,GOOG
129,2025-01-22 11:49:59.000,UPDATE 3-UK boots out antitrust boss for faili...,GOOG
130,2025-01-17 12:00:00.000,RPT-BREAKINGVIEWS-Mega-merger boom threatens a...,GOOG
131,2025-01-16 22:00:00.000,BREAKINGVIEWS-Mega-merger boom threatens a sha...,GOOG


In [74]:
# Second Data Extraction: Stock price data from yfinance library

import yfinance as yf
def fetch_stock_data(tickers, start_date="2020-01-01", end_date="2025-02-14"):
    etf_data = yf.download(tickers, start=start_date, end=end_date)['Adj Close']
    return etf_data

# Get the data for the ETFs
stock_data = fetch_stock_data(top_tickers)

# Calculate the daily price change (percentage change) for each stock ticker
stock_price_changes = stock_data.pct_change().dropna(how='all')

# Calculating the volatility over a rolling 3-day window
rolling_volatility_3days = stock_price_changes.rolling(window=3).std().dropna(how='all')
rolling_volatility_5days = stock_price_changes.rolling(window=5).std().dropna(how='all')
rolling_volatility_10days = stock_price_changes.rolling(window=10).std().dropna(how='all')
rolling_volatility_3days.to_csv("stock_data_vol_3d.csv")
rolling_volatility_5days.to_csv("stock_data_vol_5d.csv")
rolling_volatility_10days.to_csv("stock_data_vol_10d.csv")

[*********************100%***********************]  124 of 124 completed


In [None]:
# Running Stock Volatility Dataset
stock_data_vol_3d = pd.read_csv('stock_data_vol_3d.csv')
stock_data_vol_5d = pd.read_csv('stock_data_vol_5d.csv')
stock_data_vol_10d = pd.read_csv('stock_data_vol_10d.csv')
headlines_df = pd.read_csv('combined_headlines_refinitiv.csv').drop(columns='Unnamed: 0')

### Generation of Volatility Dataframe
To generate the labels, it is to note that if the rolling volatility of the current share is more than or equal to twice of the previous rolling volatility, the label will be 1, otherwise it will be 0.

In [None]:
# Combining Function to merge volatility and headlines dataframe

import pandas as pd
def combining_fn(stock_data, headlines_df, window):
    stock_data['Date'] = pd.to_datetime(stock_data['Date'])

    lst = []
    prev_return_vol = None  # Initialize previous return value

    # Iterating through each row in stock_headlines
    for row in headlines_df.itertuples():
        ticker = row.Ticker
        published_date = pd.to_datetime(row.published_date)
        
        # Ensure the stock data date is localized correctly
        stock_data['Date'] = stock_data['Date'].dt.tz_localize(None)
        
        # Get the first volatility value after the published date
        return_vol = stock_data[stock_data['Date'] > published_date][ticker].iloc[0]

        if prev_return_vol is not None:  # Checking if we have a previous return value
            
            # Calculate the absolute difference and determine if current rolling volatility is >= 2 times the previous volatility
            if abs(return_vol - prev_return_vol) / abs(prev_return_vol) >= 1:
                lst.append(1)
            else:
                lst.append(0)
        else:
            # If no previous return (first iteration), append 0 (or handle as needed)
            lst.append(0)

        # Update the previous return value for the next iteration
        prev_return_vol = return_vol
        
    headlines_df['vol_label'] = lst # Balanced Dataframe
    stock_headlines = headlines_df[['headline', 'vol_label']]
    
    # Removal of any PRESS DIGEST article head since that is just the issue number (does not say anything about it; excluded from analysis)
    stock_headlines = stock_headlines[~stock_headlines['headline'].str.contains('PRESS DIGEST', na=False)] 
    stock_headlines.to_csv(f"compiled_df_{window}.csv")
    return stock_headlines

In [None]:
# Forming of dataframes for Short-Term, Medium-Term, and Long-Term Volatilities
stock_headlines_3d = combining_fn(stock_data_vol_3d, headlines_df, window = '3d')
stock_headlines_5d = combining_fn(stock_data_vol_5d, headlines_df, window = '5d')
stock_headlines_10d = combining_fn(stock_data_vol_10d, headlines_df, window = '10d')

In [None]:
# Loading the 3-Day, 5-Day and 10-Day Volatility forecast dataframes
data_3d = pd.read_csv('compiled_df_3d.csv', encoding='utf-8', encoding_errors='ignore').drop(columns='Unnamed: 0').rename(columns = {'headline': 'text','vol_label':'label'})
data_5d = pd.read_csv('compiled_df_5d.csv', encoding='utf-8', encoding_errors='ignore').drop(columns='Unnamed: 0').rename(columns = {'headline': 'text','vol_label':'label'})
data_10d = pd.read_csv('compiled_df_10d.csv', encoding='utf-8', encoding_errors='ignore').drop(columns='Unnamed: 0').rename(columns = {'headline': 'text','vol_label':'label'})
data_10d

Unnamed: 0,text,label
0,RPT-FOCUS-Goldman Sachs faces rocky exit from ...,0
1,Newscasts - Wall Street ends down as megacaps ...,0
2,Newscasts - U.S. Day Ahead: Treasury yields r...,0
3,Newscasts - U.S. Morning Call: Elon Musk curse...,0
4,Newscasts - U.S. stocks little changed ahead o...,0
...,...,...
1496,FACTBOX-List of UK competition regulator cases...,0
1497,UPDATE 3-UK boots out antitrust boss for faili...,0
1498,RPT-BREAKINGVIEWS-Mega-merger boom threatens a...,0
1499,BREAKINGVIEWS-Mega-merger boom threatens a sha...,0


### Part 1b: Handling of Imbalanced Dataframe

Due to imbalanced datasets, using accuracy as a performance metrics is not suitable especially for long-term volatility which saw few headlines that were associated with high volatilities. As such, the F1 score was used as the performance metrics instead. To make the data more balanced, two methods (Back-Translation and Contextual Word Replacement) were done sequentially.

#### Method 1: Back-Translation
This method refers to the process of taking each headline and translating it to a random language. The translated headline will then be translated back to English, and this generates new texts which are typically distinct from the original. Both the original and newly-generated texts are then combined to form a more balanced dataframe.

#### Method 2: Contextual Word Replacement
In this method, the BERT model was used to replace words in each headline in the newly combined dataset. The BERT model was utilised over other potential models such as RoBERTa and DistilBERT since it balances computational efficiency with accuracy in word replacement.

In [75]:
# Loading of relevant libraries and packages for Back-Translation and Word Replacement
import random
import pandas as pd
from deep_translator import GoogleTranslator
from transformers import BertTokenizer, BertForMaskedLM
import torch
import time

In [None]:
# Different languages that can be used
lang_lst = ["zh-CN", "fr", "ja", "de", "hi", "ta", "es", "id", "ko", 
    "ar", "pt", "ru", "it", "tr", "pl", "th", "ms", "bn", 
    "vi", "sv"]

def back_translate(text, languages, max_retries=3):
    for attempt in range(max_retries):
        lang = random.choice(languages)  # Pick a random language
        try:
            translated = GoogleTranslator(source="en", target=lang).translate(text)
            back_translated = GoogleTranslator(source=lang, target="en").translate(translated)
            if back_translated and back_translated.strip():
                return back_translated  # Return valid translation
        except Exception as e: # Handling errors
            print(f"Translation failed with {lang}, retrying ({attempt + 1}/{max_retries})... Error: {e}") 
            time.sleep(2)  # Wait before retrying
    return text  # Return original text if all retries fail

def generate_word_replacement(sentence, tokenizer, model): # Generates the word replacement based on tokenizer
    tokens = tokenizer.tokenize(sentence)
    if not tokens:
        return sentence  # Return original sentence if no tokens
    idx = random.randint(0, len(tokens) - 1)
    word_to_mask = tokens[idx]
    masked_sentence = sentence.replace(word_to_mask, '[MASK]', 1)
    inputs = tokenizer(masked_sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = outputs.logits
    predicted_token_id = torch.argmax(predictions[0, idx]).item()
    predicted_token = tokenizer.decode([predicted_token_id])
    return masked_sentence.replace('[MASK]', predicted_token, 1)

# Combination of back-translation and word replacement
def augment_data_with_back_translation_and_word_replacement(data, tokenizer, model, languages=lang_lst, num_backtranslations=3, random_seed=43): 
    rare_class = data[data['label'] == 1]
    augmented_texts = []
    
    for text in rare_class["text"]:
        sampled_langs = random.sample(languages, num_backtranslations) # Randomly sample languages from language list; will vary every iteration
        for lang in sampled_langs:
            augmented_texts.append((back_translate(text, languages), 1))
    
    augmented_df = pd.DataFrame(augmented_texts, columns=["text", "label"])
    
    augmented_sentences = [(generate_word_replacement(text, tokenizer, model), 1) for text in pd.concat([rare_class["text"], augmented_df["text"]])]
    
    augmented_replacement_df = pd.DataFrame(augmented_sentences, columns=["text", "label"])
    final_balanced_df = pd.concat([rare_class, augmented_df, augmented_replacement_df], ignore_index=True)
    return final_balanced_df.sample(frac=1, random_state=random_seed).reset_index(drop=True)

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

# Generation of augmented data for Short-term, Medium-term and Long-term volatility dataframes
final_augmented_df_3d = augment_data_with_back_translation_and_word_replacement(data_3d, tokenizer, model, num_backtranslations = 3)
final_augmented_df_5d = augment_data_with_back_translation_and_word_replacement(data_5d, tokenizer, model, num_backtranslations = 6)
final_augmented_df_10d = augment_data_with_back_translation_and_word_replacement(data_10d, tokenizer, model, num_backtranslations = 9)
final_augmented_df_3d.to_csv('augmented_3d.csv')
final_augmented_df_5d.to_csv('augmented_5d.csv')
final_augmented_df_10d.to_csv('augmented_10d.csv')

In [None]:
# Construction of overall 'balanced' dataset
final_augmented_df_3d = pd.read_csv('augmented_3d.csv').drop(columns='Unnamed: 0')
final_augmented_df_5d = pd.read_csv('augmented_5d.csv').drop(columns='Unnamed: 0')
final_augmented_df_10d = pd.read_csv('augmented_10d.csv').drop(columns='Unnamed: 0')

# Extraction of 0-labels (majority class)
data_3d_label0 = data_3d[data_3d['label']==0]
data_5d_label0 = data_5d[data_5d['label']==0]
data_10d_label0 = data_10d[data_10d['label']==0]

# Merging of 0 and 1 labels to form combined dataset; shuffling of data is also done to mix the rows
combined_3d = pd.concat([final_augmented_df_3d,data_3d_label0], axis=0, ignore_index=True).sample(frac=1, random_state=43).reset_index(drop=True)
combined_5d = pd.concat([final_augmented_df_5d,data_5d_label0], axis=0, ignore_index=True).sample(frac=1, random_state=43).reset_index(drop=True)
combined_10d = pd.concat([final_augmented_df_10d,data_10d_label0], axis=0, ignore_index=True).sample(frac=1, random_state=43).reset_index(drop=True)

combined_3d.to_csv('compiled_df_3d_balanced.csv')
combined_5d.to_csv('compiled_df_5d_balanced.csv')
combined_10d.to_csv('compiled_df_10d_balanced.csv')

## Part 2: Fine-Tuning the Model

In this assignment, I chose to use the FinBERT model. This is because through its strong understanding of financial texts and its ability to easily discern the contextual meaning of the headline, the FinBERT model can discern how mergers & acquisitions-related news influence market stability. Apart from that, its bidirectional attention will also allow it to work well with short, yet information-rich texts, where every word and its context matter.

In [33]:
import torch
from torch import nn
from transformers import BertTokenizer, BertModel
from torch.optim import AdamW  
from torch.utils.data import Dataset, DataLoader
import pandas as pd
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from transformers import get_scheduler

In [34]:
torch.manual_seed(43)

# Set device (GPU if available, else CPU)
device = (
    "mps" 
    if torch.backends.mps.is_available() 
    else "cuda"
    if torch.cuda.is_available() 
    else "cpu"
)
device = torch.device(device)
print(f"Using device: {device}")

Using device: cpu


In [7]:
combined_3d = pd.read_csv('compiled_df_3d_balanced.csv').drop(columns='Unnamed: 0')
combined_5d = pd.read_csv('compiled_df_5d_balanced.csv').drop(columns='Unnamed: 0')
combined_10d = pd.read_csv('compiled_df_10d_balanced.csv').drop(columns='Unnamed: 0')

In [36]:
# Train-Validation split: 80% train, 20% validation
train_texts_3d, val_texts_3d, train_labels_3d, val_labels_3d = train_test_split(
    combined_3d['text'].reset_index(drop=True),
    combined_3d['label'].reset_index(drop=True),
    test_size=0.2,
    random_state=43
)

train_texts_5d, val_texts_5d, train_labels_5d, val_labels_5d = train_test_split(
    combined_5d['text'].reset_index(drop=True),
    combined_5d['label'].reset_index(drop=True),
    test_size=0.2,
    random_state=43
)

train_texts_10d, val_texts_10d, train_labels_10d, val_labels_10d = train_test_split(
    combined_10d['text'].reset_index(drop=True),
    combined_10d['label'].reset_index(drop=True),
    test_size=0.2,
    random_state=43
)

In [None]:
# Initialize tokenizer (FinBert)
tokenizer = BertTokenizer.from_pretrained("ProsusAI/finbert")

# Creation of Tokenizer class
class Tokenizer(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        # Convert to list to ensure sequential indexing
        self.texts = texts.tolist() if hasattr(texts, 'tolist') else list(texts)
        self.labels = labels.tolist() if hasattr(labels, 'tolist') else list(labels)
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }
        
# Create datasets and data loaders for the different types of volatilities: short-term, medium-term, and long-term
train_dataset_3d = Tokenizer(train_texts_3d, train_labels_3d, tokenizer)
val_dataset_3d = Tokenizer(val_texts_3d, val_labels_3d, tokenizer)
train_loader_3d = DataLoader(train_dataset_3d, batch_size=32, shuffle=True)
val_loader_3d = DataLoader(val_dataset_3d, batch_size=32)

train_dataset_5d = Tokenizer(train_texts_5d, train_labels_5d, tokenizer)
val_dataset_5d = Tokenizer(val_texts_5d, val_labels_5d, tokenizer)
train_loader_5d = DataLoader(train_dataset_5d, batch_size=32, shuffle=True)
val_loader_5d = DataLoader(val_dataset_5d, batch_size=32)

train_dataset_10d = Tokenizer(train_texts_10d, train_labels_10d, tokenizer)
val_dataset_10d = Tokenizer(val_texts_10d, val_labels_10d, tokenizer)
train_loader_10d = DataLoader(train_dataset_10d, batch_size=32, shuffle=True)
val_loader_10d = DataLoader(val_dataset_10d, batch_size=32)

In [38]:
# Create the FinBERT-based model class
class VolClassifier(nn.Module):
    def __init__(self, n_classes=2):
        super(VolClassifier, self).__init__()
        self.bert = BertModel.from_pretrained('ProsusAI/finbert')
        self.dropout = nn.Dropout(p=0.3)
        self.classifier = nn.Linear(self.bert.config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        output = self.dropout(pooled_output)
        return self.classifier(output)
    
# Initialize model
model = VolClassifier()
model = model.to(device)

In [None]:
# Setting device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Compute class weights (based on inverse frequency weighting) - Higher weight is assigned to the minority class
# 3d - Short-Run (SR) volatility, 5d - Medium-Run (MR) volatility, 10d - Long-Run (LR) volatility

class_counts_3d = torch.tensor([1335, 1328], dtype=torch.float)
class_weights_3d = class_counts_3d.sum() / class_counts_3d
class_weights_3d = class_weights_3d.to(device)

class_counts_5d = torch.tensor([1421, 1120], dtype=torch.float)
class_weights_5d = class_counts_5d.sum() / class_counts_5d
class_weights_5d = class_weights_5d.to(device)

class_counts_10d = torch.tensor([1454, 940], dtype=torch.float)
class_weights_10d = class_counts_10d.sum() / class_counts_10d
class_weights_10d = class_weights_10d.to(device)

### Fine-Tuning Features
During fine-tuning, the usage of a learning rate scheduler as well as gradual unfreezing of layers was adopted for the data. Incorporation of class weights using inverse frequency weights was also introduced to solve the issue of imbalanced dataframes.

#### Method 1: Incorporation of Class Weights for Handling Slight Imbalances in Data
With the data being slightly imbalanced even after the back-translation and word replacement steps, class weights are assigned using inverse frequency weights. This means greater weights are being assigned to the minority class, and smaller weights to the majority class.

#### Method 2: Learning Rate Scheduler
Given that FinBert requires smaller learning rates for fine-tuning (to avoid catastrophic forgetting of pre-trained knowledge), learning rate schedulers can help to extract meaningful signals from the space data within the financial texts. A linear decay scheduler with warmup was done in order to stabilise learning for the model. This helps the FinBERT model to better adapt to the headlines without abrupt weight updates. It also reduces overfitting risks by stabilizing learning in later epochs.

#### Method 3: Gradual Unfreezing
Gradual unfreezing is a layer-wise fine-tuning strategy where the model starts with only the final layers trainable, and progressively unfreezes earlier layers as training progresses. By training the model layer-by-layer, this allows the retention of generic linguistic knowledge while also adapting only relevant layers to the specific domain. Avoidance of catastrophic forgetting of pre-trained language patterns can be achieved, and this potentially improves overall stability of the model.

#### Complementarity of Methods
All 3 methods work well together to prevent overfitting and ensure proper training of the model. Gradual unfreezing helps to avoid losing the valuable representations learned during pre-training by gradually allowing the model to adjust the more task-specific layers without distorting the general-purpose language features in the early layers. This is complemented by the use of the Learning Rate Scheduler, where the rate of updates of the weights is faster. By using the scheduler, early exploration (rapid adjustments) will hence not disrupt the pre-trained knowledge, and this later ensures that the fine-tuning (small adjustments) process allows optimal model performance without overfitting. The incorporation of class weights also help to settle any potential imbalance within the dataset.


In [None]:
from sklearn.metrics import f1_score

def fine_tuner(training_loader, validation_loader, class_weights, term):
    
    # Best parameters
    optimizer = AdamW([
        {'params': model.bert.parameters(), 'lr': 1e-5},
        {'params': model.classifier.parameters(), 'lr': 1e-4}
    ])
    
    loss_fn = nn.CrossEntropyLoss(weight=class_weights) # Application of class_weights

    num_epochs = 3
    num_training_steps = len(training_loader) * num_epochs
    num_warmup_steps = int(0.1 * num_training_steps)

    # Learning rate scheduler
    lr_scheduler = get_scheduler(
        "linear", optimizer=optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps
    )

    def evaluate(model, val_loader):
        model.eval()
        total_loss, all_preds, all_labels = 0, [], []

        print("\nChecking validation dataset samples:")
        for i, batch in enumerate(val_loader):
            if i == 1:  # Print only the first batch
                print("Sample input_ids:", batch['input_ids'][0][:10])  # First 10 tokens of first sample
                print("Sample label:", batch['labels'][0].item())
                break

        with torch.no_grad(): # Disable gradient calculation (saves memory and speeds up inference)
            for batch in val_loader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)

                outputs = model(input_ids=input_ids, attention_mask=attention_mask)

                loss = loss_fn(outputs, labels) # Computing loss
                total_loss += loss.item() # Accumulation of loss

                preds = torch.argmax(outputs, dim=1) # Get predicted class
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())

        val_f1_score = f1_score(all_labels, all_preds, average='binary') # Computation of F1 score
        val_loss = total_loss / len(val_loader)

        return val_loss, val_f1_score

    def train_phase(model, phase_name, num_epochs):
        best_f1 = 0
        print(f"\n======== Starting {phase_name} ========")

        # Print trainable layers
        print("Trainable layers:")
        for name, param in model.named_parameters():
            if param.requires_grad:
                print(name)

        # Check model weights before training
        print("\nModel classifier weights before training:", model.classifier.weight[:2])  # First two rows

        for epoch in range(num_epochs):
            model.train()
            total_loss = 0
            all_preds, all_labels = [], []

            for batch in tqdm(training_loader, desc=f"Training {phase_name} Epoch {epoch+1}/{num_epochs}"):
                optimizer.zero_grad()

                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)

                outputs = model(input_ids=input_ids, attention_mask=attention_mask)

                loss = loss_fn(outputs, labels)
                total_loss += loss.item()

                preds = torch.argmax(outputs, dim=1)
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())

                loss.backward() # Backward pass
                optimizer.step() # Update model parameters based on computed gradients
                lr_scheduler.step() # Adjust the learning rate according to the scheduler

            train_loss = total_loss / len(training_loader)
            train_f1 = f1_score(all_labels, all_preds, average='binary')
            val_loss, val_f1 = evaluate(model, validation_loader) # Evaluation of model on validation dataset

            print(f"{phase_name} - Epoch {epoch+1}/{num_epochs}: "
                f"Train Loss: {train_loss:.4f}, Train F1: {train_f1:.4f}, "
                f"Val Loss: {val_loss:.4f}, Val F1: {val_f1:.4f}")
            
        if val_f1 > best_f1:
            best_f1 = val_f1
            torch.save(model.state_dict(), f'best_model_{phase_name}_{term}.pt')
            
        # Check model weights after training
        print("\nModel classifier weights after training:", model.classifier.weight[:2])  # First two rows

    # Phase 1: Freeze all layers except the classifier head
    for param in model.bert.parameters():
        param.requires_grad = False
    train_phase(model, "Phase1", num_epochs)

    # Phase 2: Unfreeze the last BERT layer
    for param in model.bert.encoder.layer[-1].parameters():
        param.requires_grad = True
    train_phase(model, "Phase2", num_epochs)

    # Phase 3: Unfreeze the last 2 BERT layers
    for i in range(-2, 0):
        for param in model.bert.encoder.layer[i].parameters():
            param.requires_grad = True
    train_phase(model, "Phase3", num_epochs)

    print("Training completed.")


In [None]:
# Short-Term Volatility Fine-Tuning
fine_tuner(training_loader=train_loader_3d, validation_loader=val_loader_3d, class_weights=class_weights_3d, term='3d')


Trainable layers:
classifier.weight
classifier.bias

Model classifier weights before training: tensor([[-0.0029, -0.0144,  0.0242,  ...,  0.0139, -0.0106,  0.0178],
        [-0.0042,  0.0065,  0.0119,  ..., -0.0198, -0.0064,  0.0336]],
       grad_fn=<SliceBackward0>)


Training Phase1 Epoch 1/3: 100%|██████████| 67/67 [05:34<00:00,  4.99s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 24482,  1011, 17235,  2850,  4160, 14572,  3020,  3805,  1997])
Sample label: 0
Phase1 - Epoch 1/3: Train Loss: 0.6570, Train F1: 0.6225, Val Loss: 0.6330, Val F1: 0.5774


Training Phase1 Epoch 2/3: 100%|██████████| 67/67 [05:13<00:00,  4.68s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 24482,  1011, 17235,  2850,  4160, 14572,  3020,  3805,  1997])
Sample label: 0
Phase1 - Epoch 2/3: Train Loss: 0.6506, Train F1: 0.6120, Val Loss: 0.6367, Val F1: 0.6112


Training Phase1 Epoch 3/3: 100%|██████████| 67/67 [05:56<00:00,  5.32s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 24482,  1011, 17235,  2850,  4160, 14572,  3020,  3805,  1997])
Sample label: 0
Phase1 - Epoch 3/3: Train Loss: 0.6528, Train F1: 0.6278, Val Loss: 0.6259, Val F1: 0.5995

Model classifier weights after training: tensor([[-0.0033, -0.0134,  0.0206,  ...,  0.0126, -0.0063,  0.0190],
        [-0.0039,  0.0055,  0.0155,  ..., -0.0185, -0.0106,  0.0324]],
       grad_fn=<SliceBackward0>)

Trainable layers:
bert.encoder.layer.11.attention.self.query.weight
bert.encoder.layer.11.attention.self.query.bias
bert.encoder.layer.11.attention.self.key.weight
bert.encoder.layer.11.attention.self.key.bias
bert.encoder.layer.11.attention.self.value.weight
bert.encoder.layer.11.attention.self.value.bias
bert.encoder.layer.11.attention.output.dense.weight
bert.encoder.layer.11.attention.output.dense.bias
bert.encoder.layer.11.attention.output.LayerNorm.weight
bert.encoder.layer.11.attention.output.LayerNorm.bias
bert.encoder.layer.11

Training Phase2 Epoch 1/3: 100%|██████████| 67/67 [06:45<00:00,  6.05s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 24482,  1011, 17235,  2850,  4160, 14572,  3020,  3805,  1997])
Sample label: 0
Phase2 - Epoch 1/3: Train Loss: 0.6444, Train F1: 0.6317, Val Loss: 0.6259, Val F1: 0.5995


Training Phase2 Epoch 2/3: 100%|██████████| 67/67 [05:31<00:00,  4.94s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 24482,  1011, 17235,  2850,  4160, 14572,  3020,  3805,  1997])
Sample label: 0
Phase2 - Epoch 2/3: Train Loss: 0.6485, Train F1: 0.6157, Val Loss: 0.6259, Val F1: 0.5995


Training Phase2 Epoch 3/3: 100%|██████████| 67/67 [06:45<00:00,  6.05s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 24482,  1011, 17235,  2850,  4160, 14572,  3020,  3805,  1997])
Sample label: 0
Phase2 - Epoch 3/3: Train Loss: 0.6535, Train F1: 0.6259, Val Loss: 0.6259, Val F1: 0.5995

Model classifier weights after training: tensor([[-0.0033, -0.0134,  0.0206,  ...,  0.0126, -0.0063,  0.0190],
        [-0.0039,  0.0055,  0.0155,  ..., -0.0185, -0.0106,  0.0324]],
       grad_fn=<SliceBackward0>)

Trainable layers:
bert.encoder.layer.10.attention.self.query.weight
bert.encoder.layer.10.attention.self.query.bias
bert.encoder.layer.10.attention.self.key.weight
bert.encoder.layer.10.attention.self.key.bias
bert.encoder.layer.10.attention.self.value.weight
bert.encoder.layer.10.attention.self.value.bias
bert.encoder.layer.10.attention.output.dense.weight
bert.encoder.layer.10.attention.output.dense.bias
bert.encoder.layer.10.attention.output.LayerNorm.weight
bert.encoder.layer.10.attention.output.LayerNorm.bias
bert.encoder.layer.10

Training Phase3 Epoch 1/3: 100%|██████████| 67/67 [05:31<00:00,  4.94s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 24482,  1011, 17235,  2850,  4160, 14572,  3020,  3805,  1997])
Sample label: 0
Phase3 - Epoch 1/3: Train Loss: 0.6480, Train F1: 0.6259, Val Loss: 0.6259, Val F1: 0.5995


Training Phase3 Epoch 2/3: 100%|██████████| 67/67 [04:33<00:00,  4.08s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 24482,  1011, 17235,  2850,  4160, 14572,  3020,  3805,  1997])
Sample label: 0
Phase3 - Epoch 2/3: Train Loss: 0.6495, Train F1: 0.6169, Val Loss: 0.6259, Val F1: 0.5995


Training Phase3 Epoch 3/3: 100%|██████████| 67/67 [04:31<00:00,  4.05s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 24482,  1011, 17235,  2850,  4160, 14572,  3020,  3805,  1997])
Sample label: 0
Phase3 - Epoch 3/3: Train Loss: 0.6510, Train F1: 0.6206, Val Loss: 0.6259, Val F1: 0.5995

Model classifier weights after training: tensor([[-0.0033, -0.0134,  0.0206,  ...,  0.0126, -0.0063,  0.0190],
        [-0.0039,  0.0055,  0.0155,  ..., -0.0185, -0.0106,  0.0324]],
       grad_fn=<SliceBackward0>)
Training completed.


In [None]:
# Medium-Term Volatility Fine-Tuning
fine_tuner(training_loader=train_loader_5d, validation_loader=val_loader_5d, class_weights=class_weights_5d, term='5d')


Trainable layers:
classifier.weight
classifier.bias

Model classifier weights before training: tensor([[-0.0033, -0.0134,  0.0206,  ...,  0.0126, -0.0063,  0.0190],
        [-0.0039,  0.0055,  0.0155,  ..., -0.0185, -0.0106,  0.0324]],
       grad_fn=<SliceBackward0>)


Training Phase1 Epoch 1/3: 100%|██████████| 64/64 [03:32<00:00,  3.32s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 12610,  1011, 13420, 12154, 12154, 24209,  2389,  9006,  2213])
Sample label: 1
Phase1 - Epoch 1/3: Train Loss: 0.6798, Train F1: 0.5213, Val Loss: 0.6483, Val F1: 0.6259


Training Phase1 Epoch 2/3: 100%|██████████| 64/64 [03:32<00:00,  3.31s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 12610,  1011, 13420, 12154, 12154, 24209,  2389,  9006,  2213])
Sample label: 1
Phase1 - Epoch 2/3: Train Loss: 0.6659, Train F1: 0.5719, Val Loss: 0.6486, Val F1: 0.5572


Training Phase1 Epoch 3/3: 100%|██████████| 64/64 [03:33<00:00,  3.33s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 12610,  1011, 13420, 12154, 12154, 24209,  2389,  9006,  2213])
Sample label: 1
Phase1 - Epoch 3/3: Train Loss: 0.6606, Train F1: 0.5587, Val Loss: 0.6442, Val F1: 0.6200

Model classifier weights after training: tensor([[-0.0061, -0.0111,  0.0236,  ...,  0.0128, -0.0070,  0.0192],
        [-0.0011,  0.0032,  0.0125,  ..., -0.0187, -0.0100,  0.0322]],
       grad_fn=<SliceBackward0>)

Trainable layers:
bert.encoder.layer.11.attention.self.query.weight
bert.encoder.layer.11.attention.self.query.bias
bert.encoder.layer.11.attention.self.key.weight
bert.encoder.layer.11.attention.self.key.bias
bert.encoder.layer.11.attention.self.value.weight
bert.encoder.layer.11.attention.self.value.bias
bert.encoder.layer.11.attention.output.dense.weight
bert.encoder.layer.11.attention.output.dense.bias
bert.encoder.layer.11.attention.output.LayerNorm.weight
bert.encoder.layer.11.attention.output.LayerNorm.bias
bert.encoder.layer.11

Training Phase2 Epoch 1/3: 100%|██████████| 64/64 [03:56<00:00,  3.70s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 12610,  1011, 13420, 12154, 12154, 24209,  2389,  9006,  2213])
Sample label: 1
Phase2 - Epoch 1/3: Train Loss: 0.6679, Train F1: 0.5541, Val Loss: 0.6442, Val F1: 0.6200


Training Phase2 Epoch 2/3: 100%|██████████| 64/64 [03:56<00:00,  3.70s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 12610,  1011, 13420, 12154, 12154, 24209,  2389,  9006,  2213])
Sample label: 1
Phase2 - Epoch 2/3: Train Loss: 0.6570, Train F1: 0.5804, Val Loss: 0.6442, Val F1: 0.6200


Training Phase2 Epoch 3/3: 100%|██████████| 64/64 [04:03<00:00,  3.81s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 12610,  1011, 13420, 12154, 12154, 24209,  2389,  9006,  2213])
Sample label: 1
Phase2 - Epoch 3/3: Train Loss: 0.6653, Train F1: 0.5585, Val Loss: 0.6442, Val F1: 0.6200

Model classifier weights after training: tensor([[-0.0061, -0.0111,  0.0236,  ...,  0.0128, -0.0070,  0.0192],
        [-0.0011,  0.0032,  0.0125,  ..., -0.0187, -0.0100,  0.0322]],
       grad_fn=<SliceBackward0>)

Trainable layers:
bert.encoder.layer.10.attention.self.query.weight
bert.encoder.layer.10.attention.self.query.bias
bert.encoder.layer.10.attention.self.key.weight
bert.encoder.layer.10.attention.self.key.bias
bert.encoder.layer.10.attention.self.value.weight
bert.encoder.layer.10.attention.self.value.bias
bert.encoder.layer.10.attention.output.dense.weight
bert.encoder.layer.10.attention.output.dense.bias
bert.encoder.layer.10.attention.output.LayerNorm.weight
bert.encoder.layer.10.attention.output.LayerNorm.bias
bert.encoder.layer.10

Training Phase3 Epoch 1/3: 100%|██████████| 64/64 [04:24<00:00,  4.14s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 12610,  1011, 13420, 12154, 12154, 24209,  2389,  9006,  2213])
Sample label: 1
Phase3 - Epoch 1/3: Train Loss: 0.6610, Train F1: 0.5651, Val Loss: 0.6442, Val F1: 0.6200


Training Phase3 Epoch 2/3: 100%|██████████| 64/64 [04:43<00:00,  4.43s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 12610,  1011, 13420, 12154, 12154, 24209,  2389,  9006,  2213])
Sample label: 1
Phase3 - Epoch 2/3: Train Loss: 0.6598, Train F1: 0.5586, Val Loss: 0.6442, Val F1: 0.6200


Training Phase3 Epoch 3/3: 100%|██████████| 64/64 [04:46<00:00,  4.48s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 12610,  1011, 13420, 12154, 12154, 24209,  2389,  9006,  2213])
Sample label: 1
Phase3 - Epoch 3/3: Train Loss: 0.6669, Train F1: 0.5558, Val Loss: 0.6442, Val F1: 0.6200

Model classifier weights after training: tensor([[-0.0061, -0.0111,  0.0236,  ...,  0.0128, -0.0070,  0.0192],
        [-0.0011,  0.0032,  0.0125,  ..., -0.0187, -0.0100,  0.0322]],
       grad_fn=<SliceBackward0>)
Training completed.


In [None]:
# Long-Term Volatility Fine-Tuning
fine_tuner(training_loader=train_loader_10d, validation_loader=val_loader_10d, class_weights=class_weights_10d, term='10d')


Trainable layers:
classifier.weight
classifier.bias

Model classifier weights before training: tensor([[-0.0061, -0.0111,  0.0236,  ...,  0.0128, -0.0070,  0.0192],
        [-0.0011,  0.0032,  0.0125,  ..., -0.0187, -0.0100,  0.0322]],
       grad_fn=<SliceBackward0>)


Training Phase1 Epoch 1/3: 100%|██████████| 60/60 [04:16<00:00,  4.28s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 10651,  1016,  1011,  2572,  2094,  2000,  9878,  8241, 12508])
Sample label: 0
Phase1 - Epoch 1/3: Train Loss: 0.6808, Train F1: 0.5091, Val Loss: 0.6541, Val F1: 0.5312


Training Phase1 Epoch 2/3: 100%|██████████| 60/60 [05:43<00:00,  5.73s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 10651,  1016,  1011,  2572,  2094,  2000,  9878,  8241, 12508])
Sample label: 0
Phase1 - Epoch 2/3: Train Loss: 0.6733, Train F1: 0.5283, Val Loss: 0.6516, Val F1: 0.5193


Training Phase1 Epoch 3/3: 100%|██████████| 60/60 [05:14<00:00,  5.24s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 10651,  1016,  1011,  2572,  2094,  2000,  9878,  8241, 12508])
Sample label: 0
Phase1 - Epoch 3/3: Train Loss: 0.6638, Train F1: 0.5361, Val Loss: 0.6484, Val F1: 0.5181

Model classifier weights after training: tensor([[-0.0063, -0.0109,  0.0246,  ...,  0.0154, -0.0060,  0.0185],
        [-0.0008,  0.0030,  0.0115,  ..., -0.0214, -0.0109,  0.0329]],
       grad_fn=<SliceBackward0>)

Trainable layers:
bert.encoder.layer.11.attention.self.query.weight
bert.encoder.layer.11.attention.self.query.bias
bert.encoder.layer.11.attention.self.key.weight
bert.encoder.layer.11.attention.self.key.bias
bert.encoder.layer.11.attention.self.value.weight
bert.encoder.layer.11.attention.self.value.bias
bert.encoder.layer.11.attention.output.dense.weight
bert.encoder.layer.11.attention.output.dense.bias
bert.encoder.layer.11.attention.output.LayerNorm.weight
bert.encoder.layer.11.attention.output.LayerNorm.bias
bert.encoder.layer.11

Training Phase2 Epoch 1/3: 100%|██████████| 60/60 [03:52<00:00,  3.88s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 10651,  1016,  1011,  2572,  2094,  2000,  9878,  8241, 12508])
Sample label: 0
Phase2 - Epoch 1/3: Train Loss: 0.6562, Train F1: 0.5578, Val Loss: 0.6484, Val F1: 0.5181


Training Phase2 Epoch 2/3: 100%|██████████| 60/60 [05:42<00:00,  5.71s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 10651,  1016,  1011,  2572,  2094,  2000,  9878,  8241, 12508])
Sample label: 0
Phase2 - Epoch 2/3: Train Loss: 0.6618, Train F1: 0.5278, Val Loss: 0.6484, Val F1: 0.5181


Training Phase2 Epoch 3/3: 100%|██████████| 60/60 [05:36<00:00,  5.62s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 10651,  1016,  1011,  2572,  2094,  2000,  9878,  8241, 12508])
Sample label: 0
Phase2 - Epoch 3/3: Train Loss: 0.6614, Train F1: 0.5357, Val Loss: 0.6484, Val F1: 0.5181

Model classifier weights after training: tensor([[-0.0063, -0.0109,  0.0246,  ...,  0.0154, -0.0060,  0.0185],
        [-0.0008,  0.0030,  0.0115,  ..., -0.0214, -0.0109,  0.0329]],
       grad_fn=<SliceBackward0>)

Trainable layers:
bert.encoder.layer.10.attention.self.query.weight
bert.encoder.layer.10.attention.self.query.bias
bert.encoder.layer.10.attention.self.key.weight
bert.encoder.layer.10.attention.self.key.bias
bert.encoder.layer.10.attention.self.value.weight
bert.encoder.layer.10.attention.self.value.bias
bert.encoder.layer.10.attention.output.dense.weight
bert.encoder.layer.10.attention.output.dense.bias
bert.encoder.layer.10.attention.output.LayerNorm.weight
bert.encoder.layer.10.attention.output.LayerNorm.bias
bert.encoder.layer.10

Training Phase3 Epoch 1/3: 100%|██████████| 60/60 [05:18<00:00,  5.31s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 10651,  1016,  1011,  2572,  2094,  2000,  9878,  8241, 12508])
Sample label: 0
Phase3 - Epoch 1/3: Train Loss: 0.6541, Train F1: 0.5501, Val Loss: 0.6484, Val F1: 0.5181


Training Phase3 Epoch 2/3: 100%|██████████| 60/60 [06:07<00:00,  6.13s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 10651,  1016,  1011,  2572,  2094,  2000,  9878,  8241, 12508])
Sample label: 0
Phase3 - Epoch 2/3: Train Loss: 0.6534, Train F1: 0.5456, Val Loss: 0.6484, Val F1: 0.5181


Training Phase3 Epoch 3/3: 100%|██████████| 60/60 [04:25<00:00,  4.42s/it]



Checking validation dataset samples:
Sample input_ids: tensor([  101, 10651,  1016,  1011,  2572,  2094,  2000,  9878,  8241, 12508])
Sample label: 0
Phase3 - Epoch 3/3: Train Loss: 0.6641, Train F1: 0.5260, Val Loss: 0.6484, Val F1: 0.5181

Model classifier weights after training: tensor([[-0.0063, -0.0109,  0.0246,  ...,  0.0154, -0.0060,  0.0185],
        [-0.0008,  0.0030,  0.0115,  ..., -0.0214, -0.0109,  0.0329]],
       grad_fn=<SliceBackward0>)
Training completed.


## Part 3: Hyperparameter Tuning of the Results
With multiple parameters available to vary to optimise the model, I tested multiple combinations of values of learning rates for the different layers within the FinBERT model. The following describes the parameters explored for the optimizer:
- BERT Layer Learning Rate = [1e-5, 2e-5]
- Classifier Learning Rate = [1e-5, 3e-5, 5e-5, 1e-3, 2e-3]
- Weight Decay = [1e-5]

It is to note that this document only contains the results to run the best model, and not the results of the different parameter combinations.

The following are some insights that have been obtained while varying the parameters:

1) Increasing classifier parameter resulted in worse performance of the model, with F1 scores decreasing in for all three models. Hence, the lower the Classifier Learning Rate, the better. However, beyond a certain point, decreasing the learning rate will result in slightly worse performance for the models as shown by Combinations 4 and 5.
2) Decreasing the learning rate for BERT layers resulted in generally better performance of the model for all three terms of volatility.

It is also to note that addition of weight decay = 1e-5 was attempted in Combination 1 for the short-term volatility. When the weight decay was included, the F1 score was lower at 0.4327 compared to the F1 score of 0.6030 when there was no weight decay. In view of this poor performance, weight decay was hence removed from consideration.

The different combinations attempted in this paper are as such:

Combination 1: BERT Learning Rate = 2e-5, Classification Learning Rate = 1e-3, Weight Decay = 0
- Volatility (SR) F1 = 0.6030
- Volatility (MR) F1 = 0.4110
- Volatility (LR) F1 = 0.5176

Combination 2: BERT Learning Rate = 2e-5, Classification Learning Rate = 2e-3, Weight Decay = 0
- Volatility (SR) F1 = 0.4493
- Volatility (MR) F1 = 0.4731
- Volatility (LR) F1 = 0.4988

Combination 3: BERT Learning Rate = 2e-5, Classification Learning Rate = 5e-4, Weight Decay = 0
- Volatility (SR) F1 = 0.5683
- Volatility (MR) F1 = 0.5521
- Volatility (LR) F1 = 0.5122

Combination 4: BERT Learning Rate = 1e-5, Classification Learning Rate = 3e-4, Weight Decay = 0
- Volatility (SR) F1 = 0.5995
- Volatility (MR) F1 = 0.6200
- Volatility (LR) F1 = 0.5181

Combination 5: BERT Learning Rate = 1e-5, Classification Learning Rate = 1e-4, Weight Decay = 0
- Volatility (SR) F1 = 0.5986
- Volatility (MR) F1 = 0.5757
- Volatility (LR) F1 = 0.5434

## Part 4: Evaluation and Interpretation of Results

The following describes some insights from the results, and some reasons for these differences:

1) Prior to balancing, upon fine-tuning the model based on the original data, the F1 score for all three were 0.1 or less. Following the balancing of the data, the F1 score for all three were significantly higher, with all 3 exceeding 0.5, indicating that the model is better than a random guess.
2) Among all three types of volatilities forecasted from the headlines, it appears that the medium-term volatility is most effective, with F1 scores reaching 0.62, followed by short-term which was approximately 0.6, and then long-term volatilities at 0.52. The poor performance of the long-term volatility fine-tuning could be attributed to the effect of the headlines being affected by other market-moving events such as interest rate changes and industrial trends, thereby resulting in false signals for volatility changes. Apart from that, mean reversion could have resulted in the stabilisation of prices over 10 days, hence reducing the distinct cause-effect relationships. The short-term volatility may not have performed as well potentially due to it being highly noisy since the headlines may simply cause a temporary spike, but this spike does not last beyond a day or two.

### Reflection on Strengths and Weaknesses of Analysis

#### Strengths
The strengths of fine-tuning this model can be summarised below:

1) Effective forecasting of the volatility of the stock based on the headlines is observed across all three terms, with the best performing model being the model forecasting the medium-term volatility (5-Day Rolling Window). This can be useful for fund managers who have lower risk-appetite clients or low-risk appetite traders who deal on a weekly basis. This fine-tuned model can guide their actions more effectively as they are no longer looking solely at the returns, but also at how much the prices will fluctuate.
2) With the inclusion of data in 2023, headlines relating to the AI Boom would have been in the dataframe. This likely provided more data about high volatility in the dataset across all stocks, thereby enabling better fine-tuning as it ensures a more balanced dataset.
3) The use of multiple methods to improve the model including the learning rate scheduler and gradual de-freezing of the FinBERT layers ensured that model was fine-tuned to the correct degree, thereby enabling a rise in F1 score following fine-tuning. The assignment of class weights to further handle potential imbalances in the data also aided in increasing the F1 score.

#### Weaknesses
However, some weaknesses observed is also stated below:

1) Since the original dataframe was highly imbalanced with a lack of headlines that resulted in spikes for volatilities, back-translation and word replacements would only rephrase the headlines originally available. This means that overfitting could have occurred due to few distinct headlines being available in the first place.
2) There was potentially lower accuracy in the word replacement step when preparing the dataset due to the use of the BERT model instead of RoBERTa which is more accurate, but is computationally more expensive.

### Potential Future Work
For further improvements, the following points could be considered to make the fine-tuning more effective:

1) Extraction of more data for volatile stocks as only data from 2022 to February 2025 was extracted. To obtain more labelled data, the extraction of headlines during the COVID-19 period (2020 to 2021) could have been done as many of the stocks were extremely volatile during that period.
2) Including more possible values of weight decay could be done to determine the optimal weight decay for a higher F1 score given that only one value was tested before weight decay was removed.
3) More combinations of learning rates for the different layers of the BERT model could be considered for further fine-tuning. 
4) Trying different pre-trained models such as RoBERTa and BERT models may be done since they may yield different types of results.

### Conclusion
To summarise, this assignment presents the capabilities of fine-tuning the FinBERT model to suit the task at hand of predicting the changes in volatility of a stock given a merger-related headline. While the actual results may not sufficiently justify its effectiveness in forecasting the volatilities of the stock, further refinements can be made to increase the F1 score further as many other parameters such as the optimal number of epochs were not considered. Future research can explore these optimizations to enhance the model’s predictive power, making it a more reliable tool for investors navigating merger-related market movements.