
# Instructions:

This notebook is forms the first part of your coursework assignment for Text Analytics in Spring 2025. You will need to read the instructions below and complete numbered tasks indicated by "TASK n". To complete the tasks, you will write code or explanations between the comments "#WRITE YOUR ANSWER HERE" and "#END OF ANSWER". For example:

TASK 0: Complete the function below to output "hello world". 


In [1]:
def demo_fun():
    # WRITE YOUR ANSWER HERE
    print("hello world")
    # END OF ANSWER

### DO NOT MODIFY
demo_fun()
###

hello world


There is also some code in the cell that should not be modified. This code saves your results to file in the correct format, which is necessary for us to be able to mark your answers. Before you submit your notebook, please make sure this code has not been modified, then restart your kernel, clear all cell outputs, run all of your code once again, then save the notebook. 

Please note:
  * The notebook you upload must include all the saved cell output after running all cells.
  * The notebook code must be complete so that it reproduces all your output when we run it. 
  * For this coursework, we recommend that you use your virtual environment that you created for the labs. The packages you need are: numpy, scipy, nltk, pytorch, transformers and datasets (from HuggingFace), pandas, matplotlib and scikit-learn. 

## Marking guidelines:
1. This notebook is worth 32% of the marks for the Text Analytics assignment.
1. The number of marks for each task is shown alongside the task.
1. We will evaluate the output of your code after running it, and marks will be awarded based on how well the output matches the task's instructions. 
1. We will give partial marks for incomplete or partially correct answers. 
1. We do not give additional marks for code style or comments, but clear code will help us to understand what you have done so that we can award partial marks where necessary. 
1. Unless the task asks you to implement something from scratch, there is no penalty for using software libraries in your implementation.

## Support:

The main source of support will be during the lab sessions. The TAs and lecturers will help you with questions about the lectures, the code provided for you in this notebook, and general questions about the topics we cover. For the assessed tasks, they can only answer clarifying questions about what you have to do. Please email Edwin if you have any other queries edwin.simpson@bristol.ac.uk and/or post your query to the Teams channel for this unit.

## Deadline:

The notebook must be submitted along with the second notebook on Blackboard before **Monday 28th April at 13.00**. 

## Submission:

For this part of the assignment, please zip up the folder containing this file and the 'outputs' directory, containing the output from this notebook as .csv files. Please name your notebook file like this:
   * Name this notebook 'text_analytics_part1_\<student number\>.ipynb'. Replace '\<student number\>' with your student number, which consists only of digits beginning with '2'. 
   * We mark anonymously, so please don't include your name in the notebook.

You can submit the file on Blackboard to the submission point "Text Analytics Part 1 Notebook". Remember that the assignment also has parts 2 and 3, described in the PDF file on Blackboard.

# Setup: random seeds

Each student will work with slightly different data splits and model weights, which will be determined by setting your 'random seed'. 
We will check that your results come from using your random seed. Please set the seed in the cell below by changing the value of 'my_student_number' to your own student number (not your username, the number you can see on eVision that contains only digits). 

Using the correct seed ensures that your results are reproducible when we rerun your notebook.

In [3]:
!pip install torch



In [72]:
import torch
import random
import numpy as np

def set_seed(seed: int = 42):
    random.seed(seed)  # Python's built-in random module
    np.random.seed(seed)  # NumPy
    torch.manual_seed(seed)  # PyTorch CPU
    torch.cuda.manual_seed(seed)  # PyTorch GPU (if available)
    torch.cuda.manual_seed_all(seed)  # Multi-GPU
    torch.backends.cudnn.deterministic = True  # Ensure deterministic behavior
    torch.backends.cudnn.benchmark = False  # Disable benchmark mode for reproducibility

### SET YOUR SEED TO YOUR STUDENT NUMBER HERE
my_student_number = 2411243
set_seed(my_student_number)

# Setup: loading the data

Let's make a folder to save the output of your work:

In [74]:
import os
import pandas as pd

In [None]:
os.mkdir('./outputs')

In [13]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-19.0.1-cp312-cp312-win_amd64.whl.metadata (3.4 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp312-cp312-win_amd64.whl.metadata (13 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py312-none-any.whl.metadata (7.2 kB)
Collecting huggingface-hub>=0.24.0 (from datasets)
  Downloading huggingface_hub-0.30.1-py3-none-any.whl.metadata (13 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
   ---------------------------------------- 0.0/491.2 kB ? eta -:--:--
   -------------------- ------------------- 256.0/491.2 kB 7.7 MB/s eta 0:00:01
   ---------------------------------------- 491.2/491.2 kB 5.1 MB/s eta 0:00:00
Downloading huggingface_hub-0.30.1-py3-none-any.whl (481 kB)
   ---------------------------------------- 0.0/481.2 kB ? eta -:--:--
   ---------------------------------- -

  You can safely remove it manually.


In [3]:
!pip install pyarrow



Now, let's load some more packages we will need later:

In [76]:
%load_ext autoreload
%autoreload 2

# Use HuggingFace's datasets library to access the Emotion dataset
from datasets import load_dataset
import numpy as np
from sklearn.model_selection import train_test_split 

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload



The dataset classifies paragraphs taken from corporate disclosures that discuss climate-related issues. It classifiers them into "risk" (0), "neutral" (1) or "opportunity" (2) representing the sentiment of the paragraph.

First we need to load the data. The data is already split into train, validation and test. The _validation_ set (also called 'development' set or 'devset') can be used to compute performance of your model when tuning hyperparameters, optimising combinations of features, or looking at the errors your model makes before improving it. This allows you to hold out the test set (i.e., not to look at it at all when developing your method) to give a fair evaluation of the model and how well it generalises to new examples. This avoids tuning the model to specific examples in the test set. An alternative approach to validation is to not use a single fixed validation set, but instead use [cross validation](https://scikit-learn.org/stable/modules/cross_validation.html). 

In [78]:
cache_dir = "./data_cache"

# load the original training set from HuggingFace
train_dataset = load_dataset(
    "climatebert/climate_sentiment",
    split="train",
    cache_dir=cache_dir,
)

# we're going to create a new validation set by splitting the data
dataset_splits = train_dataset.train_test_split(test_size=0.2)
train_dataset = dataset_splits["train"]
val_dataset = dataset_splits["test"]

train_texts = np.array(train_dataset["text"])
val_texts = np.array(val_dataset["text"])

train_labels = np.array(train_dataset["label"])
val_labels = np.array(val_dataset["label"])

print(f"Training dataset with {len(train_texts)} instances loaded")
print(f"Development/validation dataset with {len(val_texts)} instances loaded")

### DO NOT MODIFY
# save gold labels to file
pd.DataFrame(val_labels).to_csv('./outputs/val_labels.csv')

Training dataset with 800 instances loaded
Development/validation dataset with 200 instances loaded


In this notebook, you're going to build three different classifiers for this dataset, then compare how they work, and analyse the results. We are going to start by implementing a naïve Bayes classifier from scratch. 

We are going to begin by initialising some useful variables and doing some very simple pre-processing using CountVectorizer.

In [80]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize

num_classes = 3

def preprocess(train_texts):
    vectorizer = CountVectorizer(ngram_range=(2,2), tokenizer=word_tokenize)
    X = vectorizer.fit_transform(train_texts).toarray()
    num_features = X.shape[1]

    X_val = vectorizer.transform(val_texts).toarray()

    return X, X_val, vectorizer, num_features

X, X_val, vectorizer, num_features = preprocess(train_texts)



## TASK 1.1a

Complete the function below to compute the class priors, $p(y_n = c)$ for each class label $c$, where $y_n$ is the class label of document $n$. Do not use the Sklearn implementation to do this, but implement it yourself, e.g., using Numpy functions. The function must output the class priors as a list or Numpy array containing the probabilities. You do not need to apply any smoothing or regularisation.    (3 marks)

In [82]:
def compute_class_priors(texts, labels):
    priors = np.zeros(num_classes)

    ### WRITE YOUR ANSWER HERE
    total_count = len(labels)
    for i in range(num_classes):
        priors[i] = np.sum(labels == i) / total_count
    ### END OF ANSWER
    return priors

class_priors = compute_class_priors(train_texts, train_labels)
print(class_priors)

### DO NOT MODIFY
pd.DataFrame(class_priors).to_csv('./outputs/11a_class_priors.csv')

[0.35375 0.40375 0.2425 ]


## TASK 1.1b

Complete the function below to extract n-gram features from the text, then compute the liklihood $p(x_{ni} = w | y_n = c)$ that the $i$ th n-gram in document $n$ is $w$, given that the class of $n$ is $c$. Again, do not use the Sklearn implementation to do this, but implement it yourself, e.g., using Numpy functions. The function must output the likelihoods as a 2D Numpy array containing probabilities. You should apply smoothing by adding counts of +1 to the counts of each feature.  (3 marks)

In [84]:
def compute_feature_likelihoods(X, labels):

    likelihoods = np.ones((num_features, num_classes))  # a 2D numpy array where you can store the likelihoods. Note that all values are initialised to one.

    ### WRITE YOUR ANSWER HERE
    for c in range(num_classes):
        # Select all rows that belong to class c
        class_docs = X[labels == c]
        # Sum all documents in the class by column to get the total frequency for each feature
        class_feature_counts = class_docs.sum(axis=0)  # shape: (num_features,)
        # Plus 1 (already initialized to 1), so just add the frequency here
        likelihoods[:, c] += class_feature_counts
        # The total number of features (smoothed) for each class is normalized
        likelihoods[:, c] /= likelihoods[:, c].sum()
    ### END OF ANSWER
    return likelihoods

likelihoods = compute_feature_likelihoods(X, train_labels)

### DO NOT MODIFY
pd.DataFrame(likelihoods).to_csv('./outputs/11b_likelihoods.csv')

Now, we are going to use the code in the next cell to compute the log probabilities of each class for each text in the validation set. This code will use the previous functions you implemented, compute_class_priors and compute_feature_likelihoods. The log probabilities will be stored in the 'predictions' array.

In [86]:
from scipy.special import logsumexp


def NB_classify(class_priors, likelihoods, X_val):

    predictions = np.zeros((X_val.shape[0], num_classes))  # an empty numpy array to store the predictions in

    sum_of_log_likelihoods = X_val.dot(np.log(likelihoods))
    log_joint_prob = sum_of_log_likelihoods + np.log(class_priors)[None, :]
    for n, doc in enumerate(X_val):
        predictions[n, :] = log_joint_prob[n]
        predictions[n, :] -= logsumexp(predictions[n, :])
    return predictions

predictions = NB_classify(class_priors, likelihoods, X_val)
print(predictions)

[[-9.09830430e+00 -1.75665996e-04 -9.65983187e+00]
 [-1.33783274e+01 -5.75739441e+00 -3.16588727e-03]
 [-3.87865384e-09 -1.93677836e+01 -3.41478473e+01]
 [-2.14559093e+01 -2.67787941e-01 -1.44846770e+00]
 [-1.27494991e+01 -6.95415080e-04 -7.27553514e+00]
 [ 0.00000000e+00 -6.46803107e+01 -9.08665020e+01]
 [-1.36056900e+01 -1.35663862e-06 -1.59096110e+01]
 [-1.09406526e-01 -2.30920050e+00 -5.45067385e+00]
 [-1.62123297e+01 -2.46520472e-07 -1.56765332e+01]
 [-1.33019296e+01 -4.12109525e+00 -1.63615265e-02]
 [-3.70467945e+01 -1.24269420e+01 -4.00911529e-06]
 [-1.00563051e+01 -4.29152373e-05 -3.04286350e+01]
 [ 0.00000000e+00 -6.11452127e+01 -6.42534524e+01]
 [-9.28535441e-01 -5.04750216e-01 -6.71574838e+00]
 [-1.09427339e+00 -4.07676396e-01 -1.06747148e+01]
 [-5.19431721e-01 -1.88272385e+00 -1.37449835e+00]
 [ 0.00000000e+00 -3.98032181e+01 -3.95901461e+01]
 [-1.20236343e+01 -1.09426514e-05 -1.22177625e+01]
 [-2.57293708e+01 -7.70633689e+00 -4.50068011e-04]
 [ 0.00000000e+00 -3.88937993e+

Use the 'predictions' array above to compute and print the accuracy of the classifier on the validation set.   

In [88]:
from sklearn.metrics import accuracy_score

accuracy_score(val_labels, np.argmax(predictions, axis=1))

0.74

## TASK 1.1c

The simplicty of naïve Bayes means that we can quite easily interpret the model. In the code above, we used the functions you implemented, compute_feature_likelihoods and compute_class_priors, to train an NB classifier with our training set. Given this classifier, which are the five n-gram features that most strongly indicate that the document belongs to class 0? Store these features in the 'top_features' list below.    (4 marks)

In [90]:
top_features = []

### WRITE YOUR ANSWER HERE
# 1. Get all features of class 0 likelihood (P(f | class 0) for each bigram)
likelihoods_class_0 = likelihoods[:, 0]

# 2. Find the index of the top 5 largest values from largest to smallest
top_indices = np.argsort(likelihoods_class_0)[-5:][::-1]

# 3. Use the vectorizer to recover the corresponding feature names (bigram) for these indexes.
feature_names = vectorizer.get_feature_names_out()
top_features = [feature_names[i] for i in top_indices]
### END OF ANSWER

### DO NOT MODIFY
print(top_features)
pd.DataFrame(top_features).to_csv('./outputs/11c_top_feats.csv')
###

['climate change', ', and', 'of the', 'in the', 'to the']


Up to this point, the classifier used bigrams features extracted using CountVectorizer with the wordnet tokenizer. 

## TASK 1.1d

Your task is to improve the naïve Bayes classifier by changing the preprocessing or features only. It is up to you to decide how many changes are needed to improve the classifier -- a single change may be enough to achieve a good result (and maximum marks) and you should only include steps that help performance. Complete the 'preprocess_improved' function below, and run the cell to compute accuracy of the improved classifier on the validation set.     (3 marks)

In [92]:
def preprocess_improved(train_texts):
    ### WRITE YOUR ANSWER HERE

    vectorizer = CountVectorizer(ngram_range=(1,1), tokenizer=word_tokenize) # Change ngram_range=(1, 1)
    X = vectorizer.fit_transform(train_texts).toarray()
    num_features = X.shape[1]

    X_val = vectorizer.transform(val_texts).toarray()

    ### END OF ANSWER

    return X, X_val, vectorizer, num_features

X, X_val, vectorizer, num_features = preprocess_improved(train_texts)
class_priors = compute_class_priors(train_texts, train_labels)
likelihoods = compute_feature_likelihoods(X, train_labels)
predictions = NB_classify(class_priors, likelihoods, X_val)
predictions_nb = np.argmax(predictions, axis=1)

### DO NOT MODIFY
pd.DataFrame(predictions_nb).to_csv('./outputs/11d_improved_preds.csv')
accuracy_improved = accuracy_score(val_labels, predictions_nb)
print(accuracy_improved)
###



0.755


## TASK 1.2

Below is an implementation of a neural network classifier that we can apply to the same dataset. However, there are some mistakes in the code and some poor choices in the choice of parameters and architecture. Your task is to fix the errors, make better parameter choices, and improve the model's performance. **Modify the code within the next cell** to improve the neural network classifier, then run it and compute its accuracy using the code in the cell after that.   (8 marks)

In [94]:
### DO NOT MODIFY
set_seed(my_student_number)
###

### WRITE YOUR ANSWER HERE: MODIFY THE CODE WITHIN THIS CELL
 
from torch import nn
from torch import optim
from transformers import AutoTokenizer
from torch.utils.data import DataLoader

tokenizer = AutoTokenizer.from_pretrained("climatebert/distilroberta-base-climate-f")  

sequence_length = 64  ## truncate all docs longer than this. Pad all docs shorter than this. #Too short will truncate most of the key text information.
batch_size = 32 ## Small batches lead to unstable training

def tokenize_function(examples):
    return tokenizer(
        examples["text"],  # Adjust the key based on your dataset structure
        padding="max_length",  # Ensures equal sequence lengths
        truncation=True,       # Truncates longer sequences
        max_length=sequence_length,        # Adjust as needed
        return_tensors="pt"
    )

tokenized_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])  # Adjust column names
train_loader = DataLoader(tokenized_dataset, batch_size=batch_size, shuffle=True)

tokenized_dataset = val_dataset.map(tokenize_function, batched=True)
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])  # Adjust column names
val_loader = DataLoader(tokenized_dataset, batch_size=batch_size, shuffle=False)

class FFTextClassifier(nn.Module):
    
    def __init__(self, vocab_size, sequence_length, num_classes, embedding_size=128, hidden_size=128): ## Too small size to learn any meaningful information
        super(FFTextClassifier, self).__init__()

        self.embedding_size = embedding_size
        self.sequence_length = sequence_length

        # Here we just need to construct the components of our network. We don't need to connect them together yet.
        self.embedding_layer = nn.Embedding(vocab_size, embedding_size) # embedding layer
        self.hidden_layer = nn.Linear(self.embedding_size*sequence_length, hidden_size)  #nn.LSTM(self.embedding_size, hidden_size, bidirectional=True, batch_first=True) # Hidden layer
        self.activation = nn.ReLU() # Hidden layer
        self.output_layer = nn.Linear(hidden_size, num_classes) # Full connection layer
        
        
    def forward(self, input_words):
        # Input dimensions are:  (batch_size, seq_length)
        embedded_words = self.embedding_layer(input_words)  # (batch_size, seq_length, embedding_size)

        # flatten the sequence of embedding vectors for each document into a single vector.
        embedded_words = embedded_words.reshape(embedded_words.shape[0], self.sequence_length*self.embedding_size)  #(embedded_words.shape[0], self.sequence_length*self.embedding_size)  # batch_size, seq_length*embedding_size

        z = self.hidden_layer(embedded_words)   # (batch_size, seq_length, hidden_size)
        h = self.activation(z)                  # (batch_size, seq_length, hidden_size)

        output = self.output_layer(h)                      # (batch_size, num_classes)

        # Notice we haven't applied a softmax activation to the output layer -- it's not required by Pytorch's loss function.

        return output


    def run_training(self, train_dataloader, dev_dataloader):

        # training hyperparameters
        num_epochs = 10 ## The code only runs 1 epoch, which is too little. Models may not converge  
        learning_rate = 5e-4  ## learning rate for the gradient descent optimizer, related to the step size # Too lowlearning rate, will directly lead to loss of explosion

        loss_fn = nn.CrossEntropyLoss()  # create loss function object
        optimizer = optim.Adam(self.parameters(), lr=learning_rate)  # create the optimizer
        
        dev_losses = []
        ## early stopping parameters
        best_loss = float('inf')
        patience_counter = 0
        early_stopping_patience = 3
   
        for e in range(num_epochs):
            # Track performance on the training set as we are learning...
            train_losses = []

            self.train()  # Put the model in training mode.

            for i, batch in enumerate(train_dataloader):
                # Iterate over each batch of data

                optimizer.zero_grad()  # Reset the optimizer

                # Use the model to perform forward inference on the input data.
                # This will run the forward() function.
                output = self(batch['input_ids'])

                # Compute the loss for the current batch of data
                batch_loss = loss_fn(output, batch['label'].long()) ## The tag data type must be LongTensor

                # Perform back propagation to compute the gradients with respect to each weight
                batch_loss.backward()

                # Update the weights using the compute gradients
                optimizer.step()

                # Record the loss from this sample to keep track of progress.
                train_losses.append(batch_loss.item())

            print("Epoch: {}/{}".format((e+1), num_epochs),
                "Training Loss: {:.4f}".format(np.mean(train_losses)))

            self.eval()  # Switch model to evaluation mode

            dev_losses_epoch = []
            
            for dev_batch in dev_dataloader:
                dev_output = self(dev_batch['input_ids'])
                dev_loss = loss_fn(dev_output, dev_batch['label'].long()) ## The tag data type must be LongTensor

                # Save the loss on the dev set
                dev_losses_epoch.append(dev_loss.item())
                        
            dev_losses.append(np.mean(dev_losses_epoch))
                    
            print("Epoch: {}/{}".format((e+1), num_epochs),
                "Validation Loss: {:.4f}".format(dev_losses[-1]) )

            ## Add early stopping
            if dev_losses[-1] < best_loss:
                best_loss = dev_losses[-1]
                patience_counter = 0
            else:
                patience_counter += 1
                if patience_counter >= early_stopping_patience:
                    print("Early stopping triggered.")
                    break
        
        return dev_losses

def predict_nn(trained_model, data_loader):

    trained_model.eval()

    pred_labs = []  # predicted labels to return
    
    for batch in data_loader:
        test_output = trained_model(batch['input_ids'])
        predicted_labels = test_output.argmax(1)
        pred_labs.extend(predicted_labels.tolist())
    
    return pred_labs

vocab_size = len(tokenizer.get_vocab()) ## max(tokenizer.get_vocab().values()) + 1, may missize, error or waste parameter space on the embedding layer.
nn_classifier_model = FFTextClassifier(vocab_size, sequence_length, num_classes)
dev_losses = nn_classifier_model.run_training(train_loader, val_loader)

predictions_nn = predict_nn(nn_classifier_model, val_loader)

### END OF ANSWER 

### DO NOT MODIFY
pd.DataFrame(predictions_nn).to_csv("./outputs/12_nn_preds.csv")
accuracy_nn = accuracy_score(val_labels, predictions_nn)
print(accuracy_nn)
###

Epoch: 1/10 Training Loss: 1.0756
Epoch: 1/10 Validation Loss: 0.9855
Epoch: 2/10 Training Loss: 0.2348
Epoch: 2/10 Validation Loss: 1.0858
Epoch: 3/10 Training Loss: 0.0509
Epoch: 3/10 Validation Loss: 1.2463
Epoch: 4/10 Training Loss: 0.0156
Epoch: 4/10 Validation Loss: 1.3464
Early stopping triggered.
0.55


We now explore the use of transformers for building a text classifier. First, let's look at how the process a document. We'll chose one at random from the training set:

In [25]:
### DO NOT MODIFY
chosen_document = train_texts[np.random.randint(len(train_texts))]

## TASK 1.3a

Use the HuggingFace transformers library to load the pretrained BERT model "prajjwal1/bert-tiny". Obtain a document embedding for the chosen document given above. Comment your code to explain how it obtains a representation of the document.    (3 marks)

In [96]:
### DO NOT MODIFY
set_seed(my_student_number)
###

### WRITE YOUR ANSWER HERE
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd
from datasets import load_dataset
import numpy as np 

# Load model and tokenizer (Tiny BERT)
tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-tiny")
model = AutoModel.from_pretrained("prajjwal1/bert-tiny")

# Select the given document (the one randomly selected in the original notebook code)
chosen_document = train_texts[np.random.randint(len(train_texts))]

# Use tokenizer to encode documents for input_ids and attention_mask
inputs = tokenizer(
    chosen_document,
    return_tensors="pt",        # Return to PyTorch tensors
    truncation=True,            # Truncate long text
    padding="max_length",       # Fill to maximum length (default 512)
    max_length=128              # Custom length prevents overflow
)

# Input model gets output
with torch.no_grad():
    outputs = model(**inputs)

# outputs.last_hidden_state's shape is (1, seq_len, hidden_dim)
# We take the output of the first token ([CLS] token) as the embedding representation of the entire document
cls_embedding = outputs.last_hidden_state[:, 0, :]  # shape: (1, hidden_dim)

doc_emb = []  # use this variable to store the document embedding

# Append the values from tensor to the doc_emb list
doc_emb.extend(cls_embedding.squeeze().tolist())
### END OF ANSWER

### DO NOT MODIFY
pd.DataFrame(doc_emb).to_csv('./outputs/13a_sen_emb.csv')
print(doc_emb)
###

[-0.7482838034629822, 1.717582106590271, -4.1115264892578125, -0.24005068838596344, 0.31813621520996094, 0.1779763102531433, -0.08531619608402252, 1.1746528148651123, -1.109370231628418, 0.7128076553344727, 1.4363352060317993, 1.5166279077529907, -0.11133590340614319, 0.3482499420642853, 1.14781653881073, -0.44232845306396484, -0.4152769446372986, -0.9417980313301086, -1.5371623039245605, 0.9611209630966187, -1.5699527263641357, 0.4559035301208496, 1.985022783279419, 0.9156677722930908, 2.5270280838012695, -0.6815090775489807, -0.41216835379600525, -0.6346782445907593, -0.5889894962310791, 0.8110159635543823, 0.16072402894496918, -2.1547656059265137, -0.5654200911521912, -0.4460534453392029, -0.6289471983909607, -0.288046658039093, -0.5391550064086914, 0.9059393405914307, -2.3136513233184814, -0.18432575464248657, -0.24609655141830444, 0.15847896039485931, 0.39705613255500793, -3.1327106952667236, -0.6353223323822021, -2.1962480545043945, -0.1935940384864807, 0.17113542556762695, -1.11

## TASK 1.3b

Using the same document embeddings method as the previous task, find the most similar document to the 'chosen_document' from within the validation set (from the 'val_texts' object). Use a standard similarity metric that considers the direction but not the magnitude of the embedding vectors. Use the same model as in task 1.8.  (2 marks)

In [98]:
### DO NOT MODIFY
set_seed(my_student_number)
###

### WRITE YOUR ANSWER HERE
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

most_similar_doc = ""  # use this variable to store the most similar document

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-tiny")
model = AutoModel.from_pretrained("prajjwal1/bert-tiny")

# Gets the embedding of the chosen_document (same as 1.3a)
inputs = tokenizer(
    chosen_document,
    return_tensors="pt",
    truncation=True,
    padding="max_length",
    max_length=128
)

with torch.no_grad():
    outputs = model(**inputs)

chosen_emb = outputs.last_hidden_state[:, 0, :]  # [CLS] token
chosen_emb = chosen_emb.squeeze().numpy().reshape(1, -1)  # shape: (1, hidden_size)

# Computes embedding and similarity for all validation set documents
val_embeddings = []
for text in val_texts:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    emb = outputs.last_hidden_state[:, 0, :].squeeze().numpy()
    val_embeddings.append(emb)

val_embeddings = np.stack(val_embeddings)  # shape: (N, hidden_size)

# Use cosine similarity
similarities = cosine_similarity(chosen_emb, val_embeddings)[0]  # shape: (N,)
most_similar_index = np.argmax(similarities)
most_similar_doc = val_texts[most_similar_index]
### END OF ANSWER

### DO NOT MODIFY
pd.DataFrame([chosen_document, most_similar_doc]).to_csv("./outputs/13b_most_similar.csv")
print(chosen_document)
print(most_similar_doc)
###

EXAMPLES OF RISKS Resource scarcity, coupled with increasing demand, could affect production, availability, quality and cost of raw materials. Increased frequency of extreme weather events, from floods to droughts, could cause disruption in our supply chain and impact the sourcing of raw materials, as well as the production and distribution of finished goods. Increased regulation and more stringent environmental standards could impact our business by affecting production costs and flexibility of operations. Our industry is sustained by many agricultural and manufacturing communities around the world. Failure to support them in preserving key skills and building more sustainable livelihoods could cause social, economic and operational challenges, ranging from community tensions and disruption to production, to a reduced talent pool.
Climate change presents an evolving set of risks and opportunities for Coles, and has the potential to contribute to and increase the exposure of other mate

## TASK 1.3c

Implement a classifier based on the same pretrained transformer model, "prajjwal1/bert-tiny". Evaluate your model's performance on the validation set. Use an 'auto class' from HuggingFace to build your classifier (see https://huggingface.co/docs/transformers/model_doc/auto).   (6 marks)

In [100]:
### DO NOT MODIFY
set_seed(my_student_number)
###

### WRITE YOUR ANSWER HERE
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import DataLoader, Dataset
import torch
from sklearn.metrics import accuracy_score
import numpy as np

predictions_bert = []  # use this variable to store the predicted labels for the validation set

# Load tokenizer and classification model (num_labels=3, since it's three-class)
tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-tiny")
model = AutoModelForSequenceClassification.from_pretrained("prajjwal1/bert-tiny", num_labels=3)

# Define the datasets class
class ClimateDataset(Dataset):
    def __init__(self, texts, labels):
        self.encodings = tokenizer(list(texts), padding="max_length", truncation=True, max_length=128, return_tensors="pt")
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            "input_ids": self.encodings["input_ids"][idx],
            "attention_mask": self.encodings["attention_mask"][idx],
            "labels": self.labels[idx]
        }

# Build the Dataset and DataLoader
train_data = ClimateDataset(train_texts, train_labels)
val_data = ClimateDataset(val_texts, val_labels)

train_loader = DataLoader(train_data, batch_size=16, shuffle=True)
val_loader = DataLoader(val_data, batch_size=32)

# Model training
from torch import nn, optim

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

optimizer = optim.AdamW(model.parameters(), lr=2e-5)
loss_fn = nn.CrossEntropyLoss()

model.train()
num_epochs = 20
early_stopping_patience = 2

best_val_loss = float('inf')
patience_counter = 0
for epoch in range(num_epochs): 
    total_loss = 0
    for batch in train_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    print(f"Epoch {epoch+1} completed. Train Loss: {total_loss/len(train_loader):.4f}")
    
    # Evaluate on validation set
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            loss = loss_fn(outputs.logits, labels)
            val_loss += loss.item()

    val_loss /= len(val_loader)
    print(f"Validation Loss: {val_loss:.4f}")

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
    else:
        patience_counter += 1
        if patience_counter >= early_stopping_patience:
            print("Early stopping triggered.")
            break

# Model evaluation(Prediction validation)
model.eval()
with torch.no_grad():
    for batch in val_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        preds = outputs.logits.argmax(dim=1).cpu().tolist()
        predictions_bert.extend(preds)

### END OF ANSWER

### DO NOT MODIFY
pd.DataFrame(predictions_bert).to_csv('./outputs/13c_bert_preds.csv')
accuracy_tinybert = accuracy_score(val_dataset["label"], predictions_bert)
print(accuracy_tinybert)
###

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1 completed. Train Loss: 1.0787
Validation Loss: 1.0476
Epoch 2 completed. Train Loss: 1.0117
Validation Loss: 0.9841
Epoch 3 completed. Train Loss: 0.9382
Validation Loss: 0.9235
Epoch 4 completed. Train Loss: 0.8632
Validation Loss: 0.8708
Epoch 5 completed. Train Loss: 0.7843
Validation Loss: 0.7993
Epoch 6 completed. Train Loss: 0.7080
Validation Loss: 0.7452
Epoch 7 completed. Train Loss: 0.6360
Validation Loss: 0.7061
Epoch 8 completed. Train Loss: 0.5731
Validation Loss: 0.6802
Epoch 9 completed. Train Loss: 0.5160
Validation Loss: 0.6527
Epoch 10 completed. Train Loss: 0.4638
Validation Loss: 0.6475
Epoch 11 completed. Train Loss: 0.4165
Validation Loss: 0.6290
Epoch 12 completed. Train Loss: 0.3725
Validation Loss: 0.6063
Epoch 13 completed. Train Loss: 0.3304
Validation Loss: 0.6086
Epoch 14 completed. Train Loss: 0.2898
Validation Loss: 0.5897
Epoch 15 completed. Train Loss: 0.2536
Validation Loss: 0.5885
Epoch 16 completed. Train Loss: 0.2214
Validation Loss: 0.6022
E