## Automated Fact Checking For Climate Science Claims

Author: Zhi Hern Tom

The script is capable of running on Colab using the basic T4 GPU. If running out of CUDA memory, please restart the kernel and run all. Previous progress are saved for fast recovery.

In this notebook, we aim to develop and test an automated system for fact-checking claims related to climate science. The approach leverages Natural Language Processing (NLP) techniques and pretrained models to verify the authenticity of claims.

#### Install and import requirements

Firstly, let's set up our environment by installing the necessary libraries and packages. This ensures that we have all the tools needed to execute the subsequent code cells.

In [1]:
%%capture
!pip install torch torchvision transformers 
!pip install pandas numpy sklearn nltk
!pip install ipywidgets tqdm

In [2]:
# Installing essential libraries for data processing, modeling, and visualization.

%%capture
import os
import re
import json
import time
import random
from tqdm.notebook import tqdm
from collections import Counter

import nltk
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

from transformers import BertTokenizer, BertModel
from transformers import DistilBertTokenizer, DistilBertModel

In [3]:
# Importing required Python libraries and modules for data processing, NLP tasks, and modeling.

# If running on Colab, uncomment the following lines
# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
# Optionally, if running on Google Colab, this cell mounts the Google Drive for accessing datasets or saving models.

# Global variables

data_dir = "data/"
# Change if on Colab
# data_dir = "drive/MyDrive/data/"

outputs_path = "outputs/"
prediction_path = "prediction/"

train_path = f"{data_dir}train-claims.json"
dev_path = f"{data_dir}dev-claims.json"
evidence_path = f"{data_dir}evidence.json"

token_len = 256
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

gpu = 0
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [5]:
# Define utility functions to check the existence of files and directories and create them if necessary.

# Check if file or directory exists
def check_file(path):
    if os.path.exists(path):
        return True
    else:
        print(f"{path} does not exist.")
        return False

def check_dir(path):
    if not os.path.exists(path):
        os.makedirs(path)
        print(f"Created directory: {path}")
    else:
        print(f"Directory already exists: {path}")

check_dir(data_dir)
check_dir(outputs_path)
check_dir(prediction_path)

Directory already exists: data/
Directory already exists: outputs/
Directory already exists: prediction/


### Section I: Evidence Retrieval

In [6]:
# Load claims, related evidences and labels
# If label is False, return only claims and claim ids
def load_data(path, label=False):
    claimid_list = []
    claim_list = []
    if label:
        evidences_list = []
        label_list = []
    
    with open(path, 'r') as f:
        data = json.load(f)
    for item in data:
        claimid_list.append(item)
        claim_list.append(data[item]['claim_text'])
        if label:
            evidences_list.append(data[item]['evidences'])
            label_list.append(data[item]['claim_label'])
    
    if label:
        return claimid_list, claim_list, evidences_list, label_list
    return claimid_list, claim_list

# Load evidences
def load_evidence(path):
    evidence_list = []
    with open(path, 'r') as f:
        data = json.load(f)
    for item in data:
        evidence_list.append(data[item])
    return evidence_list

We need to load our claims and evidences to process them further. The following functions facilitate loading the data from JSON files, which includes claims, related evidences, and labels. The data is structured such that each claim has an associated ID, text, potential evidences, and a label indicating its veracity.

In [7]:
# Functions to load claims, related evidences, and labels from JSON files.

train_claim_ids, train_claims, train_evidences, train_labels = load_data(train_path, label=True)
dev_claim_ids, dev_claims, dev_evidences, dev_labels = load_data(dev_path, label=True)
evidence_src = load_evidence(evidence_path)

#### Text preprocessing

Text data often contains noise in the form of irrelevant characters, different cases, or frequent words that don't add much semantic value. Cleaning the text involves multiple steps including removing special characters, converting to lowercase, tokenizing, removing stopwords, and lemmatizing to get to the base form of words.

In [8]:
# Define functions for text cleaning, including tokenization, stopword removal, and lemmatization.

# Import necessary libraries
import re
import nltk
from tqdm import tqdm
import json

# Define a set of English stopwords and initialize the WordNet Lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = nltk.stem.WordNetLemmatizer()

# Define a regex pattern to clean text - keeps only alphanumeric characters and spaces
clean_pattern = r'[^a-zA-Z0-9\s]+'

def text_clean(text):
    """
    Cleans and preprocesses the input text.
    
    Args:
    - text (str): The raw text to be cleaned.
    
    Returns:
    - clean_text (str): The cleaned text with stopwords removed and lemmatized.
    """
    # Remove special characters using the defined regex pattern
    clean_text = re.sub(clean_pattern, '', text)
    
    # Tokenize the cleaned text and convert to lowercase
    words = nltk.word_tokenize(clean_text.lower())
    
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    
    # Lemmatize the words
    words = [lemmatizer.lemmatize(word) for word in words]
    
    # Join the words to get the final cleaned text
    clean_text = ' '.join(words)
    
    return clean_text

def evidence_clean(texts):
    """
    Cleans a list of texts and saves the cleaned texts to a JSON file.
    
    Args:
    - texts (list of str): The list of raw texts to be cleaned.
    
    Outputs:
    - A JSON file containing cleaned texts.
    """
    clean_texts = []
    num_texts = len(texts)

    # Display a progress bar while cleaning texts
    print("Cleaning texts ...")
    pbar = tqdm(total=num_texts, dynamic_ncols=True, miniters=10000)
    
    for text in texts:
        clean_text = text_clean(text)
        clean_texts.append(clean_text)
        pbar.update(1)
    
    pbar.close()
    
    # Save cleaned texts to a JSON file
    with open(outputs_path + 'clean_evidence.json', 'w') as f:
        json.dump(clean_texts, f)
    
    print(f"Saved to {outputs_path}clean_evidence.json")
    # return clean_texts


In [9]:
# Load cleaned evidences
if not check_file(f"{outputs_path}clean_evidence.json"):
    evidence_clean(evidence_src) 
with open(outputs_path + 'clean_evidence.json', 'r') as f:
    clean_evidence_src = json.load(f)
    print(f"Loaded from {outputs_path}clean_evidence.json")

Loaded from outputs/clean_evidence.json


#### Jaccard similarity

Jaccard similarity is a measure of similarity between two sets. It's defined as the size of the intersection divided by the size of the union of the two sets. For our purpose, we'll treat each text as a set of words and compute the Jaccard similarity to filter relevant evidence for a given claim.

# Define functions to compute Jaccard similarity between two texts and use it to filter the top-k most similar evidences for a given claim.

Using the text cleaning functions, we can process our evidence data. This step ensures that the evidence is in a format suitable for further analysis. The cleaned evidence is then saved to a JSON file for future use.

In [10]:
# Load the cleaned evidences from the saved JSON file. If the cleaned data doesn't exist, we process the raw evidence data.

# Compute jaccard similarity between two texts
def jaccard_similarity(s1, s2):
    set1 = set(s1.split())
    set2 = set(s2.split())
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    similarity = len(intersection) / len(union)
    return similarity

# Use Jaccard similarity to filter evidences
def jaccard_filter(claim, k=100):
    clean_claim = text_clean(claim)
    res = []
    for i, ev in enumerate(clean_evidence_src):
        res.append((i, jaccard_similarity(clean_claim, ev)))
    return sorted(res, key = lambda x: x[1], reverse=True)[:k]

Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that reflects the importance of a word in a document relative to a corpus. By converting our texts into TF-IDF vectors, we can compute the cosine similarity between them. This allows us to filter the evidence that is most similar to a given claim based on their TF-IDF representations.

# Define functions to compute TF-IDF vectors for texts, calculate cosine similarity between them, and use the similarity to filter the top-k most similar evidences for a given claim.

#### TF-IDF

In [11]:
# Generate TF-IDF vectors for all cleaned evidences.

# Perform TF-IDF on given texts
def tfidf_evidence(texts):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_vectors = tfidf_vectorizer.fit_transform(texts)
    return tfidf_vectors, tfidf_vectorizer

# Cosine similarity between TF-IDF vectors of claim and evidence
def tfidf_similarity(claim, ev_id):
    evidence = clean_evidence_src[ev_id]
    claim_vector = ev_tfidf_vectorizer.transform([claim])
    evidence_vector = ev_tfidf_vectorizer.transform([evidence])
    similarity = cosine_similarity(claim_vector, evidence_vector)
    return similarity.item()

# Use TF-IDF similarity to filter evidences
def tfidf_filter(claim, k=100):
    claim = text_clean(claim)
    claim_vector = ev_tfidf_vectorizer.transform([claim])
    similarities = cosine_similarity(claim_vector, ev_tfidf_vectors)
    top_k_indices = np.argsort(similarities, axis=-1)[:, -k:].flatten()
    top_k_scores = np.sort(similarities, axis=-1)[:, -k:].flatten()
    return list(zip(top_k_indices, top_k_scores))

In [12]:
# Perform TF-IDF on all evidences
ev_tfidf_vectors, ev_tfidf_vectorizer = tfidf_evidence(clean_evidence_src)
print("TF-IDF vectors generated.")

TF-IDF vectors generated.


#### DistilBERT

DistilBERT is a distilled version of BERT, a popular transformer model for NLP tasks. By using DistilBERT, we can generate embeddings for our texts, which can then be used to compute cosine similarity between them. This provides another method to filter evidence that is most similar to a given claim based on their DistilBERT embeddings.

In [13]:
# Define functions to generate embeddings for texts using DistilBERT, compute cosine similarity between these embeddings, and filter the top-k most similar evidences for a given claim.

distilbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
distilbert_model = DistilBertModel.from_pretrained('distilbert-base-uncased').to(device)

# Generate embeddings for evidences using DistilBERT
def get_text_embedding(evidences, model, tokenizer, batch_size=64, show=True):
    """
    Generate embeddings for a list of evidences using the DistilBERT model.
    
    Args:
    - evidences (list): A list of text evidence.
    - model (transformers.Model): A DistilBERT model.
    - tokenizer (transformers.Tokenizer): A DistilBERT tokenizer.
    - batch_size (int): Size of each batch for processing. Default is 64.
    - show (bool): Flag to display a progress bar. Default is True.
    
    Returns:
    - embeddings (numpy.ndarray): Embeddings for the input evidences.
    """
    
    # Setting the model to evaluation mode
    model.eval()

    # List to store the generated embeddings
    embeddings = []

    # Splitting the evidences into batches
    batches = [evidences[i:i+batch_size] for i in range(0, len(evidences), batch_size)]
    
    # Progress bar initialization (if show is True)
    pbar = tqdm(total=len(batches), dynamic_ncols=True, miniters=1000) if show else None

    for batch in batches:
        # Tokenizing the batched evidences and getting the necessary inputs for the model
        tokenized = tokenizer.batch_encode_plus(batch, padding=True, truncation=True, max_length=token_len, return_tensors='pt')
        input_ids = tokenized['input_ids']
        attention_masks = tokenized['attention_mask']

        # Moving the tokenized data to the specified device (e.g., GPU)
        input_ids = input_ids.to(device)
        attention_masks = attention_masks.to(device)
        
        # Compute the embeddings for the batch
        with torch.no_grad():
            outputs = model(input_ids, attention_masks)
            
            # Extracting the [CLS] token's embeddings for each evidence, as it's a good representation of the entire text.
            batch_embeddings = outputs[0][:, 0, :].cpu().numpy()
        
        # Storing the batch embeddings
        embeddings.append(batch_embeddings)
        
        # Updating the progress bar (if show is True)
        pbar.update(1) if show else None
    
    # Closing the progress bar (if show is True)
    pbar.close() if show else None

    # Stacking the batched embeddings into a single numpy array
    embeddings = np.vstack(embeddings)

    return embeddings

In [14]:
# Load embeddings for all evidences
ev_embeddings = None
if not check_file(f"{outputs_path}ev_distilbert.npy"):
    print("Generating embeddings for evidences using DistilBERT ...")
    ev_embeddings = get_text_embedding(evidence_src, distilbert_model, distilbert_tokenizer)
    np.save(outputs_path + "ev_distilbert.npy", ev_embeddings)
else:
    ev_embeddings = np.load(outputs_path + "ev_distilbert.npy")
    print(f"Loaded from {outputs_path}ev_distilbert.npy")

Loaded from outputs/ev_distilbert.npy


In [15]:
# Filter evidences based on cosine similarity between DistilBERT embeddings of claim and evidence
def distilbert_filter(claim, k=100):
    claim_embedding = get_text_embedding([claim], distilbert_model, distilbert_tokenizer, show=False)
    similarities = cosine_similarity(claim_embedding, ev_embeddings)
    top_k_indices = np.argsort(similarities, axis=-1)[:, -k:].flatten()
    top_k_scores = np.sort(similarities, axis=-1)[:, -k:].flatten()
    return list(zip(top_k_indices, top_k_scores))

To validate the performance of our evidence retrieval methods, we need an evaluation mechanism. The following functions provide a way to evaluate how well a given filter (Jaccard, TF-IDF, DistilBERT) retrieves relevant evidence for a claim. The performance is measured by checking the retrieved evidence against the ground truth.

# Define functions to evaluate the performance of a given filter on individual claims and on the entire development set.

#### Evaluation on filters (Jaccard, TF-IDF, DistilBERT)

In [16]:
# Function to evaluate the performance of a filter on a specific claim
# Parameters:
# - index: Index of the claim to evaluate
# - filter_func: The function to use for filtering evidences for the claim
# - k: Number of evidences to retrieve (default is 100)
# - show: Whether to print the results for each evidence (default is True)
def eval_filter(index, filter_func, k=100, show=True):
    # Retrieve the claim from the dev set using the provided index
    claim = dev_claims[index]
    
    # Retrieve the true evidences associated with the claim
    truths = dev_evidences[index]
    
    # Apply the filtering function to the claim to get the top k evidences
    evidences = [item[0] for item in filter_func(claim, k)]
    
    # Initialize counters for true and false predictions
    t = 0
    f = 0
    
    # Loop through each true evidence and check if it's in the retrieved evidences
    for truth in truths:
        if int(truth[9:]) in evidences:     # evidence ids are of the form 'evidence-xxx'
            t += 1
            print(f"In: {truth}") if show else None
        else:
            f += 1
            print(f"Out: {truth}") if show else None
    
    # Print the total number of true and false evidence predictions
    print(f"In: {t}, Out: {f}") if show else None
    
    # Return the counts for further processing or evaluation
    return t, f

# Function to evaluate the performance of a filter on the entire dev set
# Parameters:
# - filter_func: The function to use for filtering evidences for the claim
# - k: Number of evidences to retrieve for each claim (default is 100)
def eval_filter_dev(filter_func, k=100):
    # Initialize counters for true positives and false negatives
    t = 0
    f = 0
    
    # Initialize a progress bar to provide feedback during evaluation
    pbar = tqdm(total=len(dev_claims), dynamic_ncols=True)
    
    # Loop through each claim in the dev set and evaluate it
    for i in range(len(dev_claims)):
        t_, f_ = eval_filter(i, filter_func, k, False)
        t += t_
        f += f_
        pbar.update(1)
    pbar.close()
    
    # Print the aggregated results for the entire dev set
    print(f"In: {t}, Out: {f}, Total: {t+f}")

# Functions to evaluate the performance of a given filter on individual claims and on the entire development set. Performance is assessed based on the correct retrieval of relevant evidences.

To quantify the performance of our evidence retrieval methods, it's vital to evaluate them against a benchmark. Below are functions that assess the effectiveness of various filters on claims. The evaluation is based on the number of relevant evidences correctly retrieved (true positives) and those missed (false negatives).

In [17]:
eval_filter_dev(jaccard_filter, 100)

  0%|          | 0/154 [00:00<?, ?it/s]

In: 174, Out: 317, Total: 491


In [18]:
eval_filter_dev(tfidf_filter, 100)

  0%|          | 0/154 [00:00<?, ?it/s]

In: 186, Out: 305, Total: 491


In [19]:
eval_filter_dev(distilbert_filter, 100)

  0%|          | 0/154 [00:00<?, ?it/s]

In: 115, Out: 376, Total: 491


For training machine learning models, it's essential to have both positive examples (relevant evidences) and negative examples (irrelevant evidences). The following functions facilitate pairing claims with both types of evidences and labels. Negative examples can be sampled in two ways: randomly from the dataset or by selecting 'hard' negatives that are most similar to the claim based on TF-IDF.

# Define functions to pair claims with evidence and labels. It includes functions for sampling negative examples in two ways: random sampling and hard negative sampling using TF-IDF similarity.

#### BERT for evidence classification

In [20]:
# Variable storing the evidences
num_evidences = len(evidence_src) - 1

# Function to randomly select n negative cases
def get_rand_negative_ids(evidence_ids, n):
    res = []
    for i in range(n):
        temp_id = random.randint(0, num_evidences)
        # Ensure the randomly selected ID is not already in evidence_ids or res
        while temp_id in evidence_ids or temp_id in res:
            temp_id = random.randint(0, num_evidences)
        res.append(temp_id)
    return res

# Function to select n negative cases that have the highest similarity to the claim
def get_hard_negative_ids(claim, evidence_ids, filter_func, n):
    res = []
    # Retrieve similar_ids by sorting through similarities returned by filter_func
    similar_ids = filter_func(claim, 50)
    cnt = 0
    for i in range(n):
        temp_id = similar_ids[cnt][0]
        # Skip the IDs that are already used as positive evidence
        while temp_id in evidence_ids:
            cnt += 1
            temp_id = similar_ids[cnt][0]
        res.append(temp_id)
        cnt += 1
    return res

# Function to create DataFrame consisting of claim-evidence pairs along with labels
def pair_claim_evidence(claims, evs, hard=False, show=False):
    claim_list = []
    ev_list = []
    labels = []
    
    # Optional progress bar
    pbar = tqdm(total=len(claims), dynamic_ncols=True) if show else None
    
    # Loop through each claim
    for i, claim in enumerate(claims):
        # Add positive cases
        raw_evidence_ids = evs[i]
        evidence_ids = [int(num[9:]) for num in raw_evidence_ids]
        for evidence_id in evidence_ids:
            evidence_text = evidence_src[evidence_id]
            claim_list.append(claim)
            ev_list.append(evidence_text)
            labels.append(1)
        
        # Add negative cases
        num_positive = len(evidence_ids)
        if hard:
            # For hard negatives
            negative_ids = get_hard_negative_ids(claim, evidence_ids, tfidf_filter, num_positive)
        else:
            # For random negatives
            negative_ids = get_rand_negative_ids(evidence_ids, num_positive)
        
        for num in negative_ids:
            evidence_text = evidence_src[num]
            claim_list.append(claim)
            ev_list.append(evidence_text)
            labels.append(0)
        
        # Update the progress bar, if enabled
        pbar.update(1) if show else None
    
    # Close the progress bar, if enabled
    pbar.close() if show else None
    
    return pd.DataFrame({'Claim': claim_list, 'Evidence': ev_list, 'Label': labels})

In [21]:
# Create training and development sets using the pairing functions. This ensures that each set has a balanced mix of positive and negative examples.

# Create training and dev sets using random sampling
train_evcls = pair_claim_evidence(train_claims, train_evidences)
dev_evcls = pair_claim_evidence(dev_claims, dev_evidences)

In [22]:
# Merge the training and development sets to create a comprehensive dataset for training and validation.

# Merge train and dev sets (random sampling)
train_evcls_f = pd.merge(train_evcls, dev_evcls, on=['Claim', 'Evidence', 'Label'], how='outer')

In [23]:
# Create training and dev sets using hard negative sampling
train_evcls_hard = None
dev_evcls_hard = None
train_evcls_hard_data = "evcls_hard_train.csv"
dev_evcls_hard_data = "evcls_hard_dev.csv"

if not check_file(f"{outputs_path}{train_evcls_hard_data}"):
    train_evcls_hard = pair_claim_evidence(train_claims, train_evidences, hard=True, show=True)
    train_evcls_hard.to_csv(f"{outputs_path}{train_evcls_hard_data}", index=False)
    print(f"Saved to {outputs_path}{train_evcls_hard_data}")
else:
    train_evcls_hard = pd.read_csv(f"{outputs_path}{train_evcls_hard_data}")

if not check_file(f"{outputs_path}{dev_evcls_hard_data}"):
    dev_evcls_hard = pair_claim_evidence(dev_claims, dev_evidences, hard=True, show=True)
    dev_evcls_hard.to_csv(f"{outputs_path}{dev_evcls_hard_data}", index=False)
    print(f"Saved to {outputs_path}{dev_evcls_hard_data}")
else:
    dev_evcls_hard = pd.read_csv(f"{outputs_path}{dev_evcls_hard_data}")

In [24]:
# Merge train and dev sets (hard negative sampling)
train_evcls_hard_f = pd.merge(train_evcls_hard, dev_evcls_hard, on=['Claim', 'Evidence', 'Label'], how='outer')

In [25]:
# Convert data to BERT input format
def convert_input(claim, evidence, tokenizer, maxlen):
    """
    Convert claim and evidence texts into BERT input format.
    
    Parameters:
    - claim (str): The claim text.
    - evidence (str): The evidence text supporting/refuting the claim.
    - tokenizer: The BERT tokenizer used for tokenizing the input texts.
    - maxlen (int): Maximum length for the tokenized sequence.
    
    Returns:
    - tokens_ids_t (torch.Tensor): Token IDs tensor.
    - attn_mask_t (torch.Tensor): Attention mask tensor.
    - seg_ids_t (torch.Tensor): Segment IDs tensor.
    """
    
    # Tokenize the claim and evidence.
    claim_tokens = tokenizer.tokenize(claim)
    evidence_tokens = tokenizer.tokenize(evidence)

    # Truncate tokens to fit the maximum length.
    # It ensures that the combined length of claim and evidence tokens fits within maxlen.
    while len(claim_tokens) + len(evidence_tokens) > maxlen - 3:
        if len(claim_tokens) > len(evidence_tokens):
            claim_tokens.pop()
        else:
            evidence_tokens.pop()

    # Arrange tokens with BERT's [CLS], [SEP] and [PAD] tokens.
    # Format: [CLS] + evidence + [SEP] + claim + [SEP] + [PAD] (if needed).
    tokens = ['[CLS]'] + evidence_tokens + ['[SEP]'] + claim_tokens + ['[SEP]']
    if len(tokens) < maxlen:
        tokens = tokens + ['[PAD]' for _ in range(maxlen - len(tokens))]

    # Create attention mask - 0 for [PAD] tokens, 1 for all others.
    attn_mask = [0 if token == '[PAD]' else 1 for token in tokens]

    # Create segment IDs - 0 for evidence tokens, 1 for claim tokens.
    # Note: It also accounts for [CLS] and the first [SEP] tokens.
    seg_ids = [0] * (len(evidence_tokens) + 2) + [1] * (maxlen - len(evidence_tokens) - 2)
    
    # Convert tokens to their respective IDs.
    tokens_ids = tokenizer.convert_tokens_to_ids(tokens)

    # Convert lists to PyTorch tensors.
    tokens_ids_t = torch.tensor(tokens_ids)
    attn_mask_t = torch.tensor(attn_mask)
    seg_ids_t   = torch.tensor(seg_ids)

    return tokens_ids_t, attn_mask_t, seg_ids_t

In [26]:
# Create dataset for evidence classification
class EVCLSDataset(Dataset):
    
    def __init__(self, datasrc, maxlen):
        self.data = datasrc
        self.tokenizer = bert_tokenizer
        self.maxlen = maxlen
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        claim = self.data.loc[index, 'Claim']
        evidence = self.data.loc[index, 'Evidence']
        label = self.data.loc[index, 'Label']

        tokens_ids_t, attn_mask_t, seg_ids_t = convert_input(claim, evidence, self.tokenizer, self.maxlen)
        return tokens_ids_t, attn_mask_t, seg_ids_t, label

In [27]:
# Create training and dev data loader (random sampling)
train_evcls_set = EVCLSDataset(train_evcls, token_len)
dev_evcls_set = EVCLSDataset(dev_evcls, token_len)
train_evcls_loader = DataLoader(train_evcls_set, batch_size = 32, shuffle=True)
dev_evcls_loader = DataLoader(dev_evcls_set, batch_size = 32, shuffle=True)

In [28]:
# Create training and dev data loader (hard negative sampling)
train_evcls_hard_set = EVCLSDataset(train_evcls_hard, token_len)
dev_evcls_hard_set = EVCLSDataset(dev_evcls_hard, token_len)
train_evcls_hard_loader = DataLoader(train_evcls_hard_set, batch_size = 32, shuffle=True)
dev_evcls_hard_loader = DataLoader(dev_evcls_hard_set, batch_size = 32, shuffle=True)

In [29]:
# Create data loader for merged dataset (random sampling)
train_evcls_f_set = EVCLSDataset(train_evcls_f, token_len)
train_evcls_f_loader = DataLoader(train_evcls_f_set, batch_size = 32, shuffle=True)

In [30]:
# Create data loader for merged dataset (hard negative sampling)
train_evcls_hard_f_set = EVCLSDataset(train_evcls_hard_f, token_len)
train_evcls_hard_f_loader = DataLoader(train_evcls_hard_f_set, batch_size = 32, shuffle=True)

In [31]:
# Define the sturcture for model of evidence classification
class EvidenceClassifier(nn.Module):
    """
    EvidenceClassifier is a model designed to classify a given sequence 
    (typically text) based on the evidence presented within it using BERT.

    Attributes:
    - bert_layer (BertModel): Pre-trained BERT model used for extracting sequence representations.
    - cls_layer (nn.Linear): Linear layer for classification based on BERT's [CLS] token representation.

    """

    def __init__(self):
        """
        Initializes the model components.
        """
        super(EvidenceClassifier, self).__init__()
        
        # Load a pre-trained BERT model. This will automatically download the model
        # the first time you run it and will use the 'bert-base-uncased' variant.
        self.bert_layer = BertModel.from_pretrained('bert-base-uncased')
        
        # Define a linear classifier that will be used on top of the BERT [CLS] token representation.
        # 768 is the size of the hidden representation from 'bert-base-uncased' and 
        # we're outputting a single value as our classification result.
        self.cls_layer = nn.Linear(768, 1)

    def forward(self, seq, attn_masks, seg_ids):
        """
        Forward pass for the model.

        Args:
        - seq (torch.Tensor): Input sequence tensor.
        - attn_masks (torch.Tensor): Attention masks to avoid attention to padding tokens.
        - seg_ids (torch.Tensor): Segment IDs used in BERT for distinguishing different sequences.

        Returns:
        - logits (torch.Tensor): Output logits for the classifier.
        """
        
        # Obtain BERT's last hidden state from the transformer.
        # The 'last_hidden_state' is a sequence of hidden states of the last layer of the model.
        outputs = self.bert_layer(seq, attention_mask=attn_masks, token_type_ids=seg_ids, return_dict=True)
        cont_reps = outputs.last_hidden_state
        
        # Extract the [CLS] token's representations which is the first token in BERT's output.
        # This representation is often used for classification tasks.
        cls_rep = cont_reps[:, 0]
        
        # Pass the [CLS] token representation through our classifier
        logits = self.cls_layer(cls_rep)

        return logits

In [32]:
# Compute accuracy for model of evidence classification
def acc_sigmoid(logits, labels):
    """
    Calculate accuracy for sigmoid-based binary classification.
    
    Args:
    - logits (torch.Tensor): The raw model output (before activation function).
    - labels (torch.Tensor): True labels.
    
    Returns:
    - torch.Tensor: Accuracy of the model predictions.
    """
    
    # Apply sigmoid activation function to logits to get probabilities
    probs = torch.sigmoid(logits.unsqueeze(-1))
    
    # Convert probabilities to binary values (0 or 1) using threshold 0.5
    soft_probs = (probs > 0.5).long()
    
    # Compute the accuracy by comparing predicted values with true labels
    acc = (soft_probs.squeeze() == labels).float().mean()
    return acc

# This function evaluates the model performance on a given dataset (e.g., dev set)
def evaluate(model, criterion, devloader, acc_func, cls_type=0):
    """
    Evaluate a model's performance on a dataset.
    
    Args:
    - model (torch.nn.Module): The model to be evaluated.
    - criterion (torch.nn.Module): Loss function.
    - devloader (torch.utils.data.DataLoader): DataLoader for the dataset.
    - acc_func (function): Function to compute accuracy.
    - cls_type (int): Type of classification. 0 for evidence classification, 1 for claim classification. Default is 0.
    
    Returns:
    - Tuple[torch.Tensor, float]: Mean accuracy and mean loss.
    """
    
    # Set the model to evaluation mode (deactivates dropout and batch normalization)
    model.eval()
    
    # Initialize mean accuracy and mean loss to 0
    mean_acc, mean_loss = 0, 0
    count = 0
    
    # Ensure no gradients are computed to save memory and speed up computation
    with torch.no_grad():
        # Loop through batches of data in the dataset
        for seq, attn_masks, seg_ids, labels in devloader:
            
            # Move tensors to GPU if available
            seq, attn_masks, seg_ids, labels = seq.cuda(gpu), attn_masks.cuda(gpu), seg_ids.cuda(gpu), labels.cuda(gpu)
            
            # Get model predictions (logits)
            logits = model(seq, attn_masks, seg_ids)
            
            # Compute the loss using provided criterion
            if cls_type == 0:  # For evidence classification
                mean_loss += criterion(logits.squeeze(-1), labels.float()).item()
            else:  # For claim classification
                mean_loss += criterion(logits, labels).item()
            
            # Update mean accuracy using provided accuracy function
            mean_acc += acc_func(logits, labels)
            count += 1
    
    # Return mean accuracy and mean loss
    return mean_acc / count, mean_loss / count

In [33]:
# Encaplusate the training process
# acc_func: accuracy function (sigmoid or softmax)
# cls_type: 0 for evidence classfication, 1 for claim classification
# full: True for using the merged dataset
# Based on the code from week 7's workshop
def train(model, criterion, optimizer, train_loader, dev_loader, max_eps, acc_func, name, cls_type=0, full=False):
    """
    Train the given model using the provided data and hyperparameters.

    Parameters:
    - model: The model to be trained.
    - criterion: Loss function used for training.
    - optimizer: Optimization algorithm.
    - train_loader: DataLoader providing training data in batches.
    - dev_loader: DataLoader providing development (validation) data in batches.
    - max_eps: Maximum number of epochs for training.
    - acc_func: Function to compute accuracy.
    - name: Name of the saved model.
    - cls_type: Classifier type (default is 0). Determines how to compute the loss.
    - full: If True, training is done using a merged dataset (default is False).

    Returns:
    None. The model's state_dict is saved to a file if performance improves.
    """
    
    # If using merged dataset, print a message
    if full:
        print("Using merged dataset ...")

    best_acc = 0  # Keep track of the best accuracy seen during training
    st = time.time()  # Start time for tracking duration

    # Iterate over the epochs
    for ep in range(max_eps):
        print(f"Epoch {ep} ...")
        model.train()  # Set the model to training mode
        
        # Iterate over batches of data from the train_loader
        for i, (seq, attn_masks, seg_ids, labels) in enumerate(train_loader):
            optimizer.zero_grad()  # Clear any accumulated gradients
            
            # Move data to GPU for faster computation
            seq, attn_masks, seg_ids, labels = seq.cuda(gpu), attn_masks.cuda(gpu), seg_ids.cuda(gpu), labels.cuda(gpu)
            
            # Forward pass: compute predictions
            logits = model(seq, attn_masks, seg_ids)
            
            # Compute the loss; consider cls_type for loss calculation
            loss = criterion(logits.squeeze(-1), labels.float()) if cls_type == 0 else criterion(logits, labels)
            loss.backward()  # Compute gradient of the loss with respect to model parameters
            optimizer.step()  # Update model parameters

            # Print training stats every 100 iterations
            if i % 100 == 0:
                acc = acc_func(logits, labels)  # Compute accuracy for current batch
                print(f"Iteration {i} of epoch {ep} complete. Loss: {loss.item()}; Accuracy: {acc}; Time: {round((time.time() - st), 2)}s")
                st = time.time()

        # If not in 'full' mode, evaluate the model on the development set after each epoch
        if not full:
            dev_acc, dev_loss = evaluate(model, criterion, dev_loader, acc_func, cls_type=cls_type)
            print(f"\nEpoch {ep} completed. Development Accuracy: {dev_acc}; Development Loss: {dev_loss}\n")
            
            # Save the model if its performance on the dev set has improved
            if dev_acc > best_acc:
                print(f"Best accuracy is improved from {best_acc} to {dev_acc}")
                best_acc = dev_acc
                torch.save(model.state_dict(), f"{outputs_path}{name}.dat")
                print(f"Model is saved to {outputs_path}{name}.dat\n")
        else:
            # If in 'full' mode, save the model after every epoch
            torch.save(model.state_dict(), f"{outputs_path}{name}.dat")
            print(f"Model is saved to {outputs_path}{name}.dat\n")

In [34]:
# Train: random sampling
evcls_name = "evcls"

# Check if the model file for the evidence classifier already exists.
# If it doesn't, then proceed to initialize and train the model.
if not check_file(f"{outputs_path}{evcls_name}.dat"):
    
    # Instantiate the EvidenceClassifier model.
    evcls_model = EvidenceClassifier()
    
    # Move the model to the specified device, e.g., GPU or CPU.
    evcls_model.to(device)
    
    # Define the loss function for the evidence classifier. 
    # Binary Cross Entropy with Logits Loss is suitable for binary classification tasks.
    evcls_criterion = nn.BCEWithLogitsLoss()
    
    # Define the optimizer for the model, using the Adam optimization algorithm.
    # A learning rate of 2e-5 is specified.
    evcls_optimizer = optim.Adam(evcls_model.parameters(), lr=2e-5)
    
    # Number of training epochs.
    num_epoch = 2
    
    # Train the evidence classifier model.
    # This function is assumed to train the model over the specified epochs, 
    # using the given loss function, optimizer, training data loader, 
    # and development data loader, and then saves the trained model.
    train(evcls_model, evcls_criterion, evcls_optimizer, train_evcls_loader, dev_evcls_loader, num_epoch, acc_sigmoid, evcls_name)

# If the model file already exists, print a message indicating the same.
else:
    print(f"{outputs_path}{evcls_name}.dat exists")

outputs/evcls.dat does not exist.
Epoch 0 ...
Iteration 0 of epoch 0 complete. Loss: 0.8964952230453491; Accuracy: 0.125; Time: 3.22s
Iteration 100 of epoch 0 complete. Loss: 0.011854280717670918; Accuracy: 1.0; Time: 21.02s
Iteration 200 of epoch 0 complete. Loss: 0.006558247376233339; Accuracy: 1.0; Time: 21.0s

Epoch 0 completed. Development Accuracy: 0.9879031777381897; Development Loss: 0.035833008328242405

Best accuracy is improved from 0 to 0.9879031777381897
Model is saved to outputs/evcls.dat

Epoch 1 ...
Iteration 0 of epoch 1 complete. Loss: 0.013068395666778088; Accuracy: 1.0; Time: 15.82s
Iteration 100 of epoch 1 complete. Loss: 0.0016046068631112576; Accuracy: 1.0; Time: 21.06s
Iteration 200 of epoch 1 complete. Loss: 0.001259674783796072; Accuracy: 1.0; Time: 21.02s

Epoch 1 completed. Development Accuracy: 0.9899193048477173; Development Loss: 0.04143776897821696

Best accuracy is improved from 0.9879031777381897 to 0.9899193048477173
Model is saved to outputs/evcls.da

In [35]:
# Train: hard negative sampling
evcls_hard_name = "evcls_hard"
if not check_file(f"{outputs_path}{evcls_hard_name}.dat"):
    evcls_hard_model = EvidenceClassifier()
    evcls_hard_model.to(device)
    evcls_hard_criterion = nn.BCEWithLogitsLoss()
    evcls_hard_optimizer = optim.Adam(evcls_hard_model.parameters(), lr=2e-5)
    num_epoch = 2
    train(evcls_hard_model, evcls_hard_criterion, evcls_hard_optimizer, train_evcls_hard_loader, dev_evcls_hard_loader, num_epoch, acc_sigmoid, evcls_hard_name)
else:
    print(f"{outputs_path}{evcls_hard_name}.dat exists")

outputs/evcls_hard.dat does not exist.
Epoch 0 ...
Iteration 0 of epoch 0 complete. Loss: 0.7445584535598755; Accuracy: 0.375; Time: 0.22s
Iteration 100 of epoch 0 complete. Loss: 0.4154742956161499; Accuracy: 0.75; Time: 21.0s
Iteration 200 of epoch 0 complete. Loss: 0.4395042061805725; Accuracy: 0.78125; Time: 21.0s

Epoch 0 completed. Development Accuracy: 0.8561216592788696; Development Loss: 0.33622004860831844

Best accuracy is improved from 0 to 0.8561216592788696
Model is saved to outputs/evcls_hard.dat

Epoch 1 ...
Iteration 0 of epoch 1 complete. Loss: 0.2198200523853302; Accuracy: 0.90625; Time: 15.81s
Iteration 100 of epoch 1 complete. Loss: 0.16976988315582275; Accuracy: 0.96875; Time: 21.0s
Iteration 200 of epoch 1 complete. Loss: 0.20921653509140015; Accuracy: 0.96875; Time: 20.93s

Epoch 1 completed. Development Accuracy: 0.869043231010437; Development Loss: 0.33505730066568623

Best accuracy is improved from 0.8561216592788696 to 0.869043231010437
Model is saved to out

In [36]:
# Train: random sampling (merged)
evcls_f_name = "evcls_f"
if not check_file(f"{outputs_path}{evcls_f_name}.dat"):
    evcls_f_model = EvidenceClassifier()
    evcls_f_model.to(device)
    evcls_f_criterion = nn.BCEWithLogitsLoss()
    evcls_f_optimizer = optim.Adam(evcls_f_model.parameters(), lr=2e-5)
    num_epoch = 1
    train(evcls_f_model, evcls_f_criterion, evcls_f_optimizer, train_evcls_f_loader, None, num_epoch, acc_sigmoid, evcls_f_name, full=True)
else:
    print(f"{outputs_path}{evcls_f_name}.dat exists")

outputs/evcls_f.dat does not exist.
Using merged dataset ...
Epoch 0 ...
Iteration 0 of epoch 0 complete. Loss: 0.3497883677482605; Accuracy: 1.0; Time: 0.21s
Iteration 100 of epoch 0 complete. Loss: 0.03620237484574318; Accuracy: 0.96875; Time: 21.03s
Iteration 200 of epoch 0 complete. Loss: 0.011501314118504524; Accuracy: 1.0; Time: 21.06s
Model is saved to outputs/evcls_f.dat



In [37]:
# Train: hard negative sampling (merged)
evcls_hard_f_name = "evcls_hard_f"
if not check_file(f"{outputs_path}{evcls_hard_f_name}.dat"):
    evcls_hard_f_model = EvidenceClassifier()
    evcls_hard_f_model.to(device)
    evcls_hard_f_criterion = nn.BCEWithLogitsLoss()
    evcls_hard_f_optimizer = optim.Adam(evcls_hard_f_model.parameters(), lr=2e-5)
    num_epoch = 1
    train(evcls_hard_f_model, evcls_hard_f_criterion, evcls_hard_f_optimizer, train_evcls_hard_f_loader, None, num_epoch, acc_sigmoid, evcls_hard_f_name, full=True)
else:
    print(f"{outputs_path}{evcls_hard_f_name}.dat exists")

outputs/evcls_hard_f.dat does not exist.
Using merged dataset ...
Epoch 0 ...
Iteration 0 of epoch 0 complete. Loss: 0.7082314491271973; Accuracy: 0.46875; Time: 0.23s
Iteration 100 of epoch 0 complete. Loss: 0.44683903455734253; Accuracy: 0.8125; Time: 21.01s
Iteration 200 of epoch 0 complete. Loss: 0.29592233896255493; Accuracy: 0.875; Time: 21.03s
Model is saved to outputs/evcls_hard_f.dat



In [38]:
# Load: random sampling
evcls_path = f"{outputs_path}{evcls_name}.dat"
evcls_model = EvidenceClassifier()
evcls_model.load_state_dict(torch.load(evcls_path))
evcls_model.to(device)
evcls_model.eval()
print(f"Loaded {evcls_path}")

Loaded outputs/evcls.dat


In [39]:
# Load: hard negative sampling
evcls_hard_path = f"{outputs_path}{evcls_hard_name}.dat"
evcls_hard_model = EvidenceClassifier()
evcls_hard_model.load_state_dict(torch.load(evcls_hard_path))
evcls_hard_model.to(device)
evcls_hard_model.eval()
print(f"Loaded {evcls_hard_path}")

Loaded outputs/evcls_hard.dat


In [40]:
# Load: random sampling (merged)
evcls_f_path = f"{outputs_path}{evcls_f_name}.dat"
evcls_f_model = EvidenceClassifier()
evcls_f_model.load_state_dict(torch.load(evcls_f_path))
evcls_f_model.to(device)
evcls_f_model.eval()
print(f"Loaded {evcls_f_path}")

Loaded outputs/evcls_f.dat


In [41]:
# Load: hard negative sampling (merged)
evcls_hard_f_path = f"{outputs_path}{evcls_hard_f_name}.dat"
evcls_hard_f_model = EvidenceClassifier()
evcls_hard_f_model.load_state_dict(torch.load(evcls_hard_f_path))
evcls_hard_f_model.to(device)
evcls_hard_f_model.eval()
print(f"Loaded {evcls_hard_f_path}")

Loaded outputs/evcls_hard_f.dat


#### Evidence retrieval

In [42]:
# Classify if the evidence is relevant to the claim
def evcls(claim, evidence, model, tokenizer):
    """Classifies if the provided evidence supports the given claim.

    Args:
    - claim (str): The claim statement.
    - evidence (str): The evidence statement.
    - model (torch.nn.Module): The trained model for claim verification.
    - tokenizer: Tokenizer for the model.

    Returns:
    - float: A sigmoid score representing how much the evidence supports the claim.
    """
    # Convert the concatenated claim and evidence into model-ready format
    seq, attn_mask, seg_id = convert_input(claim, evidence, tokenizer, token_len)
    
    # Transfer inputs to GPU
    seq = seq.unsqueeze(0).cuda(gpu)
    attn_mask = attn_mask.unsqueeze(0).cuda(gpu)
    seg_id = seg_id.unsqueeze(0).cuda(gpu)
    
    # Model inference
    with torch.no_grad():
        return torch.sigmoid(model(seq, attn_mask, seg_id))[0][0]


# Retrieve evidences for a claim
def ev_retrieve(claim, model, hard_model, n=100, top=5):
    """Retrieves the top pieces of evidence for a given claim.

    Args:
    - claim (str): The claim statement.
    - model: Model for random sampling.
    - hard_model: Model for hard negative sampling.
    - n (int, optional): Number of candidates to retrieve using TF-IDF. Defaults to 100.
    - top (int, optional): Number of top evidences to return. Defaults to 5.

    Returns:
    - list: List of tuples containing evidence ID and their relevance scores.
    """
    # Get n most similar evidences using TF-IDF
    candidates = tfidf_filter(claim, n)
    
    # Get embedding of the claim for cosine similarity calculation
    claim_emd = get_text_embedding([claim], distilbert_model, distilbert_tokenizer, show=False)[0]
    
    # Clean the claim text (NLTK) for Jaccard similarity calculation
    claim_clean = text_clean(claim)
    
    res = []
    for candidate in candidates:
        ev_id = candidate[0]
        ev = evidence_src[ev_id]
        ev_clean = clean_evidence_src[ev_id]
        
        # Calculate various similarity and relevance scores
        tfidf = candidate[1]
        jaccard = jaccard_similarity(claim_clean, ev_clean)
        distil_sim = cosine_similarity([claim_emd], [ev_embeddings[ev_id]]).item()
        
        # Get predictions from the models
        pred = evcls(claim, ev, model=model, tokenizer=bert_tokenizer).item()
        hard_pred = evcls(claim, ev, model=hard_model, tokenizer=bert_tokenizer).item()
        
        # Calculate a composite score based on the individual scores
        score = 0.05 * tfidf + 0.05 * jaccard + 0.2 * distil_sim + 0.4 * pred + 0.3 * hard_pred
        res.append((ev_id, score))
    
    # Sort the results based on the composite score and return the top results
    res = sorted(res, key = lambda x: x[1], reverse=True)
    return res[:top]


In [43]:
# Evaulate the performance of evidence retrieval on a claim in dev set
# show; True for printing stats
# full: True for using models trained on merged dataset
def eval_dev_evcls(index, n=100, top=5, show=True, full=False):
    # Select model
    model = evcls_f_model if full else evcls_model
    hard_model = evcls_hard_f_model if full else evcls_hard_model
    
    claim = dev_claims[index]
    truths = dev_evidences[index]
    evidences = [item[0] for item in ev_retrieve(claim, model, hard_model, n, top)]
    print(evidences) if show else None
    t = 0
    f = 0
    for truth in truths:
        if int(truth[9:]) in evidences:
            t += 1
            print(f"In: {truth}") if show else None
        else:
            f += 1
            print(f"Out: {truth}") if show else None
    print(f"In: {t}, Out: {f}") if show else None
    return t, f

# Retrieve evidences for all claims from a dataset
# check: True for checking number of correct retrieved evidences
# full: True for using models trained on merged dataset
def ev_retrieve_src(datasrc, n=100, top=2, check=False, full=False):
    # Select model
    model = evcls_f_model if full else evcls_model
    hard_model = evcls_hard_f_model if full else evcls_hard_model

    evidences = []
    t = 0
    f = 0
    pbar = tqdm(total=len(datasrc), desc="Retrieving evidences", dynamic_ncols=True)
    for i in range(len(datasrc)):
        claim = datasrc[i]
        evidences.append([item[0] for item in ev_retrieve(claim, model, hard_model, n, top)])
        if check:
            truths = dev_evidences[i]
            for truth in truths:
                if int(truth[9:]) in evidences[-1]:
                    t += 1
                else:
                    f += 1
        pbar.update(1)
    pbar.close()
    print(f"Total: {t + f}, In: {t}, Out: {f}") if check else None
    return evidences

### Section II: Claim Classification

In [44]:
label2id = {'SUPPORTS': 0, 'REFUTES': 1, 'NOT_ENOUGH_INFO': 2, 'DISPUTED': 3}
id2label = {0: 'SUPPORTS', 1: 'REFUTES', 2: 'NOT_ENOUGH_INFO', 3: 'DISPUTED'}

#### BERT for claim classification

In [45]:
# Create data for claim classification
def create_claimcls_data(claims, evs, labels):
    claim_list = []
    ev_list = []
    label_list = []
    
    for i, claim in enumerate(claims):
        ev_ids = evs[i]
        for num in ev_ids:
            ev_id = int(num[9:])
            evidence_text = evidence_src[ev_id]
            claim_list.append(claim)
            ev_list.append(evidence_text)
            label_list.append(labels[i])

    return pd.DataFrame({'Claim': claim_list, 'Evidence': ev_list, 'Label': label_list})

In [46]:
# Create training and development datasets for claim classification
train_claimcls = create_claimcls_data(train_claims, train_evidences, train_labels)
dev_claimcls = create_claimcls_data(dev_claims, dev_evidences, dev_labels)

In [47]:
# Merge training and development datasets for claim classification
train_claimcls_f = pd.merge(train_claimcls, dev_claimcls, on=['Claim', 'Evidence', 'Label'], how='outer')

In [48]:
# Create dataset for claim classification
class ClaimCLSDataset(Dataset):

    def __init__(self, data, maxlen):
        """
        Constructor for the ClaimCLSDataset
        
        Args:
        - data (pd.DataFrame): DataFrame containing the data. 
                               Columns should include 'Claim', 'Evidence' and 'Label'
        - maxlen (int): Maximum length of the tokenized input (Claim + Evidence). 
                        Helps in padding/truncating.
        
        Attributes:
        - df (pd.DataFrame): Dataframe containing the data
        - tokenizer (Tokenizer): BERT tokenizer
        - maxlen (int): Maximum length of the tokenized input.
        """
        
        self.df = data
        self.tokenizer = bert_tokenizer  # Assuming bert_tokenizer is pre-defined
        self.maxlen = maxlen

    def __len__(self):
        """
        Returns the number of samples in the dataset.
        
        Returns:
        - int: Number of samples in the dataset
        """
        
        return len(self.df)

    def __getitem__(self, index):
        """
        Fetches the item at a given index from the dataframe and returns tokenized Claim, 
        Evidence, and their corresponding Label in tensor format.
        
        Args:
        - index (int): Index of the sample to fetch
        
        Returns:
        - tokens_ids_t (tensor): Tokenized and padded ids of the claim and evidence
        - attn_mask_t (tensor): Attention mask indicating real tokens (1) vs padded tokens (0)
        - seg_ids_t (tensor): Segment ids to distinguish between claim and evidence
        - label (tensor): Label of the given sample as tensor
        """
        
        claim = self.df.loc[index, 'Claim']
        evidence = self.df.loc[index, 'Evidence']
        # Convert the label to its corresponding id based on a pre-defined dictionary
        label = torch.tensor(label2id[self.df.loc[index, 'Label']])
        
        # Convert the claim and evidence to the required format for BERT
        tokens_ids_t, attn_mask_t, seg_ids_t = convert_input(claim, evidence, self.tokenizer, self.maxlen)
        
        return tokens_ids_t, attn_mask_t, seg_ids_t, label

In [49]:
# Create training and development dataloaders for claim classification
train_claimcls_set = ClaimCLSDataset(train_claimcls, token_len)
dev_claimcls_set = ClaimCLSDataset(dev_claimcls, token_len)
train_claimcls_loader = DataLoader(train_claimcls_set, batch_size = 32, shuffle=True)
dev_claimcls_loader = DataLoader(dev_claimcls_set, batch_size = 32, shuffle=True)

In [50]:
# Create dataloader for merged dataset
train_claimcls_f_set = ClaimCLSDataset(train_claimcls_f, token_len)
train_claimcls_f_loader = DataLoader(train_claimcls_f_set, batch_size = 32, shuffle=True)

In [51]:
# Define the sturcture of the model for Claim classification
class ClaimClassifier(nn.Module):

    def __init__(self):
        super(ClaimClassifier, self).__init__()
        self.bert_layer = BertModel.from_pretrained('bert-base-uncased')
        self.cls_layer = nn.Linear(768, 4)

    def forward(self, seq, attn_masks, seg_ids):
        outputs = self.bert_layer(seq, attention_mask = attn_masks, token_type_ids = seg_ids, return_dict=True)
        cont_reps = outputs.last_hidden_state
        cls_rep = cont_reps[:, 0]
        logits = self.cls_layer(cls_rep)
        return logits

In [52]:
# Compute accuracy for model of claim classification
def acc_softmax(logits, labels):
    probs = torch.softmax(logits, dim=1)
    _, pred_labels = torch.max(probs, dim=1)
    total = labels.size(0)
    correct = torch.sum(pred_labels == labels).item()
    acc = correct / total
    return acc

In [53]:
# Train
claimcls_name = "claimcls"
if not check_file(f"{outputs_path}{claimcls_name}.dat"):
    claimcls_model = ClaimClassifier()
    claimcls_model.to(device)
    claimcls_criterion = nn.CrossEntropyLoss()
    claimcls_optimizer = optim.Adam(claimcls_model.parameters(), lr=2e-5)
    num_epoch = 2
    train(claimcls_model, claimcls_criterion, claimcls_optimizer, train_claimcls_loader, dev_claimcls_loader, num_epoch, acc_softmax, claimcls_name, cls_type=1)
else:
    print(f"{outputs_path}{claimcls_name}.dat exists")

outputs/claimcls.dat does not exist.
Epoch 0 ...
Iteration 0 of epoch 0 complete. Loss: 1.5290237665176392; Accuracy: 0.1875; Time: 0.25s
Iteration 100 of epoch 0 complete. Loss: 0.9139746427536011; Accuracy: 0.59375; Time: 22.38s

Epoch 0 completed. Development Accuracy: 0.5323153409090909; Development Loss: 1.1569896638393402

Best accuracy is improved from 0 to 0.5323153409090909
Model is saved to outputs/claimcls.dat

Epoch 1 ...
Iteration 0 of epoch 1 complete. Loss: 0.7241110801696777; Accuracy: 0.75; Time: 9.33s
Iteration 100 of epoch 1 complete. Loss: 0.4082818627357483; Accuracy: 0.84375; Time: 23.8s

Epoch 1 completed. Development Accuracy: 0.5220170454545454; Development Loss: 1.50575041025877



In [54]:
# Train on merged dataset
claimcls_f_name = "claimcls_f"
if not check_file(f"{outputs_path}{claimcls_f_name}.dat"):
    claimcls_f_model = ClaimClassifier()
    claimcls_f_model.to(device)
    claimcls_f_criterion = nn.CrossEntropyLoss()
    claimcls_f_optimizer = optim.Adam(claimcls_f_model.parameters(), lr=2e-5)
    num_epoch = 1
    train(claimcls_f_model, claimcls_f_criterion, claimcls_f_optimizer, train_claimcls_f_loader, None, num_epoch, acc_softmax, claimcls_f_name, cls_type=1, full=True)
else:
    print(f"{outputs_path}{claimcls_f_name}.dat exists")

outputs/claimcls_f.dat does not exist.
Using merged dataset ...
Epoch 0 ...
Iteration 0 of epoch 0 complete. Loss: 1.494035243988037; Accuracy: 0.15625; Time: 0.26s
Iteration 100 of epoch 0 complete. Loss: 0.8271755576133728; Accuracy: 0.71875; Time: 22.9s
Model is saved to outputs/claimcls_f.dat



In [55]:
# Load model for claim classification
claimcls_path = f"{outputs_path}{claimcls_name}.dat"
claimcls_model = ClaimClassifier()
claimcls_model.load_state_dict(torch.load(claimcls_path))
claimcls_model.to(device)
claimcls_model.eval()
print(f"Loaded {claimcls_path}")

Loaded outputs/claimcls.dat


In [56]:
# Load model for claim classification (merged dataset)
claimcls_f_path = f"{outputs_path}{claimcls_f_name}.dat"
claimcls_f_model = ClaimClassifier()
claimcls_f_model.load_state_dict(torch.load(claimcls_f_path))
claimcls_f_model.to(device)
claimcls_f_model.eval()
print(f"Loaded {claimcls_f_path}")

Loaded outputs/claimcls_f.dat


#### Claim classification

In [57]:
# Classify a claim
def claimcls(claim, evidences, model, tokenizer):
    """
    Classify a claim based on provided evidences using a given model and tokenizer.
    
    Parameters:
    - claim (str): The statement/claim to be classified.
    - evidences (list): A list of evidence identifiers.
    - model (torch.nn.Module): The pre-trained model for classification.
    - tokenizer (object): Tokenizer to process the claim and evidences.
    
    Returns:
    int: The classification label of the claim.
    """

    res = []  # To store the predicted labels for each evidence

    # Iterate over each evidence
    for ev in evidences:
        # Convert the claim and evidence to model input format
        seq, attn_mask, seg_id = convert_input(claim, evidence_src[ev], tokenizer, token_len)

        # Move the inputs to GPU for faster computation
        seq = seq.unsqueeze(0).cuda(gpu)
        attn_mask = attn_mask.unsqueeze(0).cuda(gpu)
        seg_id = seg_id.unsqueeze(0).cuda(gpu)

        # Use the model to make a prediction
        with torch.no_grad():  # Disable gradient computation for inference
            logits = model(seq, attn_mask, seg_id)  # Get the model's raw outputs (logits)
            _, pred = torch.max(torch.softmax(logits, dim=1), dim=1)  # Convert logits to probabilities and get the label with highest probability
            pred = pred.item()  # Convert tensor to integer

            # If the prediction is not NOT_ENOUGH_INFO
            if pred != 2:
                res.append(pred)

    # If none of 'SUPPORTS', 'REFUTES', 'DISPUTED' are in the list, label it as NOT_ENOUGH_INFO
    if len(res) == 0:
        return 2

    # Return the most common prediction among all evidences
    return Counter(res).most_common(1)[0][0]

In [58]:
# Evaluate performance on a claim in dev set
# show: True for printing predicton and truth
def eval_dev_claimcls(index, ev_retrieve, model, tokenizer, show=True):
    """
    Evaluate a claim's classification against the actual label.

    Parameters:
    - index (int): Index of the claim in the development dataset.
    - ev_retrieve (list): List of retrieved evidences.
    - model: Pre-trained or fine-tuned model for claim classification.
    - tokenizer: Tokenizer corresponding to the model for text processing.
    - show (bool, optional): If True, prints prediction and truth. Defaults to True.

    Returns:
    - int: Model's prediction for the given claim.
    """
    pred = claimcls(dev_claims[index], ev_retrieve[index], model, tokenizer)
    truth = label2id[dev_labels[index]]
    print(f"Pred: {pred}, Truth: {truth}") if show else None
    return pred

def claimcls_src(claims, evs, model, tokenizer, check=False):
    """
    Classify all claims in the dataset based on the retrieved evidences.

    Parameters:
    - claims (list): List of claims.
    - evs (list): Corresponding list of evidences for each claim.
    - model: Pre-trained or fine-tuned model for claim classification.
    - tokenizer: Tokenizer corresponding to the model for text processing.
    - check (bool, optional): If True, checks accuracy of predictions against truth labels. Defaults to False.

    Returns:
    - list: List of predictions for each claim.
    """
    preds = []
    t = 0
    f = 0
    # Initializing the progress bar for visual feedback during prediction
    pbar = tqdm(total=len(claims), desc="Predicting claims", dynamic_ncols=True)

    for i in range(len(claims)):
        pred = claimcls(claims[i], evs[i], model, tokenizer)
        preds.append(pred)
        if check:
            # Verifying the prediction against the actual label for accuracy calculation
            truth = label2id[dev_labels[i]]
            if pred == truth:
                t += 1
            else:
                f += 1
        pbar.update(1)

    pbar.close()
    # Displaying summary of the prediction results if check is True
    print(f"Total: {t + f}, Correct: {t}, Wrong: {f}, Accuracy: {round(t / (t + f), 2)}") if check else None
    return preds

### Section III: Evaluation & Prediction

In [59]:
# Format evidences for saving
def format_evidences(evidences):
    res = []
    for ev in evidences:
        res.append(f"evidence-{ev}")
    return res

# Formatting and saving results
def save_results(claim_ids, claims, labels, evidences, filename):
    data = {}
    for i in range(len(claim_ids)):
        claim_id = claim_ids[i]
        claim_text = claims[i]
        claim_label = id2label[labels[i]]
        evs = format_evidences(evidences[i])
        data[claim_id] = {"claim_text": claim_text,
                          "claim_label": claim_label,
                          "evidences": evs}

    with open(filename, "w") as f:
        json.dump(data, f)

#### Evaluation on dev set

In [60]:
# Retrieve evidences for claims in dev set
dev_ev_retrieve = ev_retrieve_src(dev_claims, 150, 5, check=True, full=False)

Retrieving evidences:   0%|          | 0/154 [00:00<?, ?it/s]

Total: 491, In: 115, Out: 376


In [61]:
# Classify claims in dev set
dev_claimcls = claimcls_src(dev_claims, dev_ev_retrieve, claimcls_model, bert_tokenizer, check=True)

Predicting claims:   0%|          | 0/154 [00:00<?, ?it/s]

Total: 154, Correct: 77, Wrong: 77, Accuracy: 0.5


In [62]:
dev_pred_output = "dev-pred.json"
dev_save_path = f"{prediction_path}{dev_pred_output}"
save_results(dev_claim_ids, dev_claims, dev_claimcls, dev_ev_retrieve, dev_save_path)
print(f"Saved to {dev_save_path}")

Saved to prediction/dev-pred.json


In [63]:
if check_file("eval.py"):
    !python eval.py --predictions data/dev-claims.json --groundtruth prediction/dev-pred.json

Evidence Retrieval F-score (F)    = 0.17878787878787883
Claim Classification Accuracy (A) = 0.5
Harmonic Mean of F and A          = 0.2633928571428572


#### Prediction on test set

In [64]:
test_path = f"{data_dir}test-claims-unlabelled.json"
test_claim_ids, test_claims = load_data(test_path)

In [65]:
# Retrieve evidences for claims in test set
# test_ev_retrieve = ev_retrieve_src(test_claims, 150, 5, full=False)
test_ev_retrieve = ev_retrieve_src(test_claims, 150, 5, full=True)

Retrieving evidences:   0%|          | 0/153 [00:00<?, ?it/s]

In [66]:
# Classify claims in test set
# test_claimcls = claimcls_src(test_claims, test_ev_retrieve, claimcls_model, bert_tokenizer)
test_claimcls = claimcls_src(test_claims, test_ev_retrieve, claimcls_f_model, bert_tokenizer)

Predicting claims:   0%|          | 0/153 [00:00<?, ?it/s]

In [67]:
test_pred_output = "test-claims-predictions.json"
test_save_path = f"{prediction_path}{test_pred_output}"
save_results(test_claim_ids, test_claims, test_claimcls, test_ev_retrieve, test_save_path)
print(f"Saved to {test_save_path}")

Saved to prediction/test-claims-predictions.json


In [68]:
def test_insight(index):
    print(f"Claim: {test_claims[index]}")
    for ev in test_ev_retrieve[index]:
        print(evidence_src[ev])
    print(f"Prediction: {id2label[test_claimcls[index]]}")

In [69]:
test_insight(0)

Claim: The contribution of waste heat to the global climate is 0.028 W/m2.
Global forcing from waste heat was 0.028 W/m2 in 2005.
The global temperature increase since the beginning of the industrial period (taken as 1750) is about 0.8 °C (1.4 °F), and the radiative forcing due to CO 2 and other long-lived greenhouse gases – mainly methane, nitrous oxide, and chlorofluorocarbons – emitted since that time is about 2.6 W/m2.
Taking planetary heat uptake rate as the rate of ocean heat uptake estimated by the IPCC AR4 as 0.2 W/m2, yields a value for S of 2.1 °C (3.8 °F).
Without feedbacks the radiative forcing of approximately 3.7 W/m2, due to doubling CO 2 from the pre-industrial 280 ppm, would eventually result in roughly 1 °C global warming.
Solar irradiance is about 0.9 W/m2 brighter during solar maximum than during solar minimum, which correlated in measured average global temperature over the period 1959-2004.
Prediction: SUPPORTS


### End