#### COMP90042 2023 Project: Automated Fact Checking For Climate Science Claims

The goal of this project is to design and implement a system that takes a given `claim` (which is a single sentence) and then retrieves (one or more) `evidence passages` (each passage is also a single sentence) from a document store, then using these evidence passages classifies the claim as one of these four labels: `[SUPPORTS, REFUTES, NOT_ENOUGH_INFO, DISPUTED]`.  

Our training data consists of a document store which is a set of $D$ evidence passages `{evidence-1, evidence-2, ..., evidence-D}` and training and validation data instances where each instance is a tuple of the form `(claim_text, claim_label, evidence_list)`, where evidence list is a subset of passages from the document store.

In this notebook, we explore the use of `BM25` for passage retreival and evaluate it's performance on the project dataset. Evaluation will involve comparing retreived evidence passages for each data instance with the ground truth evidence list  and computing precision, recall and F1 score. Then we will use the `average F1 score` across all data instances as the final evaluation metric.

In [1]:
from nltk.tokenize import RegexpTokenizer
from collections import defaultdict
from unidecode import unidecode
import math, random
from tqdm import tqdm
import json
import numpy as np

#### Below is my implementation of a simple BM25 retreival system. Will later replace this with the PyLucene implementation.

In [2]:
class IR_System():
    def __init__(self, k = 1.25, b = 0.75):
        self.k = k
        self.b = b
        self.tokenizer = RegexpTokenizer(r'\w+') 

    def train(self, documents):
        self.documents = documents
        self.TFIDF, self.inverted_index, self.doc_tfidf_norms = self.create_inverted_index()
        
    def tokenize(self, sent):
        # Replace accented letters with regular letters
        sent = unidecode(sent)
        # tokenize into words, remove punctuation
        return self.tokenizer.tokenize(sent.lower())

    def create_inverted_index(self):
        N = len(self.documents)
        TFIDF = defaultdict(float)
        inverted_index = defaultdict(list)

        # compute term frequency and document frequencies
        TF, term_docs = self.compute_TF_weighted()

        # create inverted index
        print(f"Computing TFIDF and creating inverted index...")
        for w, docs in tqdm(term_docs.items(), total=len(term_docs)):
            for d in sorted(list(docs)):
                tfidf = TF[(w,d)] * math.log10(N/len(docs))
                inverted_index[w].append(d)
                TFIDF[(w,d)] = tfidf

        # compute document TFIDF vector norms
        print(f"Computing TFIDF vector norms...")
        doc_tfidf_norms = [0] * N
        for d, doc in tqdm(enumerate(self.documents), total=len(self.documents)):
            words = self.tokenize(doc)
            for w in words:
                doc_tfidf_norms[d] = doc_tfidf_norms[d] +  TFIDF[(w,d)]**2
            doc_tfidf_norms[d] = math.sqrt(doc_tfidf_norms[d])

        return TFIDF, inverted_index, doc_tfidf_norms  
          
    # weighted TF for BM25
    def compute_TF_weighted(self):
        TF = defaultdict(int)
        term_docs = defaultdict(set)
        doc_length = defaultdict(float)
        Dtotal = 0
        print(f"Computing TFIDF...")
        for d, doc in tqdm(enumerate(self.documents), total=len(self.documents)):
            words = self.tokenize(doc)
            for w in words:
                TF[(w, d)] += 1
                term_docs[w].add(d)
            doc_length[d] = len(words)
            Dtotal += len(words)
        Davg = Dtotal / len(self.documents)

        # compute BM25 weighted term frequencies
        TF_weighted = defaultdict(float)
        for (w,d), tf in TF.items():
            TF_weighted[(w,d)] = (tf * (self.k + 1)) / (tf + self.k * (1 - self.b + self.b * (doc_length[d]/Davg)))
        return TF_weighted, term_docs
    

    def retrieve_docs(self, query, topk=1):
        query_words = self.tokenizer.tokenize(query.lower())
        #print(f"query words: {query_words}")
        # get all documents which contain words from query
        docs = []
        for w in query_words:
            docs.extend([d for d in self.inverted_index[w]])
        # remove duplicates
        docs = list(set(docs))
        #print(f"docs: {docs}")    
        # score all these documents
        scores = np.zeros(len(docs))
        for i in range(len(docs)):
            d = docs[i]
            for w in query_words:
                scores[i] += self.TFIDF[(w,d)]
            scores[i] = scores[i] / self.doc_tfidf_norms[d]        
        #print(f"scores: {scores}")  

        sorted_indices = np.argsort(scores)[::-1]
        best_indices = sorted_indices[:topk]
        best_scores = scores[best_indices]
        topk_doc_indices = [docs[idx] for idx in best_indices]
        return topk_doc_indices, best_scores


#### Let's load up the data, both document store (`evidence.json`) and training/validation instances (`train-claims.json`, `dev-claims.json`).

In [3]:
# load the evidence passages
with open("project-data/evidence.json", "r") as train_file:
    document_store = json.load(train_file)         
print(f"Number of evidence passages: {len(document_store)}")

# load the training data insttances
with open("project-data/train-claims.json", "r") as train_file:
    train_data = json.load(train_file)
print(f"Number of training instances: {len(train_data)}")

# load the validation data instances
with open("project-data/dev-claims.json", "r") as dev_file:
    val_data = json.load(dev_file)    
print(f"Number of validation instances: {len(val_data)}")

Number of evidence passages: 1208827
Number of training instances: 1228
Number of validation instances: 154


In [4]:
# examples of some evidence passages
j = 0
for i, evidence_text in document_store.items():
    print(f"{i}: {evidence_text}")
    j += 1
    if j == 10:
        break

# examples of some training instances
print("")
j = 0
for i, claim in train_data.items():
    print(f"{i}: {claim}")
    j += 1
    if j == 10:
        break
    

evidence-0: John Bennet Lawes, English entrepreneur and agricultural scientist
evidence-1: Lindberg began his professional career at the age of 16, eventually moving to New York City in 1977.
evidence-2: ``Boston (Ladies of Cambridge)'' by Vampire Weekend
evidence-3: Gerald Francis Goyer (born October 20, 1936) was a professional ice hockey player who played 40 games in the National Hockey League.
evidence-4: He detected abnormalities of oxytocinergic function in schizoaffective mania, post-partum psychosis and how ECT modified oxytocin release.
evidence-5: With peak winds of 110 mph (175 km/h) and a minimum pressure of 972 mbar (hPa ; 28.71 inHg), Florence was the strongest storm of the 1994 Atlantic hurricane season.
evidence-6: He is currently a professor of piano at the University of Wisconsin -- Madison since August 2000.
evidence-7: In addition to known and tangible risks, unforeseeable black swan extinction events may occur, presenting an additional methodological problem.
evide

#### Let's train a BM25 model on this document store

In [5]:
# instantiate and train a retirever 
retreiver = IR_System()
passages = list(document_store.values()) 
retreiver.train(passages)   

Computing TFIDF...


  0%|          | 0/1208827 [00:00<?, ?it/s]

100%|██████████| 1208827/1208827 [00:25<00:00, 46668.53it/s]


Computing TFIDF and creating inverted index...


100%|██████████| 576915/576915 [00:34<00:00, 16624.13it/s] 


Computing TFIDF vector norms...


100%|██████████| 1208827/1208827 [00:20<00:00, 60324.22it/s]


#### Now let's run some tests. First, we pick a few example training instances and compare the retreived passages with the ground truth evidence list. We also have to pick a value for the top-k parameter for the retreiver, we'll use k=5 for now.

In [6]:
passage_ids = list(document_store.keys())
train_claims = list(train_data.values())

In [30]:
example_claim = train_claims[125]
query = example_claim["claim_text"]
gold_evidence_list = example_claim["evidences"]
print(f"Query: {query}")
print(f"Label: {example_claim['claim_label']}")
print(f"Gold evidence list:")
for evidence_id in gold_evidence_list:
    print(f"{evidence_id}: {document_store[evidence_id]}")

# retrieve relevant evidence passages
topk_doc_indices, best_scores = retreiver.retrieve_docs(query, topk=20)
topk_evidence_ids = [passage_ids[idx] for idx in topk_doc_indices]
for i, evidence_id in enumerate(topk_evidence_ids):
    print(f"Score: {best_scores[i]}, {evidence_id}: {document_store[evidence_id]}")

# evaluation (precision, recall, F1)
intersection = set(topk_evidence_ids).intersection(gold_evidence_list)
precision = len(intersection) / len(topk_evidence_ids)
recall = len(intersection) / len(gold_evidence_list)

precision = precision / len(topk_evidence_ids)
recall = recall / len(gold_evidence_list)
f1 = (2*precision*recall/(precision + recall)) if (precision + recall) > 0 else 0 
print(f"Precision: {precision}, Recall: {recall}, F1: {f1}")

Query: Severe ‘snowmageddon’ winters are now strongly linked to soaring polar temperatures, say researchers, with deadly summer heatwaves and torrential floods also probably linked.
Label: NOT_ENOUGH_INFO
Gold evidence list:
evidence-611438: The El Niño phenomenon was blamed for the unusually high sea surface temperatures in the Pacific Ocean that moved east, thus pulling rainfall along with it.
evidence-392043: Eric Klinenberg has noted that in the United States, the loss of human life in hot spells in summer exceeds that caused by all other weather events combined, including lightning, rain, floods, hurricanes, and tornadoes.
evidence-236630: The most deadly heat wave in the history of Pakistan is the record-breaking heat wave of summer 2010 which occurred in the last ten days of May.
evidence-516337: The climate is characterized by hot, dry summers and cool, wet winters.
evidence-717735: Rapid, dramatic temperature swings were common, with temperatures sometimes reverting from norma