# **Fact checking, Neural Languange Inference (NLI)**

**Authors**: Giacomo Berselli, Marco Cucè, Riccardo De Matteo

In [None]:
# to print all output for a cell instead of only last one 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

### 1. Libraries and Imports 

In [2]:
import os
import requests
import zipfile
import random
import string 

import torch

import numpy as np
import pandas as pd

import gensim
import gensim.downloader as gloader

import time 

from collections import OrderedDict, namedtuple

# Fix data seed to achieve reproducible results
torch.manual_seed(0)
random.seed(0)
np.random.seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True



In [None]:
print("Current work directory: {}".format(os.getcwd())) #print the current working directory 

data_folder = os.path.join(os.getcwd(),"data") # directory containing the notebook

if not os.path.exists(data_folder):   #create folder where all data will be stored 
    os.makedirs(data_folder)

### 2. Data handling

First thing first, we download the raw dataset, unzip it and store the csv document of each split in the dataset folder.

In [None]:
raw_dataset_path = os.path.join(data_folder,'raw_dataset')   #path of the raw dataset as downloaded 

def save_response_content(response, destination):    
    CHUNK_SIZE =32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks                
                f.write(chunk)

def download_data(data_folder):
    zip_dataset_path = os.path.join(raw_dataset_path,'fever_data.zip')    
    data_url_id ="1wArZhF9_SHW17WKNGeLmX-QTYw9Zscl1"    
    url ="https://docs.google.com/uc?export=download"

    if not os.path.exists(raw_dataset_path):        
        os.makedirs(raw_dataset_path)

    if not os.path.exists(zip_dataset_path):
        print("Downloading FEVER data splits...")
        with requests.Session() as current_session:           
            response = current_session.get(url, params={'id': data_url_id}, stream=True)

        save_response_content(response, zip_dataset_path)
        print("Download completed!")

        print("Extracting dataset...")
        with zipfile.ZipFile(zip_dataset_path) as loaded_zip:            
            loaded_zip.extractall(raw_dataset_path)
        print("Extraction completed!")

download_data(data_folder)

Now that we have the csv files of the train, val and test splits, we encode all three as a unique pandas Dataframe to be able to better inspect it and manipulate it as a whole.
The Dataframe `df` is structured as follows: 
- `claim`: the fact to verify 
- `evidence`: one of the possibly multiple sentences in the dataset which supports or refutes the `claim`
- `id`: number associated to the fact to verify (different rows can have the same `id`)
- `label`: wether the evidence REFUTES or SUPPORTS the claim
- `split`: the split to which one claim belongs (train, val, test)


In [None]:
#encode the entire dataset in a pandas dataframe and add the split column
def encode_dataset(): 

    df = pd.DataFrame()
    for split in ['train','val','test']:
        split_path = os.path.join(raw_dataset_path,f"{split}_pairs.csv")
        split_df = pd.read_csv(split_path,index_col=0)
        split_df['split'] = split

        df = df.append(split_df,ignore_index=True,)

    df.columns = df.columns.str.lower()
    df = df.reset_index(drop=True)

    return df 

df = encode_dataset()

Let's inspect the newly created dataset 

In [None]:
df.head()
print('The splits present in the dataframe are:',df['split'].unique())
print('Unique labels in the dataset:',df['label'].unique())

From the above results we can see that the dataset has been structured correctly.\
Now we print some values to check the dimensions of the different splits and to retrive useful informations.

In [None]:
print('Dataframe shape:', df.shape)
print('Number of example in train:',len(df[df['split']=='train']))
print('Number of example in val:',len(df[df['split']=='val']))
print('Number of example in test:',len(df[df['split']=='test']))

The number of claims in the training split of the dataset is clearly much higher than that of val and test splits.

The dataset should probably undergo some preprocessing before it can be used to train our model. Even if this was already noticeable from the few examples taken from the dataframe that we printed above, let's now show an examples of an evidence to make more evident the work that we will need to do.

In [None]:
print(list(df.sample(1)['evidence']))

### 3. Text preprocessing 

BOH FORSE QUI TROVARE UN MODO PER VEDERE COSA ANDREBBE PULITO DAL DATASET 

Both claims and evidence contain a lot of unwanted text: punctuation, symbols, meta-characters, foreign words, tags, ecc. For some reason claims are much cleaner that evidences. Nonetheless we will preprocess both of them, to end up with a more manageable and digestible text. Especially since all the unwanted text do not contribute to the general meaning of each sentence, which is what we are interested in.
Our preprocessing pipeling will:
- drop everything before the first '\t' (every evidence seems to start with a number followed by '\t')
- delete all unnecessary spaces; only one space between each word will be left `QUESTO COMPRENDE \n \t \s ?` 
- remove all tabs and newlines characters (there are many '\t' in the dataset)  `???? MI SA CHE NON SERVE`
- remove the rounded parenthesis (-LRB- and -RRB-)
- drop words inside square brackets (everything that falls between -LSB- and -RSB-)
- delete all words that contains non-english/non-numerical characters  (there are some greek letters for instance)
- remove 's `E COSE SIMILI DA DEFINIRE O MAGARI NO`
- drop everything after the last dot character (after that there are often some other words similar to tags which may be image descriptions or hyperlinks)
- remove punctuation
- set everything to lowercase
- convert string in list of words 

`CAMBIARE L'ORDINE e AGGIUNGERNE ALTRE (la pipeline che c'è adesso è solo per andare avanti, possibili altre cose con nltk tipo stopwords, stemming, ecc `

In [None]:
import re 

def preprocess_pipeline(sentence:str):
    
    #drop everything before the first '\t' 
    sentence = sentence[sentence.find('\t')+1:]

    #drop everything after the last period
    period_idx = sentence.rfind('.')
    if period_idx!= -1:
        sentence = sentence[:period_idx]

    #remove all rounded parenthesis 
    sentence = sentence.replace('-LRB-','').replace('-RRB-','')

    #remove words inside square brackets
    sentence = re.sub("-LSB.*?-RSB-","",sentence)

    #remove all punctuation
    sentence = sentence.translate(str.maketrans('', '', string.punctuation))

    #put everything to lowercase
    sentence = sentence.lower()

    #remove all unnecessary spaces and return a list of words
    sentence = sentence.split()

    return sentence

To test our preprocessing pipeling we will apply it to an example in the dataset that we have identified to be a pretty tough one in terms of amount of cleanup necessary.

In [None]:
#retrive from the dataset the 13th example. It is one about Greece in which the text is pretty messy 
original_claim = df.loc[13,'claim']
original_evidence = df.loc[13,'evidence']

processed_claim = preprocess_pipeline(original_claim)
processed_evidence = preprocess_pipeline(original_evidence)

print('Original claim:',original_claim)
print('Processed claim:',processed_claim)
print('\nOriginal evidence:',original_evidence)
print('Processed evidence:',processed_evidence)


As we can see, the final results relative to both the claim and evidence after the preprocessing are satisfacory. For this reason we are now going to apply the preprocessing function to the entire dataset encoded as a Dataframe.

In [None]:
df['claim'] = df['claim'].apply(preprocess_pipeline)
df['evidence'] = df['evidence'].apply(preprocess_pipeline)

df.head(10)

### 4. Vocabulary

Next, we have to build the dictionaries that will be used for the numerical tokenization of the dataset and for the generation of the embedding matrix.

The function `build_vocab` takes in input the list of unique words in the whole dataset and creates:
- `word2int`: dictionary which associates each word with an integer.
- `int2word`: dictionary which associates each integer with the relative word.

These two dictionaries constitute a bijective mapping between words and indexes in the dataset.

In [3]:
Vocab = namedtuple('Vocabulary',['word2int','int2word','unique_words'])

def build_vocab(unique_words : list[str]): 
    """
        Builds 4 dictionaries word2int, int2word, tag2int, int2tag and returns them
    """
    
    word2int = OrderedDict()
    int2word = OrderedDict()

    for i, word in enumerate(unique_words):
        word2int[word] = i+1           #plus 1 since the 0 will be used as tag token 
        int2word[i+1] = word
    
    return Vocab(word2int,int2word,unique_words)