# **_Fact checking, Neural Languange Inference (NLI)_**

**Authors**: Giacomo Berselli, Marco Cucè, Riccardo De Matteo

### 1. Initial setup

In [None]:
# to print all output for a cell instead of only last one 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [None]:
#import all libraries and modules 

import os

import requests
import zipfile
import random
import string 

import torch

import numpy as np
import pandas as pd

import gensim
import gensim.downloader as gloader
from gensim.models import KeyedVectors

import time 
import logging

from collections import OrderedDict, namedtuple

In [None]:
print("Current work directory: {}".format(os.getcwd())) #print the current working directory 

data_folder = os.path.join(os.getcwd(),"data") # directory containing the notebook

if not os.path.exists(data_folder):   #create folder where all data will be stored 
    os.makedirs(data_folder)

In [None]:
# Fix data seed to achieve reproducible results
torch.manual_seed(0)
random.seed(0)
np.random.seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

#setup logging 
log = logging.getLogger('logger')
log.setLevel(logging.DEBUG)
fh = logging.FileHandler('data/log.txt')
fh.setLevel(logging.DEBUG)
log.addHandler(fh)

### 2. Data preparation

First of all, we download the 'FEVER' dataset, unzip it and store the `.csv` document of each split in the dataset folder.

In [None]:
raw_dataset_path = os.path.join(data_folder,'raw_dataset')   #path of the raw dataset as downloaded 

def save_response_content(response, destination):    
    CHUNK_SIZE =32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks                
                f.write(chunk)

def download_data():
    zip_dataset_path = os.path.join(raw_dataset_path,'fever_data.zip')    
    data_url_id ="1wArZhF9_SHW17WKNGeLmX-QTYw9Zscl1"    
    url ="https://docs.google.com/uc?export=download"

    if not os.path.exists(raw_dataset_path):        
        os.makedirs(raw_dataset_path)

    if not os.path.exists(zip_dataset_path):
        print("Downloading FEVER data splits...")
        with requests.Session() as current_session:           
            response = current_session.get(url, params={'id': data_url_id}, stream=True)

        save_response_content(response, zip_dataset_path)
        print("Download completed!")

        print("Extracting dataset...")
        with zipfile.ZipFile(zip_dataset_path) as loaded_zip:            
            loaded_zip.extractall(raw_dataset_path)
        print("Extraction completed!")

download_data()

Now that we have the `.csv` files of the train, val and test splits, we encode all of them as a unique pandas Dataframe to be able to better inspect it and manipulate it as a whole.
The Dataframe `df` is structured as follows: 
- `claim`: the fact to verify.
- `evidence`: one of the possible multiple sentences in the dataset which supports or refutes the `claim`.
- `id`: number associated to the fact to verify (different rows can have the same `id`).
- `label`: whether the evidence REFUTES or SUPPORTS the claim.
- `split`: the split to which one claim belongs (train, val or test).


In [None]:
#encode the entire dataset in a pandas dataframe and add the split column
def encode_dataset(): 

    df = pd.DataFrame()
    for split in ['train','val','test']:
        split_path = os.path.join(raw_dataset_path,f"{split}_pairs.csv")
        split_df = pd.read_csv(split_path,index_col=0)
        split_df['split'] = split

        df = df.append(split_df,ignore_index=True,)

    df.columns = df.columns.str.lower()
    df = df.reset_index(drop=True)

    return df 

df = encode_dataset()

Let's inspect the newly created dataset:

In [None]:
df.head()
print('The splits present in the dataframe are:',df['split'].unique())
print('Unique labels in the dataset:',df['label'].unique())

From the above results we can see that the dataset has been structured correctly.\
Now we print some values to check the dimensions of the different splits and to retrieve useful informations.

In [None]:
print('Dataframe shape:', df.shape)
print('Number of example in train:',len(df[df['split']=='train']))
print('Number of example in val:',len(df[df['split']=='val']))
print('Number of example in test:',len(df[df['split']=='test']))

The number of claims in the training split of the dataset is clearly much higher than that of val and test splits.

The dataset should probably undergo some preprocessing before it can be used to train our model. Even if this was already noticeable from the few examples taken from the dataframe that we printed above, let's now show an example of an evidence to make more evident the work that we will need to do.

In [None]:
print(list(df.sample(1)['evidence']))

### 3. Text preprocessing 

Both claims and evidences contain a lot of unwanted text: punctuation, symbols, meta-characters, foreign words, tags, ecc. For some reason, claims are more clean than evidences. Nonetheless, we will preprocess both of them, to end up with a more manageable and digestible text. Especially since all the unwanted text do not contribute to the general meaning of each sentence, which is what we are interested in.
Our preprocessing pipeline is composed of:
- drop everything before the first `\t` (every evidence seems to start with a number followed by `\t`).
- delete all unnecessary spaces (only one space between each word will be left). 
- remove all tabs and newlines characters (there are many `\t` in the dataset).
- remove the rounded parenthesis (`-LRB-` and `-RRB-`).
- drop words inside square brackets (everything that falls between `-LSB-` and `-RSB-`).
- drop everything after the last dot character (after that there are often some other words similar to tags which may be image descriptions or hyperlinks).
- remove punctuation.
- set everything to lowercase.

Then we have defined three different types of preprocessing functions, which inherit from `preprocess_pipeline` and add new things, as follows:
- Type 1: standard preprocessing which simply returns a list of words.
- Type 2: transliterates UNICODE characters in ASCII, removes words with non ASCII characters, lemmatizes and returns a list of words.
- Type 3: transliterates UNICODE characters in ASCII, removes words with non ASCII characters, applies stemming, remove stop-words and returns a list of words.

In [None]:
import re 
import unidecode
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords

nltk.download('wordnet') 
nltk.download('stopwords')

def lemmatize_and_remove_non_ascii(sentence:str):
    """Remove unnecessary spaces, remove words with non ASCII characters and lemmatize"""
    sentence = sentence.split() #remove all unnecessary spaces and return a list of words
    lemmatizer = WordNetLemmatizer()
    sentence = [lemmatizer.lemmatize(word) for word in sentence if word.isascii()] #if a word has all ASCII characters: lemmatize, else: remove
    return sentence

def stemm_and_remove_non_ascii(sentence: str):
    sentence = sentence.split() #remove all unnecessary spaces and return a list of words
    ps = PorterStemmer()
    sentence = [ps.stem(word) for word in sentence if word.isascii()]#if a word has all ASCII characters: stemm, else: remove
    return sentence

def preprocess_pipeline(sentence:str):
    """Apply standard preprocessing"""
    
    #drop everything before the first '\t' 
    sentence = sentence[sentence.find('\t')+1:]

    #drop everything after the last period
    period_idx = sentence.rfind('.')
    if period_idx!= -1:
        sentence = sentence[:period_idx]

    #remove all rounded parenthesis 
    sentence = sentence.replace('-LRB-','').replace('-RRB-','')

    #remove words inside square brackets
    sentence = re.sub("-LSB.*?-RSB-","",sentence)

    #remove all square brackets
    sentence = sentence.replace('-LSB-','').replace('-RSB-','')

    #remove all punctuation
    sentence = sentence.translate(str.maketrans(dict.fromkeys(string.punctuation,' ')))

    #subsitute the character ˈ with a space 
    sentence = sentence.replace('ˈ',' ')

    #put everything to lowercase
    sentence = sentence.lower()

    return sentence

def preprocess_type1(sentence:str):
    """Apply standard preprocessing and return a list of words"""

    sentence = preprocess_pipeline(sentence)

    #remove all unnecessary spaces and return a list of words
    sentence = sentence.split()

    return sentence

def preprocess_type2(sentence:str):
    """Apply standard preprocessing, transliterates UNICODE characters in ASCII, 
    remove words with non ASCII characters, lemmatize and return a list of words"""

    sentence = preprocess_pipeline(sentence)

    #transliterates any UNICODE string into the closest possible representation in ASCII text
    sentence = unidecode.unidecode(sentence)

    #remove non-ascii words
    sentence = lemmatize_and_remove_non_ascii(sentence)

    return sentence

def preprocess_type3(sentence: str):
    """
        Apply standard preprocessing, removes stop-words and non ascii's,  and stemmes.
    """
    sentence = preprocess_pipeline(sentence)
    sentence = unidecode.unidecode(sentence)
    stemmed = stemm_and_remove_non_ascii(sentence)
    stop_words = set(stopwords.words('english'))
    filter_stop_words = [word for word in stemmed if not word in stop_words]
    return filter_stop_words

To test our preprocessing pipeline we will apply it to an example in the dataset that we have identified to be a pretty tough one in terms of amount of cleanup necessary.

In [None]:
#retrieve from the dataset the 13th example. It is one about Greece in which the text is pretty messy.

original_claim = df.loc[13,'claim']
original_evidence = df.loc[13,'evidence']

processed_claim = preprocess_type2(original_claim)
processed_evidence = preprocess_type2(original_evidence)

print('Original claim:',original_claim)
print('Processed claim:',processed_claim,'\n')
print('Original evidence:',original_evidence)
print('Processed evidence:',processed_evidence,'\n')

As we can see, the final results relative to both the claim and evidence after the preprocessing are satisfactory. For this reason we are now going to apply the preprocessing function to the entire dataset encoded as a Dataframe.

In [None]:
df['claim'] = df['claim'].apply(preprocess_type2)
df['evidence'] = df['evidence'].apply(preprocess_type2)

df.head(10)

### 4. Vocabulary

Next, we have to build the dictionaries that will be used for the numerical tokenization of the dataset and for the generation of the embedding matrix.

The function `build_vocab` takes in input the list of unique words in the whole dataset and creates:
- `word2int`: dictionary which associates each word with an integer.
- `int2word`: dictionary which associates each integer with the relative word.

These two dictionaries constitute a bijective mapping between words and indexes in the dataset.

In [None]:
Vocab = namedtuple('Vocabulary',['word2int','int2word','unique_words'])

def build_vocab(unique_words : list[str]): 
    """
        Builds the dictionaries word2int, int2word and put them in the Vocabulary
    """
    word2int = OrderedDict()
    int2word = OrderedDict()

    for i, word in enumerate(unique_words):
        word2int[word] = i+1           #plus 1 since the 0 will be used as tag token 
        int2word[i+1] = word
    
    return Vocab(word2int,int2word,unique_words)

The function `build_vocab` needs in input the list of all the unique words in the dataset, so we're now going to retrieve it from the dataset to be able to build the dictionaries.

In [None]:
unique_words_claim = df['claim'].explode().unique().tolist()  
unique_words_evidence = df['evidence'].explode().unique().tolist()

print('the number of unique words belonging to claims is:', len(unique_words_claim))
print('the number of unique words belonging to evidences is:', len(unique_words_evidence))

unique_words = set(unique_words_evidence + unique_words_claim)
print('the number of unique words in the entire dataset is:', len(unique_words))


In [None]:
vocab = build_vocab(unique_words)

Now that we have the vocabulary which contains the mapping between word and index we can 'numberise' the dataset. In particular we will add to the Dataframe 3 columns:
- `idx_claim`: same as `claim` but with each word substituted by its index.
- `idx_evidence`: same as `evidence` but with each word substituted by its index.
- `idx_label`: label encoded as a unique integer (0: REFUTES, 1: SUPPORTS).

In [None]:
def build_indexed_dataframe(df: pd.DataFrame):

    df['idx_claim'] = df.claim.apply(lambda x:list(map(vocab.word2int.get,x)))
    df['idx_evidence'] = df.evidence.apply(lambda x:list(map(vocab.word2int.get,x)))

    df['label'] = df.label.astype('category')   #convert the label column into category dtype
    df['idx_label'] = df.label.cat.codes        #assign unique integer to each category

    return df 

def check_dataframe_numberization(df,vocab):

    """
       Checks if the numberized dataframe will lead to the normal dataframe usind the reverse mapping 
    """

    claims = df['claim']
    evidences = df['evidence']

    idx_to_claims = df.idx_claim.apply(lambda x:list(map(vocab.int2word.get,x)))
    idx_to_evidences = df.idx_evidence.apply(lambda x:list(map(vocab.int2word.get,x)))

    if claims.equals(idx_to_claims) and evidences.equals(idx_to_evidences):
        print('All right with dataset numberization')
    else:
        raise Exception('There are problems with Dataset numberization')

df = build_indexed_dataframe(df)

check_dataframe_numberization(df,vocab)

Since the operation was successful, let's have a look at the numberized dataframe.

In [None]:
df.head()

### 5. Data Loaders 

In order to generate mini-batches for each split to be passed to the network we leveraged a `torchtext` utility, such as `BucketIterator`. It ensures that each mini-batch is composed of sequences of nearly the same length (depending on the chosen batch size), in order to add the minimum padding possible to each Tensor. In order to do so, we needed to create a Pytorch Dataset since this is what is requested by the Bucket Iterator.\
The problem now is how to define the length of the input to the model (which is used to create buckets of similar-lenghts sequences), since for this task we are dealing with multiple inputs (claim and evidence). 


In [None]:
claim_len = df.claim.apply(len)
evidence_len = df.evidence.apply(len)
print('average length of a claim sentence:',claim_len.mean())
print('average length of a evidence sentence:',evidence_len.mean())
print('max difference in length of claim sentences:',claim_len.max() - claim_len.min())
print('max difference in length of evidence sentences:',evidence_len.max() - evidence_len.min())

Based on the fact that the average sentence length for an evidence is much bigger than for a claim, we decided to create buckets based on the length of the evidence and only with that being equal, based on the claim's length. So the mini-batches will be constructed by grouping similar-size evidences and their corresponding claims.  

In [None]:
from torchtext.legacy.data import BucketIterator
from torch.utils.data import Dataset

class DataframeDataset(Dataset):

    def __init__(self, dataframe: pd.DataFrame):

        dataframe = dataframe.copy()
        self.claims = dataframe['idx_claim']      #column of numberized claims 
        self.evidences = dataframe['idx_evidence']   #column of numberized evidences 
        self.labels = dataframe['idx_label']       #column of categorical label 
        self.claim_ids = dataframe['id']          #column of claim ids 

    def __len__(self):
        return len(self.claims)

    def __getitem__(self, idx):
        return {'claim': self.claims[idx],
                'evidence': self.evidences[idx],
                'label': self.labels[idx],
                'claim_id': self.claim_ids[idx]}

def create_dataloaders(b_s : int, dataframe: pd.DataFrame):     #b_s = batch_size
    
    train_df = dataframe[dataframe['split'] == 'train'].reset_index(drop=True)      
    val_df = dataframe[dataframe['split'] == 'val'].reset_index(drop=True)
    test_df = dataframe[dataframe['split'] == 'test'].reset_index(drop=True)

    #create DataframeDataset objects for each split 
    train_dataset = DataframeDataset(train_df)
    val_dataset = DataframeDataset(val_df)
    test_dataset = DataframeDataset(test_df)


    # Group similar length text sequences together in batches and return an iterator for each split.
    train_dataloader,val_dataloader,test_dataloader = BucketIterator.splits((train_dataset,val_dataset,test_dataset),
                                                        batch_sizes=(b_s,b_s,b_s), sort_key=lambda x: (len(x['evidence']),len(x['claim'])), 
                                                        repeat=True, sort=False, shuffle=True, sort_within_batch=True)
    
    return train_dataloader,val_dataloader,test_dataloader 


Now let's check that the dataloaders have been created correctly. In order to do that, we print the indexed claim, the indexed evidence and the claim id of the first element of the train dataloader's first batch, and then we look in the dataframe to see if the indexes correspond.

In [None]:
temp_batch_size = 128
tr, vl, ts = create_dataloaders(temp_batch_size, df)
random_idx = random.randint(0, temp_batch_size-1)
tr.init_epoch()
for batch_id, batch in enumerate(tr.batches):
    print("Claim: ", batch[random_idx]['claim'])
    print("Evidence: ", batch[random_idx]['evidence'])
    print("Label: ", batch[random_idx]['label'])
    print("Claim id: ", batch[random_idx]['claim_id'], "\n")
    print("Corresponding row in the dataset:")
    df[df['id'] == (batch[random_idx]['claim_id'])]
    break

### 6. Word embeddings

We can finally build an embedding matrix that will be used by the embedding layer of our model to store pre-trained word embeddings and retrieve them using indices. 
The function `build_embedding_matrix`, via the passed embedding model and the `word2int` dictionary, costructs a matrix that stores at each word-index the corresponding embedding vector found in GloVe. In particular we decided to use Glove as embedding model with a vector dimension of 300. 
 
In order to handle OOV words:
- If a word in the dataset (identified by its unique integer) is present in GloVe model, we store its embedding vector in the embedding matrix.
- Otherwise we assign as embedding to the OOV word a random vector of size 300, sampled from a uniform distribution.

First thing first, we need to download the GloVe model from `gensim`.\
To avoid downloading the GloVe embeddings more than once, since the process is really slow, we store the `KeyedVectors` in the data folder. In case this is not the first run and it has been already created and saved we can load it from the folder.

In [None]:
emb_matrix_path = os.path.join(data_folder, "emb_matrix.npy")
glove_model_path = os.path.join(data_folder, "glove_vectors.txt") 

def download_glove_emb(force_download = False):   
    """
        Download the glove embedding model and returns it 
    """
    emb_model = None

    if os.path.exists(glove_model_path) and not force_download: 
        print('glove vecotrs already saved in data folder, retrieving the file...')
        emb_model = KeyedVectors.load_word2vec_format(glove_model_path, binary=True)
        print('vectors loaded')

    else:
        print('downloading glove embeddings...')        
        embedding_dimension=300

        download_path = "glove-wiki-gigaword-{}".format(embedding_dimension)
        emb_model = gloader.load(download_path)

        print('saving glove embeddings to file')  
        emb_model.save_word2vec_format(glove_model_path, binary=True)
        
    return emb_model

force_download = False      # to download glove model even if the vectors model has been already stored. Mainly for testing purposes

glove_embeddings = download_glove_emb(force_download)

Now that we have the glove embeddings, we can check if there are some Out Of Vocabulary (OOV) words in our processed dataset.
\
A word is considered OOV if it is present in our dataset but not in the GloVe embeddings. 

In [None]:
def check_OOV_terms(embedding_model: gensim.models.keyedvectors.KeyedVectors, vocab):
    """
        Given the embedding model and the unique words in the dataframe, determines the out-of-vocabulary words 
    """
    oov_words = []
    idx_oov_words = []

    if embedding_model is None:
        print('WARNING: empty embeddings model')

    else: 
        for word in vocab.unique_words:
            try: 
                embedding_model[word]
            except:
                oov_words.append(word) 
                idx_oov_words.append(vocab.word2int[word]) 
        
        print("Total number of unique words in dataset:",len(vocab.unique_words))
        print("Total OOV terms: {0} ({1:.2f}%)".format(len(oov_words), (float(len(oov_words)) / len(vocab.unique_words))*100))
        print("Some OOV terms:",random.sample(oov_words,15))
    
    return oov_words, idx_oov_words

oov_words, idx_oov_words = check_OOV_terms(glove_embeddings,vocab)

By using the GloVe embeddings for our embedding matrix and applying the 3 different types of preprocessing techniques developed, we obtain the following results:
- Type 1: 3148 OOV words with 35096 as total number of unique words (8.97%).
- Type 2: 2761 OOV words with 31692 as total number of unique words (8.71%).
- Type 3: 8031 OOV words with 26952 as total number of unique words (29.80%).

While the first two types seem to obtain similar outcomes, with the second type which not only reduces the OOV words but also the number of unique words of the dataset, the third type is probably too heavy. The main reason may be the stemming, because it truncates the words producing new ones which not always exist, with respect to the lemmatization.

Now let's build the embedding matrix.

In [None]:
def build_embedding_matrix(emb_model: gensim.models.keyedvectors.KeyedVectors,vocab) -> np.ndarray:
    """
        If the embedding for the word is present, add it to the embedding_matrix, otherwise insert a vector of random values.
        Return the embedding matrix
    """
    if emb_model is None:
        print('WARNING: empty embeddings model')
        return None
    
    print('Building embedding matrix...')

    embedding_dimension = len(emb_model[0]) #how many numbers each emb vector is composed of                                                           
    embedding_matrix = np.zeros((len(vocab.word2int)+1, embedding_dimension), dtype=np.float32)   #create a matrix initialized with all zeros 

    for word, idx in vocab.word2int.items():
        try:
            embedding_vector = emb_model[word]
        except (KeyError, TypeError):
            embedding_vector = np.random.uniform(low=-0.05, high=0.05, size=embedding_dimension)

        embedding_matrix[idx] = embedding_vector     #assign the retrived or the generated vector to the corresponding index 
    
    print(f"Embedding matrix shape: {embedding_matrix.shape}")

    return embedding_matrix

embedding_matrix = build_embedding_matrix(glove_embeddings, vocab)

Let's have a look at the first few rows of the freshly created embedding matrix, to get a sense of it.

In [None]:
pd.DataFrame(embedding_matrix).head()

As we can see the very first row is full of zeros since that's a 'fake embedding' for the padding token which will never be used in practice.

To be completely sure that the embedding matrix has been built correctly, we check that the embedding vector associated with an index in the embedding matrix is the same as the one retrieved from GloVe by passing to it the word to which that index correspond. 

In [None]:
def check_id_corr(glove: gensim.models.keyedvectors.KeyedVectors, vocab, matrix, dataframe):
    """
        Checks whether the numberized dataframe and the index of the embedding matrix correspond
    """
    if not glove:
        print('WARNING: empty model, remember to download GloVe first or set force_dowload to True')
        return 
    oov_words_ = []

    for indexed_sentence in dataframe['idx_claim']+dataframe['idx_evidence']:

        for token in indexed_sentence:
            embedding = matrix[token]
            word = vocab.int2word[token]
            if word in glove.key_to_index:
                assert(np.array_equal(embedding,glove[word]))
            else:
                oov_words_.append(word)

    print('Double check OOV number:',len(set(oov_words_)))

check_id_corr(glove_embeddings,vocab,embedding_matrix,df)

Since no error has been found, we can safely proceed with the next steps.

### 7. Model design

In [None]:
#pytoch imports

import torch.nn as nn
import torch.optim as optim
import torch.nn.utils.rnn as rnn
import torch.nn.functional as F

from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

#scikit-learn imports 
from sklearn.metrics import f1_score, accuracy_score

Since we want to use a dynamic approach, we define a single custom model which will build the correct architecture based on the paramaters passed to the tuple `Architecture`. In particular, the first parameter of the tuple stores the strategy chosen to embed the sentences of the claim and the evidence. You can choose between:
- `mlp`: encode token sequences via a simple MLP layer.
- `rnn_last`: encode token sequences via a RNN and take the last state as the sentence embedding.
- `rnn_avg`: encode token sequences via a RNN and average all the output states.
- `bag_of_vectors`: compute the sentence embedding as the mean of its token embeddings.

The second parameter defines the technique chosen to merge evidence and claim sentence embeddings, as follows:
- `concat`: define the classification input as the concatenation of evidence and claim sentence embeddings.
- `sum`: define the classification input as the sum of evidence and claim sentence embeddings.
- `mean`: define the classification input as the mean of evidence and claim sentence embeddings.

The third parameter instead is a boolean and allows to add an additional feature, the cosine similarity. `cosine_sim` allows to see if some similarity information between the claim to verify and one of its associated evidence might be useful to the classification.

In [None]:
Architecture = namedtuple('Architecture',['sentence_emb_strat','merge_input','cosine_sim'])

In [None]:
class Custom_model(nn.Module):
    """
        Class defining our model architecture  
    """
    def __init__(self, emb_matrix: np.ndarray, model_param : dict, architecture : Architecture, device) :
        super().__init__()

        
        self.device = device
        self.model_param = model_param
        self.architecture = architecture


        self.embedding_layer, self.word_embedding_dim = self.build_emb_layer(emb_matrix,model_param['pad_idx'],model_param['freeze_embedding'])

        if self.architecture.sentence_emb_strat == 'mlp':
            self.mlp = nn.Linear(model_param['max_tokens'],1)

        elif self.architecture.sentence_emb_strat in ('rnn_last','rnn_avg'):
            if self.model_param['rnn'] == 'lstm':
                self.rnn = nn.LSTM(self.word_embedding_dim, self.word_embedding_dim, batch_first = True) 
            else:
                self.rnn = nn.GRU(self.word_embedding_dim, self.word_embedding_dim, batch_first = True) 

        self.drop_layer = nn.Dropout(p=0.5)
        
        #determine the input dimension of the last layer that will classify each claim 
        classifier_input_dim = int(self.architecture.cosine_sim) + (
            self.word_embedding_dim * 2
            if self.architecture.merge_input == "concat"
            else self.word_embedding_dim)

        self.classifier = nn.Linear(classifier_input_dim,1)   


        self.to(self.device)  #move model to device , 'gpu' if possible 
    
    def get_name(self) -> str:

        return self.architecture.sentence_emb_strat+'_'+self.architecture.merge_input+('_cos_sim' if self.architecture.cosine_sim else '' )

    
    def build_emb_layer(self, weights_matrix: np.ndarray, pad_idx : int, freeze = True):
    
        matrix = torch.Tensor(weights_matrix).to(self.device)   #the embedding matrix 
        _ , embedding_dim = matrix.shape 

        emb_layer = nn.Embedding.from_pretrained(matrix, freeze=freeze, padding_idx = pad_idx)   #load pretrained weights in the layer and make it non-trainable (TODO: trainable ? )
        
        return emb_layer, embedding_dim
        

    def pad_batch(self,batch: list):    #pad each sentece of a batch to a common length
        """
            Input:  List of Tensors of variable length
            Output: Batch of tensors all padded to the same length 
        """
        batch = batch.copy()
        
        #if we are going to use 'mlp' as sentence embedding strategy, all the sentences should be padded to max_tokens length
        if self.architecture.sentence_emb_strat == 'mlp':
            batch[0] = nn.ConstantPad1d((0,self.model_param['max_tokens']-batch[0].shape[0]),0)(batch[0])  

        padded_batch = rnn.pad_sequence(batch,batch_first = True, padding_value = self.model_param['pad_idx'])

        padded_batch = padded_batch.to(self.device)    #move tensor to gpu if possible 

        return padded_batch


    def words_embedding(self, word_idxs):   #get embedding vectors for each token in sentence 
        """
            Input:  [batch_size, num_tokens]
            Output: [batch_size, num_tokens, embedding_dim]
        """
        return self.embedding_layer(word_idxs)
    
    def sentence_embedding(self, embeddings, sentence_lenghts):     #compute sentence embedding 
        """
            Input:  [batch_size, num_tokens, embedding_dim]
            Output: [batch_size, embedding_dim]
        """

        strat = self.architecture.sentence_emb_strat

        def mlp():  #compute sentence embedding via dense layer 
            
            reshaped_embeddings = embeddings.permute(0,2,1)     #swap last two dimensions since Linear operates only on last dimension

            sentence_emb = self.mlp(reshaped_embeddings)   

            return sentence_emb.squeeze(2)   #remove dimension of size 1 
        
        def rnn_last():    #take as sentence embedding the last state of the rnn 
            
            packed_embeddings = pack_padded_sequence(embeddings, sentence_lenghts.cpu(), batch_first=True, enforce_sorted=False)
            packed_out, (last_h, _)  = self.rnn(packed_embeddings)   

            if torch.isnan(last_h.squeeze(0)).any():    #debug 
                log.debug('nan in last_h')  
                raise Exception(' nan in last h ')
            
            return last_h.squeeze(0)  #remove first dimension of 1 (TODO: if bidirectional or more than 1 layer this has to be handled)
        
        def rnn_avg():   #take as sentence embedding the average of all the states of the rnn corresponing to each word 

            packed_embeddings = pack_padded_sequence(embeddings, sentence_lenghts.cpu(), batch_first=True, enforce_sorted=False)
            packed_out, _  = self.rnn(packed_embeddings)

            unpacked_out, l = pad_packed_sequence(packed_out,batch_first=True)

            avg = unpacked_out.sum(dim=1).div(sentence_lenghts.unsqueeze(dim=1))

            return avg
        
        def bag_of_vectors():  #sentence embedding as the mean of its token embeddings

            avg = embeddings.sum(dim=1).div(embeddings.count_nonzero(dim=1))

            return avg 

        if strat == 'mlp':
            return mlp()
        elif strat == 'rnn_last':
            return rnn_last()
        elif strat == 'rnn_avg':
            return rnn_avg()
        elif strat == 'bag_of_vectors':
            return bag_of_vectors()
        else :
            raise Exception('Incorrect name for sentence embedding strategy')

    def merge_sentence_emb(self,claims,evidences):
        """
            Input:  claims -> [batch_size, embedding_dim] , evidences -> [batch_size, embedding_dim]
            Output: [batch_size, dim] dim is based on strategy specified 
        """

        strat = self.architecture.merge_input

        if strat == 'concat':
            result = torch.cat((claims,evidences),dim=1)      #concatenate the two tensors 
        elif strat == 'sum': 
            result = torch.stack((claims,evidences), dim=0).sum(dim=0)    #sum the two tensors 
        elif strat == 'mean':
            result = torch.stack((claims,evidences),dim=0).mean(dim=0)    #compute mean of the two tensors 
        else :
            raise Exception('Incorrect name for input-merge strategy')
        
        if self.architecture.cosine_sim :
            cosine_sim = F.cosine_similarity(claims,evidences).unsqueeze(-1)
            result = torch.cat((result,cosine_sim),dim=1)                      #add to previously generated merged ouput, one value representing cosine similarity between the two input tensors 

        return result 


    def forward(self, claims, claim_lengths, evidences, evidence_lengths):

        #pad the sentences to have fixed size 
        padded_claims = self.pad_batch(claims)
        # print('padded_claims = ', padded_claims)
        padded_evidences = self.pad_batch(evidences)
        # print('padded_evidence = ', padded_evidences)
        
        #embed each word in a sentence with a 300d vector 
        word_emb_claims = self.words_embedding(padded_claims)       
        # print('word_emb_claims = ', word_emb_claims)   
        word_emb_evidences = self.words_embedding(padded_evidences)
        # print('word_emb_evidences = ', word_emb_evidences)

        #compute sentence embedding
        sentence_emb_claims = self.sentence_embedding(word_emb_claims,claim_lengths)
        # print('sentence_emb_claims = ', sentence_emb_claims)
        sentence_emb_evidences = self.sentence_embedding(word_emb_evidences,evidence_lengths)
        # print('sentence_emb_evidences = ', sentence_emb_evidences)

        #merge multi-inputs 
        classification_input = self.merge_sentence_emb(sentence_emb_claims,sentence_emb_evidences)

        #eventual dropout 
        if self.model_param['dropout']:
            classification_input = self.drop_layer(classification_input)

        #final classification 
        predictions = self.classifier(classification_input)

        predictions = predictions.squeeze()   #remove dim of size 1 

        return predictions 
        

In the next cell we define the functions that will be used in the 'train & eval' and 'test' pipeline to compute the metrics of the models.\
In particular, there could be two types of performance evaluations:
- Multi-input classification evaluation: this type of evaluation is the easiest and concerns computing evaluation metrics, such as accuracy and F1-score of the models on our pre-processed dataset. In other words, we assess the performance of chosen classifiers.
- Claim verification evaluation: if we want to give an answer concerning the claim itself, we need to consider the whole evidence set. For a given claim, we consider all its corresponding (claim, evidence) pairs and their corresponding classification outputs. At this point, what we need to do is to compute the final predicted claim label via majority voting.

In [None]:
#compute accuracy and f1-score 
def acc_and_f1(y_true: torch.LongTensor,y_pred: torch.LongTensor):
    """
        Compute accuracy and f1-score for an epoch 
    """
    acc = accuracy_score(y_true, y_pred)

    f1 = f1_score(y_true,y_pred,average='macro')

    return acc,f1

#construct y_true and y_pred lists to be passed to acc_and_f1 function, but based on majority voting strategy
def majority_voting(y_true, y_pred, y_ids):
    """
        Input: the list of corresponing true labels, the list of predicted labels, the list of claim ids to compute majority voting
        Output : the list of true labels (one for each claim id), the list of predicted labels via majority voting

        Idea behind the implementation: Since there could be more claims with the same id in the dataset, we start by counting the
        number of occurrences of each claim id in the dataset and we store them sorted on the id number (tensor 'a'). 
        Then we count for each id how many times we predict it as supported in the predicted tensor 'y_pred', where a 'SUPPORTS' 
        prediction is considered as a 1 (integer), so in practice we sum the 1s founded and again we store the results sorted on the 
        id number (tensor 'b'). Next we create a tensor 'true', containing the true label for each claim and sorted on the id number 
        (so if the true label of the first claim is supported, the integer at index 0 will be a 1), Finally we create the tensor 
        'pred' which verifies for each element of 'b' (so for each number of SUPPORTS for each id), if it is greater than the number 
        of occurrences of that id in the dataset divided by 2. This means that, if the result is positive, we have predicted that 
        claim in most cases as supported, otherwise as refuted (same number of SUPPORTS and REFUTES will be considered as REFUTES), and 
        we store it again in a tensor sorted on the id number.
    """
    start = time.perf_counter()

    search_sorted = torch.searchsorted(y_ids.unique(),y_ids)
    a = torch.bincount(search_sorted)           #number of occurrences for each id in the dataset
    b = torch.bincount(search_sorted, y_pred)   #for each id how many 1s (SUPPORTS) there are in the predicted tensor
    true = (torch.bincount(search_sorted, y_true) > 0).int()    #tensor (sorted on claim id) containing the true label for each claim (1: SUPPORTS, 0: REFUTES)
    pred = (b > (a / 2)).int()  #tensor (sorted on claim id) containing the predicted label for each claim (1: SUPPORTS, 0: REFUTES)
    end = time.perf_counter()

    log.debug('maj voting: %s',end-start)

    return true, pred

#compute accuracy and f1 score via majority voting 
def acc_and_f1_majority(y_true, y_pred, y_ids):

    y_true, y_pred = majority_voting(y_true,y_pred,y_ids)

    acc, f1 = acc_and_f1(y_true,y_pred)

    return acc, f1

### 8. Train and Validation pipelines

Next we will define the train and validation pipelines.

In [None]:
def train_loop(model: Custom_model, iterator : BucketIterator, optimizer: optim.Optimizer, criterion, device):
    """ Args:
         - model: the model istantiated with pre-defined hyperparameters.
         - iterator: dataloader for passing data to the network in batches 
         - optimizer: optimizer for backward pass 
         - criterion: loss function 
         - device: 'gpu' if it's available, 'cpu' otherwise 
    """
    start = time.perf_counter()

    batch_loss = 0
    
    #aggregate all the predictions and corresponding true labels (and claim ids) in tensors 
    all_pred , all_targ, all_ids = torch.LongTensor(), torch.LongTensor(), torch.LongTensor()

    model.train()
    
    iterator.init_epoch()  #generate and shuffles batches from dataloader #TODO: create_batches 

    for batch_id, batch in enumerate(iterator.batches):

        claims_batch = [torch.LongTensor(example['claim']) for example in batch]    #list of tensors of words id for each sentence in a batch 
        evidences_batch = [torch.LongTensor(example['evidence']) for example in batch]     #list of tensors of tags id for each sentence in a batch 

        claims_lengths = torch.Tensor([len(example['claim']) for example in batch])         #lenght of each claim sentence before padding 
        evidence_lengths = torch.Tensor([len(example['evidence']) for example in batch])         #lenght of each evidence sentence before padding 

        target_labels = torch.Tensor([example['label'] for example in batch])     #label of each example in a batch
        target_ids = torch.LongTensor([example['claim_id'] for example in batch])  #id of each claim in a batch 

        #move tensors to gpu if possible 
        claims_lengths = claims_lengths.to(device)
        evidence_lengths = evidence_lengths.to(device)
        target_labels = target_labels.to(device)    

        #zero the gradients 
        model.zero_grad(set_to_none=True)
        optimizer.zero_grad()            

        predictions = model(claims_batch,claims_lengths,evidences_batch,evidence_lengths)   #generate predictions 

        loss = criterion(predictions, target_labels)      #compute the loss 


        pred = (predictions > 0.0 ).int().cpu()              #get class label 

        #concatenate the new tensors with the one computed in previous steps
        all_pred = torch.cat((all_pred,pred))          
        all_targ = torch.cat((all_targ,target_labels.long().cpu()))
        all_ids = torch.cat((all_ids,target_ids))

        #backward pass 
        loss.backward()
        optimizer.step()

        batch_loss += loss.item()    #accumulate batch loss 


    acc, f1 = acc_and_f1(all_targ,all_pred)

    maj_acc, maj_f1 = acc_and_f1_majority(all_targ,all_pred,all_ids)

    loss = batch_loss/(batch_id+1)    #mean loss 

    end = time.perf_counter()
    log.debug('train epoch time: %s',end-start)

    return loss, acc, f1, maj_acc, maj_f1,predictions


def eval_loop(model: Custom_model, iterator: BucketIterator, criterion, device):
    """ Args:
         - model: the sequence pos tagger model istantiated with fixed hyperparameters.
         - iterator: dataloader for passing data to the network in batches 
         - criterion: loss function 
         - device: 'gpu' if it's available, 'cpu' otherwise 
    """
     
    start = time.perf_counter()

    batch_loss = 0
    
    all_pred , all_targ, all_ids = torch.LongTensor(), torch.LongTensor(), torch.LongTensor() 
    
    model.eval()   #model in eval mode 
    
    iterator.init_epoch()  #TODO create_batches 

    with torch.no_grad(): #without computing gradients since it is evaluation loop
    
        for batch_id, batch in enumerate(iterator.batches):
            
            claims_batch = [torch.LongTensor(example['claim']) for example in batch]    #list of tensors of words id for each sentence in a batch 
            evidences_batch = [torch.LongTensor(example['evidence']) for example in batch]     #list of tensors of tags id for each sentence in a batch 

            claims_lengths = torch.Tensor([len(example['claim']) for example in batch])         #lenght of each claim sentence before padding 
            evidence_lengths = torch.Tensor([len(example['evidence']) for example in batch])         #lenght of each evidence sentence before padding 

            target_labels = torch.Tensor([example['label'] for example in batch])     #label of each example in a batch
            target_ids = torch.LongTensor([example['claim_id'] for example in batch])  #id of each claim in a batch 

            #move tensors to gpu if possible 
            claims_lengths = claims_lengths.to(device)
            evidence_lengths = evidence_lengths.to(device)
            target_labels = target_labels.to(device)    

            
            predictions = model(claims_batch,claims_lengths,evidences_batch,evidence_lengths)   #generate predictions 

            loss = criterion(predictions, target_labels)      #compute the loss 

            
            pred = (predictions > 0.0 ).int().cpu()         #get class label 

            #concatenate the new tensors with the one computed in previous steps
            all_pred = torch.cat((all_pred,pred))          
            all_targ = torch.cat((all_targ,target_labels.long().cpu()))
            all_ids = torch.cat((all_ids,target_ids))

            batch_loss += loss.item()   #accumulate batch loss 
            

    acc, f1 = acc_and_f1(all_targ,all_pred)

    maj_acc, maj_f1 = acc_and_f1_majority(all_targ,all_pred,all_ids)

    loss = batch_loss/(batch_id+1)   #mean loss 

    end = time.perf_counter()
    log.debug('eval epoch time: %s',end-start)

    return loss, acc, f1, maj_acc, maj_f1, predictions

Now that we have the train and eval loop we can combine them. Those two phases will be alternated for each epoch in order to see the progresses made by our model.\
Here we also instantiate the optimizer, the loss criterion and the model itself with the parameters that we specify.\
The `train_and_eval` function takes as parameter the specific model that will be trained and evaluated in turn over the entire dataset, the mini-batches defined previously and the hyperparameters.\
A dictionary `model_metrics` containing the results for each of these architectures is then returned. For each architecture it contains:
- `model_name`: the name of the model evaluated. 
- `train_loss`, `train_acc`, `train_f1`: standard metrics computed at each epoch on the training set.
- `train_maj_acc`, `train_maj_f1`: metrics computed using the majority voting technique on the training set.
- `val_loss`, `val_acc`, `val_f1`: standard metrics computed at each epoch on the validation set.
- `val_maj_acc`, `val_maj_f1`: metrics computed using the majority voting technique on the validation set.

In [None]:
def train_and_eval(model: Custom_model, dataloaders: tuple[BucketIterator,...], param : dict(), device):
    """
        Runs the train and eval loop and keeps track of all the metrics of the training model 
    """
    best_f1, best_epoch = -1, -1   #init best f1 score 
    model_name = model.get_name()

    log.debug('Train and Eval for model: %s \n',model_name)

    model_metrics = {
        "model_name": model_name,
        "train_loss": [],
        "train_acc": [],
        "train_f1": [],
        "train_maj_acc": [],
        "train_maj_f1": [],
        "val_loss": [],
        "val_acc": [],
        "val_f1": [],
        "val_maj_acc": [],
        "val_maj_f1": [],
    }

    criterion = nn.BCEWithLogitsLoss(pos_weight=param['weight_positive_class']).to(device)    #Binary CrossEntropy Loss that accept raw input and apply internally the sigmoid 

    optimizer = optim.Adam(model.parameters(), lr=param['lr'],  weight_decay=param['weight_decay'])   #L2 regularization 

    train_dataloader, eval_dataloader = dataloaders   #unpack dataloaders 

    for epoch in range(param['n_epochs']): #epoch loop

        start_time = time.perf_counter()
        
        train_metrics = train_loop(model, train_dataloader, optimizer, criterion, device) #train
        eval_metrics = eval_loop(model, eval_dataloader, criterion, device) #eval
        
        end_time = time.perf_counter()

        tot_epoch_time = end_time-start_time          

        train_epoch_loss, train_epoch_acc, train_epoch_f1, train_epoch_maj_acc, train_epoch_maj_f1, train_predictions = train_metrics
        eval_epoch_loss, eval_epoch_acc, eval_epoch_f1, eval_epoch_maj_acc, eval_epoch_maj_f1, eval_predictions = eval_metrics

        if eval_epoch_f1 >= best_f1:
            best_f1 = eval_epoch_f1
            best_epoch = epoch+1
            if not os.path.exists('models'):        
                os.makedirs('models')
            torch.save(model.state_dict(),f'models/{model_name}.pt')


        #log Train and Validation metrics
        model_metrics['train_loss'].append(train_epoch_loss)
        model_metrics['train_acc'].append(train_epoch_acc)
        model_metrics['train_f1'].append(train_epoch_f1)
        model_metrics['train_maj_acc'].append(train_epoch_maj_acc)
        model_metrics['train_maj_f1'].append(train_epoch_maj_f1)
        model_metrics['val_loss'].append(eval_epoch_loss)
        model_metrics['val_acc'].append(eval_epoch_acc)
        model_metrics['val_f1'].append(eval_epoch_f1)
        model_metrics['val_maj_acc'].append(eval_epoch_maj_acc)
        model_metrics['val_maj_f1'].append(eval_epoch_maj_f1)
       
        
        log.debug('Elapsed time for epoch %s : %s \n',epoch+1,tot_epoch_time)

        print(f'Epoch: {epoch+1:02} | Epoch Time: {tot_epoch_time:.4f}')
        print(f'\tTrain Loss: {train_epoch_loss:.3f} | Train Acc: {train_epoch_acc*100:.2f}% | Train F1: {train_epoch_f1:.2f} | Train Maj Acc: {train_epoch_maj_acc*100:.2f}%  | Train Maj F1: {train_epoch_maj_f1:.2f}')
        print(f'\t Val. Loss: {eval_epoch_loss:.3f} | Val. Acc: {eval_epoch_acc*100:.2f}% | Val. F1: {eval_epoch_f1:.2f}  | Val Maj Acc: {eval_epoch_maj_acc*100:.2f}%  | Val Maj F1: {eval_epoch_maj_f1:.2f}')
    
    log.debug('Best Eval F1: %s, obtained at epoch: %s \n\n',best_f1,best_epoch)

    return model_metrics

Before training the models we have to define some useful parameters and hyperparameters to pass to the train and validation loops. According to our experimental results obtained after a tuning phase, these are the ones that we are using to train each architecture:
- `BATCH_SIZE = 128`: we have experimented with the sizes a little bit, but we don't want batches to be neither too small, to avoid noise in the gradients, nor too big, to speed up the process. Since we didn't see big differences for a batch size range between 64 and 256, we decided to set `BATCH_SIZE = 128`.
- `LR = 1e-3`: we obtained the best results using `Adam`, which automatically adapts the learning rate at each epoch. Since it allows you to choose the size you want to start with, we got the best results by using `1e-3`.
- `N_EPOCHS = 15`: with `DROPOUT = True`, we see that the models achieve the best performance nearly after 15 epochs, remaining stable after that.
- `WEIGHT_DECAY = 1e-5`: to avoid overfitting we introduced an L2 regularization.
- `FREEZE = False`: if set to `False`, the parameters of the embedding layer become trainable, adding another trainable parameter which resulted useful to obtain the best outcomes.
- `DROPOUT = True`: technique used to avoid overfitting which resulted useful in particular for the `mlp` and `bag_of_vectors` models.

In [None]:
#PARAMETERS, HYPERPARAMETERS AND USEFUL OBJECTS 

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f'running on {DEVICE}')

PAD_IDX = 0                     # pad index

#hyperparameters
BATCH_SIZE = 128                # number of sentences in each mini-batch
LR = 1e-3                       # learning rate 
N_EPOCHS = 15                   # number of epochs
WEIGHT_DECAY = 1e-5             # regularization

#model parameters
FREEZE = False                  # wheter to make the embedding layer trainable or not
RNN = 'lstm'                    # either gru or lstm
DROPOUT = False                 # wheter to use dropout layer or not  


#to counteract class imbalance 
(supports, refutes) = df.loc[df['split'] == 'train']['idx_label'].value_counts()    #number of supports and refutes in the train dataset 
weight_positive_class = torch.Tensor([refutes/supports]).to(DEVICE)  #weight to give to positive class 

max_tokens = max(df.claim.apply(len).max(),df.evidence.apply(len).max())  #max number of tokens in a sentence in the entire dataset 


#train pipeline parameters dictionary 
train_param = {
    'lr': LR,
    'n_epochs': N_EPOCHS,
    'weight_decay': WEIGHT_DECAY,
    'weight_positive_class': weight_positive_class
    }

#model parameters dictionary
model_param = {
    'pad_idx' : PAD_IDX,
    'max_tokens' : max_tokens,
    'freeze_embedding' : FREEZE,   
    'rnn' : RNN,
    'dropout' : DROPOUT
}

#create dataloaders 
train_dataloader,val_dataloader,test_dataloader = create_dataloaders(BATCH_SIZE,df)

### 9. Train the Models 

We are now ready to train our models. In particular we are going to train 4 different models, with different sentence embedding strategy while keeping the same claim-evidence input merging strategy. 
We decided to chose as input-merging baseline the `concat` strategy which is the one that operate less manipulation on the two inputs of the classifier (claim and evidence) by just concatenating them.

In this way we are evaluating all the possible sentence embedding strategies in order to pick the best performing one that will be then tested also with the other input merging options.
In the end the very best model architecture will be also trained with the additional feature of `cosine similarity` to the classification input. 



In [None]:
# clear gpu memory 
import gc
def clean_gpu_cache():
    gc.collect()
    torch.cuda.empty_cache()

all_models_metrics = {}

#### 1) MLP 
Here we train the model with the `mlp` strategy that encodes the sentences via a Linear layer.

In [None]:
architecture1 = Architecture('mlp','concat',False)                    
model1 = Custom_model(embedding_matrix, model_param, architecture1, DEVICE)

model1_metrics = train_and_eval(model1, (train_dataloader,val_dataloader), train_param, DEVICE)

all_models_metrics[model1_metrics['model_name']] = model1_metrics

#### 2) RNNLastState 
Here we train the model with the `RNNLastState` strategy. It encodes token sequences via a RNN and takes the last state as the sentence embedding.

In [None]:
clean_gpu_cache()

architecture2 = Architecture('rnn_last','concat',False)                    
model2 = Custom_model(embedding_matrix, model_param, architecture2, DEVICE)

model2_metrics = train_and_eval(model2, (train_dataloader,val_dataloader), train_param,DEVICE)

all_models_metrics[model2_metrics['model_name']] = model2_metrics

#### 3) RNNAvg 
Here we train the model with the `RNNAvg` strategy that consists in encoding the sentences via a RNN and then averages all the output states.

In [None]:
clean_gpu_cache()

architecture3 = Architecture('rnn_avg','concat',False)                    
model3 = Custom_model(embedding_matrix, model_param, architecture3, DEVICE)

model3_metrics = train_and_eval(model3, (train_dataloader,val_dataloader), train_param,DEVICE)

all_models_metrics[model3_metrics['model_name']] = model3_metrics

#### 4) BagOfVectors
Here we train the model with the `BagOfVectors` strategy. The sentence embedding is computed as the mean of its token embedding. 

In [None]:
clean_gpu_cache()

architecture4 = Architecture('bag_of_vectors','concat',False)                    
model4 = Custom_model(embedding_matrix, model_param, architecture4, DEVICE)

model4_metrics = train_and_eval(model4, (train_dataloader,val_dataloader), train_param,DEVICE)

all_models_metrics[model4_metrics['model_name']] = model4_metrics

Before going on to try different input-merging strategies, let's first plot the results of these first four runs of train and evaluation. 

In [None]:
import matplotlib.pyplot as plt

cols = ['train_loss','val_loss', 'train_acc','val_acc', 'train_f1', 'val_f1', 'train_maj_acc', 'val_maj_acc', 'train_maj_f1', 'val_maj_f1']
metrics = ['Loss', 'Accuracy', 'F1-Score', 'Majority Accuracy', 'Majority F1-Score']
rows = [name for name in all_models_metrics.keys()]
colors = ['lightsalmon', 'red', 'lightblue', 'blue', 'lightgreen', 'green', 'mediumorchid', 'darkviolet', 'gold', 'goldenrod']


fig, axes = plt.subplots(nrows=len(rows), ncols=len(metrics), figsize=(20, 20))

for ax, col in zip(axes[0], metrics):
    ax.set_title(col)

for ax, row in zip(axes[:,0], rows):
    ax.set_ylabel(row, size='large')

keys = list(all_models_metrics.keys())

for plt_row in range(len(rows)):
    num_metric = 0
    for plt_col in range(len(metrics)):
        axes[plt_row,plt_col].plot(all_models_metrics[keys[plt_row]][cols[num_metric]], color= colors[num_metric], label = 'Train') #plot train metrics
        axes[plt_row,plt_col].plot(all_models_metrics[keys[plt_row]][cols[num_metric+1]], color= colors[num_metric+1], label = 'Val') #plot validation metrics
        axes[plt_row,plt_col].legend(loc="best")
        num_metric += 2

fig.tight_layout()
plt.show();

As we can see from the above graph, the RNN architectures perform slightly better than the other ones. In particular, `rnn_avg` seems to be the best performing sentence embedding, reaching an F1-score of 0.77 on the validation set in relatively few epochs. On the other hand, `rnn_last` usually performs a bit worse, probably because in this case only the last result of the network is considered and there is not a general evaluation based on all the outputs. Despite this, even `mlp` and `bag_of_vectors`, which are much simpler sentence embedding techniques than the previous ones, achieve acceptable levels of accuracy, usually reaching an F1-score around 0.72 and 0.73. Finally, the two techniques of regularization, L2 and Dropout, were fundamental to avoid overfitting, which is however present after a certain number of epochs but in a very minor way.\
\
At this point we can select the best performing sentence embedding strategy (`rnn_avg`) and train that architecture with all possible input merging strategies. Of course we won't need to test again the `concat` strategy with which we did the previous runs. 

#### 1) Sum
Here we train the model with the `Sum` input-merging strategy. The input that will be passed to the classifier is, in this case, the sum of evidence and claim sentence embeddings

In [None]:
clean_gpu_cache()

architecture5 = Architecture('rnn_avg','sum',False)                    
model5 = Custom_model(embedding_matrix, model_param, architecture5, DEVICE)

model5_metrics = train_and_eval(model5, (train_dataloader,val_dataloader), train_param, DEVICE)

#### 2) Mean
Here we train the model with the `Mean` input-merging strategy. The final classification input is here defined as the mean of evidence and claim sentence embeddings

In [None]:
clean_gpu_cache()

architecture6 = Architecture('rnn_avg','mean',False)                    
model6 = Custom_model(embedding_matrix, model_param, architecture6, DEVICE)

model6_metrics = train_and_eval(model6, (train_dataloader,val_dataloader), train_param, DEVICE)

COMMENTI SUI RISULTATI 

Finally we can add an additional feature to the classification input of our best neural architecture. This additional value is the cosine similarity between the two sentence embeddings (claim and evidence) and it is concatenated to the input of the classifier.

In the end we will now train and evaluate a model whose architecture has `rnn_avg` as sentence embedding strategy, `dipende quale è meglio` as claim-evidence merging strategy, and the additional feature of `cosine similarity` added to the classification input. 

In [None]:
clean_gpu_cache()

architecture7 = Architecture('rnn_avg','concat',True)                    
model7 = Custom_model(embedding_matrix, model_param, architecture7, DEVICE)

model7_metrics = train_and_eval(model7, (train_dataloader,val_dataloader), train_param, DEVICE)

COMMENTI SUI RISULTATI , ANALYSYS , SUMMARY 

In [None]:
logging.shutdown()