# Motivation
This notebook applies a Universal Sentence Encoder for Question Answering on the manually reviewed dataset.

USE-QA is a greater-than-word length text encoder for question answer retrieval. It is trained on a variety of data sources and tasks, with the goal of learning text representations that are useful out-of-the-box to retrieve an answer given a question, as well as question and answers across different languages. [Source](https://tfhub.dev/google/universal-sentence-encoder-qa/3)


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
pd.set_option('max_colwidth', None)

## Data

The dataset we'll use is based on the *task_1-google_search_manually_reviewed_metadata.csv* file provided in the original data. The following processing steps have been applied to it:
* Cleaning of the data using code from this notebook: [Task 1: EDA + cleanup on manually reviewed](https://www.kaggle.com/didizlatkova/task-1-eda-cleanup-on-manually-reviewed)
* Manually going through the snippets and extracting sub-snippets which answer one or more of the 8 questions, defined in the task.
* Adding 8 columns for the respective questions containing the extracted sub-snippets or NaN value, if no sub-snippet answers the question.

The resulting data is exported to a new dataset called **bcg-manually-reviewed-cleaned**.

In [None]:
path = '/kaggle/input/bcg-manually-reviewed-cleaned'
file = f'{path}/manually_reviewed_cleaned.csv'
df = pd.read_csv(file, encoding = "ISO-8859-1")

In [None]:
df.head(1)

In [None]:
f"Using {df.shape[0]} entries"

## Evaluation

The evaluation of the approach will be performed in the following steps:
* Splitting the original texts from the 35 files into snippets
* Calculating a score between each snippet and each question using the pretrained model
* Comparing the highest scoring snippets with the manually extracted ones

### Split texts into snippets

To split the texts into snippets we'll use spacy and its functionality to return all sentences in a document. After that we apply additional fitering - we want only sentences which contain more than 5 tokens and at least one verb.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

def get_snippets(text):
    '''
        Returns sentences in the text which contain more than 5 tokens and at least one verb.
    '''
    return [sent.text.strip() for sent in nlp(text).sents 
                 if len(sent.text.strip().split()) > 5 and any([token.pos_ == 'VERB' for token in sent])]

### Calculate scores

#### Model Details
* Developed by researchers at Google, 2019, v2 [1].
* Transformer.
* Strong performance on question answer retrieval for English.
* Use the **question_encoder** signature to encode variable length questions in any of the aforementioned languages and the output is a 512 dimensional vector. The default signature is identical with the question_encoder signature.
* Use the **response_encoder** signature to encode the answer and the output is a 512 dimensional vector.
* The response_encoder signature acceptes two input fields:
    * **text**: the answer text.
    * **context**: usually the text around the answer text, for example it could be 2 sentneces before plus 2 sentences after, it could also be the paragraph containing the answer text. If you don't have context to include, you can duplicate of answer into this field.
* All input text can have arbitrary length! However, it is recommended question and response inputs to be approximately one sentence in length.

In [None]:
!pip3 install tensorflow_text>=2.0.0rc0

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text

In [None]:
module = hub.load('https://tfhub.dev/google/universal-sentence-encoder-qa/3')

In [None]:
questions = ["Which is the First Year of the BCG Policy?",
             "Which is the last year of the BCG Policy?",
             "Is BCG vaccination mandatory for all children?",
             "What is the timing for the BCG vaccination (age)?",
             "Which BCG Strain has been used?",
             "Are revaccinations (boosters) recommended for BCG?",
             "What is the timing of BCG revaccination?",
             "Which in the body (arm) is the BCG Vaccine administered?"]

question_embeddings = module.signatures['question_encoder'](
            tf.constant(questions))

In [None]:
def read_text(row):
    code = row['alpha_2_code']
    filename=row['filename'].replace('.txt', '')
    filename = f'/kaggle/input/hackathon/task_1-google_search_txt_files_v2/{code}/{filename}.txt'
    
    with open(filename, 'r') as file:
        data = file.read().replace('\n', ' ')
    return data

question_names = ['first_year','last_year','is_mandatory','timing','strain','has_revaccinations','revaccination_timing','location']

def apply_USE_model(row):
    data = read_text(row)
    
    snippets = get_snippets(data)
    
    response_embeddings = module.signatures['response_encoder'](
        input=tf.constant(snippets),
        context=tf.constant(snippets))
    scores = np.inner(question_embeddings['outputs'], response_embeddings['outputs'])

    result = pd.DataFrame(scores.T, columns=question_names)
    result['sentence'] = snippets
    result['len'] = result['sentence'].apply(len)
    result['country'] = row['country']
    
    return result

In [None]:
dfs = []
import tqdm
for _, row in tqdm.tqdm(df.iterrows()):
    result = apply_USE_model(row)
    dfs.append(result)

In [None]:
final_eval = pd.concat(dfs, ignore_index=True)

In [None]:
f"The evaluation is performed on {final_eval.shape[0]} snippets"

Below is listed the top 1 highest scoring answer for each of the 8 questions.

In [None]:
final_eval.iloc[final_eval[question_names].idxmax()]

Below are displayed top 3 answers for each question.

In [None]:
for k in question_names:
    print(k)
    display(final_eval.sort_values(k, ascending=False)[[k, 'sentence','country','len']].head(3))

We can see that the model works best for answering questions about location.

A distribution of the overall scores for each question is shown below.

In [None]:
final_eval.drop('len', axis=1).plot.box(figsize=(15,5))

In [None]:
df.columns = ['alpha_2_code', 'country', 'url', 'filename', 'is_pdf','Comments',
              'Snippet'] + question_names + ['snippet_len', 'text_len']

Comparison of predicted snippets with the extracted ones (only for the first 10 entries):

In [None]:
for _, row in df.head(10).iterrows():
    print('----' * 10)
    print('ACTUAL:')
    print('\n'.join([f"<{i}>: {v}" for i, v in row[question_names].dropna().iteritems()]))
    
    cols = [i for i, v in row[question_names].dropna().iteritems()]
    
    result = apply_USE_model(row)
    print(f"Total snippets: {result.shape[0]}")
    
    for k in cols:
        display(result.sort_values(k, ascending=False)[[k, 'sentence','country','len']].head(3))
    

By going through the results one can see that they are not too great. In most cases the predicted sentences do not match the actual snippets from the reviewed dataset.