# Semantic Question Answering System

---

> What is this? In this notebook, I attempt to build/put-together a question answering system using the latest state-of-the-art libraries and models available today.

# Objective

- `Problem`  - **Return the most needed content (as opposed to all content).**
- `Solution` - *Build a system that allows a user to ask questions on the literature and receive the most needed content (e.g., the content/answer cannot be the size of the entire article!).*


To add to the problem of information overload, we cannot pass a full article to the question answering model as this will be painfully slow. The code below is an attempt to solve this and make a query to output fast for humans.

> Overview/Summary of the QA System

<img src="https://github.com/amoux/corona/blob/master/src/img/SemanticQuestionAnsweringSystem.png?raw=true"/>

- Steps:

    - build the cord-19 dataset:
        - apply pre-processing and normalization to raw-texts.
        - transform (tokenize) raw-texts from n articles to sentences.
 
    - build the embedding store (similar to a DB but optimized for similarity search):
        - encode sentences to embeddings.
        
    - other:
        - extract questions from the dataset (papers).
        - apply the same steps used for the sentences to the questions.
        - build a terminology graph of the questions extracted.
        
    - final:
        - build the question answering engine.
        - query the questions extracted from the dataset.

---

## Is too much information a problem?

> The following summary can help us understand the question.

- The output comes from the exact model applied in this notebook. Also, note that it is indeed related to the problem stated previously.
    - Question extracted from article: [Evaluation and mechanism for outcomes exploration of providing public health care in contract service in Rural China: a multiple-case study with complex adaptive systems design](https://bmcpublichealth.biomedcentral.com/articles/10.1186/s12889-015-1540-9)

        - **question**: *What would be the value to the user of federating search results from many discovery tools?*

        - **answer**  : `satisfaction and engagement`

        - **context** : *`To what degree would an optimal searching environment enhance the satisfaction and engagement of existing users? How can we better understand how our discovery tools are being used and assess whether we are returning the most needed content (as opposed to all content)? Likewise, participation in knowledge-generating cases, whether direct or vicarious, seems integral to learning or appreciating the nature of scientific research. The central coordination of this global DoD surveillance system afforded multiple opportunities for enhanced utilization of partner capabilities, as well as concise information sharing with other DoD organizations and external agencies (Table 2). e. g., sharing and promoting one's work, perpetuation of bias by discovery systems) They permit structured searches and comparison of data in different clearinghouses and give the user adequate information to find data and use it in an appropriate context [107].`*

# About

- **Live Application**:
    - I additionally put together a simple web-app that shows how the source-code and models implemented in this notebook can be used in an application setting.
        - application data based on:
            * dataset : `2020-04-24`
            * subsets : `comm_use_subset, noncomm_use_subset, biorxiv_medrxiv`
            * papers : `14,565`
            * text-source : `body_text`
            * embeddings/sentences: `2,569,779`
        
    - WebApp URL : [COVID-19 Semantic Question Answering System](http://corona-nlp.ngrok.io.ngrok.io/?fbclid=IwAR2h4wYcxXN00dEO-wZlisQzQO-MInla8Po98ZhyZBuPDBTdlout4_sQ9aE)
    

- **Transformer Models**:
    - base model for both models below : `allenai/scibert_scivocab_uncased`

        - `scibert_nli` (uncased) : First fine-tuned on the `AllNLI dataset`, then on train set of `STS benchmark` for sentence embeddings.
            - NOTE: This model is currently available only from my Google Drive and can only be used with the `sentence_transformers` library.
   
        - `scibert_nli_squad` (uncased) : Previously finetuned on `AllNLI dataset` then, on the `SQUAD 2.0 dataset` for question answering task.
            - Available for download on Huggingface's website. : [Model: amoux/scibert_nli_squad](https://huggingface.co/amoux/scibert_nli_squad)
            
            
 > Enough of talking, lets build this thing!


In [None]:
!"/opt/conda/bin/python3.7" -m pip install --upgrade pip
!pip install googledrivedownloader
!pip install -U transformers
!pip install -U --no-deps sentence_transformers
!pip install bert-extractive-summarizer
!pip install scikit_learn

# IT REALLY SUCKS THIS MODEL COULD NOT BE INSTALLED! :(
# !pip install -U scispacy
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz

!pip install -U spacy thinc
!python -m spacy download en_core_web_sm
!conda install faiss-cpu -c pytorch --yes

In [None]:
import concurrent.futures
from multiprocessing import cpu_count
import functools
import json
import pickle
import random
import re
from collections import Counter
from dataclasses import dataclass, field
from pathlib import Path
from string import punctuation
from typing import (IO, Any, Dict, Callable, Iterator,
                    List, Sequence, Tuple, Union)

import faiss
import numpy as np
import spacy
import torch
from google_drive_downloader import GoogleDriveDownloader
from nltk.tokenize import word_tokenize
from sentence_transformers import SentenceTransformer
from spacy.lang.en import English
from spacy.tokens.span import Span
from summarizer import Summarizer
from tqdm.auto import tqdm
from transformers import (BertConfig, BertForQuestionAnswering, BertModel,
                          BertTokenizer)

## Download the Sentence Encoder Model

In [None]:
file_id = "1Qm7EL7eOsSgB66v5Zn_n-nAhR2OXo0UW"
filepath = "models/scibert-nli.zip"
gdrive = GoogleDriveDownloader()
gdrive.download_file_from_google_drive(file_id,
                                       dest_path=filepath, unzip=True)

# Source Code

> The following source code was implemented uniquely for the CORD-19 Dataset.

In [None]:
def normalize_whitespace(string: str) -> str:
    """Normalize excessive whitespace."""
    linebreak = re.compile(r"(\r\n|[\n\v])+")
    nonebreaking_space = re.compile(r"[^\S\n\v]+", flags=re.UNICODE)
    return nonebreaking_space.sub(" ", linebreak.sub(r"\n", string)).strip()


def clean_punctuation(text: str) -> str:
    punct = re.compile("[{}]".format(re.escape(punctuation)))
    tokens = word_tokenize(text)
    text = " ".join(filter(lambda t: punct.sub("", t), tokens))
    return normalize_whitespace(text)


def clean_tokenization(sequence: str) -> str:
    """Clean up spaces before punctuations and abbreviated forms."""
    return (
        sequence.replace(" .", ".")
        .replace(" ?", "?")
        .replace(" !", "!")
        .replace(" ,", ",")
        .replace(" ' ", "'")
        .replace(" n't", "n't")
        .replace(" 'm", "'m")
        .replace(" do not", " don't")
        .replace(" 's", "'s")
        .replace(" 've", "'ve")
        .replace(" 're", "'re")
        .replace(" / ", "/")
        .replace(" )", ")")
        .replace("( ", "(")
        .replace("[ ", "[")
        .replace(" ]", "]")
        .replace(" ;", ";")
        .replace(" - ", "-")
    )


class DataIO:
    @staticmethod
    def save_data(file_path: str, data_obj: Any) -> IO:
        file_path = Path(file_path)
        if file_path.is_dir():
            if not file_path.exists():
                file_path.mkdir(parents=True)
        with file_path.open("wb") as pkl:
            pickle.dump(data_obj, pkl, pickle.HIGHEST_PROTOCOL)

    @staticmethod
    def load_data(file_path: str) -> Any:
        file_path = Path(file_path)
        with file_path.open("rb") as pkl:
            return pickle.load(pkl)

In [None]:
class PaperIndexer:
    def __init__(self, source: Union[str, List[str]],
                 index_start=1, sort_first=False, extension=".json"):
        self.index_start = index_start
        self.extension = extension
        self.is_files_sorted = sort_first
        self._bins: List[int] = []
        self.paths: List[Path] = []
        self.paper_index: Dict[str, int] = {}
        self.index_paper: Dict[int, str] = {}
        if not isinstance(source, list):
            source = [source]
        file_paths = []
        for path in source:
            path = Path(path)
            if path.is_dir():
                files = [file for file in path.glob(f"*{extension}")]
                if sort_first:
                    files.sort()
                file_paths.extend(files)
                self.paths.append(path)
                self._bins.append(len(files))
            else:
                raise ValueError(f"Path, {path} directory not found.")
        self._map_files_to_ids(file_paths)

    @property
    def num_papers(self):
        return len(self.index_paper)

    @property
    def source_name(self):
        if len(self.paths) == 1:
            return self.paths[0].name
        return [p.name for p in self.paths]

    def _map_files_to_ids(self, json_files: List[str]) -> None:
        for index, file in enumerate(json_files, self.index_start):
            paper_id = file.name.replace(self.extension, "")
            if paper_id not in self.paper_index:
                self.paper_index[paper_id] = index
                self.index_paper[index] = paper_id

    def _index_dirpath(self, index: int) -> Path:
        if index <= self._bins[0]:
            return self.paths[0]
        else:
            size = 0
            for i in range(len(self._bins)):
                size += self._bins[i]
                if index <= size:
                    return self.paths[i]

    def _load_data(self, paper_id: str):
        path = self._index_dirpath(self.paper_index[paper_id])
        file_path = path.joinpath(f"{paper_id}{self.extension}")
        with file_path.open("rb") as file:
            return json.load(file)

    def _encode(self, paper_ids: List[str]) -> List[int]:
        pid2idx = self.paper_index
        return [pid2idx[pid] for pid in paper_ids if pid in pid2idx]

    def _decode(self, indices: List[int]) -> List[str]:
        idx2pid = self.index_paper
        return [idx2pid[idx] for idx in indices if idx in idx2pid]

    def load_paper(self, index: int = None, paper_id: str = None):
        """Load a single paper and data by either index or paper ID."""
        if index is not None:
            paper = self.load_papers([index], None)
        elif paper_id is not None:
            paper = self.load_papers(None, [paper_id])
        return paper[0]

    def load_papers(self, indices: List[int] = None, paper_ids: List[str] = None):
        """Load many papers and data by either indices or paper ID's."""
        if indices is not None:
            if isinstance(indices, list) and isinstance(indices[0], int):
                paper_ids = self._decode(indices)
                return [self._load_data(pid) for pid in paper_ids]
            else:
                raise ValueError("Indices not of type List[int].")

        elif paper_ids is not None:
            if isinstance(paper_ids, list) and isinstance(paper_ids[0], str):
                return [self._load_data(pid) for pid in paper_ids]
            else:
                raise ValueError("Paper ID's not of type List[str].")

    def __getitem__(self, item):
        if isinstance(item, int):
            return self.index_paper[item]
        elif isinstance(item, str):
            return self.paper_index[item]

    def __len__(self):
        return self.num_papers

    def __repr__(self):
        return "PaperIndexer(papers={}, files_sorted={}, source={})".format(
            self.num_papers, self.is_files_sorted, self.source_name)


@dataclass
class Sentences:
    indices: List[int] = field(default_factory=list, repr=False)
    counts: int = 0
    maxlen: int = 0
    strlen: int = 0

    def init_cluster(self) -> Dict[int, List[str]]:
        return dict([(index, []) for index in self.indices])

    def __len__(self):
        return self.counts


@dataclass
class Papers:
    sentences: Sentences = field(repr=False)
    cluster: Dict[int, List[str]] = field(repr=False)
    avg_strlen: float = field(init=False)
    num_papers: int = field(init=False)
    num_sents: int = field(init=False)
    _meta: List[Tuple[int, int]] = field(init=False, repr=False)

    def __post_init__(self):
        if isinstance(self.sentences, Sentences):
            for key, val in self.sentences.__dict__.items():
                setattr(self, key, val)
        self.avg_strlen = round(self.strlen / self.counts, 2)
        self.num_papers = len(self.indices)
        self.num_sents = self.counts
        self._meta = list(self._edges())

    def _edges(self):
        for i in self.indices:
            for j in range(0, len(self.cluster[i])):
                yield (i, j)

    def string(self, sent_id: int) -> str:
        """Retrive a single string from a sentence ID.

        * Same as `self[sent_id]`
        """
        return self[sent_id]

    def lookup(self, sent_ids: List[int]) -> List[Dict[str, int]]:
        locs = []
        for i in sent_ids:
            node, item = self._meta[i]
            locs.append({"sent_id": i, "paper_id": node,
                         "loc": (node, item)})
        return locs

    def sents(self, paper_id: int) -> List[str]:
        """Retrive all sentences belonging to the given paper ID."""
        return self.cluster[paper_id]

    def to_disk(self, path: str):
        """Save the current state to a directory."""
        DataIO.save_data(path, self)

    @staticmethod
    def from_disk(path: str):
        """Load the state from a directory."""
        return DataIO.load_data(path)

    def __len__(self):
        return self.num_sents

    def __getitem__(self, item):
        node, item = self._meta[item]
        return self.cluster[node][item]

    def __iter__(self):
        for index in self.cluster:
            for sentence in self.cluster[index]:
                yield sentence


def merge_papers(papers: List[Papers]) -> Papers:
    """Merge a list of instances of Papers into one."""
    if isinstance(papers, list):
        if not isinstance(papers[0], Papers):
            raise TypeError("Expected a List[Papers], but found "
                            f"a List[{type(papers[0])}] instead.")
    i = Sentences()
    c = i.init_cluster()
    for p in papers:
        i.strlen += p.strlen
        i.counts += p.counts
        i.maxlen = max(i.maxlen, p.maxlen)
        i.indices.extend(p.indices)
        c.update(p.cluster)
    return Papers(i, c)

In [None]:
def frequency_summarizer(text: Union[str, List[str]],
                         topk=7, min_tokens=30, nlp=None) -> str:
    """Frequency Based Summarization.

    :param text: sequences of strings or an iterable of string sequences.
    :param topk: number of topmost leading scored sentences.
    :param min_tokens: minimum number of tokens to consider in a sentence.
    """
    if nlp is None:
        nlp = spacy.load("en_core_web_sm")

    doc = nlp(" ".join(text) if isinstance(text, list) else text)

    vocab = {}
    for token in doc:
        if not token.is_stop and not token.is_punct:
            if token.text not in vocab:
                vocab[token.text] = 1
            else:
                vocab[token.text] += 1

    for word in vocab:
        vocab[word] = vocab[word] / max(vocab.values())

    score = {}
    for sent in doc.sents:
        for token in sent:
            if len(sent) > min_tokens:
                continue
            if token.text in vocab:
                if sent not in score:
                    score[sent] = vocab[token.text]
                else:
                    score[sent] += vocab[token.text]

    nlargest = sorted(score, key=score.get, reverse=True)[:topk]
    summary = " ".join([sent.text for sent in nlargest])
    return summary


def common_tokens(texts: List[str], minlen=3, nlp=None,
                  pos_tags=("NOUN", "ADJ", "VERB", "ADV",)):
    """Top Common Tokens (removes stopwords and punctuation).

    :param texts: iterable of string sequences.
    :param minlen: dismiss tokens with a minimum length.
    :param nlp: use an existing spacy language instance.
    :param pos_tags: lemmatize tokens based on part-of-speech tags.
    """
    common = {}
    if nlp is None:
        nlp = spacy.load("en_core_web_sm")

    for doc in nlp.pipe(texts):
        tokens = []
        for token in doc:
            if token.is_stop:
                continue
            if token.pos_ in pos_tags:
                tokens.append(token.lemma_)
            else:
                tokens.append(token.text)

        text = " ".join(tokens)
        text = clean_punctuation(text)
        for token in word_tokenize(text):
            if len(token) < minlen:
                continue
            if token not in common:
                common[token] = 1
            else:
                common[token] += 1

    common = sorted(common.items(),
                    key=lambda k: k[1], reverse=True)
    return common


def extract_questions(papers: Papers, min_length=30, sentence_ids=False):
    """Extract questions from an instance of papers.

    :param min_length: minimum length of a question to consider.
    :param sentence_ids: whether to return the decoded ids `paper[index]`.
    """
    interrogative = ['how', 'why', 'when',
                     'where', 'what', 'whom', 'whose']
    sents = []
    ids = []
    for index in tqdm(range(len(papers)), desc='sentences'):
        string = papers[index]
        if len(string) < min_length:
            continue
        toks = string.lower().split()
        if toks[0] in interrogative and toks[-1].endswith("?"):
            sents.append(string)
            ids.append(index)

    questions = list(set(sents))
    print(f'found {len(questions)} interrogative questions.')

    if not sentence_ids:
        return questions
    return questions, ids


class SpacySentenceTokenizer:
    def __init__(
        self,
        nlp_model="en_core_web_sm",
        disable=["ner", "tagger"],
        max_length=2_000_000,
    ):
        """Spacy Sentence Tokenizer.

        :params nlp_model: spaCy model to use for the tokenizer.
        :params disable: name of spaCy's pipeline components to disable.
        """
        self.nlp_model = nlp_model
        self.disable = disable
        self.max_length = max_length

    @property
    def cache(self):
        info = self.nlp.cache_info()
        if info.hits:
            return info.hits

    @functools.lru_cache()
    def nlp(self) -> List[English]:
        nlp_ = spacy.load(self.nlp_model, disable=self.disable)
        nlp_.max_length = self.max_length
        return nlp_

    def tokenize(self, doc: str) -> List[Span]:
        """Tokenize to sentences from a string of sequences to sentences."""
        doc = self.nlp()(doc)
        return list(doc.sents)

    def __repr__(self):
        model, pipe = self.nlp_model, self.disable
        return f"<SpacySentenceTokenizer({model}, disable={pipe})>"

In [None]:
class CORD19Dataset(PaperIndexer):
    def __init__(
            self,
            source: Union[str, List[str]],
            text_keys: Tuple[str] = ("abstract", "body_text",),
            index_start: int = 1,
            sort_first: bool = False,
            nlp_model: str = "en_core_web_sm",
            sentence_tokenizer: Callable = None,
    ):
        super(CORD19Dataset, self).__init__(source, index_start, sort_first)
        self.text_keys = text_keys
        self.sentence_tokenizer = sentence_tokenizer
        if sentence_tokenizer is not None:
            if not hasattr(sentence_tokenizer, 'tokenize'):
                raise AttributeError(f'Callable[{sentence_tokenizer.__name__}]'
                                     ' missing ``self.tokenize()`` attribute.')
        else:
            self.sentence_tokenizer = SpacySentenceTokenizer(nlp_model)

    def sample(self, k: int = None, seed: int = None) -> List[int]:
        """Return all or k iterable of paper-id to index mappings.
        `k`: A sample from all available papers use `k=-1`. Otherwise, pass
            `k=n` number of indices to load from the available dataset files.
        """
        random.seed(seed)
        indices = list(self.index_paper.keys())
        if k == -1:
            return indices
        assert k <= self.num_papers
        return random.sample(indices, k=k)

    def title(self, index: int = None, paper_id: str = None) -> str:
        return self.load_paper(index, paper_id)["metadata"]["title"]

    def titles(self, indices: List[int] = None,
               paper_ids: List[str] = None) -> Iterator:
        for paper in self.load_papers(indices, paper_ids):
            yield paper["metadata"]["title"]

    def docs(self, indices: List[int] = None,
             paper_ids: List[str] = None, suffix="\n") -> Iterator:
        for paper in self.load_papers(indices, paper_ids):
            doc = []
            for key in self.text_keys:
                for line in paper[key]:
                    doc.append(line["text"])
            yield suffix.join(doc)

    def lines(self, indices: List[int] = None,
              paper_ids: List[str] = None) -> Iterator:
        for paper in self.load_papers(indices, paper_ids):
            for key in self.text_keys:
                for line in paper[key]:
                    yield line["text"]

    def build(self, indices: List[int], minlen: int = 20) -> Papers:
        """Return an instance of papers with texts transformed to sentences."""
        index = Sentences(indices)
        cluster = index.init_cluster()
        docs = self.docs(indices)

        for paper in cluster:
            for line in self.sentence_tokenizer.tokenize(next(docs)):
                string = normalize_whitespace(line.text)
                string = clean_tokenization(string)
                length = len(string)
                if length <= minlen:
                    continue
                if string not in cluster[paper]:
                    index.strlen += length
                    index.counts += 1
                    index.maxlen = max(index.maxlen, length)
                    cluster[paper].append(string)

        return Papers(index, cluster=cluster)

    def batch(self, indices: List[int], minlen=20, workers=None) -> Papers:
        maxsize = len(indices)
        workers = cpu_count() if workers is None else workers

        jobs = []
        for i in range(0, maxsize, workers):
            tasks = indices[i: min(i + workers, maxsize)]
            jobs.append(tasks)

        with tqdm(total=maxsize, desc="papers") as pbar:
            batch_: List[Papers] = []
            with concurrent.futures.ThreadPoolExecutor(workers) as pool:
                future_to_ids = {
                    pool.submit(self.build, job, minlen): job for job in jobs
                }
                for future in concurrent.futures.as_completed(future_to_ids):
                    ids = future_to_ids[future]
                    try:
                        papers = future.result()
                    except Exception as e:
                        print(f"{ids} generated an exception: {e}")
                        raise
                    else:
                        batch_.append(papers)
                        pbar.update(len(ids))

        return merge_papers(batch_)

    def __repr__(self):
        return "CORD19Dataset(papers={}, files_sorted={}, source={})".format(
            self.num_papers, self.is_files_sorted, self.source_name)

# CORD19Dataset

> The `CORD19Dataset` class handles loading the content of single or many papers by `int: index` or `str:paper_id.` This input style makes it easier and eliminates the need to load articles via paths explicitly.

In [None]:
source = "/kaggle/input/CORD-19-research-challenge/document_parses/pdf_json/"
cord19 = CORD19Dataset(source=source,
                       text_keys=("body_text",),
                       sort_first=True,
                       nlp_model="en_core_web_sm")
print(cord19)

In [None]:
# general view of how to access items via index/paper_id

paper_id = cord19[100]
title = cord19.title(None, paper_id)
lines = cord19.lines(None, [paper_id])

print(f'paper       : index=100 <-> id={paper_id}.json')
print(f'paper-title : {title}')
print(f'paper-lines : {len(list(lines))}')

# Papers to Sentences

> The following method `cord19.batch()` employs the subsequent pre-processing steps: *skips duplicates normalizes syntax and tokenizes texts to sentences* for `4,500 papers` and yields about `~970,000` sentences. Additionally, the method utilizes multithreading for faster batching. Though, since we do need a GPU environment (for encoding sentences to embeddings), despite multithreading, batch speeds are slow with limited resources. On the other hand, if you have a machine with `8-cores` and `SSD` - the equivalent number of samples (4500 papers) takes around `14 minutes`.


## Error when running batch()

- **If you got an error when running the cell below**:
    - Unfortunately this is a known bug related to multithreading in spaCy's ml library `thinc`. If you see this error please click on `Cancel Run` in the toolbar above - make sure its not running and re-run the cell with `SHIFT + ENTER` it should work after,
    
[Issue on Github](https://github.com/explosion/spaCy/issues/4349)

- Error log:

```bash
Undefined operator: >>
    Called by (<thinc.neural._classes.function_layer.FunctionLayer object at 0x7fee2769ccd0>,
    <thinc.neural._classes.feed_forward.FeedForward object at 0x7fee300f7d10>)
  Available:
  [1;38;5;4mTraceback:[0m
  â”œâ”€ [1mfrom_disk[0m in /opt/conda/lib/python3.7/site-packages/spacy/util.py:654
  â”œâ”€â”€â”€ [1m<lambda>[0m in /opt/conda/lib/python3.7/site-packages/spacy/language.py:936
  â””â”€â”€â”€â”€â”€ [1mTok2Vec[0m in /opt/conda/lib/python3.7/site-packages/spacy/_ml.py:323
    [38;5;1m     >>>[0m return _legacy_tok2vec.Tok2Vec(width, embed_size, **kwargs)
```

# Tasks

> First lets get all papers matching the tasks

In [None]:
# load the sentence encoder
encoder = SentenceTransformer('models/scibert-nli')

In [None]:
TASKS = [
    'Effectiveness of drugs being developed and tried to treat COVID-19 patients.',
    ('Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such '
     'as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.'),
    'Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.',
    'Exploration of use of best animal models and their predictive value for a human vaccine.',
    ('Capabilities to discover a therapeutic (not vaccine) for the disease, and '
     'clinical effectiveness studies to discover therapeutics, to include antiviral agents.'),
    ('Alternative models to aid decision makers in determining how to prioritize and distribute scarce, '
     'newly proven therapeutics as production ramps up. This could include identifying approaches '
     'for expanding production capacity to ensure equitable and timely distribution to populations in need.'),
    'Efforts targeted at a universal coronavirus vaccine.',
    'Efforts to develop animal models and standardize challenge studies.',
    'Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers.',
    'Approaches to evaluate risk for enhanced disease after vaccination.',
    ('Assays to evaluate vaccine immune response and process development for vaccines, '
     'alongside suitable animal models in conjunction with therapeutics.')
]

# (-1) loads all indexed paper ids
sample = cord19.sample(-1)
paper_id_to_title = {}
for paper_id in tqdm(sample, desc='titles'):
    title = cord19.title(paper_id)
    title = normalize_whitespace(title)
    if len(title) <= 10:
        continue
    if paper_id not in paper_id_to_title:
        paper_id_to_title[paper_id] = title

paper_id_to_index = dict(enumerate(paper_id_to_title.keys()))
titles = list(paper_id_to_title.values())

# encode the titles (db) and tasks (queries)
titles_embed = np.array(
    encoder.encode(titles, show_progress_bar=False))
tasks_embed = np.array(
    encoder.encode(TASKS, show_progress_bar=False))

topk = 410  # we want 410 neighbors for each centroid
ndim = titles_embed.shape[1]
index = faiss.IndexFlat(ndim)
index.add(titles_embed)

# query the topmost similar neighbors to the queries
D, I = index.search(tasks_embed, topk)
I

In [None]:
goal_size = 4500  # we want 4,500 papers for the dataset
gold_ids = []
for i in I.flatten().tolist():
    paper_id = paper_id_to_index[i]
    gold_ids.append(paper_id)

gold_ids = sorted(set(gold_ids))
ntotal = len(gold_ids)
print('number of uniques :', ntotal)

extra_ids = []
if ntotal < goal_size:
    needs = goal_size - ntotal
    count = 0
    for need_id in sample:
        if need_id in gold_ids:
            continue
        if count < needs:
            extra_ids.append(need_id)
            count += 1
            
    assert len(extra_ids)+len(gold_ids) == goal_size
    print(f'goal needed {len(extra_ids)} extra number of ids!')

# small test case (make sure the ids are in sync)
gold_id = gold_ids[10]
gold_title = cord19.title(gold_id)
print(f'id    : {gold_id}')
print(f'title : {gold_title}')
print(f'match : {paper_id_to_title[gold_id]}')

In [None]:
sample = []
sample.extend(gold_ids)
sample.extend(extra_ids)
sample.sort()

papers = cord19.batch(sample, minlen=25)
print(papers)

> **Warning** - Since we only needed the following objects to build the sample of papers ids - we can now delete them to free-up RAM.

In [None]:
del paper_id_to_title, paper_id_to_index, titles, titles_embed, tasks_embed, index

# Data Structure Overview

In [None]:
# access sentences via index (like a list)

sent_ids = []
for i in range(5):
    x = random.randint(i, papers.num_sents)
    sentence = papers[x]
    sent_ids.append(x)
    print(f"{x}:\t{sentence[:90]}")

In [None]:
# it is also possible to retrive titles by sentence ids

for x in papers.lookup(sent_ids):
    title = cord19.title(x['paper_id'])
    print(f'{x["sent_id"]}:\t{title[:70]}')

In [None]:
# keep in mind only sentences are kept in memory, e.g.,
# titles are loaded from disk (since it's a less common action)

maxids = 10
for index in papers.indices[:maxids]:  # iterate over the indexed papers ids
    sents = len(papers.sents(index))   # retrive all sentences for a single paper/article
    title = cord19.title(index)        # retrive the title for the paper/article
    paper_id = cord19[index]           # decode the index id back to a string (paper/article file-id)
    
    print(f'paper_id: {paper_id}, num_sents: {sents}\n* {title[:90]}\n')

# Sentences to Embeddings

In [None]:
# encode the sentences to embeddings:
embedding = np.asarray(
    encoder.encode(papers, show_progress_bar=True)
)

In [None]:
assert embedding.shape[0] == len(papers)
print('shape :', embedding.shape)

## Faiss

> Faiss is a library for fast and efficient search and clustering of embeddings

Please refer to the project for more information. [faiss github repo](https://github.com/facebookresearch/faiss)

In [None]:
nlist = 10 # centroids
nbyte = 32
n_dim = embedding.shape[1]
quantizer = faiss.IndexHNSWFlat(n_dim, nbyte)
index_ivf = faiss.IndexIVFFlat(quantizer, n_dim, nlist, faiss.METRIC_L2)
index_ivf.verbose = True
if not index_ivf.is_trained:
    index_ivf.train(embedding)
if index_ivf.ntotal == 0:
    index_ivf.add(embedding)
assert index_ivf.ntotal == embedding.shape[0]

## Saving

In [None]:
data_dir = Path('data')
if not data_dir.exists(): data_dir.mkdir()
sents_file = data_dir.joinpath(f'sents_{papers.num_papers}.pkl')
# embed_file = data_dir.joinpath(f'embed_{papers.num_papers}.npy')
index_file = data_dir.joinpath(f'index_{papers.num_papers}.index')

# save the db
papers.to_disk(sents_file)
# np.save(embed_file, embedding)  # saving the embedding requires +HDD
faiss.write_index(index_ivf, index_file.as_posix())

> **Warning** - We have everything we need and saved to disk - delete these objects to free-up RAM.

In [None]:
del embedding, index_ivf, cord19

# Dataset Questions

> In the following code we'll extract questions from the literature, sort them based on similarity and grouped them via clustering (KNN). We'll use `faiss.Kmeans` indexer for this task.

In [None]:
# get all the questions from the instance of papers:
questions = extract_questions(papers, min_length=40)

In [None]:
# encode the questions to embeddings
embedding = np.asarray(
    encoder.encode(questions, show_progress_bar=True)
)

In [None]:
nless = 1  # remove additional questions (optional)
nlist = 10  # n centroids fluctuates on random sampling partitions (papers)
niter = 20
items = len(questions)

# topk : top neighbors per centroid
topk = (items // nlist) - (nlist * nless)
ndim = embedding.shape[1]

# build kmeans
kmeans = faiss.Kmeans(ndim, nlist, niter=niter, verbose=True)
kmeans.train(embedding)

# finally, build the indexer
index = faiss.IndexFlat(ndim)
index.add(embedding)
D, I = index.search(kmeans.centroids, topk)

# "sorting" the questions in relation to k-nn scores
cluster = [[] for _ in range(I.shape[0])]
for k in range(I.shape[0]):
    for nn in I[k]:
        cluster[k].append(questions[nn])

print(f'(centroids, neighbors) : {I.shape}')

In [None]:
nn = I.shape[1]
cats = {}
for k in range(I.shape[0]):
    toks = common_tokens(cluster[k])
    ents = toks[:nn -1 if nn % 2 else nn]
    if k not in cats:
        cats[k] = ents

# preview the results
for k in cats:
    category = cats[k][0]
    entities = cats[k][1:6]
    print(f"{category}\t-> {entities}")

# Questions Terminology Graph

In [None]:
import graphviz as graphviz

pairs = 2
edges = []
for cat in cats:
    common = cats[cat]
    maxlen = len(common)
    for i in range(0, maxlen, pairs):
        x = common[i: min(i + pairs, maxlen)]
        nodes, k = zip(*x)
        edges.append(nodes)

# build the questions graph
graph = graphviz.Digraph()
for tail, head in edges:
    graph.edge(tail, head)
    
graph  # visualize how the questions relate in terms of entitites

In [None]:
graph.render('/kaggle/working/questions-graph-table.gv', view=True)

In [None]:
questions = {}
for k in cats:
    # we'll use the topmost (1st) token for 
    # each cluster as the "master" entity
    label = cats[k][0][0]
    if label not in questions:
        questions[label] = cluster[k]
    else:  # join groups with same label (if any)
        questions[label].extend(cluster[k])

# finally save the questions to use below!
DataIO.save_data(data_dir.joinpath('k-questions.plk'),
                 questions)

# we now have a good collection of
# questions we can use with the QA model!
print('topics :', questions.keys())

> **Warning** - We have everything we need and saved to disk - delete these objects to free-up RAM.

In [None]:
del index, quantizer, kmeans, papers, embedding, questions, cluster

# Question Answering Engine

In [None]:
class BertSummarizer:
    @staticmethod
    def load(model: str, tokenizer: BertTokenizer, device=None) -> Summarizer:
        config = BertConfig.from_pretrained(model)
        config.output_hidden_states = True
        bert_model = BertModel.from_pretrained(model, config=config)
        if device is not None:
            bert_model = bert_model.to(device)
        return Summarizer(custom_model=bert_model, custom_tokenizer=tokenizer)


class QuestionAnsweringEngine(CORD19Dataset):
    def __init__(self, source: Union[str, List[str]], papers: str,
                 index: str, encoder: str, model: str, **kwargs):
        """CORD-19 Dataset Question Answering Engine.

        :**kwargs: `sort_first:bool`, `nlp_model:str`
        """
        super(QuestionAnsweringEngine, self).__init__(source, **kwargs)
        self.device = torch.device(
            "cuda" if torch.cuda.is_available() else "cpu"
        )
        self.papers = Papers.from_disk(papers)
        self.index = faiss.read_index(index)
        self.encoder = SentenceTransformer(encoder, device=self.device)
        self.tokenizer = BertTokenizer.from_pretrained(model,
                                                       do_lower_case=False)
        self.model = BertForQuestionAnswering.from_pretrained(model)
        self.model.to(self.device)
        self.nlp = self.sentence_tokenizer.nlp()
        self._freq_summarizer = frequency_summarizer
        self._bert_summarizer = BertSummarizer.load(model, device=self.device,
                                                    tokenizer=self.tokenizer)

    def compress(self, sentences: Union[str, List[str]], mode="freq") -> str:
        if mode == "freq":
            return self._freq_summarizer(sentences, nlp=self.nlp)
        elif mode == "bert":
            if isinstance(sentences, list):
                sentences = " ".join(sentences)
            return self._bert_summarizer(sentences)

    def encode(self, sentences: List[str]) -> np.array:
        embedding = self.encoder.encode(sentences, show_progress_bar=False)
        return np.array(embedding)

    def similar(self, string: str, k=5) -> Tuple[np.array, np.array]:
        string = normalize_whitespace(string.replace("?", " "))
        embedd = self.encode([string])
        return self.index.search(embedd, k)

    def decode(self, question: str, context: str) -> Tuple[str, str]:
        inputs = self.tokenizer.encode_plus(question.strip(),
                                            text_pair=context,
                                            max_length=510,
                                            add_special_tokens=True,
                                            return_tensors='pt').to(self.device)
        top_k = self.model(**inputs)
        start, end = (torch.argmax(top_k[0]),
                      torch.argmax(top_k[1]) + 1)
        input_ids = inputs["input_ids"].tolist()
        answer = self.tokenizer.decode(input_ids[0][start:end],
                                       skip_special_tokens=True)
        if len(answer.strip()) > 0:  # did the model answer the question?
            context = self.tokenizer.decode(input_ids[0],
                                            skip_special_tokens=True)
        return answer, context

    def answer(self, question: str, k=15, mode: str = None) -> Dict[str, Any]:
        question = question.strip()
        dists, indices = self.similar(question, k=k)

        sentences = []
        for index in indices.flatten():
            string = self.papers[index]
            if string == question:
                string = self.papers[index + 1]

            doc = self.nlp(string)
            for sent in doc.sents:
                string = clean_tokenization(sent.text)
                if len(sent) > 1 and sent[0].is_title:
                    if (not sent[-1].like_num
                        and not sent[-1].is_bracket
                        and not sent[-1].is_quote
                        and not sent[-1].is_stop
                            and not sent[-1].is_punct):
                        string = f"{string}."
                if string in sentences:
                    continue
                sentences.append(string)

        context = " ".join(sentences)
        if mode is not None and mode in ("freq", "bert",):
            context = self.compress(context, mode=mode)
        context = normalize_whitespace(context)

        answer, context = self.decode(question, context)
        context = clean_tokenization(context)
        dists, indices = dists.tolist()[0], indices.tolist()[0]

        return {"answer": answer,
                "context": context, "dist": dists, "ids": indices}

In [None]:
# after building the db (index/papers) we can now load everything we
# need from a single configuration. note that if you set sort_first=True
# - it also needs to be set here.

engine_config = {
    'source': (
        '../input/CORD-19-research-challenge/document_parses/pdf_json'
    ),
    'papers': 'data/sents_4500.pkl',
    'index': 'data/index_4500.index',
    'encoder': 'models/scibert-nli',
    'model': 'amoux/scibert_nli_squad',
    'sort_first': True,
    'nlp_model': 'en_core_web_sm'
}

# load the clustered questions
KNNQ = DataIO.load_data('data/k-questions.plk')

In [None]:
# start the QA engine!
qa = QuestionAnsweringEngine(**engine_config)
print(qa)

In [None]:
def answer_questions_randomly(n: int, k=15, mode=None, context=False):
    random.seed(n + k)
    misses = 0
    for cat in KNNQ:
        questions = KNNQ[cat][:n]
        random.shuffle(questions)
        for question in questions:
            output = qa.answer(question, k=k, mode=mode)
            if len(output['answer']) == 0:
                misses += 1
                continue
            print(f'\n====== {cat.title()} ======\n')
            print(f"Q : {question}")
            print(f"A : {output['answer']}\n")
            if context:
                print(f"C : {output['context']}\n")

    total = len(KNNQ)*n
    score = round(((total - misses)/total)*100, 2)
    print(f'-------- score : {score}% --------n')


def print_output(output, query: str = None, title_width=60):
    answer = output['answer']
    if query is not None:
        print(f"\nQ : {query}\n")
    print(f"Answer  : {answer[:1].upper() + answer[1:]}\n")
    print(f"Context : {output['context']}\n")
    print("\t================= TITLES ðŸ¤— =================\n")

    paper_ids = []
    for lookup in qa.papers.lookup(output['ids']):
        paper_ids.append(lookup['paper_id'])
    paper_freq = Counter(paper_ids)
    sums = sum(paper_freq.values())

    minlen = 0
    for i, (pid, freq) in enumerate(paper_freq.items()):
        title = qa.title(pid).strip()
        if len(title) == 0:
            title = '< missing-title >'
        weight = round((freq/sums) * 100, 2)
        k_dist = round(output['dist'][i], 2)
        print(f'D: {k_dist}\tW: {weight}% \t {title[:title_width]}')
    print('\t______________________________________________\n')


def contradiction(premise, category) -> None:
    # run sequences through the model pre-trained on MNLI
    hypothesis = f'This text is about {category}'
    input_ids = qa.tokenizer.encode(text=premise, text_pair=hypothesis,
                                    return_tensors='pt').to(qa.device)
    # entail contradiction logits
    logits = qa.model(input_ids)[0]
    true_prob = logits[..., [0, 2]].softmax(dim=1)[..., 1].item()
    return round(true_prob*100, 2)

# Results

In [None]:
categories = list(KNNQ.keys())

one_cat = categories[0]
question = random.choice(KNNQ[one_cat])
true_prob = contradiction(question, category=one_cat)

print(f"question: {question}\n")
print(f"probability question's category: {cat} is true: {true_prob}%")

In [None]:
# output without compressing the context before the model

output = qa.answer(question, k=5, mode=None)
print_output(output, title_width=80)

In [None]:
# here the context is compressed via basic frequency
# metrics before passing it as input to the QA model

output = qa.answer(question, k=10, mode="freq")
print_output(output, title_width=80)

In [None]:
# here we use a transformer based summarization to compress the context

output = qa.answer(question, k=15, mode="bert")
print_output(output, title_width=80)

> Let's now choose `n` randomly selected questions per *question-category* to all three modes `None, 'freq', 'bert` and see which mode performs better. Note choosing `n=4` + `k=15` in all modes will set the same random seed.

In [None]:
answer_questions_randomly(n=3, k=15, mode=None, context=False)

In [None]:
answer_questions_randomly(n=3, k=25, mode="freq", context=False)

In [None]:
answer_questions_randomly(n=3, k=25, mode="bert", context=False)

# Task Results

> Here we simply pass the tasks as queries to the question answering model. We will test the model with only `freq` and `bert` modes.

In [None]:
def answer_tasks(k=15, mode=None, context=False):
    misses = 0
    for task in TASKS:
        output = qa.answer(task, k=k, mode=mode)
        if len(output['answer']) == 0:
            misses += 1
            continue
        print(f'\n==== << TASK >> ====\n')
        print(f"T : {task}\n")
        print(f"A : {output['answer']}\n")
        if context:
            print(f"C : {output['context']}\n")

    total = len(TASKS)
    score = round(((total - misses)/total)*100, 2)
    print(f'======== score >> {score}% ========\n')

In [None]:
answer_tasks(k=25, mode="freq")

In [None]:
answer_tasks(k=45, mode="bert", context=True)

`D` : Distance, lower scores -> more similar
`W` : Weighted number of sentences in relation to all articles used for the context.

> As we can see, a single query to topmost similar sentences automatically yields related articles to the query. And perhaps more accurate than a direct question to most similar titles.

In [None]:
for task in TASKS:
    output = qa.answer(task, k=45, mode='bert')
    if len(output['answer']) == 0:
        continue
    print_output(output, query=task, title_width=70)

## Final

> If you have any questions about the idea, models, or code please ask!