# Question answering using CDQA(BERT)

## Introduction

This notebook has been made by a Atos BDS (Big data) France team.

It's made to **quickly find the bests answers** to queries in a dedicated corpus, when this corpus is **too big** or **technical** for the final user.

### Main tooling

[CDQA](https://github.com/cdqa-suite/cdQA) Closed Domain Question Answering is An End-To-End Closed Domain Question Answering System. Built on top of the HuggingFace [transformers](https://github.com/huggingface/transformers) library.

The cdQA architecture is based on two main components: the Retriever and the Reader. You can see below a schema of the system mechanism.

![](https://miro.medium.com/max/1400/1*v7s0WvOj-Z-ZwVzWFuzR7Q.png)

When a question is sent to the system, the Retriever selects a list of documents in the database that are the most likely to contain the answer. It is based on the same retriever of DrQA, which creates TF-IDF features based on uni-grams and bi-grams and compute the cosine similarity between the question sentence and each document of the database.

After selecting the most probable documents, the system divides each document into paragraphs and send them with the question to the Reader, which is basically a pre-trained Deep Learning model. The model used was the Pytorch version of the well known NLP model BERT, which was made available by HuggingFace.

Then, the Reader outputs the most probable answer it can find in each paragraph. After the Reader, there is a final layer in the system that compares the answers by using an internal score function and outputs the most likely one according to the scores. As described in https://towardsdatascience.com/how-to-create-your-own-question-answering-system-easily-with-python-2ef8abc8eb5.

### Work Done and limitations

Attention was paid to make the pipeline query agnostic.

The notebook focuses on the first query and gets the other from a github repository to save time and compute.

Due to the number of subtasks, the depth of the analysis for the results stays low (mainly consistency tests).

### Implementation

The challenge is here to apply CDQA on a resource limited environment. Extract originals files, preprocess them to fit the CQDA retriever and **extract value** directly

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
import json
import tqdm
from tqdm.notebook import tqdm

# install a GPU version of pretrained BERT
!pip install cdqa
!pip install tqdm -U
!wget https://github.com/cdqa-suite/cdQA/releases/download/bert_qa_vGPU/bert_qa_vGPU-sklearn.joblib

from ast import literal_eval
#from cdqa.utils.filters import filter_paragraphs
from cdqa.pipeline.cdqa_sklearn import QAPipeline

### Focus on a task

As queries are part of major tasks, there is a processing of the corpus per task

In [None]:
task1_queries=['Range of incubation periods for the disease in humans, how this varies across age and health status, how long individuals are contagious, even after recovery',
'Prevalence of asymptomatic shedding and transmission, particularly children',
'Seasonality of transmission of coronavirus covid-19',
'Physical science of the coronavirus, charge distribution, adhesion to hydrophilic/phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding',
'Persistence and stability on a multitude of substrates and sources like nasal discharge, sputum, urine, fecal matter, blood',
'Persistence of virus on surfaces of different materials like copper, stainless steel, plastic',
'Natural history of the virus and shedding of it from an infected person',
'Implementation of diagnostics and products to improve clinical processes',
'Disease models, animal models for infection, disease and transmission',
'Tools and studies to monitor phenotypic change and potential adaptation of the virus',
'Immune response and immunity',
'Effectiveness of movement control strategies to prevent secondary transmission in health care and community settings',
'Effectiveness of personal protective equipment (PPE) and its usefulness to reduce risk of transmission in health care and community settings',
'Role of the environment in transmission of coronavirus covid 19']

task2_queries=['Data on potential risks factors,Smoking, pre-existing pulmonary disease,Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities,Neonates and pregnant women, Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.',
'Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors',
'Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups',
'Susceptibility of populations',
'Public health mitigation measures that could be effective for control'               
]

task3_queries=['Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time.',
'Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences, and determine whether there is more than one strain in circulation. Multi-lateral agreements such as the Nagoya Protocol could be leveraged.',
'Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over.Evidence of whether farmers are infected, and whether farmers could have played a role in the origin. Surveillance of mixed wildlife- livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia. Experimental infections to test host range for this pathogen.',
'Animal host(s) and any evidence of continued spill-over to humans',
'Socioeconomic and behavioral risk factors for this spill-over',
'Sustainable risk reduction strategies']

task4_queries=['Effectiveness of drugs being developed and tried to treat COVID-19 patients.',
'Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.',
'Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.',
'Exploration of use of best animal models and their predictive value for a human vaccine.',
'Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.',
'Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.',
'Efforts targeted at a universal coronavirus vaccine.',
'Efforts to develop animal models and standardize challenge studies',
'Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers',
'Approaches to evaluate risk for enhanced disease after vaccination',
'Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models [in conjunction with therapeutics]']

task5_queries=['Resources to support skilled nursing facilities and long term care facilities.',
'Mobilization of surge medical staff to address shortages in overwhelmed communities',
'Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies',
'Extracorporeal membrane oxygenation (ECMO) outcomes data of COVID-19 patients',
'Outcomes data for COVID-19 after mechanical ventilation adjusted for age.',
'Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest.',
'Application of regulatory standards (e.g., EUA, CLIA) and ability to adapt care to crisis standards of care level.',
'Approaches for encouraging and facilitating the production of elastomeric respirators, which can save thousands of N95 masks.',
'Best telemedicine practices, barriers and faciitators, and specific actions to remove/expand them within and across state boundaries.',
'Guidance on the simple things people can do at home to take care of sick people and manage disease.',
'Oral medications that might potentially work.v',
'Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually.',
'Best practices and critical challenges and innovative solutions and technologies in hospital flow and organization, workforce protection, workforce allocation, community-based support resources, payment, and supply chain management to enhance capacity, efficiency, and outcomes.',
'Efforts to define the natural history of disease to inform clinical care, public health interventions, infection prevention control, transmission, and clinical trials',
'Efforts to develop a core clinical outcome set to maximize usability of data across a range of trials',
'Efforts to determine adjunctive and supportive interventions that can improve the clinical outcomes of infected patients (e.g. steroids, high flow oxygen)']

task6_queries=['Guidance on ways to scale up NPIs in a more coordinated way (e.g., establish funding, infrastructure and authorities to support real time, authoritative (qualified participants) collaboration with all states to gain consensus on consistent guidance and to mobilize resources to geographic areas where critical shortfalls are identified) to give us time to enhance our health care delivery system capacity to respond to an increase in cases.',
'Rapid design and execution of experiments to examine and compare NPIs currently being implemented. DHS Centers for Excellence could potentially be leveraged to conduct these experiments.',
'Rapid assessment of the likely efficacy of school closures, travel bans, bans on mass gatherings of various sizes, and other social distancing approaches.',
'Methods to control the spread in communities, barriers to compliance and how these vary among different populations..',
'Models of potential interventions to predict costs and benefits that take account of such factors as race, income, disability, age, geographic location, immigration status, housing status, employment status, and health insurance status.',
'Policy changes necessary to enable the compliance of individuals with limited resources and the underserved with NPIs.',
'Research on why people fail to comply with public health advice, even if they want to do so (e.g., social or financial costs may be too high).',
'Research on the economic impact of this or any pandemic. This would include identifying policy and programmatic alternatives that lessen/mitigate risks to critical government services, food distribution and supplies, access to critical household supplies, and access to health diagnoses, treatment, and needed care, regardless of ability to pay.']

task7_queries=['Are there geographic variations in the rate of COVID-19 spread?',
'Are there geographic variations in the mortality rate of COVID-19?',
'Is there any evidence to suggest geographic based virus mutations?']

task8_queries=['How widespread current exposure is to be able to make immediate policy recommendations on mitigation measures. Denominators for testing and a mechanism for rapidly sharing that information, including demographics, to the extent possible. Sampling methods to determine asymptomatic disease (e.g., use of serosurveys (such as convalescent samples) and early detection of disease (e.g., use of screening of neutralizing antibodies such as ELISAs).',
'Efforts to increase capacity on existing diagnostic platforms and tap into existing surveillance platforms.',
'Recruitment, support, and coordination of local expertise and capacity (public, private—commercial, and non-profit, including academic), including legal, ethical, communications, and operational issues.',
'National guidance and guidelines about best practices to states (e.g., how states might leverage universities and private laboratories for testing purposes, communications to public health officials and the public).',
'Development of a point-of-care test (like a rapid influenza test) and rapid bed-side tests, recognizing the tradeoffs between speed, accessibility, and accuracy.',
'Rapid design and execution of targeted surveillance experiments calling for all potential testers using PCR in a defined area to start testing and report to a specific entity. These experiments could aid in collecting longitudinal samples, which are critical to understanding the impact of ad hoc local interventions (which also need to be recorded).',
'Separation of assay development issues from instruments, and the role of the private sector to help quickly migrate assays onto those devices.',
'Efforts to track the evolution of the virus (i.e., genetic drift or mutations) and avoid locking into specific reagents and surveillance/detection schemes.',
'Latency issues and when there is sufficient viral load to detect the pathogen, and understanding of what is needed in terms of biological and environmental sampling.',
'Use of diagnostics such as host response markers (e.g., cytokines) to detect early disease or predict severe disease progression, which would be important to understanding best clinical practice and efficacy of therapeutic interventions.',
'Policies and protocols for screening and testing.',
'Policies to mitigate the effects on supplies associated with mass testing, including swabs and reagents.',
'Technology roadmap for diagnostics.',
'Barriers to developing and scaling up new diagnostic tests (e.g., market forces), how future coalition and accelerator models (e.g., Coalition for Epidemic Preparedness Innovations) could provide critical funding for diagnostics, and opportunities for a streamlined regulatory environment.',
'New platforms and technology (e.g., CRISPR) to improve response times and employ more holistic approaches to COVID-19 and future diseases.',
'Coupling genomics and diagnostic testing on a large scale',
'Enhance capabilities for rapid sequencing and bioinformatics to target regions of the genome that will allow specificity for a particular variant.',
'Enhance capacity (people, technology, data) for sequencing with advanced analytics for unknown pathogens, and explore capabilities for distinguishing naturally-occurring pathogens from intentional.',
'One Health surveillance of humans and potential sources of future spillover or ongoing exposure for this organism and future pathogens, including both evolutionary hosts (e.g., bats) and transmission hosts (e.g., heavily trafficked and farmed wildlife and domestic food and companion species), inclusive of environmental, demographic, and occupational risk factors.']

task9_queries=['Efforts to articulate and translate existing ethical principles and standards to salient issues in COVID-2019',
'Efforts to embed ethics across all thematic areas, engage with novel ethical issues that arise and coordinate to minimize duplication of oversight',
'Efforts to support sustained education, access, and capacity building in the area of ethics',
'Efforts to establish a team at WHO that will be integrated within multidisciplinary research and operational platforms and that will connect with existing and expanded global networks of social sciences.',
'Efforts to develop qualitative assessment frameworks to systematically collect information related to local barriers and enablers for the uptake and adherence to public health measures for prevention and control. This includes the rapid identification of the secondary impacts of these measures. (e.g. use of surgical masks, modification of health seeking behaviors for SRH, school closures)',
'Efforts to identify how the burden of responding to the outbreak and implementing public health measures affects the physical and psychological health of those providing care for Covid-19 patients and identify the immediate needs that must be addressed.',
'Efforts to identify the underlying drivers of fear, anxiety and stigma that fuel misinformation and rumor, particularly through social media.']

task10_queries=['Methods for coordinating data-gathering with standardized nomenclature.',
'Sharing response information among planners, providers, and others.',
'Understanding and mitigating barriers to information-sharing.',
'How to recruit, support, and coordinate local (non-Federal) expertise and capacity relevant to public health emergency response (public, private, commercial and non-profit, including academic).',
'Integration of federal/state/local public health surveillance systems.',
'Value of investments in baseline public health response infrastructure preparedness',
'Modes of communicating with target high-risk populations (elderly, health care workers).',
'Risk communication and guidelines that are easy to understand and follow (include targeting at risk populations’ families too).',
'Communication that indicates potential risk of disease to all population groups.',
'Misunderstanding around containment and mitigation.',
'Action plan to mitigate gaps and problems of inequity in the Nation’s public health capability, capacity, and funding to ensure all citizens in need are supported and can access information, surveillance, and treatment.',
'Measures to reach marginalized and disadvantaged populations.Data systems and research priorities and agendas incorporate attention to the needs and circumstances of disadvantaged populations and underrepresented minorities.',
'Mitigating threats to incarcerated people from COVID-19, assuring access to information, prevention, diagnosis, and treatment.',
'Understanding coverage policies (barriers and opportunities) related to testing, treatment, and care']


In [None]:
### Toolkit to extract data from JSON files

def loadfiles(file_list,directory):
    all_files = []
    for filename in file_list:
        filename = directory + filename
        file = json.load(open(filename, 'rb'))
        all_files.append(file)
    return all_files

def format_name(author):
    middle_name = " ".join(author['middle'])
    
    if author['middle']:
        return " ".join([author['first'], middle_name, author['last']])
    else:
        return " ".join([author['first'], author['last']])

def format_affiliation(affiliation):
    text = []
    location = affiliation.get('location')
    if location:
        text.extend(list(affiliation['location'].values()))
    
    institution = affiliation.get('institution')
    if institution:
        text = [institution] + text
    return ", ".join(text)

def format_authors(authors, with_affiliation=False):
    name_ls = []
    
    for author in authors:
        name = format_name(author)
        if with_affiliation:
            affiliation = format_affiliation(author['affiliation'])
            if affiliation:
                name_ls.append(f"{name} ({affiliation})")
            else:
                name_ls.append(name)
        else:
            name_ls.append(name)
    
    return name_ls

In [None]:
### Defining the dataframe schema we wanted to use
col_names = [
    'title',   
    'paragraphs',
    'authors',
    'affiliat',
    'paper_id'
]

Create the specific Dataframe for CDQA and keep meta columns for analysis

In [None]:
def cleaned_dataframe(all_files):
    cleaned_files = []
    for file in tqdm(all_files):
        texte=[]
        if file['metadata']['title']:
            #if detect(text['text'])=='en':
            if('abstract' in file.keys()):
                for text in file['abstract']+file['body_text'] : 
                    if text['text']:
                  #if detect(text['text'])=='en':
                        texte.append(text['text'])
            else :
                for text in file['body_text'] : 
                    if text['text']:
                  #if detect(text['text'])=='en':
                        texte.append(text['text'])
            if texte:
                cleaned_files.append([file['metadata']['title'],texte,format_authors(file['metadata']['authors']),format_authors(file['metadata']['authors'], with_affiliation=True),file['paper_id']])
    clean_0_df = pd.DataFrame(cleaned_files, columns=col_names)
    return clean_0_df

Tooling to size the batch

In [None]:
def customRange(start,end,step):
    l=[]
    m=[]
    i = start
    while i < end-step-15:
        l.append(i)
        i += step
        m.append(i)
    l.append(i)
    m.append(end)
    return zip(l,m)

**Retriever Training**. Feed the model with all the articles in the corpus. Using batch to save compute. 

**Prediction**. For each part of the corpus, all subtasks of the selected task are addressed and result concatenate on specific csv.

In [None]:
def train_retriever_better(pipeline,readerSize,retrieverSize,df,batchsize,subset,taskQuery,numTask):
    """
    try to use batches to fit retriever
    """
    predictions=[]
    pipeline.reader.n_best_size = readerSize 
    pipeline.retriever.top_n = retrieverSize
    for (n,m) in customRange(0,df.shape[0],batchsize):
        print(n)
        pipeline.fit_retriever(df[n:m])
        pred=[]
        for query in taskQuery:
            prediction=pipeline.predict(query=query, return_all_preds= True)
            pred.append(prediction)
        predictions.append(pred)
    for i in range(len(task1_queries)):
        l=[]
        for pred in predictions:
            l=l+pred[i]
        df=pd.DataFrame(l)
        df['title']=subset+'/'+df['title']
        df.to_csv('task'+str(numTask)+'_q_'+str(i)+'.csv', index=False, mode='a', header=False)

**Get the pretrained GPU reader for CQDA (BERT) **

In [None]:
cdqa_pipeline = QAPipeline(reader='bert_qa_vGPU-sklearn.joblib')

In [None]:
directories=['/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/pdf_json/',
             '/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/pdf_json/',
             '/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/pmc_json/',
             '/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/pdf_json/'
            ]

**Apply train and predict to a query. chosen batch size is 500 with 10 articles picked up by the retriever and red by BERT. **

In [None]:
def processCorpusForquery(task,tasknum):
    """
    process corpus for a query and persist CSV
    """
    for directory in directories:
        subset_name=str(directory.split('/')[-3])
        print(subset_name)
        file_list=os.listdir(str(directory))
        all_files = []
        if(len(file_list))<10000:
            clean_0_df=cleaned_dataframe(loadfiles(file_list,directory))
            clean_0_df.to_csv(subset_name+'.csv',index=False)
            train_retriever_better(cdqa_pipeline,10,10,clean_0_df[['title','paragraphs']], 400, subset_name,task,tasknum)
        else:
            for (n,m) in customRange(0,len(file_list),9500):
                clean_0_df=cleaned_dataframe(loadfiles(file_list[n:m],directory))
                clean_0_df.to_csv(subset_name+'.csv',index=False)
                train_retriever_better(cdqa_pipeline,10,10,clean_0_df[['title','paragraphs']], 400, subset_name,task,tasknum)

In [None]:
processCorpusForquery(task1_queries,1)

## **Test a csv for a specific subtask**

In [None]:
df=pd.read_csv('task1_q_0.csv', header=None,names=["answer", "probability","start","end","qas_id","title","paragraph","retriver_score","final_score"])

## Results highlight

In [None]:
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
s_pred_df_1=df.sort_values('final_score',ascending=False)[0:15]
objects = s_pred_df_1['answer']
y_pos = np.arange(len(objects))
performance = s_pred_df_1['final_score']

plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects,rotation='vertical')
#plt.xticks()
plt.ylabel('answer')
plt.title('Scoring of answers')

plt.show() 

Here are the best 10 answers that we found about Range of incubation periods for the disease in humans, how this varies across age and health status, how long individuals are contagious, even after recovery. We have 3 questions in 1, that is why we got such different  answers like 3-8days or 10 days to months.
We have two questions about durations so it is hard to match the answers with the corresponding part of the question.
Several answers seem to be aberrant like "years to decade". 
Some answers looks very good but we should split the questions to get better understing of the results.
It is interesting to see that some articles agreed with others.

In [None]:
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
s_pred_df_1=df.sort_values('retriver_score',ascending=False)[0:15]
objects = s_pred_df_1['title']
y_pos = np.arange(len(objects))
performance = s_pred_df_1['retriver_score']

plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects,rotation='vertical')
#plt.xticks()
plt.ylabel('title')
plt.title('Scoring of articles')

plt.show() 

## Answer by queries

In [None]:
from IPython.core.display import display, HTML 

In [None]:
def print_queries(query,taskNumber,path=None,k=3):
    for j in range(len(query)):
        if(path):
            df=pd.read_csv(path+'task'+str(taskNumber)+'/'+'task'+str(taskNumber)+'_q_'+str(j)+'.csv',header=None)
        else:
            df=pd.read_csv('task'+str(taskNumber)+'_q_'+str(j)+'.csv',header=None)
        display(HTML(f'<h2 class="question" > Question {j+1}: \n {query[j]} \n</h2>'))
        for i in range(0,k):
            display(HTML(f'<h3 class="answer">Answer {i+1}: \n {df[0][i]} </h3>'))
            display(HTML(f'<h4 >Article {i+1}: \n {df[5][i]} </h4> '))
            display(HTML(f'<h4 >Extract from the following paragraph : \n</h4>  {df[6][i]} '))

In [None]:
%%HTML
<style type="text/css">
h2.question {
     background-color: steelblue; 
     color: white; 
     padding: 8px; 
     padding-right: 30px; 
     font-size: 24px; 
     max-width: 1500px; 
     margin-top: 10px;
     margin-bottom:4px;
 }
h3.answer {
     background-color: skyblue; 
     color: black; 
     padding: 8px; 
     padding-right: 30px; 
     font-size: 20px; 
     max-width: 1500px; 
     margin-top: 4px;
     margin-bottom:4px;
 }
</style>

## **Answers for Task 1 **

In [None]:
print_queries(task1_queries,1)

### Get already processed data for others tasks

In [None]:
!git clone https://github.com/guizmo2000/guizmo2000-cdqaresults

In [None]:
print_queries(task2_queries,2,'guizmo2000-cdqaresults/')

In [None]:
print_queries(task3_queries,3,'guizmo2000-cdqaresults/')

In [None]:
print_queries(task4_queries,4,'guizmo2000-cdqaresults/')

In [None]:
print_queries(task5_queries,5,'guizmo2000-cdqaresults/')

In [None]:
print_queries(task6_queries,6,'guizmo2000-cdqaresults/')

In [None]:
print_queries(task7_queries,7,'guizmo2000-cdqaresults/')

In [None]:
print_queries(task8_queries,8,'guizmo2000-cdqaresults/')

In [None]:
print_queries(task8_queries,8,'guizmo2000-cdqaresults/')

In [None]:
print_queries(task9_queries,9,'guizmo2000-cdqaresults/')

In [None]:
print_queries(task10_queries,10,'guizmo2000-cdqaresults/')