# ** Infection Among Cancer Patients 2.0**

![](https://sportslogohistory.com/wp-content/uploads/2018/09/georgia_tech_yellow_jackets_1991-pres-1.png)

**Executive Summary:** Unsupervised scientific literature understanding system that accepts natural language questions and returns specific answers from the CORD19 scientific paper corpus. The answers are wholly generated by the system from the publicatons cited below the answer. There is also a summary answer of the relevant text and a table with the top 5 articles for ease of scanning.

**APPROACH:**
- meta.csv is loaded - system uses full text articles where available
- the natural language questions for the task are contained in a list
- there is a list of focusing keywords to focus on documents that specifically relate to the topic
- the natural language questions have the stop words removed for passing to spacy for sentence comparison
- the full text documents are parsed at sentence level and compared to the question
- https://spacy.io/usage/vectors-similarity spacy comparison
- the most relevant sentences are returned in a dataframe and sent to BERT QA and BERT summarizer
- https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15848021.pdf
- https://pypi.org/project/bert-extractive-summarizer/
- The question,summary answers and table of relevant scientific papers are returned in HTML format for user review.

**Scientific rigor:** Is the solution evidence based? i.e. does it leverage robust data?  Yes, the solution accesses the Open Research Dataset (CORD-19). CORD-19 is a resource of over 59,000 scholarly articles, including over 47,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.

**Scientific model/strategy:** Did the solution employ a robust scientific method? The solution relies on the underlying scientific documents to provide the scientific model or strategy.  The system learns concepts and topics from a MASS of documents and boils them down for the user to understand answers that are contained in the literature. Obviously the quality of the literature fed to the system will an impact on the relaiability of the asnwers to the questions presented.

**Unique and novel insight** a. Does the solution identify information (new data, features, insights etc) that is yet to be “uncovered?”  Yes, this system provides novel insight to the topics posed realting to cancer.  The biggest advantage is that each week when the CORD19 documents are supplemented with thousdands of new documents, the system can quickly analyze all the documents and revise answers based on new information that has been discovered and reported in the updated documents. 

**Market Translation and Applicability** a. Does the solution resolve an existing market need for either an individual, health institution or policy maker? Yes, when any researcher or decision maker wants to understand the broader concepts of any topic and know what is know about specific or even general qustions, the system can quickly and easily review a huge corpus and provide answers to questions that can help humnas focus there analysis or questions.

**Speed to market** a. Does it apply to an existing product vision such as a self assessment tool or policy decision-making tool? Yes the system could easily be scaled to a wed-based system that could be easily impelemented with 

**Longevity of solution in market** a. Is the solution one that could be used in various markets through time? Yes, the system can be provided different dcouments and questions and can be repurposed to any research goal such as understanding drug or vaccine targets etc.

**Ability of user to collaborate and contribute to other solutions within the Kaggle community** a. Did the user provide expertise and or resources in the form of datasets or models to their fellow Kaggle members? Yes, I have created and shared about 50 COVID-19 notebooks with Kaggle users. Here is a link to the notebooks I have shared https://www.kaggle.com/mlconsult/notebooks  In addition, my work has been cited on the Kaggle contirbutions page numerous times. Follow this link https://www.kaggle.com/covid-19-contributions and see the data scientist attributions at the bottom of the tables.


In [None]:
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nlp = spacy.load('en_core_web_lg')
import numpy as np
import pandas as pd
!pip install bert-extractive-summarizer
from summarizer import Summarizer
model = Summarizer()

import torch
from transformers import *
qa_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
qa_model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')


In [None]:
import re
import os
import json
# keep only documents with covid -cov-2 and cov2
def search_focus(df):
    dfa = df[df['abstract'].str.contains('covid')]
    dfb = df[df['abstract'].str.contains('-cov-2')]
    dfc = df[df['abstract'].str.contains('cov2')]
    dfd = df[df['abstract'].str.contains('ncov')]
    frames=[dfa,dfb,dfc,dfd]
    df = pd.concat(frames)
    df=df.drop_duplicates(subset='title', keep="first")
    return df

# load the meta data from the CSV file using 3 columns (abstract, title, authors),
df=pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv', usecols=['title','journal','abstract','authors','doi','publish_time','sha','full_text_file'])
print ('All CORD19 documents ',df.shape)
#fill na fields
df=df.fillna('no data provided')
#drop duplicate titles
df = df.drop_duplicates(subset='title', keep="first")
#keep only 2020 dated papers
df=df[df['publish_time'].str.contains('2020')]
# convert abstracts to lowercase
df["abstract"] = df["abstract"].str.lower()+df["title"].str.lower()
#show 5 lines of the new dataframe
df=search_focus(df)
print ("COVID-19 focused docuemnts ",df.shape)
#df.head()

def format_body(body_text):
    texts = [(di['section'], di['text']) for di in body_text]
    texts_di = {di['section']: "" for di in body_text}
    
    for section, text in texts:
        texts_di[section] += text

    body = ""

    for section, text in texts_di.items():
        body += section
        body += "\n\n"
        body += text
        body += "\n\n"
    
    return body


for index, row in df.iterrows():
    if ';' not in row['sha'] and os.path.exists('/kaggle/input/CORD-19-research-challenge/'+row['full_text_file']+'/'+row['full_text_file']+'/pdf_json/'+row['sha']+'.json')==True:
        with open('/kaggle/input/CORD-19-research-challenge/'+row['full_text_file']+'/'+row['full_text_file']+'/pdf_json/'+row['sha']+'.json') as json_file:
            data = json.load(json_file)
            body=format_body(data['body_text'])
            keyword_list=['TB','incidence','age']
            #print (body)
            body=body.replace("\n", " ")

            df.loc[index, 'abstract'] = body.lower()

df=df.drop(['full_text_file'], axis=1)
df=df.drop(['sha'], axis=1)
df.head()

In [None]:
from IPython.display import display, Markdown, Latex, HTML

def remove_stopwords(text,stopwords):
    text = "".join(c for c in text if c not in ('!','.',',','?','(',')','-'))
    text_tokens = word_tokenize(text)
    #remove stopwords
    tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]
    str1=''
    str1=' '.join(word for word in tokens_without_sw)
    return str1

def score_sentence(search,sentence):
        main_doc=nlp(sentence)
        search_doc=nlp(search)
        sent_score=main_doc.similarity(search_doc)
        return sent_score

def process_question(df,search,focus):
    df_table = pd.DataFrame(columns = ["pub_date","title","excerpt","rel_score"])
    df1 = df[df['abstract'].str.contains(focus)]
    search=remove_stopwords(search,stopwords)
    for index, row in df1.iterrows():
        sentences = row['abstract'].split('. ')
        pub_sentence=''
        hi_score=0
        for sentence in sentences:
            if len(sentence)>75 and focus in sentence:
                rel_score=score_sentence(search,sentence)
                if rel_score>.85:
                    sentence=sentence.capitalize()
                    if sentence[len(sentence)-1]!='.':
                        sentence=sentence+'.'
                    pub_sentence=pub_sentence+' '+sentence
                    if rel_score>hi_score:
                        hi_score=rel_score
        if pub_sentence!='':
            authors=row["authors"].split(" ")
            link=row['doi']
            title=row["title"]
            score=hi_score
            linka='https://doi.org/'+link
            linkb=title
            final_link='<p align="left"><a href="{}">{}</a></p>'.format(linka,linkb)
            #author_link='<p align="left"><a href="{}">{}</a></p>'.format(linka,authors[0]+' et al.')
            #sentence=pub_sentence+' '+author_link
            sentence=pub_sentence
            #sentence='<p fontsize=tiny" align="left">'+sentence+'</p>'
            to_append = [row['publish_time'],final_link,sentence,score]
            df_length = len(df_table)
            df_table.loc[df_length] = to_append
    df_table=df_table.sort_values(by=['rel_score'], ascending=False)
    return df_table

def prepare_summary_answer(text,model):
    #model = pipeline(task="summarization")
    return model(text)

def answer_question(question,text,tokenizer,model):
    input_text = "[CLS] " + question + " [SEP] " + text + " [SEP]"
    input_ids = tokenizer.encode(input_text)
    token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]
    start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
    all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    #print(' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1]))
    answer=(' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1]))
    # show qeustion and text
    #tokenizer.decode(input_ids)
    return answer

###### MAIN PROGRAM ######

# questions
#'''
search=[
'Do cancer patients have more risk of being infected with COVID-19?',
'Do cancer patients have a higher risk of a severe case?',
'Do cancer patients at a higher risk mortality or death?',
'What type of cancer or malignancy can put patients at higher risk for severity or death?'
]
#'''

focus='cancer'

z=0

for question in search:
    # process with spacy model and return df
    df_table=process_question(df,question,focus)
    
    #use only the top 5 papers excerpts for summarizaiton
    df_answer=df_table.head(10)
    
    qa_text=''
    
    #loop through df to assemble sentences
    for index, row in df_answer.iterrows():
        qa_text=qa_text+' '+row['excerpt']
    
    qa_text=qa_text[:511]
    
    #use only the top 20 papers excerpts for summarizaiton
    df_summary=df_table.head(10)
    
    summary_text=''
    
    #loop through df to assemble sentences
    for index, row in df_summary.iterrows():
        summary_text=summary_text+' '+row['excerpt']

    # show markdown so link can be crated
    display(Markdown('# '+question))
    
    qa_answer=answer_question(question,qa_text,qa_tokenizer,qa_model)
    qa_answer=qa_answer.replace(' ##','')
    display(HTML('<h4> Answer: </h4><i>'+qa_answer+'</i>'))
    
    #summarize questions
    summary_answer=prepare_summary_answer(summary_text,model)
    
    #summary_answer=summary_answer[0]['summary_text']
    display(HTML('<h4> Summarized Answer: </h4><i>'+summary_answer+'</i>'))
    display(HTML('<h5>results limited to 5 for ease of scanning</h5>'))
    
    #limit the size of the df for the html table
    df_table=df_table.head(5)
    
    #convert df_table to html and display
    df_table=HTML(df_table.to_html(escape=False,index=False))
    display(df_table)
    
    z=z+1

print ('done') 