# **TASK - 4 Spacy and BERT summarizer**

# this is version 4.0 (at least) of tesitng, which will now be combined with other knowledged learned to enhance performance for round 2.

![](https://sportslogohistory.com/wp-content/uploads/2018/09/georgia_tech_yellow_jackets_1991-pres-1.png)

**Executive Summary:** Unsupervised scientific literature understanding system that accepts natural language questions and returns specific answers from the CORD19 scientific paper corpus. The answers are wholly generated by the system from the publicatons cited below the answer.  There is also a link with the question pre-loaded to the CORD19 web-based corpus (QA) search.

**PROBLEM:** When a new virus is discovered and causes a pandemic, it is important for scientists to get information coming from all scientific sources that may help them combat the pandemic.  The challenege, however, is that the number of scientific papers created is large and the papers are published very rapidly, making it nearly impossible for scientists to digest and understand important data in this mass of data.

**SOLUTION:** Unsupervised scientific literature understanding system that accepts natural language quesitons (with a focusing keyword) and returns specific answers from the CORD19 scientific paper corpus.

**APPROACH:**
- meta.csv is loaded - system currently only utilizes the abstracts
- the natural language questions for the task are contained in a list
- there is a list of focusing keywords to focus on documents that specifically relate to the topic
- the natural language questions have the stop words removed for passing to spacy for sentence comparison
- the abtracts are parsed at sentence level and compared to the question
- the most relevant sentences are returned in a dataframe and sent to BERT summarizer
- https://pypi.org/project/bert-extractive-summarizer/
- The question,summary answers and table of relevant scientific papers are returned in HTML format.

**Pros:** The system uses quesitons from the task with some very slight alterations and returns very responsive and summarized answers to specific questions.  There is a focusing keyword used simply to reduce the size fo the dataframe to search through at sentence level.

**Cons:** The system currently only uses the abstracts of the papers so it may not get the most relevant text for crafting responses. Next steps will be combinming versions of different methods to provide even better results.


In [None]:
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nlp = spacy.load('en_core_web_lg')
import numpy as np
import pandas as pd
!pip install bert-extractive-summarizer
from summarizer import Summarizer

In [None]:
# keep only documents with covid -cov-2 and cov2
def search_focus(df):
    dfa = df[df['abstract'].str.contains('covid')]
    dfb = df[df['abstract'].str.contains('-cov-2')]
    dfc = df[df['abstract'].str.contains('cov2')]
    dfd = df[df['abstract'].str.contains('ncov')]
    frames=[dfa,dfb,dfc,dfd]
    df = pd.concat(frames)
    df=df.drop_duplicates(subset='title', keep="first")
    return df

# load the meta data from the CSV file using 3 columns (abstract, title, authors),
df=pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv', usecols=['title','journal','abstract','authors','doi','publish_time','sha','full_text_file'])
print (df.shape)
#fill na fields
df=df.fillna('no data provided')
#drop duplicate titles
df = df.drop_duplicates(subset='title', keep="first")
#keep only 2020 dated papers
df=df[df['publish_time'].str.contains('2020')]
# convert abstracts to lowercase
df["abstract"] = df["abstract"].str.lower()+df["title"].str.lower()
#show 5 lines of the new dataframe
df=search_focus(df)
print (df.shape)
df.head()

In [None]:
from IPython.core.display import display, HTML

def remove_stopwords(text,stopwords):
    text = "".join(c for c in text if c not in ('!','.',',','?','(',')','-'))
    text_tokens = word_tokenize(text)
    #remove stopwords
    tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]
    str1=''
    str1=' '.join(word for word in tokens_without_sw)
    return str1

def score_sentence(search,sentence):
        main_doc=nlp(sentence)
        search_doc=nlp(search)
        sent_score=main_doc.similarity(search_doc)
        return sent_score

def process_question(df,search,focus):
    df_table = pd.DataFrame(columns = ["pub_date","title","excerpt","rel_score"])
    df1 = df[df['abstract'].str.contains(focus)]
    search=remove_stopwords(search,stopwords)
    for index, row in df1.iterrows():
        sentences = row['abstract'].split('. ')
        pub_sentence=''
        hi_score=0
        for sentence in sentences:
            if len(sentence)>75:
                rel_score=score_sentence(search,sentence)
                if rel_score>.82:
                    sentence=sentence.capitalize()
                    if sentence[len(sentence)-1]!='.':
                        sentence=sentence+'.'
                    pub_sentence=pub_sentence+' '+sentence
                    if rel_score>hi_score:
                        hi_score=rel_score
        if pub_sentence!='':
            authors=row["authors"].split(" ")
            link=row['doi']
            title=row["title"]
            score=hi_score
            linka='https://doi.org/'+link
            linkb=title
            final_link='<p align="left"><a href="{}">{}</a></p>'.format(linka,linkb)
            #author_link='<p align="left"><a href="{}">{}</a></p>'.format(linka,authors[0]+' et al.')
            #sentence=pub_sentence+' '+author_link
            sentence=pub_sentence
            #sentence='<p fontsize=tiny" align="left">'+sentence+'</p>'
            to_append = [row['publish_time'],final_link,sentence,score]
            df_length = len(df_table)
            df_table.loc[df_length] = to_append
    df_table=df_table.sort_values(by=['rel_score'], ascending=False)
    return df_table

def prepare_summary_answer(text,model):
    #model = pipeline(task="summarization")
    return model(text)

###### MAIN PROGRAM ######
model = Summarizer()
# questions
search=[
'What is the effectiveness of drugs being developed and tried to treat COVID-19 patients?',
'Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen clarithromycin, and minocyclinethat that may exert effects on viral replication',
'How are potential complications of Antibody-Dependent Enhancement ADE in vaccine recipients being researched?',
'Exploration of use of best animal models and their predictive value for a human vaccine',
'Capabilities to discover a therapeutic not vaccine for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.',
'Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up and identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need',
'What research and work is being done to develop a universal vaccine for coronavirus',
'What work and research has been done to develop animal models and standardize challenge studies',
'What work and research has been done to develop prophylaxis clinical studies and prioritize in healthcare workers',
'Approaches to evaluate risk for enhanced disease after vaccination',
'Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models in conjunction with therapeutics'
]
# main focus keywords
focus=['drugs','drugs','antibodies','animal model','therapeutic','models','vaccine','model','drugs','vaccine','animal']

z=0

for question in search:
    # process with spacy model and return df
    df_table=process_question(df,question,focus[z])
    
    #use only the top 20 papers excerpts for summarizaiton
    df_answers=df_table.head(20)
    
    text=''
    
    #loop through df to assemble sentences
    for index, row in df_answers.iterrows():
        text=text+' '+row['excerpt']
    
    display(HTML('<h2>'+question+'</h2>'))
    
    #summarize questions
    summary_answer=prepare_summary_answer(text,model)
    
    #summary_answer=summary_answer[0]['summary_text']
    display(HTML('<h4> Summarized Answer: </h4><i>'+summary_answer+'</i>'))
    display(HTML('<h5>results limited to 5 for ease of scanning</h5>'))
    
    #limit the size of the df for the html table
    df_table=df_table.head(5)
    
    #convert df_table to html and display
    df_table=HTML(df_table.to_html(escape=False,index=False))
    display(df_table)
    
    z=z+1

print ('done') 