Automatic text summarization is a common problem in machine learning and natural language processing (NLP). Basically there are two main types of how to summarize text in NLP:
* Extraction-based summarization, which involves pulling key phrases from the source document and combining them to make a summary, and;
* Abstraction-based summarization, which creates new phrases and sentences that relay the most useful information from the original text — just like humans do. 
 
In general, the abstractive method is a much harder task but performs better than an extractive method.
 
In our project, considering the requirements that people may still want to further read each paragraph containing the predicted QA answer spans, we summarize the top-k  (top-3) paragraphs that QA module passes, to generate a paragraph-level abstractive summary. 
Our model is based on two different abstractive summarization models: [Unilm](https://github.com/microsoft/unilm/tree/master/s2s-ft) and [BART](https://github.com/pytorch/fairseq/tree/master/examples/bart), both of which have obtained SOTA results on the summarization tasks ([CNN/DM datasets](https://cs.nyu.edu/~kcho/DMQA/), and [XSUM](https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset) data). UniLM model is a unified pre-trained model for language understanding and generation. BART is a sequence-to-sequence model trained with denoising as a pre-training objective for language generation, translation, and comprehension.

We fine-tuned the UniLM model using [SumOnGraph](https://github.com/coshiang/SumOnGraph) biology dataset which includes literature for 5 types of diseases including Cancer, Cardiovascular Disease, Diabetes, Allergy, and Obesity. Original data is from PubMed which is a free resource supporting the search and retrieval of biomedical and life sciences literature with the aim of improving health–both globally and personally. We used the BART model fine-tuned on CNN/DailMail dataset. 
 
We generate a summary for each answer-related paragraph from the QA module, then concatenate them directly to form our final **paragraph-level answer summary**.
 
As for our **article-level summary**, even though it is not shown on this kaggle tasks. It takes the whole article as input, and generate a summary for each sections (eg. Introductions section, Methodologies section)  of the articles, and then concatenate them together as a more fine-grained article-level summary, as complementary to the abstracts. 

In [None]:
!pip uninstall covidSumm --y
!pip install easydict

In [None]:
!pip install -i https://test.pypi.org/simple/ covidSumm==0.1.3

In [None]:
!pip install fairseq

In [None]:
import covidSumm

In [None]:
import requests
import json
import os
import argparse

In [None]:
from covidSumm.abstractive_utils import get_ir_result, result_to_json, get_qa_result

In [None]:
from covidSumm.abstractive_model import abstractive_summary_model
from covidSumm.abstractive_config import set_config
from covidSumm.abstractive_bart_model import *

In [None]:
args = set_config()
args['model_path'] = '/kaggle/input/carieabssummmodel/'

In [None]:
def get_summary_list(article_list, abstractive_model):
    summary_list = []
    for i in range(len(article_list)):
        article = article_list[i]
        summary_results = abstractive_model.generate_summary(article)
        result = ""
        for item in summary_results:
            result += item.replace('\n', ' ')
        summary_list.append(result)
    return summary_list

def get_answer_summary(query, abstractive_model):
    paragraphs_list = get_qa_result(query, topk = 3)
    answer_summary_list = abstractive_model.generate_summary(paragraphs_list)
    answer_summary = ""
    for item in answer_summary_list:
        answer_summary += item.replace('\n', ' ')
    answer_summary_json = {}
    answer_summary_json['summary'] = answer_summary
    answer_summary_json['question'] = query
    return answer_summary_json

def get_article_summary(query, abstractive_summary_model):
    article_list, meta_info_list = get_ir_result(query, topk = 10)  
    summary_list = get_summary_list(article_list, abstractive_summary_model)
    summary_list_json = []
    
    for i in range(len(summary_list)):
        json_summary = {}
        json_summary = result_to_json(meta_info_list[i], summary_list[i])
        summary_list_json.append(json_summary)

    return summary_list_json

In [None]:
from IPython.core.display import display, HTML
import pandas as pd

def display_summary(ans_summary_json, model_type):
    question = ans_summary_json['question']
    text = ans_summary_json['summary']
    question_HTML = '<div style="font-family: Times New Roman; font-size: 28px; padding-bottom:28px"><b>Query</b>: '+question+'</div>'
    display(HTML(question_HTML))

    execSum_HTML = '<div style="font-family: Times New Roman; font-size: 18px; margin-bottom:1pt"><b>' + model_type + ' Abstractive Summary:</b>: '+text+'</div>'
    display(HTML(execSum_HTML))

def display_article_summary(result, query):
    question_HTML = '<div style="font-family: Times New Roman; font-size: 28px; padding-bottom:28px"><b>Query</b>: '+query+'</div>'
    pdata = []
    abstract = ""
    summary = ""
    for i in range(len(result)):
        if 'abstract' in result[i].keys():
            line = []
            context_2 = '<a href= "https://doi.org/'
            context_2 += result[i]['doi']
            context_2 += ' target="_blank">'
            context_2 += result[i]['title']
            context_2 += '</a>'
            line.append(context_2)
            
            abstract = "<div> " 
            abstract += result[i]['abstract']
            abstract += " </div>"
            line.append(abstract)
            summary = "<div> " + result[i]['summary'] + " </div>"
            line.append(summary)


            pdata.append(line)
    display(HTML(question_HTML))
    df = pd.DataFrame(pdata, columns = ['Title','Abstract','Summary'])
    HTML(df.to_html(render_links=True, escape=False))
#     display(HTML(df.to_html(render_links=True, escape=False)))
    df = df.style.set_properties(**{'text-align': 'left'})
    display(df)

In [None]:
query = 'What is the range of incubation periods for COVID-19 in humans'

* Now we initiate our **Summerization model 1**

In [None]:
args = set_config()
args['model_path'] = '/kaggle/input/carieabssummmodel/'
summary_model_1 = abstractive_summary_model(config = args)

* We initiate our Summerization model 2

In [None]:
model_path = "/kaggle/input/bartsumm/bart.large.cnn"
summary_model_2 = Bart_model(model_path)

In [None]:
answer_summary_1 = get_answer_summary(query, summary_model_1)

In [None]:
display_summary(answer_summary_1, 'UniLM')

In [None]:
answer_summary_2 = get_bart_answer_summary(query, summary_model_2)

In [None]:
display_summary(answer_summary_2, 'BART')

In [None]:
article_summary_1 = get_article_summary(query, summary_model_1)

In [None]:
display_article_summary(article_summary_1, query)

In [None]:
article_summary_2 = get_bart_article_summary(query, summary_model_2)

In [None]:
display_article_summary(article_summary_2, query)

In [None]:
from covidSumm.abstractive_api import *
answer_summary_1 = abstractive_api_uni_para(query)
answer_summary_1

In [None]:
from covidSumm.abstractive_utils import *
test_answer = abstractive_api(query, 'unilm_para')
test_answer

In [None]:
test_answer = abstractive_api(query, 'bart_article')
test_answer