
# Automatically creating abstracts for scientific articles 

This notebook was designed with the COVID-19 Open Research Dataset Challenge (CORD-19) in mind.  CORD-19 is a resource of over 138,000 scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses. The dataset is freely available and is been used by the scientific community to gain a better understand of a number of topics related to COVID-19.

An important point is that 29% of the 107032 of the articles analysed here do not have an abstract, which are critical for literature reviews, meta-analysis and synthesis review. This notebook addresses this by presenting a model that automatically generates good quality abstracts. 

At the end of this notebook I compare an actual abstract from an article and those by generated by two different models. 



In [None]:
# Import common libraries and connect with Kaggle environment

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Import dataset direclty from Kaggle

import glob
import json
import pandas as pd
from tqdm import tqdm
root_path = '/kaggle/input/CORD-19-research-challenge/'
all_json = glob.glob(f'{root_path}/**/*.json', recursive=True)
len(all_json)

In [None]:
# Create a dataframe 


metadata_path = f'{root_path}/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head()

In [None]:
# Dropped columns with mixed dtype as they were not needed

meta_df.drop(columns=['who_covidence_id', 'arxiv_id'], inplace=True)
meta_df.head()

In [None]:
# Check the number of articles without an abstract

meta_df.abstract.isnull().sum()

In [None]:
# Check the total number of articles in dataframe

meta_df.abstract.value_counts().sum()

In [None]:
# Percentage of articles without an abstract

meta_df.abstract.isnull().sum()/ meta_df.abstract.value_counts().sum()

In [None]:
# Select only those without an abstract

abstracts = meta_df[meta_df.abstract.isnull()]

In [None]:
pd.options.display.max_colwidth = 2000
abstracts.head()

In [None]:
# Instal transformers

#!pip install -U sentence-transformers



In [None]:
# Import libraries
 

from transformers import pipeline
import requests
import pprint
import time
pp = pprint.PrettyPrinter(indent=14)

In [None]:
# Create an easy summarizer thanks to HuggingFace (https://huggingface.co/transformers/main_classes/pipelines.html#summarizationpipeline) the implementation could not be easier

summarizer =  pipeline("summarization")

In [None]:
# Here I used a text version of the json file:  0001418189999fea7f7cbe3e82703d71c85a6fe5.json

f = open('/content/Absence of surface expression of feline infectious peritonitis virus (FIPV).txt', 'r')
feline = f.read()
feline 

In [None]:
# Generate abstracts


abs_bart = summarizer(feline[:1022], min_length = 5, pad_to_max_length=True)
abs_t5 = summarizer_t5(feline[:511], min_length=5, max_length=40)

# Results

Let's have a look at the abstract written by the authors and those generated by the models. 


**Actual Abstract
**

Feline infectious peritonitis virus (FIPV) positive cells are present in pyogranulomas and exudates from cats with FIP. These cells belong mainly to the monocyte/macrophage lineage. How these cells survive in immune cats is not known. In this study, FIPV positive cells were isolated from pyogranulomas and exudates of 12 naturally FIPV-infected cats and the presence of two immunologic targets, viral antigens and MHC I, on their surface was determined. The majority of the infected cells were confirmed to be cells from the monocyte/macrophage lineage. No surface expression of viral antigens was detected on FIPV positive cells. MHC I molecules were present on all the FIPV positive cells. After cultivation of the isolated infected cells, 52 ± 10% of the infected cells re-expressed viral antigens on the plasma membrane.

In conclusion, it can be stated that in FIP cats, FIPV replicates in cells of the monocyte/macrophage lineage without carrying viral antigens in their plasma membrane, which could allow them to escape from antibody-dependent cell lysis.



**Abstract generated by HuggingFace pipeline (Bart)**

Feline infectious peritonitis (FIP) is a fatal chronic disease in cats. Two forms can be distinguished. Cats suffering from the wet or effusive form have exudates in their body cavities. Exudate is absent in the second form, hence the name dry or non-effusive form.



**Abstract generated by HuggingFace pipeline (t5-base)**

feline infectious peritonitis virus (FIPV) causes fatal chronic disease in cats . characterized by granulomatous lesions formed at serosae of


The abstract generated by Bart seems to synthesize very well a complex article and it is more concise than the actual abstract. What is even more impressive is that due to a current limitation of the transformer only the first 1024 tokens were included in the generation of the abstract.
The abstracts were generated with different lengths and *setting pad_to_max_length=True* returned the best result. 

The second model did not return a good abstract. This is probably due to the fact that it only considers 512 tokens and I used the lowest model (t5-base) because larger models required a lot of computational power. So it would not be difficul to get even better results. But this simple implementation already provided a very good abstract and indicates how the dataset can be further improved. 


The following were helpful in the development of this notebook:

https://colab.research.google.com/github/theamrzaki/COVID-19-BERT-ResearchPapers-Semantic-Search/blob/master/COVID_19_BERT_ResearchPapers_Semantic_Search.ipynb#scrollTo=gtDN1zQ8f63B

https://colab.research.google.com/drive/1iAIFX1QQiFm1F01vMmnAgFh4oH1H-K8W#scrollTo=NWzDuuEmICBM

https://huggingface.co/transformers/main_classes/pipelines.html#summarizationpipeline
