Aim: An extractive text summarization for PubMed article abstracts

Citation: The core of this extractive text summarization is based on Ng Wai Foong's post:
Extractive Text Summarization Using spaCy in Python 
https://medium.com/better-programming/extractive-text-summarization-using-spacy-in-python-88ab96d1fd97

Modifications from my side:
1) Tracking case sensitive terms and replace them in the summary
2) Replacing "We" to "The authors" and "Our" to "The authors'"
3) Normalized "Conclusion:" etc into "In conclusion" and give additional weights to these sentences
4) Additional text cleaning

In [1]:
#!pip install --user -U spacy

In [2]:
#!python -m spacy download en_core_web_lg

In [3]:
import spacy
import re

In [4]:
nlp = spacy.load('en_core_web_lg')

In [5]:
from collections import Counter
from string import punctuation

In [6]:
# Strip HTLM tags
# Citation: https://stackoverflow.com/a/925630
from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [7]:
# Clean up line breaks, HTML tags, extra spaces, and first tone plural
def clean_text(text):
    
    # Citation: lrnq @ github for this space normalization step
    text = ' '.join(text.split())
    
    # Remove line breaks
    # Citation: https://stackoverflow.com/a/16566356
    text = text.replace('\n', ' ').replace('\r', '')
    
    # Strip HTML tags
    text = strip_tags(text)
    
    # Remove extra spaces
    # Citation: https://stackoverflow.com/a/1546244
    text = re.sub(' +', ' ', text)
    
    # Remove spaces before punctuation
    # Citation: https://stackoverflow.com/a/18878970
    text = re.sub(r'\s([?,.!"):;](?:\s|$))', r'\1', text)
    
    # Change "We" and "Our" etc
    text = text.replace('We ', 'The authors ')
    text = text.replace(' we ', ' the authors ')
    text = text.replace('Our ', 'The authors\' ')
    text = text.replace(' our ', ' the authors\' ')
    
    # Change "conlusion"
    text = text.replace('CONCLUSION:', 'In conclusion, ')
    text = text.replace('CONCLUSION ', 'In conclusion, ')
    text = text.replace('CONCLUSIONS:', 'In conclusion, ')
    text = text.replace('CONCLUSIONS ', 'In conclusion, ')
    text = text.replace('Conclusion:', 'In conclusion, ')
    text = text.replace('Conclusion ', 'In conclusion, ')
    text = text.replace('Conclusions:', 'In conclusion, ')
    text = text.replace('Conclusions ', 'In conclusion, ')
    text = re.sub(' +', ' ', text)
    
    return text

In [8]:
# Track case sensitive tokens
def track_case(text):
    cased_words = []
    ignore = ["BACKGROUND", "PURPOSE", "METHODS", "RESULTS", "CONCLUSION", "DISCUSSION", "AIMS", "INTRODUCTION", "AND", "ANALYSIS", "DESIGN", "ETHICS", "DISSEMINATION", "TRIAL", "REGISTRATION", "NUMBER", "CONCLUSIONS", "AIMS/INTRODUCTION", "MATERIALS", "STUDY", "TYPE", "POPULATION", "FIELD", "STRENGTH/SEQUENCE", "STRENGTH",  "SEQUENCE", "ASSESSMENT", "STATISTICAL", "TESTS", "DATA", "LEVEL", "OF", "EVIDENCE", "OBJECTIVE", "MAIN", "KEY", "FINDINGS", "SIGNIFICANCE", "TRANSLATIONAL", "PERSPECTIVE"]
    doc = nlp(text)
    for token in doc:
        token = token.text
        if token != token.lower() and re.search('^[A-Z][a-z]+$', token) == None and len(token)>1 and token not in ignore:
            cased_words.append(token)
    
    text = text.replace('(', ' ')
    text = text.replace(')', ' ')
    text = text.replace(':', ' ')
    text = text.replace(',', ' ')
    text = text.replace(';', ' ')
    
    # Since spaCy breaks terms with hypens, I go through the text again, with manual tokenization and identification of hypen tokens
    for token in text.split():
        if "-" in token and token != token.lower() and re.search('^[A-Z][a-z]+$', token) == None:
            cased_words.append(token)
    
    return set(cased_words)    

In [9]:
def change(text, word):
    
    # Citation: lrnq @ github solution
    return ' '.join([x.replace(word.lower(), word) if x == word.lower()
                    or x == "(" + word.lower() + ")," 
                     or x == "(" + word.lower() + ")."
                     or x == "(" + word.lower() + ");"
                     or x == "(" + word.lower() + "):"
                     or x == "(" + word.lower() + ")"
                     or x == word.lower() + ","
                     or x == word.lower() + "."
                     or x == word.lower() + ";"
                     or x == word.lower() + ":"
                     else x for x in text.split()])

In [10]:
def fix_case(text, case_list):
    for word in case_list:
        text = change(text, word)
    return text
        

In [11]:
def top_sentence(text, limit):
    '''
    Args:
        text - the input text
        limit - the number of sentences to return
    '''
    
    text = clean_text(text)
    cased = track_case(text)
    
    keyword = []
    pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
    
    # to lower and tokenize
    doc = nlp(text.lower())
    
    # loop over the tokens
    for token in doc:
        
        # ignore stopword or punctuation
        if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
            continue
        # append definded POS words
        if(token.pos_ in pos_tag):
            keyword.append(token.text)
    
    # to normalize the weightage of the keywords
    freq_word = Counter(keyword)
    # get the frequency of the top most-common keyword
    max_freq = Counter(keyword).most_common(1)[0][1]
    # normalize the frequency
    for w in freq_word:
        freq_word[w] = (freq_word[w]/max_freq)
    
    # manually increase weight for conclusions and summary
    freq_word['conclusion']=10
      
    
    sent_strength = {}
    # loop over each sentence
    for sent in doc.sents:
        
        # loop over each word
        for word in sent:
            # decide if the word is a keyword
            if word.text in freq_word.keys():
                if sent in sent_strength.keys():
                    # add the normalized keyword value to the key-value pair of the sentence
                    sent_strength[sent] += freq_word[word.text]
                else:
                    # create new key-value in the sent_strength dic using sent as key and norm keyword value as value
                    sent_strength[sent] = freq_word[word.text]
    
    summary = []
    # sort the dic in descending order
    sorted_x = sorted(sent_strength.items(), key = lambda kv: kv[1], reverse=True)
    
    counter = 0
    # loop over each of sorted items
    for i in range(len(sorted_x)):
        # append result to the list
        summary.append(str(sorted_x[i][0]).capitalize())
        
        counter += 1
        if (counter >= limit):
            break
    
    result = ' '.join(summary)
    
    # fix the case
    result2 = fix_case(result, cased)

    return result2
                

In [12]:
example_text = '''Background: Clostridium ramosum is a generally non-pathogenic enteric anaerobe, and Fournier's gangrene is a rare necrotizing soft tissue infection with male predisposition affecting the perineum and the genital area. We report, to our knowledge, the first case of Fournier's gangrene caused by C. ramosum in a female patient with multiple underlying conditions.

Case presentation: A 44-year-old woman with a 6-year history of insulin-dependent diabetes mellitus after total pancreatectomy and an 11-year history of central diabetes insipidus developed a pain in the genital area after a month of urinary catheter use. The lower abdominal pain worsened gradually over 2 weeks, and the pain, general fatigue, and loss of appetite prompted the patient's hospital admission. As she had severe edema in her pelvic and bilateral femoral areas, ceftriaxone was started empirically after collecting two sets of blood cultures. On hospital day 2, CT examination revealed the presence of necrotizing faciitis in the genital and pelvic areas, and the antibiotics were changed to a combination of meropenem, vancomycin, and clindamycin. Gram-positive cocci and gram-positive rods were isolated from blood cultures, which were finally identified as Streptococcus constellatus and C. ramosum using superoxide dismutase and 16S rDNA sequencing. An emergent surgery was performed on hospital day 2 to remove the affected tissue. Despite undergoing debridement and receiving combined antimicrobial chemotherapies, the patient's clinical improvement remained limited. The patient's condition continued to deteriorate, and she eventually died on hospital day 8. In the present case, the underlying diabetes mellitus, urinary incontinence due to central diabetes insipidus, undernutrition, and edema served as the predisposing conditions.

Conclusions: C. ramosum is a potentially opportunistic pathogen among immunosuppressed persons and a rare cause of necrotizing fasciitis.'''

In [13]:
print(top_sentence(example_text, 2))

In conclusion, C. ramosum is a potentially opportunistic pathogen among immunosuppressed persons and a rare cause of necrotizing fasciitis. A 44-year-old woman with a 6-year history of insulin-dependent diabetes mellitus after total pancreatectomy and an 11-year history of central diabetes insipidus developed a pain in the genital area after a month of urinary catheter use.
