# <center>Extractive Text Summarization Using Python and Gensim</center>

__<center>by Tauno Tanilas - 2020</center>__

## 1. Introduction

__Main goals:__
- Create the user interface for input and output data.
- Create the summary of text the user has inserted in input field.
- Find the keywords of text the user has inserted in input field and by using these keywords extract sentences.
- Prior extracting stopwords and lemmatizing text, find top 10 words and display them in a bar chart.

__Implemented algorithms:__
- TextRank

__Main technological components:__
- Anaconda Jupyter Notebook - as for editing and running the notebook project.
- MyBinder - as for opening notebooks in an executable environment, making the code reproducible by anyone.
- Python 3.7.3 - as a programming language.
- Gensim 3.8.3 - as an open-source library for unsupervised topic modeling and natural language processing, using statistical machine learning.
- Ipywidgets 7.5.1 - as an automatic user interface control for exploring code and data interactively.

## 2. The Importance of Automatic Text Summarization

In a book titled ["Automatic Text summarization"](https://onlinelibrary.wiley.com/doi/book/10.1002/9781119004752?signUpSource=www.google.com/) data scientist Juan Manuel Torres Moreno has provided 6 reasons why we need automatic text summarization tools:

- Summaries reduce reading time.
- When researching documents, summaries make the selection process easier.
- Automatic summarization improves the effectiveness of indexing.
- Automatic summarization algorithms are less biased than human summarizers.
- Personalized summaries are useful in question-answering systems as they provide personalized information.
- Using automatic or semi-automatic summarization systems enables commercial abstract services to increase the number of texts they are able to process.

## 3. What is Automatic Text Summarization?

<img src="images/text_summarization.jpg" width="500" align="left">

Summarization is the task of shorten a piece of text to a smaller version, reducing the size of the initial text while at the same time the key informational elements and the meaning of content are preserved. Compared to manual text summarization the automatization of the task is much time-saving and less laborious. Therefore it is gaining increasing popularity and i softeh used in academic research.

Text summarization can be applied in many NLP related tasks such as text classification, question answering, legal texts summarization, news summarization, headline generation etc. In the book ["Advances in Automatic Text Summarization"](https://www.amazon.com/Advances-Automatic-Text-Summarization-Press/dp/0262133598/ref=as_li_ss_tl?ie=UTF8&qid=1503872626&sr=8-1&keywords=text+summarization&linkCode=sl1&tag=inspiredalgor-20&linkId=75d9f8d62261d17bdddf5c5c0f43881a), the authors has provided a list of every-day examples of text summarization.

- headlines (from around the world)
- outlines (notes for students)
- minutes (of a meeting)
- previews (of movies)
- synopses (soap opera listings)
- reviews (of a book, CD, movie, etc.)
- digests (TV guide)
- biography (resumes, obituaries)
- abridgments (Shakespeare for children)
- bulletins (weather forecasts/stock market reports)
- sound bites (politicians on a current issue)
- histories (chronologies of salient events)

## 4. Types of Automatic Text Summarization

Depending on the use case and the type of documents, text summarization can be divided into  two main approaches:

### 4.1. Extractive Methods

Involves the selection of phrases and sentences from the source document to make up the new summary. Extractive methods are using an unsupervised learning technique to find the sentences similarity and rank them. The benefit of it is that there is no need to train and build a model prior starting to use it.

#### Input document => Sentences similarity => Weight sentences => Select sentences with higher rank

### 4.2. Abstractive Methods

Involves generating entirely new phrases and sentences to capture the meaning of the source document. Extractive methods give often better results compared to automatic approach because abstractive methods have to overcome problems like semantic representation, inference and natural language generation which are more sophisticated than data-driven approaches such as sentence extraction.

#### Input document => Understand context => Semantics =>Create own summary

## 5. The working principles of TextRank algorithm

<img src="images/textrank.png" width="500" align="left">

1. Integrate all the articles into text data.
2. Split the text into individual sentences.
3. Find a vector representation (word vector) for each sentence.
4. Calculate the similarity between sentence vectors and store them in the matrix.
5. Transform the similarity matrix into a graph structure with sentences as nodes and similarity scores as edges.
6. Choose a certain number of the highest ranked sentences that form the final summary.

## 6. The implementation of TextRank algorithm

To better understand how the implementation of TextRank algorithm works, run the following cells. For that select the cell and choose the 'Run' command in the above menu. It works better for english language.

Install missing libraries.

In [2]:
#!pip install gensim
#!pip install texthero
#!pip install regex
#!pip install spaCy

Import required libraries.

In [None]:
# data downloading
import urllib.request as ur
import bs4 as bs

# data preprocessing
import texthero as hero
from texthero import stopwords
import pandas as pd
import regex as rx

# modeling
import gensim
from gensim.summarization import keywords, summarize
import spaCy as sp

# data visualization
import matplotlib.pyplot as plt
import ipywidgets as widgets

Define helper functions.

In [45]:
def scrape_data(url):
    """
    Take input url and return it's article text.
    
    :param url: String to specify web address.
    :return: String of article text.
    """
    scraped_data = ur.urlopen(url)
    article = scraped_data.read()
    
    parsed_article = bs.BeautifulSoup(article, 'html.parser')
    paragraphs = parsed_article.find_all('p')
    
    article_text = ""
    
    for p in paragraphs:
        article_text += p.text
    return article_text

In [46]:
def remove_brackets(input_txt):
    """
    Take input text and remove all symbols containing square brackets and the text inside it.
    
    :param input_txt: String to specify input text.
    :return: String of output text.
    """
    output_txt = rx.sub("\[[0-9a-zA-Z]+]", "", input_txt)
    return output_txt

In [47]:
def create_text_example(b):
    """
    Take input Textarea object and attribute example text into it.
    
    :param b: ipywidgets.widgets.widget_button.Button class to specify input event.
    """
    input_w_text.value = example_text

In [48]:
def delete_text(b):
    """
    Take input Textarea object and delete inserted text from it.
    
    :param b: ipywidgets.widgets.widget_button.Button class to specify input event.
    """
    input_w_text.value = ""
    summary_w_output_text.value = ""

In [49]:
def create_summary(b):
    """
    Take inserted text and create a summary from it.
    
    :param b: ipywidgets.widgets.widget_button.Button class to specify input event.
    """
    if (len(input_w_text.value) == 0) or (input_w_text.value == default_text):
        summary_w_output_text.value = "Your text of article is missing or is too short!"
        summary_w_info_label.value = ""
    else:
        # Use gensim library to summarize text.
        summary_w_output_text.value = summarize(input_w_text.value, summary_w_slider.value/100)
        summary_w_info_label.value = "Text reduced to "+str(summary_w_slider.value)+"% (" \
            +str(len(summary_w_output_text.value.split()))+ " words of " \
            +str(len(input_w_text.value.split()))+")"

In [50]:
def keyword_suggestions(b):
    """
    Take inserted text and create keyword suggestions from it.
    
    :param b: ipywidgets.widgets.widget_button.Button class to specify input event.
    """
    if (len(input_w_text.value) == 0) or (input_w_text.value == default_text):
        keywords_w_output_text.value = "Your text of article is missing or is too short!"
    else: 
        keywords_w_input_text.value = "Searching keywords..."
        keywords_w_output_text.value = ""
        # Use gensim library to extract keywords.
        kw_suggestions = keywords(input_w_text.value, words = 10, scores = False, split = True, lemmatize = True)
        # Convert list to string and uppercase each word.
        kw_suggestions = (', '.join(map(str, kw_suggestions))).title()
        keywords_w_input_text.value = kw_suggestions

In [51]:
def extract_keyword_sentences(b):
    """
    Take inserted keywords and extract sentences containing these keywords.
    
    :param b: ipywidgets.widgets.widget_button.Button class to specify input event.
    """
    if len(keywords_w_input_text.value) == 0:
        keywords_w_output_text.value = "Keywords are missing!"
    elif (len(input_w_text.value) == 0) or (input_w_text.value == default_text):
        keywords_w_output_text.value = "Your text of article is missing or is too short!"
    else:
        keywords_w_output_text.value = "Extracting sentences..."
        user_keywords = keywords_w_input_text.value.replace(" ", "").split(",")
        input_sentences = input_w_text.value.split(".")
        extracted_sentences = ""

        for kw_ind, kw_value in enumerate(user_keywords):
            for st_ind, st_value in enumerate(input_sentences):
                if kw_value.lower() in st_value.lower():
                    extracted_sentences += remove_brackets(st_value).replace("\n", "").strip()+".\n\n"
                    
        keywords_w_output_text.value = extracted_sentences.strip()

In [66]:
def create_top_words(b):
    """
    Take inserted text and show top words statistics.
    
    :param b: ipywidgets.widgets.widget_button.Button class to specify input event.
    """
    topwords_w_image.value = open("images/default_top_words.png", "rb").read()
    
    if (len(input_w_text.value) == 0) or (input_w_text.value == default_text):
        topwords_w_info_label.value = "Your text of article is missing or is too short!"
        
    else:
        topwords_w_info_label.value = "Starting to generate the Top Words..."
        
        # Split text to sentences and create a dataframe containing sentences and cleaned sentences columns.
        sentences_list = input_w_text.value.split(".")
        sentences_df = pd.DataFrame(sentences_list, columns = ["sentence"])
        sentences_df["clean_sentence"] = hero.clean(sentences_df["sentence"])
        
        # Get default stopwords.
        default_stopwords = stopwords.DEFAULT
        # add a list of estonian stopwords to the stopwords
        estonian_stopwords =  set(open('stopwords/estonian-stopwords-lemmas.txt', encoding='utf-8').read().split())
        custom_stopwords = default_stopwords.union(estonian_stopwords)
        sentences_df['clean_sentence'] = hero.remove_stopwords(sentences_df['clean_sentence'], custom_stopwords)
        
        # Generate top words.
        tw_vis = hero.visualization.top_words(sentences_df["clean_sentence"]).head(10)
        
        # Create dataframe for chart annotation.
        tw_vis_list = tw_vis.tolist()
        tw_vis_df = tw_vis.to_frame()
        tw_vis_df.rename(columns={'clean_sentence':'kw_value'}, inplace=True)
        tw_vis_df.index.name = 'kw_name'
        tw_vis_df.reset_index(drop=False, inplace=True)
        
        bar_x = list(range(1, 11))
        bar_height = tw_vis_list
        bar_tick_label = tw_vis_df['kw_name'].tolist()
        bar_label = tw_vis_list
        
        fig, ax = plt.subplots(figsize=(20, 7))
        bar_plot = plt.bar(bar_x, bar_height, tick_label=bar_tick_label)
        annotate_labels(bar_plot, ax, bar_label)
        
        plt.xlabel('Words')
        plt.ylabel('Value')
        plt.xticks(rotation=0, ha='center')
        plt.savefig("images/top_words.png", bbox_inches='tight')
        plt.close(fig)
        
        # display top words
        topwords_w_image.value = open("images/top_words.png", "rb").read()
        
        topwords_w_info_label.value = ""

In [64]:
def annotate_labels(b_plots, ax, bar_label):
    """
    Take inserted text and create keyword suggestions from it.
    
    :param b_plots: matplotlib.container.BarContainer class.
    :param ax: matplotlib.axes._subplots.AxesSubplot class.
    :param bar_label: list to contain bars labels values.
    :param bars_font: dictionary to determine bars font style.
    """
    for idx, b_plot in enumerate(b_plots):
        height = b_plot.get_height()
        ax.text(b_plot.get_x() + b_plot.get_width()/2., 0.5*height, bar_label[idx], ha='center', va='center')

Download example summarization text.

In [54]:
default_text = "Copy and paste here your text of article."
example_text = scrape_data('https://en.wikipedia.org/wiki/Bill_Gates')
example_text = example_text.strip()

Create user interface widgets.

In [55]:
# Create input widgets
input_w_text = widgets.Textarea(value=default_text, layout={'height': '300px', 'width': '100%'})

input_w_example_button = widgets.Button(description='Text example',
                                        button_style='info',
                                        tooltip='Text example')

input_w_delete_button = widgets.Button(description='Delete text',
                                       button_style='info',
                                       tooltip='Delete text')

In [56]:
# Create summary widgets
summary_w_slider_label = widgets.HTML(value="Choose the size of summary (%):")
summary_w_slider = widgets.FloatSlider(min=1.0, max=100.0, step=1.0, value=50.0)
summary_w_submit_button = widgets.Button(description='Create Summary', button_style='info', tooltip='Create Summary')
summary_w_info_label = widgets.HTML(value="")
summary_w_output_text = widgets.Textarea(value="", layout={'height': '300px', 'width': '100%'})
summary_w_hbox = widgets.HBox([summary_w_slider_label, summary_w_slider])

summary_w_gbox = widgets.GridBox(children=[summary_w_hbox, 
                                           summary_w_submit_button, 
                                           summary_w_info_label, 
                                           summary_w_output_text],
                                 layout=widgets.Layout(grid_template_columns='99%'))

In [57]:
# Create keywords widgets
keywords_w_header_label = widgets.Label(value="Enter the keywords you want to keep in your text separated by commas.")
keywords_w_sug_button = widgets.Button(description="Keyword suggestions", button_style='info', tooltip='Keyword suggestions')
keywords_w_submit_button = widgets.Button(description='Extract Sentences', button_style='info', tooltip='Extract Sentences')
keywords_w_hbox = widgets.HBox([keywords_w_sug_button, keywords_w_submit_button])
keywords_w_input_text = widgets.Text(value="", layout={'height': '30px', 'width': '100%'})
keywords_w_output_text = widgets.Textarea(value="", layout={'height': '300px', 'width': '100%'})

keywords_w_gbox = widgets.GridBox(children=[keywords_w_header_label,
                                            keywords_w_hbox,
                                            keywords_w_input_text, 
                                            keywords_w_output_text],
                                  layout=widgets.Layout(grid_template_columns='99%'))

In [58]:
# Create top words widgets
topwords_w_submit_button = widgets.Button(description='Create Top Words', button_style='info', tooltip='Create Top Words')
topwords_w_info_label = widgets.HTML(value="")
topwords_w_image = widgets.Image(value=open("images/default_top_words.png", "rb").read())

topwords_w_gbox = widgets.GridBox(children=[topwords_w_submit_button,
                                             topwords_w_info_label,
                                             topwords_w_image],
                                   layout=widgets.Layout(grid_template_columns='99%'))

Create CSS to set font style for Textarea widgets.

In [59]:
%%html
<style>
textarea, input {
    font-family: monospace;
    font-size: 20px;
}
</style>

Declare widget events to handle user commands.

In [60]:
# Handle input widgets events
input_w_example_button.on_click(create_text_example)
input_w_delete_button.on_click(delete_text)

# Handle summary widgets events
summary_w_submit_button.on_click(create_summary)

# Handle keyword widgets events
keywords_w_sug_button.on_click(keyword_suggestions)
keywords_w_submit_button.on_click(extract_keyword_sentences)

# Handle word cloud widgets events
topwords_w_submit_button.on_click(create_top_words)

Display input widgets.

In [61]:
input_w_hbox = widgets.HBox([input_w_example_button, input_w_delete_button])
widgets.GridBox(children=[input_w_hbox, input_w_text], layout=widgets.Layout(grid_template_columns='99%'))

GridBox(children=(HBox(children=(Button(button_style='info', description='Text example', style=ButtonStyle(), …

Display output widgets.

In [62]:
tab_names = ['Summary', 'Keywords', 'Top Words']
tab = widgets.Tab()
tab.children = [summary_w_gbox, keywords_w_gbox, topwords_w_gbox]

for i in range(len(tab_names)):
    tab.set_title(i, tab_names[i])
tab

Tab(children=(GridBox(children=(HBox(children=(HTML(value='Choose the size of summary (%):'), FloatSlider(valu…

## 7. Proposals for Further Developments

Automatic Text Summarization is a broad topic and in current article only a tiny part of it was covered. There are many tasks that could be explored in further developments and here are some ideas of it:

- Multiple domain text summarization
- Cross-language text summarization
- Text summarization using various algorithms like RNNs, LSTM and Reinforcement Learning
- Abstractive summarization where Deep Learning plays a big role.