# <center>Extractive Text Summarization Using Python and Gensim</center>

__<center>by Tauno Tanilas - 2020</center>__

## 1. Introduction

This is the Capstone Project for [Advanced Data Science with IBM Specialization Certificate](https://www.coursera.org/specializations/advanced-data-science-ibm?fbclid=IwAR3p2ZthalkKNP1kcyK3VfxRBKyrRnOWeowrrus5UQt1IL3O8tB7jRWkc4E) course. The use case of this project is to recognize well-known landmarks captured on photos. The task can be classified as a Multi-Class Classification problem where each sample is assigned to one and only one label. The task will be accomplished using both the Deep and Non-Deep Learning.

__Importance of Text Summarization:__
- 1
- 2
- 3

__Use case:__
- Automate the prediction of landmark labels directly from image pixels to help people organize their photo collections.

__Main goals:__
- Collect the dataset containing landmark pictures of three different classes.
- Explore the dataset and its quality for modeling.
- Create a Deep Learning model, evaluate it and measure its accuracy on test dataset.
- Create a Non-Deep Learning model, evaluate it and measure its accuracy on test dataset.
- Compare the results between Deep and Non-Deep Learning accuracies.

__Implemented algorithms:__
- TextRank

__Main technological components:__
- Anaconda Jupyter Notebook for editing and running the notebook project.
- Python 3.7.3 as a programming language.
- Keras as a higher-level (capable of running on top of TensorFlow, CNTK or Theano) neural networks library for implementing Deep Learning functionality.
- Laptop Dell Inspiron 3541: AMD 1.80 GHz, RAM 8.00 GB, AMD Radeon(TM) R4 Graphics.

Works for english.

<img src="https://st2.ning.com/topology/rest/1.0/file/get/502818197?profile=original" width="500" align="left">

## 2. Types of Text Summarization

Meaning of Extractive Text Summarization

## 3. TextRank algorithm working principles

### 2.2. Applying TextRank Algorithm

<img src="https://st2.ning.com/topology/rest/1.0/file/get/502818197?profile=original" width="500" align="left">

### 2.1. Data Acquiring

Install missing libraries.

In [None]:
#!pip install gensim
#!pip install texthero

Import required libraries.

In [1]:
# data downloading
import urllib.request as ur
import bs4 as bs

# data preprocessing
import texthero as hero
from texthero import stopwords
import pandas as pd
import re

# modeling
import gensim
from gensim.summarization import keywords, summarize

# data visualization
import matplotlib.pyplot as plt
import ipywidgets as widgets

Define helper functions.

In [2]:
def scrape_data(url):
    """
    Take input url and return it's article text.
    
    :param url: String to specify web address.
    :return: String of article text.
    """
    scraped_data = ur.urlopen(url)
    article = scraped_data.read()
    
    parsed_article = bs.BeautifulSoup(article,'lxml')
    paragraphs = parsed_article.find_all('p')
    
    article_text = ""
    
    for p in paragraphs:
        article_text += p.text
    return article_text

In [3]:
def remove_brackets(input_txt):
    """
    Take input text and remove all symbols containing square brackets and the text inside it.
    
    :param input_txt: String to specify input text.
    :return: String of output text.
    """
    output_txt = re.sub("\[[0-9a-zA-Z]+]", "", input_txt)
    return output_txt

In [4]:
def create_text_example(b):
    """
    Take input Textarea object and attribute example text into it.
    
    :param b: ipywidgets.widgets.widget_button.Button class to specify input event.
    """
    input_w_text.value = example_text

In [5]:
def delete_text(b):
    """
    Take input Textarea object and delete inserted text from it.
    
    :param b: ipywidgets.widgets.widget_button.Button class to specify input event.
    """
    input_w_text.value = ""
    summary_w_output_text.value = ""

In [6]:
def create_summary(b):
    """
    Take inserted text and create a summary from it.
    
    :param b: ipywidgets.widgets.widget_button.Button class to specify input event.
    """
    if (len(input_w_text.value) == 0) or (input_w_text.value == default_text):
        summary_w_output_text.value = "Your text of article is missing or is too short!"
        summary_w_info_label.value = ""
    else:
        # Use gensim library to summarize text.
        summary_w_output_text.value = summarize(input_w_text.value, summary_w_slider.value/100)
        summary_w_info_label.value = "Text reduced to "+str(summary_w_slider.value)+"% (" \
            +str(len(summary_w_output_text.value.split()))+ " words of " \
            +str(len(input_w_text.value.split()))+")"

In [7]:
def keyword_suggestions(b):
    """
    Take inserted text and create keyword suggestions from it.
    
    :param b: ipywidgets.widgets.widget_button.Button class to specify input event.
    """
    if (len(input_w_text.value) == 0) or (input_w_text.value == default_text):
        keywords_w_output_text.value = "Your text of article is missing or is too short!"
    else: 
        keywords_w_output_text.value = ""
        # Use gensim library to extract keywords.
        kw_suggestions = keywords(input_w_text.value, words = 10, scores = False, split = True, lemmatize = True)
        # Convert list to string and uppercase each word.
        kw_suggestions = (', '.join(map(str, kw_suggestions))).title()
        keywords_w_input_text.value = kw_suggestions

In [8]:
def extract_keyword_sentences(b):
    """
    Take inserted keywords and extract sentences containing these keywords.
    
    :param b: ipywidgets.widgets.widget_button.Button class to specify input event.
    """
    if len(keywords_w_input_text.value) == 0:
        keywords_w_output_text.value = "Keywords are missing!"
    elif input_w_text.value == default_text:
        keywords_w_output_text.value = "Your text of article is missing or is too short!"
    else:
        user_keywords = keywords_w_input_text.value.replace(" ", "").split(",")
        input_sentences = input_w_text.value.split(".")
        extracted_sentences = ""

        for kw_ind, kw_value in enumerate(user_keywords):
            for st_ind, st_value in enumerate(input_sentences):
                if kw_value.lower() in st_value.lower():
                    extracted_sentences += remove_brackets(st_value).replace("\n", "").strip()+".\n\n"
                    
        keywords_w_output_text.value = extracted_sentences.strip()

In [9]:
def create_top_words(b):
    """
    Take inserted text and show top words statistics.
    
    :param b: ipywidgets.widgets.widget_button.Button class to specify input event.
    """
    topwords_w_image.value = open("images/default_top_words.png", "rb").read()
    
    if (len(input_w_text.value) == 0) or (input_w_text.value == default_text):
        topwords_w_info_label.value = "Your text of article is missing or is too short!"
        
    else:
        topwords_w_info_label.value = "Starting to generate the Top Words..."
        
        # Split text to sentences and create a dataframe containing sentences and cleaned sentences columns.
        sentences_list = input_w_text.value.split(".")
        sentences_df = pd.DataFrame(sentences_list, columns = ["sentence"])
        sentences_df["clean_sentence"] = hero.clean(sentences_df["sentence"])
        
        # Get default stopwords.
        default_stopwords = stopwords.DEFAULT
        # add a list of estonian stopwords to the stopwords
        estonian_stopwords =  set(open('stopwords/estonian-stopwords-lemmas.txt', encoding='utf-8').read().split())
        custom_stopwords = default_stopwords.union(estonian_stopwords)
        sentences_df['clean_sentence'] = hero.remove_stopwords(sentences_df['clean_sentence'], custom_stopwords)
        
        # Generate top words.
        tw_vis = hero.visualization.top_words(sentences_df["clean_sentence"]).head(10)
        
        # Create dataframe for chart annotation.
        tw_vis_list = tw_vis.tolist()
        tw_vis_df = tw_vis.to_frame()
        tw_vis_df.rename(columns={'clean_sentence':'kw_value'}, inplace=True)
        tw_vis_df.index.name = 'kw_name'
        tw_vis_df.reset_index(drop=False, inplace=True)
        
        bar_x = list(range(1, 11))
        bar_height = tw_vis_list
        bar_tick_label = tw_vis_df['kw_name'].tolist()
        bar_label = tw_vis_list
        
        fig, ax = plt.subplots(figsize=(20, 7))
        bar_plot = plt.bar(bar_x, bar_height, tick_label=bar_tick_label)
        axes_font = {'fontname':'Arial', 'size':'15'}
        bars_font = {'fontname':'Arial', 'size':'15', 'color':'white'}
        
        annotate_labels(bar_plot, ax, bar_label, bars_font)
        
        plt.xlabel('Words', **axes_font)
        plt.ylabel('Value', **axes_font)
        plt.xticks(rotation=0, ha='center', **axes_font)
        plt.savefig("images/top_words.png", bbox_inches='tight')
        plt.close(fig)
        
        # display top words
        topwords_w_image.value = open("images/top_words.png", "rb").read()
        
        topwords_w_info_label.value = ""

In [10]:
def annotate_labels(b_plots, ax, bar_label, bars_font):
    """
    Take inserted text and create keyword suggestions from it.
    
    :param b_plots: matplotlib.container.BarContainer class.
    :param ax: matplotlib.axes._subplots.AxesSubplot class.
    :param bar_label: list to contain bars labels values.
    :param bars_font: dictionary to determine bars font style.
    """
    for idx, b_plot in enumerate(b_plots):
        height = b_plot.get_height()
        ax.text(b_plot.get_x() + b_plot.get_width()/2., 0.5*height,
                bar_label[idx], ha='center', va='center', **bars_font)

Download example summarization text.

In [11]:
default_text = "Copy and paste here your text of article."
example_text = scrape_data('https://en.wikipedia.org/wiki/Bill_Gates')
example_text = example_text.strip()

Create user interface widgets.

In [12]:
# Create input widgets
input_w_text = widgets.Textarea(value=default_text, layout={'height': '300px', 'width': '100%'})

input_w_example_button = widgets.Button(description='Text example',
                                        button_style='info',
                                        tooltip='Text example')

input_w_delete_button = widgets.Button(description='Delete text',
                                       button_style='info',
                                       tooltip='Delete text')

In [13]:
# Create summary widgets
summary_w_slider_label = widgets.HTML(value="Choose the size of summary (%):")
summary_w_slider = widgets.FloatSlider(min=1.0, max=100.0, step=1.0, value=50.0)
summary_w_submit_button = widgets.Button(description='Create Summary', button_style='info', tooltip='Create Summary')
summary_w_info_label = widgets.HTML(value="")
summary_w_output_text = widgets.Textarea(value="", layout={'height': '300px', 'width': '100%'})
summary_w_hbox = widgets.HBox([summary_w_slider_label, summary_w_slider])

summary_w_gbox = widgets.GridBox(children=[summary_w_hbox, 
                                           summary_w_submit_button, 
                                           summary_w_info_label, 
                                           summary_w_output_text],
                                 layout=widgets.Layout(grid_template_columns='99%'))

In [14]:
# Create keywords widgets
keywords_w_header_label = widgets.Label(value="Enter the keywords you want to keep in your text separated by commas.")
keywords_w_sug_button = widgets.Button(description="Keyword suggestions", button_style='info', tooltip='Keyword suggestions')
keywords_w_submit_button = widgets.Button(description='Extract Sentences', button_style='info', tooltip='Extract Sentences')
keywords_w_hbox = widgets.HBox([keywords_w_sug_button, keywords_w_submit_button])
keywords_w_input_text = widgets.Text(value="", layout={'height': '30px', 'width': '100%'})
keywords_w_output_text = widgets.Textarea(value="", layout={'height': '300px', 'width': '100%'})

keywords_w_gbox = widgets.GridBox(children=[keywords_w_header_label,
                                            keywords_w_hbox,
                                            keywords_w_input_text, 
                                            keywords_w_output_text],
                                  layout=widgets.Layout(grid_template_columns='99%'))

In [15]:
# Create top words widgets
topwords_w_submit_button = widgets.Button(description='Create Top Words', button_style='info', tooltip='Create Top Words')
topwords_w_info_label = widgets.HTML(value="")
topwords_w_image = widgets.Image(value=open("images/default_top_words.png", "rb").read())

topwords_w_gbox = widgets.GridBox(children=[topwords_w_submit_button,
                                             topwords_w_info_label,
                                             topwords_w_image],
                                   layout=widgets.Layout(grid_template_columns='99%'))

Create CSS to set font style for Textarea widget.

In [16]:
%%html
<style>
textarea, input {
    font-family: monospace;
    font-size: 20px;
}
</style>

Declare widget events to handle user commands.

In [17]:
# Handle input widgets events
input_w_example_button.on_click(create_text_example)
input_w_delete_button.on_click(delete_text)

# Handle summary widgets events
summary_w_submit_button.on_click(create_summary)

# Handle keyword widgets events
keywords_w_sug_button.on_click(keyword_suggestions)
keywords_w_submit_button.on_click(extract_keyword_sentences)

# Handle word cloud widgets events
topwords_w_submit_button.on_click(create_top_words)

Display input widgets.

In [18]:
input_w_hbox = widgets.HBox([input_w_example_button, input_w_delete_button])
widgets.GridBox(children=[input_w_hbox, input_w_text], layout=widgets.Layout(grid_template_columns='99%'))

GridBox(children=(HBox(children=(Button(button_style='info', description='Text example', style=ButtonStyle(), …

Display output widgets.

In [19]:
tab_names = ['Summary', 'Keywords', 'Top Words']
tab = widgets.Tab()
tab.children = [summary_w_gbox, keywords_w_gbox, topwords_w_gbox]

for i in range(len(tab_names)):
    tab.set_title(i, tab_names[i])
tab

Tab(children=(GridBox(children=(HBox(children=(HTML(value='Choose the size of summary (%):'), FloatSlider(valu…

## 5. Conclusions

- Use Case: Increase Complexity (add more data and increase number of classes), implement image recognition on video files, objects counting etc.
- Data Preparation: Use pretrained model or partial manual control to reduce the time for data cleaning.
- Model Definition: Try different architecture or transfer learning.
- Model Training: Add GPU support to reduce the training time. This requires Nvidia GPU, as Keras doesn't work with AMD GPU yet.
- Model Tuning: Try different optimization and activation functions.

## 6. Proposals for Further Developments

- Use Case: Increase Complexity (add more data and increase number of classes), implement image recognition on video files, objects counting etc.
- Data Preparation: Use pretrained model or partial manual control to reduce the time for data cleaning.
- Model Definition: Try different architecture or transfer learning.
- Model Training: Add GPU support to reduce the training time. This requires Nvidia GPU, as Keras doesn't work with AMD GPU yet.
- Model Tuning: Try different optimization and activation functions.