- Text summarization is the process of finding relevant information in a document to produce a brief version of the original document.
- Text can be summarized using below two methods
    * #### **Extractive Method** 
         This method selects specific keywords or sentences from the input to generate the output. This model tends to work but won’t output a correctly structured sentence, as it just selects words or sentences from input and copies them to the output, without actually understanding the sentences, think of it as a highlighter.
    * #### **Abstractive Method**
         This method involves building a neural network to truly understand the relation between the input and output and not just merely copy words or sentences. This method outputs correctly structured sentences, think of it like a pen.
___
- In this tutorial we will be looking into __Extractive Method__ for text summarization using Natural Language Processing (NLP) libraries like __spaCy__, **gensim** and **sumy**.

#### Get the text
For scraping contents from the web, we will use Beautiful Soup python library for parsing web pages in HTML or XML format.

In [1]:
# import beautifulsoup library
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests

In [2]:
# get title and text at specified url
def get_only_text(url):
    page = urlopen(url)
    soup = BeautifulSoup(page, 'lxml')
    text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
    return soup.title.text, text

url = 'https://www.theverge.com/2019/5/17/18629003/us-phone-upgrades-apple-samsung-market'
text = get_only_text(url)

In [3]:
# print original text
print('Original Text :')
print(text)

Original Text :
('Three big reasons why Americans aren’t upgrading their phones - The Verge', 'Yes, this has to do with China Last month, Verizon and AT&T made official something you’ve probably been aware of for a while: American smartphone owners are upgrading a lot less than they used to. In fact, they’re hitting record lows at the two biggest US carriers, with people apparently more content than ever to keep hold of their existing device. This is a global trend, as the smartphone market is reaching maturity and saturation in many developed nations, and yet it’s most pronounced in the United States for a few reasons particular to the country. If you were to ask me to name the most exciting phones of 2019, top of my mind would be Huawei’s P30 Pro, with its exotic array of cameras and unmatched low-light photography, closely followed by the OnePlus 7 Pro and its gorgeous 90Hz screen. Is either of those phones available on AT&T or Verizon? Nope. Huawei is effectively banned by the US g

### 1. [spaCy](https://spacy.io/)

In [4]:
# load spaCy library
import spacy

In [5]:
# load text pre-processing libraries
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [6]:
# build list of stopwords
stopwords = list(STOP_WORDS)

In [7]:
# checkout 10 stopwords
stopwords[:10]

['one',
 'must',
 'at',
 'never',
 'seems',
 'then',
 'together',
 'with',
 'from',
 'serious']

In [8]:
# load 'english' language module in spacy
nlp = spacy.load('en')

In [9]:
# build nlp object from our original text
doc = nlp(str(text))

In [10]:
# checkout contents of doc
doc

('Three big reasons why Americans aren’t upgrading their phones - The Verge', 'Yes, this has to do with China Last month, Verizon and AT&T made official something you’ve probably been aware of for a while: American smartphone owners are upgrading a lot less than they used to. In fact, they’re hitting record lows at the two biggest US carriers, with people apparently more content than ever to keep hold of their existing device. This is a global trend, as the smartphone market is reaching maturity and saturation in many developed nations, and yet it’s most pronounced in the United States for a few reasons particular to the country. If you were to ask me to name the most exciting phones of 2019, top of my mind would be Huawei’s P30 Pro, with its exotic array of cameras and unmatched low-light photography, closely followed by the OnePlus 7 Pro and its gorgeous 90Hz screen. Is either of those phones available on AT&T or Verizon? Nope. Huawei is effectively banned by the US government, while

Build Word Frequency table i.e dictionary of words and their count using non-stopwords

In [16]:
word_frequency = {}
for word in doc:
    if word.text not in stopwords:
        if word.text not in word_frequency.keys():
            word_frequency[word.text] = 1
        else:
            word_frequency[word.text] += 1

In [17]:
word_frequency

{'(': 3,
 "'": 4,
 'Three': 1,
 'big': 2,
 'reasons': 2,
 'Americans': 1,
 'n’t': 6,
 'upgrading': 2,
 'phones': 4,
 '-': 13,
 'The': 6,
 'Verge': 1,
 ',': 68,
 'Yes': 1,
 'China': 1,
 'Last': 1,
 'month': 1,
 'Verizon': 4,
 'AT&T': 3,
 'official': 1,
 '’ve': 2,
 'probably': 1,
 'aware': 1,
 ':': 3,
 'American': 2,
 'smartphone': 5,
 'owners': 1,
 'lot': 3,
 '.': 38,
 'In': 1,
 'fact': 1,
 '’re': 4,
 'hitting': 1,
 'record': 1,
 'lows': 1,
 'biggest': 1,
 'US': 11,
 'carriers': 3,
 'people': 8,
 'apparently': 1,
 'content': 2,
 'hold': 1,
 'existing': 2,
 'device': 2,
 'This': 1,
 'global': 1,
 'trend': 1,
 'market': 5,
 'reaching': 1,
 'maturity': 1,
 'saturation': 1,
 'developed': 1,
 'nations': 2,
 '’s': 13,
 'pronounced': 1,
 'United': 2,
 'States': 2,
 'particular': 1,
 'country': 2,
 'If': 2,
 'ask': 1,
 'exciting': 2,
 '2019': 2,
 'mind': 1,
 'Huawei': 5,
 'P30': 1,
 'Pro': 3,
 'exotic': 1,
 'array': 1,
 'cameras': 1,
 'unmatched': 1,
 'low': 1,
 'light': 1,
 'photography': 1,
 

In [18]:
# get maximun word frequency
maximum_word_frequency = max(word_frequency.values())

In [20]:
# find the word frequency distribution
for word in word_frequency.keys():  
        word_frequency[word] = (word_frequency[word]/maximum_word_frequency)

In [21]:
word_frequency

{'(': 0.04411764705882353,
 "'": 0.058823529411764705,
 'Three': 0.014705882352941176,
 'big': 0.029411764705882353,
 'reasons': 0.029411764705882353,
 'Americans': 0.014705882352941176,
 'n’t': 0.08823529411764706,
 'upgrading': 0.029411764705882353,
 'phones': 0.058823529411764705,
 '-': 0.19117647058823528,
 'The': 0.08823529411764706,
 'Verge': 0.014705882352941176,
 ',': 1.0,
 'Yes': 0.014705882352941176,
 'China': 0.014705882352941176,
 'Last': 0.014705882352941176,
 'month': 0.014705882352941176,
 'Verizon': 0.058823529411764705,
 'AT&T': 0.04411764705882353,
 'official': 0.014705882352941176,
 '’ve': 0.029411764705882353,
 'probably': 0.014705882352941176,
 'aware': 0.014705882352941176,
 ':': 0.04411764705882353,
 'American': 0.029411764705882353,
 'smartphone': 0.07352941176470588,
 'owners': 0.014705882352941176,
 'lot': 0.04411764705882353,
 '.': 0.5588235294117647,
 'In': 0.014705882352941176,
 'fact': 0.014705882352941176,
 '’re': 0.058823529411764705,
 'hitting': 0.01470

In [22]:
# get sentence tokens
sentence_list = [ sentence for sentence in doc.sents ]

In [27]:
# get sentence scores based on word frequency
sentence_scores = {}  
for sent in sentence_list:  
        for word in sent:
            if word.text.lower() in word_frequency.keys():
                if len(sent.text.split(' ')) < 30:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequency[word.text.lower()]
                    else:
                        sentence_scores[sent] += word_frequency[word.text.lower()]

In [28]:
# check sentence scores
sentence_scores

{In fact, they’re hitting record lows at the two biggest US carriers, with people apparently more content than ever to keep hold of their existing device.: 2.970588235294117,
 Is either of those phones available on AT&T or Verizon?: 0.08823529411764706,
 Nope.: 0.5588235294117647,
 Huawei is effectively banned by the US government, while OnePlus only has a distribution deal with T-Mobile in the country, which is better than nothing but still comparatively niche.: 2.9264705882352935,
 The typical American smartphone buyer knows a choice between only two brands: Apple and Samsung.: 0.7941176470588236,
 If you scroll down far enough, you’ll get to see the Red Hydrogen One, which is a garbage phone, but it’s from a US company: 3.4411764705882346,
 , so they let it in.: 1.5735294117647058,
 Chinese phone brands like Huawei and Xiaomi have leading positions in most of the world’s markets now, but in the US they’re almost entirely absent.: 2.1470588235294117,
 Even OnePlus is mostly a palatab

In [29]:
from heapq import nlargest

In [35]:
# get seven sentences with highest scores
summarized_sentences = nlargest(8, sentence_scores, key=sentence_scores.get)

In [36]:
# checkout summarized sentences
summarized_sentences

[The US government’s geopolitics is playing out in carrier stores, narrowing consumer choice to products from US companies, mainly Apple, or manufacturers from US-allied nations like South Korea.,
 Smartphones are still fun, exciting, and full of novel features, but you might have to go outside the United States to find one that’s both compelling and affordable.,
 If you scroll down far enough, you’ll get to see the Red Hydrogen One, which is a garbage phone, but it’s from a US company,
 The OnePlus 7 Pro is a rare exception, bringing a devastatingly handsome, bezel-less display to the sub-$700 market.,
 In fact, they’re hitting record lows at the two biggest US carriers, with people apparently more content than ever to keep hold of their existing device.,
 Huawei is effectively banned by the US government, while OnePlus only has a distribution deal with T-Mobile in the country, which is better than nothing but still comparatively niche.,
 “Incremental changes from one model to the nex

In [37]:
# convert summarized sentences from its list representation to a summarized output
final_sentences = [w.text for w in summarized_sentences]

In [38]:
# join sentences
summary = ' '.join(final_sentences)

In [39]:
# checkout final summary
summary

'The US government’s geopolitics is playing out in carrier stores, narrowing consumer choice to products from US companies, mainly Apple, or manufacturers from US-allied nations like South Korea. Smartphones are still fun, exciting, and full of novel features, but you might have to go outside the United States to find one that’s both compelling and affordable. If you scroll down far enough, you’ll get to see the Red Hydrogen One, which is a garbage phone, but it’s from a US company The OnePlus 7 Pro is a rare exception, bringing a devastatingly handsome, bezel-less display to the sub-$700 market. In fact, they’re hitting record lows at the two biggest US carriers, with people apparently more content than ever to keep hold of their existing device. Huawei is effectively banned by the US government, while OnePlus only has a distribution deal with T-Mobile in the country, which is better than nothing but still comparatively niche. “Incremental changes from one model to the next hasn’t bee

Using spaCy's method for text summarization does not output correctly structured sentences, instead it just outputs sentences with highest sentence scores

* * *

### 2. [gensim](https://radimrehurek.com/gensim/index.html)

gensim has built in text summarizer and provides keywords

In [40]:
# import gensim library
from gensim.summarization import summarize, keywords

The ratio parameter used below for summarizing text is a number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary.

In [41]:
for ratio in [0.1,0.2,0.4]:
    summarized_text = summarize(str(text),ratio=ratio)
    print(f'Summary using Gensim for Ratio : {ratio}')
    print(summarized_text)
    print('\n')

Summary using Gensim for Ratio : 0.1
Apple did have a major redesign with the iPhone X in 2017, sparking a wave of upgrades from people who’d been waiting for such a dramatic change, but the company has otherwise kept to a conservative cadence when it comes to introducing new hardware features and capabilities.
Without the likes of Huawei to push them into more aggressive upgrade cycles, Apple and Samsung can afford to keep pace only with one another, at least in the US market.
Phone manufacturers and carriers in the US have shifted the most innovative and appealing devices to a price point that’s simply unattainable for a majority of people.


Summary using Gensim for Ratio : 0.2
('Three big reasons why Americans aren’t upgrading their phones - The Verge', 'Yes, this has to do with China Last month, Verizon and AT&T made official something you’ve probably been aware of for a while: American smartphone owners are upgrading a lot less than they used to.
Chinese phone brands like Huawei 

In [46]:
# get keywords and their scores for summarized text
print('Keywords :')
print(keywords(str(text), words=20, scores=True ,lemmatize=True))

Keywords :
[('phone', 0.318452407845418), ('smartphones', 0.1685040707322157), ('apple', 0.14406767508313495), ('brand', 0.1432874845707694), ('markets', 0.13889852900491376), ('people', 0.13757362866634612), ('samsung', 0.1320181729882281), ('chinese', 0.12836914005244315), ('likes', 0.12340612862996207), ('new', 0.11677405478264472), ('prices', 0.11189707986166327), ('upgrade', 0.11184920733756953), ('vendors', 0.1062817234854944), ('bezel', 0.1034118904932478), ('line', 0.10160248723804759), ('consumers', 0.09659572460602217), ('carrier', 0.09659572460602181), ('low', 0.0931813628519074), ('galaxy', 0.08617360759661755), ('nations', 0.08567004917866335)]


- The good thing about gensim text summarizer is that it maintaines the order of input statements in the output.
- The number of sentences in the output can be controlled by the __ratio__ paramter. Default is 0.2
- Gensim also provides keywords from its summarized text. We can play around with paramters like __ratio__, __words__, __scores__, __lemmatize__ or __split__ to get required output.

___

### 3. [Sumy](https://pypi.org/project/sumy/) 

Sumy is a simple library and command line utility for extracting summary from HTML pages or plain texts.The package also contains simple evaluation framework for text summaries. Implemented summarization methods are as below:

- __Luhn__ - heurestic method, [reference](http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5392672)
- __Edmundson__ heurestic method with previous statistic research, [reference](http://dl.acm.org/citation.cfm?doid=321510.321519)
- __Latent Semantic Analysis, LSA__ - one of the algorithm from http://scholar.google.com/citations?user=0fTuW_YAAAAJ&hl=en I think the author is using more advanced algorithms now. [Steinberger, J. a Ježek, K. Using latent semantic an and summary evaluation. In In Proceedings ISIM '04. 2004. S. 93-100.](http://www.kiv.zcu.cz/~jstein/publikace/isim2004.pdf)
- __LexRank__ - Unsupervised approach inspired by algorithms PageRank and HITS, [reference](http://tangra.si.umich.edu/~radev/lexrank/lexrank.pdf)
- __TextRank__ - Unsupervised approach, also using PageRank algorithm, [reference](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)
- __SumBasic__ - Method that is often used as a baseline in the literature. Source: [Read about SumBasic](http://www.cis.upenn.edu/~nenkova/papers/ipm.pdf)
- __KL-Sum__ - Method that greedily adds sentences to a summary so long as it decreases the KL Divergence. Source: [Read about KL-Sum](http://www.aclweb.org/anthology/N09-1041)
- __Reduction__ - Graph-based summarization, where a sentence salience is computed as the sum of the weights of its edges to other sentences. The weight of an edge between two sentences is computed in the same manner as TextRank.

We can checkout different __sumy__ summarizers and verify which one works best according to our requirement

In [8]:
# import sumy's text pre-processing libraries and summarizers
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.edmundson import EdmundsonSummarizer
from sumy.summarizers.kl import KLSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.random import RandomSummarizer
from sumy.summarizers.reduction import ReductionSummarizer
from sumy.summarizers.sum_basic import SumBasicSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer

In [14]:
# iterate through summarizer to get text summary
summarizers = [KLSummarizer, LexRankSummarizer, LsaSummarizer, LuhnSummarizer, RandomSummarizer, ReductionSummarizer, 
               SumBasicSummarizer, TextRankSummarizer]

# select language
language = 'english'

# set number of sentences to be in the summary
sentence_count = 7

# url to fetch text
url = 'https://www.theverge.com/2019/5/17/18629003/us-phone-upgrades-apple-samsung-market'

# parse the url through an HTML parser and tokenize it
parser = HtmlParser.from_url(url, Tokenizer(language))

for summary in summarizers:
    name = summary.__name__    
    print(('='*53)+name+('='*53))
    summarizer = summary(Stemmer(language))                         # create summarizer object
    summarizer.stop_words = get_stop_words(language)                # get stop words
    for sentence in summarizer(parser.document, sentence_count):    # parse original text through the summarizer to get summary
        print(sentence)
    print('\n')

If you scroll down far enough, you’ll get to see the Red Hydrogen One, which is a garbage phone , but it’s from a US company, so they let it in.
Think about the things that make a Samsung Galaxy S10 compelling: a beautiful display with tiny bezels, a very good camera, a large battery with wireless charging, fast performance, water resistance, and, as a bonus, a headphone jack.
You’d certainly struggle to tell the difference between an iPhone X and XS, just as you would struggle to differentiate between an iPhone 6 and a 6S on first glance.
Huawei’s breakneck pursuit of new features has proven extremely enticing for phone buyers in Europe and across the rest of the world, with the Chinese vendor racking up 50 percent growth in phone shipments in the first quarter of 2019 while Samsung and Apple both faltered .
That strategy has worked surprisingly well, with consumers seeing only a marginal increase in their monthly cost and valuing the increased capabilities (or sheer aesthetic and lux

In my opinion, __Latent Semantic Analysis (LSA) Summarizer__ provides the best summary out off all the sumy summarizers

__Edmundson Summarizer__ requires additional parameters like bonus_words, stigma_words and null_words for summarizing text.
- Bonus words are the words that we want to see in our summary, they are most informative and are significant words.
- Stigma words and null words are non-significant words similar to stop words

In [4]:
print('=============Edmundson_Summarizer=============')
summarizer = EdmundsonSummarizer(Stemmer(language))
summarizer.bonus_words = ("china", "apple", "samsung" )
summarizer.stigma_words = get_stop_words(language)
summarizer.null_words = get_stop_words(language)
for sentence in summarizer(parser.document, sentence_count):
    print(sentence)

Nope.
The typical American smartphone buyer knows a choice between only two brands: Apple and Samsung.
Geopolitics playing out in US carrier stores
The three-year-old Galaxy S7 has all of those things.
Increasingly, they’re bundling their phone line rentals with subscriptions to premium video or music services, and they’re offering long-term payment plans to help people buy the super flagship $1,000 phones that Apple, Samsung, and Google have been offering.
But there are two long-term issues for hardware manufacturers selling ultra pricey handsets.
Samsung’s cheapest Galaxy S10 variant, the S10E , is still $749.
