# Text Summarization

Text summarization refers to the technique of shortening long pieces of text. The intention is to create a coherent and fluent summary having only the main points outlined in the document.
Automatic text summarization is a common problem in machine learning and natural language processing (NLP).

There are broadly two different approaches that are used for text summarization:

* Extractive Summarization
* Abstractive Summarization


### Extractive Summarization
The name gives away what this approach does. We identify the important sentences or phrases from the original text and extract only those from the text. Those extracted sentences would be our summary. The below diagram illustrates extractive summarization:

![Extractive_Summarization.png](images/Extractive_Summarization.png)

In this notebook, we will use gensim for text summarization

In [0]:
# ! pip install wikipedia
# ! pip install gensim

In [2]:
# Import the Libraries
from gensim.summarization.summarizer import summarize 
from gensim.summarization import keywords 
import wikipedia 

In [3]:
# Get wiki content. 
wikisearch = wikipedia.page("Amitabh Bachchan") 
wikicontent = wikisearch.content  

In [4]:
# Summary (0.5% of the original content). 
summ_per = summarize(wikicontent, ratio = 0.05) 
print("Percent summary") 
print(summ_per) 

Percent summary
Amitabh Bachchan (pronounced [əmɪˈtaːbʱ ˈbətʃːən]; born Inquilaab Srivastava; 11 October 1942) is an Indian film actor, film producer, television host, occasional playback singer and former politician.
He first gained popularity in the early 1970s for films such as Zanjeer, Deewaar and Sholay, and was dubbed India's "angry young man" for his on-screen roles in Bollywood.
Referred to as the Shahenshah of Bollywood, Sadi ka Mahanayak (Hindi for, "Greatest actor of the century"), Star of the Millennium, or Big B, he has since appeared in over 190 Indian films in a career spanning almost five decades.
Beyond the Indian subcontinent, he also has a large overseas following in markets including Africa (such as South Africa), the Middle East (especially Egypt), United Kingdom, Russia and parts of the United States.Bachchan has won numerous accolades in his career, including four National Film Awards as Best Actor, Dadasaheb Phalke Award as lifetime achievement award and many aw

In [5]:
# Summary (200 words) 
summ_words = summarize(wikicontent, word_count = 200) 
print("Word count summary") 
print(summ_words)  

Word count summary
Beyond the Indian subcontinent, he also has a large overseas following in markets including Africa (such as South Africa), the Middle East (especially Egypt), United Kingdom, Russia and parts of the United States.Bachchan has won numerous accolades in his career, including four National Film Awards as Best Actor, Dadasaheb Phalke Award as lifetime achievement award and many awards at international film festivals and award ceremonies.
Salim Khan introduced Bachchan to Prakash Mehra, and Salim-Javed insisted that Bachchan be cast for the role.Zanjeer was a crime film with violent action, in sharp contrast to the romantically themed films that had generally preceded it, and it established Amitabh in a new persona—the "angry young man" of Bollywood cinema.
This led to Bachchan being dubbed as "angry young man", a journalistic catchphrase which became a metaphor for the dormant rage, frustration, restlessness, sense of rebellion and anti-establishment disposition of an en

In [6]:
# Use the "split" option if you want a list of strings instead of a single string.
splits = summarize(wikicontent, split=True) 
print("split summary") 
print(splits)

split summary
['Amitabh Bachchan (pronounced [əmɪˈtaːbʱ ˈbətʃːən]; born Inquilaab Srivastava; 11 October 1942) is an Indian film actor, film producer, television host, occasional playback singer and former politician.', 'He first gained popularity in the early 1970s for films such as Zanjeer, Deewaar and Sholay, and was dubbed India\'s "angry young man" for his on-screen roles in Bollywood.', 'Referred to as the Shahenshah of Bollywood, Sadi ka Mahanayak (Hindi for, "Greatest actor of the century"), Star of the Millennium, or Big B, he has since appeared in over 190 Indian films in a career spanning almost five decades.', 'Beyond the Indian subcontinent, he also has a large overseas following in markets including Africa (such as South Africa), the Middle East (especially Egypt), United Kingdom, Russia and parts of the United States.Bachchan has won numerous accolades in his career, including four National Film Awards as Best Actor, Dadasaheb Phalke Award as lifetime achievement award a

 this module also supports keyword extraction. Keyword extraction works in the same way as summary generation (i.e. sentence extraction), in that the algorithm tries to find words that are important or seem representative of the entire text. They keywords are not always single words; in the case of multi-word keywords, they are typically all nouns.

In [7]:
# Get keywords
from gensim.summarization import keywords

print('Keywords:')
print(keywords(wikicontent))

Keywords:
bachchan
films
filming
indian film
awards
award
awarded
drama
amitabh
actor
actors
critics
critic
including
include
included
india
issn
bollywood
bollywoodization
roles
role
like
likeness
star
starring
year
years
hits
hit
career
careers
critical acclaim
critically acclaimed
doi
successful
success
successes
directed
time
times
timing
new
news
appeared
appearance
appearances
appears
appear
appearing
starred alongside
family
action comedy
filmfare
released
releases
releasing
release
hindi
named
cinema
later
movies
university
kabhie
kabhi
aur
office
major acting
salim
voice
kapoor
highest
called
calling
followed
jaya
character
characters
javed
producer television
politics
political
girl
girls
social
chopra
culture
khan
cultural nationalism
brother
industry
industries
national
movie scene
college
opposite
honoured
honour
honours
old
police
received
receiving
receive
period
padma
baritone
life
sikh
papers
lead
leading
man
popularity
popular
overseas following
legal battles
asian
br