# Gensim - Automated Text Summarisation

- The gensim implementation is based on the popular "TextRank" algorithm.
- TextRank algorithm is a graph-based ranking model for text processing.
- "Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

In [1]:
from bs4 import BeautifulSoup
from requests import get


### Create a Function to Extract only Text from \<p> Tags

In [2]:
def get_only_text(url):
    """
    return the title and the text of the article
    at the specified url
    """
    page = get(url)
    soup = BeautifulSoup(page.content, 'lxml')
    text = " ".join(map(lambda p: p.text, soup.find_all("p")))
    
    title = " ".join(soup.title.stripped_strings)
    return title, text

### Calling the function with the desired News URL

In [3]:
title, text = get_only_text("https://en.wikinews.org/wiki/Global_markets_plunge")

In [4]:
title

'Global markets plunge - Wikinews, the free news source'

In [5]:
text

'Friday, October 10, 2008\xa0\n Stock markets across the world have fallen sharply with several seeing the biggest drop in their history. \n Asian markets saw the biggest sell-off. The Nikkei dropped 9.62% to reach a 20 year low. Japan also saw a collapse of a mid-size insurance company, Yamato Life Insurance Company, which declared bankruptcy.  The Hang Seng, which was one of the few markets that was positive yesterday, fell 7.19%. Australia dropped by 8.4% and South Korea saw a 9% fall. \n In Europe, markets dropped at the open with the FTSE losing 11%. They have recovered only sightly with all European markets losing more than 5%. The European sell off was more about the Asian lows then any specific news. European banks and financial institutes saw the most selling. Also, oil related companies saw large drops as an result of an expected decrease in oil consumption. \n\n The U.S. markets opened lower with the Dow Jones Industrial Average falling below 8,000, before recovering slightl

### Number of Words -  Original Text

In [6]:
len(str.split(text))

1174

## Summarisation - using Gensim

In [7]:
# summarization has been removed from gensim version 4
# must use version 3.8.3: python -m pip install "gensim==3.8.3"
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

### Printing the Summarised Text

### Method 1 - Word Count

In [8]:
print("Title: " + title)
print("Summary: ")
print(summarize(text, word_count=100))

Title: Global markets plunge - Wikinews, the free news source
Summary: 
Bush made an address on the economy and said markets were being "driven by uncertainty and fear."
Charities, such as Cats Protection, today said that they have lost much of their funds in collapsing banks.
The Dow Jones Industrial Average fell to its lowest level in five years at 8,579.19, falling 679 points in one day.
“I think we quickly realised that we cannot solve the problems we have got as a result of the sub-prime market collapse simply by improving liquidity," he said speaking in Birmingham to business leaders earlier today.


In [9]:
print("Title: " + title)
print("Summary: ")
print(summarize(text, ratio=0.03))

Title: Global markets plunge - Wikinews, the free news source
Summary: 
Charities, such as Cats Protection, today said that they have lost much of their funds in collapsing banks.
The Dow Jones Industrial Average fell to its lowest level in five years at 8,579.19, falling 679 points in one day.


### Method 2 - Keywords

In [11]:
print("Keywords: ")
print(keywords(text), ratio=0.1, lemmatize=True)

Keywords: 


TypeError: 'ratio' is an invalid keyword argument for print()