Skip to content

Summarises text content from any website using extractive summarization

Notifications You must be signed in to change notification settings

rittikbasu/text-summarizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Summarizer

About the project

This project is offered as a service through a simple flask website wherein a user can either provide a URL or enter text manually and get an elegant summary of the provided text. The underlying principal on which the algorithm works is extractive summarisation, where the important sentences or phrases from the original text are identified and extracted to form a summary.

Landing Page Mode of input: URL Summary Page

How it works?

  • We request the page source with urllib and then parse that page with BeautifulSoup and extract the text from the <p> tags.

  • Text Processing: First the paragraphs are converted into sentences then we remove all special characters, numbers, punctuation, and stop words (words such as is, an, a, the, for that do not add value to the meaning of a sentence) from the extracted text.

  • Tokenization: The text is divided into a series of tokens using the sent_tokenize() function of nltk. Tokenizing the sentences is done to get all the words present in the sentences.

  • A frequency table of the words is created to evaluate the weighted occurrence frequency of the words. The approach of weighing is based on frequencies i.e. every word/term is assigned a weight using tf-idf (term frequency – inverted document frequency) approach. The weight of a term = term frequency * inverse of document frequency.

  • We then substitute words with their weighted frequencies choosing all sentences above a certain weight threshold and ordering the selected sentences as they appear in the original article.

  • Finally after getting all the required parameters we generate the summary.

Prerequisites

  • urllib: for requesting a webpage
  • bs4: for parsing the web page
  • lxml: for processing html and xml with python
  • nltk: for performing natural language processing tasks