<a href="https://colab.research.google.com/github/samyumobi/NLP-Projects/blob/main/Summarize_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Text Ranking method

## 1.1 Import the text from any url

In [3]:
# Import BeautifulSoup and urllib libraries to fetch data from Wikipedia.
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Function to get data from Wikipedia
def get_only_text(url):
 page = urlopen(url)
 soup = BeautifulSoup(page)
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 print (text)
 return soup.title.text, text


# Mention the Wikipedia url
url="https://en.wikipedia.org/wiki/Natural_language_processing"

# Call the function created above
text = get_only_text(url)


Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.  The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
 Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
 Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, a task that involves the automated inter

In [4]:
# Lets see first 1000 letters from the text
text[:1000]

('Natural language processing - Wikipedia',
 'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.  The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.\n Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.\n Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intell

In [7]:
# Count the number of letters
len(''.join(text))

9155

## 1.2 Generated Summary using gensim summarize

In [8]:
# Import summarize from gensim
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords
# Convert text to string format
text = str(text)
#Summarize the text with ratio 0.1 (10% of the total words.)
summarize(text, ratio=0.1)

'Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, a task that involves the automated interpretation and generation of natural language, but at the time not articulated as a problem separate from artificial intelligence.\\n The premise of symbolic NLP is well-summarized by John Searle\\\'s Chinese room experiment: Given a collection of rules (e.g., a Chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other NLP tasks) by applying those rules to the data it is confronted with.\\n Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules.\nSuch models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system

In [9]:
print(keywords(text, ratio=0.1))


learning
learn
cognitive
cognition
statistical
nlp
computers
computing
computational
task
tasks
large
largely
rules
natural language processing
grammar
grammars
process
processes
linguistics
modeling
models
model
neural
technical
technically
features
feature
real
turing
results
systems
research
researched
answers
answering
hand
intelligence
intelligent


# 2. Feature-based text summarization

## 2.1 Install the sumy package to extract features from text using Luhn's algorithm

In [10]:
# Install sumy
!pip install sumy


Collecting sumy
  Downloading sumy-0.9.0-py2.py3-none-any.whl (87 kB)
[?25l[K     |███▊                            | 10 kB 22.1 MB/s eta 0:00:01[K     |███████▌                        | 20 kB 27.4 MB/s eta 0:00:01[K     |███████████▏                    | 30 kB 18.6 MB/s eta 0:00:01[K     |███████████████                 | 40 kB 15.9 MB/s eta 0:00:01[K     |██████████████████▋             | 51 kB 5.3 MB/s eta 0:00:01[K     |██████████████████████▍         | 61 kB 5.7 MB/s eta 0:00:01[K     |██████████████████████████      | 71 kB 5.3 MB/s eta 0:00:01[K     |█████████████████████████████▉  | 81 kB 6.0 MB/s eta 0:00:01[K     |████████████████████████████████| 87 kB 3.5 MB/s 
Collecting breadability>=0.1.20
  Downloading breadability-0.1.20.tar.gz (32 kB)
Collecting pycountry>=18.2.23
  Downloading pycountry-20.7.3.tar.gz (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 19.2 MB/s 
Building wheels for collected packages: breadability, pycountry
  Building whee

##2.2 Import packages

In [16]:
# Import the packages
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.luhn import LuhnSummarizer
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## 2.3 Extract the data

Extract text using HTML Parser and the text is tokenized using NLTK packages.
The LSA Summarizer extracts the stem data, removes stop words and summarizes sentences.

In [17]:
# Extracting and summarizing
LANGUAGE = "english"
SENTENCES_COUNT = 10
url="https://en.wikipedia.org/wiki/Natural_language_processing"
parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
summarizer = LsaSummarizer()
summarizer = LsaSummarizer(Stemmer(LANGUAGE))
summarizer.stop_words = get_stop_words(LANGUAGE)
for sentence in summarizer(parser.document, SENTENCES_COUNT):
  print(sentence)

However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web ), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.
This is increasingly important in medicine and healthcare, where NLP is being used to analyze notes and text in electronic health records that would otherwise be inaccessible for study when seeking to improve care.
The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora (the plural form of corpus , is a set of documents, possibly with human or computer annotations) of typical real-world examples.
Increasingly, however, research has focused on statistical models , which make soft, probabilistic decisions based on attaching real-valued weights to each input feature (complex-valued embeddings , [17] and neural networks in general have al