## Automatic Text Summarization Using Python

##### Importing Necessary Libraries
> <em>bs4</em> or Beautiful Soup is a parser for HTML and XML<br>
> <em>urllib.request</em> is an extensible library for opening URLs<br>
> <em>re</em> is a regex tool for python<br>
> <em>nltk</em> is a suite of libraries and programs for NLP using Python<br>
> <em>operator</em> is a module for efficient implementations of common operations <br>
> <em>nltk.download</em> retrieves data from the nltk library

In [None]:
import bs4 as bs
import urllib.request as url
import re
import nltk
import operator

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

##### Getting Data
><p>The data stores the read content from the url. The website HTML is parsed, and the text enclosed in <em>p</em> tags make up the article paragraphs. The text used for summarization is a string of those paragraphs</p>

In [None]:
# https://en.wikipedia.org/wiki/AI
# https://en.wikipedia.org/wiki/Russo-Ukrainian_War
# https://en.wikipedia.org/wiki/Uno_(card_game)

data = url.urlopen('https://en.wikipedia.org/wiki/AI').read()

parsed = bs.BeautifulSoup(data,'html.parser')

paragraphs = parsed.find_all('p') 

raw_text = ""
for paragraph in paragraphs: raw_text += paragraph.text


##### Preprocessing/Cleaning the Text
><p>Removing square brackets, special characters and digits, and the extra spaces in the text</p>
><p>Raw text contains the original article's text, and the formatted text removes the punctuation and special charachters so that the word frequency table can be made</p>
><p>The raw text is tokenized into sentences, and the formatted text is tokenized into words. Ignorable stopwords are also stored. We initiliaze the frequency and score dictionaries as well</p>

In [None]:
raw_text = re.sub(r'\[.*\]', ' ', raw_text)
raw_text = re.sub(r'\s+', ' ', raw_text)
sentences = nltk.sent_tokenize(raw_text)

formatted_text = re.sub('[^a-zA-Z0-9]', ' ', raw_text )
formatted_text = re.sub(r'\s+', ' ', formatted_text)
words = nltk.word_tokenize(formatted_text)

stopwords = nltk.corpus.stopwords.words('english')

frequencies = {}
scores = {}

##### Getting Word Frequency
>We loop through the words and append the count of the word in a frequency table every time it is encountered
<br><br>
>Diving each table entry by the max frequency gives the weighted frequencies for each word

In [None]:
for word in words:
    if word not in stopwords:
        if word not in frequencies.keys(): 
            frequencies[word] = 1
        else: 
            frequencies[word] += 1

max_frequency = max(frequencies.values())

for word in frequencies.keys():
    frequencies[word] /= max_frequency

##### Finding the Most Important Sentences
> We loop through the sentences, and tokenize each one into words. If the word has a frequency, we look at the ones with sentences less than <em>n</em> words (30 in our case)
<br><br>
> The score for the sentence is the weighted frequency of the first word in the sentence, if it is a new entry. Otherwise the weighted frequency of each word is added to the sentence. So high frequency words in a sentence will cause a high score sentence, and vice versa.

In [None]:
for sentence in sentences:
    sentence_words = nltk.word_tokenize(sentence.lower())
    for word in sentence_words:
        if word in frequencies.keys():
            if len(sentence.split(' ')) < 30:
                if sentence not in scores.keys():
                    scores[sentence] = frequencies[word]
                else:
                    scores[sentence] += frequencies[word]

##### Summarizing
>The scores are sorted in descending order. The loop then gets the top <em>n</em> sentences (3 or 5 in our case), joins the sentences with spaces, and then outputs them as a summary

In [None]:
scores = dict( sorted(scores.items(), key=operator.itemgetter(1), reverse=True))
count = 0
summary_text = ''
for sentence in scores.keys():
    # 3 or 5 sentence summaries
    if count <= 5: 
        summary_text += sentence + ' '
        count +=1 ;

summary_text

' Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by animals including humans. A machine with general intelligence can solve a wide variety of problems with breadth and versatility similar to human intelligence. A superintelligence, hyperintelligence, or superhuman intelligence, is a hypothetical agent that would possess intelligence far surpassing that of the brightest and most gifted human mind. AI founder John McCarthy said: "Artificial intelligence is not, by definition, simulation of human intelligence". Data analysis is a fundamental property of artificial intelligence that enables it to be used in every facet of life from search results to the way people buy product. By 2000, solutions developed by AI researchers were being widely used, although in the 1990s they were rarely described as "artificial intelligence". '