<a href="https://colab.research.google.com/github/rayehaarika597/Extractive-summarisation-using-web-scraping/blob/main/Copy_of_Text_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**IMPORTING ALL LIBRARIES NEEDED** 

In [1]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In this script, we first begin with importing the required libraries for web scraping i.e. BeautifulSoup. The urllib package is required for parsing the URL. Re is the library for regular expressions that are used for text pre-processing. The urlopen function will be used to scrape the data. The read() will read the data on the URL. Further on, we will parse the data with the help of the BeautifulSoup object and the lxml parser.

In most of the websites, the text is present in the paragraph tags. Hence we are using the find_all function to retrieve all the text which is wrapped within the paragraph tags.

In [2]:
import bs4 as bs
import urllib.request
import re
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Seventeen_(South_Korean_band)')
article = scraped_data.read()
parsed_article = bs.BeautifulSoup(article,'lxml') #lxml provides a very simple and powerful API for parsing XML and HTML webpages.
paragraphs = parsed_article.find_all('p') #p here means the paragraph tag in the given html pages
article_text = ""
for p in paragraphs:
    article_text += p.text

In [3]:
print(article)

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Seventeen (South Korean band) - Wikipedia</title>\n<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled";(function(){var cookie=

**TEXT PREPROCESSING**

The article_text will contain text without brackets which is the original text. We are not removing any other words or punctuation marks as we will use them directly to create the summaries.

The below code is used to create weighted frequencies and also to clean the text

In [4]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r's+', ' ', formatted_article_text)

In [5]:
print(formatted_article_text)

 Seventeen  Korean        tylized in all cap  or a  SVT  i  a South Korean boy band formed by Pledi  Entertainment  The group con i t  of thirteen member   S Coup   Jeonghan  Jo hua  Jun  Ho hi  Wonwoo  Woozi  DK  Mingyu  The   Seungkwan  Vernon  and Dino  The group debuted on May           with the extended play  EP     Carat     which became the longe t charting K pop album of the year in the US    and the only rookie album to appear on Billboard       Be t K Pop Album  of       li t        Seventeen ha  relea ed four  tudio album   twelve EP  and three rei ue   Seventeen i  con idered a   elf producing  idol group  with the member  actively involved in  ongwriting and choreographing  among other a pect  of their mu ic and performance         They perform a  one group and are divided into three unit  hip hop  vocal  and performance each with a different area of  pecialization  They have been labeled  Performance King     Theater Kid  of K Pop   and  K Pop Performance Powerhou e   by 

**Convert text to sentences**


The sentences are broken down into words so that we have separate entities.

We are tokenizing the article_text object as it is unfiltered data while the formatted_article_text object has formatted data devoid of punctuations etc.

In [6]:
sentence_list = nltk.sent_tokenize(article_text)

All English stopwords from the nltk library are stored in the stopwords variable. Iterate over all the sentences, check if the word is a stopword. If the word is not a stopword, then check for its presence in the word_frequencies dictionary. If it doesn’t exist, then insert it as a key and set its value to 1. If it is already existing, just increase its count by 1.

In [7]:
stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

To find the weighted frequency, divide the frequency of the word by the frequency of the most occurring word.

In [8]:
maximum_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

**Calculate sentence scores**


We have calculated the weighted frequencies. Now scores for each sentence can be calculated by adding weighted frequencies for each word.

The sentence_scores dictionary has been created which will store the sentences as keys and their occurrence as values. Iterate over all the sentences, tokenize all the words in a sentence. If the word exists in word_frequences and also if the sentence exists in sentence_scores then increase its count by 1 else insert it as a key in the sentence_scores and set its value to 1. We are not considering longer sentences hence we have set the sentence length to 30.

In [9]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys(): # ?
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [10]:
print(sentence_scores)

{'\nSeventeen (Korean:\xa0세븐틴; stylized in all caps or as SVT) is a South Korean boy band formed by Pledis Entertainment.': 0.0425531914893617, 'The group consists of thirteen members: S.Coups, Jeonghan, Joshua, Jun, Hoshi, Wonwoo, Woozi, DK, Mingyu, The8, Seungkwan, Vernon, and Dino.': 0.4361702127659574, '[4][5] Seventeen has released four studio albums, twelve EPs and three reissues.': 0.1702127659574468, 'Seventeen is considered a "self-producing" idol group, with the members actively involved in songwriting and choreographing, among other aspects of their music and performances.': 0.48936170212765956, '[6][7] They perform as one group and are divided into three units—hip-hop, vocal, and performance—each with a different area of specialization.': 0.7872340425531915, 'They have been labeled "Performance Kings", "Theater Kids of K-Pop", and "K-Pop Performance Powerhouse\'" by various domestic and international media outlets.': 0.22340425531914893, '[12][13]\nBeginning in 2013, Sevent

The sentence_scores dictionary consists of the sentences along with their scores. Now, top N sentences can be used to form the summary of the article.
Here the heapq library has been used to pick the top 7 sentences to summarize the article.

In [11]:
import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)

[66][67] The album sold 700,000 copies in its first week[68][69] and won the group their first Daesang (grand prize) for Album of the Year. [38] The lead single "Don't Wanna Cry" became one of the group's most popular tracks, with its music video becoming Seventeen's first to reach 200 million views on YouTube. [96] Attacca sold two million copies, making it the group's first double million-selling album. In addition to success on domestic charts, the album charted on the Oricon Weekly Pop Album Chart in Japan. The group's donation went to the UNESCO Korean National Commission's Global Education Sharing Project to help those such as children and teenagers in Africa and Asia receive an education. [40][41][42] The group completed their first world tour, 2017 Seventeen 1st World Tour "Diamond Edge", on October 6. [6][7] They perform as one group and are divided into three units—hip-hop, vocal, and performance—each with a different area of specialization.
