<a href="https://colab.research.google.com/github/trungduyen0220/nlp-text-summarization/blob/master/Text_Summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarizer

###  Import the required libraries

In [1]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import nltk
import urllib.request
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from string import punctuation
from heapq import nlargest
from collections import defaultdict
from string import digits
import re
import string
import random

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

### Crawl data using Beautiful Soup

In [2]:
url="https://en.wikipedia.org/wiki/Blackpink"
request = urllib.request.urlopen(url).read().decode('utf8','ignore')
soup= BeautifulSoup(request, 'html.parser')
text_p = soup.find_all('p')

### Data preprocessing

#### 1.Handling strange data

In [3]:
text = ""
for i in range(0,len(text_p)):
    text += text_p[i].text

text = text.replace("BLΛƆKPIИK", "Blackpink").replace("blλɔkpiиk", "Blackpink").replace("블랙핑크","Blackpink")
text = re.sub(r'\[.*?\]', "",text)
text = text.replace("\n", " ")
text = text.replace(".", ". ")
print(text)

 Blackpink (Hangul: Blackpink; commonly stylized as BLACKPINK or Blackpink) is a South Korean girl group formed by YG Entertainment, consisting of members Jisoo, Jennie, Rosé, and Lisa.  The group debuted in August 2016 with their single album Square One, which featured "Whistle" and "Boombayah", their first number-one hits on South Korea's Gaon Digital Chart and the Billboard World Digital Song Sales chart, respectively.  Blackpink is the highest-charting female Korean act on the Billboard Hot 100, peaking at number 13 with their 2020 single "Ice Cream", and on the Billboard 200, peaking at number 24 with Kill This Love (2019).  They were the first Korean girl group to enter and top Billboard's Emerging Artists chart and to top the Billboard's World Digital Song Sales chart three times.  Blackpink is also the first female Korean act to receive a certification from the Recording Industry Association of America (RIAA) with their hit single "Ddu-Du Ddu-Du" (2018), which currently has the

#### 2.Lemmatization and process the data for vocabulary

In [4]:
tokens = word_tokenize(text)
tags = nltk.pos_tag(tokens)
lemmatizer = WordNetLemmatizer()

words = ""
for i, token in enumerate(tokens):
  pos_tag = tags[i][1]
  if pos_tag.startswith('N'):
      lemma = lemmatizer.lemmatize(token, pos=NOUN)
      words += lemma + " "
  elif pos_tag.startswith('V'):
      lemma = lemmatizer.lemmatize(token, pos=VERB)
      words += lemma + " "
  elif pos_tag.startswith('J'):
      lemma = lemmatizer.lemmatize(token, pos=ADJ)
      words += lemma + " "
  else:
      lemma = token
      words += " "

temp = []
for t in words.split():
  t = t.lower().translate ({ord(c): " " for c in "1234567890!@#$%^&*()[]{}/<>?\\|`~=_+'"""}).translate(str.maketrans('', '', string.punctuation))
  s = ' '.join(t.split())
  if( s != ''):
    temp.append(s)

print(temp[:])

['blackpink', 'hangul', 'blackpink', 'stylize', 'blackpink', 'blackpink', 'be', 'south', 'korean', 'girl', 'group', 'form', 'yg', 'entertainment', 'consist', 'member', 'jisoo', 'jennie', 'rosé', 'lisa', 'group', 'debut', 'august', 'single', 'album', 'square', 'one', 'feature', 'whistle', 'boombayah', 'first', 'numberone', 'hit', 'south', 'korea', 'gaon', 'digital', 'chart', 'billboard', 'world', 'digital', 'song', 'sales', 'chart', 'blackpink', 'be', 'highestcharting', 'female', 'korean', 'act', 'billboard', 'hot', 'peak', 'number', 'single', 'ice', 'cream', 'billboard', 'peak', 'number', 'kill', 'love', 'be', 'first', 'korean', 'girl', 'group', 'enter', 'top', 'billboard', 'emerging', 'artists', 'chart', 'top', 'billboard', 'world', 'digital', 'song', 'sales', 'chart', 'time', 'blackpink', 'be', 'first', 'female', 'korean', 'act', 'receive', 'certification', 'recording', 'industry', 'association', 'america', 'riaa', 'hit', 'single', 'ddudu', 'ddudu', 'have', 'mostviewed', 'music', 'vi

#### 3. Cleaning stop words

In [5]:
clean_token =temp[:]
#define irrelevant words that include stop words , punctuations and numbers
stopword = set(stopwords.words('english') + list(punctuation) + list("0123456789") )
for token in temp:
    if token in stopword:
        clean_token.remove(token)

for i in clean_token:
  if i == "be":
    print(i)

### Bag of words 

Use frequency distribution to know how the word tokens are distributed.

In [6]:
freq = nltk.FreqDist(clean_token)
top_words=[]
top_words=freq.most_common(100) # 100 most common words

for i in range(10):
  print(top_words[i])

('blackpink', 78)
('group', 73)
('music', 55)
('first', 47)
('release', 39)
('debut', 36)
('video', 36)
('chart', 33)
('girl', 31)
('korean', 30)


Tokenize the text from the web page into sentences

In [7]:
sentences = sent_tokenize(text) 
sentences[0]

' Blackpink (Hangul: Blackpink; commonly stylized as BLACKPINK or Blackpink) is a South Korean girl group formed by YG Entertainment, consisting of members Jisoo, Jennie, Rosé, and Lisa.'

Creating ranking for each sentence

In [8]:
ranking = defaultdict(int)
for i, sent in enumerate(sentences):
    for word in word_tokenize(sent.lower()):
        if word in freq:
            ranking[i]+=freq[word]
    top_sentences = nlargest(10, ranking, ranking.get)
print(top_sentences)

[63, 163, 57, 164, 0, 110, 55, 11, 4, 46]


In [9]:
sentences[63]

'The music video for the song later went on to break the record for the most-liked music video by a Korean girl group on YouTube as well as the most-viewed K-pop group music video in the first 24 hours of release.'

In [10]:
result = ""
for j in sorted(top_sentences):
  result += " " + sentences[j]

result.strip()

'Blackpink (Hangul: Blackpink; commonly stylized as BLACKPINK or Blackpink) is a South Korean girl group formed by YG Entertainment, consisting of members Jisoo, Jennie, Rosé, and Lisa. Blackpink is also the first female Korean act to receive a certification from the Recording Industry Association of America (RIAA) with their hit single "Ddu-Du Ddu-Du" (2018), which currently has the most-viewed music video by a Korean group on YouTube. Blackpink\'s other accolades include the New Artist of the Year Awards at the 31st Golden Disc Awards and the 26th Seoul Music Awards, as well as recognition as the most powerful celebrities in South Korea by Forbes Korea in 2019, and as the first female Korean group on Forbes\' 30 Under 30 Asia. The first girl group to debut under YG Entertainment in seven years, Blackpink released their debut single album, Square One, on August 8, 2016, consisting of tracks "Boombayah" and "Whistle". "Playing with Fire" was Blackpink\'s second single to reach number o