<h1 align="center">Topic: Building a Text Summarizer</h1>

<h2>Importing required libraries</h2>

In [1]:
!pip install newspaper3k
!pip install gTTS



In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en import English
import numpy as np
import newspaper

<h2>Load spacy model for sentence tokenization</h2>

In [3]:
nlp = English()
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x1263619f3c0>

In [4]:
nlp

<spacy.lang.en.English at 0x1262069baf0>

In [5]:

import newspaper
 
# Assign url
url = 'https://timesofindia.indiatimes.com/sports/cricket/ipl/top-stories/ipl-2022-delhis-next-game-shifted-to-wankhede-from-pune-after-tim-seifert-tests-covid-19-positive/articleshow/90963118.cms'
 
# Extract web data
url_i = newspaper.Article(url="%s" % (url), language='en')
url_i.download()
url_i.parse()

In [6]:
print(url_i.text)
print("Length of the article", len(url_i.text))
text_corpus = url_i.text

MUMBAI: The cloud over the IPL game between Delhi Capitals and Punjab Kings was lifted an hour before the game on Wednesday after the BCCI confirmed that the match will go on despite a sixth COVID positive case being reported in the Delhi camp on game day.New Zealand cricketer Tim Seifert testing positive on the morning of the game raised serious doubts over the fixture but the rest of the Delhi squad members returning two negative tests put the match back on track."The entire Delhi Capitals contingent underwent 2 rounds of COVID testing today. Match No. 32 involving Delhi Capitals and Punjab Kings scheduled today at Brabourne – CCI will go ahead as per the schedule after the second round of COVID tests returned negative today," said the BCCI in a statement.High drama unfolded in the lead up to the game with Seifert becoming the sixth member from the Delhi contingent to test positive. Australian all-rounder Mitchell Marsh had tested positive on Monday.Punjab had reached the Brabourne S

<h2>Create spacy document for further sentence level tokenization</h2>

In [7]:
doc = nlp(text_corpus.replace("\n", ""))
#print(doc)
sentences = [sent.text.strip() for sent in doc.sents]
#print(sentences[0])

<h2>Peeking into our tokenized sentences</h2>

In [8]:
print("Senetence are: \n", sentences)

Senetence are: 
 ['MUMBAI: The cloud over the IPL game between Delhi Capitals and Punjab Kings was lifted an hour before the game on Wednesday after the BCCI confirmed that the match will go on despite a sixth COVID positive case being reported in the Delhi camp on game day.', 'New Zealand cricketer Tim Seifert testing positive on the morning of the game raised serious doubts over the fixture but the rest of the Delhi squad members returning two negative tests put the match back on track.', '"The entire Delhi Capitals contingent underwent 2 rounds of COVID testing today.', 'Match No.', '32 involving Delhi Capitals and Punjab Kings scheduled today at Brabourne – CCI will go ahead as per the schedule after the second round of COVID tests returned negative today," said the BCCI in a statement.', 'High drama unfolded in the lead up to the game with Seifert becoming the sixth member from the Delhi contingent to test positive.', 'Australian all-rounder Mitchell Marsh had tested positive on M

<h2>Creating sentence organizer</h2>

In [9]:
sentence_organizer = {k:v for v,k in enumerate(sentences)}
print("Sentence organizer: \n", sentence_organizer)

Sentence organizer: 
 {'MUMBAI: The cloud over the IPL game between Delhi Capitals and Punjab Kings was lifted an hour before the game on Wednesday after the BCCI confirmed that the match will go on despite a sixth COVID positive case being reported in the Delhi camp on game day.': 0, 'New Zealand cricketer Tim Seifert testing positive on the morning of the game raised serious doubts over the fixture but the rest of the Delhi squad members returning two negative tests put the match back on track.': 1, '"The entire Delhi Capitals contingent underwent 2 rounds of COVID testing today.': 2, 'Match No.': 3, '32 involving Delhi Capitals and Punjab Kings scheduled today at Brabourne – CCI will go ahead as per the schedule after the second round of COVID tests returned negative today," said the BCCI in a statement.': 4, 'High drama unfolded in the lead up to the game with Seifert becoming the sixth member from the Delhi contingent to test positive.': 5, 'Australian all-rounder Mitchell Marsh h

<h2>Creating TF-IDF model</h2>

In [10]:
# TF-IDF model
tf_idf_vectorizer = TfidfVectorizer(min_df=2, 
                                    ngram_range=(1, 3), 
                                    sublinear_tf=1,
                                    stop_words = 'english')
#sublinear_tf: replace tf with 1 + log(tf).

In [11]:
# Passing our sentences treating each as one document to TF-IDF vectorizer
tf_idf_vectorizer.fit(sentences)

TfidfVectorizer(min_df=2, ngram_range=(1, 3), stop_words='english',
                sublinear_tf=1)

In [12]:
# Transforming our sentences to TF-IDF vectors
sentence_vectors = tf_idf_vectorizer.transform(sentences)
print(sentence_vectors)

  (0, 59)	0.24872847042701354
  (0, 47)	0.24872847042701354
  (0, 39)	0.24872847042701354
  (0, 38)	0.22402990004556514
  (0, 36)	0.17598481399937277
  (0, 28)	0.2048722005047843
  (0, 24)	0.2048722005047843
  (0, 22)	0.24872847042701354
  (0, 20)	0.22402990004556514
  (0, 19)	0.32404366043231375
  (0, 17)	0.24872847042701354
  (0, 16)	0.2048722005047843
  (0, 15)	0.2322663005572307
  (0, 13)	0.17598481399937277
  (0, 9)	0.2048722005047843
  (0, 8)	0.24872847042701354
  (0, 7)	0.24872847042701354
  (0, 6)	0.2048722005047843
  (0, 4)	0.18921922024031018
  (1, 60)	0.29921366884846384
  (1, 54)	0.29921366884846384
  (1, 53)	0.29921366884846384
  (1, 52)	0.2695019521058735
  (1, 51)	0.2695019521058735
  (1, 46)	0.24645575415172485
  :	:
  (15, 16)	0.1676279374785198
  (15, 15)	0.19004199111565329
  (15, 13)	0.14399206591021557
  (15, 11)	0.20351145927626052
  (15, 10)	0.20351145927626052
  (15, 9)	0.1676279374785198
  (15, 6)	0.1676279374785198
  (15, 4)	0.15482055418951912
  (15, 3)	0.203

<h2>Performing sentence scoring</h2>

In [13]:
# Getting sentence scores for each sentences
sentence_scores = np.array(sentence_vectors.sum(axis=1)).ravel()
print(sentence_scores)

# Sanity checkup
print(len(sentences) == len(sentence_scores))

[4.3110189  3.68888954 2.60954908 1.         4.3935779  2.5848844
 2.21900428 2.6091297  2.59786213 2.20654433 2.96510694 1.69688925
 2.61740099 2.7744219  2.43557758 5.01111891 2.95820572 1.40764955
 1.7300246 ]
True


In [14]:
# Getting top-n sentences
N = 3
top_n_sentences = [sentences[ind] for ind in np.argsort(sentence_scores, axis=0)[::-1][:N]]
#print(top_n_sentences)

<h2>Performing final summarization</h2>

In [15]:
# Let's now do the sentence ordering using our prebaked sentence_organizer
# Let's map the scored sentences with their indexes
mapped_top_n_sentences = [(sentence,sentence_organizer[sentence]) for sentence in top_n_sentences]
print("Our top_n_sentence with their index: \n")
for element in mapped_top_n_sentences:
    print(element)


# Ordering our top-n sentences in their original ordering
mapped_top_n_sentences = sorted(mapped_top_n_sentences, key = lambda x: x[1])
ordered_scored_sentences = [element[0] for element in mapped_top_n_sentences]

# Our final summary
summary = " ".join(ordered_scored_sentences)

Our top_n_sentence with their index: 

('34 – Delhi Capital versus Rajasthan Royals from MCA Stadium, Pune to Wankhede Stadium, Mumbai scheduled on April 22, 2022."The decision on the change of venue was made as a precautionary measure after Delhi Capitals registered the 6th COVID case with New Zealand wicketkeeper Mr Tim Seifert returning positive in today\'s RT-PCR testing," BCCI added.', 15)
('32 involving Delhi Capitals and Punjab Kings scheduled today at Brabourne – CCI will go ahead as per the schedule after the second round of COVID tests returned negative today," said the BCCI in a statement.', 4)
('MUMBAI: The cloud over the IPL game between Delhi Capitals and Punjab Kings was lifted an hour before the game on Wednesday after the BCCI confirmed that the match will go on despite a sixth COVID positive case being reported in the Delhi camp on game day.', 0)


<h2>Result / Summary</h2>

In [16]:
print("Summary: \n", summary)
print(len(summary))
print(len(text_corpus))




Summary: 
 MUMBAI: The cloud over the IPL game between Delhi Capitals and Punjab Kings was lifted an hour before the game on Wednesday after the BCCI confirmed that the match will go on despite a sixth COVID positive case being reported in the Delhi camp on game day. 32 involving Delhi Capitals and Punjab Kings scheduled today at Brabourne – CCI will go ahead as per the schedule after the second round of COVID tests returned negative today," said the BCCI in a statement. 34 – Delhi Capital versus Rajasthan Royals from MCA Stadium, Pune to Wankhede Stadium, Mumbai scheduled on April 22, 2022."The decision on the change of venue was made as a precautionary measure after Delhi Capitals registered the 6th COVID case with New Zealand wicketkeeper Mr Tim Seifert returning positive in today's RT-PCR testing," BCCI added.
814
2781


In [17]:
!pip install gTTs



In [18]:
from gtts import gTTS
import os
mytext = summary

myobj = gTTS(text=mytext, lang='en', slow=False)

myobj.save("summarized TTS.mp3")
os.system("mpg321 welcome.mp3")



1