<a href="https://colab.research.google.com/github/urvashiramdasani/Document-Summarization/blob/main/notebooks/20_newsgroups_using_HMM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook presents the code for 20 Newsgroups dataset document summarization.

## Setting up GitHub

In [1]:
!git clone https://github.com/urvashiramdasani/Document-Summarization.git

Cloning into 'Document-Summarization'...
remote: Enumerating objects: 18948, done.[K
remote: Counting objects: 100% (18948/18948), done.[K
remote: Compressing objects: 100% (18913/18913), done.[K
remote: Total 18948 (delta 58), reused 18893 (delta 33), pack-reused 0[K
Receiving objects: 100% (18948/18948), 19.52 MiB | 19.04 MiB/s, done.
Resolving deltas: 100% (58/58), done.


## Data Preprocessing

In [33]:
def read_article(file_name):
    file = open(file_name, "r")
    filedata = file.readlines()
    filedata = filedata[26:50]
    sentences = []

    for sentence in filedata:
        sentences.append(sentence.replace("[^a-zA-Z]", " ").strip(" \n"))
    
    return sentences

In [34]:
# Reading a sample article

sentences = read_article("/content/Document-Summarization/data/20news-bydate-train/sci.space/59905")
sentences = list(filter(None, sentences))
print(sentences)

['MARINER 1, the first U.S. attempt to send a spacecraft to Venus, failed', 'minutes after launch in 1962. The guidance instructions from the ground', 'stopped reaching the rocket due to a problem with its antenna, so the', 'onboard computer took control. However, there turned out to be a bug in', 'the guidance software, and the rocket promptly went off course, so the', 'Range Safety Officer destroyed it. Although the bug is sometimes claimed', 'to have been an incorrect FORTRAN DO statement, it was actually a', 'transcription error in which the bar (indicating smoothing) was omitted', 'from the expression "R-dot-bar sub n" (nth smoothed value of derivative', 'of radius). This error led the software to treat normal minor variations', 'of velocity as if they were serious, leading to incorrect compensation.', 'MARINER 2 became the first successful probe to flyby Venus in December', 'of 1962, and it returned information which confirmed that Venus is a', 'very hot (800 degrees Fahrenheit, 

In [35]:
len(sentences)

21

## POS Tagging

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [53]:
txt = ""
for sentence in sentences:
  txt += "".join(sentence)
  txt += " "
print(txt)

MARINER 1, the first U.S. attempt to send a spacecraft to Venus, failed minutes after launch in 1962. The guidance instructions from the ground stopped reaching the rocket due to a problem with its antenna, so the onboard computer took control. However, there turned out to be a bug in the guidance software, and the rocket promptly went off course, so the Range Safety Officer destroyed it. Although the bug is sometimes claimed to have been an incorrect FORTRAN DO statement, it was actually a transcription error in which the bar (indicating smoothing) was omitted from the expression "R-dot-bar sub n" (nth smoothed value of derivative of radius). This error led the software to treat normal minor variations of velocity as if they were serious, leading to incorrect compensation. MARINER 2 became the first successful probe to flyby Venus in December of 1962, and it returned information which confirmed that Venus is a very hot (800 degrees Fahrenheit, now revised to 900 degrees F.) world with

In [None]:
tokenized = sent_tokenize(txt)
tagged = []
for i in tokenized:
    wordsList = nltk.word_tokenize(i)
    wordsList = [w for w in wordsList if not w in stop_words] 
    tagged.append(nltk.pos_tag(wordsList))

In [None]:
print(tagged)

[[('This', 'DT'), ('section', 'NN'), ('lightly', 'RB'), ('adapted', 'VBD'), ('original', 'JJ'), ('posting', 'NN'), ('Larry', 'NNP'), ('Klaes', 'NNP'), ('(', '('), ('klaes', 'VB'), ('@', 'NNP'), ('verga.enet.dec.com', 'NN'), (')', ')'), (',', ','), ('mostly', 'RB'), ('minor', 'JJ'), ('formatting', 'NN'), ('changes', 'NNS'), ('.', '.')], [('Matthew', 'NNP'), ('Wiener', 'NNP'), ('(', '('), ('weemba', 'JJ'), ('@', 'NNP'), ('libra.wistar.upenn.edu', 'NN'), (')', ')'), ('contributed', 'VBD'), ('section', 'NN'), ('Voyager', 'NNP'), (',', ','), ('section', 'NN'), ('Sakigake', 'NNP'), ('obtained', 'VBD'), ('ISAS', 'NNP'), ('material', 'NN'), ('posted', 'VBD'), ('Yoshiro', 'NNP'), ('Yamada', 'NNP'), ('(', '('), ('yamada', 'PRP'), ('@', 'NNP'), ('yscvax.ysc.go.jp', 'NN'), (')', ')'), ('.', '.')], [('US', 'NNP'), ('PLANETARY', 'NNP'), ('MISSIONS', 'NNP'), ('MARINER', 'NNP'), ('(', '('), ('VENUS', 'NNP'), (',', ','), ('MARS', 'NNP'), (',', ','), ('&', 'CC'), ('MERCURY', 'NNP'), ('FLYBYS', 'NNP'), (

In [None]:
annotations = set()

for i in range(len(tagged)):
  for j in range(len(tagged[i])):
    annotations.add(tagged[i][j][1])

print(annotations)

{'POS', 'CD', 'VBZ', 'DT', 'FW', '.', 'PRP', 'JJR', '``', "''", 'VBP', 'MD', 'VBN', '$', ',', 'WRB', 'JJS', 'PRP$', 'NN', '#', 'NNP', 'NNPS', 'RB', 'VBG', 'JJ', 'IN', '(', 'RBR', 'NNS', 'RP', 'VBD', 'CC', ')', 'VB', ':', 'WP'}


In [None]:
tagged[0][0][1]

'NNP'

In [None]:
corpus = ""
for i in range(len(tagged)):
  for j in range(len(tagged[i])):
    if tagged[i][j][1] != 'JJ' and tagged[i][j][1] != 'DT':
      corpus += tagged[i][j][0] + " "

print(corpus)

section lightly adapted posting Larry Klaes ( klaes @ verga.enet.dec.com ) , mostly formatting changes . Matthew Wiener ( @ libra.wistar.upenn.edu ) contributed section Voyager , section Sakigake obtained ISAS material posted Yoshiro Yamada ( yamada @ yscvax.ysc.go.jp ) . US PLANETARY MISSIONS MARINER ( VENUS , MARS , & MERCURY FLYBYS AND ORBITERS ) MARINER 1 , U.S. attempt send spacecraft Venus , failed minutes 1962 . guidance instructions ground stopped reaching rocket problem antenna , onboard computer took control . However , turned guidance software , rocket promptly went course , Range Safety Officer destroyed . Although bug sometimes claimed FORTRAN DO statement , actually transcription error bar ( indicating smoothing ) omitted expression `` sub n '' ( nth smoothed value radius ) . error led software treat variations velocity , compensation . MARINER 2 became first probe flyby Venus December 1962 , returned information confirmed Venus ( 800 degrees Fahrenheit , revised 900 degr

## Hidden Markov Models



In [36]:
CONST = 2
n = 4
threshold = 1.0e-120

In [37]:
from keras.preprocessing.text import Tokenizer
from keras.utils import np_utils, to_categorical
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN, Embedding, LSTM
from keras.preprocessing.sequence import pad_sequences
import numpy as np

In [38]:
# Preprocess data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
vocab = tokenizer.word_index
seqs = tokenizer.texts_to_sequences(sentences)

In [39]:
def prepare_sentence(seq, maxlen):
    # Pads seq and slides windows
    x = []
    y = []
    for i, w in enumerate(seq):
        x_padded = pad_sequences([seq[:i]],
                                 maxlen=maxlen - 1,
                                 padding='pre')[0]  # Pads before each sequence
        x.append(x_padded)
        y.append(w)
    return x, y

# Pad sequences and slide windows
maxlen = max([len(seq) for seq in seqs])
x = []
y = []
for seq in seqs:
    x_windows, y_windows = prepare_sentence(seq, maxlen)
    x += x_windows
    y += y_windows
x = np.array(x)
y = np.array(y) - 1  # The word <PAD> does not constitute a class
y = np.eye(len(vocab))[y]  # One hot encoding

In [40]:
# Define model
model = Sequential()
model.add(Embedding(input_dim=len(vocab) + 1,  # vocabulary size. Adding an
                                               # extra element for <PAD> word
                    output_dim=5,  # size of embeddings
                    input_length=maxlen - 1))  # length of the padded sequences
model.add(LSTM(10))
model.add(Dense(len(vocab), activation='softmax'))
model.compile('rmsprop', 'categorical_crossentropy', metrics = ['accuracy'])

# Train network
model.fit(x, y, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f433bcea850>

In [41]:
# Compute probability of occurence of a sentence
def probability_of_sentence(sentence):
  tok = tokenizer.texts_to_sequences([sentence])[0]
  x_test, y_test = prepare_sentence(tok, maxlen)
  x_test = np.array(x_test)
  y_test = np.array(y_test) - 1  # The word <PAD> does not constitute a class
  p_pred = model.predict(x_test)  # array of conditional probabilities
  vocab_inv = {v: k for k, v in vocab.items()}

  # Compute product
  # Efficient version: np.exp(np.sum(np.log(np.diag(p_pred[:, y_test]))))
  log_p_sentence = 0
  for i, prob in enumerate(p_pred):
      word = vocab_inv[y_test[i]+1]  # Index 0 from vocab is reserved to <PAD>
      history = ' '.join([vocab_inv[w] for w in x_test[i, :] if w != 0])
      prob_word = prob[y_test[i]]
      log_p_sentence += np.log(prob_word)
      # print('P(w={}|h={})={}'.format(word, history, prob_word))
  return np.exp(log_p_sentence)

In [42]:
import pandas as pd

df = pd.DataFrame(columns = ['Sentence', 'Probability'])

In [43]:
# txt = txt.split(" .")
word_list = ['', '', '', '', '', '', '', '']

for i in range(len(sentences)):
  words = sentences[i].split(" .")
  num_words = len(words)
  for j in range(num_words):
    word_list[3] = words[j]
    if j + n <= num_words:
      word_list[5] = words[j + n]
    if j - n >= 0:
      word_list[2] = words[j - n]
    if j + 1 < num_words:
      word_list[4] = words[j + 1]
    if i - CONST >= 0:
      word_list[0] = sentences[i - CONST].split(" .")[j]
    if i - CONST + 1>= 0:
      word_list[1] = sentences[i - CONST + 1].split(" .")[j]
    if i + CONST < len(sentences):
      word_list[7] = sentences[i + CONST].split(" .")[j]
    if i + CONST - 1 < len(sentences):
      word_list[6] = sentences[i + CONST - 1].split(" .")[j]
    final_str = " ".join(word_list)
    df = df.append({'Sentence':final_str, 'Probability':probability_of_sentence(final_str)}, ignore_index = True)

In [44]:
df

Unnamed: 0,Sentence,Probability
0,"MARINER 1, the first U.S. attempt to send a...",2.644688e-72
1,"MARINER 1, the first U.S. attempt to send a s...",2.442937e-98
2,"MARINER 1, the first U.S. attempt to send a sp...",1.3515860000000002e-120
3,minutes after launch in 1962. The guidance ins...,1.04866e-117
4,stopped reaching the rocket due to a problem w...,4.8927470000000005e-121
5,"onboard computer took control. However, there ...",2.085755e-117
6,"the guidance software, and the rocket promptly...",4.638989e-119
7,Range Safety Officer destroyed it. Although th...,4.738781e-122
8,to have been an incorrect FORTRAN DO statement...,2.003781e-122
9,transcription error in which the bar (indicati...,1.457077e-122


In [45]:
df.describe()

Unnamed: 0,Probability
count,21.0
mean,1.259375e-73
std,5.771183e-73
min,9.837845000000001e-128
25%,1.9974289999999998e-122
50%,1.3515860000000002e-120
75%,3.961797e-114
max,2.644688e-72


In [46]:
# Filter out all the sentences whose probability > 1.0e-140
df = df[df.Probability > 1.0e-117]

In [47]:
df

Unnamed: 0,Sentence,Probability
0,"MARINER 1, the first U.S. attempt to send a...",2.644688e-72
1,"MARINER 1, the first U.S. attempt to send a s...",2.442937e-98
3,minutes after launch in 1962. The guidance ins...,1.04866e-117
5,"onboard computer took control. However, there ...",2.085755e-117
13,MARINER 2 became the first successful probe to...,4.139742e-109
14,"of 1962, and it returned information which con...",9.019625e-112
15,"very hot (800 degrees Fahrenheit, now revised ...",1.2835e-112
16,with a cloud-covered atmosphere composed prima...,3.961797e-114


In [65]:
hmm_summary = ""

for sentence in df['Sentence']:
  hmm_summary += sentence

In [57]:
!pip install rouge-score

Collecting rouge-score
  Downloading https://files.pythonhosted.org/packages/1f/56/a81022436c08b9405a5247b71635394d44fe7e1dbedc4b28c740e09c2840/rouge_score-0.0.4-py2.py3-none-any.whl
Installing collected packages: rouge-score
Successfully installed rouge-score-0.0.4


In [69]:
human_summary = "Mariner1 was the first attempt to send a spacecraft to venus, which failed because of fault in antenna guidance instructions couldn't reach the rocket. Hence, the control was taken over onboard computer that had bug in software due to which even the slightest change in the velocity were considered significant leading to incorrect compensation. In December 1962, MARINER 2 became the successful probe to  venus and it return the information confirming venus is very hot, nearly 800 to 900 Fahrenheit with cloud covered atmosphere comprising of carbon dioxide. Sulfuric acid was confirmed later in 1978. On November 5 1964, MARINER3 was launched, but after being placed in the space, failed to eject from the protective shroud. Due to this, it couldn't get solar power for its panels and the probe died after running out of battery. It was intended for Mars fly y with MARINER4"

In [70]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rouge3','rougeL'], use_stemmer=True)
scores = scorer.score(human_summary, hmm_summary)

In [71]:
scores

{'rouge1': Score(precision=0.26634382566585957, recall=0.7482993197278912, fmeasure=0.39285714285714285),
 'rouge2': Score(precision=0.10679611650485436, recall=0.3013698630136986, fmeasure=0.15770609318996412),
 'rouge3': Score(precision=0.0413625304136253, recall=0.11724137931034483, fmeasure=0.06115107913669065),
 'rougeL': Score(precision=0.19128329297820823, recall=0.5374149659863946, fmeasure=0.28214285714285714)}

In [73]:
hmm_summary

"   MARINER 1, the first U.S. attempt to send a spacecraft to Venus, failed   minutes after launch in 1962. The guidance instructions from the ground stopped reaching the rocket due to a problem with its antenna, so the MARINER 1, the first U.S. attempt to send a spacecraft to Venus, failed  minutes after launch in 1962. The guidance instructions from the ground   stopped reaching the rocket due to a problem with its antenna, so the onboard computer took control. However, there turned out to be a bug inminutes after launch in 1962. The guidance instructions from the ground stopped reaching the rocket due to a problem with its antenna, so the  onboard computer took control. However, there turned out to be a bug in   the guidance software, and the rocket promptly went off course, so the Range Safety Officer destroyed it. Although the bug is sometimes claimedonboard computer took control. However, there turned out to be a bug in the guidance software, and the rocket promptly went off cour

## BERT

In [49]:
!pip install bert-extractive-summarizer==0.4.2

Collecting bert-extractive-summarizer==0.4.2
  Downloading https://files.pythonhosted.org/packages/23/1d/71f0a5c7f81b1a87d4428a6a935e9ddeb5e662e41512952e11bd10533cd9/bert-extractive-summarizer-0.4.2.tar.gz
Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 5.7MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 44.5MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 36.4MB/s 


In [50]:
!pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |▎                               | 10kB 18.2MB/s eta 0:00:01[K     |▌                               | 20kB 10.9MB/s eta 0:00:01[K     |▉                               | 30kB 7.8MB/s eta 0:00:01[K     |█                               | 40kB 6.9MB/s eta 0:00:01[K     |█▍                              | 51kB 4.2MB/s eta 0:00:01[K     |█▋                              | 61kB 4.7MB/s eta 0:00:01[K     |██                              | 71kB 4.8MB/s eta 0:00:01[K     |██▏                             | 81kB 4.9MB/s eta 0:00:01[K     |██▌                             | 92kB 5.2MB/s eta 0:00:01[K     |██▊                             | 102kB 5.4MB/s eta 0:00:01[K     |███                             | 112kB 5.4MB/s eta 0:00:01[K     |███▎                 

In [51]:
import torch
from summarizer import Summarizer

In [52]:
model = Summarizer('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [54]:
resp = model(txt)
print(resp)

MARINER 1, the first U.S. attempt to send a spacecraft to Venus, failed minutes after launch in 1962. MARINER 2 became the first successful probe to flyby Venus in December of 1962, and it returned information which confirmed that Venus is a very hot (800 degrees Fahrenheit, now revised to 900 degrees F.) world with a cloud-covered atmosphere composed primarily of carbon dioxide (sulfuric acid was later confirmed in 1978).


In [55]:
print(len(resp), len(txt))

426 1445


In [72]:
print(scorer.score(human_summary, resp))

{'rouge1': Score(precision=0.75, recall=0.3673469387755102, fmeasure=0.49315068493150693), 'rouge2': Score(precision=0.38028169014084506, recall=0.18493150684931506, fmeasure=0.24884792626728108), 'rouge3': Score(precision=0.21428571428571427, recall=0.10344827586206896, fmeasure=0.1395348837209302), 'rougeL': Score(precision=0.625, recall=0.30612244897959184, fmeasure=0.4109589041095891)}
