<a href="https://colab.research.google.com/github/urvashiramdasani/Document-Summarization/blob/main/notebooks/20_newsgroups_using_HMM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook presents the code for 20 Newsgroups dataset document summarization.

## Setting up GitHub

In [None]:
!git clone https://github.com/urvashiramdasani/Document-Summarization.git

Cloning into 'Document-Summarization'...
remote: Enumerating objects: 18916, done.[K
remote: Counting objects: 100% (18916/18916), done.[K
remote: Compressing objects: 100% (18881/18881), done.[K
remote: Total 18916 (delta 44), reused 18892 (delta 33), pack-reused 0
Receiving objects: 100% (18916/18916), 18.03 MiB | 20.32 MiB/s, done.
Resolving deltas: 100% (44/44), done.
Checking out files: 100% (18851/18851), done.


## Data Preprocessing

In [None]:
def read_article(file_name):
    file = open(file_name, "r")
    filedata = file.readlines()
    filedata = filedata[15:]
    sentences = []

    for sentence in filedata:
        sentences.append(sentence.replace("[^a-zA-Z]", " ").strip(" \n"))
    
    return sentences

In [None]:
# Reading a sample article

sentences = read_article("/content/Document-Summarization/data/20news-bydate-train/sci.space/59905")
sentences = list(filter(None, sentences))
print(sentences)

['This section was lightly adapted from an original posting by Larry Klaes', '(klaes@verga.enet.dec.com), mostly minor formatting changes. Matthew', 'Wiener (weemba@libra.wistar.upenn.edu) contributed the section on', 'Voyager, and the section on Sakigake was obtained from ISAS material', 'posted by Yoshiro Yamada (yamada@yscvax.ysc.go.jp).', 'US PLANETARY MISSIONS', 'MARINER (VENUS, MARS, & MERCURY FLYBYS AND ORBITERS)', 'MARINER 1, the first U.S. attempt to send a spacecraft to Venus, failed', 'minutes after launch in 1962. The guidance instructions from the ground', 'stopped reaching the rocket due to a problem with its antenna, so the', 'onboard computer took control. However, there turned out to be a bug in', 'the guidance software, and the rocket promptly went off course, so the', 'Range Safety Officer destroyed it. Although the bug is sometimes claimed', 'to have been an incorrect FORTRAN DO statement, it was actually a', 'transcription error in which the bar (indicating smoothi

## POS Tagging

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
txt = ""
for sentence in sentences:
  txt += "".join(sentence)
  txt += " "
print(txt)

This section was lightly adapted from an original posting by Larry Klaes (klaes@verga.enet.dec.com), mostly minor formatting changes. Matthew Wiener (weemba@libra.wistar.upenn.edu) contributed the section on Voyager, and the section on Sakigake was obtained from ISAS material posted by Yoshiro Yamada (yamada@yscvax.ysc.go.jp). US PLANETARY MISSIONS MARINER (VENUS, MARS, & MERCURY FLYBYS AND ORBITERS) MARINER 1, the first U.S. attempt to send a spacecraft to Venus, failed minutes after launch in 1962. The guidance instructions from the ground stopped reaching the rocket due to a problem with its antenna, so the onboard computer took control. However, there turned out to be a bug in the guidance software, and the rocket promptly went off course, so the Range Safety Officer destroyed it. Although the bug is sometimes claimed to have been an incorrect FORTRAN DO statement, it was actually a transcription error in which the bar (indicating smoothing) was omitted from the expression "R-dot-b

In [None]:
tokenized = sent_tokenize(txt)
tagged = []
for i in tokenized:
    wordsList = nltk.word_tokenize(i)
    wordsList = [w for w in wordsList if not w in stop_words] 
    tagged.append(nltk.pos_tag(wordsList))

In [None]:
print(tagged)

[[('This', 'DT'), ('section', 'NN'), ('lightly', 'RB'), ('adapted', 'VBD'), ('original', 'JJ'), ('posting', 'NN'), ('Larry', 'NNP'), ('Klaes', 'NNP'), ('(', '('), ('klaes', 'VB'), ('@', 'NNP'), ('verga.enet.dec.com', 'NN'), (')', ')'), (',', ','), ('mostly', 'RB'), ('minor', 'JJ'), ('formatting', 'NN'), ('changes', 'NNS'), ('.', '.')], [('Matthew', 'NNP'), ('Wiener', 'NNP'), ('(', '('), ('weemba', 'JJ'), ('@', 'NNP'), ('libra.wistar.upenn.edu', 'NN'), (')', ')'), ('contributed', 'VBD'), ('section', 'NN'), ('Voyager', 'NNP'), (',', ','), ('section', 'NN'), ('Sakigake', 'NNP'), ('obtained', 'VBD'), ('ISAS', 'NNP'), ('material', 'NN'), ('posted', 'VBD'), ('Yoshiro', 'NNP'), ('Yamada', 'NNP'), ('(', '('), ('yamada', 'PRP'), ('@', 'NNP'), ('yscvax.ysc.go.jp', 'NN'), (')', ')'), ('.', '.')], [('US', 'NNP'), ('PLANETARY', 'NNP'), ('MISSIONS', 'NNP'), ('MARINER', 'NNP'), ('(', '('), ('VENUS', 'NNP'), (',', ','), ('MARS', 'NNP'), (',', ','), ('&', 'CC'), ('MERCURY', 'NNP'), ('FLYBYS', 'NNP'), (

In [None]:
annotations = set()

for i in range(len(tagged)):
  for j in range(len(tagged[i])):
    annotations.add(tagged[i][j][1])

print(annotations)

{'POS', 'CD', 'VBZ', 'DT', 'FW', '.', 'PRP', 'JJR', '``', "''", 'VBP', 'MD', 'VBN', '$', ',', 'WRB', 'JJS', 'PRP$', 'NN', '#', 'NNP', 'NNPS', 'RB', 'VBG', 'JJ', 'IN', '(', 'RBR', 'NNS', 'RP', 'VBD', 'CC', ')', 'VB', ':', 'WP'}


In [None]:
tagged[0][0][1]

'NNP'

In [None]:
corpus = ""
for i in range(len(tagged)):
  for j in range(len(tagged[i])):
    if tagged[i][j][1] != 'JJ' and tagged[i][j][1] != 'DT':
      corpus += tagged[i][j][0] + " "

print(corpus)

section lightly adapted posting Larry Klaes ( klaes @ verga.enet.dec.com ) , mostly formatting changes . Matthew Wiener ( @ libra.wistar.upenn.edu ) contributed section Voyager , section Sakigake obtained ISAS material posted Yoshiro Yamada ( yamada @ yscvax.ysc.go.jp ) . US PLANETARY MISSIONS MARINER ( VENUS , MARS , & MERCURY FLYBYS AND ORBITERS ) MARINER 1 , U.S. attempt send spacecraft Venus , failed minutes 1962 . guidance instructions ground stopped reaching rocket problem antenna , onboard computer took control . However , turned guidance software , rocket promptly went course , Range Safety Officer destroyed . Although bug sometimes claimed FORTRAN DO statement , actually transcription error bar ( indicating smoothing ) omitted expression `` sub n '' ( nth smoothed value radius ) . error led software treat variations velocity , compensation . MARINER 2 became first probe flyby Venus December 1962 , returned information confirmed Venus ( 800 degrees Fahrenheit , revised 900 degr

## Hidden Markov Models



In [None]:
txt = txt.split(" .")


In [None]:
states = []
obs = []
initial_probs = []
transmission_probs = []
emission_probs = []