# Summraizer
Implement an extractive summarization technique to create concise summaries of longer documents. <br>
Expected Outcome: A tool that generates summaries by selecting and combining the most important sentences from the original text.

In [1]:
import nltk
import re
nltk.download('stopwords')

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

## Get raw text

In [2]:
DOCUMENT_PATH = "documents/harry_potter_plot.txt"
raw_text = ""
with open(DOCUMENT_PATH, "r") as file:
    raw_text = file.read()

raw_text

'The series follows the life of a boy named Harry Potter. In the first book, Harry Potter and the Philosopher\'s Stone (Harry Potter and the Sorcerer\'s Stone in the US), Harry lives in a cupboard under the stairs in the house of the Dursleys, his aunt, uncle and cousin, who all treat him poorly. At the age of 11, Harry discovers that he is a wizard. He meets a half-giant named Hagrid who gives him a letter of acceptance to attend the Hogwarts School of Witchcraft and Wizardry. Harry learns that his parents, Lily and James Potter, also had magical powers and were murdered by the dark wizard Lord Voldemort when Harry was a baby. When Voldemort attempted to kill Harry, his curse rebounded, seemingly killing Voldemort, and Harry survived with a lightning-shaped scar on his forehead. The event made Harry famous among the community of wizards and witches.\n\nHarry becomes a student at Hogwarts and is sorted into Gryffindor House. He gains the friendship of Ron Weasley, a member of a large b

## Tokenize into sentences

In [3]:
sentences = nltk.sent_tokenize(raw_text)
sentences

['The series follows the life of a boy named Harry Potter.',
 "In the first book, Harry Potter and the Philosopher's Stone (Harry Potter and the Sorcerer's Stone in the US), Harry lives in a cupboard under the stairs in the house of the Dursleys, his aunt, uncle and cousin, who all treat him poorly.",
 'At the age of 11, Harry discovers that he is a wizard.',
 'He meets a half-giant named Hagrid who gives him a letter of acceptance to attend the Hogwarts School of Witchcraft and Wizardry.',
 'Harry learns that his parents, Lily and James Potter, also had magical powers and were murdered by the dark wizard Lord Voldemort when Harry was a baby.',
 'When Voldemort attempted to kill Harry, his curse rebounded, seemingly killing Voldemort, and Harry survived with a lightning-shaped scar on his forehead.',
 'The event made Harry famous among the community of wizards and witches.',
 'Harry becomes a student at Hogwarts and is sorted into Gryffindor House.',
 'He gains the friendship of Ron We

## Clean the sentences

In [4]:
clean_sentences = []
for sentence in sentences:
    clean_sent = re.sub('[^a-zA-Z]', " ", sentence)
    clean_sent = clean_sent.lower()
    clean_sentences.append(clean_sent)

clean_sentences

['the series follows the life of a boy named harry potter ',
 'in the first book  harry potter and the philosopher s stone  harry potter and the sorcerer s stone in the us   harry lives in a cupboard under the stairs in the house of the dursleys  his aunt  uncle and cousin  who all treat him poorly ',
 'at the age of     harry discovers that he is a wizard ',
 'he meets a half giant named hagrid who gives him a letter of acceptance to attend the hogwarts school of witchcraft and wizardry ',
 'harry learns that his parents  lily and james potter  also had magical powers and were murdered by the dark wizard lord voldemort when harry was a baby ',
 'when voldemort attempted to kill harry  his curse rebounded  seemingly killing voldemort  and harry survived with a lightning shaped scar on his forehead ',
 'the event made harry famous among the community of wizards and witches ',
 'harry becomes a student at hogwarts and is sorted into gryffindor house ',
 'he gains the friendship of ron we

In [5]:
# Store the clean sentences indices in a dictionary
clean_sent_idx = {sent: i for i, sent in enumerate(clean_sentences)}

## Word Counts

In [6]:
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [7]:
# Find the word count for each word

word_count = {}

for sentence in clean_sentences:
    for word in sentence.split():
        if word in stopwords:
            continue
        if word not in word_count:
            word_count[word] = 0
        word_count[word] += 1

max_word_count = max(word_count.values())


# Normalize: word count/max_word count
for word, cnt in word_count.items():
    word_count[word] = cnt/max_word_count


# Display words and their normalized counts
sorted(word_count.items(), key=lambda x: -x[1])

[('harry', 1.0),
 ('voldemort', 0.5306122448979592),
 ('potter', 0.24489795918367346),
 ('hogwarts', 0.22448979591836735),
 ('dumbledore', 0.20408163265306123),
 ('ron', 0.16326530612244897),
 ('dark', 0.14285714285714285),
 ('hermione', 0.14285714285714285),
 ('sirius', 0.14285714285714285),
 ('death', 0.14285714285714285),
 ('ministry', 0.14285714285714285),
 ('book', 0.10204081632653061),
 ('school', 0.10204081632653061),
 ('learns', 0.10204081632653061),
 ('blood', 0.10204081632653061),
 ('snape', 0.10204081632653061),
 ('arts', 0.10204081632653061),
 ('stone', 0.08163265306122448),
 ('half', 0.08163265306122448),
 ('james', 0.08163265306122448),
 ('professor', 0.08163265306122448),
 ('defence', 0.08163265306122448),
 ('chamber', 0.08163265306122448),
 ('attacked', 0.08163265306122448),
 ('ginny', 0.08163265306122448),
 ('phoenix', 0.08163265306122448),
 ('revealed', 0.08163265306122448),
 ('group', 0.08163265306122448),
 ('eaters', 0.08163265306122448),
 ('horcrux', 0.081632653061

## Sentence Score
sentence score is sum of normalized word counts of each word in the sentence

In [8]:
sentence_scores = {sent:0 for sent in clean_sentences}  # Initialize each sentence score as 0

for sentence in clean_sentences:
    for word in sentence.split():
        if word not in stopwords:
            sentence_scores[sentence] += word_count[word]

sentence_scores

{'the series follows the life of a boy named harry potter ': 1.4489795918367347,
 'in the first book  harry potter and the philosopher s stone  harry potter and the sorcerer s stone in the us   harry lives in a cupboard under the stairs in the house of the dursleys  his aunt  uncle and cousin  who all treat him poorly ': 4.183673469387753,
 'at the age of     harry discovers that he is a wizard ': 1.1224489795918369,
 'he meets a half giant named hagrid who gives him a letter of acceptance to attend the hogwarts school of witchcraft and wizardry ': 0.6326530612244898,
 'harry learns that his parents  lily and james potter  also had magical powers and were murdered by the dark wizard lord voldemort when harry was a baby ': 3.4081632653061225,
 'when voldemort attempted to kill harry  his curse rebounded  seemingly killing voldemort  and harry survived with a lightning shaped scar on his forehead ': 3.306122448979591,
 'the event made harry famous among the community of wizards and witch

## Get the top sentences

In [9]:
from heapq import nlargest

# Percentage of total sentences to use in summary
percentage = 0.3
no_of_sentences = int(0.3 * len(clean_sentences))


# get the top clean sentences
top_clean_sentences = nlargest(no_of_sentences, sentence_scores)
# sort clean sentences by their order in raw_text
top_clean_sentences = sorted(top_clean_sentences, key=lambda sent: clean_sent_idx[sent])  


# get the actual top sentences from the top clean sentences
top_sentences = []
for clean_sent in top_clean_sentences:
    idx = clean_sent_idx[clean_sent]
    top_sentences.append(sentences[idx])


print(*top_sentences)

The series follows the life of a boy named Harry Potter. When Voldemort attempted to kill Harry, his curse rebounded, seemingly killing Voldemort, and Harry survived with a lightning-shaped scar on his forehead. The event made Harry famous among the community of wizards and witches. The trio develop an enmity with the rich pure-blood student Draco Malfoy. The first book concludes with Harry's confrontation with Voldemort, who, in his quest to regain a body, yearns to possess the Philosopher's Stone, a substance that bestows everlasting life. When Hermione is attacked and Ron's younger sister, Ginny Weasley, abducted, Harry and Ron uncover the chamber's secrets and enter it. With the help of Dumbledore's phoenix, Fawkes, and the Sword of Gryffindor, Harry slays the basilisk and destroys the diary. The dog is revealed to be Sirius Black. They are surrounded by dementors, but are saved by a figure resembling James who casts a stag Patronus. This is later revealed to be a future version of