In [1]:
## Text Summarization has two types:
## 1. Extractive Summarization
##      -> This type of summarization involves selecting the most important sentences from the original text and then extract them and combine them to make a summary.
##      -> It is like doing a copy-paste from the original text to make a summary.
##      -> It is like selecting the most important sentences from the original text and then extract them and combine them to make a summary.
## Extractive SUmmarization algorithms:
##      -> TextRank
##      -> LexRank
##      -> LSA (Latent Semantic Analysis)
##      -> Luhn’s Algorithm
## 2. Abstractive Summarization
##      -> This type of summarization involves generating entirely new sentences to capture the meaning of the original text.
##      -> It is like writing a summary from scratch.
##      -> It is like generating entirely new sentences to capture the meaning of the original text.
## Abstractive Summarization algorithms:
##      -> Since, we are generating from scratch some of the most effective algorithms are based on transformers architecture
##      -> GPT, BERT, T5, PEGASUS

In [2]:
## importing libraries

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import bs4 as BeautifulSoup
import urllib.request  

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD

import re

# Download and prepare data

In [3]:
stop_words = stopwords.words('english')

def normalize_sentence(sentence):
    sentence = sentence.lower()
    sentence = re.sub(r'[^a-zA-Z\s]','',sentence)
    sentence = sentence.strip()
    
    tokens = word_tokenize(sentence)
    filtered_tokens = [token for token in tokens if token not in stop_words]

    sentence = ' '.join(filtered_tokens)
    return sentence



# fetching the content from the URL
fetched_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/20th_century')

article_read = fetched_data.read()

# parsing the URL content and storing in a variable
article_parsed = BeautifulSoup.BeautifulSoup(article_read,'html.parser')

# returning <p> tags
paragraphs = article_parsed.find_all('p')

# looping through the paragraphs and adding them to the variable
article_content = ''.join(paragraph.text for paragraph in paragraphs)

sentences = sent_tokenize(article_content)
# sentences = [normalize_sentence(sentence) for sentence in sentences]

print(len(sentences))

102


### Example of Extractive Summariation

In [4]:
## This is an example of extractive summarization
## Here, we are using the frequency of words to generate the summary. We assume that the sentences with the most frequent words are the most important sentences.
## This is a very basic example of extractive summarization.

## Create Word Frequency dictionary
words = word_tokenize(article_content)
ps = PorterStemmer()
freqTable = dict()
for word in words:
    word = word.lower()
    word = ps.stem(word)
    if word not in stop_words:
        freqTable[word] = freqTable.get(word, 0) + 1

## Calculating Sentence Scores
sentence_scores = dict()
for sentence in sentences:
    key = sentence[:15]
    words = word_tokenize(sentence)
    non_stop_words = 0
    for word in words:
        word = word.lower()
        word = ps.stem(word)
        if word in freqTable:
            non_stop_words += 1
            sentence_scores[key] = sentence_scores.get(key, 0) + freqTable[word]
    sentence_scores[key] = sentence_scores.get(key, 0) / non_stop_words

## Calculating the threshold
threshold = sum(sentence_scores.values()) / len(sentence_scores)

## Generating the summary
summary = ''
for sentence in sentences:
    if sentence_scores[sentence[:15]] > threshold:
        summary += " " + sentence

print(len(summary))

7621


## TextRank Algorithm

In [5]:
## Exaplantion of textrank algorithm:
## 1. Create a graph where the sentences are the vertices and the edges between the sentences are the weights.
## 2. The weights are calculated based on the similarity between the sentences.
## 3. The similarity between the sentences is calculated using cosine similarity.
## 4. The sentences are then ranked based on the weights of the edges.
## 5. The top N sentences are then selected to form the summary.



In [6]:
## Word Tokenization
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(sentences)
X.shape

(102, 834)

In [7]:
## Calculating the similarity matrix
similarity_matrix = cosine_similarity(X)
similarity_matrix.shape

(102, 102)

In [8]:
## Calculating textRank Scores
## The TextRank algorithm doesn't just select the sentences that are most similar to other sentences. 
## It selects the sentences that are most "important", 
## and importance is determined by a sentence's ability to represent other sentences in the document.

## In the TextRank algorithm, a sentence is considered important if it is similar to many other sentences that are themselves important. 
## This is similar to how web pages are ranked by Google's PageRank algorithm, which inspired TextRank.

## So, a sentence will have a high rank if it is similar to other sentences that also have high ranks. 
## This creates a kind of recursive definition of importance.


textrank_scores = np.ones(len(sentences))

## the damping factor represents the likelihood that the algorithm will randomly jump from one sentence to another during the calculation of the TextRank scores
damping_factor = 0.85

## TextRank algorithm
for iter in range(10):
    new_textrank_scores = np.ones(len(sentences)) * (1 - damping_factor)

    for sent_1_idx in range(len(sentences)):
        for sent_2_idx in range(len(sentences)):
            if sent_1_idx != sent_2_idx:
                new_textrank_scores[sent_1_idx] += (damping_factor * similarity_matrix[sent_2_idx][sent_1_idx] * textrank_scores[sent_2_idx])
    
    textrank_scores = new_textrank_scores


## Generating the summary
ranked_sentences = [sentence for _, sentence in sorted(zip(textrank_scores, sentences), reverse=True)]
summary = ' '.join(ranked_sentences[:5])
print(f"Original Length: {len(article_content)}, Summary Length: {len(summary)}, compression ratio: {len(summary)/len(article_content):.2f}")
print(summary)

Original Length: 15520, Summary Length: 784, compression ratio: 0.05
[4]
The 20th century was dominated by significant geopolitical events that reshaped the political and social structure of the globe: World War I, the Spanish flu pandemic, World War II and the Cold War. The world was undergoing its second major period of globalization; the first, which started in the 18th century, having been terminated by World War I. World population increased from about 1.6 billion people in 1901 to 6.1 billion at the century's end. The people of the Indian subcontinent, a sixth of the world population at the end of the 20th century, had attained an indigenous independence for the first time in centuries. At the beginning of the period, the British Empire was the world's most powerful nation,[7] having acted as the world's policeman for the past century.


## LSA

In [9]:
## LSA (Latent Semantic Analysis) analyzes relationships between a set of documents and the terms they contain.
## LSA is based on the principle that words that are close in meaning will occur in similar pieces of text.
## LSA is used to find the relationships between the words and the sentences in the text and then rank the sentences based on these relationships.
## These relationships are found using singular value decomposition (SVD) which is a matrix factorization technique.

## Steps:
## 1. Create a term-document matrix where the rows are the terms and the columns are the documents.
## 2. Apply SVD to the term-document matrix to get the three matrices U, S, and V.
## 3. The matrix U represents the relationship between the terms and the concepts.
## 4. The matrix S represents the strength of each concept.
## 5. The matrix V represents the relationship between the concepts and the documents.
## 6. The sentences are then ranked based on the relationship between the terms and the concepts.
## 7. The top N sentences are then selected to form the summary.

## X = USV^T



In [10]:
vectorizer = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
document_term_matrix = vectorizer.fit_transform(sentences)
document_term_matrix = document_term_matrix.toarray()
vocab = vectorizer.get_feature_names_out()
term_document_matrix = document_term_matrix.T
print(document_term_matrix.shape, term_document_matrix.shape)

(102, 934) (934, 102)


In [11]:
## Perform LSA - text summarization

## Sklearn implementation for LSA
lsa = TruncatedSVD(n_components=1, n_iter=10)
lsa.fit(term_document_matrix)

## Getting the sentence scores
lsa_scores = lsa.components_
lsa_scores = lsa_scores / lsa_scores.max()  
lsa_scores = lsa_scores * 100

## Generating the summary
ranked_sentences = [sentence for _, sentence in sorted(zip(lsa_scores[0], sentences), reverse=True)]
summary = ' '.join(ranked_sentences[:5])
summary = summary.replace('\n', '')
print(f"Original Length: {len(article_content)}, Summary Length: {len(summary)}, compression ratio: {len(summary)/len(article_content):.2f}")
print(summary)

Original Length: 15520, Summary Length: 824, compression ratio: 0.05
After the victory of the Allies in Europe, the war in Asia ended with the Soviet invasion of Manchuria and the dropping of two atomic bombs on Japan by the US, the first nation to develop nuclear weapons and the only one to use them in warfare. The people of the Indian subcontinent, a sixth of the world population at the end of the 20th century, had attained an indigenous independence for the first time in centuries. At the beginning of the period, the British Empire was the world's most powerful nation,[7] having acted as the world's policeman for the past century. Later in the 20th century, the development of computers led to the establishment of a theory of computation. In the latter half of the century, most of the European-colonized world in Africa and Asia gained independence in a process of decolonization.


In [12]:
## LSA for multiple components --> multiple components meaning multiple concepts/topics


lsa = TruncatedSVD(n_components=5, n_iter=10)
lsa.fit(term_document_matrix)

lsa_scores = lsa.components_
lsa_scores = lsa_scores / lsa_scores.max(axis=1)[:, np.newaxis]  ## Normalizing the scores for each concept
lsa_scores = lsa_scores * 100

for component in range(5):
    ## Generating the summary
    ranked_sentences = [sentence for _, sentence in sorted(zip(lsa_scores[component], sentences), reverse=True)]
    summary = ' '.join(ranked_sentences[:5])
    summary = summary.replace('\n', '')

    print(f"\n--- Concept {component + 1} ---")
    print(f"Original Length: {len(article_content)}, Summary Length: {len(summary)}, compression ratio: {len(summary)/len(article_content):.2f}")
    print(summary)


--- Concept 1 ---
Original Length: 15520, Summary Length: 824, compression ratio: 0.05
After the victory of the Allies in Europe, the war in Asia ended with the Soviet invasion of Manchuria and the dropping of two atomic bombs on Japan by the US, the first nation to develop nuclear weapons and the only one to use them in warfare. The people of the Indian subcontinent, a sixth of the world population at the end of the 20th century, had attained an indigenous independence for the first time in centuries. At the beginning of the period, the British Empire was the world's most powerful nation,[7] having acted as the world's policeman for the past century. Later in the 20th century, the development of computers led to the establishment of a theory of computation. In the latter half of the century, most of the European-colonized world in Africa and Asia gained independence in a process of decolonization.

--- Concept 2 ---
Original Length: 15520, Summary Length: 875, compression ratio: 0.06

In [13]:
## Generate one summary from all the concepts
lsa = TruncatedSVD(n_components=20, n_iter=10)
lsa.fit(term_document_matrix)

lsa_scores = lsa.components_
lsa_scores = lsa_scores / lsa_scores.max(axis=1)[:, np.newaxis] # normalize the scores
lsa_scores = lsa_scores * 100

## Generating the summary
ranked_sentences = [sentence for _, sentence in sorted(zip(lsa_scores.sum(axis=0), sentences), reverse=True)]
summary = ' '.join(ranked_sentences[:5])
summary = summary.replace('\n', '')
print(f"Original Length: {len(article_content)}, Summary Length: {len(summary)}, compression ratio: {len(summary)/len(article_content):.2f}")
print(summary)

Original Length: 15520, Summary Length: 769, compression ratio: 0.05
Due to continuing industrialization and expanding trade, many significant changes of the century were, directly or indirectly, economic and technological in nature. World population increased from about 1.6 billion people in 1901 to 6.1 billion at the century's end. The deaths from acts of war during the two world wars alone have been estimated at between 50 and 80 million. By the end of the 20th century, in many parts of the world, women had the same legal rights as men, and racism had come to be seen as unacceptable, a sentiment often backed up by legislation. [19] According to Charles Tilly, "Altogether, about 100 million people died as a direct result of action by organized military units backed by one government or another over the course of the century.
