In [1]:
text ="""India, officially known as the Republic of India, is a vibrant and diverse nation located in South Asia. With a population of over 1.4 billion people, it is the second most populous country in the world and the seventh-largest in terms of land area. Known for its rich cultural heritage, India is a land of contrasts where ancient traditions blend seamlessly with modern advancements. India has a history that dates back thousands of years, making it one of the world's oldest civilizations. The Indus Valley Civilization, which flourished around 2500 BCE, is considered one of the cradles of human civilization. Over the centuries, India has been ruled by various dynasties and empires, including the Maurya, Gupta, Mughal, and British Empires. Each of these periods has left an indelible mark on the country's cultural and architectural heritage. One of India’s most remarkable features is its cultural diversity. It is a melting pot of religions, languages, and traditions. The country is the birthplace of major world religions such as Hinduism, Buddhism, Jainism, and Sikhism. It is also home to significant populations of Muslims, Christians, and other faiths. With 22 officially recognized languages and hundreds of dialects, India’s linguistic landscape is unparalleled. Festivals are an integral part of Indian culture, reflecting its vibrant and diverse traditions. Diwali, Holi, Eid, Christmas, and Pongal are celebrated with great enthusiasm and bring people together in joyous unity. Indian music, dance, and cinema, particularly Bollywood, have gained international recognition and continue to captivate global audiences. India has made significant strides in economic development over the past few decades. It is one of the world’s fastest-growing economies, driven by sectors such as information technology, manufacturing, agriculture, and services. Cities like Bengaluru, Mumbai, and Hyderabad are hubs of innovation and entrepreneurship. Despite its progress, India faces challenges such as poverty, unemployment, and infrastructural gaps. However, government initiatives like "Make in India" and "Digital India" aim to address these issues and propel the nation toward sustainable development. India is blessed with diverse landscapes, ranging from the towering Himalayas in the north to the serene backwaters of Kerala in the south. The Thar Desert, lush forests, and pristine beaches add to its geographical richness. India is also home to a wide variety of flora and fauna, including several endangered species such as the Bengal tiger and the Asiatic lion. National parks and wildlife sanctuaries, such as Jim Corbett National Park and Kaziranga National Park, play a crucial role in preserving this biodiversity. India’s influence extends beyond its borders. It is a founding member of international organizations like the United Nations, the Non-Aligned Movement, and BRICS. India’s contributions to science, technology, and medicine have been noteworthy, with achievements like the Mars Orbiter Mission showcasing its technological prowess. Additionally, the Indian diaspora, spread across the globe, acts as a bridge between India and the world, fostering cultural and economic ties. India is a nation of immense potential and promise. Its rich history, cultural diversity, and dynamic economy make it a unique and fascinating country. While challenges persist, the resilience and ingenuity of its people continue to drive progress. As India moves forward, it remains a beacon of diversity, unity, and innovation, inspiring the world with its vibrant spirit."""

In [2]:
len(text)

3585

In [69]:
%pip install spacy

Note: you may need to restart the kernel to use updated packages.


In [70]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [71]:
%pip -m spacy download en_core_web_sm
#run this in terminal



Usage:   
  /usr/local/bin/python3 -m pip <command> [options]

no such option: -m
Note: you may need to restart the kernel to use updated packages.


In [72]:
nlp = spacy.load('en_core_web_sm')

In [73]:
doc = nlp(text)

In [74]:
tokens = [token.text.lower() for token in doc 
            if not token.is_stop and 
            not token.is_punct and
            token.text != '\n']

In [75]:
len(tokens)

317

In [76]:
tokens1 = []
stopwords = list(STOP_WORDS)
allowed_pos = ['ADJ', 'PROPN', 'VERB', 'NOUN']

for token in doc:
    if token.text in stopwords or token.text in punctuation:
        continue
    if token.pos_ in allowed_pos:
        tokens1.append(token.text)

In [77]:
len(tokens1)

299

In [78]:
from collections import Counter


In [79]:
freq = Counter(tokens)

In [80]:
freq

Counter({'india': 18,
         'world': 6,
         'cultural': 5,
         'country': 4,
         'like': 4,
         'vibrant': 3,
         'diverse': 3,
         'nation': 3,
         'people': 3,
         'traditions': 3,
         'diversity': 3,
         'indian': 3,
         'national': 3,
         'officially': 2,
         'known': 2,
         'south': 2,
         'land': 2,
         'rich': 2,
         'heritage': 2,
         'history': 2,
         'civilization': 2,
         'empires': 2,
         'including': 2,
         'religions': 2,
         'languages': 2,
         'home': 2,
         'significant': 2,
         'unity': 2,
         'international': 2,
         'continue': 2,
         'economic': 2,
         'development': 2,
         'technology': 2,
         'innovation': 2,
         'progress': 2,
         'challenges': 2,
         'park': 2,
         'republic': 1,
         'located': 1,
         'asia': 1,
         'population': 1,
         '1.4': 1,
         'billio

In [81]:
max_freq = max(freq.values())

max_freq

18

In [82]:
#normalising frequencies
for word in freq.keys():
    freq[word] = freq[word]/max_freq

In [83]:
sent_token = [sent.text for sent in doc.sents]

In [84]:
sent_token

['India, officially known as the Republic of India, is a vibrant and diverse nation located in South Asia.',
 'With a population of over 1.4 billion people, it is the second most populous country in the world and the seventh-largest in terms of land area.',
 'Known for its rich cultural heritage, India is a land of contrasts where ancient traditions blend seamlessly with modern advancements.',
 "India has a history that dates back thousands of years, making it one of the world's oldest civilizations.",
 'The Indus Valley Civilization, which flourished around 2500 BCE, is considered one of the cradles of human civilization.',
 'Over the centuries, India has been ruled by various dynasties and empires, including the Maurya, Gupta, Mughal, and British Empires.',
 "Each of these periods has left an indelible mark on the country's cultural and architectural heritage.",
 'One of India’s most remarkable features is its cultural diversity.',
 'It is a melting pot of religions, languages, and t

In [85]:
sent_score = {}

for sent in sent_token:
    for word in sent.split():
        if word.lower() in freq.keys():
            if sent not in sent_score.keys():
                sent_score[sent] = freq[word]
            else:
                sent_score[sent] += freq[word]

In [86]:
sent_score

{'India, officially known as the Republic of India, is a vibrant and diverse nation located in South Asia.': 0.7777777777777777,
 'With a population of over 1.4 billion people, it is the second most populous country in the world and the seventh-largest in terms of land area.': 1.0,
 'Known for its rich cultural heritage, India is a land of contrasts where ancient traditions blend seamlessly with modern advancements.': 0.9444444444444445,
 "India has a history that dates back thousands of years, making it one of the world's oldest civilizations.": 0.33333333333333337,
 'The Indus Valley Civilization, which flourished around 2500 BCE, is considered one of the cradles of human civilization.': 0.2777777777777778,
 'Over the centuries, India has been ruled by various dynasties and empires, including the Maurya, Gupta, Mughal, and British Empires.': 0.2222222222222222,
 "Each of these periods has left an indelible mark on the country's cultural and architectural heritage.": 0.555555555555555

In [87]:
import pandas as pd

In [88]:
txt = pd.DataFrame(list(sent_score.items()), columns=['sentence', 'score'])

In [89]:
txt

Unnamed: 0,sentence,score
0,"India, officially known as the Republic of Ind...",0.777778
1,"With a population of over 1.4 billion people, ...",1.0
2,"Known for its rich cultural heritage, India is...",0.944444
3,India has a history that dates back thousands ...,0.333333
4,"The Indus Valley Civilization, which flourishe...",0.277778
5,"Over the centuries, India has been ruled by va...",0.222222
6,Each of these periods has left an indelible ma...,0.555556
7,One of India’s most remarkable features is its...,0.388889
8,"It is a melting pot of religions, languages, a...",0.111111
9,The country is the birthplace of major world r...,0.777778


In [90]:
from heapq import nlargest

In [93]:
num_sent = 2
sents = nlargest(num_sent, sent_score, key= sent_score.get)

summary = " ".join(sents)

In [92]:
summary

'With a population of over 1.4 billion people, it is the second most populous country in the world and the seventh-largest in terms of land area.Known for its rich cultural heritage, India is a land of contrasts where ancient traditions blend seamlessly with modern advancements.'

## Second approach

In [96]:
%pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.6.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (31 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.15.0-cp312-cp312-macosx_14_0_arm64.whl.metadata (61 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.6.0-cp312-cp312-macosx_12_0_arm64.whl (11.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.2/11.2 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hDownloading joblib-1.4.2-py3-none-any.whl (301 kB)
Downloading scipy-1.15.0-cp312-cp312-macosx_14_0_arm64.whl (24.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.9/24.9 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installi

In [97]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [101]:
vectorizer = TfidfVectorizer()

sentences = [sent.text for sent in doc.sents]
vectors = vectorizer.fit_transform(sentences)

In [102]:
vectors = [nlp(sentence).vector for sentence in sentences]

In [103]:
vectors

[array([ 0.38839835, -0.33060288, -0.29257646,  0.02619797,  0.06465251,
        -0.03158513,  0.13368835,  0.02776655,  0.26087984, -0.10530698,
         0.11062695, -0.3213635 , -0.33520025,  0.2500936 , -0.2938026 ,
        -0.00915765,  0.02876836,  0.34982464,  0.08068117, -0.25144657,
        -0.298299  ,  0.12633279,  0.10962502, -0.15436274,  0.4274312 ,
        -0.07809234,  0.10536658, -0.34320772,  0.3584214 ,  0.26524752,
        -0.12901446, -0.10566539,  0.14897498,  0.14017746, -0.12990393,
        -0.23992586,  0.11201388,  0.21127169,  0.09690809, -0.3872399 ,
         0.01146693, -0.00204394, -0.12833712, -0.17904636,  0.12369457,
         0.63079715, -0.13364491,  0.0102569 ,  0.36382437,  0.02766543,
        -0.2707515 , -0.02035365, -0.44514588, -0.17435592, -0.23481442,
        -0.23940164,  0.04558674,  0.05812324,  0.24474937,  0.17684838,
        -0.20649078, -0.10583249,  0.20712177, -0.4680389 ,  0.13447577,
        -0.11581215,  0.40178698, -0.10002746, -0.1

In [104]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [105]:
#compute similarity score of each sent wrt to document
doc_vector = np.mean(vectors, axis=0)  # Document vector
scores = cosine_similarity(vectors, [doc_vector])


In [107]:
num_sent = 2

In [108]:
ranked_sentences = [(scores[i], sentences[i]) for i in range(len(sentences))]
ranked_sentences = sorted(ranked_sentences, reverse=True)
summary = " ".join([sent[1] for sent in ranked_sentences[:num_sent]])


In [109]:
summary

'The country is the birthplace of major world religions such as Hinduism, Buddhism, Jainism, and Sikhism. Diwali, Holi, Eid, Christmas, and Pongal are celebrated with great enthusiasm and bring people together in joyous unity.'

## Third approch - abractive text summary using transformers

In [110]:
from transformers import pipeline

In [111]:
summary_pipe = pipeline("summarization", model="t5-base", tokenizer="t5-base")

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Device set to use 0


In [112]:
summary = summary_pipe(text, max_length = 50, min_length = 10)

I0000 00:00:1736451865.844896 3868720 service.cc:148] XLA service 0x16c170740 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1736451865.845328 3868720 service.cc:156]   StreamExecutor device (0): Host, Default Version
2025-01-10 01:14:25.883822: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1736451866.000862 3868720 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


In [123]:
summary = summary[0]['summary_text']

In [124]:
summary

'with a population of over 1.4 billion people, it is the second most populous country in the world and the seventh-largest in terms of land area . Known for its rich cultural heritage, India is a land'