# NLP Text Summarization

This notebook demonstrates a simple extractive text summarization pipeline using NLTK: tokenization, stopword removal, stemming, frequency scoring, and selecting top sentences.

In [1]:
# Import required libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from string import punctuation

In [7]:
# Download required NLTK resources (only run once)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)  # for newer NLTK versions
nltk.download('stopwords', quiet=True)

True

In [8]:
# Step 1: Input Text
text = """Natural Language Processing (NLP) is a field of Artificial Intelligence 
that enables computers to understand, interpret, and generate human language. 
NLP combines computational linguistics with machine learning and deep learning models. 
Applications of NLP include machine translation, sentiment analysis, and text summarization. 
Text summarization is a process of creating a short and coherent version of a longer document. 
It aims to capture the most important information while maintaining the original meaning."""

print("Original Text:\n", text)

Original Text:
 Natural Language Processing (NLP) is a field of Artificial Intelligence 
that enables computers to understand, interpret, and generate human language. 
NLP combines computational linguistics with machine learning and deep learning models. 
Applications of NLP include machine translation, sentiment analysis, and text summarization. 
Text summarization is a process of creating a short and coherent version of a longer document. 
It aims to capture the most important information while maintaining the original meaning.


In [9]:
# Step 2: Sentence Tokenization
sentences = sent_tokenize(text)
print("\nSentence Tokenization:\n", sentences)


Sentence Tokenization:
 ['Natural Language Processing (NLP) is a field of Artificial Intelligence \nthat enables computers to understand, interpret, and generate human language.', 'NLP combines computational linguistics with machine learning and deep learning models.', 'Applications of NLP include machine translation, sentiment analysis, and text summarization.', 'Text summarization is a process of creating a short and coherent version of a longer document.', 'It aims to capture the most important information while maintaining the original meaning.']


In [10]:
# Step 3: Word Tokenization
words = word_tokenize(text.lower())
print("\nWord Tokenization:\n", words[:20], "...")


Word Tokenization:
 ['natural', 'language', 'processing', '(', 'nlp', ')', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'enables', 'computers', 'to', 'understand', ',', 'interpret', ','] ...


In [11]:
# Step 4: Stopword Removal
stop_words = set(stopwords.words("english"))
filtered_words = [word for word in words if word not in stop_words and word not in punctuation]
print("\nAfter Stopword Removal:\n", filtered_words[:20], "...")


After Stopword Removal:
 ['natural', 'language', 'processing', 'nlp', 'field', 'artificial', 'intelligence', 'enables', 'computers', 'understand', 'interpret', 'generate', 'human', 'language', 'nlp', 'combines', 'computational', 'linguistics', 'machine', 'learning'] ...


In [12]:
# Step 5: Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
print("\nAfter Stemming:\n", stemmed_words[:20], "...")


After Stemming:
 ['natur', 'languag', 'process', 'nlp', 'field', 'artifici', 'intellig', 'enabl', 'comput', 'understand', 'interpret', 'gener', 'human', 'languag', 'nlp', 'combin', 'comput', 'linguist', 'machin', 'learn'] ...


In [13]:
# Step 6: Frequency Distribution
word_freq = {}
for word in stemmed_words:
    word_freq[word] = word_freq.get(word, 0) + 1

# Normalize frequencies
max_freq = max(word_freq.values()) if word_freq else 1
for word in list(word_freq.keys()):
    word_freq[word] = word_freq[word] / max_freq

print("\nWord Frequencies:\n", word_freq)


Word Frequencies:
 {'natur': 0.3333333333333333, 'languag': 0.6666666666666666, 'process': 0.6666666666666666, 'nlp': 1.0, 'field': 0.3333333333333333, 'artifici': 0.3333333333333333, 'intellig': 0.3333333333333333, 'enabl': 0.3333333333333333, 'comput': 0.6666666666666666, 'understand': 0.3333333333333333, 'interpret': 0.3333333333333333, 'gener': 0.3333333333333333, 'human': 0.3333333333333333, 'combin': 0.3333333333333333, 'linguist': 0.3333333333333333, 'machin': 0.6666666666666666, 'learn': 0.6666666666666666, 'deep': 0.3333333333333333, 'model': 0.3333333333333333, 'applic': 0.3333333333333333, 'includ': 0.3333333333333333, 'translat': 0.3333333333333333, 'sentiment': 0.3333333333333333, 'analysi': 0.3333333333333333, 'text': 0.6666666666666666, 'summar': 0.6666666666666666, 'creat': 0.3333333333333333, 'short': 0.3333333333333333, 'coher': 0.3333333333333333, 'version': 0.3333333333333333, 'longer': 0.3333333333333333, 'document': 0.3333333333333333, 'aim': 0.3333333333333333, 

In [14]:
# Step 7: Sentence Scoring
sentence_scores = {}
for sent in sentences:
    for word in word_tokenize(sent.lower()):
        if word in word_freq:
            sentence_scores[sent] = sentence_scores.get(sent, 0) + word_freq[word]

print("\nSentence Scores:\n", sentence_scores)


Sentence Scores:
 {'Natural Language Processing (NLP) is a field of Artificial Intelligence \nthat enables computers to understand, interpret, and generate human language.': 2.333333333333333, 'NLP combines computational linguistics with machine learning and deep learning models.': 1.3333333333333333, 'Applications of NLP include machine translation, sentiment analysis, and text summarization.': 2.0, 'Text summarization is a process of creating a short and coherent version of a longer document.': 2.6666666666666665}


In [15]:
# Step 8: Generate Summary (Top N sentences)
import heapq
summary_sentences = heapq.nlargest(2, sentence_scores, key=sentence_scores.get)
summary = " ".join(summary_sentences)

print("\n----- SUMMARY -----\n")
print(summary)


----- SUMMARY -----

Text summarization is a process of creating a short and coherent version of a longer document. Natural Language Processing (NLP) is a field of Artificial Intelligence 
that enables computers to understand, interpret, and generate human language.
