<a href="https://colab.research.google.com/github/rehanali455/rehanali455/blob/main/21_Rehan_NLP_Assigment_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**EXPLANATION:**

Import Libraries: Load required tools and download NLTK resources like tokenizers and stopwords.



In [2]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

This downloads the Punkt tokenizer models, required for splitting text into sentences or words.



In [8]:
# Download required NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Read Text File: Open sample.txt and read its content for processing.



In [4]:
# Step 1: Read a sample text file (ensure a file named 'sample.txt' exists with 1000+ words)
with open('sample.txt', 'r', encoding='utf-8') as file:
    text = file.read()

Display the initial part of the input text

In [5]:
# Display the initial part of the input text
print("Sample input text:", text[:500], "\n...")

Sample input text: 
In the vast expanse of the digital age, information flows ceaselessly, shaping the contours of our understanding and interaction with the world.
Technology has become the heartbeat of society, influencing every facet of our existence—from communication and commerce to education and entertainment.
As we navigate this ever-evolving landscape, the importance of critical thinking and digital literacy cannot be overstated.

The history of technology is a testament to human ingenuity and resilience.  
...


Normalize Text: Convert text to lowercase and remove punctuation.



In [6]:
# Step 2: Text normalization
# Lowercasing and removing punctuation
normalized_text = text.lower()
normalized_text = re.sub(f"[{re.escape(string.punctuation)}]", "", normalized_text)
print("\nNormalized text sample:", normalized_text[:500], "\n...")


Normalized text sample: 
in the vast expanse of the digital age information flows ceaselessly shaping the contours of our understanding and interaction with the world
technology has become the heartbeat of society influencing every facet of our existence—from communication and commerce to education and entertainment
as we navigate this everevolving landscape the importance of critical thinking and digital literacy cannot be overstated

the history of technology is a testament to human ingenuity and resilience from the  
...


Tokenize Text: Split the text into individual words (tokens).



In [9]:
# Step 3: Tokenization
tokens = word_tokenize(normalized_text)
print("\nTokenized sample:", tokens[:30], "\n...")


Tokenized sample: ['in', 'the', 'vast', 'expanse', 'of', 'the', 'digital', 'age', 'information', 'flows', 'ceaselessly', 'shaping', 'the', 'contours', 'of', 'our', 'understanding', 'and', 'interaction', 'with', 'the', 'world', 'technology', 'has', 'become', 'the', 'heartbeat', 'of', 'society', 'influencing'] 
...


Remove Stopwords: Remove common words like "the", "is", "and" to focus on important words.



In [10]:
# Step 4: Stopword removal
stop_words = set(stopwords.words('english'))
tokens_without_stopwords = [token for token in tokens if token not in stop_words]
print("\nSample after stopword removal:", tokens_without_stopwords[:30], "\n...")


Sample after stopword removal: ['vast', 'expanse', 'digital', 'age', 'information', 'flows', 'ceaselessly', 'shaping', 'contours', 'understanding', 'interaction', 'world', 'technology', 'become', 'heartbeat', 'society', 'influencing', 'every', 'facet', 'existence—from', 'communication', 'commerce', 'education', 'entertainment', 'navigate', 'everevolving', 'landscape', 'importance', 'critical', 'thinking'] 
...


import Stemming and Lemmatization Funtions

In [11]:
# Step 5: Stemming and Lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

Stem Words: Reduce words to their root form (e.g., "running" → "run").


In [12]:
# Apply stemming
stemmed_tokens = [stemmer.stem(token) for token in tokens_without_stopwords]
print("\nStemmed tokens sample:", stemmed_tokens[:30], "\n...")


Stemmed tokens sample: ['vast', 'expans', 'digit', 'age', 'inform', 'flow', 'ceaselessli', 'shape', 'contour', 'understand', 'interact', 'world', 'technolog', 'becom', 'heartbeat', 'societi', 'influenc', 'everi', 'facet', 'existence—from', 'commun', 'commerc', 'educ', 'entertain', 'navig', 'everevolv', 'landscap', 'import', 'critic', 'think'] 
...



Lemmatize Words: Find the base form of words considering meaning (e.g., "better" → "good").


In [13]:
# Apply lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens_without_stopwords]
print("\nLemmatized tokens sample:", lemmatized_tokens[:30], "\n...")


Lemmatized tokens sample: ['vast', 'expanse', 'digital', 'age', 'information', 'flow', 'ceaselessly', 'shaping', 'contour', 'understanding', 'interaction', 'world', 'technology', 'become', 'heartbeat', 'society', 'influencing', 'every', 'facet', 'existence—from', 'communication', 'commerce', 'education', 'entertainment', 'navigate', 'everevolving', 'landscape', 'importance', 'critical', 'thinking'] 
...



Combine Tokens: Join processed words back into a single text string.


In [14]:
# Step 6: Summary output
processed_text = ' '.join(lemmatized_tokens)
print("\nProcessed text sample:", processed_text[:500], "\n...")


Processed text sample: vast expanse digital age information flow ceaselessly shaping contour understanding interaction world technology become heartbeat society influencing every facet existence—from communication commerce education entertainment navigate everevolving landscape importance critical thinking digital literacy overstated history technology testament human ingenuity resilience wheel microchip invention built upon knowledge past creating tapestry progress innovation early day computing pioneer like ada love 
...



Save Output: Write the cleaned text to processed_output.txt.

In [15]:
# Save the processed text to an output file
with open('processed_output.txt', 'w', encoding='utf-8') as out_file:
    out_file.write(processed_text)

In [16]:
print("\nText preprocessing complete. Processed text saved to 'processed_output.txt'")



Text preprocessing complete. Processed text saved to 'processed_output.txt'
