<a href="https://colab.research.google.com/github/shahzadahmad3/Natural-Language-Processing/blob/main/NLP_Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Natural Language Processing (NLP)** is a field of artificial intelligence that focuses on enabling machines to understand, interpret, and generate human language. It combines linguistics, computer science, and machine learning to process text and speech data.

In [15]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m84.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


**Text preprocessing** is the first step in any NLP pipeline. It involves cleaning and preparing raw text data for analysis. Let’s break it down into steps:

In [16]:
#1. Lowercasing
text="I'm doing Basic NLP Text Preprocessing"
text=text.lower()
text

"i'm doing basic nlp text preprocessing"

In [17]:
#2. Tokenization
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
tokens=word_tokenize(text)
text

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


"i'm doing basic nlp text preprocessing"

In [18]:
#3. Removing Punctuation
import string
tokens=[word for word in tokens if word not in string.punctuation]
tokens

['i', "'m", 'doing', 'basic', 'nlp', 'text', 'preprocessing']

In [19]:
#4. Removing Stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords=set(stopwords.words('english'))
filtered_tokens=[word for word in tokens if word not in stopwords]
filtered_tokens


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


["'m", 'basic', 'nlp', 'text', 'preprocessing']

In [20]:
#5. Stemming and Lemmatization
# Stemming: Reducing words to their root form (e.g., "running" → "run").
# Lemmatization: Converting words to their base or dictionary form (e.g., "better" → "good").
from nltk.stem import WordNetLemmatizer, PorterStemmer
nltk.download('wordnet')
stem=PorterStemmer()
stemmed_tokens=[stem.stem(word) for word in filtered_tokens]
print(stemmed_tokens)

lemmatizer=WordNetLemmatizer()
lemmatized_tokens=[lemmatizer.lemmatize(word) for word in stemmed_tokens]
print(lemmatized_tokens)

["'m", 'basic', 'nlp', 'text', 'preprocess']
["'m", 'basic', 'nlp', 'text', 'preprocess']


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**Activity: Preprocess a Real-World Dataset**
Let's apply what we've learned to a real-world dataset. We'll use the IMDb movie reviews dataset for sentiment analysis.

In [22]:
#Load dataset
import pandas as pd
url="https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
!wget {url}
!tar -xzf aclImdb_v1.tar.gz

--2025-03-08 05:44:27--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2025-03-08 05:45:22 (1.47 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [36]:
# Load positive and negative reviews
import os

def load_reviews(directory):
  review=[]
  for filename in os.listdir(directory):
    with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
      review.append(file.read())
    return review

positive_reviews=load_reviews('aclImdb/train/pos')[:100]
negative_reviews=load_reviews('aclImdb/train/neg')[:100]

In [48]:
# import nltk
# from nltk.tokenize import word_tokenize
# nltk.download('punkt_tab')
# from nltk.corpus import stopwords
# import string
# from nltk.stem import WordNetLemmatizer, PorterStemmer
def preprocessing(text):
  text=text.lower()
  tokens=word_tokenize(text)
  tokens=[word for word in tokens if word not in string.punctuation]
  # stopwords=set(stopwords.words('english'))
  filtered_tokens=[word for word in tokens if word not in stopwords]
  # stem=PorterStemmer()
  # tokens=[stem.stem(word) for word in tokens]
  lemmatizer=WordNetLemmatizer()
  tokens=[lemmatizer.lemmatize(word) for word in tokens]
  return tokens

positive_rev=[preprocessing(review) for review in positive_reviews]
negative_rev=[preprocessing(review) for review in negative_reviews]

In [49]:
print("Postive Reviews: ", positive_rev[0])  # Example output: ['film', 'great', 'acting', 'awesome']
print("Negative Reviews: ",negative_rev[0])  # Example output: ['movie', 'terrible', 'waste', 'time']

Postive Reviews:  ['i', 'do', "n't", 'know', 'why', 'people', 'always', 'want', 'deeper', 'meaning', 'in', 'movie', 'or', 'else', 'consider', 'them', 'worthless.', 'br', 'br', 'what', 'about', 'just', 'being', 'entertained', 'something', 'at', 'which', 'morgan', 'freeman', 'excels', 'he', 'get', 'a', 'chance', 'to', 'show', 'off', 'a', 'bit', 'paz', 'vega', 'his', 'co-star', 'get', 'a', 'career', 'boost', 'and', 'brad', 'silberling', 'get', 'a', 'name', 'to', 'draw', 'people', 'into', 'watching', 'his', 'movie.', 'br', 'br', 'i', 'thought', 'it', 'wa', 'a', 'good', 'movie', 'some', 'humor', 'some', 'pathos', 'some', 'bittersweetness', 'but', 'nothing', 'over', 'the', 'top', 'i', 'got', 'an', 'especial', 'kick', 'out', 'of', 'jim', 'parson', 'a', 'the', 'receptionist', 'at', 'a', 'construction', 'company', 'when', 'he', 'look', 'at', 'freeman', 'adoringly', 'and', 'say', '``', 'you', 'make', 'me', 'want', 'to', 'be', 'a', 'woman', "''", 'he', "'s", 'just', 'hilarious', 'the', 'fight', '