# **Comprehensive NLP Lab: From Preprocessing to Feature Extraction**

In this lab, you will explore a wide range of Natural Language Processing (NLP) techniques, from basic text preprocessing to advanced feature extraction and analysis. By the end of this lab, you will be able to:

1. **Tokenize** and preprocess text data.
2. Remove **stop words** and **punctuation**.
3. Apply **stemming** and **lemmatization**.
4. Extract features using **Bag of Words (BoW)** and **TF-IDF**.
5. Generate **n-grams** to capture contextual information.
6. Evaluate the impact of different preprocessing techniques on text data.

Let's dive in!

## **1. Setup the Environment**


Before we begin, ensure you have the necessary libraries installed. Run the following cell to install them:


In [1]:
!pip install nltk scikit-learn pandas matplotlib




Now, import the required libraries:

In [2]:

import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [3]:
# Download NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## **2. Text Preprocessing**

### **Exercise 1: Tokenization and Stop Word Removal**

Tokenize the following text

In [4]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
def tokenize_words(text: str):
    """
    Return only word tokens (letters, incl. Unicode letters).
    Keeps internal apostrophes (e.g., don't).
    Excludes digits, underscores, and punctuation.
    """
    pattern = r"[^\W\d_]+(?:'[^\W\d_]+)?"  # letters-only words, optional internal apostrophe
    return re.findall(pattern, text, flags=re.UNICODE)

if __name__ == "__main__":
    sample = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
    print(tokenize_words(sample))
    # ['Natural', 'Language', 'Processing', 'NLP', 'is', 'a', 'fascinating', 'field', 'of', 'study', 'It', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language']


['Natural', 'Language', 'Processing', 'NLP', 'is', 'a', 'fascinating', 'field', 'of', 'study', 'It', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language']


Remove stop words and store the result in a variable called `filtered_tokens`

In [None]:
import re
import nltk
from nltk.corpus import stopwords

# Ensure NLTK stopwords are available
try:
    stop_words = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))

tokens = tokenize_words(text)

# Remove stop words (case-insensitive) and stores the rest in filtered_tokens
filtered_tokens = [t for t in tokens if t.lower() not in stop_words]
print(filtered_tokens)

['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


In [8]:
print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


### **Exercise 2: Stemming and Lemmatization**

Apply stemming and lemmatization to the `filtered_tokens`. Compare the results.

In [9]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

# assumes you already have: filtered_tokens (a list of strings)

# Ensure required NLTK data is available
for pkg in ("wordnet", "omw-1.4", "averaged_perceptron_tagger"):
    try:
        nltk.data.find(f"corpora/{pkg}")
    except LookupError:
        try:
            nltk.data.find(f"taggers/{pkg}")
        except LookupError:
            nltk.download(pkg)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...


In [None]:

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()


Apply stemming and store the result in `stemmed_tokens`

In [14]:
# Stemming
stemmed_tokens = [stemmer.stem(w) for w in filtered_tokens]

In [12]:
print("Stemmed Tokens:", stemmed_tokens)

Stemmed Tokens: ['natur', 'languag', 'process', 'nlp', 'fascin', 'field', 'studi', 'involv', 'analyz', 'understand', 'human', 'languag']


Apply lemmatization and store the result in `lemmatized_tokens`

In [19]:
# Lemmatize without POS tagging (defaults to noun)
lemmatized_tokens = [lemmatizer.lemmatize(w) for w in filtered_tokens]

In [20]:
print("Lemmatized Tokens:", lemmatized_tokens)

Lemmatized Tokens: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


## **3. Feature Extraction**

### **Exercise 3: Bag of Words (BoW)**

Use the `CountVectorizer` from `scikit-learn` to create a Bag of Words representation of the following corpus

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

# creating bag of words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)



In [26]:

# your code here
# Step 1: Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Step 2: Fit and transform the corpus into a BoW representation
X = vectorizer.fit_transform(corpus)

# Vocabulary and matrix
feature_names = vectorizer.get_feature_names_out()
bow_matrix = X.toarray()

print("Features:", feature_names)
print("Bag-of-Words matrix:\n", bow_matrix)

Features: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']
Bag-of-Words matrix:
 [[0 0 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 0 1 0]
 [0 1 1 0 1 0 1 1 1]]


In [24]:
print("Bag of Words:\n", X.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

Bag of Words:
 [[0 0 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 0 1 0]
 [0 1 1 0 1 0 1 1 1]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 4: TF-IDF**

Use the `TfidfVectorizer` from `scikit-learn` to create a TF-IDF representation of the same corpus. Store the result in `X_tfidf`

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]


# Step 1: Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Step 2: Fit and transform the corpus into a TF-IDF representation
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

In [32]:


print("TF-IDF:\n", X_tfidf.toarray())

print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

TF-IDF:
 [[0.         0.         0.         0.         0.         0.861037
  0.         0.50854232 0.        ]
 [0.65249088 0.         0.         0.65249088 0.         0.
  0.         0.38537163 0.        ]
 [0.         0.43238509 0.43238509 0.         0.43238509 0.
  0.43238509 0.2553736  0.43238509]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 5: N-grams**

Generate `bigrams (2-grams)` from the corpus using `CountVectorizer`. Store the result in `X_bigram`

In [33]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

# Step 1: Initialize the CountVectorizer with ngram_range=(2, 2)
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))

# Step 2: Fit and transform the corpus into a bigram representation
X_bigram = bigram_vectorizer.fit_transform(corpus)


In [34]:
print("Bigrams:\n", X_bigram.toarray())
print("Bigram Vocabulary:", bigram_vectorizer.get_feature_names_out())

Bigrams:
 [[0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 1 0]
 [1 1 0 1 0 1 0 1]]
Bigram Vocabulary: ['enjoy learning' 'in nlp' 'is amazing' 'learning new' 'love nlp'
 'new things' 'nlp is' 'things in']


## **4. Advanced Exercise: Custom Preprocessing Pipeline**

### **Exercise 6: Build a Custom Preprocessing Pipeline**

Combine all the preprocessing steps (tokenization, stop word removal, punctuation removal, stemming/lemmatization) into a single function. 

In [35]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Ensure required NLTK data is available (only needed once)
try:
    _ = stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')
try:
    nltk.data.find('corpora/omw-1.4')
except LookupError:
    nltk.download('omw-1.4')

# Helper: word-only token pattern (letters incl. Unicode, keep internal apostrophes)
_WORD_PATTERN = re.compile(r"[^\W\d_]+(?:'[^\W\d_]+)?$", flags=re.UNICODE)
    
def text_preprocessing_pipeline(text):
    # Step 1: Tokenize the text (words + punctuation)
    raw_tokens = re.findall(r"\w+(?:'\w+)?|[^\w\s]", text, flags=re.UNICODE)

    # Step 2: Remove stop words (case-insensitive)
    stop_words = set(stopwords.words('english'))
    no_stops = [t for t in raw_tokens if t.lower() not in stop_words]

    # Step 3: Remove punctuation (keep only word tokens)
    word_tokens = [t for t in no_stops if _WORD_PATTERN.match(t)]

    # Step 4: Apply lemmatization (no POS tagging; defaults to noun)
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(t.lower()) for t in word_tokens]

    return lemmatized_tokens


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Apply this function to the following text

In [38]:
# text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

if __name__ == "__main__":
    text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
    processed_text = text_preprocessing_pipeline(text)
    print(processed_text)



['natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


In [39]:
print("Processed Text:", processed_text)

Processed Text: ['natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


## **5. Evaluation of Preprocessing Techniques**

### **Exercise 7: Compare Preprocessing Techniques**

Compare the results of stemming and lemmatization on the following sentence. Store the results in `stemmed_tokens` and `lemmatized_tokens`

In [40]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Ensure required NLTK data is available
try:
    _ = stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [41]:
# Helpers
word_pattern = re.compile(r"[^\W\d_]+(?:'[^\W\d_]+)?", flags=re.UNICODE)
tokenize_words = lambda s: word_pattern.findall(s)

In [42]:
# Input sentence

sentence = "The cats are playing with the mice in the garden."

# Step 1: Tokenize and preprocess the sentence and store the result in filtered_tokens
stop_words = set(stopwords.words('english'))
tokens = tokenize_words(sentence)
filtered_tokens = [t for t in tokens if t.lower() not in stop_words]

# Step 2: Apply stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(t.lower()) for t in filtered_tokens]


# Step 3: Apply lemmatization (without POS tagging; defaults to noun)
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(t.lower()) for t in filtered_tokens]



In [43]:
print("Original Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Tokens: ['cats', 'playing', 'mice', 'garden']
Stemmed Tokens: ['cat', 'play', 'mice', 'garden']
Lemmatized Tokens: ['cat', 'playing', 'mouse', 'garden']


## **6. Real-World Dataset: Sentiment Analysis**

### **Exercise 8: Preprocess and Analyze Tweets**

In this exercise, you will work with a real-world dataset of tweets. The dataset contains 5000 positive and 5000 negative tweets. Your task is to preprocess the tweets and extract features for sentiment analysis.


In [44]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\twitter_samples.zip.


True

In [45]:
# Load the dataset
from nltk.corpus import twitter_samples

Load the dataset of positive and negative tweets. 

In [46]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Combine them into a single list called ``all_tweets`` and create a corresponding list of labels called `labels`.

In [None]:
# 1) Combining the tweets into a single list
all_tweets = list(positive_tweets) + list(negative_tweets)

# 2) Create the corresponding labels 
# building labels with one label per tweet: "pos" for each positive tweet and "neg" for each negative one
labels = (["pos"] * len(positive_tweets)) + (["neg"] * len(negative_tweets))


# (optional) sanity checks
assert len(all_tweets) == len(labels), "Tweets and labels must be the same length."
assert labels.count("pos") == len(positive_tweets)
assert labels.count("neg") == len(negative_tweets)

print(f"Total samples: {len(all_tweets)} (pos={labels.count('pos')}, neg={labels.count('neg')})")


Total samples: 10000 (pos=5000, neg=5000)


In [48]:
# Print a sample tweet
print("Sample Tweet:", all_tweets[0])
print("Label:", labels[0])

Sample Tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Label: pos


### **Exercise 9: Preprocess Tweets**

Apply the custom preprocessing pipeline to the entire dataset of tweets. Store the result in ``preprocessed_tweets``.

In [49]:
# - all_tweets: list[str]  (combined positive + negative tweets)
# - text_preprocessing_pipeline(text) -> list[str]  (custom pipeline from Exercise 6)

# Apply the custom preprocessing pipeline to every tweet
preprocessed_tweets = [text_preprocessing_pipeline(t) for t in all_tweets]

# (optional) If we also want a cleaned string per tweet for vectorizers:
# preprocessed_texts = [" ".join(tokens) for tokens in preprocessed_tweets]

# Quick sanity check
# print(len(preprocessed_tweets), "tweets preprocessed")
# print(preprocessed_tweets[0])  # first tweet's token list


In [50]:
# Print a sample preprocessed tweet
print("Preprocessed Tweets Sample:", preprocessed_tweets[0])

Preprocessed Tweets Sample: ['followfriday', 'top', 'engaged', 'member', 'community', 'week']


In [None]:
# (optional) If we also want a cleaned string per tweet for vectorizers:
preprocessed_texts = [" ".join(tokens) for tokens in preprocessed_tweets]
# Join tokens: Your preprocessing returns token lists; vectorizers expect strings. 
# We convert each tweet’s tokens into a space-separated string (preprocessed_texts) 
# so the vectorizers can parse them later

# Quick sanity check
print(len(preprocessed_tweets), "tweets preprocessed")
print(preprocessed_tweets[0])  # first tweet's token list

10000 tweets preprocessed
['followfriday', 'top', 'engaged', 'member', 'community', 'week']


### **Exercise 10: Feature Extraction on Tweets**

Extract features from the preprocessed tweets using **Bag of Words** and **TF-IDF**. Store the results in ``X_bow`` and ``X_tfidf``, respectively.

In [52]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Step 1: Create a Bag of Words representation
# CountVectorizer → X_bow: Builds a vocabulary from all terms and encodes each tweet 
# as counts of those terms (sparse matrix)

bow_vectorizer = CountVectorizer(min_df=2, max_df=0.95, ngram_range=(1,2)) # you can add options like min_df=2, max_df=0.95, etc.
X_bow = bow_vectorizer.fit_transform(preprocessed_texts)


# Step 2: Create a TF-IDF representation
# TfidfVectorizer → X_tfidf: Similar to BoW, but scales counts by
# how informative a term is across the corpus 
# (downweights ubiquitous words, upweights rare-but-relevant words).
# TF-IDF (term frequency * inverse document frequency)
tfidf_vectorizer = TfidfVectorizer(min_df=2, max_df=0.95, ngram_range=(1,2))  # same options apply here if needed
X_tfidf = tfidf_vectorizer.fit_transform(preprocessed_texts)


# (optional) Inspect the vocabulary and matrix shapes
print("BoW features:", bow_vectorizer.get_feature_names_out()[:20])
print("X_bow shape:", X_bow.shape)
print("TF-IDF features:", tfidf_vectorizer.get_feature_names_out()[:20])
print("X_tfidf shape:", X_tfidf.shape)

BoW features: ['aa' 'aaahhh' 'aameen' 'aaroncarpenter' 'ab' 'abby' 'abhi' 'able'
 'able get' 'able meet' 'able see' 'able sleep' 'able tell' 'aboard' 'abp'
 'abroad' 'absolute' 'absolutely' 'absolutely amazing' 'abt']
X_bow shape: (10000, 7307)
TF-IDF features: ['aa' 'aaahhh' 'aameen' 'aaroncarpenter' 'ab' 'abby' 'abhi' 'able'
 'able get' 'able meet' 'able see' 'able sleep' 'able tell' 'aboard' 'abp'
 'abroad' 'absolute' 'absolutely' 'absolutely amazing' 'abt']
X_tfidf shape: (10000, 7307)


## **7. Conclusion**

In this lab, you explored a wide range of NLP techniques, from basic text preprocessing to advanced feature extraction and analysis. You also worked with a real-world dataset of tweets and applied your knowledge to preprocess and extract features for sentiment analysis.

