# **Comprehensive NLP Lab: From Preprocessing to Feature Extraction**

In this lab, you will explore a wide range of Natural Language Processing (NLP) techniques, from basic text preprocessing to advanced feature extraction and analysis. By the end of this lab, you will be able to:

1. **Tokenize** and preprocess text data.
2. Remove **stop words** and **punctuation**.
3. Apply **stemming** and **lemmatization**.
4. Extract features using **Bag of Words (BoW)** and **TF-IDF**.
5. Generate **n-grams** to capture contextual information.
6. Evaluate the impact of different preprocessing techniques on text data.

Let's dive in!

## **1. Setup the Environment**


Before we begin, ensure you have the necessary libraries installed. Run the following cell to install them:


In [2]:
pip install nltk==3.8.1 scikit-learn pandas matplotlib


Defaulting to user installation because normal site-packages is not writeable
Collecting nltk==3.8.1
  Using cached nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Installing collected packages: nltk
  Attempting uninstall: nltk
    Found existing installation: nltk 3.9.1
    Uninstalling nltk-3.9.1:
      Successfully uninstalled nltk-3.9.1
Successfully installed nltk-3.8.1
Note: you may need to restart the kernel to use updated packages.


Now, import the required libraries:

In [3]:

import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [4]:
# Download NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')




[nltk_data] Downloading package punkt to /Users/test/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/test/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/test/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /Users/test/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/test/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## **2. Text Preprocessing**

### **Exercise 1: Tokenization and Stop Word Removal**

Tokenize the following text

In [5]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
tokens = word_tokenize(text)
print(tokens)
print(len(tokens))



['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'of', 'study', '!', 'It', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language', '.']
21


Remove stop words and store the result in a variable called `filtered_tokens`

In [6]:
stop_words = set(stopwords.words('english'))
print(stop_words)

# Convert to lowercase and filter out stop words
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print(filtered_tokens)
print(len(filtered_tokens))


{'s', 'to', 'o', 'were', "weren't", "couldn't", 'most', "shan't", 'had', 'haven', 'this', 'was', 'have', 'didn', 'itself', 'those', 'until', 'more', 'so', 'but', 'the', "we'd", 'at', "hadn't", 'yourself', 'in', 'am', 'isn', 'again', 'that', 'i', 'who', 'shan', 'ma', 'he', 'is', "that'll", 'off', 'does', 'why', 'out', 'should', 'my', 'being', 'between', 'himself', 'then', 'it', 'about', 'you', 'into', 'd', "doesn't", 'ours', 'your', 'and', "i'm", "don't", "he'll", 'during', 'down', "it'll", 'her', 'over', 'its', 'other', 'them', 'she', 'be', 'mustn', "hasn't", 'having', 'needn', "shouldn't", 'which', 'if', 'very', 'themselves', 'up', 'whom', 'below', "mustn't", 'ourselves', 'same', "didn't", "they've", 've', 'a', 'been', 'while', 'can', 'hers', 'wasn', 'm', 'myself', 'how', 'don', "they'd", "they're", 'nor', 'ain', 'an', 'now', "you'd", 'we', "wasn't", 'doesn', "she'd", 'only', 'these', 'did', 'with', "won't", 'me', 'wouldn', 'for', "we're", "i'll", 'by', 'yours', 'before', 'here', "we'

In [7]:
# #removes punctuation
import string
import re

print(string.punctuation)
print (re.escape(string.punctuation))

pattern = re.compile('[%s]' % re.escape(string.punctuation)) 
tokenized_words_no_punctuation = []
for token in filtered_tokens: 
    new_token = pattern.sub(u'', token) # Replace by an empty string
    if not new_token == u'':
        tokenized_words_no_punctuation.append(new_token)
    
    print(tokenized_words_no_punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
!"\#\$%\&'\(\)\*\+,\-\./:;<=>\?@\[\\\]\^_`\{\|\}\~
['Natural']
['Natural', 'Language']
['Natural', 'Language', 'Processing']
['Natural', 'Language', 'Processing']
['Natural', 'Language', 'Processing', 'NLP']
['Natural', 'Language', 'Processing', 'NLP']
['Natural', 'Language', 'Processing', 'NLP', 'fascinating']
['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field']
['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study']
['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study']
['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves']
['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing']
['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding']
['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understandin

In [8]:
print("Filtered Tokens:", tokenized_words_no_punctuation)

Filtered Tokens: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


### **Exercise 2: Stemming and Lemmatization**

Apply stemming and lemmatization to the `filtered_tokens`. Compare the results.

In [9]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

for word in tokenized_words_no_punctuation:
    print("---- ",word,"----")
    print('PS:',stemmer.stem(word))

    print('WN:',lemmatizer.lemmatize(word))
    print('WN:',lemmatizer.lemmatize(word,pos="v"))
    print()


----  Natural ----
PS: natur
WN: Natural
WN: Natural

----  Language ----
PS: languag
WN: Language
WN: Language

----  Processing ----
PS: process
WN: Processing
WN: Processing

----  NLP ----
PS: nlp
WN: NLP
WN: NLP

----  fascinating ----
PS: fascin
WN: fascinating
WN: fascinate

----  field ----
PS: field
WN: field
WN: field

----  study ----
PS: studi
WN: study
WN: study

----  involves ----
PS: involv
WN: involves
WN: involve

----  analyzing ----
PS: analyz
WN: analyzing
WN: analyze

----  understanding ----
PS: understand
WN: understanding
WN: understand

----  human ----
PS: human
WN: human
WN: human

----  language ----
PS: languag
WN: language
WN: language



Apply stemming and store the result in `stemmed_tokens`

In [10]:
# your code here
stemmed_tokens = []

for word in tokenized_words_no_punctuation:
    stemmed_tokens.append(stemmer.stem(word))


In [11]:
print("Stemmed Tokens:", stemmed_tokens)

Stemmed Tokens: ['natur', 'languag', 'process', 'nlp', 'fascin', 'field', 'studi', 'involv', 'analyz', 'understand', 'human', 'languag']


Apply lemmatization and store the result in `lemmatized_tokens`

In [None]:
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.data import load

tagger = load('taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle')


# unfortunately pos_tag and lemmatize use different codes for parts of speech
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper() # gets first letter of POS categorization
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN) # get returns second argument if first key does not exist

lemmatizer = WordNetLemmatizer()


lemmatized_tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokenized_words_no_punctuation]

    

print("Original:", filtered_tokens)
print("Lemmatized:", lemmatized_tokens)






Original: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']
Lemmatized: ['Natural', 'Language', 'Processing', 'NLP', 'fascinate', 'field', 'study', 'involves', 'analyze', 'understand', 'human', 'language']


[nltk_data] Downloading package omw-1.4 to /Users/test/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [13]:
print("Lemmatized Tokens:", lemmatized_tokens)




Lemmatized Tokens: ['Natural', 'Language', 'Processing', 'NLP', 'fascinate', 'field', 'study', 'involves', 'analyze', 'understand', 'human', 'language']


## **3. Feature Extraction**

### **Exercise 3: Bag of Words (BoW)**

Use the `CountVectorizer` from `scikit-learn` to create a Bag of Words representation of the following corpus

In [68]:
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

In [74]:

# your code here
# Step 1: Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Step 2: Fit and transform the corpus into a BoW representation
X = vectorizer.fit_transform(corpus)



In [None]:
print("Bag of Words:\n", X.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

Bag of Words:
 [[0 0 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 0 1 0]
 [0 1 1 0 1 0 1 1 1]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 4: TF-IDF**

Use the `TfidfVectorizer` from `scikit-learn` to create a TF-IDF representation of the same corpus. Store the result in `X_tfidf`

In [None]:
# your code here
# Step 1: Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Step 2: Fit and transform the corpus into a TF-IDF representation
X_tfidf = tfidf_vectorizer.fit_transform(corpus)



In [77]:
print("TF-IDF:\n", X_tfidf.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

TF-IDF:
 [[0.         0.         0.         0.         0.         0.861037
  0.         0.50854232 0.        ]
 [0.65249088 0.         0.         0.65249088 0.         0.
  0.         0.38537163 0.        ]
 [0.         0.43238509 0.43238509 0.         0.43238509 0.
  0.43238509 0.2553736  0.43238509]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 5: N-grams**

Generate `bigrams (2-grams)` from the corpus using `CountVectorizer`. Store the result in `X_bigram`

In [78]:
# your code here
# Step 1: Initialize the CountVectorizer with ngram_range=(2, 2)
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))

# Step 2: Fit and transform the corpus into a bigram representation
X_bigram = bigram_vectorizer.fit_transform(corpus)


In [79]:
print("Bigrams:\n", X_bigram.toarray())
print("Bigram Vocabulary:", bigram_vectorizer.get_feature_names_out())

Bigrams:
 [[0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 1 0]
 [1 1 0 1 0 1 0 1]]
Bigram Vocabulary: ['enjoy learning' 'in nlp' 'is amazing' 'learning new' 'love nlp'
 'new things' 'nlp is' 'things in']


## **4. Advanced Exercise: Custom Preprocessing Pipeline**

### **Exercise 6: Build a Custom Preprocessing Pipeline**

Combine all the preprocessing steps (tokenization, stop word removal, punctuation removal, stemming/lemmatization) into a single function. 

In [None]:
import string
import re

# your code here
def text_preprocessing_pipeline(text):
    # Step 1: Tokenize the text
    tokens = word_tokenize(text)

    # Step 2: Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

    # Step 3: Remove punctuation
    pattern = re.compile('[%s]' % re.escape(string.punctuation)) 
    tokenized_words_no_punctuation = []
    for token in filtered_tokens: 
        new_token = pattern.sub(u'', token) # Replace by an empty string
        if not new_token == u'':
            tokenized_words_no_punctuation.append(new_token)
        
    # Step 4: Apply lemmatization

    def get_wordnet_pos(word):
        tag = nltk.pos_tag([word])[0][1][0].upper() # gets first letter of POS categorization
        tag_dict = {"J": wordnet.ADJ,
                    "N": wordnet.NOUN,
                    "V": wordnet.VERB,
                    "R": wordnet.ADV}
        return tag_dict.get(tag, wordnet.NOUN) # get returns second argument if first key does not exist

    lemmatizer = WordNetLemmatizer()

    lemmatized_tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokenized_words_no_punctuation]

    return lemmatized_tokens


Apply this function to the following text

In [19]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

processed_text = text_preprocessing_pipeline(text)



In [48]:
print("Processed Text:", processed_text)

Processed Text: ['Natural', 'Language', 'Processing', 'NLP', 'fascinate', 'field', 'study', 'involves', 'analyze', 'understand', 'human', 'language']


## **5. Evaluation of Preprocessing Techniques**

### **Exercise 7: Compare Preprocessing Techniques**

Compare the results of stemming and lemmatization on the following sentence. Store the results in `stemmed_tokens` and `lemmatized_tokens`

In [24]:
sentence = "The cats are playing with the mice in the garden."
# your code here
# Step 1: Tokenize and preprocess the sentence and store the result in filtered_tokens
def preprocessing_pipeline(text):
    # Step 1: Tokenize the text
    tokens = word_tokenize(text)

    # Step 2: Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

    # Step 3: Remove punctuation
    pattern = re.compile('[%s]' % re.escape(string.punctuation)) 
    tokenized_words_no_punctuation = []
    for token in filtered_tokens: 
        new_token = pattern.sub(u'', token) # Replace by an empty string
        if not new_token == u'':
            tokenized_words_no_punctuation.append(new_token)
        
    return tokenized_words_no_punctuation

filtered_tokens = preprocessing_pipeline(sentence)

# Step 2: Apply stemming
stemmed_tokens = []

for word in filtered_tokens:
    stemmed_tokens.append(stemmer.stem(word))

print(stemmed_tokens)

# Step 3: Apply lemmatization
def get_wordnet_pos(word):
        tag = nltk.pos_tag([word])[0][1][0].upper() # gets first letter of POS categorization
        tag_dict = {"J": wordnet.ADJ,
                    "N": wordnet.NOUN,
                    "V": wordnet.VERB,
                    "R": wordnet.ADV}
        return tag_dict.get(tag, wordnet.NOUN) # get returns second argument if first key does not exist

lemmatizer = WordNetLemmatizer()

lemmatized_tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in filtered_tokens]

print(lemmatized_tokens)


['cat', 'play', 'mice', 'garden']
['cat', 'play', 'mouse', 'garden']


In [23]:
print("Original Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Tokens: ['cats', 'playing', 'mice', 'garden']
Stemmed Tokens: ['cat', 'play', 'mice', 'garden']
Lemmatized Tokens: ['cat', 'play', 'mouse', 'garden']


## **6. Real-World Dataset: Sentiment Analysis**

### **Exercise 8: Preprocess and Analyze Tweets**

In this exercise, you will work with a real-world dataset of tweets. The dataset contains 5000 positive and 5000 negative tweets. Your task is to preprocess the tweets and extract features for sentiment analysis.


In [25]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/test/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [26]:
# Load the dataset
from nltk.corpus import twitter_samples

Load the dataset of positive and negative tweets. 

In [27]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Combine them into a single list called ``all_tweets`` and create a corresponding list of labels called `labels`.

In [40]:
# your code here

# Combine the datasets
all_tweets = positive_tweets + negative_tweets
labels = ["positive_tweets", "negative_tweetsative"]

print(all_tweets)





In [38]:
# Print a sample tweet
print("Sample Tweet:", all_tweets[0])
print("Label:", labels[0])

Sample Tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Label: positive_tweets


### **Exercise 9: Preprocess Tweets**

Apply the custom preprocessing pipeline to the entire dataset of tweets. Store the result in ``preprocessed_tweets``.

In [49]:
# Step 1: Apply the preprocessing pipeline to all tweets
# your code here

# Apply preprocessing
preprocessed_tweets = [text_preprocessing_pipeline(tweet) for tweet in all_tweets]

# Print results
for i, tokens in enumerate(preprocessed_tweets):
    print(f"\nTweet {i+1} tokens:")
    print(tokens)



Tweet 1 tokens:
['FollowFriday', 'FranceInte', 'PKuchly57', 'MilipolParis', 'top', 'engage', 'member', 'community', 'week']

Tweet 2 tokens:
['Lamb2ja', 'Hey', 'James', 'odd', 'Please', 'call', 'Contact', 'Centre', '02392441234', 'able', 'assist', 'Many', 'thanks']

Tweet 3 tokens:
['DespiteOfficial', 'listen', 'last', 'night', 'Bleed', 'amaze', 'track', 'Scotland']

Tweet 4 tokens:
['97sides', 'CONGRATS']

Tweet 5 tokens:
['yeaaaah', 'yippppy', 'accnt', 'verify', 'rqst', 'succeed', 'get', 'blue', 'tick', 'mark', 'fb', 'profile', '15', 'day']

Tweet 6 tokens:
['BhaktisBanter', 'PallaviRuhail', 'one', 'irresistible', 'FlipkartFashionFriday', 'http', 'tcoEbZ0L2VENM']

Tweet 7 tokens:
['nt', 'like', 'keep', 'lovely', 'customer', 'wait', 'long', 'hope', 'enjoy', 'Happy', 'Friday', 'LWWF', 'http', 'tcosmyYriipxI']

Tweet 8 tokens:
['Impatientraider', 'second', 'thought', '’', 'enough', 'time', 'DD', 'new', 'short', 'enter', 'system', 'Sheep', 'must', 'buying']

Tweet 9 tokens:
['Jgh', 'go'

In [50]:
# Print a sample preprocessed tweet
print("Preprocessed Tweets Sample:", preprocessed_tweets[0])

Preprocessed Tweets Sample: ['FollowFriday', 'FranceInte', 'PKuchly57', 'MilipolParis', 'top', 'engage', 'member', 'community', 'week']


### **Exercise 10: Feature Extraction on Tweets**

Extract features from the preprocessed tweets using **Bag of Words** and **TF-IDF**. Store the results in ``X_bow`` and ``X_tfidf``, respectively.

In [53]:
# your code here
# Step 1: Create a Bag of Words representation
# Convert list of token lists into list of strings
preprocessed_strings = [' '.join(tokens) for tokens in preprocessed_tweets]

# Step 1: Create a Bag of Words representation
vectorizer = CountVectorizer()

# Step 2: Fit and transform the corpus
X_bow = vectorizer.fit_transform(preprocessed_strings)

# Output the results
print("Bag of Words:\n", X_bow.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())


# Step 2: Create a TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()

# Step 2: Fit and transform the corpus into a TF-IDF representation
X_tfidf = tfidf_vectorizer.fit_transform(preprocessed_strings)

print("TF-IDF:\n", X_tfidf.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())



Bag of Words:
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
Vocabulary: ['00' '0001' '00128835' ... '인피니트' 'ｍｅ' 'ｓｅｅ']
TF-IDF:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Vocabulary: ['00' '0001' '00128835' ... '인피니트' 'ｍｅ' 'ｓｅｅ']


## **7. Conclusion**

In this lab, you explored a wide range of NLP techniques, from basic text preprocessing to advanced feature extraction and analysis. You also worked with a real-world dataset of tweets and applied your knowledge to preprocess and extract features for sentiment analysis.

