# **Comprehensive NLP Lab: From Preprocessing to Feature Extraction**

In this lab, you will explore a wide range of Natural Language Processing (NLP) techniques, from basic text preprocessing to advanced feature extraction and analysis. By the end of this lab, you will be able to:

1. **Tokenize** and preprocess text data.
2. Remove **stop words** and **punctuation**.
3. Apply **stemming** and **lemmatization**.
4. Extract features using **Bag of Words (BoW)** and **TF-IDF**.
5. Generate **n-grams** to capture contextual information.
6. Evaluate the impact of different preprocessing techniques on text data.

Let's dive in!

## **1. Setup the Environment**


Before we begin, ensure you have the necessary libraries installed. Run the following cell to install them:


In [1]:
!pip install nltk scikit-learn pandas matplotlib




Now, import the required libraries:

In [2]:

import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [3]:
# Download NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package punkt to /Users/raneem/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/raneem/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/raneem/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/raneem/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/raneem/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/raneem/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## **2. Text Preprocessing**

### **Exercise 1: Tokenization and Stop Word Removal**

Tokenize the following text

In [4]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
# Tokenization
tokens = word_tokenize(text)
print(tokens)


['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'of', 'study', '!', 'It', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language', '.']


Remove stop words and store the result in a variable called `filtered_tokens`

In [5]:
stop_words = set(stopwords.words('english'))
print(stop_words)

filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)

{'their', 'doesn', 'haven', 'has', 't', 'should', 'no', "he'd", 'do', 'same', "that'll", 'you', "they're", 'did', 'not', "they'd", 'some', 'are', 'ain', 'me', 'needn', 'after', "mustn't", 'does', 'then', 'this', "i'll", 'isn', 'just', 'there', 'once', 'him', 'very', 'how', 'm', "we'd", 'or', 'weren', 'can', "it's", 'a', "they'll", "wouldn't", 'again', 'over', 'couldn', 'o', "didn't", "i'd", 'own', 'she', 'both', "you'd", 'won', 'by', 'in', 've', "hadn't", 'ourselves', "aren't", 'ma', 'until', 'mustn', 'theirs', 'such', 'd', 'if', 'what', 'yours', 'where', "needn't", 'through', 'his', "it'll", 'am', "hasn't", "won't", 'whom', 'wouldn', 'ours', 're', 'its', 'they', 'while', 'at', 'them', 'had', 'nor', 'as', "don't", 'between', 'aren', 'but', 'it', "you'll", 'each', "he's", 'off', "you've", 'with', 'before', "couldn't", 'don', 'now', 'yourself', 'why', "i've", 'he', 'to', 'below', 'too', 'and', 'few', 'doing', 'have', 'i', 'all', 'mightn', 'your', 'from', 'down', 'than', 'themselves', 'sh

In [6]:
print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']


### **Exercise 2: Stemming and Lemmatization**

Apply stemming and lemmatization to the `filtered_tokens`. Compare the results.

In [7]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

Apply stemming and store the result in `stemmed_tokens`

In [8]:
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

In [9]:
print("Stemmed Tokens:", stemmed_tokens)

Stemmed Tokens: ['natur', 'languag', 'process', '(', 'nlp', ')', 'fascin', 'field', 'studi', '!', 'involv', 'analyz', 'understand', 'human', 'languag', '.']


Apply lemmatization and store the result in `lemmatized_tokens`

In [10]:
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]


In [11]:
print("Lemmatized Tokens:", lemmatized_tokens)

Lemmatized Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']


## **3. Feature Extraction**

### **Exercise 3: Bag of Words (BoW)**

Use the `CountVectorizer` from `scikit-learn` to create a Bag of Words representation of the following corpus

In [12]:
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]



In [13]:

# your code here
# Step 1: Initialize the CountVectorizer
vectorizer = CountVectorizer()
# Step 2: Fit and transform the corpus into a BoW representation
vectorizer.fit(corpus)
X = vectorizer.transform(corpus)
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(bow_df)


   amazing  enjoy  in  is  learning  love  new  nlp  things
0        0      0   0   0         0     1    0    1       0
1        1      0   0   1         0     0    0    1       0
2        0      1   1   0         1     0    1    1       1


In [14]:
print("Bag of Words:\n", X.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

Bag of Words:
 [[0 0 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 0 1 0]
 [0 1 1 0 1 0 1 1 1]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 4: TF-IDF**

Use the `TfidfVectorizer` from `scikit-learn` to create a TF-IDF representation of the same corpus. Store the result in `X_tfidf`

In [24]:
# your code here
# Step 1: Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
# Step 2: Fit and transform the corpus into a TF-IDF representation
tfidf_vectorizer.fit(corpus)
X_tfidf= tfidf_vectorizer.transform(corpus)
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print(tfidf_df)

    amazing     enjoy        in        is  learning      love       new  \
0  0.000000  0.000000  0.000000  0.000000  0.000000  0.861037  0.000000   
1  0.652491  0.000000  0.000000  0.652491  0.000000  0.000000  0.000000   
2  0.000000  0.432385  0.432385  0.000000  0.432385  0.000000  0.432385   

        nlp    things  
0  0.508542  0.000000  
1  0.385372  0.000000  
2  0.255374  0.432385  


In [26]:
print("TF-IDF:\n", X_tfidf.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

TF-IDF:
 [[0.         0.         0.         0.         0.         0.861037
  0.         0.50854232 0.        ]
 [0.65249088 0.         0.         0.65249088 0.         0.
  0.         0.38537163 0.        ]
 [0.         0.43238509 0.43238509 0.         0.43238509 0.
  0.43238509 0.2553736  0.43238509]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 5: N-grams**

Generate `bigrams (2-grams)` from the corpus using `CountVectorizer`. Store the result in `X_bigram`

In [27]:
# your code here
# Step 1: Initialize the CountVectorizer with ngram_range=(2, 2)
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
# Step 2: Fit and transform the corpus into a bigram representation
bigram_vectorizer.fit(corpus)
X_bigram = bigram_vectorizer.transform(corpus)
bigram_df = pd.DataFrame(X_bigram.toarray(), columns=bigram_vectorizer.get_feature_names_out())
print(bigram_df)



   enjoy learning  in nlp  is amazing  learning new  love nlp  new things  \
0               0       0           0             0         1           0   
1               0       0           1             0         0           0   
2               1       1           0             1         0           1   

   nlp is  things in  
0       0          0  
1       1          0  
2       0          1  


In [28]:
print("Bigrams:\n", X_bigram.toarray())
print("Bigram Vocabulary:", bigram_vectorizer.get_feature_names_out())

Bigrams:
 [[0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 1 0]
 [1 1 0 1 0 1 0 1]]
Bigram Vocabulary: ['enjoy learning' 'in nlp' 'is amazing' 'learning new' 'love nlp'
 'new things' 'nlp is' 'things in']


## **4. Advanced Exercise: Custom Preprocessing Pipeline**

### **Exercise 6: Build a Custom Preprocessing Pipeline**

Combine all the preprocessing steps (tokenization, stop word removal, punctuation removal, stemming/lemmatization) into a single function. 

In [29]:
# your code here
def text_preprocessing_pipeline(text):
    # Step 1: Tokenize the text
    tokens = word_tokenize(text)
    print("Tokens:", tokens)
    # Step 2: Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    print("Filtered Tokens:", filtered_tokens)
    # Step 3: Remove punctuation
    filtered_tokens = [word for word in filtered_tokens if word not in string.punctuation]
    print("Punctuation Removed:", filtered_tokens)
    # Step 4: Apply lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
    return lemmatized_tokens


Apply this function to the following text

In [30]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

preprocessed_text = text_preprocessing_pipeline(text)
print("Preprocessed Text:", preprocessed_text)


Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'of', 'study', '!', 'It', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language', '.']
Filtered Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']
Punctuation Removed: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']
Preprocessed Text: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


## **5. Evaluation of Preprocessing Techniques**

### **Exercise 7: Compare Preprocessing Techniques**

Compare the results of stemming and lemmatization on the following sentence. Store the results in `stemmed_tokens` and `lemmatized_tokens`

In [32]:
sentence = "The cats are playing with the mice in the garden."
# your code here
# Step 1: Tokenize and preprocess the sentence and store the result in filtered_tokens
tokens = word_tokenize(sentence)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
filtered_tokens = [word for word in filtered_tokens if word not in string.punctuation]

# Step 2: Apply stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in tokens]
# Step 3: Apply lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]




In [33]:
print("Original Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Tokens: ['cats', 'playing', 'mice', 'garden']
Stemmed Tokens: ['the', 'cat', 'are', 'play', 'with', 'the', 'mice', 'in', 'the', 'garden', '.']
Lemmatized Tokens: ['The', 'cat', 'are', 'playing', 'with', 'the', 'mouse', 'in', 'the', 'garden', '.']


## **6. Real-World Dataset: Sentiment Analysis**

### **Exercise 8: Preprocess and Analyze Tweets**

In this exercise, you will work with a real-world dataset of tweets. The dataset contains 5000 positive and 5000 negative tweets. Your task is to preprocess the tweets and extract features for sentiment analysis.


In [34]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/raneem/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [35]:
# Load the dataset
from nltk.corpus import twitter_samples

Load the dataset of positive and negative tweets. 

In [36]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Combine them into a single list called ``all_tweets`` and create a corresponding list of labels called `labels`.

In [37]:
# your code here
all_tweets = positive_tweets + negative_tweets

# Combine the datasets
labels = [1] * len(positive_tweets) + [0] * len(negative_tweets)


In [38]:
# Print a sample tweet
print("Sample Tweet:", all_tweets[0])
print("Label:", labels[0])

Sample Tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Label: 1


### **Exercise 9: Preprocess Tweets**

Apply the custom preprocessing pipeline to the entire dataset of tweets. Store the result in ``preprocessed_tweets``.

In [39]:
# Step 1: Apply the preprocessing pipeline to all tweets
preprocessed_tweets = [text_preprocessing_pipeline(tweet) for tweet in all_tweets]

# Step 2: Combine the tokens into a single string
preprocessed_tweets = [" ".join(tokens) for tokens in preprocessed_tweets]

# Step 3: Create a DataFrame with the preprocessed tweets and labels
tweets_df = pd.DataFrame({
    'tweets': preprocessed_tweets,
    'labels': labels
})

print(tweets_df.head())



Tokens: ['#', 'FollowFriday', '@', 'France_Inte', '@', 'PKuchly57', '@', 'Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':', ')']
Filtered Tokens: ['#', 'FollowFriday', '@', 'France_Inte', '@', 'PKuchly57', '@', 'Milipol_Paris', 'top', 'engaged', 'members', 'community', 'week', ':', ')']
Punctuation Removed: ['FollowFriday', 'France_Inte', 'PKuchly57', 'Milipol_Paris', 'top', 'engaged', 'members', 'community', 'week']
Tokens: ['@', 'Lamb2ja', 'Hey', 'James', '!', 'How', 'odd', ':', '/', 'Please', 'call', 'our', 'Contact', 'Centre', 'on', '02392441234', 'and', 'we', 'will', 'be', 'able', 'to', 'assist', 'you', ':', ')', 'Many', 'thanks', '!']
Filtered Tokens: ['@', 'Lamb2ja', 'Hey', 'James', '!', 'odd', ':', '/', 'Please', 'call', 'Contact', 'Centre', '02392441234', 'able', 'assist', ':', ')', 'Many', 'thanks', '!']
Punctuation Removed: ['Lamb2ja', 'Hey', 'James', 'odd', 'Please', 'call', 'Contact', 'Centre', '02392441234', 'able',

In [40]:
# Print a sample preprocessed tweet
print("Preprocessed Tweets Sample:", preprocessed_tweets[0])

Preprocessed Tweets Sample: FollowFriday France_Inte PKuchly57 Milipol_Paris top engaged member community week


### **Exercise 10: Feature Extraction on Tweets**

Extract features from the preprocessed tweets using **Bag of Words** and **TF-IDF**. Store the results in ``X_bow`` and ``X_tfidf``, respectively.

In [41]:

# Step 1: Create a Bag of Words representation
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(preprocessed_tweets)
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(bow_df)



# Step 2: Create a TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(preprocessed_tweets)
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print(tfidf_df)



      00  000  001  00128835  009  00962778381838  00am  00kouhey00  \
0      0    0    0         0    0               0     0           0   
1      0    0    0         0    0               0     0           0   
2      0    0    0         0    0               0     0           0   
3      0    0    0         0    0               0     0           0   
4      0    0    0         0    0               0     0           0   
...   ..  ...  ...       ...  ...             ...   ...         ...   
9995   0    0    0         0    0               0     0           0   
9996   0    0    0         0    0               0     0           0   
9997   0    0    0         0    0               0     0           0   
9998   0    0    0         0    0               0     0           0   
9999   0    0    0         0    0               0     0           0   

      00yckce7wj  01  ...  للعودة  مطعم_هاشم  एक  හව  다쇼  더쇼  에이핑크  인피니트  ｍｅ  \
0              0   0  ...       0          0   0   0   0   0     0 

## **7. Conclusion**

In this lab, you explored a wide range of NLP techniques, from basic text preprocessing to advanced feature extraction and analysis. You also worked with a real-world dataset of tweets and applied your knowledge to preprocess and extract features for sentiment analysis.

