# **Comprehensive NLP Lab: From Preprocessing to Feature Extraction**

In this lab, you will explore a wide range of Natural Language Processing (NLP) techniques, from basic text preprocessing to advanced feature extraction and analysis. By the end of this lab, you will be able to:

1. **Tokenize** and preprocess text data.
2. Remove **stop words** and **punctuation**.
3. Apply **stemming** and **lemmatization**.
4. Extract features using **Bag of Words (BoW)** and **TF-IDF**.
5. Generate **n-grams** to capture contextual information.
6. Evaluate the impact of different preprocessing techniques on text data.

Let's dive in!

## **1. Setup the Environment**


Before we begin, ensure you have the necessary libraries installed. Run the following cell to install them:


In [1]:
!pip install nltk scikit-learn pandas matplotlib




Now, import the required libraries:

In [2]:

import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [3]:
# Download NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## **2. Text Preprocessing**

### **Exercise 1: Tokenization and Stop Word Removal**

Tokenize the following text

In [4]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk

# Ensure stopwords are downloaded
nltk.download('stopwords')

# Example text
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

# Tokenize the text into words
tokens = word_tokenize(text)

# Get the list of stop words in English
stop_words = set(stopwords.words('english'))

# Remove stop words from the tokens
filtered_tokens = [word for word in tokens if word.lower() not in stop_words and word not in string.punctuation]

# Display the results
print("Original Text:", text)
print("Tokens:", tokens)
print("Filtered Tokens (After Stop Word Removal):", filtered_tokens)


Original Text: Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language.
Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'of', 'study', '!', 'It', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language', '.']
Filtered Tokens (After Stop Word Removal): ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Remove stop words and store the result in a variable called `filtered_tokens`

In [7]:
# Import necessary NLTK components
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Example text
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

# Tokenize the text into words
tokens = word_tokenize(text)

# Get the list of stop words in English
stop_words = set(stopwords.words('english'))

# Remove stop words and punctuation from the tokens
filtered_tokens = [word for word in tokens if word.lower() not in stop_words and word not in string.punctuation]

# Display the filtered tokens
print(filtered_tokens)



['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


### **Exercise 2: Stemming and Lemmatization**

Apply stemming and lemmatization to the `filtered_tokens`. Compare the results.

In [None]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

Apply stemming and store the result in `stemmed_tokens`

In [8]:
# Import necessary NLTK components
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import nltk

# Ensure necessary NLTK data is downloaded
nltk.download('wordnet')

# Initialize the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Filtered tokens from the previous step
filtered_tokens = ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']

# Apply stemming to the filtered tokens
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

# Apply lemmatization to the filtered tokens
lemmatized_tokens = [lemmatizer.lemmatize(word.lower()) for word in filtered_tokens]

# Display the results
print("Original Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Original Tokens: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']
Stemmed Tokens: ['natur', 'languag', 'process', 'nlp', 'fascin', 'field', 'studi', 'involv', 'analyz', 'understand', 'human', 'languag']
Lemmatized Tokens: ['natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


In [9]:
print("Stemmed Tokens:", stemmed_tokens)

Stemmed Tokens: ['natur', 'languag', 'process', 'nlp', 'fascin', 'field', 'studi', 'involv', 'analyz', 'understand', 'human', 'languag']


Apply lemmatization and store the result in `lemmatized_tokens`

In [10]:
# Import necessary NLTK components
from nltk.stem import WordNetLemmatizer
import nltk

# Ensure necessary NLTK data is downloaded
nltk.download('wordnet')

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Example filtered tokens from the previous step (assuming stop words removed)
filtered_tokens = ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']

# Apply lemmatization to the filtered tokens
lemmatized_tokens = [lemmatizer.lemmatize(word.lower()) for word in filtered_tokens]

# Display the lemmatized tokens
print("Lemmatized Tokens:", lemmatized_tokens)


Lemmatized Tokens: ['natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [11]:
print("Lemmatized Tokens:", lemmatized_tokens)

Lemmatized Tokens: ['natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


## **3. Feature Extraction**

### **Exercise 3: Bag of Words (BoW)**

Use the `CountVectorizer` from `scikit-learn` to create a Bag of Words representation of the following corpus

In [12]:
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

# Define the corpus
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

# Step 1: Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Step 2: Fit and transform the corpus into a BoW representation
X = vectorizer.fit_transform(corpus)

# Convert the result into an array to view the token counts
bow_representation = X.toarray()

# Display the BoW representation (token counts)
print("BoW Representation (Token Counts):")
print(bow_representation)

# Display the feature names (words)
print("\nFeature Names (Words in the corpus):")
print(vectorizer.get_feature_names_out())


BoW Representation (Token Counts):
[[0 0 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 0 1 0]
 [0 1 1 0 1 0 1 1 1]]

Feature Names (Words in the corpus):
['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


In [14]:
print("Bag of Words:\n", X.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

Bag of Words:
 [[0 0 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 0 1 0]
 [0 1 1 0 1 0 1 1 1]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 4: TF-IDF**

Use the `TfidfVectorizer` from `scikit-learn` to create a TF-IDF representation of the same corpus. Store the result in `X_tfidf`

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the corpus
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

# Step 1: Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Step 2: Fit and transform the corpus into a TF-IDF representation
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

# Convert the result into an array to view the TF-IDF values
tfidf_representation = X_tfidf.toarray()

# Display the TF-IDF representation
print("TF-IDF Representation:")
print(tfidf_representation)

# Display the feature names (words in the corpus)
print("\nFeature Names (Words in the corpus):")
print(tfidf_vectorizer.get_feature_names_out())


TF-IDF Representation:
[[0.         0.         0.         0.         0.         0.861037
  0.         0.50854232 0.        ]
 [0.65249088 0.         0.         0.65249088 0.         0.
  0.         0.38537163 0.        ]
 [0.         0.43238509 0.43238509 0.         0.43238509 0.
  0.43238509 0.2553736  0.43238509]]

Feature Names (Words in the corpus):
['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


In [16]:
print("TF-IDF:\n", X_tfidf.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

TF-IDF:
 [[0.         0.         0.         0.         0.         0.861037
  0.         0.50854232 0.        ]
 [0.65249088 0.         0.         0.65249088 0.         0.
  0.         0.38537163 0.        ]
 [0.         0.43238509 0.43238509 0.         0.43238509 0.
  0.43238509 0.2553736  0.43238509]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 5: N-grams**

Generate `bigrams (2-grams)` from the corpus using `CountVectorizer`. Store the result in `X_bigram`

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

# Define the corpus
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

# Step 1: Initialize the CountVectorizer with ngram_range=(2, 2) for bigrams
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))

# Step 2: Fit and transform the corpus into a bigram representation
X_bigram = bigram_vectorizer.fit_transform(corpus)

# Convert the result into an array to view the bigram counts
bigram_representation = X_bigram.toarray()

# Display the bigram representation (counts of bigrams)
print("Bigram Representation (Token Counts):")
print(bigram_representation)

# Display the feature names (bigrams in the corpus)
print("\nFeature Names (Bigrams in the corpus):")
print(bigram_vectorizer.get_feature_names_out())


Bigram Representation (Token Counts):
[[0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 1 0]
 [1 1 0 1 0 1 0 1]]

Feature Names (Bigrams in the corpus):
['enjoy learning' 'in nlp' 'is amazing' 'learning new' 'love nlp'
 'new things' 'nlp is' 'things in']


In [18]:
print("Bigrams:\n", X_bigram.toarray())
print("Bigram Vocabulary:", bigram_vectorizer.get_feature_names_out())

Bigrams:
 [[0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 1 0]
 [1 1 0 1 0 1 0 1]]
Bigram Vocabulary: ['enjoy learning' 'in nlp' 'is amazing' 'learning new' 'love nlp'
 'new things' 'nlp is' 'things in']


## **4. Advanced Exercise: Custom Preprocessing Pipeline**

### **Exercise 6: Build a Custom Preprocessing Pipeline**

Combine all the preprocessing steps (tokenization, stop word removal, punctuation removal, stemming/lemmatization) into a single function.

In [22]:
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Ensure necessary NLTK data is downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize lemmatizer and get stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Define the custom preprocessing pipeline function
def text_preprocessing_pipeline(text):
    # Step 1: Tokenize the text
    tokens = word_tokenize(text)

    # Step 2: Remove stop words
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words and word not in string.punctuation]

    # Step 3: Remove punctuation (already handled with stopwords removal)
    # Step 4: Apply lemmatization
    lemmatized_tokens = [lemmatizer.lemmatize(word.lower()) for word in filtered_tokens]

    return lemmatized_tokens

# Apply this function to the given text
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
processed_text = text_preprocessing_pipeline(text)

# Display the processed text
print("Processed Text:", processed_text)


Processed Text: ['natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Apply this function to the following text

In [23]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
processed_text = text_preprocessing_pipeline(text)

# Display the processed text
print("Processed Text:", processed_text)


Processed Text: ['natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


## **5. Evaluation of Preprocessing Techniques**

### **Exercise 7: Compare Preprocessing Techniques**

Compare the results of stemming and lemmatization on the following sentence. Store the results in `stemmed_tokens` and `lemmatized_tokens`

In [24]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import string

# Ensure necessary NLTK data is downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Example sentence
sentence = "The cats are playing with the mice in the garden."

# Step 1: Tokenize the sentence and remove stop words and punctuation
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(sentence)
filtered_tokens = [word for word in tokens if word.lower() not in stop_words and word not in string.punctuation]

# Step 2: Apply stemming
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

# Step 3: Apply lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(word.lower()) for word in filtered_tokens]

# Print the results
print("Original Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)


Original Tokens: ['cats', 'playing', 'mice', 'garden']
Stemmed Tokens: ['cat', 'play', 'mice', 'garden']
Lemmatized Tokens: ['cat', 'playing', 'mouse', 'garden']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [25]:
print("Original Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Tokens: ['cats', 'playing', 'mice', 'garden']
Stemmed Tokens: ['cat', 'play', 'mice', 'garden']
Lemmatized Tokens: ['cat', 'playing', 'mouse', 'garden']


## **6. Real-World Dataset: Sentiment Analysis**

### **Exercise 8: Preprocess and Analyze Tweets**

In this exercise, you will work with a real-world dataset of tweets. The dataset contains 5000 positive and 5000 negative tweets. Your task is to preprocess the tweets and extract features for sentiment analysis.


In [26]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [27]:
# Load the dataset
from nltk.corpus import twitter_samples

Load the dataset of positive and negative tweets.

In [28]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Combine them into a single list called ``all_tweets`` and create a corresponding list of labels called `labels`.

In [29]:
import nltk
from nltk.corpus import twitter_samples

# Ensure necessary NLTK data is downloaded
nltk.download('twitter_samples')

# Load the positive and negative tweets
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

# Step 1: Combine the positive and negative tweets into a single list called all_tweets
all_tweets = positive_tweets + negative_tweets

# Step 2: Create a corresponding list of labels, where 1 is for positive tweets and 0 for negative tweets
labels = [1] * len(positive_tweets) + [0] * len(negative_tweets)

# Display the combined data and labels
print("First 5 tweets:", all_tweets[:5])
print("First 5 labels:", labels[:5])


[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


First 5 tweets: ['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)', '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!', '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!', '@97sides CONGRATS :)', 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days']
First 5 labels: [1, 1, 1, 1, 1]


In [30]:
# Print a sample tweet
print("Sample Tweet:", all_tweets[0])
print("Label:", labels[0])

Sample Tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Label: 1


### **Exercise 9: Preprocess Tweets**

Apply the custom preprocessing pipeline to the entire dataset of tweets. Store the result in ``preprocessed_tweets``.

In [31]:
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Ensure necessary NLTK data is downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize the lemmatizer and stop words list
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Define the custom preprocessing pipeline function
def text_preprocessing_pipeline(text):
    # Step 1: Tokenize the text
    tokens = word_tokenize(text)

    # Step 2: Remove stop words
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words and word not in string.punctuation]

    # Step 3: Remove punctuation (already handled with stopwords removal)
    # Step 4: Apply lemmatization
    lemmatized_tokens = [lemmatizer.lemmatize(word.lower()) for word in filtered_tokens]

    return lemmatized_tokens


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [33]:
# Apply the preprocessing pipeline to the entire dataset of tweets
preprocessed_tweets = [text_preprocessing_pipeline(tweet) for tweet in all_tweets]

# Print a sample preprocessed tweet
print("Preprocessed Tweets Sample:", preprocessed_tweets[0])


Preprocessed Tweets Sample: ['followfriday', 'france_inte', 'pkuchly57', 'milipol_paris', 'top', 'engaged', 'member', 'community', 'week']


### **Exercise 10: Feature Extraction on Tweets**

Extract features from the preprocessed tweets using **Bag of Words** and **TF-IDF**. Store the results in ``X_bow`` and ``X_tfidf``, respectively.

In [34]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Step 1: Create a Bag of Words representation
# Convert preprocessed tweets from list of tokens to string format for CountVectorizer
preprocessed_tweets_str = [' '.join(tweet) for tweet in preprocessed_tweets]

# Initialize CountVectorizer for Bag of Words
bow_vectorizer = CountVectorizer()

# Fit and transform the preprocessed tweets into a BoW representation
X_bow = bow_vectorizer.fit_transform(preprocessed_tweets_str)

# Step 2: Create a TF-IDF representation
# Initialize TfidfVectorizer for TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the preprocessed tweets into a TF-IDF representation
X_tfidf = tfidf_vectorizer.fit_transform(preprocessed_tweets_str)

# Display the shapes of the results
print("BoW Feature Matrix Shape:", X_bow.shape)
print("TF-IDF Feature Matrix Shape:", X_tfidf.shape)


BoW Feature Matrix Shape: (10000, 19900)
TF-IDF Feature Matrix Shape: (10000, 19900)


## **7. Conclusion**

In this lab, you explored a wide range of NLP techniques, from basic text preprocessing to advanced feature extraction and analysis. You also worked with a real-world dataset of tweets and applied your knowledge to preprocess and extract features for sentiment analysis.

