# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [14]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [15]:
import re

def simple_stem(word):
    # Remove suffixes 'ing', 'ly', 'ed', 's'
    if word.endswith('ing'):
        word = word[:-3]
    elif word.endswith('ly'):
        word = word[:-2]
    elif word.endswith('ed'):
        word = word[:-2]
    elif word.endswith('s'):
        word = word[:-1]
    return word

def clean_text_scratch(text):
    # 1. Lowercasing
    text = text.lower()
    # 2. Punctuation Removal
    # First, handle the literal '...' sequence
    text = text.replace('...', '')
    # Then, remove other single character punctuation (!, ., ,, :, ;, ', [, ])
    # Note: Apostrophe is also handled here, and square brackets are correctly escaped for regex.
    text = re.sub(r'[!.,:;\'\['']', '', text)
    # 3. Tokenization
    words = text.split()
    # 4. Stopword Removal
    stopwords = ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or']
    words = [word for word in words if word not in stopwords]
    # 5. Simple Stemming
    words = [simple_stem(word) for word in words]
    return words

# Task: Run this function on the first sentence of the corpus and print the result.
cleaned_first_sentence_scratch = clean_text_scratch(corpus[0])
print(cleaned_first_sentence_scratch)

['artificial', 'intelligence', 'transform', 'world', 'however', 'ethical', 'concern', 'remain']


**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [16]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def clean_text_nltk(text):
    # 1. Lowercasing
    text = text.lower()
    # 2. Tokenization
    tokens = word_tokenize(text)
    # 3. Punctuation Removal and Stopword Removal and Lemmatization
    cleaned_tokens = []
    for word in tokens:
        # Remove punctuation by checking if the word is alphabetic
        if word.isalpha():
            if word not in stop_words:
                cleaned_tokens.append(lemmatizer.lemmatize(word))
    return cleaned_tokens

# Apply the cleaning function to the entire corpus
cleaned_corpus_nltk = [clean_text_nltk(sentence) for sentence in corpus]

# Print the cleaned, lemmatized tokens for the second sentence (corpus[1])
print(cleaned_corpus_nltk[1])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


['pizza', 'absolutely', 'delicious', 'service', 'terrible', 'wo', 'go', 'back']


# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [17]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# NLTK downloads (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# rest of the code here
# 1. Build Vocabulary
all_words = []
for sentence_tokens in cleaned_corpus_nltk:
    all_words.extend(sentence_tokens)

vocabulary = sorted(list(set(all_words)))

# Print the unique Vocabulary list
print("Unique Vocabulary:", vocabulary)

# 2. Vectorize function
def vectorize_bow(sentence, vocab, cleaning_function):
    # Clean the input sentence using the NLTK cleaning function
    cleaned_sentence_tokens = cleaning_function(sentence)

    # Initialize vector with zeros
    vector = [0] * len(vocab)

    # Populate vector with word counts
    for word in cleaned_sentence_tokens:
        if word in vocab:
            vector[vocab.index(word)] += 1
    return vector

# Task: Print the BoW vector for: "The quick brown fox jumps over the lazy dog."
sentence_to_vectorize = "The quick brown fox jumps over the lazy dog."
bow_vector_for_sentence = vectorize_bow(sentence_to_vectorize, vocabulary, clean_text_nltk)

print(f"\nBoW Vector for '{sentence_to_vectorize}':", bow_vector_for_sentence)

Unique Vocabulary: ['absolutely', 'algebra', 'artificial', 'back', 'behind', 'brown', 'concern', 'data', 'delicious', 'dog', 'ethical', 'fox', 'go', 'hate', 'however', 'intelligence', 'involves', 'jump', 'lazy', 'learning', 'linear', 'love', 'machine', 'math', 'mind', 'nobler', 'pizza', 'question', 'quick', 'remain', 'science', 'service', 'statistic', 'terrible', 'transforming', 'whether', 'wo', 'world']

BoW Vector for 'The quick brown fox jumps over the lazy dog.': [0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

# 1. Instantiate the vectorizer.
vectorizer = CountVectorizer()

# 2. fit_transform the raw corpus.
bow_matrix = vectorizer.fit_transform(corpus)

# 3. Convert the result to an array (.toarray()) and print it.
print("BoW Matrix (CountVectorizer):")
print(bow_matrix.toarray())

BoW Matrix (CountVectorizer):
[[0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1]
 [1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 0 0 0 1 0 1 0 2 0 0 0 2 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
  0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0
  1 0 0 0 0 0 0 1 2 1 2 0 0 1 0 0]
 [0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0
  0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]]


**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

In [23]:
import math

# The last sentence (corpus[5]) after NLTK cleaning and lemmatization
last_sentence_cleaned = cleaned_corpus_nltk[5]

# 1. Calculate TF (Term Frequency) for 'machine' in the last sentence
word_to_analyze = 'machine'

count_machine_in_sentence = last_sentence_cleaned.count(word_to_analyze)
total_words_in_last_sentence = len(last_sentence_cleaned)

tf = count_machine_in_sentence / total_words_in_last_sentence

print(f"TF for '{word_to_analyze}' in the last sentence: {tf:.4f}")

# 2. Calculate IDF (Inverse Document Frequency) for 'machine' across the entire corpus
total_documents = len(cleaned_corpus_nltk)
documents_containing_machine = 0

for doc in cleaned_corpus_nltk:
    if word_to_analyze in doc:
        documents_containing_machine += 1

# Add a small epsilon to avoid division by zero if word not found in any document
# (though in this case 'machine' is present)
idf = math.log(total_documents / (documents_containing_machine + 1e-6))

print(f"IDF for '{word_to_analyze}' across the corpus: {idf:.4f}")

# 3. Calculate TF-IDF
tfidf_score = tf * idf

print(f"\nManual TF-IDF score for '{word_to_analyze}' in the last sentence: {tfidf_score:.4f}")

TF for 'machine' in the last sentence: 0.1667
IDF for 'machine' across the corpus: 1.0986

Manual TF-IDF score for 'machine' in the last sentence: 0.1831


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

# 1. Instantiate TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# 2. Fit and transform the raw corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# 3. Print the TF-IDF vector for the first sentence
print("TF-IDF Vector for the first sentence (corpus[0]):")
print(tfidf_matrix[0].toarray())

# Observation: Compare the score of unique words (like "Intelligence") vs common words (like "is").
# Get feature names (vocabulary)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Get the TF-IDF vector for the first sentence as a dense array
first_sentence_vector = tfidf_matrix[0].toarray()[0]

# Find the index and score for 'intelligence'
try:
    idx_intelligence = list(feature_names).index('intelligence')
    score_intelligence = first_sentence_vector[idx_intelligence]
    print(f"\nTF-IDF score for 'intelligence': {score_intelligence:.4f}")
except ValueError:
    print("\n'intelligence' not found in vocabulary for TF-IDF.")

# Find the index and score for 'is'
try:
    idx_is = list(feature_names).index('is')
    score_is = first_sentence_vector[idx_is]
    print(f"TF-IDF score for 'is': {score_is:.4f}")
except ValueError:
    print("\n'is' not found in vocabulary for TF-IDF.")

TF-IDF Vector for the first sentence (corpus[0]):
[[0.         0.         0.         0.33454543 0.         0.
  0.         0.         0.         0.33454543 0.         0.
  0.         0.33454543 0.         0.         0.         0.33454543
  0.         0.33454543 0.         0.27433204 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.33454543 0.         0.         0.
  0.         0.         0.17139656 0.         0.         0.33454543
  0.         0.         0.         0.33454543]]

TF-IDF score for 'intelligence': 0.3345
TF-IDF score for 'is': 0.2743


# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [25]:
!pip install gensim
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Train Word2Vec model
# Pass the cleaned tokenized corpus (from Part 1.2) to Word2Vec
# cleaned_corpus_nltk was generated in cell v_4FjuCqy5Kt
model = Word2Vec(sentences=cleaned_corpus_nltk, min_count=1, vector_size=10)

# Print the vector for the word "learning"
word_to_experiment = "learning"
if word_to_experiment in model.wv:
    print(f"Vector for '{word_to_experiment}':")
    print(model.wv[word_to_experiment])
else:
    print(f"'{word_to_experiment}' not found in the vocabulary.")

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m70.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
Vector for 'learning':
[-0.00535678  0.00238785  0.05107836  0.09016657 -0.09301379 -0.07113771
  0.06464887  0.08973394 -0.05023384 -0.03767424]


**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [26]:
import gensim.downloader as api

# Load pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-50')

# Analogy Task: Compute King - Man + Woman = ?
result = glove_model.most_similar(positive=['woman', 'king'], negative=['man'])

print("King - Man + Woman = ?")
for word, score in result:
    print(f"{word}: {score:.4f}")

# Question: Does the model correctly guess "Queen"?
is_queen_guessed = any(word == 'queen' for word, score in result)
if is_queen_guessed:
    print("\nYes, the model correctly guesses 'queen' (or a close variant) in the top results.")
else:
    print("\nNo, the model did not correctly guess 'queen' in the top results.")

King - Man + Woman = ?
queen: 0.8524
throne: 0.7664
prince: 0.7592
daughter: 0.7474
elizabeth: 0.7460
princess: 0.7425
kingdom: 0.7337
monarch: 0.7214
eldest: 0.7185
widow: 0.7099

Yes, the model correctly guesses 'queen' (or a close variant) in the top results.


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [28]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download VADER (run once)
nltk.download('vader_lexicon')

# 1. Initialize the SentimentIntensityAnalyzer.
sia = SentimentIntensityAnalyzer()

# 2. Pass the Pizza Review (corpus[1]) into the analyzer.
pizza_review = corpus[1]
pizza_sentiment = sia.polarity_scores(pizza_review)
print(f"\nPizza Review: '{pizza_review}'")
print(f"Sentiment: {pizza_sentiment}")

# 3. Pass the Math Complaint (corpus[5]) into the analyzer.
math_complaint = corpus[5]
math_sentiment = sia.polarity_scores(math_complaint)
print(f"\nMath Complaint: '{math_complaint}'")
print(f"Sentiment: {math_sentiment}")

# Analysis: Look at the compound score for both.
# Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?


Pizza Review: 'The pizza was absolutely delicious, but the service was terrible ... I won't go back.'
Sentiment: {'neg': 0.223, 'neu': 0.644, 'pos': 0.134, 'compound': -0.3926}

Math Complaint: 'I love machine learning, but I hate the math behind it.'
Sentiment: {'neg': 0.345, 'neu': 0.478, 'pos': 0.177, 'compound': -0.5346}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
