# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [None]:
import re

def naive_stemmer(word):
    suffixes = ['ing', 'ly', 'ed', 's']
    for suf in suffixes:
        if word.endswith(suf) and len(word) > len(suf):
            return word[:-len(suf)]
    return word

def clean_text_scratch(text):
    # 1. Lowercase
    text = text.lower()

    # 2. Remove punctuation/special characters
    text = re.sub(r"[^a-zA-Z\s]", "", text)

    # 3. Tokenization
    tokens = text.split()

    # 4. Stopword removal
    stopwords = ['the','is','in','to','of','and','a','it','was','but','or']
    tokens = [word for word in tokens if word not in stopwords]

    # 5. Naive stemming
    tokens = [naive_stemmer(word) for word in tokens]

    return tokens

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

result = clean_text_scratch(corpus[0])
result



['artificial',
 'intelligence',
 'transform',
 'world',
 'however',
 'ethical',
 'concern',
 'remain']

**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_with_nltk(text):
    # 1. Tokenize
    tokens = word_tokenize(text)

    # 2. Lowercase and remove punctuation tokens
    tokens = [t.lower() for t in tokens if t not in string.punctuation]

    # 3. Stopword removal
    tokens = [t for t in tokens if t not in stop_words]

    # 4. Lemmatization
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    return tokens

pizza_review = corpus[1]

result_nltk = clean_with_nltk(pizza_review)
result_nltk


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


['pizza',
 'absolutely',
 'delicious',
 'service',
 'terrible',
 '...',
 'wo',
 "n't",
 'go',
 'back']

# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# NLTK downloads (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
import string

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_with_nltk(text):
    tokens = word_tokenize(text)
    tokens = [t.lower() for t in tokens if t not in string.punctuation]
    tokens = [t for t in tokens if t not in stop_words]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens


cleaned_corpus = [clean_with_nltk(sentence) for sentence in corpus]

vocab = sorted(list(set(word for sent in cleaned_corpus for word in sent)))

print("Vocabulary:\n", vocab)
print("\nVocabulary Size:", len(vocab))

def bow_vectorize(sentence, vocab):
    cleaned = clean_with_nltk(sentence)
    vector = [cleaned.count(word) for word in vocab]
    return vector

test_sentence = "The quick brown fox jumps over the lazy dog."
bow_vector = bow_vectorize(test_sentence, vocab)

print("\nBoW Vector:\n", bow_vector)


Vocabulary:
 ["'t", '...', 'absolutely', 'algebra', 'artificial', 'back', 'behind', 'brown', 'concern', 'data', 'delicious', 'dog', 'ethical', 'fox', 'go', 'hate', 'however', 'intelligence', 'involves', 'jump', 'lazy', 'learning', 'linear', 'love', 'machine', 'math', 'mind', "n't", 'nobler', 'pizza', 'question', 'quick', 'remain', 'science', 'service', 'statistic', 'terrible', 'transforming', 'whether', 'wo', 'world']

Vocabulary Size: 41

BoW Vector:
 [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

bow_matrix = vectorizer.fit_transform(corpus)
bow_array = bow_matrix.toarray()

print("Vocabulary:\n", vectorizer.get_feature_names_out())
print("\nBoW Matrix:\n", bow_array)


Vocabulary:
 ['absolutely' 'algebra' 'and' 'artificial' 'back' 'be' 'behind' 'brown'
 'but' 'concerns' 'data' 'delicious' 'dog' 'ethical' 'fox' 'go' 'hate'
 'however' 'in' 'intelligence' 'involves' 'is' 'it' 'jumps' 'lazy'
 'learning' 'linear' 'love' 'machine' 'math' 'mind' 'nobler' 'not' 'or'
 'over' 'pizza' 'question' 'quick' 'remain' 'science' 'service'
 'statistics' 'terrible' 'that' 'the' 'tis' 'to' 'transforming' 'was'
 'whether' 'won' 'world']

BoW Matrix:
 [[0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1]
 [1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 0 0 0 1 0 1 0 2 0 0 0 2 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
  0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0
  1 0 0 0 0 0 0 1 2 1 2 0 0 1 0 0]
 [0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0
  0 0 0 1 0 1 0 0 0 0

**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

In [None]:
import math

sentence = "I love machine learning but I hate the math behind it"
word = "machine"

words = sentence.lower().split()
tf = words.count(word) / len(words)

documents = [
    "The quick brown fox jumps over the lazy dog",
    "I absolutely love pizza, but the service was terrible",
    "Machine learning involves transforming data into intelligence",
    "To be or not to be, that is the question",
    "Math and statistics are behind artificial intelligence",
    "I love machine learning but I hate the math behind it"
]

processed_docs = [doc.lower().split() for doc in documents]

doc_count_with_word = sum(word in doc for doc in processed_docs)

idf = math.log(len(documents) / doc_count_with_word)


tfidf = tf * idf

print("TF:", tf)
print("IDF:", idf)
print("TF-IDF:", tfidf)


TF: 0.09090909090909091
IDF: 1.0986122886681098
TF-IDF: 0.09987384442437362


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "The quick brown fox jumps over the lazy dog",
    "I absolutely love pizza, but the service was terrible",
    "Machine learning involves transforming data into intelligence",
    "To be or not to be, that is the question",
    "Math and statistics are behind artificial intelligence",
    "I love machine learning but I hate the math behind it"
]

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(corpus)

vocab = vectorizer.get_feature_names_out()
print("Vocabulary:\n", vocab, "\n")

first_vector = X[0].toarray()
print("TF-IDF Vector for FIRST sentence:\n", first_vector, "\n")

print("Observation:")
print("Higher TF-IDF = more unique or rare words in corpus.")
print("Lower TF-IDF = common or repeated words like 'the', 'is'.")


Vocabulary:
 ['absolutely' 'and' 'are' 'artificial' 'be' 'behind' 'brown' 'but' 'data'
 'dog' 'fox' 'hate' 'intelligence' 'into' 'involves' 'is' 'it' 'jumps'
 'lazy' 'learning' 'love' 'machine' 'math' 'not' 'or' 'over' 'pizza'
 'question' 'quick' 'service' 'statistics' 'terrible' 'that' 'the' 'to'
 'transforming' 'was'] 

TF-IDF Vector for FIRST sentence:
 [[0.         0.         0.         0.         0.         0.
  0.34487217 0.         0.         0.34487217 0.34487217 0.
  0.         0.         0.         0.         0.         0.34487217
  0.34487217 0.         0.         0.         0.         0.
  0.         0.34487217 0.         0.         0.34487217 0.
  0.         0.         0.         0.40919714 0.         0.
  0.        ]] 

Observation:
Higher TF-IDF = more unique or rare words in corpus.
Lower TF-IDF = common or repeated words like 'the', 'is'.


# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [12]:
!pip install gensim

from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def clean_text_scratch(text):
    text = text.lower()
    tokens = word_tokenize(text)

    cleaned = []
    for t in tokens:
        if t.isalpha() and t not in stop_words:
            cleaned.append(lemmatizer.lemmatize(t))
    return cleaned

corpus = [
    "The quick brown fox jumps over the lazy dog",
    "I absolutely love pizza, but the service was terrible",
    "Machine learning involves transforming data into intelligence",
    "To be or not to be, that is the question",
    "Math and statistics are behind artificial intelligence",
    "I love machine learning but I hate the math behind it"
]

cleaned_corpus = [clean_text_scratch(sent) for sent in corpus]

print("Cleaned Corpus:")
for c in cleaned_corpus:
    print(c)

model = Word2Vec(
    sentences=cleaned_corpus,
    vector_size=10,
    window=5,
    min_count=1,
    workers=4
)

print("\nVector for 'learning':")
print(model.wv["learning"])


Cleaned Corpus:
['quick', 'brown', 'fox', 'jump', 'lazy', 'dog']
['absolutely', 'love', 'pizza', 'service', 'terrible']
['machine', 'learning', 'involves', 'transforming', 'data', 'intelligence']
['question']
['math', 'statistic', 'behind', 'artificial', 'intelligence']
['love', 'machine', 'learning', 'hate', 'math', 'behind']

Vector for 'learning':
[-0.07512337 -0.0093043   0.0954107  -0.07318932 -0.02333602 -0.01938214
  0.08080113 -0.05932492  0.000434   -0.04756258]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [14]:
import gensim.downloader as api

glove_model = api.load('glove-wiki-gigaword-50')

result = glove_model.most_similar(positive=['woman', 'king'], negative=['man'])

print(result[0])


('queen', 0.8523604273796082)


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [16]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Download VADER lexicon (run once)
nltk.download('vader_lexicon')

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

sia = SentimentIntensityAnalyzer()

pizza_review = corpus[1]
pizza_score = sia.polarity_scores(pizza_review)

math_complaint = corpus[5]
math_score = sia.polarity_scores(math_complaint)

print("Pizza Review:", pizza_review)
print("Sentiment:", pizza_score, "\n")

print("Math Complaint:", math_complaint)
print("Sentiment:", math_score, "\n")

mixed_sentence = "The food was delicious but the service was terrible."
mixed_score = sia.polarity_scores(mixed_sentence)

print("Mixed Sentence:", mixed_sentence)
print("Sentiment:", mixed_score)


Pizza Review: The pizza was absolutely delicious, but the service was terrible ... I won't go back.
Sentiment: {'neg': 0.223, 'neu': 0.644, 'pos': 0.134, 'compound': -0.3926} 

Math Complaint: I love machine learning, but I hate the math behind it.
Sentiment: {'neg': 0.345, 'neu': 0.478, 'pos': 0.177, 'compound': -0.5346} 

Mixed Sentence: The food was delicious but the service was terrible.
Sentiment: {'neg': 0.307, 'neu': 0.519, 'pos': 0.174, 'compound': -0.4215}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


VADER correctly detects mixed sentiment when positive and negative words appear together. In the Pizza Review, “delicious” and “terrible” produce a slightly negative compound score (–0.3926). The Math Complaint also shows mixed sentiment and results in a negative score (–0.5346). The test sentence “The food was delicious but the service was terrible” similarly yields a mixed sentiment score (–0.4215). This confirms that VADER handles opposing sentiment cues by producing balanced, near-neutral results that lean toward the stronger emotion.