# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [12]:
import re

def clean_text_scratch(texts):

    lower_texts = []
    for text in texts:
        text = text.lower()
        lower_texts.append(text)

    low_punc_texts = []
    for text in lower_texts:
        text = re.sub(r"[^\w\s]", "", text)
        low_punc_texts.append(text)

    words = []
    for text in low_punc_texts:
        tokens = text.split()
        for token in tokens:
            words.append(token)

    words_to_remove = ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or']
    filtered_words = [w for w in words if w not in words_to_remove]

    return filtered_words

def stemming(words, suffixes):
    stemmed_words = []
    for word in words:
        stemmed = False
        for suff in suffixes:
            if word.endswith(suff):
                stemmed_words.append(word[:-len(suff)])
                stemmed = True
                break
        if not stemmed:
            stemmed_words.append(word)
    return stemmed_words

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

suffixes = ['ing', 'ly', 'ed', 's']

fit_words = clean_text_scratch(corpus)
print(stemming(fit_words, suffixes))


['artificial', 'intelligence', 'transform', 'world', 'however', 'ethical', 'concern', 'remain', 'pizza', 'absolute', 'deliciou', 'service', 'terrible', 'i', 'wont', 'go', 'back', 'quick', 'brown', 'fox', 'jump', 'over', 'lazy', 'dog', 'be', 'not', 'be', 'that', 'question', 'whether', 'ti', 'nobler', 'mind', 'data', 'science', 'involve', 'statistic', 'linear', 'algebra', 'machine', 'learn', 'i', 'love', 'machine', 'learn', 'i', 'hate', 'math', 'behind']


**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [8]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('universal_tagset')


from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

# Tokenize + lowercase
words = []
for sent in corpus:
    tokens = word_tokenize(sent)
    words.extend([token.lower() for token in tokens])

# Remove stopwords & punctuation
stop_words = set(stopwords.words("english"))
fit_words = [w for w in words if w.isalpha() and w not in stop_words]


def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return 'a'
    elif treebank_tag.startswith('V'):
        return 'v'
    elif treebank_tag.startswith('R'):
        return 'r'
    else:
        return 'n'

lemmatizer = WordNetLemmatizer()

# POS tagging
tagged_words = pos_tag(fit_words)

lemmatized_words = []

print("Original filtered words:", fit_words)

print("\nLemmatized words with POS:")
for word, tag in tagged_words:
    wn_pos = get_wordnet_pos(tag)
    lemma = lemmatizer.lemmatize(word, wn_pos)
    lemmatized_words.append(lemma)
    print(f"{word} ({wn_pos}) -> {lemma}")

print("\nFinal lemmatized corpus:", lemmatized_words)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


Original filtered words: ['artificial', 'intelligence', 'transforming', 'world', 'however', 'ethical', 'concerns', 'remain', 'pizza', 'absolutely', 'delicious', 'service', 'terrible', 'wo', 'go', 'back', 'quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', 'question', 'whether', 'nobler', 'mind', 'data', 'science', 'involves', 'statistics', 'linear', 'algebra', 'machine', 'learning', 'love', 'machine', 'learning', 'hate', 'math', 'behind']

Lemmatized words with POS:
artificial (a) -> artificial
intelligence (n) -> intelligence
transforming (v) -> transform
world (n) -> world
however (r) -> however
ethical (a) -> ethical
concerns (n) -> concern
remain (v) -> remain
pizza (a) -> pizza
absolutely (r) -> absolutely
delicious (a) -> delicious
service (n) -> service
terrible (a) -> terrible
wo (n) -> wo
go (v) -> go
back (r) -> back
quick (a) -> quick
brown (n) -> brown
fox (n) -> fox
jumps (n) -> jump
lazy (a) -> lazy
dog (a) -> dog
question (n) -> question
whether (n) -> whether
nobler (n) ->

# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [11]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# NLTK downloads
nltk.download('punkt')
nltk.download('stopwords')

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

stop_words = set(stopwords.words("english"))

# Building Vocabulary
def preprocess(text):
    tokens = word_tokenize(text.lower())
    return [w for w in tokens if w.isalpha() and w not in stop_words]

all_words = []
for sent in corpus:
    all_words.extend(preprocess(sent))

# Unique sorted vocabulary
vocabulary = sorted(list(set(all_words)))

print("Vocabulary:", vocabulary)

#  BoW vectorizer
def bow_vector(sentence, vocab):
    tokens = preprocess(sentence)
    return [tokens.count(word) for word in vocab]

test_sentence = "The quick brown fox jumps over the lazy dog."

bow = bow_vector(test_sentence, vocabulary)

print(bow)


Vocabulary: ['absolutely', 'algebra', 'artificial', 'back', 'behind', 'brown', 'concerns', 'data', 'delicious', 'dog', 'ethical', 'fox', 'go', 'hate', 'however', 'intelligence', 'involves', 'jumps', 'lazy', 'learning', 'linear', 'love', 'machine', 'math', 'mind', 'nobler', 'pizza', 'question', 'quick', 'remain', 'science', 'service', 'statistics', 'terrible', 'transforming', 'whether', 'wo', 'world']
[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

vectorizer = CountVectorizer()
# Fit + transform corpus into BoW matrix
bow_matrix = vectorizer.fit_transform(corpus)

bow_array = bow_matrix.toarray()

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("\nBag of Words Matrix:")
print(bow_array)


Vocabulary:
['absolutely' 'algebra' 'and' 'artificial' 'back' 'be' 'behind' 'brown'
 'but' 'concerns' 'data' 'delicious' 'dog' 'ethical' 'fox' 'go' 'hate'
 'however' 'in' 'intelligence' 'involves' 'is' 'it' 'jumps' 'lazy'
 'learning' 'linear' 'love' 'machine' 'math' 'mind' 'nobler' 'not' 'or'
 'over' 'pizza' 'question' 'quick' 'remain' 'science' 'service'
 'statistics' 'terrible' 'that' 'the' 'tis' 'to' 'transforming' 'was'
 'whether' 'won' 'world']

Bag of Words Matrix:
[[0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1]
 [1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 0 0 0 1 0 1 0 2 0 0 0 2 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
  0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0
  1 0 0 0 0 0 0 1 2 1 2 0 0 1 0 0]
 [0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0
  0 0 0 1 0 1 

**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

In [15]:
import math

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

target = "machine"

sentence = corpus[-1]

# TF Calculation
tokens = [w.lower() for w in sentence.split()]
tf = tokens.count(target) / len(tokens)
# IDF Calculation
total_docs = len(corpus)
docs_with_word = 0

for doc in corpus:
    if target in doc.lower().split():
        docs_with_word += 1

idf = math.log(total_docs / docs_with_word)

#  TF-IDF
tfidf = tf * idf

print("TF:", tf)
print("IDF:", idf)
print("TF-IDF:", tfidf)


TF: 0.09090909090909091
IDF: 1.0986122886681098
TF-IDF: 0.09987384442437362


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

vectorizer = TfidfVectorizer()

#  Fit + transform
tfidf_matrix = vectorizer.fit_transform(corpus)

first_vector = tfidf_matrix[0].toarray()[0]

print("Vocabulary list:")
print(vectorizer.get_feature_names_out())

print("\nTF-IDF vector for FIRST sentence:")
print(first_vector)

print("\nObservation:")
print('Unique words like "intelligence" have HIGHER TF-IDF,')
print('Common words like "is", "the", "in" have LOWER TF-IDF.')


Vocabulary list:
['absolutely' 'algebra' 'and' 'artificial' 'back' 'be' 'behind' 'brown'
 'but' 'concerns' 'data' 'delicious' 'dog' 'ethical' 'fox' 'go' 'hate'
 'however' 'in' 'intelligence' 'involves' 'is' 'it' 'jumps' 'lazy'
 'learning' 'linear' 'love' 'machine' 'math' 'mind' 'nobler' 'not' 'or'
 'over' 'pizza' 'question' 'quick' 'remain' 'science' 'service'
 'statistics' 'terrible' 'that' 'the' 'tis' 'to' 'transforming' 'was'
 'whether' 'won' 'world']

TF-IDF vector for FIRST sentence:
[0.         0.         0.         0.33454543 0.         0.
 0.         0.         0.         0.33454543 0.         0.
 0.         0.33454543 0.         0.         0.         0.33454543
 0.         0.33454543 0.         0.27433204 0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.33454543 0.         0.         0.
 0.         0.         0.17139656 0.         0.         0.33454543
 0.         0.    

# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [19]:
!pip install gensim

from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download resources
nltk.download('punkt')
nltk.download('stopwords')

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

stop_words = set(stopwords.words("english"))

def clean_and_tokenize(text):
    tokens = word_tokenize(text.lower())
    tokens = [w for w in tokens if w.isalpha() and w not in stop_words]
    return tokens

cleaned_corpus = [clean_and_tokenize(sentence) for sentence in corpus]

print("Cleaned Tokenized Corpus:")
print(cleaned_corpus)

#  TRAIN WORD2VEC MODEL
model = Word2Vec(
    sentences=cleaned_corpus,
    vector_size=10,
    min_count=1,
    workers=4
)

print("\nVector for word 'learning':")
print(model.wv["learning"])


Cleaned Tokenized Corpus:
[['artificial', 'intelligence', 'transforming', 'world', 'however', 'ethical', 'concerns', 'remain'], ['pizza', 'absolutely', 'delicious', 'service', 'terrible', 'wo', 'go', 'back'], ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog'], ['question', 'whether', 'nobler', 'mind'], ['data', 'science', 'involves', 'statistics', 'linear', 'algebra', 'machine', 'learning'], ['love', 'machine', 'learning', 'hate', 'math', 'behind']]

Vector for word 'learning':
[-0.00535678  0.00238785  0.05107836  0.09016657 -0.09301379 -0.07113771
  0.06464887  0.08973394 -0.05023384 -0.03767424]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [21]:
import gensim.downloader as api

glove_model = api.load('glove-wiki-gigaword-50')

result = glove_model.most_similar(
    positive=['king', 'woman'],
    negative=['man']
)

print(result)


[('queen', 0.8523604273796082), ('throne', 0.7664334177970886), ('prince', 0.7592144012451172), ('daughter', 0.7473883628845215), ('elizabeth', 0.7460219860076904), ('princess', 0.7424570322036743), ('kingdom', 0.7337412238121033), ('monarch', 0.721449077129364), ('eldest', 0.7184861898422241), ('widow', 0.7099431157112122)]


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [24]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

sia = SentimentIntensityAnalyzer()

pizza_review = corpus[1]
math_complaint = corpus[5]

pizza_score = sia.polarity_scores(pizza_review)
math_score = sia.polarity_scores(math_complaint)

print("Pizza Review:", pizza_review)
print("Sentiment:", pizza_score)

print("\nMath Complaint:", math_complaint)
print("Sentiment:", math_score)

print("\nInterpretation:")
print("Compound Score Range = -1 (negative) to +1 (positive).")
print("Notice the pizza review has both 'delicious' (positive) and 'terrible' (negative).")
print("VADER usually gives it a mixed or close-to-neutral score because it balances both sentiments.")


Pizza Review: The pizza was absolutely delicious, but the service was terrible ... I won't go back.
Sentiment: {'neg': 0.223, 'neu': 0.644, 'pos': 0.134, 'compound': -0.3926}

Math Complaint: I love machine learning, but I hate the math behind it.
Sentiment: {'neg': 0.345, 'neu': 0.478, 'pos': 0.177, 'compound': -0.5346}

Interpretation:
Compound Score Range = -1 (negative) to +1 (positive).
Notice the pizza review has both 'delicious' (positive) and 'terrible' (negative).
VADER usually gives it a mixed or close-to-neutral score because it balances both sentiments.


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
