# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [1]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [2]:
import re
#write rest of the code here

text = "Artificial Intelligence is transforming the world; however, ethical concerns remain!"

def simple_stemmer(word):
    if word.endswith('ing'): return word[:-3]
    elif word.endswith('ly'): return word[:-2]
    elif word.endswith('ed'): return word[:-2]
    elif word.endswith('s'):  return word[:-1]
    return word


def clean_text_scratch(text):

    text = text.lower()
    text = re.sub(r'[!.,:;\']', '', text)
    tokens = text.split()
    stop_words = ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or']
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return [simple_stemmer(word) for word in filtered_tokens]

result = clean_text_scratch(text)
print(result)


['artificial', 'intelligence', 'transform', 'world', 'however', 'ethical', 'concern', 'remain']


**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [3]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#write rest of the code here
text = "The pizza was absolutely delicious, but the service was terrible ... I won't go back."

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
text = text.lower()
tokens = word_tokenize(text)

cleaned_tokens = []

for word in tokens:
    if word not in stop_words and word.isalnum():
        lemma = lemmatizer.lemmatize(word)
        cleaned_tokens.append(lemma)

print(cleaned_tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['pizza', 'absolutely', 'delicious', 'service', 'terrible', 'wo', 'go', 'back']


# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [4]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# NLTK downloads (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# rest of the code here

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

target_sentence = "The quick brown fox jumps over the lazy dog."

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

#vocab

vocab_set = set()

for line in corpus:
    tokens = word_tokenize(line.lower())

    for i in tokens:
        if i.isalnum() and i not in stop_words:
            root_word = lemmatizer.lemmatize(i)
            vocab_set.add(root_word)

vocab = sorted(list(vocab_set))
print(vocab)

# target sentence
def vector_func(target_sentence, vocab):

  cleaned_sentence = []
  target_tokens = word_tokenize(target_sentence.lower())

  for i in target_tokens:
    if i.isalnum() and i not in stop_words:
        root_word = lemmatizer.lemmatize(i)
        cleaned_sentence.append(root_word)

  vector = []
  for i in vocab:
    count = cleaned_sentence.count(i)
    vector.append(count)

  return vector

vector = vector_func(target_sentence, vocab)

print(vector)


['absolutely', 'algebra', 'artificial', 'back', 'behind', 'brown', 'concern', 'data', 'delicious', 'dog', 'ethical', 'fox', 'go', 'hate', 'however', 'intelligence', 'involves', 'jump', 'lazy', 'learning', 'linear', 'love', 'machine', 'math', 'mind', 'nobler', 'pizza', 'question', 'quick', 'remain', 'science', 'service', 'statistic', 'terrible', 'transforming', 'whether', 'wo', 'world']
[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
#rest of the code here
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

vectorizer = CountVectorizer()
vector_matrix = vectorizer.fit_transform(corpus)
vector_array = vector_matrix.toarray()
print(vector_array)

[[0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1]
 [1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 0 0 0 1 0 1 0 2 0 0 0 2 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
  0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0
  1 0 0 0 0 0 0 1 2 1 2 0 0 1 0 0]
 [0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0
  0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]]


**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

In [6]:
# write code here
import math
import re

# 1. Data
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

target_sentence = corpus[5]
target_word = "machine"

def get_words(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text.split()

# TF
words = get_words(target_sentence)
tf = words.count(target_word) / len(words)

# IDF
docs_with_word = 0
for doc in corpus:
    if target_word in get_words(doc):
        docs_with_word += 1

idf = math.log(len(corpus) / docs_with_word)

print(tf * idf)

0.09987384442437362


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
# rest of the code here
import pandas as pd

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

feature_names = vectorizer.get_feature_names_out()
vector_values = tfidf_matrix[0].toarray()[0]

results = []
for i in range(len(feature_names)):
    if vector_values[i] > 0:
        results.append((feature_names[i], vector_values[i]))

for word, score in results:
    print(f"{word}: {score:.5f}")



artificial: 0.33455
concerns: 0.33455
ethical: 0.33455
however: 0.33455
intelligence: 0.33455
is: 0.27433
remain: 0.33455
the: 0.17140
transforming: 0.33455
world: 0.33455


# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [8]:
!pip install gensim
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#rest of the code here

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
tokenized_corpus = []

for text in corpus:
    text = text.lower()
    tokens = word_tokenize(text)

    cleaned_tokens = []
    for word in tokens:
        if word not in stop_words and word.isalnum():
            lemma = lemmatizer.lemmatize(word)
            cleaned_tokens.append(lemma)

    tokenized_corpus.append(cleaned_tokens)

model = Word2Vec(sentences=tokenized_corpus, min_count=1, vector_size=10)
vector = model.wv['learning']

print(vector)

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
[-0.00535678  0.00238785  0.05107836  0.09016657 -0.09301379 -0.07113771
  0.06464887  0.08973394 -0.05023384 -0.03767424]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [9]:
import gensim.downloader as api

# Load pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-50')

#rest of the code here

result = glove_model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
print("\nAnalogy: King - Man + Woman = ?")
print(f"Model's Guess: {result[0][0]}")
print(f"Similarity Score: {result[0][1]:.4f}")


Analogy: King - Man + Woman = ?
Model's Guess: queen
Similarity Score: 0.8524


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [10]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download VADER (run once)
nltk.download('vader_lexicon')
# rest of the code here

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

bcs = SentimentIntensityAnalyzer()

pizza_review = corpus[1]
pizza_scores = bcs.polarity_scores(pizza_review)

print(f"Review: '{pizza_review}'")
print(f"Scores: {pizza_scores}")
print(f"Compound Score: {pizza_scores['compound']}")
print("\n")

math_complaint = corpus[5]
math_scores = bcs.polarity_scores(math_complaint)

print(f"Review: '{math_complaint}' ---")
print(f"Scores: {math_scores}")
print(f"Compound Score: {math_scores['compound']}")
print("\nans to the q: -ve score cuz vader is kinda smart\n")

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Review: 'The pizza was absolutely delicious, but the service was terrible ... I won't go back.'
Scores: {'neg': 0.223, 'neu': 0.644, 'pos': 0.134, 'compound': -0.3926}
Compound Score: -0.3926


Review: 'I love machine learning, but I hate the math behind it.' ---
Scores: {'neg': 0.345, 'neu': 0.478, 'pos': 0.177, 'compound': -0.5346}
Compound Score: -0.5346

ans to the q: -ve score cuz vader is kinda smart

