# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [10]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

def clean_text_scratch(data):
  import re
  if type(data) == str:
    data = [data]

  processed_sentences = []
  for sentence in data:
    sentence = sentence.lower()

    sentence = re.sub(r'[!.,:;...\']', '', sentence)

    tokens = sentence.split()

    stopwords = ['the', 'is', 'in', 'to', 'of','and','a','it','was','but','or']
    filtered_tokens = [word for word in tokens if word not in stopwords]

    def stem(word):
      suffixes = ['ing', 'ly', 'ed', 's']
      for suffix in suffixes:
        if word.endswith(suffix):
          return word[:-len(suffix)]
      return word

    stemmed_tokens = [stem(word) for word in filtered_tokens]
    processed_sentences.append(stemmed_tokens)

  return processed_sentences

clean_text_scratch(corpus[0])


[['artificial',
  'intelligence',
  'transform',
  'world',
  'however',
  'ethical',
  'concern',
  'remain']]

**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [11]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def clean_text_nltk(data):
  if type(data) == str:
    data = [data]
  for x in data:
    x = x.lower()
  pattern = ['!', '.', ',', ':', ';', "'"]
  for i in range(len(data)):
    for x in pattern:
      if x in data[i]:
        data[i] = data[i].replace(x, '')
  nltk_tokens = [word_tokenize(x) for x in data]
  stop_words = set(stopwords.words('english'))
  for i in range(len(nltk_tokens)):
    nltk_tokens[i] = [x for x in nltk_tokens[i] if x not in stop_words]
  lemmatizer = WordNetLemmatizer()
  for i in range(len(nltk_tokens)):
    nltk_tokens[i] = [lemmatizer.lemmatize(x) for x in nltk_tokens[i]]
  return nltk_tokens

clean_text_nltk(corpus[1])


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


[['The',
  'pizza',
  'absolutely',
  'delicious',
  'service',
  'terrible',
  'I',
  'wont',
  'go',
  'back']]

# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [12]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# NLTK downloads (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

def clean_text_scratch(data):
    import re
    if type(data) == str:
        data = [data]

    processed_sentences = []
    for sentence in data:
        sentence = sentence.lower()
        sentence = re.sub(r'[!.,:;...\']', '', sentence)
        tokens = sentence.split()

        stopwords_list = ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or']
        filtered_tokens = [word for word in tokens if word not in stopwords_list]

        def stem(word):
            suffixes = ['ing', 'ly', 'ed', 's']
            for suffix in suffixes:
                if word.endswith(suffix):
                    return word[:-len(suffix)]
            return word

        stemmed_tokens = [stem(word) for word in filtered_tokens]
        processed_sentences.append(stemmed_tokens)

    return processed_sentences

cleaned_corpus_nested = clean_text_scratch(corpus)

all_words = []
for sent in cleaned_corpus_nested:
    for w in sent:
        all_words.append(w)

unique_words = []
for sent in cleaned_corpus_nested:
    for w in sent:
        if w not in unique_words:
            unique_words.append(w)


vocabulary = unique_words
print("Vocabulary ({} tokens):".format(len(vocabulary)))
print(vocabulary)
print()

def BoW(sentence):
    cleaned = clean_text_scratch(sentence)
    if isinstance(cleaned, list) and len(cleaned) > 0 and isinstance(cleaned[0], list):
        tokens = cleaned[0]
    else:
        tokens = cleaned

    for word in tokens:
        cnt = all_words.count(word)
        print(word, ":", cnt)

BoW(corpus[2])


Vocabulary (44 tokens):
['artificial', 'intelligence', 'transform', 'world', 'however', 'ethical', 'concern', 'remain', 'pizza', 'absolute', 'deliciou', 'service', 'terrible', 'i', 'wont', 'go', 'back', 'quick', 'brown', 'fox', 'jump', 'over', 'lazy', 'dog', 'be', 'not', 'that', 'question', 'whether', 'ti', 'nobler', 'mind', 'data', 'science', 'involve', 'statistic', 'linear', 'algebra', 'machine', 'learn', 'love', 'hate', 'math', 'behind']

quick : 1
brown : 1
fox : 1
jump : 1
over : 1
lazy : 1
dog : 1


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

threeD_array = vectorizer.fit_transform(corpus)

feature_names = vectorizer.get_feature_names_out()

twoD_array = threeD_array.toarray()

print(feature_names)





['absolutely' 'algebra' 'and' 'artificial' 'back' 'be' 'behind' 'brown'
 'but' 'concerns' 'data' 'delicious' 'dog' 'ethical' 'fox' 'go' 'hate'
 'however' 'in' 'intelligence' 'involves' 'is' 'it' 'jumps' 'lazy'
 'learning' 'linear' 'love' 'machine' 'math' 'mind' 'nobler' 'not' 'or'
 'over' 'pizza' 'question' 'quick' 'remain' 'science' 'service'
 'statistics' 'terrible' 'that' 'the' 'tis' 'to' 'transforming' 'was'
 'whether' 'won' 'world']


**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

In [14]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

sentence = corpus[5]
count = 0
for word in sentence.split():
  if word == 'machine':
    count += 1
    break

count_machine_in_sentence = count

total_words_in_sentence = len(sentence.split())

tf = count_machine_in_sentence / total_words_in_sentence

count = 0
for sentence in corpus:
  count += 1

total_documents = count

count = 0
for sentence in corpus:
  if 'machine' in sentence:
    count += 1
    break

count_documents_in_corpus_machine = count

import math
idf = math.log(total_documents / count_documents_in_corpus_machine)

tf_idf = tf * idf

print(tf_idf)


0.16288722447527773


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_model = TfidfVectorizer()
tfidf_vector = tfidf_model.fit_transform([corpus[0]])
tfidf_twoDarray = tfidf_vector.toarray()
print(tfidf_twoDarray)


[[0.31622777 0.31622777 0.31622777 0.31622777 0.31622777 0.31622777
  0.31622777 0.31622777 0.31622777 0.31622777]]


# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [16]:
!pip install gensim
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


model = Word2Vec(
    sentences=cleaned_corpus_nested,
    vector_size=10,
    min_count=1
)

vector = model.wv['learn']
print(vector)

[ 0.07380094 -0.01535677 -0.04540383  0.06549373 -0.04860722 -0.01817802
  0.02879471  0.0099871  -0.08291139 -0.09449325]


**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [19]:
import gensim.downloader as api

# Load pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-50')

analogy = glove_model.most_similar(positive=['woman', 'king'], negative=['man'])
print(analogy)


[('queen', 0.8523604273796082), ('throne', 0.7664334177970886), ('prince', 0.7592144012451172), ('daughter', 0.7473883628845215), ('elizabeth', 0.7460219860076904), ('princess', 0.7424570322036743), ('kingdom', 0.7337412238121033), ('monarch', 0.721449077129364), ('eldest', 0.7184861898422241), ('widow', 0.7099431157112122)]


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [23]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download VADER (run once)
nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()
print(sia.polarity_scores(corpus[1]))
print(sia.polarity_scores(corpus[5]))

{'neg': 0.223, 'neu': 0.644, 'pos': 0.134, 'compound': -0.3926}
{'neg': 0.345, 'neu': 0.478, 'pos': 0.177, 'compound': -0.5346}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
