# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [2]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [3]:
import re
def clean_scratch_text(s):
    n = len(s)
    for i in range(n):
        s[i] = lowercase(s[i])
    for i in range(n):
        s[i] = clean(s[i])
    l = []
    for i in range(n):
        l += s[i].split()
    stop = {'the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'}
    words = [w for w in l if w not in stop]
    words = helper(words)
    return words

def lowercase(s):
    r = ""
    for i in s:
        if "A" <= i <= "Z":
            r += chr(ord(i) - ord('A') + ord('a'))
        else:
            r += i
    return r

def clean(s):
    return re.sub(r'[^\w\s]', '', s)

def helper(words):
    suf = ("ing", "ly", "ed", "s")
    result = []
    for w in words:
        for suffix in suf:
            if w.endswith(suffix) and len(w) > len(suffix):
                w = w[:-len(suffix)]
                break
        result.append(w)
    return result


In [60]:
corpus1 = [(corpus[0])]
words = clean_scratch_text(corpus1)
print(words)

['artificial', 'intelligence', 'transform', 'world', 'however', 'ethical', 'concern', 'remain']


**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [61]:
import nltk
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4')
# nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
def rem_stop(s):
  l =[w for w in s if w not in stopwords.words("english")]
  return l

words = []
for i in range(len(corpus)):
  words += nltk.tokenize.word_tokenize(corpus[i])
print(words)

stopwords = nltk.corpus.stopwords
words1 = rem_stop(words)
print(words1)
wordl = WordNetLemmatizer()
def lem(s):
  tokens = [wordl.lemmatize(w) for w in s]
  tokens = [wordl.lemmatize(w,pos='v') for w in tokens]
  return tokens
tokens = lem(words1)
print(tokens)

m = corpus[1]
words2 = word_tokenize(m)
words2 = rem_stop(words2)
token2 = lem(words2)
print(token2)

['artificial', 'intelligence', 'is', 'transforming', 'the', 'world', 'however', 'ethical', 'concerns', 'remain', 'the', 'pizza', 'was', 'absolutely', 'delicious', 'but', 'the', 'service', 'was', 'terrible', 'i', 'wont', 'go', 'back', 'the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'to', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question', 'whether', 'tis', 'nobler', 'in', 'the', 'mind', 'data', 'science', 'involves', 'statistics', 'linear', 'algebra', 'and', 'machine', 'learning', 'i', 'love', 'machine', 'learning', 'but', 'i', 'hate', 'the', 'math', 'behind', 'it']
['artificial', 'intelligence', 'transforming', 'world', 'however', 'ethical', 'concerns', 'remain', 'pizza', 'absolutely', 'delicious', 'service', 'terrible', 'wont', 'go', 'back', 'quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', 'question', 'whether', 'tis', 'nobler', 'mind', 'data', 'science', 'involves', 'statistics', 'linear', 'algebra', 'machine', 'learning', 'love', 'machine', 'learni

# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [62]:
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# NLTK downloads (run once)
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
def clean(s):
    return re.sub(r'[^\w\s]', '', s)

def make_vocab(s):
  vocab = []
  for i in range(len(s)):
    words = word_tokenize(s[i])
    for k in range(len(words)):
      if(words[k] not in vocab):
        vocab.append(words[k])
  return vocab

def vectorise(s,vocab):
  n = len(s)
  for i in range(n):
    s[i] = lowercase(s[i])
  for i in range(n):
    s[i] = clean(s[i])
  l = []
  for i in range(n):
    l += s[i].split()
  count = [0]*len(vocab)
  for i in range(len(vocab)):
    for k in l:
      if(vocab[i]==k):
        count[i] += 1
  return count
vocab = make_vocab(corpus)
vocab.sort()
print(vocab)

count = vectorise(corpus,vocab)
print(count)

['absolutely', 'algebra', 'and', 'artificial', 'back', 'be', 'behind', 'brown', 'but', 'concerns', 'data', 'delicious', 'dog', 'ethical', 'fox', 'go', 'hate', 'however', 'i', 'in', 'intelligence', 'involves', 'is', 'it', 'jumps', 'lazy', 'learning', 'linear', 'love', 'machine', 'math', 'mind', 'nobler', 'not', 'or', 'over', 'pizza', 'question', 'quick', 'remain', 'science', 'service', 'statistics', 'terrible', 'that', 'the', 'tis', 'to', 'transforming', 'was', 'whether', 'wont', 'world']
[1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 1, 2, 1, 2, 1, 1, 1]


In [63]:
s = ["The quick brown fox jumps over the lazy dog."]
vocab1 = make_vocab(s)
bow = vectorise(s,vocab1)
print(vocab1)
print(bow)

#clearly this is not good
#our own built function was better than this lets implement that

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
[0, 1, 1, 1, 1, 1, 2, 1, 1, 0]


In [64]:
def vocab_maker(s):
      n = len(s)
      for i in range(n):
        s[i] = lowercase(s[i])
      for i in range(n):
        s[i] = clean(s[i])
      l = []
      for i in range(n):
        l += s[i].split()
      unique = []
      for w in l:
        if w not in unique:
            unique.append(w)
      return unique
vocab2 = vocab_maker(s)
bow1 = vectorise(s,vocab2)
print(vocab2)
print(bow1)

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']
[2, 1, 1, 1, 1, 1, 1, 1]


**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [65]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
tok = vect.fit_transform(corpus)
print(tok.toarray())


[[0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1]
 [1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 0 0 0 1 0 1 0 2 0 0 0 2 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
  0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0
  1 0 0 0 0 0 0 1 2 1 2 0 0 1 0 0]
 [0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0
  0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]]


**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

In [66]:
import math
s = corpus[5]
target = "machine"
def tf(s,target):
  words = s.split(" ")
  return(words.count(target)/len(words))
print(tf(s,target))

tot = 1
d = 1
def df(n,m):
  return(math.log(n/m))
print(df(tot,d))

print(tf(s,target)*df(tot,d))

0.09090909090909091
0.0
0.0


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [67]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfv = TfidfVectorizer()
tfidf = tfv.fit_transform(corpus)
words = (tfv.get_feature_names_out())
tfidf_values = (tfidf.toarray()[0])
tfidf_dict = dict(zip(words, tfidf_values))
print(tfidf_dict)

{'absolutely': np.float64(0.0), 'algebra': np.float64(0.0), 'and': np.float64(0.0), 'artificial': np.float64(0.3345454287016015), 'back': np.float64(0.0), 'be': np.float64(0.0), 'behind': np.float64(0.0), 'brown': np.float64(0.0), 'but': np.float64(0.0), 'concerns': np.float64(0.3345454287016015), 'data': np.float64(0.0), 'delicious': np.float64(0.0), 'dog': np.float64(0.0), 'ethical': np.float64(0.3345454287016015), 'fox': np.float64(0.0), 'go': np.float64(0.0), 'hate': np.float64(0.0), 'however': np.float64(0.3345454287016015), 'in': np.float64(0.0), 'intelligence': np.float64(0.3345454287016015), 'involves': np.float64(0.0), 'is': np.float64(0.27433203727401334), 'it': np.float64(0.0), 'jumps': np.float64(0.0), 'lazy': np.float64(0.0), 'learning': np.float64(0.0), 'linear': np.float64(0.0), 'love': np.float64(0.0), 'machine': np.float64(0.0), 'math': np.float64(0.0), 'mind': np.float64(0.0), 'nobler': np.float64(0.0), 'not': np.float64(0.0), 'or': np.float64(0.0), 'over': np.float64

# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [68]:

from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

snt = [word_tokenize(s.lower()) for s in corpus]

vect = Word2Vec(snt, vector_size=10, min_count=1)

print(vect.wv["learning"])

[ 0.07309777  0.05074635  0.06763472  0.00753128  0.06360789 -0.03414256
 -0.00940346  0.05780506 -0.07521617 -0.03939066]


**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [69]:
import gensim.downloader as api

# Load pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-50')

predict = glove_model.most_similar(positive=['woman', 'king'], negative=['man'])
print(predict[0][0])

queen


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [70]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download VADER (run once)
#nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
print(sia.polarity_scores(corpus[1]))
print(sia.polarity_scores(corpus[5]))

{'neg': 0.235, 'neu': 0.623, 'pos': 0.141, 'compound': -0.3926}
{'neg': 0.345, 'neu': 0.478, 'pos': 0.177, 'compound': -0.5346}
