# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [7]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [8]:
import re
#write rest of the code here
def clean_text_scratch(text):


  for i in range(len(text)):
    cor1=[words.lower() for words in text ]
    cor2=[re.sub(r'[^\w\s]','',lower_words) for lower_words in cor1]
    cor3=[only_words.split() for only_words in cor2 ]
    stopwords=['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or']
    cor4=[[main_words for main_words in sentence if main_words not in stopwords] for sentence in cor3 ]
    suffixes=['ing','ly','ed','s']
    for suffix in suffixes:
      cor5 = [[
      word[:-len(suffix)] if word.endswith(suffix) else word
      for word in refined_sentence
  ] for refined_sentence in cor4]

    return cor5

In [9]:
clean_text_scratch(corpus)

[['artificial',
  'intelligence',
  'transforming',
  'world',
  'however',
  'ethical',
  'concern',
  'remain'],
 ['pizza',
  'absolutely',
  'deliciou',
  'service',
  'terrible',
  'i',
  'wont',
  'go',
  'back'],
 ['quick', 'brown', 'fox', 'jump', 'over', 'lazy', 'dog'],
 ['be', 'not', 'be', 'that', 'question', 'whether', 'ti', 'nobler', 'mind'],
 ['data',
  'science',
  'involve',
  'statistic',
  'linear',
  'algebra',
  'machine',
  'learning'],
 ['i', 'love', 'machine', 'learning', 'i', 'hate', 'math', 'behind']]

**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [10]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords,wordnet
from nltk.stem import WordNetLemmatizer
import string

#write rest of the code here
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # default


tokenised_text=[word_tokenize(sentence.lower()) for sentence in corpus]
print(tokenised_text)
stop_words=set(stopwords.words("english"))
filtered_text = [
    [word for word in sentence if word not in stop_words and word not in string.punctuation]
    for sentence in tokenised_text
]

print(filtered_text)
pos_tagged_sentences= [nltk.pos_tag(sentence) for sentence in filtered_text]
print(pos_tagged_sentences)
lem=WordNetLemmatizer()
lemmatized_text = [
    [lem.lemmatize(word, pos=get_wordnet_pos(tag)) for (word, tag) in sentence]
    for sentence in pos_tagged_sentences
]
print(lemmatized_text)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


[['artificial', 'intelligence', 'is', 'transforming', 'the', 'world', ';', 'however', ',', 'ethical', 'concerns', 'remain', '!'], ['the', 'pizza', 'was', 'absolutely', 'delicious', ',', 'but', 'the', 'service', 'was', 'terrible', '...', 'i', 'wo', "n't", 'go', 'back', '.'], ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.'], ['to', 'be', ',', 'or', 'not', 'to', 'be', ',', 'that', 'is', 'the', 'question', ':', 'whether', "'t", 'is', 'nobler', 'in', 'the', 'mind', '.'], ['data', 'science', 'involves', 'statistics', ',', 'linear', 'algebra', ',', 'and', 'machine', 'learning', '.'], ['i', 'love', 'machine', 'learning', ',', 'but', 'i', 'hate', 'the', 'math', 'behind', 'it', '.']]
[['artificial', 'intelligence', 'transforming', 'world', 'however', 'ethical', 'concerns', 'remain'], ['pizza', 'absolutely', 'delicious', 'service', 'terrible', '...', 'wo', "n't", 'go', 'back'], ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog'], ['question', 'whether', "'t", 'nobler', '

# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [11]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# NLTK downloads (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# rest of the code here
all_words=[words  for sentence in lemmatized_text for words in sentence ]
sorted_all_words=sorted(all_words)
unique_words= set(words for sentence in lemmatized_text for words in sentence )
sorted_words = sorted(unique_words)
count_array=[]
print(sorted_all_words)
def vectorise(text):
  i=0
  while i < len(sorted_all_words):
          count = 1
          while i + 1 < len(sorted_all_words) and sorted_all_words[i] == sorted_all_words[i+1]:
              count += 1
              i += 1
          count_array.append((sorted_all_words[i], count))
          i += 1

  return count_array
print(vectorise(sorted_all_words))

["'t", '...', 'absolutely', 'algebra', 'artificial', 'back', 'behind', 'brown', 'concern', 'data', 'delicious', 'dog', 'ethical', 'fox', 'go', 'hate', 'however', 'intelligence', 'involve', 'jump', 'lazy', 'learn', 'learning', 'linear', 'love', 'machine', 'machine', 'math', 'mind', "n't", 'nobler', 'pizza', 'question', 'quick', 'remain', 'science', 'service', 'statistic', 'terrible', 'transform', 'whether', 'wo', 'world']
[("'t", 1), ('...', 1), ('absolutely', 1), ('algebra', 1), ('artificial', 1), ('back', 1), ('behind', 1), ('brown', 1), ('concern', 1), ('data', 1), ('delicious', 1), ('dog', 1), ('ethical', 1), ('fox', 1), ('go', 1), ('hate', 1), ('however', 1), ('intelligence', 1), ('involve', 1), ('jump', 1), ('lazy', 1), ('learn', 1), ('learning', 1), ('linear', 1), ('love', 1), ('machine', 2), ('math', 1), ('mind', 1), ("n't", 1), ('nobler', 1), ('pizza', 1), ('question', 1), ('quick', 1), ('remain', 1), ('science', 1), ('service', 1), ('statistic', 1), ('terrible', 1), ('transfor

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
#rest of the code here
v= CountVectorizer()
print(v.fit_transform(corpus).toarray())


[[0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1]
 [1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 0 0 0 1 0 1 0 2 0 0 0 2 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
  0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0
  1 0 0 0 0 0 0 1 2 1 2 0 0 1 0 0]
 [0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0
  0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]]


**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

In [13]:

import math

tf_sentence="I love machine learning, but I hate the math behind it."

total_words=tf_sentence.split()
no_total_words=len(total_words)
machine_count=total_words.count('machine')
#print(machine_count)
tf=machine_count/no_total_words
i=0
count=1
for i in range(5):
  if 'machine' in corpus[i]:
    count+=1

idf=math.log(len(corpus)/count)
print(tf*idf)


0.09987384442437362


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
# rest of the code here
vect= TfidfVectorizer()
transformed_output=vect.fit_transform(corpus)
print(vect.vocabulary_)
print(transformed_output[3])


{'artificial': 3, 'intelligence': 19, 'is': 21, 'transforming': 47, 'the': 44, 'world': 51, 'however': 17, 'ethical': 13, 'concerns': 9, 'remain': 38, 'pizza': 35, 'was': 48, 'absolutely': 0, 'delicious': 11, 'but': 8, 'service': 40, 'terrible': 42, 'won': 50, 'go': 15, 'back': 4, 'quick': 37, 'brown': 7, 'fox': 14, 'jumps': 23, 'over': 34, 'lazy': 24, 'dog': 12, 'to': 46, 'be': 5, 'or': 33, 'not': 32, 'that': 43, 'question': 36, 'whether': 49, 'tis': 45, 'nobler': 31, 'in': 18, 'mind': 30, 'data': 10, 'science': 39, 'involves': 20, 'statistics': 41, 'linear': 26, 'algebra': 1, 'and': 2, 'machine': 28, 'learning': 25, 'love': 27, 'hate': 16, 'math': 29, 'behind': 6, 'it': 22}
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 13 stored elements and shape (1, 52)>
  Coords	Values
  (0, 21)	0.18951403836835215
  (0, 44)	0.23680832518670944
  (0, 46)	0.4622212982553308
  (0, 5)	0.4622212982553308
  (0, 33)	0.2311106491276654
  (0, 32)	0.2311106491276654
  (0, 43)	0.231110649127

# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [30]:
!pip install gensim
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
model=Word2Vec(
    window=10,
    min_count=1,
    workers=4
)
model.build_vocab(lemmatized_text, progress_per=10)
print(model.epochs)
model.corpus_count
model.train(lemmatized_text,total_examples=model.corpus_count,epochs=model.epochs)
model.wv["learning"]


5


array([ 8.13227147e-03, -4.45733406e-03, -1.06835726e-03,  1.00636482e-03,
       -1.91113955e-04,  1.14817743e-03,  6.11386076e-03, -2.02715401e-05,
       -3.24596534e-03, -1.51072862e-03,  5.89729892e-03,  1.51410222e-03,
       -7.24261976e-04,  9.33324732e-03, -4.92128357e-03, -8.38409644e-04,
        9.17541143e-03,  6.74942741e-03,  1.50285603e-03, -8.88256077e-03,
        1.14874600e-03, -2.28825561e-03,  9.36823711e-03,  1.20992784e-03,
        1.49006362e-03,  2.40640994e-03, -1.83600665e-03, -4.99963388e-03,
        2.32429506e-04, -2.01418041e-03,  6.60093315e-03,  8.94012302e-03,
       -6.74754381e-04,  2.97701475e-03, -6.10765442e-03,  1.69932481e-03,
       -6.92623248e-03, -8.69402662e-03, -5.90020278e-03, -8.95647518e-03,
        7.27759488e-03, -5.77203138e-03,  8.27635173e-03, -7.24354526e-03,
        3.42167495e-03,  9.67499893e-03, -7.78544787e-03, -9.94505733e-03,
       -4.32914635e-03, -2.68313056e-03, -2.71289347e-04, -8.83155130e-03,
       -8.61755759e-03,  

**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [34]:
import gensim.downloader as api

# Load pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-50')
print(glove_model.most_similar(positive=['woman', 'king'], negative=['man'])[:5])
#rest of the code here

[('queen', 0.8523604273796082), ('throne', 0.7664334177970886), ('prince', 0.7592144012451172), ('daughter', 0.7473883628845215), ('elizabeth', 0.7460219860076904)]


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [36]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download VADER (run once)
nltk.download('vader_lexicon')
# rest of the code here
analyser=SentimentIntensityAnalyzer()
score_pizza=analyser.polarity_scores(corpus[1])
print(score_pizza)
math_complaint_score=analyser.polarity_scores(corpus[5])



{'neg': 0.223, 'neu': 0.644, 'pos': 0.134, 'compound': -0.3926}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [37]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download VADER (run once)
nltk.download('vader_lexicon')
# rest of the code here
analyser=SentimentIntensityAnalyzer()
score_pizza=analyser.polarity_scores(corpus[1])
print(score_pizza)
math_complaint_score=analyser.polarity_scores(corpus[5])
print(math_complaint_score)


{'neg': 0.223, 'neu': 0.644, 'pos': 0.134, 'compound': -0.3926}
{'neg': 0.345, 'neu': 0.478, 'pos': 0.177, 'compound': -0.5346}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
