# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [60]:
import re
def clean_text_scratch(text):
  text=text.lower()
  text=re.sub(r"[!.,:;']","",text)
  words=text.split()
  i=0
  while i<len(words):
    if words[i] in  ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or']:
      del words[i]
    else:
      i+=1
  for i in range(len(words)):
    if words[i].endswith('ing'):
      words[i]=words[i][:-3]
    if words[i].endswith(('ly','ed')):
      words[i]=words[i][:-2]
    if words[i].endswith('s'):
      words[i]=words[i][:-1]
  return ' '.join(words)
print(clean_text_scratch(corpus[0]))


#write rest of the code here

artificial intelligence transform world however ethical concern remain


**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def clean_text(text):
  lemm=WordNetLemmatizer()
  text=text.split()
  text=[lemm.lemmatize(word) for word in text if not word in set(['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'])]
  text=' '.join(text)
  text=re.sub(r"[!.,:;']","",text)
  text=text.lower().strip()
  return text
clean_corpus=[0]*len(corpus)
for i in range(len(corpus)):
  clean_corpus[i]=clean_text(corpus[i])
print(clean_corpus[1])

#write rest of the code here

pizza absolutely delicious service terrible wont go back


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# NLTK downloads (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def clean_text(text):
  lemm=WordNetLemmatizer()
  text=text.split()
  text=[lemm.lemmatize(word) for word in text if not word in set(stopwords.words('english'))]
  text=' '.join(text)
  text=re.sub(r"[!.,:;']","",text)
  text=text.lower().strip()
  return text
clean_corpus=[0]*len(corpus)
for i in range(len(corpus)):
  clean_corpus[i]=clean_text(corpus[i])
print("corpus after cleaning:",clean_corpus)
un_words=[]
for i in clean_corpus:
  sent=i.split()
  for j in sent:
    if j not in un_words:
      un_words.append(j)
un_words.sort()
print("unique words sorted alphabetically:",un_words)

def vectorize(sent):
  words_sent=sent.split()
  count=[0]*len(un_words)
  for i in range(len(un_words)):
    for j in words_sent:
      if un_words[i]==j:
        count[i]+=1
  return count
print(vectorize(clean_corpus[2]))

# rest of the code here

corpus after cleaning: ['artificial intelligence transforming world however ethical concern remain', 'pizza absolutely delicious service terrible wont go back', 'quick brown fox jump lazy dog', 'question whether ti nobler mind', 'data science involves statistic linear algebra machine learning', 'love machine learning hate math behind']
unique words sorted alphabetically: ['absolutely', 'algebra', 'artificial', 'back', 'behind', 'brown', 'concern', 'data', 'delicious', 'dog', 'ethical', 'fox', 'go', 'hate', 'however', 'intelligence', 'involves', 'jump', 'lazy', 'learning', 'linear', 'love', 'machine', 'math', 'mind', 'nobler', 'pizza', 'question', 'quick', 'remain', 'science', 'service', 'statistic', 'terrible', 'ti', 'transforming', 'whether', 'wont', 'world']
[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
['artificial intelligence transforming world however ethical concern remain', 'pizza absolutely delicious servi

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorize=CountVectorizer()
x=vectorize.fit_transform(corpus)
feature_names=vectorize.get_feature_names_out()
x_array=x.toarray()
print("unique word list:",feature_names)
print("Bag of words:",x_array)
#rest of the code here

unique word list: ['absolutely' 'algebra' 'artificial' 'back' 'behind' 'brown' 'concern'
 'data' 'delicious' 'dog' 'ethical' 'fox' 'go' 'hate' 'however'
 'intelligence' 'involves' 'jump' 'lazy' 'learning' 'linear' 'love'
 'machine' 'math' 'mind' 'nobler' 'pizza' 'question' 'quick' 'remain'
 'science' 'service' 'statistic' 'terrible' 'ti' 'transforming' 'whether'
 'wont' 'world']
Bag of words: [[0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
  0 0 1]
 [1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0
  0 1 0]
 [0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
  0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0
  1 0 0]
 [0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0
  0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0]]


**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

In [None]:
# write code here
import math
def tf(sent,word):
  words_sent=sent.split()
  count=0
  if len(words_sent)==0:
    return 0
  for i in words_sent:
    if i==word:
      count+=1
  return count/len(words_sent)
def idf(word):
  count_docx=0
  for i in corpus:
    words_sent=i.split()
    for j in words_sent:
      if j==word:
        count_docx+=1
        break
  return math.log10(len(corpus)/count_docx)
def tfidf_for_word(sent,word):
  print("TF-IDF of",word,"=",tf(sent,word)*idf(word))
tfidf_for_word(corpus[5],'machine')


TF-IDF of machine = 0.043374659519969314
['artificial intelligence transforming world however ethical concern remain', 'pizza absolutely delicious service terrible wont go back', 'quick brown fox jump lazy dog', 'question whether ti nobler mind', 'data science involves statistic linear algebra machine learning', 'love machine learning hate math behind']


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# rest of the code here
tf_idf_=TfidfVectorizer()
tf_idf_vector=tf_idf_.fit_transform(corpus)
#print(type(tf_idf_vector),tf_idf_vector.shape)
tf_idf_array=tf_idf_vector.toarray()
print(tf_idf_array[0])
feature_names = tf_idf_.get_feature_names_out()
index1=list(feature_names).index('intelligence')
index2=list(feature_names).index('is')
if(tf_idf_array[0,index1]>tf_idf_array[0,index2]):
  print("tf-idf score for rare words are larger than common words\nintelligence:",tf_idf_array[0,index1],' ',"is:",tf_idf_array[0,index2])
else:
  print("tf-idf score for rare words are smaller than common words")



[0.         0.         0.         0.33454543 0.         0.
 0.         0.         0.         0.33454543 0.         0.
 0.         0.33454543 0.         0.         0.         0.33454543
 0.         0.33454543 0.         0.27433204 0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.33454543 0.         0.         0.
 0.         0.         0.17139656 0.         0.         0.33454543
 0.         0.         0.         0.33454543]
tf-idf score for rare words are larger than common words
intelligence: 0.3345454287016015   is: 0.27433203727401334


# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [None]:
!pip install gensim
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

model=Word2Vec(sentences=clean_corpus,vector_size=10,min_count=1)
print(model.wv.get_vector("learning"))


#rest of the code here

[-0.00536899  0.00237282  0.05103846  0.0900786  -0.09300981 -0.07119522
  0.06463154  0.08977251 -0.0501886  -0.03764008]


**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [None]:
import gensim.downloader as api

# Load pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-50')

print(glove_model.most_similar(positive=['king','woman'],negative=['man']))
#rest of the code here

[('queen', 0.8523604273796082), ('throne', 0.7664334177970886), ('prince', 0.7592144012451172), ('daughter', 0.7473883628845215), ('elizabeth', 0.7460219860076904), ('princess', 0.7424570322036743), ('kingdom', 0.7337412238121033), ('monarch', 0.721449077129364), ('eldest', 0.7184861898422241), ('widow', 0.7099431157112122)]


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download VADER (run once)
#nltk.download('vader_lexicon')
# rest of the code here
sia=SentimentIntensityAnalyzer()
scores1=sia.polarity_scores(corpus[1])
scores5=sia.polarity_scores(corpus[5])
print("polarity scores for pizza:",scores1)
print("polarity scores for math:",scores5)
print("delicious and terrible in the same sentence results in a compound score of",scores1['compound'],"indicating the pizza review is moderately negative")


polarity scores for pizza: {'neg': 0.223, 'neu': 0.644, 'pos': 0.134, 'compound': -0.3926}
polarity scores for math: {'neg': 0.345, 'neu': 0.478, 'pos': 0.177, 'compound': -0.5346}
delicious and terrible in the same sentence results in a compound score of -0.3926 indicating the pizza review is moderately negative
