# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [2]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [3]:
import re
##1.1
def clean_text_scratch(corpus):
  corpus_clean = [sentence.lower() for sentence in corpus] #converting text to lowercase

  corpus_clean = [sentence.replace('!','').replace('.','').replace(',','').replace(':','').replace(';','').replace("'",'') for sentence in corpus_clean] #removing puncuations

  corpus_clean = [sentence.split(" ") for sentence in corpus_clean] #tokenisation

  stop_words = ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or']
  corpus_clean = [word for sentence in corpus_clean for word in sentence if word not in stop_words] #removing stop words

  return corpus_clean
def helper(corpus): # helper function for simple stemming
  corpus_clean = [word.replace('ing','').replace('ly','').replace('ed','').replace('s','') for word in clean_text_scratch(corpus)] #simple stemming
  print(corpus_clean)
helper(corpus)

['artificial', 'intelligence', 'tranform', 'world', 'however', 'ethical', 'concern', 'remain', 'pizza', 'abolute', 'deliciou', 'ervice', 'terrible', '', 'i', 'wont', 'go', 'back', 'quick', 'brown', 'fox', 'jump', 'over', 'lazy', 'dog', 'be', 'not', 'be', 'that', 'quetion', 'whether', 'ti', 'nobler', 'mind', 'data', 'cience', 'involve', 'tatitic', 'linear', 'algebra', 'machine', 'learn', 'i', 'love', 'machine', 'learn', 'i', 'hate', 'math', 'behind']


**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [92]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
def clean_sentence(sentence):

  sentence = sentence.lower() # converting to lowercase

  translator = str.maketrans('', '', string.punctuation)
  clean_text = sentence.translate(translator)  #removing puntuations

  tokens = word_tokenize(clean_text) #tokenization

  stop_words = set(stopwords.words('english'))
  tokens = [word for word in tokens if word not in stop_words] #removing stop_words

  lemmatizer = WordNetLemmatizer()
  sentence = [lemmatizer.lemmatize(word,pos = 'v') for word in tokens] #lemmatising the verbs
  return sentence

cleaned_corpus = [clean_sentence(sentence) for sentence in corpus]
print(cleaned_corpus[1])

['pizza', 'absolutely', 'delicious', 'service', 'terrible', 'wont', 'go', 'back']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [72]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
# NLTK downloads (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
given = "The quick brown fox jumps over the lazy dog."

def clean_sentence2(sentence):
  sentence = sentence.lower()
  translator = str.maketrans('', '', string.punctuation)
  clean_text = sentence.translate(translator)
  tokens = word_tokenize(clean_text)
  lemmatizer = WordNetLemmatizer()
  sentence = [lemmatizer.lemmatize(word,pos = 'v') for word in tokens]
  return sentence

def clean_doc(doc):     #function to tokenize corpus
  vocabulary = []
  for sentence in doc:
    vocabulary.append(clean_sentence2(sentence))
  vocabulary = [words for sentence in vocabulary for words in sentence]
  vocabulary = unique_words(vocabulary)
  vocabulary.sort()
  return vocabulary

def unique_words(text):     #function to find unique vocabulary list
  unique_text = []
  for i in range(len(text)):
    for j in range(len(text)):
      if text[i] not in unique_text :
        unique_text.append(text[i])
  return unique_text   #unique vocabulary list


given = clean_sentence2(given)

def vectorize(words): #function to vectorise manually
  vector = []
  for w in unique_words(clean_doc(corpus)):
    i = 0
    for word in given:
      if w == word:
        i += 1
    vector.append(i)
  print(vector)       #vector


print(unique_words(clean_doc(corpus)))
vectorize(unique_words(clean_doc(corpus)))


['absolutely', 'algebra', 'and', 'artificial', 'back', 'be', 'behind', 'brown', 'but', 'concern', 'data', 'delicious', 'dog', 'ethical', 'fox', 'go', 'hate', 'however', 'i', 'in', 'intelligence', 'involve', 'it', 'jump', 'lazy', 'learn', 'linear', 'love', 'machine', 'math', 'mind', 'nobler', 'not', 'or', 'over', 'pizza', 'question', 'quick', 'remain', 'science', 'service', 'statistics', 'terrible', 'that', 'the', 'tis', 'to', 'transform', 'whether', 'wont', 'world']
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [71]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names_out()
CountVectorizer(vocabulary = vocab)
print(vocab)
print(vectorizer.transform([corpus[2]]).toarray()[0].tolist())

['absolutely' 'algebra' 'and' 'artificial' 'back' 'be' 'behind' 'brown'
 'but' 'concerns' 'data' 'delicious' 'dog' 'ethical' 'fox' 'go' 'hate'
 'however' 'in' 'intelligence' 'involves' 'is' 'it' 'jumps' 'lazy'
 'learning' 'linear' 'love' 'machine' 'math' 'mind' 'nobler' 'not' 'or'
 'over' 'pizza' 'question' 'quick' 'remain' 'science' 'service'
 'statistics' 'terrible' 'that' 'the' 'tis' 'to' 'transforming' 'was'
 'whether' 'won' 'world']
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0]


**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

In [138]:
#2.3
import math

def score_of_word(word,sentence,doc):
  sentence = clean_sentence2(sentence)
  count_of_word = 0
  for w in sentence:
    if w == word:
      count_of_word += 1

  TF = count_of_word/len(sentence) # term frequency

  den = 0
  for s in doc:
    if word in s:
      den += 1
  IDF = math.log(len(doc)/den) #inverse document frequency

  return TF*IDF

print(score_of_word("machine",corpus[5],corpus))

0.09987384442437362


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [131]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(lowercase=True)
tfidf_matrix = vectorizer.fit_transform(corpus)

first_sentence_vector = tfidf_matrix[0].toarray()[0]  #for first sentence
vocab = vectorizer.get_feature_names_out()

for word, score in zip(vocab, first_sentence_vector):
    if score > 0:
        print(f"{word:<15}\t{score:.4f}")

# comparing scores of word like Intelligence and Is
intelligence_idx = list(vocab).index('intelligence')
intelligence_score = first_sentence_vector[intelligence_idx]
is_idx = list(vocab).index('is')
is_score = first_sentence_vector[is_idx]
print(f"intelligence score = {intelligence_score} >  is score {is_score}")

artificial     	0.3345
concerns       	0.3345
ethical        	0.3345
however        	0.3345
intelligence   	0.3345
is             	0.2743
remain         	0.3345
the            	0.1714
transforming   	0.3345
world          	0.3345
intelligence score = 0.3345454287016015 >  is score 0.27433203727401334


# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [96]:
!pip install gensim
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#3.1
model = Word2Vec(
    sentences=cleaned_corpus,
    vector_size=10,
    min_count=1
)

vector = model.wv['learn']   # "learning" does not work because we have lemmatized in Question 1.2
print(vector)

[-0.00536899  0.00237282  0.05103846  0.0900786  -0.09300981 -0.07119522
  0.06463154  0.08977251 -0.0501886  -0.03764008]


**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [102]:
import gensim.downloader as api
model = api.load('glove-wiki-gigaword-50')

In [113]:
analogy = model.most_similar(positive=['woman', 'king'], negative=['man'])
print(f"King-Man+Woman = {analogy[0][0]}")
## YEP, THE MODEL IS GUESSING IT IS AS QUEEN!!!!!!!

King-Man+Woman = queen


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [114]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download VADER (run once)
nltk.download('vader_lexicon')


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [129]:
sa = SentimentIntensityAnalyzer()
def Sentiment_analyser(review):
  scores = sa.polarity_scores(review)
  print(scores)
  print(scores['compound'])
  i = -1
  j= -0.6
  rate = 0
  for k in range(5):
    if i< scores['compound'] and  scores['compound']< j:
        rate += 1
    i+= 0.4
    j+= 0.4

  print(f"Rate of the review in scale of 0 to 4 = {rate}")

Sentiment_analyser(corpus[1])      #"The pizza was absolutely delicious, but the service was terrible ... I won't go back."
Sentiment_analyser(corpus[5])      #"I love machine learning, but I hate the math behind it."

#COMPOUND SCORES ARE DIFFERENT AND IS HIGHER FOR CORPUS[1] THAN CORPUS[2]
#BOTH THE SENTENCES HAVE GOOD AND BAD REVIEW, HENCE BOTH THE SENTENCE IS GIVING A NEUTRAL SCORE.

{'neg': 0.223, 'neu': 0.644, 'pos': 0.134, 'compound': -0.3926}
-0.3926
Rate of the review in scale of 0 to 4 = 1
{'neg': 0.345, 'neu': 0.478, 'pos': 0.177, 'compound': -0.5346}
-0.5346
Rate of the review in scale of 0 to 4 = 1
