# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [2]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [3]:
import re
#write rest of the code here

def clean_text_scratch(text):
    text=text.lower()          #lowercasing
    text = re.sub(r"[!.,:;…']", "", text)       #punctuation removal
    l=text.split()                      #tekenization

    a=['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or']
    for i in text:
        if i in l:
            l.remove(i)                      #stopword removal

    return l
def helper(text):
    l=clean_text_scratch(text)
    r=[]
    ending = ("ing", "ly", "ed", "s")
    for i in l:
        for a in ending:
            if i.endswith(a):
                i = i[:-len(a)]
                break
        r.append(i)
    print(r) 

helper("Artificial Intelligence is transforming the world; however, ethical concerns remain!")

    
    
    

['artificial', 'intelligence', 'i', 'transform', 'the', 'world', 'however', 'ethical', 'concern', 'remain']


**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [4]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string


lemmat = WordNetLemmatizer()
a=corpus[1].lower()
translator = str.maketrans('', '', string.punctuation)
a = a.translate(translator)
stop = set(stopwords.words('english'))


tokens = word_tokenize(a)

cleaned_lemmatized = [lemmat.lemmatize(word) for word in tokens if  word not in stop]

print(f"Cleaned and lemmatized sentence: {cleaned_lemmatized}")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\divya\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\divya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\divya\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\divya\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\divya\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Cleaned and lemmatized sentence: ['pizza', 'absolutely', 'delicious', 'service', 'terrible', 'wont', 'go', 'back']


# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [6]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# NLTK downloads (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def cleanl(t):

    import string
    
    
    lemmat = WordNetLemmatizer()
    a=t.lower()
    translator = str.maketrans('', '', string.punctuation)
    a = a.translate(translator)
    stop = set(stopwords.words('english'))
    
    
    tokens = word_tokenize(a)
    cleaned_lemmatized=[]
    for word in tokens:
        if word not in stop:
            cleaned_lemmatized.append(lemmat.lemmatize(word))
    return cleaned_lemmatized
vocab=set()
for i in corpus:
    a=list(cleanl(i))
    for j in a:
        vocab.add(j)

# Sort vocabulary alphabetically
vocabulary = sorted(list(vocab))
print(f"Unique Vocabulary ({len(vocabulary)} words):")
print(vocabulary)

# Function to create BoW vector for a sentence
def bow_vector(sentence, vocab):
    tokens = word_tokenize(sentence.lower())
    vector = [0] * len(vocab)
    for word in tokens:
        if word.isalpha() and word not in stop:
            lemmatized = lemmat.lemmatize(word)
            if lemmatized in vocab:
                idx = vocab.index(lemmatized)
                vector[idx] += 1
    return vector

# Test with the third sentence
test_sentence = "The quick brown fox jumps over the lazy dog."
vector = bow_vector(test_sentence, vocabulary)
print(f"\nTest sentence: '{test_sentence}'")
print(f"BoW vector: {vector}")

Unique Vocabulary (39 words):
['absolutely', 'algebra', 'artificial', 'back', 'behind', 'brown', 'concern', 'data', 'delicious', 'dog', 'ethical', 'fox', 'go', 'hate', 'however', 'intelligence', 'involves', 'jump', 'lazy', 'learning', 'linear', 'love', 'machine', 'math', 'mind', 'nobler', 'pizza', 'question', 'quick', 'remain', 'science', 'service', 'statistic', 'terrible', 'ti', 'transforming', 'whether', 'wont', 'world']

Test sentence: 'The quick brown fox jumps over the lazy dog.'
BoW vector: [0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\divya\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\divya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\divya\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer


vectorizer = CountVectorizer()

bow_matrix = vectorizer.fit_transform(corpus)

bow_array = bow_matrix.toarray()
print("Bag of Words Matrix (from CountVectorizer):")
print(bow_array)
print(f"\nShape: {bow_array.shape} (documents x vocabulary size)")
print(f"\nFeature names (vocabulary): {vectorizer.get_feature_names_out()}[:20]...")

Bag of Words Matrix (from CountVectorizer):
[[0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1]
 [1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 0 0 0 1 0 1 0 2 0 0 0 2 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
  0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0
  1 0 0 0 0 0 0 1 2 1 2 0 0 1 0 0]
 [0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0
  0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]]

Shape: (6, 52) (documents x vocabulary size)

Feature names (vocabulary): ['absolutely' 'algebra' 'and' 'artificial' 'back' 'be' 'behind' 'brown'
 'but' 'concerns' 'data' 'delicious' 'dog' 'ethical' 'fox' 'go' 'hate'
 'however' 'in' 'intelligence' 'involves' 'is' 'it' 'jumps' 'lazy'
 'learning' 'lin

In [22]:
import math

word = "machine"
last_sentence = corpus[5]  # "I love machine learning, but I hate the math behind it."


# Calculate TF (Term Frequency)
words_in_sentence = last_sentence.lower().split()
total_words = len(words_in_sentence)
count_word = sum(1 for w in words_in_sentence if word in w.lower())
tf = count_word / total_words




total_documents = len(corpus)
docs_with_word = sum(1 for doc in corpus if word.lower() in doc.lower())                 # Calculate IDF (Inverse Document Frequency)
idf = math.log(total_documents / docs_with_word)



# Calculate TF-IDF
tf_idf = tf * idf
print("TF-IDF Score:")
print(f"  TF-IDF = TF × IDF = {tf:.4f} × {idf:.4f} = {tf_idf:.6f}")

TF-IDF Score:
  TF-IDF = TF × IDF = 0.0909 × 1.0986 = 0.099874


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)


tfidf_array = tfidf_matrix.toarray()


first_tfidf_vector = tfidf_array[0]
feature_names = tfidf_vectorizer.get_feature_names_out()

print("TF-IDF Vector for the first sentence:")
print(f"Sentence: {corpus[0]}\n")

print("Non-zero TF-IDF scores:")
for idx, score in enumerate(first_tfidf_vector):
    if score > 0:
        print(f"  {feature_names[idx]}: {score:.4f}")

print("\nObservation:")
print("Unique words (like 'intelligence', 'artificial', 'transforming') have higher scores")
print("Common words (like 'is', 'the') would have lower or zero scores due to high document frequency.")
print(f"\nShape of TF-IDF matrix: {tfidf_array.shape}")

TF-IDF Vector for the first sentence:
Sentence: Artificial Intelligence is transforming the world; however, ethical concerns remain!

Non-zero TF-IDF scores:
  artificial: 0.3345
  concerns: 0.3345
  ethical: 0.3345
  however: 0.3345
  intelligence: 0.3345
  is: 0.2743
  remain: 0.3345
  the: 0.1714
  transforming: 0.3345
  world: 0.3345

Observation:
Unique words (like 'intelligence', 'artificial', 'transforming') have higher scores
Common words (like 'is', 'the') would have lower or zero scores due to high document frequency.

Shape of TF-IDF matrix: (6, 52)


# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [19]:
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

a = [word_tokenize(i.lower()) for i in corpus]

vect = Word2Vec(a, vector_size=10, min_count=1)

print(vect.wv["learning"])

[-0.08167583  0.04493266 -0.04121889  0.00810245  0.08528175 -0.0447215
  0.04544761 -0.06766256 -0.03566954  0.09391536]


**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [15]:
import gensim.downloader as api

# Load pre-trained GloVe model
glove = api.load('glove-wiki-gigaword-50')

predict = glove.most_similar(positive=['woman', 'king'], negative=['man'])
print('King - Man + Woman  equals',predict[0][0])

King - Man + Woman  equals queen


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [3]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download VADER (run once)
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
print(sia.polarity_scores(corpus[1]))
print(sia.polarity_scores(corpus[5]))

{'neg': 0.223, 'neu': 0.644, 'pos': 0.134, 'compound': -0.3926}
{'neg': 0.345, 'neu': 0.478, 'pos': 0.177, 'compound': -0.5346}


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\divya\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
