# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [None]:
import re

# Naive stemmer
def naive_stem(word):
    for suffix in ['ing', 'ly', 'ed', 's']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word

# Cleaning function
def clean_text_scratch(text):
    text = text.lower()                              # 1. Lowercasing
    text = re.sub(r"[^\w\s]", "", text)              # 2. Remove punctuation
    tokens = text.split()                            # 3. Tokenization

    stopwords = ['the', 'is', 'in', 'to', 'of', 'and',
                 'a', 'it', 'was', 'but', 'or']      # 4. Stopwords
    tokens = [word for word in tokens if word not in stopwords]

    tokens = [naive_stem(word) for word in tokens]  # 5. Naive stemming
    return tokens

# ONLY the first sentence
first_sentence = "Artificial Intelligence is transforming the world; however, ethical concerns remain!"

result = clean_text_scratch(first_sentence)
print(result)


['artificial', 'intelligence', 'transform', 'world', 'however', 'ethical', 'concern', 'remain']


**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Given corpus
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

# Initialize tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Cleaning function using NLTK
def clean_text_nltk(text):
    text = text.lower()                              # Lowercasing
    text = re.sub(r"[^\w\s]", "", text)              # Remove punctuation
    tokens = word_tokenize(text)                    # Tokenization

    tokens = [word for word in tokens               # Stopword removal
              if word not in stop_words]

    tokens = [lemmatizer.lemmatize(word)            # Lemmatization
              for word in tokens]

    return tokens

#  Apply to entire corpus
cleaned_corpus = [clean_text_nltk(sentence) for sentence in corpus]

# âœ… Print ONLY the second sentence result (pizza review)
print(cleaned_corpus[1])


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['pizza', 'absolutely', 'delicious', 'service', 'terrible', 'wont', 'go', 'back']


# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


# 1. GIVEN CORPUS
#----------------------
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

# --------------------------------
# 2. INITIALIZE STOPWORDS & LEMMATIZER
# --------------------------------
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# --------------------------------
# 3. CLEANING FUNCTION USING NLTK
# --------------------------------
def clean_text_nltk(text):
    text = text.lower()                      # Lowercasing
    text = re.sub(r"[^\w\s]", "", text)      # Remove punctuation
    tokens = word_tokenize(text)            # Tokenization

    tokens = [word for word in tokens if word not in stop_words]  # Stopword removal
    tokens = [lemmatizer.lemmatize(word) for word in tokens]     # Lemmatization

    return tokens

# --------------------------------
# 4. CLEAN THE ENTIRE CORPUS
# --------------------------------
cleaned_corpus = [clean_text_nltk(sentence) for sentence in corpus]

# --------------------------------
# 5. BUILD VOCABULARY (UNIQUE + SORTED)
# --------------------------------
vocabulary = sorted(set(word for sentence in cleaned_corpus for word in sentence))

print("Vocabulary List:")
print(vocabulary)

# --------------------------------
# 6. BoW VECTOR FUNCTION
# --------------------------------
def bow_vector(sentence, vocabulary):
    clean_sentence = clean_text_nltk(sentence)
    vector = []

    for word in vocabulary:
        vector.append(clean_sentence.count(word))

    return vector

# --------------------------------
# 7. TEST SENTENCE
# --------------------------------
test_sentence = "The quick brown fox jumps over the lazy dog."
vector = bow_vector(test_sentence, vocabulary)

print("\nBoW Vector for:")
print(test_sentence)
print(vector)



Vocabulary List:
['absolutely', 'algebra', 'artificial', 'back', 'behind', 'brown', 'concern', 'data', 'delicious', 'dog', 'ethical', 'fox', 'go', 'hate', 'however', 'intelligence', 'involves', 'jump', 'lazy', 'learning', 'linear', 'love', 'machine', 'math', 'mind', 'nobler', 'pizza', 'question', 'quick', 'remain', 'science', 'service', 'statistic', 'terrible', 'ti', 'transforming', 'whether', 'wont', 'world']

BoW Vector for:
The quick brown fox jumps over the lazy dog.
[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# --------------------------------
# 1. GIVEN RAW CORPUS (UNCLEANED)
# --------------------------------
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

# --------------------------------
# 2. INSTANTIATE THE VECTORIZER
# --------------------------------
vectorizer = CountVectorizer()

# --------------------------------
# 3. FIT AND TRANSFORM THE CORPUS
# --------------------------------
X = vectorizer.fit_transform(corpus)

# --------------------------------
# 4. CONVERT TO ARRAY AND PRINT
# --------------------------------
print("Bag of Words Matrix:")
print(X.toarray())

# --------------------------------
# 5. PRINT VOCABULARY (OPTIONAL BUT USEFUL)
# --------------------------------
print("\nVocabulary:")
print(vectorizer.get_feature_names_out())


Bag of Words Matrix:
[[0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1]
 [1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 0 0 0 1 0 1 0 2 0 0 0 2 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
  0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0
  1 0 0 0 0 0 0 1 2 1 2 0 0 1 0 0]
 [0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0
  0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]]

Vocabulary:
['absolutely' 'algebra' 'and' 'artificial' 'back' 'be' 'behind' 'brown'
 'but' 'concerns' 'data' 'delicious' 'dog' 'ethical' 'fox' 'go' 'hate'
 'however' 'in' 'intelligence' 'involves' 'is' 'it' 'jumps' 'lazy'
 'learning' 'linear' 'love' 'machine' 'math' 'mind' 'nobler' 'not' 'or'
 'over' 'pizza' 'question' 'q

**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

In [None]:
import math

# Given corpus
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

# Target word and target sentence (last sentence)
word = "machine"
target_sentence = corpus[-1].lower()

# -----------------------------
# 1. TERM FREQUENCY (TF)
# -----------------------------
words_in_sentence = target_sentence.split()
tf = words_in_sentence.count(word) / len(words_in_sentence)

# -----------------------------
# 2. INVERSE DOCUMENT FREQUENCY (IDF)
# -----------------------------
total_documents = len(corpus)

doc_count = 0
for sentence in corpus:
    if word in sentence.lower():
        doc_count += 1

idf = math.log(total_documents / doc_count)

# -----------------------------
# 3. TF-IDF
# -----------------------------
tf_idf = tf * idf

# -----------------------------
# 4. PRINT RESULTS
# -----------------------------
print("TF:", tf)
print("IDF:", idf)
print("TF-IDF Score for 'machine':", tf_idf)


TF: 0.09090909090909091
IDF: 1.0986122886681098
TF-IDF Score for 'machine': 0.09987384442437362


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# -----------------------------
# GIVEN CORPUS
# -----------------------------
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

# -----------------------------
# 1. INSTANTIATE TF-IDF VECTORIZER
# -----------------------------
vectorizer = TfidfVectorizer()

# -----------------------------
# 2. FIT AND TRANSFORM THE CORPUS
# -----------------------------
X = vectorizer.fit_transform(corpus)

# -----------------------------
# 3. CONVERT TO ARRAY
# -----------------------------
tfidf_array = X.toarray()

# -----------------------------
# 4. PRINT TF-IDF VECTOR FOR FIRST SENTENCE
# -----------------------------
print("TF-IDF Vector for First Sentence:")
print(tfidf_array[0])

# -----------------------------
# 5. PRINT VOCABULARY (WORD ORDER)
# -----------------------------
print("\nVocabulary:")
print(vectorizer.get_feature_names_out())


TF-IDF Vector for First Sentence:
[0.         0.         0.         0.33454543 0.         0.
 0.         0.         0.         0.33454543 0.         0.
 0.         0.33454543 0.         0.         0.         0.33454543
 0.         0.33454543 0.         0.27433204 0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.33454543 0.         0.         0.
 0.         0.         0.17139656 0.         0.         0.33454543
 0.         0.         0.         0.33454543]

Vocabulary:
['absolutely' 'algebra' 'and' 'artificial' 'back' 'be' 'behind' 'brown'
 'but' 'concerns' 'data' 'delicious' 'dog' 'ethical' 'fox' 'go' 'hate'
 'however' 'in' 'intelligence' 'involves' 'is' 'it' 'jumps' 'lazy'
 'learning' 'linear' 'love' 'machine' 'math' 'mind' 'nobler' 'not' 'or'
 'over' 'pizza' 'question' 'quick' 'remain' 'science' 'service'
 'statistics' 'terrible' 'that' 'the' 'tis' 'to' 'transforming' 'was'
 'w

# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [None]:
!pip install gensim
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# -----------------------------
# NLTK Downloads (run once)
# -----------------------------
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# -----------------------------
# GIVEN CORPUS
# -----------------------------
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

# -----------------------------
# CLEANING FUNCTION (FROM PART 1.2 STYLE USING NLTK)
# -----------------------------
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text_nltk(text):
    text = text.lower()
    text = re.sub(r"[^\w\s]", "", text)
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

# -----------------------------
# CLEAN AND TOKENIZE ENTIRE CORPUS
# -----------------------------
cleaned_corpus = [clean_text_nltk(sentence) for sentence in corpus]

# -----------------------------
# TRAIN WORD2VEC MODEL
# -----------------------------
model = Word2Vec(
    sentences=cleaned_corpus,
    vector_size=10,
    min_count=1
)

# -----------------------------
# PRINT VECTOR FOR WORD "learning"
# -----------------------------
print("Word Vector for 'learning':")
print(model.wv["learning"])




Word Vector for 'learning':
[-0.00536899  0.00237282  0.05103846  0.0900786  -0.09300981 -0.07119522
  0.06463154  0.08977251 -0.0501886  -0.03764008]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [None]:
import gensim.downloader as api

# Load pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-50')

# Analogy: King - Man + Woman = ?
result = glove_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

print("Top prediction for 'King - Man + Woman':", result[0])


Top prediction for 'King - Man + Woman': ('queen', 0.8523604273796082)


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download VADER (run once)
nltk.download('vader_lexicon')

# Initialize the SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

# Example corpus
corpus = [
    "I love this product!",
    "The pizza was delicious but the service was slow.",  # corpus[1] - Pizza Review
    "I don't know what to say.",
    "This is amazing!",
    "Not bad, could be better.",
    "The professor's explanation was terrible and confusing."  # corpus[5] - Math Complaint
]

# Analyze Pizza Review
pizza_review = corpus[1]
pizza_score = sia.polarity_scores(pizza_review)
print("Pizza Review:", pizza_review)
print("VADER Score:", pizza_score)

# Analyze Math Complaint
math_complaint = corpus[5]
math_score = sia.polarity_scores(math_complaint)
print("\nMath Complaint:", math_complaint)
print("VADER Score:", math_score)


Pizza Review: The pizza was delicious but the service was slow.
VADER Score: {'neg': 0.0, 'neu': 0.773, 'pos': 0.227, 'compound': 0.3291}

Math Complaint: The professor's explanation was terrible and confusing.
VADER Score: {'neg': 0.5, 'neu': 0.5, 'pos': 0.0, 'compound': -0.6124}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
