<a href="https://colab.research.google.com/github/ghoshankur102/Judge_It_Well/blob/main/AnkurGhosh_250153_week1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [None]:
def clean_text_scratch(text):
  corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]
  text=text.lower()            #1
  for ch in '!,.,,,:,;,...,''':
    text=text.replace(ch,"")   #2
  words=text.split()           #3
  stopwords = ["is", "a", "the", "and", "of", "to", "in", "for", "on", "with"]     #4
  filteredwords=[word for word in words if word not in stopwords]
  def simple_stem(word):
    if word.endswith("ing") and len(word) > 4:
        return word[:-3]
    if word.endswith("ly") and len(word) > 3:
        return word[:-2]
    if word.endswith("ed") and len(word) > 3:
        return word[:-2]
    if word.endswith("s") and len(word) > 2:
        return word[:-1]
    return word
  filteredwords=[simple_stem(word) for word in filteredwords]

  print(filteredwords)

In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]
clean_text_scratch(corpus[0])

['artificial', 'intelligence', 'transform', 'world', 'however', 'ethical', 'concern', 'remain']


**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Initialize tools
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def clean_corpus(corpus):
    cleaned_corpus = []

    for document in corpus:
        # Lowercase
        document = document.lower()

        # Tokenize
        tokens = word_tokenize(document)

        # Remove punctuation, stopwords, and non-alphabetic tokens
        tokens = [
            word for word in tokens
            if word.isalpha() and word not in stop_words
        ]

        # Lemmatize
        lemmatized_tokens = [
            lemmatizer.lemmatize(word)
            for word in tokens
        ]

        cleaned_corpus.append(lemmatized_tokens)

    return cleaned_corpus

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]
clean_text_scratch(corpus[1])

['pizza', 'wa', 'absolute', 'deliciou', 'but', 'service', 'wa', 'terrible', 'i', "won't", 'go', 'back']


# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# NLTK downloads (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def clean_corpus(corpus):
    cleaned = []
    for doc in corpus:
        tokens = word_tokenize(doc.lower())
        tokens = [
            lemmatizer.lemmatize(word)
            for word in tokens
            if word.isalpha() and word not in stop_words
        ]
        cleaned.append(tokens)
    return cleaned


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
def build_vocabulary(cleaned_corpus):
    vocab = set()
    for doc in cleaned_corpus:
        vocab.update(doc)
    return sorted(vocab)

In [None]:
def vectorize_sentence(sentence, vocabulary):
    tokens = word_tokenize(sentence.lower())
    tokens = [
        lemmatizer.lemmatize(word)
        for word in tokens
        if word.isalpha() and word not in stop_words
    ]

    vector = [tokens.count(word) for word in vocabulary]
    return vector

In [None]:
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "Transforming data is essential for building models.",
    "Students are learning NLP techniques efficiently."
]

cleaned_corpus = clean_corpus(corpus)

vocabulary = build_vocabulary(cleaned_corpus)

print("Vocabulary:")
print(vocabulary)

sentence = "The quick brown fox jumps over the lazy dog."
bow_vector = vectorize_sentence(sentence, vocabulary)

print("\nBoW Vector for sentence:")
print(bow_vector)

Vocabulary:
['brown', 'building', 'data', 'dog', 'efficiently', 'essential', 'fox', 'jump', 'lazy', 'learning', 'model', 'nlp', 'quick', 'student', 'technique', 'transforming']

BoW Vector for sentence:
[1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0]


**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "Transforming data is essential for building models.",
    "Students are learning NLP techniques efficiently."
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
X_array = X.toarray()

print("BoW Matrix:")
print(X_array)
print("\nVocabulary:")
print(vectorizer.get_feature_names_out())

BoW Matrix:
[[0 1 0 0 1 0 0 0 1 0 1 1 0 0 0 1 1 0 0 2 0]
 [0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1]
 [1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0]]

Vocabulary:
['are' 'brown' 'building' 'data' 'dog' 'efficiently' 'essential' 'for'
 'fox' 'is' 'jumps' 'lazy' 'learning' 'models' 'nlp' 'over' 'quick'
 'students' 'techniques' 'the' 'transforming']


**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

In [None]:
import math

tf = 1 / 10
idf = math.log(3 / 1)

tf_idf = tf * idf
print(tf_idf)

0.10986122886681099


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    "Artificial Intelligence is transforming the world",
    "Intelligence is the ability to learn",
    "Machine learning is a part of artificial intelligence"
]
vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(corpus)
# Convert to array
X_array = X.toarray()

# Vocabulary for reference
vocab = vectorizer.get_feature_names_out()

print("Vocabulary:")
print(vocab)

print("\nTF-IDF Vector for first sentence:")
print(X_array[0])     #unique words have higher tf-idf scores

Vocabulary:
['ability' 'artificial' 'intelligence' 'is' 'learn' 'learning' 'machine'
 'of' 'part' 'the' 'to' 'transforming' 'world']

TF-IDF Vector for first sentence:
[0.         0.38737583 0.30083189 0.30083189 0.         0.
 0.         0.         0.         0.38737583 0.         0.50935267
 0.50935267]


# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [None]:
!pip install gensim
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

cleaned_corpus = [
    ["artificial", "intelligence", "transform", "world"],
    ["intelligence", "ability", "learn"],
    ["machine", "learning", "part", "artificial", "intelligence"]
]
model = Word2Vec(
    sentences=cleaned_corpus,
    vector_size=10,   # small vector size
    min_count=1,      # keep all words
    window=3,
    workers=1,
    seed=42           # for reproducibility
)
learning_vector = model.wv["learning"]
print("Vector for 'learning':")
print(learning_vector)

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m44.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
Vector for 'learning':
[-0.00990809 -0.05455226 -0.08157282  0.01091695  0.07757796 -0.08723656
  0.07165825  0.06552622 -0.04464806  0.02633288]


**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [None]:
import gensim.downloader as api

model = api.load("glove-wiki-gigaword-50")



result = model.most_similar(positive=["woman", "king"],negative=["man"],topn=5)
print(result)

[('queen', 0.8523604273796082), ('throne', 0.7664334177970886), ('prince', 0.7592144012451172), ('daughter', 0.7473883628845215), ('elizabeth', 0.7460219860076904)]


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download VADER (run once)
nltk.download('vader_lexicon')
# rest of the code here
sia = SentimentIntensityAnalyzer()
corpus = [
    "I love machine learning.",
    "The pizza was delicious, but the service was terrible.",
    "Artificial intelligence is fascinating.",
    "This course is interesting.",
    "The lecture was okay.",
    "I hate the math behind machine learning."
]
pizza_review = corpus[1]
math_complaint = corpus[5]

pizza_scores = sia.polarity_scores(pizza_review)
math_scores = sia.polarity_scores(math_complaint)

print("Pizza Review Scores:")
print(pizza_scores)

print("\nMath Complaint Scores:")
print(math_scores)

Pizza Review Scores:
{'neg': 0.307, 'neu': 0.519, 'pos': 0.174, 'compound': -0.4215}

Math Complaint Scores:
{'neg': 0.425, 'neu': 0.575, 'pos': 0.0, 'compound': -0.5719}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
