<a href="https://colab.research.google.com/github/PoonamGupta078/Judge_It_Well/blob/main/Week1_assignment_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [None]:
import re

def naive_stem(word):
    for suffix in ['ing', 'ly', 'ed', 's']:
        if word.endswith(suffix) and len(word) > len(suffix):
            word = word[:-len(suffix)]
            break
    return word

def clean_text_scratch(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", "", text)
    tokens = text.split()
    stopwords = ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or']
    tokens = [word for word in tokens if word not in stopwords]
    tokens = [naive_stem(word) for word in tokens]
    return tokens

# Run function on first sentence
result = clean_text_scratch(corpus[0])
print(result)


['artificial', 'intelligence', 'transform', 'world', 'however', 'ethical', 'concern', 'remain']


**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def clean_text_nltk(text):
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalnum()]
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    return tokens

# Clean the second sentence (pizza review)
result = clean_text_nltk(corpus[1])
print(result)

['pizza', 'absolutely', 'delicious', 'service', 'terrible', 'wo', 'go', 'back']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [None]:
# write your code here
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# NLTK downloads (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# --- Preprocessing Function Using NLTK ---
def clean_text_nltk(text):
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalnum()]
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens

# --- Clean Entire Corpus ---
cleaned_corpus = [clean_text_nltk(sentence) for sentence in corpus]


# --- Build Vocabulary ---
vocab = sorted(list(set([word for sent in cleaned_corpus for word in sent])))


# --- BoW Vectorizer ---
def bow_vector(sentence, vocab):
    tokens = clean_text_nltk(sentence)
    vector = [tokens.count(word) for word in vocab]
    return vector


# Task Output
print("Vocabulary:")
print(vocab)

sentence = "The quick brown fox jumps over the lazy dog."
bow_result = bow_vector(sentence, vocab)

print("\nBoW Vector:")
print(bow_result)


Vocabulary:
['absolutely', 'algebra', 'artificial', 'back', 'behind', 'brown', 'concern', 'data', 'delicious', 'dog', 'ethical', 'fox', 'go', 'hate', 'however', 'intelligence', 'involves', 'jump', 'lazy', 'learning', 'linear', 'love', 'machine', 'math', 'mind', 'nobler', 'pizza', 'question', 'quick', 'remain', 'science', 'service', 'statistic', 'terrible', 'transforming', 'whether', 'wo', 'world']

BoW Vector:
[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [None]:
# your code here
from sklearn.feature_extraction.text import CountVectorizer

# 1. Instantiate vectorizer
vectorizer = CountVectorizer()

# 2. Fit + transform corpus
bow_matrix = vectorizer.fit_transform(corpus)

# 3. Convert to array
bow_array = bow_matrix.toarray()

# Print results
print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("\nBoW Matrix:")
print(bow_array)


Vocabulary:
['absolutely' 'algebra' 'and' 'artificial' 'back' 'be' 'behind' 'brown'
 'but' 'concerns' 'data' 'delicious' 'dog' 'ethical' 'fox' 'go' 'hate'
 'however' 'in' 'intelligence' 'involves' 'is' 'it' 'jumps' 'lazy'
 'learning' 'linear' 'love' 'machine' 'math' 'mind' 'nobler' 'not' 'or'
 'over' 'pizza' 'question' 'quick' 'remain' 'science' 'service'
 'statistics' 'terrible' 'that' 'the' 'tis' 'to' 'transforming' 'was'
 'whether' 'won' 'world']

BoW Matrix:
[[0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1]
 [1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 0 0 0 1 0 1 0 2 0 0 0 2 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
  0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0
  1 0 0 0 0 0 0 1 2 1 2 0 0 1 0 0]
 [0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0
  0 0 0 1 0 1 0 0 0 0 0

**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

In [None]:
# write code here
import math

# Corpus
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

# Target sentence
sentence = corpus[-1]
word = "machine"

# Step 1: Term Frequency (TF)
tokens = sentence.lower().split()
tf = tokens.count(word) / len(tokens)

# Step 2: Inverse Document Frequency (IDF)
num_docs = len(corpus)
docs_with_word = sum([1 for doc in corpus if word in doc.lower().split()])

idf = math.log(num_docs / docs_with_word)

# Step 3: TF-IDF
tfidf = tf * idf

print("TF:", tf)
print("IDF:", idf)
print("TF-IDF:", tfidf)


TF: 0.09090909090909091
IDF: 1.0986122886681098
TF-IDF: 0.09987384442437362


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [None]:
# code here
from sklearn.feature_extraction.text import TfidfVectorizer

# 1. Create vectorizer
tfidf = TfidfVectorizer()

# 2. Fit and transform
tfidf_matrix = tfidf.fit_transform(corpus)

# 3. Vector for the FIRST sentence
first_sentence_vector = tfidf_matrix[0].toarray()

# 4. Print feature names and vector
print("Vocabulary (Feature Names):")
print(tfidf.get_feature_names_out())

print("\nTF-IDF Vector for First Sentence:")
print(first_sentence_vector)


Vocabulary (Feature Names):
['absolutely' 'algebra' 'and' 'artificial' 'back' 'be' 'behind' 'brown'
 'but' 'concerns' 'data' 'delicious' 'dog' 'ethical' 'fox' 'go' 'hate'
 'however' 'in' 'intelligence' 'involves' 'is' 'it' 'jumps' 'lazy'
 'learning' 'linear' 'love' 'machine' 'math' 'mind' 'nobler' 'not' 'or'
 'over' 'pizza' 'question' 'quick' 'remain' 'science' 'service'
 'statistics' 'terrible' 'that' 'the' 'tis' 'to' 'transforming' 'was'
 'whether' 'won' 'world']

TF-IDF Vector for First Sentence:
[[0.         0.         0.         0.33454543 0.         0.
  0.         0.         0.         0.33454543 0.         0.
  0.         0.33454543 0.         0.         0.         0.33454543
  0.         0.33454543 0.         0.27433204 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.33454543 0.         0.         0.
  0.         0.         0.17139656 0.         0.         0.33454543

# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [None]:
!pip install gensim



In [None]:
#code here
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Preprocessing function from Part 1.2
def clean_text_nltk(text):
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalnum()]  # keep alphanumeric
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]  # remove stopwords
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]  # lemmatize
    return tokens

# Clean + tokenize entire corpus
cleaned_corpus = [clean_text_nltk(sentence) for sentence in corpus]

# Train Word2Vec model
model = Word2Vec(
    sentences=cleaned_corpus,
    vector_size=10,  # small vector size for easy viewing
    min_count=1,     # keep all words, corpus is small
    workers=4
)

# Print vector for the word "learning"
print("Vector for 'learning':")
print(model.wv["learning"])

Vector for 'learning':
[-0.00535678  0.00238785  0.05107836  0.09016657 -0.09301379 -0.07113771
  0.06464887  0.08973394 -0.05023384 -0.03767424]


**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [None]:
import gensim.downloader as api

# Load pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-50')

# Analogy task: King - Man + Woman = ?
analogy_result = glove_model.most_similar(positive=['woman', 'king'], negative=['man'])

print("King - Man + Woman = ?")
print(analogy_result)

# Check if 'queen' is among the top results
does_guess_queen = False
for word, score in analogy_result:
    if word.lower() == 'queen':
        does_guess_queen = True
        break

print(f"\nDoes the model correctly guess 'Queen'? {does_guess_queen}")

King - Man + Woman = ?
[('queen', 0.8523604273796082), ('throne', 0.7664334177970886), ('prince', 0.7592144012451172), ('daughter', 0.7473883628845215), ('elizabeth', 0.7460219860076904), ('princess', 0.7424570322036743), ('kingdom', 0.7337412238121033), ('monarch', 0.721449077129364), ('eldest', 0.7184861898422241), ('widow', 0.7099431157112122)]

Does the model correctly guess 'Queen'? True


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [None]:
# code here
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Download VADER (run once)
nltk.download('vader_lexicon')

# Initialize analyzer
sia = SentimentIntensityAnalyzer()

# Sentences
pizza_review = corpus[1]
math_complaint = corpus[5]

# Analyze sentiments
pizza_score = sia.polarity_scores(pizza_review)
math_score = sia.polarity_scores(math_complaint)

print("Pizza Review Sentiment:", pizza_score)
print("Math Complaint Sentiment:", math_score)


Pizza Review Sentiment: {'neg': 0.223, 'neu': 0.644, 'pos': 0.134, 'compound': -0.3926}
Math Complaint Sentiment: {'neg': 0.345, 'neu': 0.478, 'pos': 0.177, 'compound': -0.5346}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
