# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [None]:
import re
#write rest of the code here

##LOWERCASING
t="Artificial Intelligence is transforming the world; however, ethical concerns remain!"
t=t.lower()
print(t)
print(" ")##adds blank line after printing

##TOKENIZATION
doc=t.split()
print(doc)
print(" ")##adds blank line after printing

##REMOVE THE SPECIAL CHARACTERS
clean_text=" "
clean_text1=" "
char_to_remove="!.,:;'..."
for char in t:
  if char not in char_to_remove:
    clean_text+=char
  else:
    clean_text1+=char
print("Sentence after removing special characters:")
print(clean_text)
print(" ")##adds blank line after printing

##STOPWORD REMOVAL
!pip install -u spacy==3
!python -m spacy download en_core_web_sm
!python -m spacy info
import spacy
nlp=spacy.load("en_core_web_sm")
stop_words=["the","is","in","to","of","and","a","it","was","but","or"]
print("Sentence after removing stopwords:")
for word in doc:
  if word not in stop_words:
    print(word)


##Stemming
import nltk
from nltk.stem import PorterStemmer
nltk.download('punkt')
from nltk.tokenize import word_tokenize
def stemmed_st(t):
 ps=PorterStemmer()
 print("\n Stemming a sentence :")
 tk_words=word_tokenize(t)
 st_tk_words=[ps.stem(words) for word in tk_words]
 stemmed_st=" ".join(st_tk_words)
 print(stemmed_st)






artificial intelligence is transforming the world; however, ethical concerns remain!
 
['artificial', 'intelligence', 'is', 'transforming', 'the', 'world;', 'however,', 'ethical', 'concerns', 'remain!']
 
Sentence after removing special characters:
 artificial intelligence is transforming the world however ethical concerns remain
 

Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

no such option: -u
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [9]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#write rest of the code here
##TOKENIZATION
x="The pizza was absolutely delicious, but the service was terrible ... I won't go back."
words=word_tokenize(x)
print(words)
print(" ")##adds blank line after printing

##STOPWORDS
stop_words=set(stopwords.words('english'))
tokens=word_tokenize(x.lower())

filtered_tokens=[word for word in tokens if word not in stop_words]
print("Original:",tokens)
print("Filtered Tokens:",filtered_tokens)
print(" ")##adds blank line after printing

##LEMMATIZATION
lemmatizer=WordNetLemmatizer()
tokens1=word_tokenize(x)
lemmitized_words=[lemmatizer.lemmatize(word) for word in tokens1]
print(f"Original Text:{x}")
print(f"Lemmatized Text:{lemmitized_words}")


['The', 'pizza', 'was', 'absolutely', 'delicious', ',', 'but', 'the', 'service', 'was', 'terrible', '...', 'I', 'wo', "n't", 'go', 'back', '.']
 
Original: ['the', 'pizza', 'was', 'absolutely', 'delicious', ',', 'but', 'the', 'service', 'was', 'terrible', '...', 'i', 'wo', "n't", 'go', 'back', '.']
Filtered Tokens: ['pizza', 'absolutely', 'delicious', ',', 'service', 'terrible', '...', 'wo', "n't", 'go', 'back', '.']
 
Original Text:The pizza was absolutely delicious, but the service was terrible ... I won't go back.
Lemmatized Text:['The', 'pizza', 'wa', 'absolutely', 'delicious', ',', 'but', 'the', 'service', 'wa', 'terrible', '...', 'I', 'wo', "n't", 'go', 'back', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [34]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# NLTK downloads (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# rest of the code here
import re
from collections import Counter

corpus=[""]
def build_vocabulary(corpus):

  for sentence in corpus:
    words=[]
    cleaned_sentence = re.sub(r'[^\w\s]','',sentence).lower()
    words=cleaned_sentence.split()
    words= words.extend(words)
    vocabulary=sorted(list(set(words)))
    return vocabulary

def vectorize_sentence(sentence,vocabulary):
 cleaned_sentence = re.sub(r'[^\w\]','',sentence).lower()
 words=cleaned_sentence.split()
 word_counts=Counter(words)
 bow_vector=[word_counts.get(word,0) for word in vocabulary]
 return bow_vector

 vocabulary=build_vocab(corpus)
 sentence="The quick brown fox jumps over the lazy dog."
 bow_vector=vectorize_sentence(sentence,vocabulary)

 print("--BAG OF WORDS IMPLEMENTATION--")
 print("Vocabulary:",vocabulary)
 print("BOW Vector for sentence:",sentence)
 print(bow_vector)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
#rest of the code here
corpus=["Artificial Intelligence is transforming the world; howver , ehical concerns remain!",
"The pizza was absolutely delicious , but the service was terrible ... I won't go back",
"The quick brown fox jumps over the lazy dog",
"To be, or not to be , that is the question ; Whether 'tis nobler in the mind",
"Data science involves statistics , linear algebra, and machine learning.",
"I love machine learning , but I hate the math behind it"]
tr_idf_model=TfidfVectorizer() # Corrected TfidVectorizer to TfidfVectorizer
tf_idf_vector=tr_idf_model.fit_transform(corpus)
print(tf_idf_vector.shape)
print(tf_idf_vector.toarray())

(6, 52)
[[0.         0.         0.         0.33454543 0.         0.
  0.         0.         0.         0.33454543 0.         0.
  0.         0.33454543 0.         0.         0.         0.33454543
  0.         0.33454543 0.         0.27433204 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.33454543 0.         0.         0.
  0.         0.         0.17139656 0.         0.         0.33454543
  0.         0.         0.         0.33454543]
 [0.26995162 0.         0.         0.         0.26995162 0.
  0.         0.         0.22136419 0.         0.         0.26995162
  0.         0.         0.         0.26995162 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.26995162
  0.         0.         0.         0.         0.26995162 0.
  0.26995162 

**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

In [20]:
import math
import re

# The full corpus from the problem description
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

target_word = "machine"
last_sentence_index
last_sentence = corpus[last_sentence_index]


cleaned_last_sentence = re.sub(r'[^\w\s]', '', last_sentence).lower()
words_in_last_sentence = cleaned_last_sentence.split()


count_machine_in_sentence = words_in_last_sentence.count(target_word)


total_words_in_sentence = len(words_in_last_sentence)


if total_words_in_sentence > 0:
    tf = count_machine_in_sentence / total_words_in_sentence
else:
    tf = 0.0

print(f"Term Frequency (TF) for '{target_word}' in the last sentence: {tf:.4f}")


total_documents = len(corpus)
documents_containing_machine = 0

for doc in corpus:

    cleaned_doc = re.sub(r'[^ȦȦȦȦ\s]', '', doc).lower()
    if target_word in cleaned_doc.split():
        documents_containing_machine += 1


if documents_containing_machine > 0:
    idf = math.log(total_documents / documents_containing_machine)
else:
    idf = 0.0

print(f"Inverse Document Frequency (IDF) for '{target_word}' across the corpus: {idf:.4f}")

tf_idf_score = tf * idf
print(f"Manual TF-IDF score for '{target_word}' in the last sentence: {tf_idf_score:.4f}")

Term Frequency (TF) for 'machine' in the last sentence: 0.0909
Inverse Document Frequency (IDF) for 'machine' across the corpus: 0.0000
Manual TF-IDF score for 'machine' in the last sentence: 0.0000


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer
# rest of the code here
tr_idf_model=TfidfVectorizer()
corpus1=["Artificial Intelligence is transforming the world;however,ethical concerns remain!.",
         "The pizza was absolutely delicious , but the service was terrible ...I won,t go back.",
         "The quick brown fox jumps over the lazy dog.",
         "To be, or not to be , that is the question ; Whether 'tis nobler in the mind.",
         "Data science involves statistics , linear algebra, and machine learning.",
         "I love machine learning , but I hate the math behind it"]
tf_idf_vector=tr_idf_model.fit_transform(corpus1)
tf_array=tf_idf_vector.toarray()


df_tf_idf=pd.DataFrame(tf_array,columns=tr_idf_model.get_feature_names_out())
print(df_tf_idf)

tf_idf_vector=tr_idf_model.fit_transform(corpus) [:1]
print(tf_idf_vector)

   absolutely   algebra       and  artificial      back        be    behind  \
0    0.000000  0.000000  0.000000    0.334545  0.000000  0.000000  0.000000   
1    0.269952  0.000000  0.000000    0.000000  0.269952  0.000000  0.000000   
2    0.000000  0.000000  0.000000    0.000000  0.000000  0.000000  0.000000   
3    0.000000  0.000000  0.000000    0.000000  0.000000  0.462221  0.000000   
4    0.000000  0.346171  0.346171    0.000000  0.000000  0.000000  0.000000   
5    0.000000  0.000000  0.000000    0.000000  0.000000  0.000000  0.370631   

      brown       but  concerns  ...  terrible      that       the       tis  \
0  0.000000  0.000000  0.334545  ...  0.000000  0.000000  0.171397  0.000000   
1  0.000000  0.221364  0.000000  ...  0.269952  0.000000  0.276607  0.000000   
2  0.352456  0.000000  0.000000  ...  0.000000  0.000000  0.361145  0.000000   
3  0.000000  0.000000  0.000000  ...  0.000000  0.231111  0.236808  0.231111   
4  0.000000  0.000000  0.000000  ...  0.000000

# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [13]:
!pip install gensim
from gensim.models import Word2Vec
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


#rest of the code here
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go again",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]
cleaned_corpus=[]
for sentence in corpus:
  # Corrected the regular expression to remove non-word characters and keep whitespace
  cleaned_sentence=re.sub(r'[^\w\s]','',sentence).lower()
  words=word_tokenize(cleaned_sentence)
  stop_words=set(stopwords.words('english'))
  filtered_words=[word for word in words if word not in stop_words]
  lemmatizer=WordNetLemmatizer()
  lemmatized_words=[lemmatizer.lemmatize(word) for word in filtered_words]
  cleaned_corpus.append(lemmatized_words)

# Train the model after the entire corpus has been cleaned and tokenized
model=Word2Vec(cleaned_corpus,min_count=1,vector_size=10)

# Now print the vector for 'learning'
print("Vector for 'learning':")
print(model.wv['learning'])

Vector for 'learning':
[-0.00537893  0.00240059  0.0510531   0.09015553 -0.09299233 -0.07115052
  0.0646439   0.08971389 -0.05020833 -0.03767695]


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [19]:
import gensim.downloader as api

# Load pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-50')

# --- Analogy Task ---
# Compute the analogy: King - Man + Woman = ?
result = glove_model.most_similar(positive=['woman', 'king'], negative=['man'])

print(f"King - Man + Woman = {result[0][0]}")

# Question: Does the model correctly guess "Queen"?
# Check if 'queen' is the top result
is_queen_correct = (result[0][0].lower() == 'queen')
print(f"Does the model correctly guess \"Queen\"? {is_queen_correct}")

King - Man + Woman = queen
Does the model correctly guess "Queen"? True


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [23]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download VADER (run once)
nltk.download('vader_lexicon')

# Install vaderSentiment if it's not already there (this is the external library)
!pip install vaderSentiment

# Using NLTK's SentimentIntensityAnalyzer
def sentiment_scores(sentence):
    sid_obj = SentimentIntensityAnalyzer()
    sentiment_dict = sid_obj.polarity_scores(sentence)

    print(f"Sentiment Scores: {sentiment_dict}")
    print(f"Negative Sentiment: {sentiment_dict['neg']*100:.2f}%")
    print(f"Neutral Sentiment: {sentiment_dict['neu']*100:.2f}%")
    print(f"Positive Sentiment: {sentiment_dict['pos']*100:.2f}%")

    if sentiment_dict['compound'] >= 0.05:
        print("Overall Sentiment: Positive")
    elif sentiment_dict['compound'] <= -0.05:
        print("Overall Sentiment: Negative")
    else:
        print("Overall Sentiment: Neutral")

# Define the corpus for use in the task
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]


# Task 2: Pass the Pizza Review (corpus[1]) into the analyzer.
print("\n--- Analyzing Pizza Review (corpus[1]) ---")
pizza_review = corpus[1]
print(f"Statement: {pizza_review}")
sentiment_scores(pizza_review)

# Task 3: Pass the Math Complaint (corpus[5]) into the analyzer.
print("\n--- Analyzing Math Complaint (corpus[5]) ---")
math_complaint = corpus[5]
print(f"Statement: {math_complaint}")
sentiment_scores(math_complaint)


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2

--- Analyzing Pizza Review (corpus[1]) ---
Statement: The pizza was absolutely delicious, but the service was terrible ... I won't go back.
Sentiment Scores: {'neg': 0.223, 'neu': 0.644, 'pos': 0.134, 'compound': -0.3926}
Negative Sentiment: 22.30%
Neutral Sentiment: 64.40%
Positive Sentiment: 13.40%
Overall Sentiment: Negative

--- Analyzing Math Complaint (corpus[5]) ---
Statement: I love machine learning, but I hate the math behind it.
Sentiment Scores: {'neg': 0.345, 'neu': 0.478, 'pos': 0.177, 'compound': -0.5346}
Negative Sentiment: 34.50%
Neutral Sentiment: 47.80%
Positive Sentiment: 17.70%
Overall Sent