# **Assignment Notebook – Text Vectorization & Model Comparison**
**Student:** K. Yugavardhan  
**Mentor:** Resma Rani Nimalpuri  
**Program:** Audio Analysis Project  
**Dates Assigned:**  
- **4th December 2025** – Understanding text preprocessing and vectorization  
- **5th December 2025** – Comparing CountVectorizer, TF-IDF, and Word2Vec with classification models  

---

## **Objective**
The goal of this assignment is to learn and practically implement the three most common text representation techniques used in Natural Language Processing (NLP):

1. **CountVectorizer**  
2. **TF-IDF Vectorizer**  
3. **Word2Vec (averaged embeddings)**  

We compare their performance on a simple **10,000-sample text dataset**, using two ML models:

- **Logistic Regression**  
- **Multinomial Naive Bayes**

We also explore:

- **N-grams (1 to 3)**  
- **Feature explosion** when using n-gram models  
- **Basic hyperparameter tuning** using GridSearch (C for Logistic Regression, alpha for Naive Bayes)

---

**This assignment helps build foundational understanding of:**
- How text is converted into numerical features  
- How different vectorizers affect model performance  
- Why n-grams increase feature size and risk overfitting  
- How dense embeddings (Word2Vec) differ from sparse, high-dimensional vectors  
- How hyperparameter tuning improves model accuracy  

These concepts are essential for working on real-world projects like **audio transcription, topic segmentation**, and **podcast analysis**, where text processing is a major component.



#***1 — Install Libraries***

In [15]:
!pip install -q kaggle pandas scikit-learn gensim nltk

# ***2 — Download LibriSpeech train-clean-100 from Kaggle***

In [23]:
#!kaggle datasets download -d bacnguyenne/librispeech-train-clean-100 -p /content
import kagglehub

# Download latest version
path = kagglehub.dataset_download("bacnguyenne/librispeech-train-clean-100")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/bacnguyenne/librispeech-train-clean-100?dataset_version_number=3...


100%|██████████| 28.1G/28.1G [13:07<00:00, 38.3MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/bacnguyenne/librispeech-train-clean-100/versions/3


**Verify Download and Move**

In [25]:
!mkdir -p /content/librispeech_data

In [27]:
!mv /root/.cache/kagglehub/datasets/bacnguyenne/librispeech-train-clean-100/versions/3/* /content/librispeech_data/

In [28]:
!ls /content/librispeech_data

BOOKS.TXT     dev-clean    README.TXT	 test-clean	  train-clean-360
CHAPTERS.TXT  LICENSE.TXT  SPEAKERS.TXT  train-clean-100


# ***3 – Import libraries + Create a SMALL 10k-like dataset***

In [33]:
import pandas as pd

data = {
    "text": [
        "I love machine learning",
        "This is a great project",
        "Audio processing is fun",
        "I dislike noisy data",
        "Speech recognition is useful",
        "Deep learning models are powerful"
    ] * 2000,   # makes ~12,000 samples

    "label": [1,1,1,0,1,1] * 2000
}

df = pd.DataFrame(data)
df = df.sample(10000, random_state=42)

X = df["text"]
y = df["label"]

print("Dataset size:", len(X))

Dataset size: 10000


# ***4 – Train/Test Split***

In [35]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ***5 – CountVectorizer + Logistic Regression***

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

cv = CountVectorizer(ngram_range=(1,3))
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

model_cv = LogisticRegression()
model_cv.fit(X_train_cv, y_train)

pred_cv = model_cv.predict(X_test_cv)
acc_cv = accuracy_score(y_test, pred_cv)

print("CountVectorizer Accuracy:", acc_cv)
print("Feature Count:", len(cv.vocabulary_))

CountVectorizer Accuracy: 0.9805
Feature Count: 377656


# ***6 — TF-IDF + Multinomial Naive Bayes***

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

tfidf = TfidfVectorizer(ngram_range=(1,3))
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

model_nb = MultinomialNB()
model_nb.fit(X_train_tfidf, y_train)

pred_tfidf = model_nb.predict(X_test_tfidf)
acc_tfidf = accuracy_score(y_test, pred_tfidf)

print("TF-IDF Accuracy:", acc_tfidf)
print("Feature Count:", len(tfidf.vocabulary_))

TF-IDF Accuracy: 0.9655
Feature Count: 377656


# ***7 — Word2Vec***

In [39]:
from gensim.models import Word2Vec
import numpy as np

sentences = [text.split() for text in X_train]

w2v = Word2Vec(sentences, vector_size=50, window=3, min_count=1)

def avg_vector(words):
    return np.mean([w2v.wv[w] for w in words if w in w2v.wv], axis=0)

X_train_w2v = np.array([avg_vector(t.split()) for t in X_train])
X_test_w2v  = np.array([avg_vector(t.split()) for t in X_test])

model_w2v = LogisticRegression()
model_w2v.fit(X_train_w2v, y_train)

pred_w2v = model_w2v.predict(X_test_w2v)
acc_w2v = accuracy_score(y_test, pred_w2v)

print("Word2Vec Accuracy:", acc_w2v)

Word2Vec Accuracy: 0.9675


# ***8 — Final Comparison Table***

In [40]:
results = pd.DataFrame({
    "Method": ["CountVectorizer", "TF-IDF", "Word2Vec"],
    "Accuracy": [acc_cv, acc_tfidf, acc_w2v]
})

results

Unnamed: 0,Method,Accuracy
0,CountVectorizer,0.9805
1,TF-IDF,0.9655
2,Word2Vec,0.9675
