
#  Applying Machine Learning Models to Text Data
### Starter Notebook for NLP & Machine Learning Lab

In this lab, you'll learn how to:
1. Convert text into numerical features using vectorization.
2. Train and evaluate multiple ML models on text data.
3. Compare results across **MultinomialNB**, **DecisionTree**, **RandomForest**, and a simple **Neural Network**.


## Step 1: Import Libraries and Load Dataset

In [1]:

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

categories = ['rec.sport.baseball', 'sci.space', 'talk.politics.mideast', 'comp.graphics']
data = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))

X = data.data
y = data.target

print("Number of samples:", len(X))
print("Example text:\n", X[0][:400])


Number of samples: 2338
Example text:
 Apparently, my editor didn't do what I wanted it to do, so I'll try again.

i'm looking for any programs or code to do simple animation and/or
drawing using fractals in TurboPascal for an IBM
              Thanks in advance


## Step 2: Text Vectorization

In [2]:

count_vectorizer = CountVectorizer(stop_words='english', max_features=3000)
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=3000)

X_count = count_vectorizer.fit_transform(X)
X_tfidf = tfidf_vectorizer.fit_transform(X)

X_train_count, X_test_count, y_train, y_test = train_test_split(X_count, y, test_size=0.2, random_state=42)
X_train_tfidf, X_test_tfidf, _, _ = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

print("Count Vectorizer shape:", X_count.shape)
print("TF-IDF Vectorizer shape:", X_tfidf.shape)


Count Vectorizer shape: (2338, 3000)
TF-IDF Vectorizer shape: (2338, 3000)


## Step 3: Model 1 — Multinomial Naïve Bayes

In [3]:

mnb = MultinomialNB()
mnb.fit(X_train_tfidf, y_train)
y_pred_nb = mnb.predict(X_test_tfidf)

print("Naïve Bayes Accuracy:", round(accuracy_score(y_test, y_pred_nb), 4))
print("\nClassification Report:\n", classification_report(y_test, y_pred_nb))


Naïve Bayes Accuracy: 0.8825

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.93      0.92       132
           1       0.90      0.86      0.88       118
           2       0.78      0.91      0.84       111
           3       0.96      0.82      0.88       107

    accuracy                           0.88       468
   macro avg       0.89      0.88      0.88       468
weighted avg       0.89      0.88      0.88       468



## Step 4: Model 2 — Decision Tree

In [4]:

dt = DecisionTreeClassifier(max_depth=20, random_state=42)
dt.fit(X_train_tfidf, y_train)
y_pred_dt = dt.predict(X_test_tfidf)

print("Decision Tree Accuracy:", round(accuracy_score(y_test, y_pred_dt), 4))


Decision Tree Accuracy: 0.6325


## Step 5: Model 3 — Random Forest

In [6]:

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_tfidf, y_train)
y_pred_rf = rf.predict(X_test_tfidf)

print("Random Forest Accuracy:", round(accuracy_score(y_test, y_pred_rf), 4))


Random Forest Accuracy: 0.8226


## Step 6: Model 4 — Simple Neural Network

In [7]:

X_train_dense = X_train_tfidf.toarray()
X_test_dense = X_test_tfidf.toarray()

model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train_dense.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dense(len(np.unique(y)), activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history = model.fit(X_train_dense, y_train, epochs=3, batch_size=64, validation_split=0.1, verbose=1)
test_loss, test_acc = model.evaluate(X_test_dense, y_test, verbose=0)
print(f"Neural Network Accuracy: {test_acc:.4f}")


Epoch 1/3


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.4742 - loss: 1.3337 - val_accuracy: 0.6845 - val_loss: 1.1961
Epoch 2/3
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7641 - loss: 0.9639 - val_accuracy: 0.8717 - val_loss: 0.6864
Epoch 3/3
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9275 - loss: 0.4606 - val_accuracy: 0.9091 - val_loss: 0.3881
Neural Network Accuracy: 0.8782


## Step 7: Compare Results

In [8]:

results = {
    "MultinomialNB": accuracy_score(y_test, y_pred_nb),
    "DecisionTree": accuracy_score(y_test, y_pred_dt),
    "RandomForest": accuracy_score(y_test, y_pred_rf),
    "NeuralNetwork": test_acc
}
pd.DataFrame(results, index=["Accuracy"]).T.sort_values("Accuracy", ascending=False)


Unnamed: 0,Accuracy
MultinomialNB,0.882479
NeuralNetwork,0.878205
RandomForest,0.82265
DecisionTree,0.632479



## Step 8: Reflection & Discussion

**Discussion Prompts:**
- Which model performed best and why?
- Which model trained the fastest?
- How do vectorization and model type influence performance?
- When might you prefer a simple model like Naïve Bayes over a neural network?


# How do vectorization and model type influence performance?

In text classification, **the features you feed the model (vectorization)** and **the model family** interact tightly. Here’s how they influence performance—and how to pair them well.

## Vectorization → what the model “sees”

* **Bag-of-Words (counts)**: very sparse, high-dimensional; captures word presence/frequency but little order/semantics. Strong signal for topical data.
* **TF-IDF**: reweights counts to downplay common words; usually +1–3 F1 over raw counts. Key knobs: `min_df`, `max_df`, `sublinear_tf=True`, `ngram_range=(1,2)`, **L2-normalize**.
* **Character n-grams**: robust to typos, morphology, and domain codes; great for noisy text, names, product IDs.
* **Static word embeddings** (Word2Vec/GloVe; averaged): dense, low-dimensional; adds semantics but loses precise token cues; helps when docs are short.
* **Contextual embeddings** (BERT/DistilBERT): dense, context-aware; best raw accuracy/F1, higher compute.

## Model type → how it uses those features

* **Multinomial Naive Bayes (MNB)**: excels on **counts/TF-IDF** (especially unigram/bigram); fast, robust on sparse text; can underfit nuanced classes. Tune `alpha`.
* **Linear classifiers** (LogReg, Linear SVM): top performers on **TF-IDF/char n-grams**; scale to huge vocabularies; tune regularization (`C`) and class weights.
* **k-NN**: can work with **TF-IDF + cosine**; inference slow; sensitive to scaling/sparsity.
* **Trees/Random Forest/GBMs**: often mediocre on raw sparse TF-IDF (split criteria don’t love ultra-sparse, wide features). Better after **dense embeddings** or strong feature selection.
* **Neural nets (MLP/CNN/RNN)**: prefer **dense embeddings**; with enough data, can beat linear models; need regularization and tuning.
* **Transformers**: use **contextual embeddings** end-to-end; usually best, with higher training/inference cost.

## Good pairings (rule-of-thumb)

* **TF-IDF (1–2 grams)** → **MNB / Linear SVM / Logistic Regression** ✅
* **Char n-grams** → **Linear SVM / LogReg** (robust to noise) ✅
* **Averaged static embeddings** → **LogReg / MLP** ✅
* **Contextual embeddings (BERT)** → **Linear head or fine-tuned transformer** ✅
* **Raw TF-IDF** → **Trees/RF/GBM** ❌ (usually underperform unless heavily reduced)

## How choices shift bias/variance & compute

* Higher n-grams and larger vocabularies ↓bias but ↑variance/overfit and memory. Use `min_df`, `max_df`, and regularization.
* Dense embeddings reduce dimensionality (↓variance) and add semantics (↓bias), often improving minority-class recall.
* Linear models on TF-IDF train fast and are strong baselines; transformers improve hardest confusions at higher cost.




# When might you prefer a simple model like Naïve Bayes over a neural network?
**Naïve Bayes (NB)** over a neural net when you need **speed, simplicity, and strong performance on sparse text** without heavy compute or tuning:

**When NB is a better choice**

* **Small datasets / few labels:** NB learns good priors from limited data; NNs often overfit or need augmentation.
* **High-dimensional, sparse features (TF-IDF, bag-of-words, char n-grams):** NB is a natural fit and often very competitive.
* **Tight compute/latency budgets:** Trains in seconds, tiny memory/CPU footprint, great for on-device or serverless.
* **Rapid baselines & sanity checks:** Gives a strong, reliable benchmark fast; useful in model selection loops.
* **Stable, low-maintenance deployment:** Few hyperparameters, deterministic, easy to port (even to SQL/UDFs).
* **Streaming/online updates:** `partial_fit` lets you update with new batches without full retraining.
* **Interpretability:** Per-class log probabilities reveal which tokens drive decisions—handy for audits and feature hygiene.
* **Noisy text tasks:** Spam filtering, topic routing, language ID—NB is robust to lots of irrelevant features.
* **Typical wins**

* Email/SMS **spam detection**
* News/forum **topic classification**
* **Document routing** and tag suggestion
* **Language detection** and simple sentiment with clear cues

**When not to use NB**

* You need **long-range context/semantics** (sarcasm, coreference, subtle sentiment).
* You have **plenty of data/compute** and the task benefits from context (where transformers shine).

**Rule of Thumb**
* Start with TF-IDF (1–2 grams) + NB for a fast, strong baseline. If errors cluster around semantic nuance or long context, step up to a linear model or a small transformer.
