# HW2: Bag-of-Words Text Classification — Group 2
**CS6120: Natural Language Processing · Spring 2026**

This notebook implements a complete BOW text classification pipeline for Common Crawl web pages:
- **Part A (Week 1):** Text preprocessing and Document-Term Matrix creation with TF-IDF
- **Part B (Week 2):** Training classifiers, evaluation, prediction on unlabeled pages, and error analysis

In [1]:
# ── Import Libraries ──────────────────────────────────────────────
import pandas as pd
import numpy as np
import re
import warnings
warnings.filterwarnings('ignore')

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt

# Scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report, confusion_matrix, ConfusionMatrixDisplay,
    f1_score, accuracy_score
)
from sklearn.preprocessing import LabelEncoder

# XGBoost
from xgboost import XGBClassifier

RANDOM_STATE = 42
print("All libraries imported successfully.")

All libraries imported successfully.


## Part A: Document-Term Matrix Creation

### Step 1: Data Loading & Exploration

Load the **full Common Crawl dataset** (`dataset_with_assignments.csv`). This contains ~6,300 web pages with metadata and full text content. Key columns:
- `page_id` — unique identifier for each web page
- `url` — the source URL of the page
- `full_text` — complete raw text content scraped from the web page
- `domain`, `tld` — website domain and top-level domain
- `word_count`, `text_length`, `sentence_count` — pre-computed text statistics

> **Note:** This dataset does NOT contain category labels. Labels are in a separate file used in Part B.

In [None]:
# ── Load Full Common Crawl Dataset ────────────────────────────────
df = pd.read_csv('dataset_with_assignments.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nNull counts:\n{df.isnull().sum()}")
print(f"\nData types:\n{df.dtypes}")
df.head()

Dataset shape: (596, 7)

Columns: ['page_id', 'assigned_to', 'manual_label', 'manual_label_clean', 'manual_label_final', 'full_text', '_merge']

Null counts:
page_id               0
full_text             0
manual_label_final    0
dtype: int64

Data types:
page_id               float64
assigned_to            object
manual_label           object
manual_label_clean     object
manual_label_final     object
full_text              object
_merge                 object
dtype: object


Unnamed: 0,page_id,assigned_to,manual_label,manual_label_clean,manual_label_final,full_text,_merge
0,655.0,Vinay Varshigan Sivakumar Jayalakshmi,ecommerce,ECOMMERCE,ECOMMERCE,Fall arrest and work positioning harness – All...,both
1,656.0,Vinay Varshigan Sivakumar Jayalakshmi,blog,BLOG,BLOG,Vermont Mountain Eats: Jay Peak - All Mountain...,both
2,657.0,Vinay Varshigan Sivakumar Jayalakshmi,news,NEWS,NEWS,05/18/2020 Booking Report for Bulloch County -...,both
3,658.0,Vinay Varshigan Sivakumar Jayalakshmi,blog,BLOG,BLOG,Is Pickleball Easier than Tennis? | AllRacket\...,both
4,661.0,Vinay Varshigan Sivakumar Jayalakshmi,other,OTHER,OTHER,"Account Executive, Auto Finance - Greater Seat...",both


In [None]:
# ── Exploratory Data Analysis ─────────────────────────────────────
# Top-Level Domain distribution
tld_counts = df['tld'].value_counts().head(15).reset_index()
tld_counts.columns = ['TLD', 'Count']

fig = px.bar(
    tld_counts, x='TLD', y='Count',
    color='TLD',
    title='Top 15 Top-Level Domains (TLDs)',
    text='Count'
)
fig.update_layout(xaxis_tickangle=-45, showlegend=False)
fig.update_traces(textposition='outside')
fig.show()

# Domain distribution
domain_counts = df['domain'].value_counts().head(15).reset_index()
domain_counts.columns = ['Domain', 'Count']

fig2 = px.bar(
    domain_counts, x='Domain', y='Count',
    color='Domain',
    title='Top 15 Domains',
    text='Count'
)
fig2.update_layout(xaxis_tickangle=-45, showlegend=False)
fig2.update_traces(textposition='outside')
fig2.show()

print(f"\nTotal pages: {len(df)}")
print(f"Unique domains: {df['domain'].nunique()}")
print(f"Unique TLDs: {df['tld'].nunique()}")
print(f"\nAssigned pages (for manual labeling): {df['assigned_to'].notna().sum()}")
print(f"Unassigned pages: {df['assigned_to'].isna().sum()}")


Total labeled pages: 596
Number of categories: 8

Class imbalance ratio (max/min): 158/13 = 12.2x


In [None]:
# ── Text Length Analysis ──────────────────────────────────────────
df['text_length'] = df['full_text'].str.len()

print("Text length statistics (full_text):")
print(df['text_length'].describe())

# Histogram of overall text lengths
fig = px.histogram(
    df, x='text_length', nbins=50,
    title='Text Length Distribution (All Pages)',
    labels={'text_length': 'Character Count', 'count': 'Number of Pages'}
)
fig.show()

# Word count distribution (pre-computed column)
fig2 = px.histogram(
    df, x='word_count', nbins=50,
    title='Word Count Distribution (All Pages)',
    labels={'word_count': 'Word Count', 'count': 'Number of Pages'}
)
fig2.show()

print(f"\nWord count statistics:")
print(df['word_count'].describe())

Text length statistics:
count      596.000000
mean      6535.971477
std       9110.613565
min        294.000000
25%       1748.500000
50%       3446.500000
75%       7771.000000
max      75582.000000
Name: text_length, dtype: float64


### Step 2: Text Preprocessing

Web page text contains significant noise that must be cleaned before vectorization. Our preprocessing pipeline applies the following steps (in order):

| Step | Operation | Rationale |
|------|-----------|-----------|
| 1 | Lowercase | Normalize case so "News" and "news" are treated as the same token |
| 2 | Remove URLs | URLs (http/https/www) are not meaningful for category classification |
| 3 | Remove email addresses | Email addresses are noisy and not category-discriminative |
| 4 | Remove special characters | Keep only letters, numbers, and spaces; remove punctuation and HTML artifacts |
| 5 | Remove standalone numbers | Pure numbers rarely help classification (prices, dates, etc.) |
| 6 | Collapse whitespace | Clean up extra spaces left by previous removal steps |

In [None]:
# ── Text Cleaning Function ────────────────────────────────────────
def clean_text(text):
    """Clean raw web page text for TF-IDF vectorization."""
    text = str(text).lower()                                # 1. Lowercase
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)    # 2. Remove URLs
    text = re.sub(r'\S+@\S+', '', text)                     # 3. Remove email addresses
    text = re.sub(r'[^\w\s]', ' ', text)                    # 4. Remove special characters
    text = re.sub(r'\b\d+\b', '', text)                     # 5. Remove standalone numbers
    text = re.sub(r'\s+', ' ', text).strip()                # 6. Collapse whitespace
    return text

# Apply preprocessing to all pages
df['cleaned_text'] = df['full_text'].apply(clean_text)

# Show before/after examples
for i in [0, 100, 500]:
    print(f"{'='*80}")
    print(f"Page ID: {df.iloc[i]['page_id']} | Domain: {df.iloc[i]['domain']}")
    print(f"\nRAW (first 300 chars):\n{df.iloc[i]['full_text'][:300]}")
    print(f"\nCLEANED (first 300 chars):\n{df.iloc[i]['cleaned_text'][:300]}")
    print()

Page ID: 655.0 | Label: ECOMMERCE

RAW (first 300 chars):
Fall arrest and work positioning harness – Allied Safety Equipment Pvt. Ltd.
Skip to content
Sports
Professional
Search for:
Search
ENQUIRE NOW!
X
ACTIVITY
INDUSTRY
Rope Access
Confined Space
Facade Cleaning
Wind Mill
Bird Netting
Tree Care
Energy & Network
Framing & Roofing
STUNTS
Stunt Vest
Aerial

CLEANED (first 300 chars):
fall arrest and work positioning harness allied safety equipment pvt ltd skip to content sports professional search for search enquire now x activity industry rope access confined space facade cleaning wind mill bird netting tree care energy network framing roofing stunts stunt vest aerial acrobatic

Page ID: 662.0 | Label: BLOG

RAW (first 300 chars):
Blog – AllyTech Ai
Product
Title
Methodology
About platform
Plans
Cases
Support
Workshops
Blog
About company
Solutions
Title
Production
Healthcare
Government
Resources
Title
Marketplace
Integration
Training
Docs & API
Title
Knowledge base
Working with the AP

In [6]:
# ── Post-Cleaning Validation ──────────────────────────────────────
df['cleaned_length'] = df['cleaned_text'].str.len()

# Check for empty texts after cleaning
empty_count = (df['cleaned_text'].str.strip() == '').sum()
print(f"Empty texts after cleaning: {empty_count}")

print(f"\nCleaned text length statistics:")
print(df['cleaned_length'].describe())

# Compare raw vs cleaned lengths
fig = go.Figure()
fig.add_trace(go.Histogram(x=df['text_length'], name='Raw Text', opacity=0.6))
fig.add_trace(go.Histogram(x=df['cleaned_length'], name='Cleaned Text', opacity=0.6))
fig.update_layout(
    title='Text Length: Raw vs Cleaned',
    xaxis_title='Character Count',
    yaxis_title='Number of Pages',
    barmode='overlay'
)
fig.show()

Empty texts after cleaning: 0

Cleaned text length statistics:
count      596.000000
mean      5786.474832
std       8049.753075
min        194.000000
25%       1559.750000
50%       3192.500000
75%       6849.750000
max      70186.000000
Name: cleaned_length, dtype: float64


### Step 3: TF-IDF Vectorization

We use `TfidfVectorizer` from scikit-learn to convert the cleaned text of **all ~6,300 pages** into a numerical Document-Term Matrix with TF-IDF weights. Fitting on the full corpus ensures a comprehensive vocabulary.

**Parameter Choices and Justification:**

| Parameter | Value | Justification |
|-----------|-------|---------------|
| `max_features` | 3000 | Caps vocabulary to the top 3000 terms by frequency — keeps the matrix manageable while retaining the most informative terms |
| `min_df` | 3 | Ignores terms appearing in fewer than 3 documents — removes typos, misspellings, and very rare proper nouns |
| `max_df` | 0.85 | Ignores terms appearing in >85% of documents — removes ubiquitous words that don't discriminate between categories |
| `ngram_range` | (1, 2) | Includes unigrams and bigrams — bigrams capture meaningful phrases like "add to cart" (ECOMMERCE) or "breaking news" (NEWS) |
| `stop_words` | 'english' | Removes common English stop words (the, is, at, etc.) that carry no category-specific meaning |
| `sublinear_tf` | True | Applies log(1 + tf) scaling — dampens the effect of repeated terms in very long web pages |

In [None]:
# ── Build TF-IDF Document-Term Matrix ────────────────────────────
tfidf = TfidfVectorizer(
    max_features=3000,
    min_df=3,
    max_df=0.85,
    ngram_range=(1, 2),
    stop_words='english',
    sublinear_tf=True
)

# Fit and transform on ALL pages (full corpus)
tfidf_matrix = tfidf.fit_transform(df['cleaned_text'])

print(f"TF-IDF Matrix shape: {tfidf_matrix.shape}")
print(f"  → {tfidf_matrix.shape[0]} documents × {tfidf_matrix.shape[1]} features")
print(f"\nVocabulary size: {len(tfidf.vocabulary_)}")
print(f"Non-zero entries: {tfidf_matrix.nnz:,}")
print(f"Sparsity: {1 - tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1]):.4%}")

TF-IDF Matrix shape: (596, 3000)
  → 596 documents × 3000 features

Vocabulary size: 3000
Non-zero entries: 87,830
Sparsity: 95.0878%


### Step 4: DTM Analysis

Analyze the Document-Term Matrix to understand vocabulary composition, sparsity, and the most discriminative terms overall and per category.

In [8]:
# ── Top 20 Terms by Mean TF-IDF ──────────────────────────────────
feature_names = tfidf.get_feature_names_out()
mean_tfidf = np.array(tfidf_matrix.mean(axis=0)).flatten()

# Sort and get top 20
top_indices = mean_tfidf.argsort()[-20:][::-1]
top_terms = [feature_names[i] for i in top_indices]
top_scores = [mean_tfidf[i] for i in top_indices]

fig = px.bar(
    x=top_scores, y=top_terms,
    orientation='h',
    title='Top 20 Terms by Mean TF-IDF Score (Across All Documents)',
    labels={'x': 'Mean TF-IDF Score', 'y': 'Term'}
)
fig.update_layout(yaxis=dict(autorange='reversed'))
fig.show()

---

## Part B: Classification

### Step 5: Load Labeled Data & Prepare Features

Load the **labeled subset** (`english_pages_metadata_clean_with_labels.csv`) containing 596 manually labeled pages. We align these labels with the TF-IDF features already computed in Part A by matching on `page_id`.

This gives us:
- **Labeled features (X_labeled):** TF-IDF rows for the 596 labeled pages — used for training and evaluation
- **Unlabeled features (X_unlabeled):** TF-IDF rows for the remaining ~5,700 pages — used for prediction in Step 8

In [None]:
# ── Load Labeled Dataset ──────────────────────────────────────────
labeled_df = pd.read_csv('english_pages_metadata_clean_with_labels.csv')
print(f"Labeled dataset shape: {labeled_df.shape}")
print(f"Categories: {labeled_df['manual_label_final'].nunique()}")
print(f"\nLabel distribution:")
print(labeled_df['manual_label_final'].value_counts())

# ── Label Distribution Plot ──────────────────────────────────────
label_counts = labeled_df['manual_label_final'].value_counts().reset_index()
label_counts.columns = ['Category', 'Count']

fig = px.bar(
    label_counts, x='Category', y='Count',
    color='Category',
    title='Distribution of Manual Labels (7 Categories)',
    text='Count'
)
fig.update_layout(xaxis_tickangle=-45, showlegend=False)
fig.update_traces(textposition='outside')
fig.show()

print(f"\nClass imbalance ratio (max/min): {label_counts['Count'].max()}/{label_counts['Count'].min()} = {label_counts['Count'].max()/label_counts['Count'].min():.1f}x")

# ── Align Labeled Pages with TF-IDF Matrix ───────────────────────
# Merge to get row indices in df that correspond to labeled pages
merged = df[['page_id']].reset_index().merge(
    labeled_df[['page_id', 'manual_label_final']], on='page_id'
)
labeled_indices = merged['index'].values
y_labeled = merged['manual_label_final'].values

# Extract TF-IDF features for labeled and unlabeled pages
X_labeled = tfidf_matrix[labeled_indices]
unlabeled_mask = ~df['page_id'].isin(labeled_df['page_id'])
X_unlabeled = tfidf_matrix[unlabeled_mask]
unlabeled_df = df[unlabeled_mask].copy()

print(f"\n✅ Labeled pages matched: {len(labeled_indices)} / {len(labeled_df)}")
print(f"   X_labeled shape: {X_labeled.shape}")
print(f"   X_unlabeled shape: {X_unlabeled.shape}")
print(f"   Unlabeled pages to predict: {len(unlabeled_df)}")

Training set: 476 samples
Test set:     120 samples

Training label distribution:
manual_label_final
BLOG                 98
ECOMMERCE            91
EDUCATION            49
FORUM/DISCUSSION     37
GOVERNMENT           10
NEWS                 38
OTHER               126
TECHNICAL            27
Name: count, dtype: int64

Test label distribution:
manual_label_final
BLOG                25
ECOMMERCE           23
EDUCATION           12
FORUM/DISCUSSION     9
GOVERNMENT           3
NEWS                 9
OTHER               32
TECHNICAL            7
Name: count, dtype: int64


In [None]:
# ── Top 10 Terms per Category ────────────────────────────────────
feature_names = tfidf.get_feature_names_out()
categories = np.unique(y_labeled)
tfidf_labeled_dense = X_labeled.toarray()

rows = []
for cat in sorted(categories):
    mask = y_labeled == cat
    cat_mean = tfidf_labeled_dense[mask].mean(axis=0)
    top_idx = cat_mean.argsort()[-10:][::-1]
    for rank, idx in enumerate(top_idx, 1):
        rows.append({
            'Category': cat,
            'Term': feature_names[idx],
            'Mean_TF-IDF': cat_mean[idx],
            'Rank': rank
        })

top_terms_df = pd.DataFrame(rows)

fig = px.bar(
    top_terms_df, x='Mean_TF-IDF', y='Term', color='Category',
    facet_col='Category', facet_col_wrap=4,
    title='Top 10 TF-IDF Terms per Category',
    height=600
)
fig.update_layout(showlegend=False)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.update_yaxes(matches=None, autorange='reversed')
fig.show()

### Step 6: Train/Test Split

We use a **stratified** 80/20 train/test split on the 596 labeled pages to ensure every category is proportionally represented in both sets. This is critical given our class imbalance (e.g., GOVERNMENT has only 13 samples).

In [None]:
# ── Stratified Train/Test Split ───────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X_labeled, y_labeled,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y_labeled
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set:     {X_test.shape[0]} samples")
print(f"\nTraining label distribution:")
print(pd.Series(y_train).value_counts().sort_index())
print(f"\nTest label distribution:")
print(pd.Series(y_test).value_counts().sort_index())

### Step 7: Model Training

We train **three classifiers** on the labeled subset and compare their performance:
1. **Logistic Regression** — strong linear baseline for text; fast and interpretable
2. **Random Forest** — ensemble of decision trees; captures non-linear patterns; tuned with GridSearchCV
3. **XGBoost** — gradient-boosted trees; often state-of-the-art for tabular/sparse data; tuned with GridSearchCV

All models use `class_weight='balanced'` (or equivalent) to address class imbalance.

#### Classifier 1: Logistic Regression

In [11]:
# ── Logistic Regression ───────────────────────────────────────────
lr_model = LogisticRegression(
    max_iter=1000,
    class_weight='balanced',
    random_state=RANDOM_STATE,
    solver='lbfgs',
    multi_class='multinomial'
)
lr_model.fit(X_train, y_train)

lr_train_acc = lr_model.score(X_train, y_train)
lr_test_acc = lr_model.score(X_test, y_test)
print(f"Logistic Regression - Train Accuracy: {lr_train_acc:.4f}")
print(f"Logistic Regression - Test Accuracy:  {lr_test_acc:.4f}")

Logistic Regression - Train Accuracy: 0.9202
Logistic Regression - Test Accuracy:  0.5250


#### Classifier 2: Random Forest with GridSearchCV

In [12]:
# ── Random Forest with GridSearchCV ───────────────────────────────
rf = RandomForestClassifier(
    class_weight='balanced',
    random_state=RANDOM_STATE
)

param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
}

gs_rf = GridSearchCV(
    rf, param_grid_rf,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE),
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=1
)
gs_rf.fit(X_train, y_train)

print(f"\nBest Parameters: {gs_rf.best_params_}")
print(f"Best CV F1 (weighted): {gs_rf.best_score_:.4f}")

rf_model = gs_rf.best_estimator_
rf_test_acc = rf_model.score(X_test, y_test)
print(f"Random Forest - Test Accuracy: {rf_test_acc:.4f}")

Fitting 5 folds for each of 12 candidates, totalling 60 fits

Best Parameters: {'max_depth': 20, 'min_samples_split': 5, 'n_estimators': 100}
Best CV F1 (weighted): 0.5306
Random Forest - Test Accuracy: 0.5500


#### Classifier 3: XGBoost with GridSearchCV

In [None]:
# ── XGBoost with GridSearchCV ─────────────────────────────────────
# Encode labels for XGBoost (requires integer labels)
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.transform(y_test)

# Compute sample weights to handle class imbalance
from collections import Counter
class_counts = Counter(y_train_enc)
n_samples = len(y_train_enc)
n_classes = len(class_counts)
sample_weights = np.array([n_samples / (n_classes * class_counts[c]) for c in y_train_enc])

xgb = XGBClassifier(
    random_state=RANDOM_STATE,
    eval_metric='mlogloss',
    use_label_encoder=False
)

param_grid_xgb = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1],
    'subsample': [0.8, 1.0],
}

gs_xgb = GridSearchCV(
    xgb, param_grid_xgb,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE),
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=1
)
gs_xgb.fit(X_train, y_train_enc, sample_weight=sample_weights)

print(f"\nBest Parameters: {gs_xgb.best_params_}")
print(f"Best CV F1 (weighted): {gs_xgb.best_score_:.4f}")

xgb_model = gs_xgb.best_estimator_
xgb_test_acc = accuracy_score(y_test_enc, xgb_model.predict(X_test))
print(f"XGBoost - Test Accuracy: {xgb_test_acc:.4f}")
print(f"\nLabel mapping: {dict(zip(le.classes_, le.transform(le.classes_)))}")

Fitting 5 folds for each of 24 candidates, totalling 120 fits

Best Parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
Best CV F1 (weighted): 0.4625
XGBoost - Test Accuracy: 0.5250

Label mapping: {'BLOG': np.int64(0), 'ECOMMERCE': np.int64(1), 'EDUCATION': np.int64(2), 'FORUM/DISCUSSION': np.int64(3), 'GOVERNMENT': np.int64(4), 'NEWS': np.int64(5), 'OTHER': np.int64(6), 'TECHNICAL': np.int64(7)}


### Step 8: Model Evaluation

We evaluate all three models using:
- **5-fold stratified cross-validation** on the training set (for robust performance estimates)
- **Classification report** on the held-out test set (precision, recall, F1 per class)
- **Confusion matrices** to visualize misclassification patterns
- **Model comparison table** summarizing all metrics

In [14]:
# ── Cross-Validation Scores ───────────────────────────────────────
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

# Logistic Regression CV
lr_cv = cross_val_score(lr_model, X_train, y_train, cv=cv, scoring='f1_weighted')
print(f"Logistic Regression  CV F1 (weighted): {lr_cv.mean():.4f} ± {lr_cv.std():.4f}")

# Random Forest CV (already computed via GridSearchCV)
print(f"Random Forest        CV F1 (weighted): {gs_rf.best_score_:.4f}")

# XGBoost CV (already computed via GridSearchCV)
print(f"XGBoost              CV F1 (weighted): {gs_xgb.best_score_:.4f}")

Logistic Regression  CV F1 (weighted): 0.5598 ± 0.0273
Random Forest        CV F1 (weighted): 0.5306
XGBoost              CV F1 (weighted): 0.4625


In [None]:
# ── Classification Reports ────────────────────────────────────────
models = {
    'Logistic Regression': (lr_model, y_test, lr_model.predict(X_test)),
    'Random Forest': (rf_model, y_test, rf_model.predict(X_test)),
    'XGBoost': (xgb_model, y_test_enc, xgb_model.predict(X_test)),
}

for name, (model, y_true, y_pred) in models.items():
    print(f"\n{'='*60}")
    print(f"  {name} — Classification Report")
    print(f"{'='*60}")
    if name == 'XGBoost':
        print(classification_report(y_true, y_pred, target_names=le.classes_))
    else:
        print(classification_report(y_true, y_pred))


  Logistic Regression — Classification Report
                  precision    recall  f1-score   support

            BLOG       0.68      0.60      0.64        25
       ECOMMERCE       0.67      0.70      0.68        23
       EDUCATION       0.27      0.33      0.30        12
FORUM/DISCUSSION       0.78      0.78      0.78         9
      GOVERNMENT       0.00      0.00      0.00         3
            NEWS       0.57      0.44      0.50         9
           OTHER       0.41      0.47      0.43        32
       TECHNICAL       0.33      0.29      0.31         7

        accuracy                           0.53       120
       macro avg       0.46      0.45      0.45       120
    weighted avg       0.53      0.53      0.52       120


  Random Forest — Classification Report
                  precision    recall  f1-score   support

            BLOG       0.65      0.60      0.62        25
       ECOMMERCE       0.74      0.61      0.67        23
       EDUCATION       0.43      0.25 

In [None]:
# ── Confusion Matrices (Plotly Heatmaps) ─────────────────────────
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=['Logistic Regression', 'Random Forest', 'XGBoost'],
    horizontal_spacing=0.08
)

labels_sorted = sorted(np.unique(y_labeled))

for idx, (name, (model, y_true, y_pred)) in enumerate(models.items()):
    if name == 'XGBoost':
        cm = confusion_matrix(y_true, y_pred)
        display_labels = le.classes_
    else:
        cm = confusion_matrix(y_true, y_pred, labels=labels_sorted)
        display_labels = labels_sorted
    
    # Normalize by row (true labels)
    cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    heatmap = go.Heatmap(
        z=cm_norm,
        x=display_labels,
        y=display_labels,
        colorscale='Blues',
        text=cm,
        texttemplate='%{text}',
        showscale=(idx == 2)
    )
    fig.add_trace(heatmap, row=1, col=idx+1)

fig.update_layout(
    title='Confusion Matrices (Normalized by True Label)',
    height=500, width=1400
)
for i in range(3):
    fig.update_xaxes(title_text='Predicted', tickangle=-45, row=1, col=i+1)
    fig.update_yaxes(title_text='True', row=1, col=i+1)

fig.show()

In [None]:
# ── Model Comparison Summary Table ────────────────────────────────
results = []

for name, (model, y_true, y_pred) in models.items():
    acc = accuracy_score(y_true, y_pred)
    f1_w = f1_score(y_true, y_pred, average='weighted')
    f1_m = f1_score(y_true, y_pred, average='macro')
    
    results.append({
        'Model': name,
        'CV F1 (weighted)': lr_cv.mean() if name == 'Logistic Regression' else (gs_rf.best_score_ if name == 'Random Forest' else gs_xgb.best_score_),
        'Test Accuracy': acc,
        'Test F1 (weighted)': f1_w,
        'Test F1 (macro)': f1_m
    })

results_df = pd.DataFrame(results).round(4)
results_df = results_df.sort_values('Test F1 (weighted)', ascending=False)
print("\n📊 Model Comparison:")
results_df


📊 Model Comparison:


Unnamed: 0,Model,CV F1 (weighted),Test Accuracy,Test F1 (weighted),Test F1 (macro)
1,Random Forest,0.5306,0.55,0.5344,0.4411
0,Logistic Regression,0.5598,0.525,0.5228,0.4545
2,XGBoost,0.4625,0.525,0.5148,0.445


### Step 9: Prediction on Unlabeled Pages

Select the best-performing model based on cross-validation F1 score and apply it to predict categories for all **unlabeled web pages** (~5,700 pages from the full dataset that were not manually labeled).

The output CSV contains: `page_id`, `url`, `predicted_category`.

In [18]:
# ── Select Best Model ─────────────────────────────────────────────
cv_scores_dict = {
    'Logistic Regression': lr_cv.mean(),
    'Random Forest': gs_rf.best_score_,
    'XGBoost': gs_xgb.best_score_
}

best_model_name = max(cv_scores_dict, key=cv_scores_dict.get)
print(f"✅ Best model by CV F1 (weighted): {best_model_name} ({cv_scores_dict[best_model_name]:.4f})")

# Map name to fitted model
best_model_map = {
    'Logistic Regression': lr_model,
    'Random Forest': rf_model,
    'XGBoost': xgb_model
}
best_model = best_model_map[best_model_name]
is_xgb_best = (best_model_name == 'XGBoost')

✅ Best model by CV F1 (weighted): Logistic Regression (0.5598)


In [None]:
# ── Predict on Unlabeled Pages ─────────────────────────────────────
predictions = best_model.predict(X_unlabeled)

if is_xgb_best:
    predicted_labels = le.inverse_transform(predictions)
else:
    predicted_labels = predictions

# Build output DataFrame with page_id, url, and predicted_category
output_df = pd.DataFrame({
    'page_id': unlabeled_df['page_id'].values,
    'url': unlabeled_df['url'].values,
    'predicted_category': predicted_labels
})

print(f"Total predictions: {len(output_df)}")
print(f"\nPredicted category distribution:")
print(output_df['predicted_category'].value_counts())

# Export to CSV
output_df.to_csv('group_2_predictions.csv', index=False)
print(f"\n✅ Saved predictions to group_2_predictions.csv")
output_df.head(10)

Total predictions: 596

Predicted category distribution:
predicted_category
OTHER               146
BLOG                123
ECOMMERCE           118
EDUCATION            72
NEWS                 46
FORUM/DISCUSSION     46
TECHNICAL            35
GOVERNMENT           10
Name: count, dtype: int64

✅ Saved predictions to group_2_predictions.csv


Unnamed: 0,page_id,predicted_category
0,655,ECOMMERCE
1,656,BLOG
2,657,NEWS
3,658,BLOG
4,661,OTHER
5,662,TECHNICAL
6,663,FORUM/DISCUSSION
7,664,OTHER
8,665,ECOMMERCE
9,667,EDUCATION


### Step 10: Error Analysis

Analyze misclassifications on the **test set** to understand model weaknesses and identify areas for improvement.

In [None]:
# ── Misclassification Analysis ────────────────────────────────────
# Use the best model's predictions on the test set
if is_xgb_best:
    y_pred_test = le.inverse_transform(xgb_model.predict(X_test))
    y_true_test = le.inverse_transform(y_test_enc)
else:
    y_pred_test = best_model.predict(X_test)
    y_true_test = y_test

# Build test DataFrame with predictions
test_results = pd.DataFrame({
    'true_label': y_true_test,
    'predicted': y_pred_test,
})
test_results['correct'] = test_results['predicted'] == test_results['true_label']

# Summary
n_correct = test_results['correct'].sum()
n_total = len(test_results)
print(f"Test set: {n_correct}/{n_total} correct ({n_correct/n_total:.1%})")
print(f"Misclassified: {n_total - n_correct} pages\n")

# Misclassifications per true class
misclassified = test_results[~test_results['correct']]
print("Misclassifications by true label:")
print(misclassified['true_label'].value_counts())
print(f"\nMost common confusion pairs:")
print(misclassified.groupby(['true_label', 'predicted']).size().sort_values(ascending=False).head(10))

Test set: 63/120 correct (52.5%)
Misclassified: 57 pages

Misclassifications by true label:
true_label
OTHER               17
BLOG                10
EDUCATION            8
ECOMMERCE            7
NEWS                 5
TECHNICAL            5
GOVERNMENT           3
FORUM/DISCUSSION     2
Name: count, dtype: int64

Most common confusion pairs:
true_label        predicted
BLOG              OTHER        6
OTHER             ECOMMERCE    5
ECOMMERCE         OTHER        4
TECHNICAL         OTHER        4
OTHER             EDUCATION    4
EDUCATION         OTHER        4
OTHER             BLOG         3
                  NEWS         3
GOVERNMENT        OTHER        2
FORUM/DISCUSSION  EDUCATION    2
dtype: int64


In [None]:
# ── Inspect Misclassified Examples ────────────────────────────────
print("Sample Misclassified Pages:\n")
for _, row in misclassified.head(5).iterrows():
    print(f"  True: {row['true_label']}  →  Predicted: {row['predicted']}")
    print()

Sample Misclassified Pages:

Page ID: 723
  True: OTHER  →  Predicted: TECHNICAL
  Text (first 200 chars): stephanie ouellette npi dietitian registered rates price transparency price transparency tx katy stephanie ouellette healthcare price transparency data for stephanie ouellette access comprehensive pri...

Page ID: 46
  True: OTHER  →  Predicted: EDUCATION
  Text (first 200 chars): cadotech solutions pvt ltd cadotech solutions pvt ltd toggle navigation authorized reseller products solidworks solutions cover all aspects of your product development process with a seamless integrat...

Page ID: 234
  True: BLOG  →  Predicted: OTHER
  Text (first 200 chars): si barber moral rights asserted si barber photo archive si barber moral rights asserted pic by si barber cook manager clair legge with charles grace at middleton primary kings lynn norfolk preparing t...

Page ID: 1051
  True: BLOG  →  Predicted: EDUCATION
  Text (first 200 chars): how to make the data center buildout in north caro

In [None]:
# ── Per-Class F1 Score Visualization ──────────────────────────────
from sklearn.metrics import f1_score as sklearn_f1

all_labels = sorted(np.unique(y_labeled))

class_f1 = sklearn_f1(y_true_test, y_pred_test, labels=all_labels, average=None)

class_f1_df = pd.DataFrame({
    'Category': all_labels,
    'F1 Score': class_f1
}).sort_values('F1 Score')

fig = px.bar(
    class_f1_df, x='F1 Score', y='Category',
    orientation='h',
    title=f'Per-Class F1 Score — {best_model_name}',
    color='F1 Score',
    color_continuous_scale='RdYlGn',
    range_color=[0, 1]
)
fig.add_vline(x=class_f1_df['F1 Score'].mean(), line_dash='dash', line_color='gray',
              annotation_text=f"Mean: {class_f1_df['F1 Score'].mean():.3f}")
fig.show()

print("\nPer-class F1 scores:")
for _, row in class_f1_df.iterrows():
    bar = '█' * int(row['F1 Score'] * 20)
    print(f"  {row['Category']:<20s} {row['F1 Score']:.4f}  {bar}")


Per-class F1 scores:
  GOVERNMENT           0.0000  
  EDUCATION            0.2963  █████
  TECHNICAL            0.3077  ██████
  OTHER                0.4348  ████████
  NEWS                 0.5000  ██████████
  BLOG                 0.6383  ████████████
  ECOMMERCE            0.6809  █████████████
  FORUM/DISCUSSION     0.7778  ███████████████


### Discussion: Error Analysis & Insights

**Which classes are most confused?**
- **GOVERNMENT** has only 13 training samples — the model has very little to learn from, leading to poor recall
- **OTHER** is a catch-all category that overlaps with nearly every other class (blogs about shopping → ECOMMERCE or OTHER?)
- **BLOG vs NEWS** share similar vocabulary (article-style writing, date mentions, author names)
- **FORUM/DISCUSSION** may overlap with BLOG when forums have long single posts

**Reasons for misclassification:**
1. **Class imbalance:** GOVERNMENT (13) vs OTHER (158) — a 12:1 ratio makes it hard for the model to learn minority patterns
2. **Web page boilerplate:** Navigation menus, footers, cookie notices, and "Copyright ©" appear across all categories
3. **Ambiguous content:** Some pages genuinely span multiple categories (e.g., a government blog)
4. **Short pages:** Pages with very little text (min 294 chars) may lack enough discriminative vocabulary

**Suggestions for improvement:**
- **More labeled data** — especially for GOVERNMENT, TECHNICAL, and FORUM/DISCUSSION
- **Better preprocessing** — detect and remove boilerplate HTML (nav bars, footers, repeated site-wide text)
- **Feature engineering** — add text length, presence of specific patterns (prices → ECOMMERCE, dates → NEWS)
- **Oversampling** — use SMOTE or random oversampling for minority classes
- **Ensemble methods** — combine predictions from multiple models (voting classifier)
- **Neural approaches** — FFNN or transformer-based models (as shown in Week 3 Notebook 3) may capture richer patterns

---

## Summary of Findings

*(Update these bullet points after running the notebook)*

- **Dataset:** TF-IDF vectorizer fitted on all ~6,300 Common Crawl pages; 596 manually labeled pages used for training/evaluation
- **Best Model:** [Fill in after running — e.g., Logistic Regression]
- **Best CV F1 (weighted):** [Fill in — e.g., 0.XXXX]
- **Test Accuracy:** [Fill in — e.g., 0.XXXX]
- **Strongest Categories:** [e.g., ECOMMERCE, BLOG — categories with highest F1]
- **Weakest Categories:** [e.g., GOVERNMENT, OTHER — categories with lowest F1]
- **Key Preprocessing Decision:** Removing URLs, emails, and special characters significantly reduced noise; `sublinear_tf=True` helped normalize long web pages
- **Class Imbalance Impact:** GOVERNMENT (13 samples) was consistently the hardest category; `class_weight='balanced'` improved recall for minority classes
- **Predictions:** ~5,700 unlabeled pages classified and exported to `group_2_predictions.csv`
- **Recommendations:** Collect more labeled data for minority classes; explore boilerplate removal and neural approaches for better accuracy