# BDACA II: Practice Session – From Classifier to Publishable Workflow

In this session, you'll put the lecture concepts into practice:

- Proper data splitting (and understanding when to use what)
- Building pipelines
- Hyperparameter tuning with cross-validation
- Understanding regularization
- Thorough evaluation
- Saving and reusing models

You have two files available:

- `labeled.csv` — your annotated training/evaluation data
- `unlabeled.csv` — data you might eventually want to classify

**Goal:** Build a robust, well-validated text classifier that you could defend
in a thesis or paper.

**Book reference:** This session builds on
[Chapter 11.4 of the CSS book](https://cssbook.net/content/chapter11.html#sec-supervised)
— you may want to keep it open for reference.

---


## 0. Setup


In [1]:
import pandas as pd
import numpy as np
import joblib  # for saving models

from sklearn.model_selection import (
    train_test_split, 
    StratifiedKFold, 
    cross_val_score, 
    GridSearchCV,
    RandomizedSearchCV  # alternative to GridSearchCV
)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# For reproducibility - ALWAYS set this!
RANDOM_STATE = 42

## 1. Load and Explore the Data

Before doing anything else: understand your data.


In [2]:
df = pd.read_csv("labeled.csv")
print(f"Dataset size: {len(df)} documents")
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'labeled.csv'

In [None]:
# TODO: Examine the label distribution
# This is crucial - is it balanced? Imbalanced?
# Hint: use .value_counts()



**Question to consider:** Based on the label distribution, what should you keep
in mind when:

1. Splitting your data?
2. Choosing evaluation metrics?

---


## 2. Data Splitting Strategies

There are two main approaches (as covered in the lecture):

### Option A: Train / Validation / Test Split

- **Train** (e.g., 64%): fit models
- **Validation** (e.g., 16%): compare models, tune hyperparameters
- **Test** (e.g., 20%): final evaluation, touched only once

**When to use:** Very large datasets, computationally expensive models (e.g.,
deep learning)

### Option B: Train / Test Split + Cross-Validation

- **Train** (e.g., 80%): used for k-fold CV during tuning
- **Test** (e.g., 20%): final evaluation, touched only once

**When to use:** Smaller datasets where you want stable estimates (our case
today)

**Book reference:** See the
[CSS book section 11.4.1](https://cssbook.net/content/chapter11.html#sec-workflow)
for more on workflows.

---

We'll use **Option B** today. First, let's create our train/test split.


In [None]:
# TODO: Create a train/test split
# Requirements:
# - 80% train, 20% test
# - Use stratification (why? → preserves label distribution in both sets)
# - Set the random state for reproducibility

# Hint: check the parameters of train_test_split()

X = df['text']  # adjust column name if needed
y = df['label']  # adjust column name if needed

# X_train, X_test, y_train, y_test = train_test_split(...)



In [None]:
# Verify your split preserved the label distribution
print("Original distribution:")
print(y.value_counts(normalize=True).round(3))
print("\nTraining distribution:")
print(y_train.value_counts(normalize=True).round(3))
print("\nTest distribution:")
print(y_test.value_counts(normalize=True).round(3))

### (Optional) If you wanted a three-way split instead:

```python
# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE
)

# Second split: separate train and validation from the remainder
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.2, stratify=y_temp, random_state=RANDOM_STATE
)  # 0.2 of 0.8 = 0.16 of total
```

---


## 3. Building a Pipeline

Pipelines bundle preprocessing and classification into one object. This is
important because:

1. **Prevents data leakage** during cross-validation (vectorizer is fit only on
   training folds)
2. Makes hyperparameter tuning cleaner
3. Easier to save and deploy

**Book reference:** See
[CSS book Example 11.5](https://cssbook.net/content/chapter11.html#exm-basicpipe)
for pipeline basics.

Let's start with a simple baseline.


In [None]:
# A simple baseline pipeline
baseline_pipe = Pipeline([
    ("vectorizer", TfidfVectorizer()),
    ("classifier", LogisticRegression(solver="liblinear", random_state=RANDOM_STATE))
])

# Quick sanity check - does it run?
baseline_pipe.fit(X_train, y_train)
print(f"Baseline accuracy on training data: {baseline_pipe.score(X_train, y_train):.3f}")

**Question:** Why shouldn't we report that training accuracy as our result?
(Think: overfitting)

---


## 4. Cross-Validation

Instead of a single train/validation split, we use k-fold cross-validation to
get a more stable estimate of performance.

**Remember:** We're still only using `X_train` and `y_train` here. The test set
stays untouched!


In [None]:
# Set up stratified k-fold cross-validation
# - Stratified: preserves label distribution in each fold
# - shuffle=True: randomize before splitting
# - random_state: for reproducibility

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

In [None]:
# TODO: Use cross_val_score to evaluate the baseline pipeline
# Try different scoring metrics: 'accuracy', 'f1_macro', 'f1_weighted'
#
# Which one(s) should you report given your label distribution?
# - Balanced classes → accuracy is fine
# - Imbalanced classes → F1 (especially macro) is more informative

# scores = cross_val_score(baseline_pipe, X_train, y_train, cv=cv, scoring="f1_macro")
# print(f"CV F1 (macro): {scores.mean():.3f} (+/- {scores.std():.3f})")



---

## 5. Understanding Regularization

Before we tune hyperparameters, let's understand what we're tuning.

### The Problem: High-Dimensional Text

Text classification has **many features** (words) and **relatively few
documents**. Without constraints, models will overfit by assigning large weights
to rare words that happen to correlate with labels in the training data.

### The Solution: Regularization

Regularization adds a penalty for large coefficients, forcing the model to be
simpler.

| Type            | Penalty                      | Effect                                | When to use                                        |
| --------------- | ---------------------------- | ------------------------------------- | -------------------------------------------------- |
| **L2 (Ridge)**  | Sum of squared coefficients  | Many small coefficients               | Default for text, good all-rounder                 |
| **L1 (Lasso)**  | Sum of absolute coefficients | Some coefficients become exactly zero | When you want feature selection / interpretability |
| **Elastic Net** | Mix of L1 and L2             | Compromise                            | When you want some sparsity but L1 is unstable     |

### The C Parameter

In scikit-learn, `C` controls regularization strength:

- **Small C** (e.g., 0.01) → Strong regularization → Simpler model
- **Large C** (e.g., 100) → Weak regularization → Model can fit training data
  more closely (risk of overfitting)

Note: C is the _inverse_ of regularization strength (λ in the lecture slides),
which can be confusing!


In [None]:
# Quick demonstration: effect of C on number of "active" features with L1
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(min_df=5, max_df=0.5)
X_train_vec = vec.fit_transform(X_train)

for C in [0.01, 0.1, 1, 10]:
    lr = LogisticRegression(C=C, penalty='l1', solver='liblinear', random_state=RANDOM_STATE)
    lr.fit(X_train_vec, y_train)
    n_nonzero = (lr.coef_ != 0).sum()
    print(f"C={C:5}: {n_nonzero:4} non-zero coefficients out of {lr.coef_.shape[1]}")

---

## 6. Hyperparameter Tuning with GridSearchCV

Now let's systematically search for better configurations.

The syntax for tuning pipeline parameters is: `stepname__parametername`

**Book reference:** See
[CSS book Example 11.6](https://cssbook.net/content/chapter11.html#exm-gridsearchlogreg)
for a similar grid search.


In [None]:
# Example: a small parameter grid
param_grid = {
    "vectorizer__ngram_range": [(1, 1), (1, 2)],  # unigrams vs. unigrams+bigrams
    "vectorizer__min_df": [1, 5],                  # minimum document frequency
    "classifier__C": [0.1, 1, 10],                 # regularization strength
}

# How many combinations is this?
n_combinations = 2 * 2 * 3
print(f"Grid has {n_combinations} combinations")
print(f"With 5-fold CV, that's {n_combinations * 5} model fits")

In [None]:
# TODO: Set up and run GridSearchCV
# - Use the pipeline (baseline_pipe or create a new one)
# - Use param_grid and cv defined above
# - Choose an appropriate scoring metric
# - Consider setting n_jobs=-1 to use all CPU cores
# - Consider setting verbose=1 to see progress

pipe = Pipeline([
    ("vectorizer", TfidfVectorizer()),
    ("classifier", LogisticRegression(solver="liblinear", random_state=RANDOM_STATE))
])

# grid_search = GridSearchCV(
#     pipe, 
#     param_grid, 
#     cv=cv, 
#     scoring="f1_macro",
#     n_jobs=-1,
#     verbose=1
# )
# grid_search.fit(X_train, y_train)



In [None]:
# TODO: Inspect the results
# - What were the best parameters?
# - What was the best CV score?

# print(f"Best parameters: {grid_search.best_params_}")
# print(f"Best CV score: {grid_search.best_score_:.3f}")



In [None]:
# Optional: Look at all results as a DataFrame
# results_df = pd.DataFrame(grid_search.cv_results_)
# results_df[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']].sort_values('rank_test_score').head(10)



---

## 7. Grid Search vs. Random Search

As mentioned in the lecture:

- **Grid search**: tries all combinations → good for small grids
- **Random search**: samples random combinations → better for large search
  spaces

Random search is often more efficient because some hyperparameters matter more
than others. With the same computational budget, random search explores more
values of the important parameters.

**Rule of thumb:** Small grid (< 50 combinations) → Grid search. Larger → Random
search.


In [None]:
# Example: RandomizedSearchCV with continuous distributions
from scipy.stats import loguniform

# For random search, you can specify distributions instead of lists
param_distributions = {
    "vectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)],
    "vectorizer__min_df": [1, 2, 3, 5, 10],
    "vectorizer__max_df": [0.5, 0.7, 0.9, 1.0],
    "classifier__C": loguniform(0.01, 100),  # samples from log-uniform distribution
    "classifier__penalty": ["l1", "l2"],
}

# This would be 3 * 5 * 4 * continuous * 2 = many combinations!
# Random search samples n_iter combinations instead of trying all

# random_search = RandomizedSearchCV(
#     pipe,
#     param_distributions,
#     n_iter=50,  # try 50 random combinations
#     cv=cv,
#     scoring="f1_macro",
#     random_state=RANDOM_STATE,
#     n_jobs=-1
# )
# random_search.fit(X_train, y_train)

---

## 8. Comparing Multiple Classifiers (Your Turn)

The grid above only tested one classifier type. Now extend it to compare
different approaches.

**Task:** Create a comparison that includes:

- At least two different classifiers (e.g., LogisticRegression vs. MultinomialNB
  vs. LinearSVC)
- Different vectorizer settings
- Appropriate parameters for each classifier

**Hint:** You can have multiple param grids (as a list) to handle
classifier-specific parameters.

**Book reference:** See
[CSS book Example 11.4](https://cssbook.net/content/chapter11.html#exm-basiccomparisons)
for comparing multiple configurations.


In [None]:
# Example of multiple grids for different classifiers:

pipe = Pipeline([
    ("vectorizer", TfidfVectorizer()),
    ("classifier", LogisticRegression())  # placeholder, will be replaced by grid
])

# When you have a list of grids, GridSearchCV tries each one
param_grids = [
    # Grid 1: Logistic Regression with L2
    {
        "vectorizer__ngram_range": [(1, 1), (1, 2)],
        "vectorizer__min_df": [1, 5],
        "classifier": [LogisticRegression(solver="liblinear", random_state=RANDOM_STATE)],
        "classifier__C": [0.1, 1, 10],
        "classifier__penalty": ["l2"],
    },
    # Grid 2: Logistic Regression with L1 (for comparison)
    {
        "vectorizer__ngram_range": [(1, 1), (1, 2)],
        "vectorizer__min_df": [1, 5],
        "classifier": [LogisticRegression(solver="liblinear", random_state=RANDOM_STATE)],
        "classifier__C": [0.1, 1, 10],
        "classifier__penalty": ["l1"],
    },
    # Grid 3: Naive Bayes (no C parameter!)
    {
        "vectorizer__ngram_range": [(1, 1), (1, 2)],
        "vectorizer__min_df": [1, 5],
        "classifier": [MultinomialNB()],
        "classifier__alpha": [0.1, 0.5, 1.0],  # smoothing parameter
    },
    # TODO: Add Grid 4 for LinearSVC
    # Hint: LinearSVC uses C for regularization, similar to LogisticRegression
]

# YOUR CODE HERE: Run this expanded grid search



---

## 9. Final Evaluation on Test Set

**Only now** do we touch the test set. We use the best model from our grid
search and evaluate it once.

This gives us an unbiased estimate of how our model will perform on new data.


In [None]:
# The best estimator is already fitted on the full training data
best_model = grid_search.best_estimator_

# Predict on the held-out test set
y_pred = best_model.predict(X_test)

In [None]:
# TODO: Generate a full classification report
# Look at precision, recall, F1 for EACH class

# print(classification_report(y_test, y_pred))



In [None]:
# TODO: Create and display a confusion matrix
# Which classes get confused with each other?

# cm = confusion_matrix(y_test, y_pred)
# disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=best_model.classes_)
# disp.plot(cmap='Blues')
# plt.title("Confusion Matrix on Test Set")
# plt.show()



---

## 10. Error Analysis

Numbers only tell part of the story. Looking at actual misclassifications often
reveals systematic issues.

**Task:** Examine some misclassified examples.


In [None]:
# Create a dataframe with predictions for analysis
test_results = pd.DataFrame({
    'text': X_test.values,
    'true_label': y_test.values,
    'predicted': y_pred,
    'correct': y_test.values == y_pred
})

# Look at some misclassifications
misclassified = test_results[~test_results['correct']]
print(f"Total misclassifications: {len(misclassified)} out of {len(test_results)} ({100*len(misclassified)/len(test_results):.1f}%)")

In [None]:
# TODO: Examine some misclassified examples
# - Are there patterns?
# - Are some errors understandable (ambiguous cases)?
# - Are some errors surprising (clear cases the model got wrong)?

# Random sample of errors:
# misclassified.sample(10)

# Or filter by specific confusion:
# misclassified[(misclassified['true_label'] == 'A') & (misclassified['predicted'] == 'B')]



---

## 11. Saving and Loading Your Model

Once you have a good model, you'll want to save it so you can:

1. Apply it to new data (`unlabeled.csv`) without retraining
2. Share it with others
3. Use it in production

**Important:** You save the entire pipeline (vectorizer + classifier), not just
the classifier!

**Book reference:** See
[CSS book Example 11.8](https://cssbook.net/content/chapter11.html#exm-reuse)
for saving and loading models.


In [None]:
# Save the best model
joblib.dump(best_model, "best_text_classifier.joblib")
print("Model saved to best_text_classifier.joblib")

In [None]:
# Later: load and use the model
loaded_model = joblib.load("best_text_classifier.joblib")

# Test that it works
sample_texts = ["This is a test document.", "Another example text here."]
predictions = loaded_model.predict(sample_texts)
print(f"Predictions: {predictions}")

In [None]:
# TODO: Apply your model to the unlabeled data

# unlabeled = pd.read_csv("unlabeled.csv")
# unlabeled['predicted_label'] = loaded_model.predict(unlabeled['text'])
# unlabeled.head()



---
## 12. Reflection Questions

Before you finish, consider these questions:

1. **Is the test performance notably worse than CV performance?** 
   - If so, why might that be? (Hint: overfitting to validation during tuning?)

2. **Are there classes that perform particularly poorly?** 
   - What could you do about it? (More data? Different features? Class weights?)

3. **If you were to write up these results for a thesis**, what would you report? 
   - Best model and its hyperparameters
   - CV performance (mean ± std)
   - Test set performance (precision, recall, F1 per class)
   - Confusion matrix
   - Error analysis insights

4. **What caveats would you mention?**
   - How well will this generalize to data from different sources/time periods?
   - What are the failure modes?

---


## 13. Bonus Challenges (if time permits)

### A. Threshold Tuning (Binary Classification)

For binary classification: instead of using the default 0.5 threshold, find the
optimal threshold using the ROC curve.


In [None]:
# Only for binary classification!
# from sklearn.metrics import roc_curve, roc_auc_score

# Get predicted probabilities (not just class labels)
# Note: LinearSVC doesn't have predict_proba, use LogisticRegression
# y_proba = best_model.predict_proba(X_test)[:, 1]  # probability of positive class

# fpr, tpr, thresholds = roc_curve(y_test, y_proba, pos_label='pos')  # adjust pos_label
# auc = roc_auc_score(y_test, y_proba)

# plt.figure(figsize=(8, 6))
# plt.plot(fpr, tpr, label=f'ROC curve (AUC = {auc:.3f})')
# plt.plot([0, 1], [0, 1], 'k--', label='Random classifier')
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('ROC Curve')
# plt.legend()
# plt.show()



### B. Feature Importance

What words are most predictive for each class?

**Book reference:** See
[CSS book Example 11.9](https://cssbook.net/content/chapter11.html#exm-eli5) for
using eli5 to inspect features.


In [None]:
# For logistic regression, you can examine coefficients
# (Only works if best_model uses LogisticRegression)

# vectorizer = best_model.named_steps['vectorizer']
# classifier = best_model.named_steps['classifier']

# feature_names = vectorizer.get_feature_names_out()

# For binary classification:
# coefs = classifier.coef_[0]
# feature_importance = pd.DataFrame({
#     'feature': feature_names,
#     'coefficient': coefs
# }).sort_values('coefficient', key=abs, ascending=False)

# print("Top 20 most predictive features:")
# print(feature_importance.head(20))



### C. Handling Class Imbalance

If you have imbalanced classes, try `class_weight='balanced'` and see if it
helps the minority class.


In [None]:
# Compare with and without class_weight='balanced'

# pipe_balanced = Pipeline([
#     ("vectorizer", TfidfVectorizer(min_df=5, max_df=0.5)),
#     ("classifier", LogisticRegression(solver="liblinear", class_weight="balanced", random_state=RANDOM_STATE))
# ])

# scores_balanced = cross_val_score(pipe_balanced, X_train, y_train, cv=cv, scoring="f1_macro")
# print(f"With class_weight='balanced': {scores_balanced.mean():.3f} (+/- {scores_balanced.std():.3f})")

