# Regulations.gov: Comment Embeddings >>> Clustering >>> Post-processing

This notebook takes a cleaned Regulations.gov comment dataset and produces cluster labels plus cluster-level summaries for downstream text analysis (e.g., LDA / Wordfish).

**Pipeline:**
1. Loads a CSV of cleaned comments (must include a text column).
2. Computes SBERT embeddings for each comment.
3. Clusters comments (DBSCAN or graph connected-components at a similarity threshold).
4. Keeps **all original columns** and adds a cluster label column.
5. Generates:
   - a cluster size table
   - a de-duplicated “one row per cluster” dataset
   - optional exports for later modeling

> **Note:** You probably need to change `DATA_PATH` below to match your own file location.


## 0. Setup

This block prepares the Colab environment.

**What this block does**
1. Checks GPU availability (`nvidia-smi`).
2. Installs required packages.
3. Imports libraries used later.
4. Downloads NLTK tokenizers (only needed if you use sentence splitting later).
5. Mounts Google Drive.
6. Defines all input/output paths.

> **Note:** Edit `BASE_DIR` and `CSV_FILENAME` to match your Drive folder and file name.

In [None]:
# ---- 0.1 GPU check ----
!nvidia-smi

In [None]:
# ---- 0.2 Install dependencies ----
!pip install -q sentence-transformers nltk pandas scikit-learn tqdm

# ---- 0.3 Imports ----
import os
import json
import numpy as np
import pandas as pd
import torch

from sentence_transformers import SentenceTransformer

import nltk
from nltk.tokenize import sent_tokenize

In [None]:
# ---- 0.4 NLTK (for embedding) ----
nltk.download("punkt")

In [None]:
# ---- 0.5 Mount Google Drive ----
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

In [None]:
# ---- 0.6 Project paths ----

BASE_DIR = "/content/drive/MyDrive/<your_project>"  # <-- EDIT to your drive
DATA_DIR = os.path.join(BASE_DIR, "data")

PREFIX = "irs_multi"  # <-- EDIT

# Input
CSV_FILENAME = f"{PREFIX}_comments_all_clean_irs_multi.csv"  # <-- EDIT
CSV_PATH = os.path.join(DATA_DIR, CSV_FILENAME)

# Outputs
EMBED_JSON_PATH = os.path.join(DATA_DIR, f"{PREFIX}_comments_all_clean_irs_multi_embeddings_allmpnet_SL.json")
DBSCAN_CSV_PATH = os.path.join(DATA_DIR, f"{PREFIX}_comments_all_clean_irs_multi_dbscan.csv")
DBSCAN_FULL_CSV_PATH = os.path.join(DATA_DIR, f"{PREFIX}_comments_all_clean_irs_multi_dbscan_full.csv")

print("CSV_PATH:", CSV_PATH)
print("EMBED_JSON_PATH:", EMBED_JSON_PATH)
print("DBSCAN_CSV_PATH:", DBSCAN_CSV_PATH)
print("DBSCAN_FULL_CSV_PATH:", DBSCAN_FULL_CSV_PATH)

## 1. Load the CSV and standardize required columns

We load the cleaned comments CSV and keep only the columns needed for the embedding pipeline.

**Expected columns**
- `commentId`: unique identifier for each comment
- `text_clean`: cleaned text used for sentence splitting and embedding

**What this step produces**
- A DataFrame `df` containing only:
  - `commentId` (as string)
  - `text_clean` (as string, with missing values filled as empty strings)

In [None]:
# ---- 1.1 Read CSV ----
df_raw = pd.read_csv(CSV_PATH)

print(f"Loaded CSV with {df_raw.shape[0]:,} rows and {df_raw.shape[1]:,} columns.")
print("First 10 column names:", list(df_raw.columns)[:10])

# ---- 1.2 Check required columns ----
required_cols = ["commentId", "text_clean"]
missing = [c for c in required_cols if c not in df_raw.columns]
if missing:
    raise ValueError(
        f"Missing required column(s): {missing}\n"
        f"Available columns include: {list(df_raw.columns)[:30]}"
    )

# ---- 1.3 Keep only columns we need ----
df = df_raw[required_cols].copy()

# ---- 1.4 Standardize types ----
df["commentId"] = df["commentId"].astype(str)
df["text_clean"] = df["text_clean"].fillna("").astype(str)

print("After selecting/cleaning columns:")
print(df.dtypes)

df.head()

## 2. Sentence-level SBERT embeddings

Instead of embedding the full document directly, we:
1. Split each comment into sentences (`nltk.sent_tokenize`).
2. Embed each sentence using SBERT.
3. Average sentence embeddings to obtain a single **document embedding** per comment.

**Key settings**
- Model: `sentence-transformers/all-mpnet-base-v2`
- `MAX_SEQ_LENGTH`: truncation length *per sentence* (not per document)
- `min_sent_len`: drop extremely short sentences to reduce noise

**Output**
- A function `embed_document_sentence_level(text)` that returns a 1D NumPy array
  with shape `(embedding_dim,)`.

In [None]:
# ---- 2.1 Model setup ----
from typing import List
from sentence_transformers import SentenceTransformer

MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
MAX_SEQ_LENGTH = 128  # truncation applies to each sentence

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

model = SentenceTransformer(MODEL_NAME, device=device)
model.max_seq_length = MAX_SEQ_LENGTH  # sentence-level truncation


# ---- 2.2 Sentence-level embedding function ----
def embed_document_sentence_level(text: str, min_sent_len: int = 5) -> np.ndarray:

    # Normalize whitespace
    text = " ".join(str(text).split())
    if not text:
        return np.zeros(model.get_sentence_embedding_dimension(), dtype=np.float32)

    # 1) Sentence splitting
    sentences = sent_tokenize(text)

    # 2) Filter very short sentences (noise reduction)
    sentences = [s.strip() for s in sentences if len(s.strip()) >= min_sent_len]
    if not sentences:
        return np.zeros(model.get_sentence_embedding_dimension(), dtype=np.float32)

    # 3) Encode all sentences (batching handled internally by SentenceTransformer)
    sent_embeddings = model.encode(
        sentences,
        show_progress_bar=False,
        convert_to_numpy=True
    )

    # 4) Mean pooling -> document embedding
    doc_embedding = sent_embeddings.mean(axis=0).astype(np.float32)

    return doc_embedding

## 3. Compute sentence-level embeddings and export to JSON

We compute one embedding per comment using `embed_document_sentence_level()` and save results to a JSON file.

**What this step does**
- Iterates over `text_clean`
- Produces an embedding matrix with shape `(N, dim)`
- Saves a record-per-row JSON (portable and easy to reload)

**Output file**
- `EMBED_JSON_PATH`: JSON list of records with:
  - `commentId`
  - `text_clean`
  - `embedding`

In [None]:
# ---- 3.1 Compute embeddings ----
from tqdm.auto import tqdm

comment_ids = df["commentId"].tolist()
texts = df["text_clean"].tolist()

use_cache = True # Prep for possibel runtime disconnect

if use_cache and os.path.exists(EMBED_JSON_PATH):
    print("Found cached embeddings JSON. Loading:", EMBED_JSON_PATH)
    embed_df = pd.read_json(EMBED_JSON_PATH, orient="records")
    embeddings = np.vstack(embed_df["embedding"].apply(np.array).values).astype(np.float32)
    print("Loaded embedding matrix with shape:", embeddings.shape)

else:
    print(f"Computing sentence-level embeddings for {len(texts):,} comments...")

    embeddings_list = []
    for txt in tqdm(texts, total=len(texts)):
        emb = embed_document_sentence_level(txt)
        embeddings_list.append(emb)

    embeddings = np.vstack(embeddings_list).astype(np.float32)  # (N, dim)
    print("Embedding matrix shape:", embeddings.shape)

    # Build export table
    embed_df = pd.DataFrame({
        "commentId": comment_ids,
        "text_clean": texts,
        "embedding": embeddings.tolist()
    })

    # ---- 3.2 Save JSON ----
    os.makedirs(os.path.dirname(EMBED_JSON_PATH), exist_ok=True)
    tmp_path = EMBED_JSON_PATH + ".tmp"

    embed_df.to_json(tmp_path, orient="records", force_ascii=False)
    os.replace(tmp_path, EMBED_JSON_PATH)

    print("Saved embeddings to:", EMBED_JSON_PATH)

embed_df.head(3)

## 4. DBSCAN clustering on comment embeddings

This section runs DBSCAN on comment embeddings and exports both:
- a lightweight cluster table, and
- a full merged table (original CSV + `dbscan_cluster`).

### Inputs
**Required**
- `EMBED_JSON_PATH`: JSON records containing at least:
  - `commentId`
  - `embedding` (vector stored as a Python list)
- `CSV_PATH`: original comments CSV (used to build the FULL output)

### Outputs
- `DBSCAN_CSV_PATH`: lightweight CSV with:
  - `commentId`, `dbscan_cluster`
- `DBSCAN_FULL_CSV_PATH`: full CSV with:
  - all original columns from `CSV_PATH` + `dbscan_cluster`

### DBSCAN settings (cosine distance)
- `eps`: neighborhood radius in cosine-distance space  
- `min_samples`: minimum number of points required to form a core cluster
- Label `-1` indicates **noise** (unclustered points)

In [None]:
# ---- 4.1 Load embeddings JSON ----
df_embed = pd.read_json(EMBED_JSON_PATH, orient="records")
df_embed["commentId"] = df_embed["commentId"].astype(str)

print("df_embed rows:", len(df_embed))
print("df_embed columns:", df_embed.columns.tolist())

# ---- 4.2 Build DBSCAN input matrix X ----
X = np.stack(df_embed["embedding"].to_numpy()).astype(np.float32)
print("X shape:", X.shape)

# ---- 4.3 Load original CSV (required for FULL merged export) ----
df_full = pd.read_csv(CSV_PATH)
df_full["commentId"] = df_full["commentId"].astype(str)

print("df_full rows:", len(df_full))
print("df_full columns (first 12):", df_full.columns.tolist()[:12])

# ---- 4.4 Run DBSCAN ----
from sklearn.cluster import DBSCAN

# Tunable parameters
EPS = 0.05  # cosine-distance radius
MIN_SAMPLES = 5  # minimum points to form a cluster

db = DBSCAN(
    eps=EPS,
    min_samples=MIN_SAMPLES,
    metric="cosine",
    n_jobs=-1
)

print("Fitting DBSCAN ...")
db_labels = db.fit_predict(X)
print("DBSCAN finished.")

# Summary: cluster sizes (including noise = -1)
unique_labels, counts = np.unique(db_labels, return_counts=True)

print("Top clusters by size (including noise = -1):")
for lbl, cnt in sorted(zip(unique_labels, counts), key=lambda x: -x[1])[:15]:
    print(f"  cluster {lbl}: {cnt} comments")

# Attach labels
df_embed["dbscan_cluster"] = db_labels
df_embed[["commentId", "dbscan_cluster"]].head()

In [None]:
# ---- 4.5 Merge dbscan_cluster back onto the original CSV ----
df_dbscan = df_embed[["commentId", "dbscan_cluster"]].copy()
df_dbscan["commentId"] = df_dbscan["commentId"].astype(str)

df_full_out = df_full.merge(df_dbscan, on="commentId", how="left")

print("Rows after merge:", len(df_full_out))
print("Cluster columns in output:")
print([c for c in df_full_out.columns if "cluster" in c])

df_dbscan.to_csv(DBSCAN_CSV_PATH, index=False)
print("Saved lightweight DBSCAN CSV to:", DBSCAN_CSV_PATH)

df_full_out.to_csv(DBSCAN_FULL_CSV_PATH, index=False)
print("Saved FULL CSV to:", DBSCAN_FULL_CSV_PATH)

df_full_out[["commentId", "dbscan_cluster"]].head()

## 5. DBSCAN post-processing (exports)

From `df_full_out` (original CSV + `dbscan_cluster`), export:

1. Noise only (`dbscan_cluster == -1`)  
2. One representative per non-noise cluster (most complete row; ties by longer text)  
3. All rows in clusters with size > 5  
4. Noise + representatives


In [None]:
# ---- 5.1 Split noise vs non-noise ----
df_noise = df_full_out[df_full_out["dbscan_cluster"] == -1].copy()
df_nonnoise = df_full_out[df_full_out["dbscan_cluster"] != -1].copy()

print("Noise rows:", len(df_noise))
print("Non-noise rows:", len(df_nonnoise))

In [None]:
# ---- 5.2 Representatives per cluster ----
info_cols = [
    "title", "trackingNbr", "organizationName", "firstName", "lastName",
    "city", "stateProvinceRegion", "country", "combinedText",
]
info_cols = [c for c in info_cols if c in df_nonnoise.columns]

# Treat empty strings as missing for scoring
for c in info_cols:
    df_nonnoise[c] = df_nonnoise[c].replace(r"^\s*$", np.nan, regex=True)

df_nonnoise["info_nonnull"] = df_nonnoise[info_cols].notna().sum(axis=1) if info_cols else 0
df_nonnoise["text_len"] = (
    df_nonnoise["combinedText"].fillna("").astype(str).str.len()
    if "combinedText" in df_nonnoise.columns else 0
)

df_reps = (
    df_nonnoise
    .sort_values(["dbscan_cluster", "info_nonnull", "text_len"], ascending=[True, False, False])
    .groupby("dbscan_cluster", as_index=False)
    .first()
    .drop(columns=["info_nonnull", "text_len"], errors="ignore")
)

print("Representatives:", len(df_reps))

In [None]:
# ---- 5.3 Clusters with size > 5 (full content) ----
cluster_counts = df_nonnoise["dbscan_cluster"].value_counts()
big_clusters = cluster_counts[cluster_counts > 5].index

df_big = df_nonnoise[df_nonnoise["dbscan_cluster"].isin(big_clusters)].copy()

print("Big clusters (>5):", len(big_clusters))
print("Rows in big clusters:", len(df_big))


In [None]:
# ---- 5.4 Noise + representatives ----
df_noise_plus_reps = pd.concat([df_noise, df_reps], ignore_index=True)
print("Rows in noise + reps:", len(df_noise_plus_reps))


In [None]:
# ---- 5.5 Save outputs ----
tag = f"eps{EPS:g}_min{MIN_SAMPLES}"

noise_path      = os.path.join(DATA_DIR, f"{PREFIX}_dbscan_{tag}_noise_only.csv")
reps_path       = os.path.join(DATA_DIR, f"{PREFIX}_dbscan_{tag}_cluster_representatives.csv")
big_path        = os.path.join(DATA_DIR, f"{PREFIX}_dbscan_{tag}_clusters_gt5_full.csv")
noise_reps_path = os.path.join(DATA_DIR, f"{PREFIX}_dbscan_{tag}_noise_plus_reps.csv")

df_noise.to_csv(noise_path, index=False)
df_reps.to_csv(reps_path, index=False)
df_big.to_csv(big_path, index=False)
df_noise_plus_reps.to_csv(noise_reps_path, index=False)

print("Saved:")
print("1) Noise only:", noise_path)
print("2) Representatives:", reps_path)
print("3) Clusters > 5 (full):", big_path)
print("4) Noise + reps:", noise_reps_path)