# Discovering Topics from Voted Sentences

This notebook walks through a complete pipeline for extracting **topics** from a collection of single-sentence submissions that have been rated by participants. Each sentence can be voted on with one of three outcomes: **agree** (1), **disagree** (-1), or **pass/no opinion** (0). Participants joined at different times and new sentences were added over time, so some sentences were never exposed to earlier participants.  

The pipeline addresses several challenges:

1. **Semantic deduplication:** multiple sentences may express the same idea; we group paraphrases into a single cluster before analysing votes.
2. **Exposure modelling:** because participants see only a subset of sentences (and exposure is not uniform), we model the probability that a participant has seen a sentence and correct for this in the vote-based embeddings.
3. **Signed matrix factorization with propensity weighting:** we learn low-dimensional representations of sentences based on participants' votes, using inverse-propensity weights to remove exposure bias and a special treatment for pass votes.
4. **Feature fusion and clustering:** we combine vote-derived embeddings with semantic sentence embeddings and cluster them using a density-based algorithm. An agglomerative step builds a hierarchy of topics.
5. **Topic labeling and social statistics:** we label clusters with keywords (via c‑TF‑IDF), show representative sentences, and summarise each topic with the distribution of agree/disagree/pass votes.

Throughout the notebook we provide commentary to explain each step and choices made. Feel free to adjust hyperparameters, thresholds and weighting schemes to your data and domain.


## Installation (optional)

The environment used to run this notebook may already have the necessary libraries installed. If you encounter import errors, uncomment the following lines and run them to install the required packages:

```python
# !pip install sentence-transformers hdbscan umap-learn faiss-cpu networkx scikit-learn
```


In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

# Attempt to read the data exported from pol.is. If unavailable, fall back to a synthetic dataset.
comments_path = Path("comments.csv")
votes_path = Path("votes.csv")

if comments_path.exists() and votes_path.exists():
    comments_raw = pd.read_csv(comments_path)
    votes_raw = pd.read_csv(votes_path)

    # Normalise column names and timestamps
    sentences_df = comments_raw.rename(
        columns={
            "comment-id": "sentence_id",
            "comment-body": "text",
            "author-id": "author_id",
        }
    ).copy()
    sentences_df["timestamp"] = pd.to_datetime(
        sentences_df["timestamp"], unit="s", errors="coerce"
    )
    sentences_df = sentences_df[["sentence_id", "text", "timestamp", "author_id", "agrees", "disagrees", "moderated"]]
    sentences_df.sort_values("timestamp", inplace=True)

    votes_df = votes_raw.rename(
        columns={
            "comment-id": "sentence_id",
            "voter-id": "participant_id",
        }
    ).copy()
    votes_df["timestamp"] = pd.to_datetime(
        votes_df["timestamp"], unit="s", errors="coerce"
    )
    votes_df = votes_df[["participant_id", "sentence_id", "vote", "timestamp"]]
    votes_df.sort_values(["participant_id", "timestamp"], inplace=True)

    print(f"Loaded {len(sentences_df)} comments and {len(votes_df)} votes from pol.is export.")
else:
    print("Could not load pol.is export; generating a synthetic dataset instead.")
    # Synthetic dataset parameters
    num_sentences = 128
    num_participants = 60
    np.random.seed(0)
    # Create synthetic sentences with timestamps spread over 4 months
    base_time = np.datetime64("2025-01-01")
    times = base_time + np.random.randint(0, 120, size=num_sentences).astype("timedelta64[D]")
    sentences_df = pd.DataFrame(
        {
            "sentence_id": np.arange(num_sentences),
            "text": [f"Synthetic sentence {i}" for i in range(num_sentences)],
            "timestamp": times,
        }
    )
    # Generate participant join times randomly over the same period
    join_times = base_time + np.random.randint(0, 120, size=num_participants).astype("timedelta64[D]")
    # Synthetic votes: each participant votes on sentences added before they joined, with random agree/disagree/pass
    votes_records = []
    for p, join_time in enumerate(join_times):
        eligible_sentences = sentences_df[sentences_df["timestamp"] <= join_time]
        chosen = eligible_sentences.sample(frac=0.5, random_state=p)
        for _, row in chosen.iterrows():
            vote = np.random.choice([1, -1, 0], p=[0.5, 0.3, 0.2])
            votes_records.append(
                {
                    "participant_id": p,
                    "sentence_id": row["sentence_id"],
                    "vote": vote,
                    "timestamp": join_time,
                }
            )
    votes_df = pd.DataFrame(votes_records)
    votes_df["timestamp"] = pd.to_datetime(votes_df["timestamp"])
    print("Synthetic dataset created.")

sentences_df.head(), votes_df.head()


Loaded 61 comments and 4538 votes from pol.is export.


(    sentence_id                                               text  \
 6             0       FAccT should be open and welcoming community   
 7             1  FAccT should be a generator for alternative so...   
 8             2  FAccT is NOT doing well in inviting “non-acade...   
 9             3                  FAccT should have poster sessions   
 10            4  Facct is not inclusive to the local country/co...   
 
              timestamp  author_id  agrees  disagrees  moderated  
 6  2025-06-26 12:52:10          0      87          1          1  
 7  2025-06-26 12:53:45          0      42         19          1  
 8  2025-06-26 12:53:49          0      37         25          1  
 9  2025-06-26 12:53:54          0      61         12          1  
 10 2025-06-26 12:53:59          0      22         19          1  ,
       participant_id  sentence_id  vote           timestamp
 0                  0            0     1 2025-06-29 12:00:08
 3108               0           33     1 2025-0

## 1. Semantic Deduplication

Multiple participants may express the same idea using different words. Before aggregating votes, we group **paraphrases** into single units called *groups*. Our deduplication strategy proceeds in two stages:

1. **Bi‑encoder retrieval:** We embed each sentence with a sentence‑level transformer (e.g. `all‑mpnet‑base‑v2`) and use cosine similarity to find candidate paraphrase pairs. Embedding similarity is fast and yields high recall but may include false positives.
2. **Cross‑encoder re‑ranking:** We pass each candidate pair through a more accurate cross‑encoder model (e.g. `ms‑marco‑MiniLM‑L‑6‑v2`). This network jointly processes both sentences and predicts a similarity score. We keep pairs above a chosen threshold as paraphrases.

The union of verified paraphrase pairs forms a graph; each connected component corresponds to one semantic group.  

If the `sentence-transformers` or `networkx` packages are unavailable, this step gracefully degrades by skipping deduplication (each sentence becomes its own group).


In [2]:
# Attempt to import sentence-transformers. If unavailable, deduplication will be skipped.
try:
    from sentence_transformers import CrossEncoder, SentenceTransformer, util
except ImportError as e:
    print("sentence-transformers not available:", e)
    SentenceTransformer = None
    util = None
    CrossEncoder = None

# Compute bi‑encoder embeddings if possible
bi_encoder = None
sent_emb = None
if SentenceTransformer is not None:
    try:
        bi_encoder = SentenceTransformer("all-mpnet-base-v2")
        sent_emb = bi_encoder.encode(
            sentences_df["text"].tolist(),
            convert_to_tensor=True,
            normalize_embeddings=True,
        )
    except Exception as e:
        print("Bi‑encoder unavailable or failed:", e)
        bi_encoder = None
        sent_emb = None

# Identify candidate paraphrase pairs using embedding similarity
paraphrase_pairs = []
if sent_emb is not None and util is not None:
    candidates = util.paraphrase_mining_embeddings(sent_emb, top_k=20)
    COS_THRESHOLD = 0.80
    paraphrase_pairs = [(i, j) for score, i, j in candidates if score >= COS_THRESHOLD]

# Refine pairs using cross‑encoder
verified_pairs = []
if paraphrase_pairs:
    if CrossEncoder is not None:
        try:
            ce_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
            texts = sentences_df["text"].tolist()
            pair_texts = [(texts[i], texts[j]) for i, j in paraphrase_pairs]
            ce_scores = ce_model.predict(pair_texts)
            CE_THRESHOLD = 0.50
            verified_pairs = [
                (i, j)
                for (i, j), s in zip(paraphrase_pairs, ce_scores)
                if s >= CE_THRESHOLD
            ]
        except Exception as e:
            print("Cross‑encoder unavailable or failed:", e)
            verified_pairs = paraphrase_pairs.copy()
    else:
        verified_pairs = paraphrase_pairs.copy()

# Build connected components of paraphrase graph
try:
    import networkx as nx

    G = nx.Graph()
    G.add_nodes_from(range(len(sentences_df)))
    G.add_edges_from(verified_pairs)
    components = list(nx.connected_components(G))
except Exception as e:
    print("networkx unavailable or dedup skipped:", e)
    components = [{i} for i in range(len(sentences_df))]

# Assign a group_id per sentence
group_mapping = {}
for cid, comp in enumerate(components):
    for idx in comp:
        sid = sentences_df.iloc[idx]["sentence_id"]
        group_mapping[sid] = cid
sentences_df["group_id"] = sentences_df["sentence_id"].map(group_mapping)

# Merge group_id into votes_df
votes_df = votes_df.merge(
    sentences_df[["sentence_id", "group_id"]], on="sentence_id", how="left"
)

print(f"Deduplication complete: {len(components)} groups identified.")

Deduplication complete: 60 groups identified.


## 2. Eligibility and Exposure Propensity

Participants joined at various times; sentences were added over time. A participant *cannot* vote on a sentence that was submitted after they left the study. To correct for this **structural missingness**, we build an *eligibility matrix* `E` where `E[i, g] = 1` if participant `i` could have seen sentence group `g`, and `0` otherwise.  

Within the eligible pairs there is still **selection bias**: some sentences may be more likely to be shown or noticed than others. We model the probability that an eligible participant votes on a sentence using logistic regression.  

Features used in the exposure model:

* `sentence_age`: time difference (in days) between the participant’s last vote and the sentence’s submission.
* `active_days`: the span (in days) of each participant’s activity (last vote minus first vote).

The binary target is 1 if the participant voted on the sentence, 0 otherwise. The resulting predicted probability `p_hat` serves as our **propensity score**. If `scikit-learn` is unavailable, we fall back to a simple ratio: for each group, `p_hat = (# voters)/(# eligible)`.  

We will later weight each observed vote by `1 / p_hat` to account for non-uniform exposure.


In [3]:
from sklearn.linear_model import LogisticRegression

# Compute last and first vote times per participant
last_vote_time = votes_df.groupby("participant_id")["timestamp"].max()
first_vote_time = votes_df.groupby("participant_id")["timestamp"].min()
participant_active_days = (last_vote_time - first_vote_time).dt.total_seconds() / (
    24 * 3600.0
)

# Eligibility matrix: participants x group_id
users = votes_df["participant_id"].unique()
groups = sentences_df["group_id"].unique()
eligibility = pd.DataFrame(False, index=users, columns=groups)

# Map group to sentence timestamp
group_time_map = sentences_df.set_index("group_id")["timestamp"].to_dict()

# Populate eligibility: participant can see group if their last vote is after sentence timestamp
for u in users:
    last_time = last_vote_time[u]
    for g in groups:
        if pd.notna(last_time) and pd.notna(group_time_map[g]) and last_time >= group_time_map[g]:
            eligibility.loc[u, g] = True

# Build data for logistic regression: features and labels for eligible pairs
rows = []
for u in users:
    for g in groups:
        if not eligibility.loc[u, g]:
            continue
        sent_time = group_time_map[g]
        lv_time = last_vote_time[u]
        sentence_age = (lv_time - sent_time).total_seconds() / (24 * 3600.0)
        active_days = participant_active_days[u]
        # outcome: 1 if user voted on this group
        voted = int(
            ((votes_df["participant_id"] == u) & (votes_df["group_id"] == g)).any()
        )
        rows.append(
            {
                "participant_id": u,
                "group_id": g,
                "sentence_age": sentence_age,
                "active_days": active_days,
                "voted": voted,
            }
        )

exposure_df = pd.DataFrame(rows)

# Normalize features
X = exposure_df[["sentence_age", "active_days"]].fillna(0.0).values
y = exposure_df["voted"].values
mean = X.mean(axis=0)
std = X.std(axis=0) + 1e-12
X_norm = (X - mean) / std

# Initialise p_hat DataFrame
p_hat = pd.DataFrame(index=users, columns=groups, data=np.nan)

try:
    # Fit logistic regression
    clf = LogisticRegression(max_iter=200)
    clf.fit(X_norm, y)
    preds = clf.predict_proba(X_norm)[:, 1]
    exposure_df["probability"] = preds
    # Fill p_hat for eligible pairs
    for u, g, p in zip(
        exposure_df["participant_id"],
        exposure_df["group_id"],
        exposure_df["probability"],
    ):
        p_hat.loc[u, g] = p
    print("Exposure model fit with logistic regression.")
except Exception as e:
    # Fallback: ratio (#voters)/(#eligible) per group
    print("Logistic regression unavailable, using exposure ratio. Error:", e)
    n_voters = votes_df.groupby("group_id")["participant_id"].nunique()
    n_eligible = eligibility.sum(axis=0)
    for g in groups:
        prob = n_voters.get(g, 0) / max(n_eligible[g], 1)
        for u in users:
            if eligibility.loc[u, g]:
                p_hat.loc[u, g] = prob

print("Eligibility sample:")
print(eligibility.head())
print("\nPropensity sample:")
print(p_hat.head())


Exposure model fit with logistic regression.
Eligibility sample:
     0     1     2     3     4     5     6     7     8     9   ...     50  \
0  True  True  True  True  True  True  True  True  True  True  ...  False   
1  True  True  True  True  True  True  True  True  True  True  ...  False   
2  True  True  True  True  True  True  True  True  True  True  ...  False   
3  True  True  True  True  True  True  True  True  True  True  ...  False   
4  True  True  True  True  True  True  True  True  True  True  ...  False   

      51     52     53     54     55     56     57     58     59  
0  False  False  False  False  False  False  False  False  False  
1  False  False  False  False  False  False  False  False  False  
2  False  False  False  False  False  False  False  False  False  
3  False  False  False  False  False  False  False  False  False  
4  False  False  False  False  False  False  False  False  False  

[5 rows x 60 columns]

Propensity sample:
         0         1       

## 3. Vote Embedding via Signed Matrix Factorization with Propensity Weighting

Our next goal is to represent each sentence group by a low‑dimensional vector capturing how participants voted on it. We use a **signed matrix factorization** model with latent user and group factors (\(U\) and \(V\)) and biases (\(b_u\) and \(c_v\)). For each observed vote \(y_{ij}\) on group \(j\) by user \(i\), the score is

\[s_{ij} = U_i \cdot V_j + b_{u_i} + c_{v_j}\].

We minimise a weighted loss:

* **Agree/Disagree (±1):** logistic loss \(\log(1+\exp(-y\, s))\).
* **Pass (0):** quadratic loss \(w_0 \cdot s^2\) pulling the score toward neutrality, with hyperparameter \(w_0\).

Each term is weighted by the **inverse propensity** \(1/p_{ij}\) computed earlier to correct for selection bias. We also include \(\ell_2\) regularisation on the latent factors. Optimisation is performed with stochastic gradient descent (SGD).  

The resulting item factors \(V\) serve as our vote‑based embeddings for each sentence group.


In [4]:
# Hyperparameters
k = 16  # latent dimensionality
lambda_reg = 1e-4  # regularisation strength
w0 = 0.3  # weight for neutral votes
epochs = 200
learning_rate = 0.05

# Map users and groups to indices
user_to_idx = {u: idx for idx, u in enumerate(users)}
group_to_idx = {g: idx for idx, g in enumerate(groups)}

# Build list of (u_idx, g_idx, y, p_hat)
train_data = []
for _, row in votes_df.iterrows():
    u_idx = user_to_idx[row["participant_id"]]
    g_idx = group_to_idx[row["group_id"]]
    y_val = row["vote"]
    p_val = p_hat.loc[row["participant_id"], row["group_id"]]
    if p_val and not np.isnan(p_val) and p_val > 0:
        train_data.append((u_idx, g_idx, y_val, p_val))

# Initialise latent factors and biases
num_users = len(users)
num_items = len(groups)
rng = np.random.default_rng(seed=42)
U = 0.1 * rng.standard_normal((num_users, k))
V = 0.1 * rng.standard_normal((num_items, k))
b_u = np.zeros(num_users)
c_v = np.zeros(num_items)


def sigmoid(x):
    return 1 / (1 + np.exp(-x))


# Training loop
for epoch in range(epochs):
    np.random.shuffle(train_data)
    total_loss = 0.0
    for u_idx, g_idx, y_val, p_val in train_data:
        s = np.dot(U[u_idx], V[g_idx]) + b_u[u_idx] + c_v[g_idx]
        if y_val == 1 or y_val == -1:
            loss = np.log(1 + np.exp(-y_val * s))
            grad_s = -y_val * sigmoid(-y_val * s)
        else:
            loss = w0 * (s**2)
            grad_s = 2 * w0 * s
        # Inverse propensity weight
        loss *= 1.0 / p_val
        grad_s *= 1.0 / p_val
        total_loss += loss
        # Compute gradients with regularisation
        grad_u = grad_s * V[g_idx] + 2 * lambda_reg * U[u_idx]
        grad_v = grad_s * U[u_idx] + 2 * lambda_reg * V[g_idx]
        # Update parameters
        U[u_idx] -= learning_rate * grad_u
        V[g_idx] -= learning_rate * grad_v
        b_u[u_idx] -= learning_rate * grad_s
        c_v[g_idx] -= learning_rate * grad_s
    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1}/{epochs}, loss = {total_loss:.4f}")

vote_embeddings = V.copy()
print("First 5 vote embeddings:", vote_embeddings[:5])

Epoch 50/200, loss = 627.9315


Epoch 100/200, loss = 563.2507


Epoch 150/200, loss = 558.6116


Epoch 200/200, loss = 533.6448
First 5 vote embeddings: [[ 1.9783377  -1.11671204 -0.0860506  -0.00721439  0.92986988  4.19952215
  -3.05191696  0.27079637 -0.0730158  -0.49096503 -0.20508504 -0.35291271
  -2.16140411  0.35052903 -0.66082859 -1.14069716]
 [ 0.31238407 -0.02368577  1.87799615  1.63209758  0.01706437  0.28531172
  -2.26965972 -0.72445035 -0.85762453 -0.76144885  0.47072501 -1.32956924
   0.98490974 -0.51816394 -1.73413672 -1.24150323]
 [ 0.3129001   1.70313543  0.67811289  0.94326179  0.03002561  0.10028031
   0.54368785  0.8535412  -2.20133034 -0.1744629   0.70060969  0.33455487
   0.30804464  0.28529754  0.7193219  -0.82363672]
 [-0.2587769  -0.10141546 -2.45206763  0.65560158  0.19394021 -1.02930633
  -0.01423074  0.44706764 -0.1201821  -1.24550685  0.27081563 -2.99506991
  -1.29544618  1.24493907  1.37783327  1.00407781]
 [-1.48436691  0.46080736  0.04326834  0.38387733  0.40233809  0.63750902
   0.68518626  1.18925226 -0.38998604  0.63534448 -0.16568496  0.20342383


## 4. Text Embedding and Feature Fusion

While vote embeddings capture participants' opinions, we also want to incorporate **semantic information** from the sentences themselves. We obtain sentence embeddings using the same bi‑encoder as in the deduplication step (if available).  

We standardise (zero‑mean/ unit‑variance) both vote and text embeddings and concatenate them. A normalisation to unit length ensures that cosine distance is equivalent to Euclidean distance on the sphere. We then cluster the fused representations using **HDBSCAN**, which automatically determines the number of clusters and handles noise. If `hdbscan` is unavailable, we fall back to `k`‑means.

After clustering, we compute centroids of each topic and apply agglomerative clustering to obtain a hierarchy of topics (super‑topics). The labels `topic` and `super_topic` are added to the `sentences_df` DataFrame.


In [5]:
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.preprocessing import StandardScaler, normalize

## Text embedding
text_embeddings = None
if "bi_encoder" in globals() and bi_encoder is not None:
    try:
        sentence_embeddings = bi_encoder.encode(
            sentences_df["text"].tolist(),
            convert_to_tensor=False,
            normalize_embeddings=True,
        )
        group_lookup = {g: idx for idx, g in enumerate(groups)}
        group_vectors = np.zeros((len(groups), sentence_embeddings.shape[1]), dtype=float)
        counts = np.zeros(len(groups), dtype=int)
        for emb, gid in zip(sentence_embeddings, sentences_df["group_id"].tolist()):
            idx = group_lookup.get(gid)
            if idx is None:
                continue
            group_vectors[idx] += emb
            counts[idx] += 1
        nonzero = counts > 0
        if np.any(nonzero):
            group_vectors[nonzero] /= counts[nonzero, None]
        text_embeddings = group_vectors if np.any(counts) else None
    except Exception as e:
        print("Error computing text embeddings:", e)

## Standardise vote and text embeddings
vote_scaled = None
if "vote_embeddings" in globals() and vote_embeddings is not None:
    vote_scaler = StandardScaler()
    vote_scaled = vote_scaler.fit_transform(vote_embeddings)

text_scaled = None
if text_embeddings is not None:
    text_scaler = StandardScaler()
    text_scaled = text_scaler.fit_transform(text_embeddings)

# Build feature matrix
if vote_scaled is not None and text_scaled is not None:
    features = np.hstack([vote_scaled, text_scaled])
elif vote_scaled is not None:
    features = vote_scaled
elif text_scaled is not None:
    features = text_scaled
else:
    raise ValueError("No features available for clustering.")

# Normalise to unit norm
features_norm = normalize(features)

## Clustering
cluster_labels = None
try:
    import hdbscan

    hdb = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=5, metric="euclidean")
    cluster_labels = hdb.fit_predict(features_norm)
    valid = cluster_labels[cluster_labels >= 0]
    if valid.size == 0:
        raise ValueError("HDBSCAN returned only noise; fallback to k-means.")
    print("HDBSCAN identified", len(np.unique(valid)), "clusters.")
except Exception as e:
    print("HDBSCAN unavailable or insufficient structure, falling back to k-means. Reason:", e)
    K = min(10, max(2, features_norm.shape[0] // 3))
    km = KMeans(n_clusters=K, random_state=42)
    cluster_labels = km.fit_predict(features_norm)
    print("k-means formed", len(np.unique(cluster_labels)), "clusters.")

group_cluster_map = {g: int(label) for g, label in zip(groups, cluster_labels)}
sentences_df["topic"] = sentences_df["group_id"].map(group_cluster_map)

# Compute centroids for each cluster
clusters = np.unique(cluster_labels)
centroids = np.array(
    [features_norm[cluster_labels == c].mean(axis=0) for c in clusters]
)

# Hierarchical clustering on centroids
agg = AgglomerativeClustering(
    n_clusters=None, distance_threshold=0.5, metric="euclidean", linkage="average"
)
hier_labels = agg.fit_predict(centroids)
hier_map = {int(c): int(h) for c, h in zip(clusters, hier_labels)}
sentences_df["super_topic"] = sentences_df["topic"].map(hier_map)

# Persist group-level feature matrix for downstream analysis
group_features_norm = features_norm
group_ids_ordered = list(groups)

sentences_df[["sentence_id", "group_id", "topic", "super_topic"]].head()


HDBSCAN unavailable or insufficient structure, falling back to k-means. Reason: HDBSCAN returned only noise; fallback to k-means.
k-means formed 10 clusters.




Unnamed: 0,sentence_id,group_id,topic,super_topic
6,0,0,0,5
7,1,1,3,4
8,2,2,1,8
9,3,3,8,1
10,4,4,3,4


## 5. Topic Labeling and Social Statistics

To make the discovered topics interpretable, we extract keywords and representative sentences. We employ **class‑based TF‑IDF (c‑TF‑IDF)**: for each topic, we concatenate all sentences in that cluster into a single document and compute TF‑IDF scores over the vocabulary. The top‑scoring n‑grams are selected as keywords.

We also present a few sentences closest to the cluster centroid in the fused embedding space.

In addition, we summarise the voting behaviour within each topic using the following statistics:

* **Coverage:** the proportion of eligible participants who actually voted on sentences in this topic.
* **Agree / Disagree / Pass:** the fraction of votes (among exposed participants) that were +1, -1 or 0, respectively.
* **Polarity:** the mean vote value (agree = +1, disagree = -1, pass = 0).
* **Controversy:** the entropy (base 3) of the agree/pass/disagree distribution; higher values indicate more mixed opinions.

These metrics help identify topics with strong consensus versus contentious topics.


In [6]:
import math

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer


def compute_ctfidf(texts, labels, ngram_range=(1, 3), top_k=10):
    unique_labels = np.unique(labels)
    docs = [' '.join(np.array(texts)[labels == label]) for label in unique_labels]
    vectorizer = TfidfVectorizer(ngram_range=ngram_range, stop_words="english").fit(
        docs
    )
    tfidf_matrix = vectorizer.transform(docs)
    feature_names = np.array(vectorizer.get_feature_names_out())
    keywords = {}
    for i, label in enumerate(unique_labels):
        row = tfidf_matrix[i].toarray().flatten()
        idx = row.argsort()[::-1][:top_k]
        keywords[label] = feature_names[idx].tolist()
    return keywords


# Keywords per topic
keywords_per_topic = compute_ctfidf(
    sentences_df["text"].values,
    sentences_df["topic"].values,
    ngram_range=(1, 3),
    top_k=10,
)

# Representative sentences: nearest group centroids
rep_sentences = {}
if "group_ids_ordered" in globals():
    group_ids_array = np.array(group_ids_ordered)
    for c in clusters:
        group_mask = np.where(cluster_labels == c)[0]
        if group_mask.size == 0:
            continue
        centroid = centroids[np.where(clusters == c)[0][0]]
        dists = np.linalg.norm(group_features_norm[group_mask] - centroid, axis=1)
        nearest_group_indices = group_mask[np.argsort(dists)[:3]]
        nearest_group_ids = group_ids_array[nearest_group_indices]
        texts = (
            sentences_df[sentences_df["group_id"].isin(nearest_group_ids)]
            .sort_values("timestamp")["text"]
            .tolist()
        )
        rep_sentences[c] = texts[:3]
else:
    rep_sentences = {c: [] for c in clusters}


def compute_topic_stats(topic_id):
    idxs = sentences_df[sentences_df["topic"] == topic_id].index
    groups_in_topic = sentences_df.iloc[idxs]["group_id"].unique()
    elig_users = 0
    voted_users = 0
    agree_cnt = 0
    disagree_cnt = 0
    pass_cnt = 0
    for u in users:
        eligible_any = False
        votes_for_topic = []
        for g in groups_in_topic:
            if eligibility.loc[u, g]:
                eligible_any = True
                rows = votes_df[
                    (votes_df["participant_id"] == u) & (votes_df["group_id"] == g)
                ]
                if not rows.empty:
                    votes_for_topic.append(rows.iloc[0]["vote"])
        if eligible_any:
            elig_users += 1
            if votes_for_topic:
                voted_users += 1
                # pick strongest vote for this topic
                v = sorted(votes_for_topic, key=lambda x: (abs(x), x), reverse=True)[0]
                if v == 1:
                    agree_cnt += 1
                elif v == -1:
                    disagree_cnt += 1
                else:
                    pass_cnt += 1
    coverage = voted_users / max(elig_users, 1)
    total = agree_cnt + disagree_cnt + pass_cnt
    if total > 0:
        agree_pct = agree_cnt / total
        disagree_pct = disagree_cnt / total
        pass_pct = pass_cnt / total
        polarity = (agree_cnt - disagree_cnt) / total
        probs = np.array([agree_pct, pass_pct, disagree_pct])
        entropy = -np.sum(probs * np.log(probs + 1e-12)) / np.log(3)
    else:
        agree_pct = disagree_pct = pass_pct = polarity = entropy = np.nan
    return {
        "coverage": coverage,
        "agree_pct": agree_pct,
        "disagree_pct": disagree_pct,
        "pass_pct": pass_pct,
        "polarity": polarity,
        "controversy": entropy,
    }


summary_rows = []
topic_stats = {c: compute_topic_stats(c) for c in clusters}

# Display summary per topic
for c in clusters:
    print(f"Topic {c}")
    print("  Keywords:", ", ".join(keywords_per_topic.get(c, [])))
    print("  Representative sentences:")
    for s in rep_sentences.get(c, []):
        print("   -", s)
    stats = topic_stats[c]
    coverage_txt = f"{stats['coverage']:.2%}"
    agree_txt = "nan" if np.isnan(stats["agree_pct"]) else f"{stats['agree_pct']:.1%}"
    disagree_txt = "nan" if np.isnan(stats["disagree_pct"]) else f"{stats['disagree_pct']:.1%}"
    pass_txt = "nan" if np.isnan(stats["pass_pct"]) else f"{stats['pass_pct']:.1%}"
    polarity_txt = "nan" if np.isnan(stats["polarity"]) else f"{stats['polarity']:.2f}"
    controversy_txt = "nan" if np.isnan(stats["controversy"]) else f"{stats['controversy']:.2f}"
    print(
        f"  Coverage: {coverage_txt}, Agree: {agree_txt}, Disagree: {disagree_txt}, Pass: {pass_txt}, Polarity: {polarity_txt}, Controversy: {controversy_txt}"
    )
    summary_rows.append(
        {
            "topic": int(c),
            "keywords": keywords_per_topic.get(c, []),
            "representatives": rep_sentences.get(c, []),
            "coverage": stats["coverage"],
            "agree_pct": stats["agree_pct"],
            "disagree_pct": stats["disagree_pct"],
            "pass_pct": stats["pass_pct"],
            "polarity": stats["polarity"],
            "controversy": stats["controversy"],
        }
    )

summary_df = pd.DataFrame(summary_rows).sort_values("topic").reset_index(drop=True)
summary_df


Topic 0
  Keywords: values, doing, facct doing, facct, values facct, values facct doing, community, expression values, respectful exchange interdisciplinary, respectful exchange
  Representative sentences:
   - FAccT is building a community that shares similar values
   - FAccT is doing well in sticking to their values
   - FAccT is doing well in selecting diverse locations that are fun and culturally enriched.
  Coverage: 88.28%, Agree: 92.0%, Disagree: 1.8%, Pass: 6.2%, Polarity: 0.90, Controversy: 0.29
Topic 1
  Keywords: workshops, industry, facct, years, panels workshops, panels, 24, ve, ve facct, doing
  Representative sentences:
   - 3 times I've been to FAccT (18, 23, 24) tech industry influence (e.g. who runs workshops) has been noticeable & sometimes problematic.
   - FAccT could require panels & workshops to be designed/led by representatives of that year's region. '24 had decent regional representation.
   - In 3 years I've been to FAccT it had plenty of industry input thro

Unnamed: 0,topic,keywords,representatives,coverage,agree_pct,disagree_pct,pass_pct,polarity,controversy
0,0,"[values, doing, facct doing, facct, values fac...",[FAccT is building a community that shares sim...,0.882812,0.920354,0.017699,0.061947,0.902655,0.291361
1,1,"[workshops, industry, facct, years, panels wor...","[3 times I've been to FAccT (18, 23, 24) tech ...",0.859375,0.672727,0.181818,0.145455,0.490909,0.780124
2,2,"[facct, volunteer, volunteer run, academics fa...","[FAccT should be free for phd students, FAccT ...",0.914062,0.811966,0.094017,0.094017,0.717949,0.55861
3,3,"[visa, facct, local, locations, attendees, mee...",[FAccT should NOT be repeatedly in visa-requir...,0.921875,0.864407,0.025424,0.110169,0.838983,0.420819
4,4,"[faact, conference, future collaborators, usef...",[Faact is mostly focused on a single disciplin...,0.84375,0.833333,0.027778,0.138889,0.805556,0.478472
5,5,"[room, facct, room nap, room facct exclusionar...","[FAccT should NOT be exclusionary, FAccT shoul...",0.84375,0.851852,0.027778,0.12037,0.824074,0.446906
6,6,"[conference, participatory, design, creating, ...",[FAccT is not doing well in providing tangible...,0.859375,0.845455,0.054545,0.1,0.790909,0.483202
7,7,"[narratives, western, attention, topics, facct...",[FAccT should NOT be centred around western na...,0.796875,0.705882,0.107843,0.186275,0.598039,0.727353
8,8,"[sessions, poster, poster sessions, sessions f...","[FAccT should have poster sessions, the curren...",0.875,0.830357,0.080357,0.089286,0.75,0.521268
9,9,"[facct, waste facct, waste, plastic, plastic w...","[FAccT should be vegetarian, FAccT should NOT ...",0.859375,0.827273,0.081818,0.090909,0.745455,0.527639


## Conclusion

This notebook demonstrated an end‑to‑end approach to analysing single‑sentence submissions with crowd‑sourced votes. We handled semantic redundancy via paraphrase mining, estimated exposure probabilities to correct for non‑uniform visibility, learned vote‑driven embeddings through signed matrix factorisation weighted by inverse propensities, fused them with semantic sentence embeddings, clustered the fused vectors to find topics, and labelled those topics with keywords and social statistics.

The techniques showcased here are modular: you can swap out the embedding models, use alternative exposure models or matrix factorisation algorithms, or experiment with different clustering methods. The general principle remains: **model exposure**, **learn meaningful representations**, **cluster to discover structure**, and **provide interpretable summaries**.
