# Validation Setup for ConDynS

This notebook demonstrates the validation procedure for ConDynS, our similarity measure for comparing conversational dynamics, introduced in the [paper: A Similarity Measure for Comparing Conversational Dynamics](https://arxiv.org/abs/2507.18956). It constructs anchor–positive–negative triplets of conversations from Reddit, where positives share similar dynamics with the anchor and negatives differ, and evaluates how well ConDynS distinguishes them relative to baseline similarity measures (e.g., SBERT cosine similarity, BERTScore), as demonstrate in the other demo notebook. This notebook follows the methodology described in the paper.

In [None]:
import json
from convokit import Corpus, download
from tqdm import tqdm
import scipy.stats as stats
import random
random.seed(4300)

In [None]:
corpus = Corpus(filename=download("conversations-gone-awry-cmv-corpus"))
corpus.print_summary_stats()

In [None]:
### Get the human and machine summary ids ###
human_summary_ids = corpus.get_conversation_ids(selector=lambda conversation: conversation.meta["summary_meta"] != []
and any(summary_meta["summary_type"] == "human_written_SCD" for summary_meta in conversation.meta["summary_meta"]))
machine_summary_ids = corpus.get_conversation_ids(selector=lambda conversation: conversation.meta["summary_meta"] != []
               and any(summary_meta["summary_type"] == "machine_generated_SCD" for summary_meta in conversation.meta["summary_meta"]))
pair_of = {}
for convo_id in human_summary_ids:
    convo = corpus.get_conversation(convo_id)
    pair_of[convo.id] = convo.meta['pair_id']

In [None]:
### Get pair info ###
human_summary_pair = [] # (calm, awry) 
for convo_id in human_summary_ids:
    convo = corpus.get_conversation(convo_id)
    if convo.meta['has_removed_comment']:
        if (convo.meta['pair_id'],convo.id) not in human_summary_pair:
            human_summary_pair.append((convo.meta['pair_id'],convo.id))
    else:
        if (convo.id, convo.meta['pair_id']) not in human_summary_pair:
            human_summary_pair.append((convo.meta['pair_id'],convo.id))
print("Number of conversation pair: ", len(human_summary_pair))

In [None]:
ARTEFACTS_DIR = "./artefacts/"

# ConDynS Validation

Here we compute ConDynS on a subset of Reddit conversations with constructed triplets to validate the measure's usefulness in capturing and comparing conversational dynamics (discussed in detail in paper Section 5). The followings are steps to conduct the validation setup.

## Simulating Conversations

To construct the triplets used for validating ConDynS (see Section 5 of the paper), we simulate synthetic conversations from human-written SCDs provided in the ConvoKit corpus. These SCDs abstract away surface content while preserving conversational dynamics. By generating conversations from these summaries, we can also assign new topics—allowing us to test whether ConDynS remains sensitive to dynamics while being invariant to topical changes.

In [None]:
from convokit.convo_similarity.utils import format_transcript_from_convokit, get_human_summary
from convokit.genai import get_llm_client
from convokit.genai.genai_config import GenAIConfigManager

config = GenAIConfigManager() ### make sure to set your own config if this is never set before
client = get_llm_client("gpt", config)

def gpt_query(prompt, **kwargs):
    response = client.generate(prompt, **kwargs)
    return response.text

In [None]:
### Extract topic of the conversations ###
topic_msg = """Here are two conversations of the same topic. Summarize the topic of the conversations in a concise phrase that accurately captures the main subject being discussed.
Here is the transcript of the first conversation:
{transcript1}

Here is the transcript of the second conversation:
{transcript2}

Now, write the topic of the conversation in a concise phrase:
"""
topic = {}
for calm_convo_id, awry_convo_id in tqdm(human_summary_pair):
    calm_transcript = format_transcript_from_convokit(corpus, calm_convo_id)
    awry_transcript = format_transcript_from_convokit(corpus, awry_convo_id)
    query = topic_msg.format(transcript1 = '\n'.join(calm_transcript), transcript2 = '\n'.join(awry_transcript))
    response = gpt_query(query)
    topic[calm_convo_id] = response
    topic[awry_convo_id] = response


In [None]:
### Simulate transcript ###
simulation_msg = """You are given a task to recreate an online conversation that occured on reddit. Here is a list of information you are given.
1. Topic of the conversation: {topic}
2. Trajectory summary that summarizes the conversational and speakers' dynamics: {trajectory_summary}

Each utterance of the transcript should be formatted as the following:
Speaker_ID (e.g. "SPEAKER2") : [Added text of the utterance]


#Output
Add your recreated conversation. Only generate the transcript of the conversation. 
"""
generated_transcripts = {}
for calm_convo_id, awry_convo_id in tqdm(human_summary_pair):
    calm_human_summary = get_human_summary(corpus, calm_convo_id)
    awry_human_summary = get_human_summary(corpus, awry_convo_id)
    calm_query = simulation_msg.format(topic=topic[calm_convo_id],trajectory_summary=calm_human_summary['summary_text'])
    calm_response = gpt_query(calm_query)
    generated_transcripts[calm_convo_id] = calm_response
    awry_query = simulation_msg.format(topic=topic[awry_convo_id],trajectory_summary=awry_human_summary['summary_text'])
    awry_response = gpt_query(awry_query)
    generated_transcripts[awry_convo_id] = awry_response

output = {}
for convo_id in generated_transcripts:
    output[convo_id] = {
        'transcript': generated_transcripts[convo_id],
        'topic': topic[convo_id]
    }
with open(ARTEFACTS_DIR + "validation_gpt/transcript_simulations.json", "w") as f:
    json.dump(output, f, indent=4)

In [None]:
### Topic shuffle transcript simulation ###
topic_set = []
for i, (calm_convo_id, awry_convo_id) in enumerate(human_summary_pair):
    topic_set.append(generated_transcripts[human_summary_pair[(i) % len(human_summary_pair)][0]]['topic'])

new_topic = {}
for i, (calm_convo_id, awry_convo_id) in enumerate(human_summary_pair):
    new_topic[calm_convo_id] = random.choice(topic_set)
    new_topic[awry_convo_id] = random.choice(topic_set)
for convo_id in new_topic:
    assert new_topic[convo_id] != generated_transcripts[convo_id]['topic']
assert len(new_topic) == len(generated_transcripts)

generated_transcripts_topic_shuffled = {}
for calm_convo_id, awry_convo_id in tqdm(human_summary_pair):
    calm_human_summary = get_human_summary(corpus, calm_convo_id)
    awry_human_summary = get_human_summary(corpus, awry_convo_id)
    calm_query = simulation_msg.format(topic=new_topic[calm_convo_id],trajectory_summary=calm_human_summary['summary_text']) #Adding new topic 
    calm_response = gpt_query(calm_query)
    generated_transcripts_topic_shuffled[calm_convo_id] = calm_response
    awry_query = simulation_msg.format(topic=new_topic[awry_convo_id],trajectory_summary=awry_human_summary['summary_text']) #Adding new topic
    awry_response = gpt_query(awry_query)
    generated_transcripts_topic_shuffled[awry_convo_id] = awry_response

output = {}
for convo_id in generated_transcripts_topic_shuffled:
    output[convo_id] = {
        'generated_transcript': generated_transcripts_topic_shuffled[convo_id],
        'topic': new_topic[convo_id]
    }
with open(ARTEFACTS_DIR + "validation_gpt/transcript_simulations_topic_shuffled.json", "w") as f:
    json.dump(output, f, indent=4)

## Writing SCDs and SoPs

Now we generate the Summaries of Conversational Dynamics (SCDs) and extracts their corresponding Sequences of Patterns (SoPs), which are required inputs for computing the ConDynS score. The SCDs provide high-level abstractions of conversational flow, while the SoPs capture the ordered interaction patterns needed for alignment. These representations are prepared for both real and simulated conversations to ensure consistency during the validation procedure.

In [None]:
from convokit.convo_similarity.summary import SCDWriter
scd_writer_gpt = SCDWriter(model_provider="gpt")

In [None]:
scd = {}
bulletpoints = {}
for convo_id in tqdm(pair_of):
    summary = scd_writer_gpt.get_scd_summary("\n\n".join(format_transcript_from_convokit(corpus, convo_id)))
    scd[convo_id] = summary
    bulletpoints[convo_id] = scd_writer_gpt.get_sop_from_summary(summary)

with open(ARTEFACTS_DIR + f"validation_gpt/scd_og.json", 'w') as f:
    json.dump(scd, f, indent=4)

with open(ARTEFACTS_DIR + f"validation_gpt/sop_og.json", 'w') as f:
    json.dump(bulletpoints, f, indent=4)

In [None]:
with open(ARTEFACTS_DIR + "validation_gpt/transcript_simulations.json", "r") as f:
    simulated_transcripts = json.load(f)

scd = {}
bulletpoints = {}
for convo_id in tqdm(pair_of):
    summary = scd_writer_gpt.get_scd_summary(simulated_transcripts[convo_id]['generated_transcript'])
    scd[convo_id] = summary
    bulletpoints[convo_id] = scd_writer_gpt.get_sop_from_summary(summary)

with open(ARTEFACTS_DIR + f"validation_gpt/scd_sim.json", 'w') as f:
    json.dump(scd, f, indent=4)

with open(ARTEFACTS_DIR + f"validation_gpt/sop_sim.json", 'w') as f:
    json.dump(bulletpoints, f, indent=4)

In [None]:
with open(ARTEFACTS_DIR + "validation_gpt/transcript_simulations_topic_shuffled.json", "r") as f:
    simulated_transcripts_topic_shuffled = json.load(f)

scd = {}
bulletpoints = {}
for convo_id in tqdm(pair_of):
    summary = scd_writer_gpt.get_scd_summary(simulated_transcripts_topic_shuffled[convo_id]['generated_transcript'])
    scd[convo_id] = summary
    bulletpoints[convo_id] = scd_writer_gpt.get_sop_from_summary(summary)

with open(ARTEFACTS_DIR + f"validation_gpt/scd_sim_topic_shuffled.json", 'w') as f:
    json.dump(scd, f, indent=4)

with open(ARTEFACTS_DIR + f"validation_gpt/sop_sim_topic_shuffled.json", 'w') as f:
    json.dump(bulletpoints, f, indent=4)

## Compute ConDynS Score

Finally, we are now ready to compute the ConDynS scores between conversation pairs. Using the SoP from one conversation and the transcript of the other, we apply the alignment procedure described in the paper to quantify how similar their dynamics are.

In this validation of our ConDynS measure, we compare ConDynS scores for each triplet (anchor, positive, negative, introduced in Section 5). The metric is expected to assign a higher similarity score to the anchor–positive pair (which shares dynamics) than to the anchor–negative pair (which differs in dynamics). Accuracy is computed as the proportion of triplets where this condition holds. As reported in Table 1 of the paper, ConDynS achieves substantially higher accuracy than baseline methods across same-topic, different-topic, and adversarial-topic conditions.

In [None]:
from convokit.convo_similarity.condyns import ConDynS
condyns_gpt = ConDynS(model_provider="gpt", config=config)

In [None]:
def evaluate_condyns_results(self_scores, pair_scores):
    performance = []
    for score1, score2 in zip(self_scores, pair_scores):
        performance.append(score1 > score2)
    print("Accuracy:",sum(performance) / len(performance), f"for {len(performance)} pairs")
    print(stats.wilcoxon(self_scores, pair_scores))

In [None]:
### Load SCDs and SoPs ###
with open(ARTEFACTS_DIR + "validation_gpt/scd_og.json", "r") as f:
    scd_og = json.load(f)
with open(ARTEFACTS_DIR + "validation_gpt/sop_og.json", "r") as f:
    sop_og = json.load(f)
with open(ARTEFACTS_DIR + "validation_gpt/scd_sim.json", "r") as f:
    scd_sim = json.load(f)
with open(ARTEFACTS_DIR + "validation_gpt/sop_sim.json", "r") as f:
    sop_sim = json.load(f)
with open(ARTEFACTS_DIR + "validation_gpt/scd_sim_topic_shuffled.json", "r") as f:
    scd_sim_topic_shuffled = json.load(f)
with open(ARTEFACTS_DIR + "validation_gpt/sop_sim_topic_shuffled.json", "r") as f:
    sop_sim_topic_shuffled = json.load(f)

In [None]:
### Compute ConDynS with simulated transcripts ###
with open(ARTEFACTS_DIR + "validation_gpt/transcript_simulations.json", "r") as f:
    simulated_transcripts = json.load(f)

self_scores = []
self_results = {}

for convo_id in tqdm(pair_of):
    transcript1 = "\n\n".join(format_transcript_from_convokit(corpus, convo_id))
    transcript2 = simulated_transcripts[convo_id]['generated_transcript']
    sop1 = sop_og[convo_id]
    sop2 = sop_sim[convo_id]
    results = condyns_gpt.compute_bidirectional_similarity(transcript1, transcript2, sop1, sop2)
    self_results[convo_id] = results
    self_scores.append(condyns_gpt.compute_score_from_results(results))

pair_scores = []
pair_results = {}
for convo_id in tqdm(pair_of):
    transcript1 = "\n\n".join(format_transcript_from_convokit(corpus, convo_id))
    transcript2 = simulated_transcripts[pair_of[convo_id]]['generated_transcript']
    sop1 = sop_og[convo_id]
    sop2 = sop_sim[pair_of[convo_id]]
    results = condyns_gpt.compute_bidirectional_similarity(transcript1, transcript2, sop1, sop2)
    pair_results[convo_id] = results
    pair_scores.append(condyns_gpt.compute_score_from_results(results))

output = {"self" : self_results, "pair" : pair_results}
with open(ARTEFACTS_DIR + f"validation_gpt/condyns_og-sim.json", 'w') as f:
    json.dump(output, f, indent=4)

evaluate_condyns_results(self_scores, pair_scores)

In [None]:
### Compute ConDynS with topic shuffled simulated transcripts ###
with open(ARTEFACTS_DIR + "validation_gpt/transcript_simulations_topic_shuffled.json", "r") as f:
    simulated_transcripts_topic_shuffled = json.load(f)

self_scores = []
self_results = {}

for convo_id in tqdm(pair_of):
    transcript1 = "\n\n".join(format_transcript_from_convokit(corpus, convo_id))
    transcript2 = simulated_transcripts_topic_shuffled[convo_id]['generated_transcript']
    scd1 = sop_og[convo_id]
    scd2 = sop_sim_topic_shuffled[convo_id]
    results = condyns_gpt.compute_bidirectional_similarity(transcript1, transcript2, scd1, scd2)
    self_results[convo_id] = results
    self_scores.append(condyns_gpt.compute_score_from_results(results))

pair_scores = []
pair_results = {}
for convo_id in tqdm(pair_of):
    transcript1 = "\n\n".join(format_transcript_from_convokit(corpus, convo_id))
    transcript2 = simulated_transcripts_topic_shuffled[pair_of[convo_id]]['generated_transcript']
    scd1 = sop_og[convo_id]
    scd2 = sop_sim_topic_shuffled[pair_of[convo_id]]
    results = condyns_gpt.compute_bidirectional_similarity(transcript1, transcript2, scd1, scd2)
    pair_results[convo_id] = results
    pair_scores.append(condyns_gpt.compute_score_from_results(results))

output = {"self" : self_results, "pair" : pair_results}
with open(ARTEFACTS_DIR + f"validation_gpt/condyns_og-sim_topic_shuffled.json", 'w') as f:
    json.dump(output, f, indent=4)

evaluate_condyns_results(self_scores, pair_scores)

In [None]:
### Compute ConDynS with Adversarial simulated transcripts ###
with open(ARTEFACTS_DIR + f"validation_gpt/condyns_og-sim.json", 'r') as f:
    sim_results = json.load(f)

with open(ARTEFACTS_DIR + f"validation_gpt/condyns_og-sim_topic_shuffled.json", 'r') as f:
    topic_shuffle_results = json.load(f)

self_results = topic_shuffle_results['self']
pair_results = sim_results['pair']

self_scores = []
for convo_id in self_results:
    results = self_results[convo_id]
    self_scores.append(condyns_gpt.compute_score_from_results(results))

pair_scores = []
for convo_id in pair_results:
    results = pair_results[convo_id]
    pair_scores.append(condyns_gpt.compute_score_from_results(results))

evaluate_condyns_results(self_scores, pair_scores)

## ConDynS SoP to SoP Alignment

Here we also include ConDynS computation with SoP-to-SoP alignment that is presented in the paper, where both conversations use their pattern sequences. This keeps order information but can miss overlapping patterns. This can serve as a comparison to highlight ConDynS’s benefit of combining SoP precision with transcript recall.

Notice in the following code, we call it Naive ConDynS, because it is more "naive" comparing to our advanced ConDynS above.

In [None]:
from convokit.convo_similarity.naive_condyns import NaiveConDynS
naive_condyns_gpt = NaiveConDynS(model_provider="gpt", config=config)

In [None]:
### Compute NaiveConDynS with simulated transcripts ###
with open(ARTEFACTS_DIR + "validation_gpt/transcript_simulations.json", "r") as f:
    simulated_transcripts = json.load(f)

self_scores = []
self_results = {}

for convo_id in tqdm(pair_of):
    transcript1 = "\n\n".join(format_transcript_from_convokit(corpus, convo_id))
    transcript2 = simulated_transcripts[convo_id]['generated_transcript']
    sop1 = sop_og[convo_id]
    sop2 = sop_sim[convo_id]
    results = naive_condyns_gpt.compute_bidirectional_naive_condyns(transcript1, transcript2, sop1, sop2)
    self_results[convo_id] = results
    self_scores.append(naive_condyns_gpt.compute_score_from_results(results))

pair_scores = []
pair_results = {}
for convo_id in tqdm(pair_of):
    transcript1 = "\n\n".join(format_transcript_from_convokit(corpus, convo_id))
    transcript2 = simulated_transcripts[pair_of[convo_id]]['generated_transcript']
    sop1 = sop_og[convo_id]
    sop2 = sop_sim[pair_of[convo_id]]
    results = naive_condyns_gpt.compute_bidirectional_naive_condyns(transcript1, transcript2, sop1, sop2)
    pair_results[convo_id] = results
    pair_scores.append(naive_condyns_gpt.compute_score_from_results(results))

output = {"self" : self_results, "pair" : pair_results}
with open(ARTEFACTS_DIR + f"validation_gpt/naive_condyns_og-sim.json", 'w') as f:
    json.dump(output, f, indent=4)

evaluate_condyns_results(self_scores, pair_scores)

In [None]:
### Compute NaiveConDynS with topic shuffled simulated transcripts ###
with open(ARTEFACTS_DIR + "validation_gpt/transcript_simulations_topic_shuffled.json", "r") as f:
    simulated_transcripts_topic_shuffled = json.load(f)

self_scores = []
self_results = {}

for convo_id in tqdm(pair_of):
    transcript1 = "\n\n".join(format_transcript_from_convokit(corpus, convo_id))
    transcript2 = simulated_transcripts_topic_shuffled[convo_id]['generated_transcript']
    sop1 = sop_og[convo_id]
    sop2 = sop_sim_topic_shuffled[convo_id]
    results = naive_condyns_gpt.compute_bidirectional_naive_condyns(transcript1, transcript2, sop1, sop2)
    self_results[convo_id] = results
    self_scores.append(naive_condyns_gpt.compute_score_from_results(results))

pair_scores = []
pair_results = {}
for convo_id in tqdm(pair_of):
    transcript1 = "\n\n".join(format_transcript_from_convokit(corpus, convo_id))
    transcript2 = simulated_transcripts_topic_shuffled[pair_of[convo_id]]['generated_transcript']
    sop1 = sop_og[convo_id]
    sop2 = sop_sim_topic_shuffled[pair_of[convo_id]]
    results = naive_condyns_gpt.compute_bidirectional_naive_condyns(transcript1, transcript2, sop1, sop2)
    pair_results[convo_id] = results
    pair_scores.append(naive_condyns_gpt.compute_score_from_results(results))

output = {"self" : self_results, "pair" : pair_results}
with open(ARTEFACTS_DIR + f"validation_gpt/naive_condyns_og-sim_topic_shuffled.json", 'w') as f:
    json.dump(output, f, indent=4)

evaluate_condyns_results(self_scores, pair_scores)

In [None]:
### Compute NaiveConDynS with Adversarial simulated transcripts ###
with open(ARTEFACTS_DIR + f"validation_gpt/naive_condyns_og-sim.json", 'r') as f:
    sim_results = json.load(f)

with open(ARTEFACTS_DIR + f"validation_gpt/naive_condyns_og-sim_topic_shuffled.json", 'r') as f:
    topic_shuffle_results = json.load(f)

self_results = topic_shuffle_results['self']
pair_results = sim_results['pair']

self_scores = []
for convo_id in self_results:
    results = self_results[convo_id]
    self_scores.append(naive_condyns_gpt.compute_score_from_results(results))

pair_scores = []
for convo_id in pair_results:
    results = pair_results[convo_id]
    pair_scores.append(naive_condyns_gpt.compute_score_from_results(results))

evaluate_condyns_results(self_scores, pair_scores)