## Differentially Private Synthetic Call Transcript Generation

This notebook generates differentially private synthetic call transcripts based on an input CSV file, using Microsoft's Private Evolution library and OpenAI's GPT models.

**Prerequisites:**
1.  **Install Dependencies:** 
    ```bash
    pip install "private-evolution[text] @ git+[https://github.com/microsoft/DPSDA.git](https://github.com/microsoft/DPSDA.git)"
    pip install python-dotenv pandas
    ```
2.  **OpenAI API Key:** Create a `.env` file in the same directory as this notebook with your OpenAI API key:
    ```
    OPENAI_API_KEY=your_actual_openai_api_key
    ```
3.  **Input Data:** Place your input CSV file (e.g., `call_transcripts.csv`) in the same directory or update the `input_csv_path` variable below.
4.  **Run Cells:** Execute the cells sequentially.

### 1. Imports

In [2]:
import pandas as pd
import os
import numpy as np
import json
import logging # Import standard logging
from dotenv import load_dotenv

# Import necessary PE components
from pe.data.text import TextCSV # To load data from CSV
from pe.logging import setup_logging # For setting up logging
from pe.runner import PE # The main PE runner class
from pe.population import PEPopulation # Default population strategy
from pe.api.text import LLMAugPE # API for text generation using LLMs
from pe.llm import OpenAILLM # OpenAI LLM wrapper
from pe.embedding.text import SentenceTransformer # Sentence embedding model
from pe.histogram import NearestNeighbors # Histogram based on nearest neighbors
from pe.callback import SaveCheckpoints, ComputeFID, SaveTextToCSV # Callbacks for saving and evaluation
from pe.logger import CSVPrint, LogPrint # Loggers for results
from pe.constant.data import VARIATION_API_FOLD_ID_COLUMN_NAME # Constant for column name

pd.options.mode.copy_on_write = True

### 2. Configuration

Set the input/output paths, model parameters, and PE settings here.

In [3]:
# --- Configuration --- 
# Input CSV file path (Relative to this notebook)
input_csv_path = "real_estate_synthetic_data/basic_call_transcripts.csv"
# Column containing the text transcripts in your CSV
text_column_name = "TranscriptText"
# Columns containing labels (if any, leave empty for unconditional generation)
label_columns = [] # e.g., ["BrokerName"] if you want to condition on BrokerName

# Output directory for results and logs
exp_folder = "results/text/call_transcripts_openai_notebook"
# Directory where prompt files will be stored/read from
prompt_folder = "prompts/call_transcripts"

# LLM Configuration
# Consider using "gpt-4o-mini" or other suitable models
llm_model = "gpt-4o-mini"
# Adjust based on expected transcript length
max_completion_tokens = 256 
llm_temperature = 1.2 # Controls randomness (higher means more random)
llm_threads = 4 # Number of parallel API calls

# API Configuration (LLMAugPE)
# Adjust based on transcript characteristics
min_word_count_variation = 50
word_count_std_variation = 50 # Std deviation for word count variation
token_to_word_ratio_variation = 1.5 # Estimated ratio for token limits
max_completion_tokens_limit_variation = 512 # Hard limit for variation tokens
# Set blank_probabilities if using masking/blanking in variation prompt
blank_probabilities_variation = None # Example: 0.5 or [0.1, 0.2, ...] per iteration

# Embedding Configuration
# Options: "stsb-roberta-base-v2", "sentence-t5-base", etc.
embedding_model = "stsb-roberta-base-v2"

# Population Configuration
initial_variation_fold = 3 # Folds for initial random generation
next_variation_fold = 3 # Folds for subsequent iterations
keep_selected_population = True # Keep best samples from previous iteration
selection_mode_population = "rank" # "rank" or "sample"

# PE Runner Configuration
# Number of samples per iteration (first element is initial, rest are for iterations)
# Generates 5000 samples for 10 refinement iterations (+1 initial)
num_samples_schedule = [100] * 11 
# Privacy parameters
target_epsilon = 1.0
# Delta will be calculated based on data size later

### 3. Create Prompt Files

This cell creates the necessary JSON prompt files for the LLM API.

In [None]:
# Create the directory for prompts if it doesn't exist
os.makedirs(prompt_folder, exist_ok=True)

# --- Create random_api_prompt.json ---
random_prompt_path = os.path.join(prompt_folder, "random_api_prompt.json")
random_prompt = {
    "message_template": [
        {
            "role": "system",
            "content": "You are required to write a realistic call transcript between two people discussing real estate or mortgage-related topics. Ensure the conversation includes typical dialogue elements like greetings, questions, confirmations, and potentially mentions of financial details or personal information in a natural way. Remove any PII or sensitive information, and swap any names, addresses, or other identifying information with generic placeholders like [address],[number], [ssn]."
        },
        {
            "role": "user",
            "content": "Generate a call transcript about a real estate showing follow-up or mortgage pre-approval discussion."
        }
    ]
}
with open(random_prompt_path, 'w') as f:
    json.dump(random_prompt, f, indent=4)
print(f"Created random prompt file at: {random_prompt_path}")

# --- Create variation_api_prompt.json ---
variation_prompt_path = os.path.join(prompt_folder, "variation_api_prompt.json")
variation_prompt = {
    "message_template": [
        {
            "role": "system",
            "content": "You are a helpful assistant that modifies text while maintaining its core meaning and style. You will be given a call transcript. You need to remove all PII and information that would be considered sensitive or confidential, and swap any names, addresses, or other identifying information with generic placeholders like [address],[number], [ssn]."
        },
        {
            "role": "user",
            "content": "Please rephrase the following call transcript {tone}, keeping it realistic for a real estate or mortgage context:\n{sample}"
        }
    ],
    "replacement_rules": [
        {
            "constraints": {},
            "replacements": {
                "tone": [
                    "in a slightly different way",
                    "using different wording",
                    "with minor variations",
                    "while preserving the key information",
                    "in a similar conversational style"
                ]
            }
        }
    ]
}
# If you want to use blanking/masking similar to the yelp_openai example,
# you would need a more complex prompt structure like the one commented out below.
# variation_prompt = {
#     "message_template": [
#         {
#             "role": "system",
#             "content": "You are a helpful assistant that fills in blanks in text while maintaining the original context and style."
#         },
#         {
#             "role": "user",
#             "content": "Based on the context of a real estate or mortgage call transcript, fill in the blanks in the input sentences. If there are no blanks, output the original input sentences.\nInput: {masked_sample}\nFill-in-Blanks and your answer MUST be exactly {word_count} words:"
#         }
#     ]
# }
with open(variation_prompt_path, 'w') as f:
    json.dump(variation_prompt, f, indent=4)
print(f"Created variation prompt file at: {variation_prompt_path}")

Created random prompt file at: prompts/call_transcripts/random_api_prompt.json
Created variation prompt file at: prompts/call_transcripts/variation_api_prompt.json


### 4. Setup Environment and Logging

In [5]:
# Load environment variables (like OPENAI_API_KEY from .env file)
load_dotenv()
print("Environment variables loaded.")

# Setup logging
os.makedirs(exp_folder, exist_ok=True)
log_file_path = os.path.join(exp_folder, "log.txt")
setup_logging(log_file=log_file_path)
logging.info("Starting synthetic call transcript generation.")
print(f"Logging setup complete. Log file: {log_file_path}")

Environment variables loaded.
Logging setup complete. Log file: results/text/call_transcripts_openai_notebook/log.txt


### 5. Load Private Data

In [6]:
logging.info(f"Loading private data from: {input_csv_path}")
try:
    data = TextCSV(
        csv_path=input_csv_path,
        label_columns=label_columns, # Use configured label columns
        text_column=text_column_name # Use configured text column
    )
    num_private_samples = len(data.data_frame)
    logging.info(f"Loaded {num_private_samples} private samples.")
    print(f"Loaded {num_private_samples} private samples from {input_csv_path}")
    if num_private_samples == 0:
         raise ValueError("Loaded data is empty. Please check CSV path and content.")
    
    # Display first few rows (optional)
    # print("Sample data:")
    # print(data.data_frame.head())

except FileNotFoundError:
    logging.error(f"Error: Input CSV file not found at {input_csv_path}")
    print(f"Error: Input CSV file not found at {input_csv_path}")
    # Stop execution or handle error appropriately in a notebook context
    raise 
except ValueError as e:
     logging.error(f"Error loading data: {e}")
     print(f"Error loading data: {e}")
     raise
except Exception as e:
    logging.error(f"An unexpected error occurred during data loading: {e}")
    print(f"An unexpected error occurred during data loading: {e}")
    raise

Loaded 10000 private samples from real_estate_synthetic_data/basic_call_transcripts.csv


### 6. Define PE Components

In [7]:
# Define LLM
llm = OpenAILLM(
    max_completion_tokens=max_completion_tokens,
    model=llm_model,
    temperature=llm_temperature,
    num_threads=llm_threads
)
logging.info(f"LLM defined: {llm_model}")

# Define API for PE (using LLMAugPE)
api = LLMAugPE(
    llm=llm,
    random_api_prompt_file=random_prompt_path,
    variation_api_prompt_file=variation_prompt_path,
    min_word_count=min_word_count_variation,
    word_count_std=word_count_std_variation,
    token_to_word_ratio=token_to_word_ratio_variation,
    max_completion_tokens_limit=max_completion_tokens_limit_variation,
    blank_probabilities=blank_probabilities_variation,
)
logging.info("LLMAugPE API defined.")

# Define Embedding model
embedding = SentenceTransformer(model=embedding_model)
logging.info(f"Embedding defined: {embedding_model}")

# Define Histogram method
histogram = NearestNeighbors(
    embedding=embedding,
    mode="L2", # L2 distance for nearest neighbors
    lookahead_degree=0, # No lookahead in this example
)
logging.info("NearestNeighbors Histogram defined.")

# Define Population update strategy
population = PEPopulation(
    api=api,
    initial_variation_api_fold=initial_variation_fold,
    next_variation_api_fold=next_variation_fold,
    keep_selected=keep_selected_population,
    selection_mode=selection_mode_population
)
logging.info("PEPopulation defined.")

04/16/2025 01:41:57 AM [pe] [INFO ]  Using 1 OpenAI API keys
INFO:pe:Using 1 OpenAI API keys
04/16/2025 01:41:57 AM [pe] [INFO ]  Using 4 threads for making concurrent API calls
INFO:pe:Using 4 threads for making concurrent API calls
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


### 7. Define Callbacks and Loggers

In [8]:
# Define Callbacks
save_checkpoints = SaveCheckpoints(os.path.join(exp_folder, "checkpoint"))
compute_fid = ComputeFID(
    priv_data=data,
    embedding=embedding,
    # Only compute FID on samples generated by random API initially, or final variation samples later
    filter_criterion={VARIATION_API_FOLD_ID_COLUMN_NAME: -1}
)
save_text_to_csv = SaveTextToCSV(output_folder=os.path.join(exp_folder, "synthetic_text"))
logging.info("Callbacks defined: SaveCheckpoints, ComputeFID, SaveTextToCSV")

# Define Loggers
csv_print = CSVPrint(output_folder=exp_folder)
log_print = LogPrint() # Prints progress to standard logging output
logging.info("Loggers defined: CSVPrint, LogPrint")

04/16/2025 01:42:44 AM [pe] [INFO ]  Embedding: computing PE.EMBEDDING.SentenceTransformer.stsb-roberta-base-v2 for 10000/10000 samples
INFO:pe:Embedding: computing PE.EMBEDDING.SentenceTransformer.stsb-roberta-base-v2 for 10000/10000 samples
04/16/2025 01:43:57 AM [pe] [INFO ]  Embedding: finished computing PE.EMBEDDING.SentenceTransformer.stsb-roberta-base-v2 for 10000/10000 samples
INFO:pe:Embedding: finished computing PE.EMBEDDING.SentenceTransformer.stsb-roberta-base-v2 for 10000/10000 samples


### 8. Calculate Delta for DP

In [9]:
# Calculate delta for DP based on the number of private samples loaded earlier
# Avoid division by zero or log(1) if only 1 sample
if num_private_samples <= 1:
     delta = 1e-5 # Assign a small default delta if few samples
     logging.warning(f"Number of private samples is {num_private_samples}. Using default delta={delta}")
else:
     # Standard delta calculation: 1 / (N * log N)
     delta = 1.0 / (num_private_samples * np.log(num_private_samples))
logging.info(f"Using delta={delta} based on {num_private_samples} private samples.")
print(f"Calculated delta={delta} for DP.")

Calculated delta=1.0857362047581295e-05 for DP.


### 9. Define and Run PE Runner

In [10]:
# Define PE Runner
pe_runner = PE(
    priv_data=data,
    population=population,
    histogram=histogram,
    callbacks=[save_checkpoints, save_text_to_csv, compute_fid],
    loggers=[csv_print, log_print], # LogPrint will output to the notebook console/log file
    # DP mechanism defaults to Gaussian if not specified
)
logging.info("PE Runner defined.")

# Run Private Evolution
logging.info(f"Starting PE run with epsilon={target_epsilon}, delta={delta}")
print(f"Starting PE run... Check logs in '{log_file_path}' and results in '{exp_folder}'")
pe_runner.run(
    num_samples_schedule=num_samples_schedule,
    delta=delta,
    epsilon=target_epsilon,
    checkpoint_path=os.path.join(exp_folder, "checkpoint"), # Resume from checkpoint if exists
)

logging.info("Synthetic call transcript generation finished.")
print("PE run finished!")

  if np.isinf(np.exp(eps)):
04/16/2025 01:44:10 AM [pe] [INFO ]  DP epsilon=1.0, delta=1.0857362047581295e-05, noise_multiplier=11.738744309185014, num_iterations=10.
INFO:pe:DP epsilon=1.0, delta=1.0857362047581295e-05, noise_multiplier=11.738744309185014, num_iterations=10.
04/16/2025 01:44:10 AM [pe] [WARNI]  fraction_per_label_id is not provided. Assuming the fraction of label ids in private data is public information.
04/16/2025 01:44:10 AM [pe] [INFO ]  Population: generating 100*4 initial synthetic samples for label 
INFO:pe:Population: generating 100*4 initial synthetic samples for label 
04/16/2025 01:44:10 AM [pe] [INFO ]  RANDOM API: creating 100 samples for label 
INFO:pe:RANDOM API: creating 100 samples for label 
04/16/2025 01:44:10 AM [pe] [INFO ]  RANDOM API: producing LLM requests
INFO:pe:RANDOM API: producing LLM requests
04/16/2025 01:44:10 AM [pe] [INFO ]  RANDOM API: getting LLM responses
INFO:pe:RANDOM API: getting LLM responses
04/16/2025 01:44:10 AM [pe] [INFO ]

Starting PE run... Check logs in 'results/text/call_transcripts_openai_notebook/log.txt' and results in 'results/text/call_transcripts_openai_notebook'


INFO:pe:Workload [1]
  0%|          | 0/100 [00:00<?, ?it/s]04/16/2025 01:44:10 AM [pe] [INFO ]  Workload [2]
INFO:pe:Workload [2]
04/16/2025 01:44:10 AM [pe] [INFO ]  Workload [3]
INFO:pe:Workload [3]
04/16/2025 01:44:10 AM [pe] [INFO ]  Workload [4]
INFO:pe:Workload [4]
04/16/2025 01:44:14 AM [pe] [INFO ]  Workload [4]
INFO:pe:Workload [4]
04/16/2025 01:44:14 AM [pe] [INFO ]  Workload [4]
INFO:pe:Workload [4]
04/16/2025 01:44:14 AM [pe] [INFO ]  Workload [4]
  1%|          | 1/100 [00:04<06:48,  4.12s/it]INFO:pe:Workload [4]
04/16/2025 01:44:14 AM [pe] [INFO ]  Workload [4]
  2%|▏         | 2/100 [00:04<02:52,  1.76s/it]INFO:pe:Workload [4]
04/16/2025 01:44:17 AM [pe] [INFO ]  Workload [4]
  5%|▌         | 5/100 [00:06<01:43,  1.09s/it]INFO:pe:Workload [4]
04/16/2025 01:44:17 AM [pe] [INFO ]  Workload [4]
INFO:pe:Workload [4]
04/16/2025 01:44:17 AM [pe] [INFO ]  Workload [4]
  7%|▋         | 7/100 [00:07<01:10,  1.32it/s]INFO:pe:Workload [4]
04/16/2025 01:44:18 AM [pe] [INFO ]  Workl

KeyboardInterrupt: 