# Walkthrough: generate(toy_df, epsilon=...)

This notebook explains the generate function step by step using a small toy dataset.
The goal is that after reading the Markdown cells, it is clear what each part of the code does, and what intermediate objects look like (histograms, classTotals, p, classTuples, cond_attr, attr_vectors, ...).

In [2]:
import numpy as np
import pandas as pd

# ---------- Hilfsfunktionen ----------

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity/epsilon)

def _build_noisy_histograms(df, className, attributesName, epsilon):
    """
    Phase 1: Baue f√ºr jedes Attribut ein DP-noisy Histogramm P(attr, class).
    """
    numHistograms = len(attributesName)
    epsilon_per_hist = epsilon / numHistograms
    histograms = {}

    for attributeName in attributesName:
        counts = df[[attributeName, className]].value_counts()
        noisy_counts = laplace_mech(counts, sensitivity=1, epsilon=epsilon_per_hist)
        noisy_counts.clip(lower=0.0, inplace=True)
        histograms[attributeName] = noisy_counts

    return histograms


def _compute_class_distribution(histograms, numTuples):
    """
    Aus allen Histogrammen: classTotals, Klassenwahrscheinlichkeiten p(c),
    Anzahl synthetischer Tupel pro Klasse und Klassenvektoren.
    """
    classTotals = {}

    # classTotals[c] = Summe √ºber alle Attribute und Attribute-Werte der Counts
    for attr_name, hist in histograms.items():
        for (attr_val, class_val), count in hist.items():
            classTotals[class_val] = classTotals.get(class_val, 0.0) + count

    total = sum(classTotals.values())
    if total == 0:
        # Fallback: gleichverteilt, falls durch Noise alles 0 wurde
        gleich = 1.0 / len(classTotals)
        p = {c: gleich for c in classTotals}
    else:
        p = {c: classTotals[c] / total for c in classTotals}

    # Anzahl Tupel pro Klasse
    classTuples = {c: round(numTuples * p[c]) for c in classTotals}

    # Klassenvektor f√ºr jede Klasse
    class_vector = {c: [c] * classTuples[c] for c in classTuples}

    return classTotals, p, classTuples, class_vector


def _compute_conditional_attributes(histograms, classTotals):
    """
    Berechne P(attr = a | class = c) f√ºr jedes Attribut und jede Klasse.
    """
    cond_attr = {}

    for attr_name, hist in histograms.items():
        cond_attr[attr_name] = {}
        for c in classTotals:
            attr_counts = {}

            # Z√§hle f√ºr fixe Klasse c die H√§ufigkeiten der Attribut-Werte
            for (attr_val, class_val), count in hist.items():
                if class_val == c:
                    attr_counts[attr_val] = attr_counts.get(attr_val, 0.0) + count

            sum_c = sum(attr_counts.values())

            if sum_c == 0:
                # gleichverteilte Notl√∂sung
                if attr_counts:  # Klasseninfo vorhanden, aber alles 0
                    gleich = 1.0 / len(attr_counts)
                    probs = {a: gleich for a in attr_counts}
                else:
                    # gar keine Werte ‚Äì leeres Dict, kann sp√§ter geskippt werden
                    probs = {}
            else:
                probs = {a: attr_counts[a] / sum_c for a in attr_counts}

            cond_attr[attr_name][c] = probs

    return cond_attr


def _sample_attribute_vectors(cond_attr, classTuples, random_state=None):
    """
    Ziehe f√ºr jedes Attribut und jede Klasse einen Vektor von Attributwerten
    mit L√§nge n_c = classTuples[c] entsprechend P(attr | class).
    """
    if random_state is not None:
        np.random.seed(random_state)

    attr_vectors = {}

    for attr_name, class_dict in cond_attr.items():
        attr_vectors[attr_name] = {}
        for c, probs in class_dict.items():
            n_c = classTuples[c]

            # Falls keine Wahrscheinlichkeiten existieren (leeres Dict)
            if not probs:
                # Notl√∂sung: Vektor komplett leer, wird sp√§ter ggf. ersetzt/geskippt
                attr_vectors[attr_name][c] = [None] * n_c
                continue

            # Zielanzahl pro Attributwert
            target_count = {attr_val: round(p_val * n_c)
                            for attr_val, p_val in probs.items()}

            # Baue den Vektor mit der Zielanzahl pro Wert
            attr_vec_c = []
            for attr_val, cnt in target_count.items():
                attr_vec_c.extend([attr_val] * cnt)

            # L√§nge anpassen (auf n_c)
            current_len = len(attr_vec_c)
            diff = n_c - current_len

            if diff > 0:
                vals = list(probs.keys())
                extra = np.random.choice(vals, size=diff, p=list(probs.values()))
                attr_vec_c.extend(extra)
            elif diff < 0:
                remove_indices = np.random.choice(len(attr_vec_c), size=-diff, replace=False)
                for idx in sorted(remove_indices, reverse=True):
                    attr_vec_c.pop(idx)

            # Shuffle, um keine Struktur / Reihenfolge zu verraten
            np.random.shuffle(attr_vec_c)

            attr_vectors[attr_name][c] = attr_vec_c

    return attr_vectors


def _assemble_synthetic_dataframe(attr_vectors, class_vector, attributesName, className):
    """
    Setze aus Attributvektoren und Klassenvektoren die DataFrames pro Klasse
    und f√ºhre sie zu einem Gesamt-DataFrame zusammen.
    """
    blocks = {}
    for c, class_vec in class_vector.items():
        n_c = len(class_vec)
        block = {}

        for attributeName in attributesName:
            values = attr_vectors[attributeName][c]
            # Sanity-Check: auf L√§nge n_c trimmen/auff√ºllen falls n√∂tig
            if len(values) < n_c:
                values = values + [values[-1]] * (n_c - len(values))
            elif len(values) > n_c:
                values = values[:n_c]
            block[attributeName] = values

        block[className] = class_vec
        blocks[c] = block

    df_blocks = {c: pd.DataFrame(block) for c, block in blocks.items()}
    synthetic_df = pd.concat(df_blocks.values(), ignore_index=True)

    return synthetic_df


# ---------- Hauptfunktion ----------

def generate(df, numTuples=None, numClass=None, epsilon=1.0, random_state=42):
    """
    Erzeuge differentially private synthetische Daten auf Basis von df.
    Phase 1: DP-Histogramme.
    Phase 2: Ziehen von Klassen- und Attributwerten und Zusammenbau des synthetischen Datensatzes.
    """
    print(df.shape)

    # optionale Parameter abkl√§ren
    if numTuples is None:
        numTuples = df.shape[0]
    if numClass is None:
        numClass = df.shape[1] - 1

    className = df.columns[numClass]
    attributesName = [col for col in df.columns if col != className]

    # ---- Phase 1: DP-Histogramme ----
    histograms = _build_noisy_histograms(df, className, attributesName, epsilon)

    # ---- Phase 2: Klassenverteilung, P(attr|class), Sampling, Zusammenbau ----
    classTotals, p, classTuples, class_vector = _compute_class_distribution(
        histograms, numTuples
    )

    cond_attr = _compute_conditional_attributes(histograms, classTotals)

    attr_vectors = _sample_attribute_vectors(
        cond_attr, classTuples, random_state=random_state
    )

    synthetic_df = _assemble_synthetic_dataframe(
        attr_vectors, class_vector, attributesName, className
    )

    # final shuffle des gesamten DataFrames (Zeilen)
    synthetic_df = synthetic_df.sample(frac=1, random_state=random_state).reset_index(drop=True)

    return synthetic_df




## 1) Toy dataset and function call

We use a small dataset with:

two attributes: job, education

one class variable: income

In the next code cell, we construct toy_df and then call:

`synthetic_toy = generate(toy_df, epsilon=1.0)`

Epsilon is the privacy parameter, with 1.0 being the default value.

In [13]:
toy_df = pd.DataFrame({
    "job": [
        "blue", "blue", "white",
        "white", "blue", "white"
    ],
    "education": [
        "low", "low", "high",
        "high", "high", "low"
    ],
    "income": [
        "<=50K", "<=50K", ">50K",
        ">50K", "<=50K", ">50K"
    ]
})

print("toy: \n",toy_df)

synthetic_toy = generate(toy_df, epsilon=1.0)

print("synthetic_toy: \n",synthetic_toy)

toy: 
      job education income
0   blue       low  <=50K
1   blue       low  <=50K
2  white      high   >50K
3  white      high   >50K
4   blue      high  <=50K
5  white       low   >50K
(6, 3)
synthetic_toy: 
      job education income
0   blue       low  <=50K
1   blue      high  <=50K
2  white       low   >50K
3   blue      high  <=50K
4  white       low   >50K
5  white      high   >50K


## 2) generate(...): parameters and defaults

The function signature is:

`generate(df, numTuples=None, numClass=None, epsilon=1.0, ...)`

Meaning:

- **df**: the input DataFrame (here: toy_df)

- **numTuples**: number of rows to generate

- **numClass**: index of the class column

numTuples and numClass can be set to not use certain parts of the DataFrame, the default is set to the total number of rows and the last column respectively.

- **epsilon**: overall privacy budget

it is split across attributes: $\varepsilon_{\text{per hist}} = \frac{\varepsilon}{\text{no. of attributes}}$



In our toy dataset:

- number of rows: N=6

- class column is the last column (income)

- attributes are the other columns (job, education)

- number of histograms: 2 ‚Üí ùúÄ per hist=ùúÄ/2

In [14]:
df = toy_df

# same default logic as in generate
numTuples = None
numClass = None
epsilon = 1.0

if numTuples is None:
    numTuples = df.shape[0]

if numClass is None:
    numClass = df.shape[1] - 1

className = df.columns[numClass]
attributesName = [col for col in df.columns if col != className]

numHistograms = len(attributesName)
epsilon_per_hist = epsilon / numHistograms

print("df.shape:", df.shape)
print("numTuples:", numTuples)
print("className:", className)
print("attributesName:", attributesName)
print("epsilon_per_hist:", epsilon_per_hist)

df.shape: (6, 3)
numTuples: 6
className: income
attributesName: ['job', 'education']
epsilon_per_hist: 0.5


## 3) Phase 1: building joint histograms (attribute, class)

The first main block of code constructs a dictionary of histograms.

### 3.1 What does one histogram represent?

For each attribute (e.g. job), the code computes:

`counts = df[[attributeName, className]].value_counts()`

This returns a pandas Series with:

- a MultiIndex: `(attr_val, class_val)`

- values: `count(attr_val, class_val)`

So for job, keys look like:

("blue", "‚â§50K")

("white", ">50K")

and values are the corresponding frequencies in the toy dataset.

### 3.2 What is stored in histograms?

histograms is a dict:

**key**: attribute name (e.g. "job")

**value**: noisy Series of counts with MultiIndex

In [15]:
# Raw (non-noisy) joint histograms from the toy dataset

job_income_hist = toy_df[["job", "income"]].value_counts()
edu_income_hist = toy_df[["education", "income"]].value_counts()

job_income_hist, edu_income_hist


(job    income
 blue   <=50K     3
 white  >50K      3
 Name: count, dtype: int64,
 education  income
 high       >50K      2
 low        <=50K     2
 high       <=50K     1
 low        >50K      1
 Name: count, dtype: int64)

The outputs above are pandas Series with a MultiIndex.

Each entry represents a joint count of the form:

$(attribute value , class value ) ‚ü∂ count$

For example, in the histogram for job:

the index ("blue", "<=50K") with value 3 means
that there are three records in the toy dataset with
job = "blue" and income = "<=50K".

All combinations that do not occur in the dataset are absent from the index and implicitly have count zero.

These joint histograms contain all information used by the algorithm in Phase 1, before differential privacy is applied.

## 4) Phase 1 continued: Laplace noise and clipping

After computing the true counts, the code applies the Laplace mechanism:

`noisy_counts = laplace_mech(counts, 1, epsilon_per_hist)`

Then:

`noisy_counts.clip(lower=0.0, inplace=True)`

Meaning:

- each count is perturbed by Laplace noise

- negative noisy counts are clipped to zero

This is the only differentially private step in the pipeline.
Everything after that is post-processing (no privacy cost).

In [29]:
epsilon = 1.0
attributesName = ["job", "education"]
className = "income"

numHistograms = len(attributesName)
epsilon_per_hist = epsilon / numHistograms

histograms = {}

for attributeName in attributesName:
    counts = toy_df[[attributeName, className]].value_counts()
    noisy_counts = laplace_mech(counts, sensitivity=1, epsilon=epsilon_per_hist)
    noisy_counts.clip(lower=0.0, inplace=True)
    histograms[attributeName] = noisy_counts

# Display noisy histograms
histograms["job"], histograms["education"]


(job    income
 blue   <=50K     7.315299
 white  >50K      7.315299
 Name: count, dtype: float64,
 education  income
 high       >50K      2.270669
 low        <=50K     2.270669
 high       <=50K     1.270669
 low        >50K      1.270669
 Name: count, dtype: float64)

**Note**

Since the toy dataset contains only six records, the variance introduced by the Laplace mechanism is large relative to the true counts.
As the Laplace noise scale is independent of the dataset size, this effect diminishes for larger datasets.

## 5) Phase 2 starts: computing classTotals

The next block computes classTotals, a dictionary mapping:

**key**: class label c

**value**: total noisy mass associated with class c

This is done by iterating through all noisy histograms:

- outer loop: over attributes (`for attr_name, hist in histograms.items()`)

- inner loop: over MultiIndex entries (`for (attr_val, class_val), count in hist.items()`)

For each entry, the noisy count is added to:

`classTotals[class_val] += count`

So conceptually:

$$
\text{classTotals}(c)
=
\sum_{A \in \mathcal{A}}
\sum_{a}
\tilde{N}(A = a, C = c)
$$

In [30]:
classTotals = {}

for attr_name, hist in histograms.items():
    for (attr_val, class_val), count in hist.items():
        if class_val not in classTotals:
            classTotals[class_val] = 0.0
        classTotals[class_val] += count

total = sum(classTotals.values())

classTotals, total


({'<=50K': 10.856638030035528, '>50K': 10.856638030035528}, 21.713276060071056)

## 6) From classTotals to p and classTuples

Next, the code computes a probability distribution over classes:

`p = {c: classTotals[c] / total for c in classTotals}`

$P(C=c)=\frac{\text{classTotals}(c)}{\sum_{c'} \text{classTotals}(c')}$


Then, the number of synthetic tuples per class is:

`classTuples = {c: round(numTuples * p[c]) for c in classTotals}`

So: $n_c = \text{round}\bigl(N \cdot P(C=c)\bigr)$

Finally, a class vector is built:

`class_vector[c] = [c] * classTuples[c]`

This fixes the class labels of the synthetic dataset before sampling the attributes.

In [31]:
# Compute class probabilities p(C=c)
p = {c: classTotals[c] / total for c in classTotals}

# Number of synthetic tuples per class
numTuples = toy_df.shape[0]
classTuples = {c: round(numTuples * p[c]) for c in classTotals}

# Class vectors (fixed labels for synthetic data)
class_vector = {c: [c] * classTuples[c] for c in classTuples}

p, classTuples, class_vector

({'<=50K': 0.5, '>50K': 0.5},
 {'<=50K': 3, '>50K': 3},
 {'<=50K': ['<=50K', '<=50K', '<=50K'], '>50K': ['>50K', '>50K', '>50K']})

## 7) Conditional distributions `cond_attr[attr][class]`

This step computes the conditional distributions of attribute values given the class.

For each attribute \( A \) and each class \( c \), the algorithm estimates:

$P(A = a \mid C = c)=\frac{\tilde{N}(A = a, C = c)}{\sum_{a'} \tilde{N}(A = a', C = c)}$

where $\tilde{N}(A = a, C = c)$ denotes the noisy histogram counts obtained in Phase 1.

### How `attr_counts` is computed

For a fixed attribute `attr_name` and class \( c \), the code iterates over all
entries of the corresponding noisy histogram:

- only entries with `class_val == c` are considered,
- the noisy counts are accumulated per attribute value:
  `attr_counts[attr_val] += count`.

This yields a dictionary mapping attribute values \( a \) to their total noisy
mass within class \( c \).

### Normalization and edge cases

Let:

$\text{sum}_c = \sum_a \tilde{N}(A = a, C = c)$

- If $\text{sum}_c > 0$, the conditional probabilities are obtained by normalization.
- If $\text{sum}_c = 0$ but attribute values are present, a uniform distribution
  over the observed attribute values is used.
- If no information is available at all, an empty distribution is stored.

The result is stored in:

`cond_attr[attr_name][c]`

which is later used for sampling synthetic attribute values.


In [32]:
# Compute conditional attribute distributions cond_attr
cond_attr = {}

for attr_name, hist in histograms.items():
    cond_attr[attr_name] = {}
    for c in classTotals:
        attr_counts = {}

        for (attr_val, class_val), count in hist.items():
            if class_val == c:
                attr_counts[attr_val] = attr_counts.get(attr_val, 0.0) + count

        sum_c = sum(attr_counts.values())

        if sum_c == 0:
            if attr_counts:
                gleich = 1.0 / len(attr_counts)
                probs = {a: gleich for a in attr_counts}
            else:
                probs = {}
        else:
            probs = {a: attr_counts[a] / sum_c for a in attr_counts}

        cond_attr[attr_name][c] = probs

cond_attr

{'job': {'<=50K': {'blue': 1.0}, '>50K': {'white': 1.0}},
 'education': {'<=50K': {'low': 0.6411895477322098,
   'high': 0.3588104522677903},
  '>50K': {'high': 0.6411895477322098, 'low': 0.3588104522677903}}}

## 8) Sampling: attr_vectors[attr][class]

Now we generate concrete vectors of attribute values.

For each attribute and each class:

- set `sum_c = classTuples[c]`

- compute target counts:
`target_count[attr_val] = round(p_val * n_c)`

- construct a vector by repeating each attr_val target_count[attr_val] times

Then the code enforces that the vector length is exactly nc

if too short ‚Üí sample extra values using `np.random.choice(..., p=...)`

if too long ‚Üí remove random entries

Finally:
`np.random.shuffle(attr_vec_c)`

This prevents ordering artifacts.

In [34]:
# Build attr_vectors (sampling step)
attr_vectors = {}

for attr_name, class_dict in cond_attr.items():
    attr_vectors[attr_name] = {}
    for c, probs in class_dict.items():
        n_c = classTuples[c]
        target_count = {}

        for attr_val, p_val in probs.items():
            target_count[attr_val] = round(p_val * n_c)

        attr_vec_c = []
        for attr_val, cnt in target_count.items():
            attr_vec_c.extend([attr_val] * cnt)

        current_len = len(attr_vec_c)
        diff = n_c - current_len

        if diff > 0:
            vals = list(probs.keys())
            extra = np.random.choice(vals, size=diff, p=list(probs.values()))
            attr_vec_c.extend(extra)
        elif diff < 0:
            remove_indices = np.random.choice(len(attr_vec_c), size=-diff, replace=False)
            for idx in sorted(remove_indices, reverse=True):
                attr_vec_c.pop(idx)

        np.random.shuffle(attr_vec_c)
        attr_vectors[attr_name][c] = attr_vec_c

In [35]:
# Inspect sampled attribute vectors for the toy dataset

for c in classTuples:
    print(f"Class: {c}")

    print(" job:")
    print("  length:", len(attr_vectors["job"][c]))
    print("  example:", attr_vectors["job"][c][:10])

    print(" education:")
    print("  length:", len(attr_vectors["education"][c]))
    print("  example:", attr_vectors["education"][c][:10])
    print()


Class: <=50K
 job:
  length: 3
  example: ['blue', 'blue', 'blue']
 education:
  length: 3
  example: ['low', 'high', 'low']

Class: >50K
 job:
  length: 3
  example: ['white', 'white', 'white']
 education:
  length: 3
  example: ['high', 'high', 'low']



## 9) Building blocks, concatenating, final shuffle

At this point we have:

- `class_vector[c]` containing class labels (length $\text{n}_c$)

- `attr_vectors[attr][c]` for each attribute (also length $\text{n}_c$)

The code builds one ‚Äúblock‚Äù per class:

each block is a dict:

- `block[attributeName] = attr_vectors[attributeName][c]`

- `block[className] = class_vector[c]`

Then:

`df_blocks[c] = pd.DataFrame(block)` creates one DataFrame per class

`synthetic_df = pd.concat(df_blocks.values(), ignore_index=True)` merges them

`synthetic_df = synthetic_df.sample(frac=1, random_state=42)` shuffles rows

In [36]:
# Build class-wise blocks and assemble the final synthetic DataFrame
blocks = {}

for c in classTuples:
    block = {}
    for attributeName in attributesName:
        block[attributeName] = attr_vectors[attributeName][c]
    block[className] = class_vector[c]
    blocks[c] = block

# Create one DataFrame per class
df_blocks = {c: pd.DataFrame(block) for c, block in blocks.items()}

# Concatenate all class DataFrames
synthetic_df = pd.concat(df_blocks.values(), ignore_index=True)

# Final shuffle of rows
synthetic_df = synthetic_df.sample(frac=1, random_state=42).reset_index(drop=True)

# Display results
synthetic_df.head(), synthetic_df["income"].value_counts(), toy_df["income"].value_counts()

(     job education income
 0   blue       low  <=50K
 1   blue      high  <=50K
 2  white       low   >50K
 3   blue       low  <=50K
 4  white      high   >50K,
 income
 <=50K    3
 >50K     3
 Name: count, dtype: int64,
 income
 <=50K    3
 >50K     3
 Name: count, dtype: int64)

## 10) Interpretation: what is preserved and what can change?

Even on this toy dataset, we should expect:

Typically preserved (in expectation):

- the overall class balance $\text{P(C)}$

- conditional structure $\text{P(A|C)}$

Allowed to change due to DP noise:

- exact histogram counts

- exact correlations / small-sample artifacts

- some rare combinations may disappear or appear

This is the intended privacy‚Äìutility tradeoff controlled by Œµ.

## 11) Connection to the full experimental study

This notebook demonstrates the mechanics of the pipeline with a tiny dataset.
In the full study, the exact same function is applied to real datasets, and we evaluate utility via classification performance for multiple
Œµ values with repeated runs.

The toy walkthrough is meant to make the experimental pipeline fully understandable.