# Local Differential Privacy: Overview, Advantages, and Limitations

## What is Local Differential Privacy?
Local Differential Privacy (LDP) is a variant of Differential Privacy where the privacy protection is applied directly on the user's device before data is shared with a central server or aggregator. In LDP, each individual's data is obfuscated by adding noise locally, ensuring that the raw data is never exposed, not even to the data collector.

The privacy guarantee in LDP is also controlled by the parameter **Epsilon (ε)**:
- Smaller ε values provide stronger privacy but at the cost of reduced utility.

### Key Difference Between LDP and Standard DP:
- In **DP**, noise is added globally after data is collected.
- In **LDP**, noise is added locally by each user, and the aggregator only receives noisy data.

## Main Advantages
1. **Stronger Privacy**: Raw data is never shared, providing a higher level of trust for users.
2. **No Trust Assumptions**: LDP does not require users to trust a central server since the privacy mechanism is applied locally.
3. **Scalability**: Ideal for decentralized and large-scale systems, such as surveys, IoT networks, and distributed data collection.
4. **Compliance**: Helps organizations adhere to privacy regulations by ensuring individual data remains private.

## Main Disadvantages
1. **High Noise Levels**: LDP requires adding more noise than standard DP to provide the same level of privacy, often leading to significant utility loss.
2. **Lower Data Utility**: Aggregated results may be less accurate, especially for small datasets or low privacy budgets (small ε).
3. **Limited Applicability**: LDP is better suited for aggregate queries (e.g., averages, histograms) and may not perform well for complex analyses or machine learning tasks.

## Limitations
1. **Utility-Privacy Trade-off**: Achieving high privacy in LDP often severely impacts the utility of the data.
2. **Inefficiency for Correlated Data**: LDP works independently on each data point, which may not capture correlations between data effectively.
3. **Scaling Challenges for Complex Tasks**: Applying LDP to more advanced machine learning models or multidimensional data can be computationally intensive and require sophisticated algorithms.
4. **Interpretability**: Similar to DP, the ε parameter in LDP is not intuitive for non-technical stakeholders.

## Conclusion
Local Differential Privacy is an excellent approach for scenarios where users are unwilling to share raw data and trust in central aggregators is minimal. However, its noise-heavy nature makes it less suitable for applications requiring high-accuracy results or detailed data analysis. LDP is particularly well-suited for simple queries and privacy-sensitive domains like surveys, telemetry, and user behavior tracking.

Look into https://github.com/google/rappor


In [1]:
!pip install sklearn-pandas

Collecting sklearn-pandas
  Downloading sklearn_pandas-2.2.0-py2.py3-none-any.whl.metadata (445 bytes)
Downloading sklearn_pandas-2.2.0-py2.py3-none-any.whl (10 kB)
Installing collected packages: sklearn-pandas
Successfully installed sklearn-pandas-2.2.0


In [6]:
!pip install kagglehub

Collecting kagglehub
  Downloading kagglehub-0.3.6-py3-none-any.whl.metadata (30 kB)
Collecting tqdm (from kagglehub)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Downloading kagglehub-0.3.6-py3-none-any.whl (51 kB)
Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, kagglehub
Successfully installed kagglehub-0.3.6 tqdm-4.67.1


In [7]:
import kagglehub
import matplotlib.pylab as pl
import matplotlib.patches as patches
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

In [9]:
# Download latest version
path = kagglehub.dataset_download("johnolafenwa/us-census-data")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/johnolafenwa/us-census-data?dataset_version_number=1...


100%|██████████| 703k/703k [00:00<00:00, 977kB/s]

Extracting files...
Path to dataset files: /home/rkruger/.cache/kagglehub/datasets/johnolafenwa/us-census-data/versions/1





In [11]:
names = ('age', 'workclass', 'fnlwgt', 'education', 'education-num',
         'marital-status', 'occupation', 'relationship',
         'race', 'sex', 'capital-gain', 'capital-loss',
         'hours-per-week', 'native-country', 'income',)

categorical = set(('workclass', 'education', 'marital-status',
                   'occupation', 'relationship', 'sex',
                   'native-country', 'race', 'income',))

df = pd.read_csv(
    "/home/rkruger/github.com/rodkruger/ppml-python/archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    sep=",", header=None, names=names, index_col=False, engine='python')

df.head()
df.nunique()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


age                  73
workclass             9
fnlwgt            21648
education            16
education-num        16
marital-status        7
occupation           15
relationship          6
race                  5
sex                   2
capital-gain        119
capital-loss         92
hours-per-week       94
native-country       42
income                2
dtype: int64

In [13]:
for name in categorical:
    df[name] = df[name].astype('category')


def get_spans(df, partition, scale=None):
    spans = {}

    for column in df.columns:
        if column in categorical:
            span = len(df[column][partition].unique())
        else:
            span = df[column][partition].max() - df[column][partition].min()
        if scale is not None:
            span = span / scale[column]
        spans[column] = span
        print("Column:", column, "Span:", span)

    return spans


full_spans = get_spans(df, df.index)

Column: age Span: 73
Column: workclass Span: 9
Column: fnlwgt Span: 1472420
Column: education Span: 16
Column: education-num Span: 15
Column: marital-status Span: 7
Column: occupation Span: 15
Column: relationship Span: 6
Column: race Span: 5
Column: sex Span: 2
Column: capital-gain Span: 99999
Column: capital-loss Span: 4356
Column: hours-per-week Span: 98
Column: native-country Span: 42
Column: income Span: 2


In [17]:
def split(df, partition, column):
    dfp = df[column][partition]
    if column in categorical:
        values = dfp.unique()
        lv = set(values[:len(values) // 2])
        rv = set(values[len(values) // 2:])
        return dfp.index[dfp.isin(lv)], dfp.index[dfp.isin(rv)]
    else:
        median = dfp.median()
        dfl = dfp.index[dfp < median]
        dfr = dfp.index[dfp >= median]
        return (dfl, dfr)


def is_k_anonymous(df, partition, sensitive_column, k=3):
    if len(partition) < k:
        return False
    return True


def partition_dataset(df, feature_columns, sensitive_column,
                      scale, is_valid):
    finished_partitions = []
    partitions = [df.index]
    while partitions:
        partition = partitions.pop(0)
        spans = get_spans(df[feature_columns], partition, scale)
        for column, span in sorted(spans.items(), key=lambda x: -x[1]):
            lp, rp = split(df, partition, column)
            if not is_valid(df, lp, sensitive_column) or \
                    not is_valid(df, rp, sensitive_column):
                continue
            partitions.extend((lp, rp))
            break
        else:
            finished_partitions.append(partition)
    return finished_partitions

In [19]:
feature_columns = ['age', 'education-num']
sensitive_column = 'income'
finished_partitions = partition_dataset(df, feature_columns, sensitive_column, full_spans, is_k_anonymous)

Column: age Span: 1.0
Column: education-num Span: 1.0
Column: age Span: 0.2602739726027397
Column: education-num Span: 1.0
Column: age Span: 0.726027397260274
Column: education-num Span: 1.0
Column: age Span: 0.2602739726027397
Column: education-num Span: 0.5333333333333333
Column: age Span: 0.2602739726027397
Column: education-num Span: 0.4
Column: age Span: 0.726027397260274
Column: education-num Span: 0.5333333333333333
Column: age Span: 0.726027397260274
Column: education-num Span: 0.4
Column: age Span: 0.2602739726027397
Column: education-num Span: 0.4666666666666667
Column: age Span: 0.2602739726027397
Column: education-num Span: 0.0
Column: age Span: 0.2602739726027397
Column: education-num Span: 0.0
Column: age Span: 0.2465753424657534
Column: education-num Span: 0.3333333333333333
Column: age Span: 0.1506849315068493
Column: education-num Span: 0.5333333333333333
Column: age Span: 0.5616438356164384
Column: education-num Span: 0.5333333333333333
Column: age Span: 0.10958904109

In [28]:
def agg_categorical_column(series):
    return [','.join(set(series))]


def agg_numerical_column(series):
    return [series.mean()]


def build_anonymized_dataset(df, partitions, feature_columns, sensitive_column, max_partitions=None):
    aggregations = {}
    for column in feature_columns:
        if column in categorical:
            aggregations[column] = agg_categorical_column
        else:
            aggregations[column] = agg_numerical_column

    rows = []
    for i, partition in enumerate(partitions):
        if i % 100 == 1:
            print("Finished {} partitions...".format(i))
        if max_partitions is not None and i > max_partitions:
            break

        grouped_columns = df.loc[partition].agg(aggregations)
        if isinstance(grouped_columns, pd.Series):
            grouped_columns = grouped_columns.to_frame().T

        sensitive_counts = df.loc[partition].groupby(
            sensitive_column, observed=False
        ).agg({sensitive_column: 'count'})

        values = grouped_columns.iloc[0].to_dict()
        for sensitive_value, count in sensitive_counts[sensitive_column].items():
            if count == 0:
                continue
            values.update({
                sensitive_column: sensitive_value,
                'count': count,
            })
            rows.append(values.copy())

    return pd.DataFrame(rows)


dfn = build_anonymized_dataset(df, finished_partitions, feature_columns, sensitive_column)

dfn.head()

Finished 1 partitions...
Finished 101 partitions...
Finished 201 partitions...
Finished 301 partitions...
Finished 401 partitions...


Unnamed: 0,age,education-num,income,count
0,[17.555555555555557],[6.0],<=50K,207
1,[21.0],[10.0],<=50K,371
2,[21.0],[10.0],>50K,1
3,[25.487465181058496],[10.0],<=50K,333
4,[25.487465181058496],[10.0],>50K,26


In [29]:
def diversity(df, partition, column):
    return len(df[column][partition].unique())

In [46]:
def is_l_diverse(df, partition, sensitive_column, l=2):
    return diversity(df, partition, sensitive_column) >= l

finished_l_diverse_partitions = partition_dataset(
    df, feature_columns, sensitive_column, full_spans,
    lambda *args: is_k_anonymous(*args) and is_l_diverse(*args)
)

column_x, column_y = feature_columns[:2]
dfl = build_anonymized_dataset(
    df, finished_l_diverse_partitions, feature_columns, sensitive_column
)

#print(dfl.sort_values([column_x, column_y, sensitive_column]))
dfl.head()


Column: age Span: 1.0
Column: education-num Span: 1.0
Column: age Span: 0.2602739726027397
Column: education-num Span: 1.0
Column: age Span: 0.726027397260274
Column: education-num Span: 1.0
Column: age Span: 0.2602739726027397
Column: education-num Span: 0.5333333333333333
Column: age Span: 0.2602739726027397
Column: education-num Span: 0.4
Column: age Span: 0.726027397260274
Column: education-num Span: 0.5333333333333333
Column: age Span: 0.726027397260274
Column: education-num Span: 0.4
Column: age Span: 0.2602739726027397
Column: education-num Span: 0.4666666666666667
Column: age Span: 0.2602739726027397
Column: education-num Span: 0.0
Column: age Span: 0.2602739726027397
Column: education-num Span: 0.0
Column: age Span: 0.2465753424657534
Column: education-num Span: 0.3333333333333333
Column: age Span: 0.1506849315068493
Column: education-num Span: 0.5333333333333333
Column: age Span: 0.5616438356164384
Column: education-num Span: 0.5333333333333333
Column: age Span: 0.10958904109

Unnamed: 0,age,education-num,income,count
0,[21.013875123885036],[10.0],<=50K,1998
1,[21.013875123885036],[10.0],>50K,20
2,[17.8],[6.912418300653595],<=50K,764
3,[17.8],[6.912418300653595],>50K,1
4,[20.084821428571427],[9.0],<=50K,1117


In [47]:
global_freqs = {}
total_count = float(len(df))
group_counts = df.groupby(sensitive_column, observed=False)[sensitive_column].agg('count')
for value, count in group_counts.to_dict().items():
    p = count / total_count
    global_freqs[value] = p

print(global_freqs)

{' <=50K': 0.7591904425539756, ' >50K': 0.2408095574460244}


In [49]:
def t_closeness(df, partition, column, global_freqs):
    total_count = float(len(partition))
    d_max = None
    group_counts = df.loc[partition].groupby(column)[column].agg('count')
    for value, count in group_counts.to_dict().items():
        p = count / total_count
        d = abs(p - global_freqs[value])
        if d_max is None or d > d_max:
            d_max = d
    return d_max


def is_t_close(df, partition, sensitive_column, global_freqs, p=0.2):
    if not sensitive_column in categorical:
        raise ValueError("this method only works for categorical values")
    return t_closeness(df, partition, sensitive_column, global_freqs) <= p


finished_t_close_partitions = partition_dataset(
    df, feature_columns, sensitive_column, full_spans,
    lambda *args: is_k_anonymous(*args) and is_t_close(*args, global_freqs) and is_t_close(*args, global_freqs))

dft = build_anonymized_dataset(df, finished_t_close_partitions,
                               feature_columns, sensitive_column)

#print the header
# print(dft.sort_values([column_x, column_y, sensitive_column]))
dft.head()


Column: age Span: 1.0
Column: education-num Span: 1.0
Column: age Span: 0.2602739726027397
Column: education-num Span: 1.0
Column: age Span: 0.726027397260274
Column: education-num Span: 1.0
Column: age Span: 0.2602739726027397
Column: education-num Span: 0.5333333333333333
Column: age Span: 0.2602739726027397
Column: education-num Span: 0.4
Column: age Span: 0.1232876712328767
Column: education-num Span: 1.0
Column: age Span: 0.589041095890411
Column: education-num Span: 1.0
Column: age Span: 0.2602739726027397
Column: education-num Span: 0.0
Column: age Span: 0.2465753424657534
Column: education-num Span: 0.3333333333333333
Column: age Span: 0.0410958904109589
Column: education-num Span: 1.0
Column: age Span: 0.0684931506849315
Column: education-num Span: 1.0
Column: age Span: 0.0958904109589041
Column: education-num Span: 1.0
Column: age Span: 0.4794520547945205
Column: education-num Span: 1.0
Column: age Span: 0.2328767123287671
Column: education-num Span: 0.06666666666666667
Colum

  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('

Column: age Span: 0.0
Column: education-num Span: 1.0
Column: age Span: 0.0
Column: education-num Span: 1.0
Column: age Span: 0.0
Column: education-num Span: 1.0
Column: age Span: 0.0136986301369863
Column: education-num Span: 1.0
Column: age Span: 0.0684931506849315
Column: education-num Span: 0.2
Column: age Span: 0.0684931506849315
Column: education-num Span: 0.2
Column: age Span: 0.3972602739726027
Column: education-num Span: 0.2
Column: age Span: 0.3972602739726027
Column: education-num Span: 0.2
Column: age Span: 0.0136986301369863
Column: education-num Span: 0.4666666666666667
Column: age Span: 0.0410958904109589
Column: education-num Span: 0.4666666666666667
Column: age Span: 0.3972602739726027
Column: education-num Span: 0.0
Column: age Span: 0.3972602739726027
Column: education-num Span: 0.4
Column: age Span: 0.0
Column: education-num Span: 0.06666666666666667
Column: age Span: 0.0136986301369863
Column: education-num Span: 0.06666666666666667
Column: age Span: 0.013698630136

  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('

Column: age Span: 0.0
Column: education-num Span: 0.26666666666666666
Column: age Span: 0.0
Column: education-num Span: 0.26666666666666666
Column: age Span: 0.0
Column: education-num Span: 0.0
Column: age Span: 0.0136986301369863
Column: education-num Span: 0.0
Column: age Span: 0.0
Column: education-num Span: 0.06666666666666667
Column: age Span: 0.0136986301369863
Column: education-num Span: 0.06666666666666667
Column: age Span: 0.0410958904109589
Column: education-num Span: 0.06666666666666667
Column: age Span: 0.0410958904109589
Column: education-num Span: 0.0
Column: age Span: 0.0
Column: education-num Span: 0.0
Column: age Span: 0.0136986301369863
Column: education-num Span: 0.0
Column: age Span: 0.0
Column: education-num Span: 0.0
Column: age Span: 0.0136986301369863
Column: education-num Span: 0.0
Column: age Span: 0.0
Column: education-num Span: 0.0
Column: age Span: 0.0136986301369863
Column: education-num Span: 0.0
Column: age Span: 0.0
Column: education-num Span: 0.1333333

  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('count')
  group_counts = df.loc[partition].groupby(column)[column].agg('

Unnamed: 0,age,education-num,income,count
0,[26.687216806261155],[8.10970753810243],<=50K,6837
1,[26.687216806261155],[8.10970753810243],>50K,446
2,[25.8130180399805],[10.0],<=50K,3755
3,[25.8130180399805],[10.0],>50K,347
4,[29.41486068111455],[13.307120743034055],<=50K,2239
