# Differential Privacy: Overview, Advantages, and Limitations

## What is Differential Privacy?
Differential Privacy (DP) is a mathematical framework for providing privacy guarantees while analyzing and sharing data. The core idea is to ensure that the inclusion or exclusion of any single individual's data does not significantly affect the output of an analysis, thereby protecting their privacy. This is achieved by adding carefully calibrated noise to the data or to the results of computations.

The privacy guarantee is quantified by two parameters:
- **Epsilon (ε)**: Controls the trade-off between privacy and accuracy; smaller ε provides stronger privacy but potentially less accurate results.
- **Delta (δ)**: Represents the probability of failing to achieve the desired privacy level (used in relaxed versions of DP).

## Main Advantages
1. **Strong Privacy Guarantees**: Protects individuals' data even against adversaries with auxiliary knowledge.
2. **Mathematical Rigor**: Provides provable and quantifiable privacy guarantees.
3. **Scalability**: Suitable for large datasets and machine learning models.
4. **Flexibility**: Can be applied to various applications, including statistics, machine learning, and synthetic data generation.
5. **Resilience**: Ensures privacy protection even when multiple analyses are conducted on the same dataset (composition property).

## Main Disadvantages
1. **Utility Loss**: Adding noise to ensure privacy can degrade the accuracy of results, especially with smaller datasets or low ε values.
2. **Complex Implementation**: Requires careful tuning of privacy parameters and understanding of the underlying mathematics.
3. **Resource Intensive**: In some cases, computational requirements increase due to the additional noise and constraints.

## Limitations
1. **Requires Large Datasets**: Differential Privacy works best with large datasets to mitigate utility loss from added noise.
2. **No Absolute Privacy**: Privacy guarantees are probabilistic, meaning there is still a small chance of information leakage (controlled by δ).
3. **Not a Universal Solution**: DP does not eliminate all privacy risks (e.g., adversaries might infer information indirectly through external data).
4. **Interpretability Challenges**: Non-technical stakeholders may find it difficult to interpret and understand ε and δ parameters.

## Conclusion
Differential Privacy is a powerful tool for balancing the need for data utility and individual privacy. However, its practical implementation requires careful consideration of privacy-utility trade-offs, parameter tuning, and domain-specific requirements.


In [None]:
!pip install diffprivlib

import diffprivlib.models as dp

In [19]:
from sklearn import datasets
from sklearn.metrics import accuracy_score, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

In [8]:
import numpy as np

X_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                     usecols=(0, 4, 10, 11, 12), delimiter=",")
y_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", usecols=14,
                     dtype=str, delimiter=",")

X_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                    usecols=(0, 4, 10, 11, 12), delimiter=",", skiprows=1)
y_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", usecols=14, dtype=str,
                    delimiter=",", skiprows=1)

y_test = np.array([a[:-1] for a in y_test])

  y_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", usecols=14,
  y_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", usecols=14, dtype=str,


In [13]:
from sklearn.naive_bayes import GaussianNB

nonprivate_clf = GaussianNB()
nonprivate_clf.fit(X_train, y_train)

print("Non-private test accuracy: %.2f%%" %
      (accuracy_score(y_test, nonprivate_clf.predict(X_test)) * 100))

dp_clf = dp.GaussianNB(epsilon=0.01)

dp_clf.fit(X_train, y_train)

print("Differentially private test accuracy (epsilon=%.2f): %.2f%%" %
      (dp_clf.epsilon, accuracy_score(y_test, dp_clf.predict(X_test)) * 100))

Non-private test accuracy: 79.64%
Differentially private test accuracy (epsilon=0.01): 78.30%




In [16]:
from sklearn.linear_model import LogisticRegression

lr = Pipeline([
    ('scaler', MinMaxScaler()),
    ('clf', LogisticRegression(solver="lbfgs"))
])

lr.fit(X_train, y_train)

print("Non-private test accuracy: %.2f%%" % (accuracy_score(y_test, lr.predict(X_test)) * 100))

dp_lr = Pipeline([
    ('scaler', MinMaxScaler()),
    ('clf', dp.LogisticRegression(epsilon=0.01))
])

dp_lr.fit(X_train, y_train)
print("Differentially private test accuracy (epsilon=%.2f): %.2f%%" %
      (dp_lr['clf'].epsilon, accuracy_score(y_test, dp_lr.predict(X_test)) * 100))


Non-private test accuracy: 81.04%
Differentially private test accuracy (epsilon=0.01): 76.38%




In [25]:
dataset = datasets.load_diabetes()

X_train, X_test, y_train, y_test = train_test_split(dataset.data,
                                                    dataset.target, test_size=0.2)

print("Train examples: %d, Test examples: %d" % (X_train.shape[0],
                                                 X_test.shape[0]))

Train examples: 353, Test examples: 89


In [28]:
from sklearn.linear_model import LinearRegression as sk_LinearRegression

regr = sk_LinearRegression()

regr.fit(X_train, y_train)

baseline = r2_score(y_test, regr.predict(X_test))
print("Non-private baseline: %.2f" % baseline)

from diffprivlib.models import LinearRegression

regr = LinearRegression(epsilon=0.01)
regr.fit(X_train, y_train)

print("R2 score for epsilon=%.2f: %.2f" % (regr.epsilon, r2_score(y_test, regr.predict(X_test))))

Non-private baseline: 0.44
R2 score for epsilon=0.01: -28921958697987.12


This will result in additional privacy leakage. To ensure differential privacy with no additional privacy loss, specify `bounds_X` and `bounds_y`.
