---
## Definitions
- Let $\mathcal{A} = \{a_1, \ldots, a_k\}$ be the set of attributes, and let $\mathcal{D}(a_i)$ be the domain of attribute $a_i$.
- A dataset $D = \{r_1, \ldots, r_n\}$ is a set of rows where $r_i \in a_1 \times \ldots \times a_k$ is a tuple of $k$ values and represents one individual in the population.
- Let $\mathcal{S} \subseteq \mathcal{A}$ be a subset of attributes. We say that $r_i[\mathcal{S}]$ are the attribute values of the individual $r_i$ for attributes in $\mathcal{S}$, e.g., $r_2[\{\text{age}, \text{education}\}] = (23, Bachelor)$.

Assumptions about the adversary:
  - She knows $D$, $\mathcal{A}$ and all $\mathcal{D}(a_i)$.
  - She has a single target $r_t \in D$ (random person from the population) and knows some quasi-identifiers (QIDs) about the target, i.e., there is a subset of attributes $\mathcal{Q} \subseteq \mathcal{A}$ and she knows $r_t[\mathcal{Q}]$.
  - For re-identification: The adversary is trying to find out which row correspond to her target.
  - For attribute-inference: The adversary is trying to infer the value of a sensitive attribute $a_s \in \mathcal{A}$.

---
## Example dataset
| age | education   | income   |
|-----|-------------|----------|
| 20  | Master      | low     |
| 30  | High School | medium|
| 30  | High School | low     |
| 30  | PhD         | medium|
| 30  | PhD         | medium|
| 55  | Bachelor    | high     |
| 55  | Bachelor    | high     |
| 55  | Bachelor    | medium|

---
## Re-identification

**Prior vulnerability:** The adversary's chance of re-identifying a random target before observing the data.

Before learning the QIDs of her target the best the adversary can do is to guess one of the $n$ rows, thus her expected probability of guessing correctly the row is

$\begin{equation}
1/n \text{ .}\nonumber
\end{equation}$

In the example the prior vulnerability is $1/ 8$.

**Posterior vulnerability:** The adversary's chance of re-identifying a random target after observing the data.

After the adversary learned $r_t[\mathcal{Q}]$, she will filter all records $\{r_i~:~r_i[\mathcal{Q}] = r_t[\mathcal{Q}]\}$ and from this subset the best she can do is guessing one of the rows. Her expected chance of success is

$\begin{equation}
\frac{1}{n} \sum\limits_{r_t \in D} \frac{1}{|~\{r_i~:~r_i[\mathcal{Q}] = r_t[\mathcal{Q}]\}~|}\nonumber
\end{equation}$

In the example, assume the adversary knows the age and education of her target. So the possible targets are:
- (20, Master)
- (30, High School)
- (30, PhD)
- (55, Bachelor)

The posterior is then $\Large \frac{1 + 2\cdot \frac{1}{2} + 2\cdot \frac{1}{2} + 3\cdot \frac{1}{3}}{8} = \frac{4}{8} = \frac{1}{2}$

---
## Attribute Inference

Assume the sensitive attribute is *income*.

**Prior vulnerability:** The adversary's chance of guessing correctly the sensitive attribute value of a random target before observing the data.

Before learning the QIDs of her target the best she can do is guessing the most frequent attribute value in the dataset, thus her expected probabilty of success is

$\begin{equation}
\max\limits_{v \in \mathcal{D}(a_s)} \frac{|r_i~:~r_i[a_s] = v|}{n}\text{ .}\nonumber
\end{equation}$

In the example, the most frequent income is "medium" that appears in 4 records, so the prior vulnerability will be $4/8 = 1/2$.

**Posterior vulnerability:** The adversary's chance of guessing correctly the sensitive attribute value of a random target after observing the data.

After the adversary learned $r_t[\mathcal{Q}]$, she will filter all records $\{r_i:~r_i[\mathcal{Q}] = r_t[\mathcal{Q}]\}$ and from this subset the best she can do is to guess the attribute value with the highest frequency in the subset. Her expected chance of sucess is

$\begin{equation}
\frac{1}{n} \sum\limits_{r_t \in D} \max\limits_{v \in \mathcal{D}(a_s)} \frac{|~\{r_i~:~r_i[\mathcal{Q}] = r_t[\mathcal{Q}]\} \wedge r_i[a_s] = v~|}{|~\{r_i~:~r_i[\mathcal{Q}] = r_t[\mathcal{Q}]\}~|}
\end{equation}$

In the example the posterior vulnerability will be $\Large\frac{1 + 2\cdot \frac{1}{2} + 2\cdot 1 + 3\cdot \frac{2}{3}}{8} = \frac{6}{8} = \frac{3}{4}$.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
sys.path.append("/Users/ramongonze/projects/privattacks")
import privattacks

In [2]:
df = pd.DataFrame({
    "age":[20,30,30,30,30,55,55,55],
    "education":["Master", "High School", "High School", "PhD", "PhD", "Bachelor", "Bachelor", "Bachelor"],
    "income":["low", "medium", "low", "medium", "medium", "high", "high", "medium"]
})
display(df)

Unnamed: 0,age,education,income
0,20,Master,low
1,30,High School,medium
2,30,High School,low
3,30,PhD,medium
4,30,PhD,medium
5,55,Bachelor,high
6,55,Bachelor,high
7,55,Bachelor,medium


In [3]:
# Define quasi-identifiers and sensitive attribute
qids = ["age", "education"]
sensitive = "income"

data = privattacks.data.Data(dataframe=df)
attack = privattacks.attacks.Attack(data)
prior_reid = attack.prior_reid()
prior_ai = attack.prior_ai(sensitive)
posterior_reid = attack.posterior_reid(qids)
posterior_ai = attack.posterior_ai(qids, sensitive)

print(f"[Re-identification]\n\tPrior vulnerability; {prior_reid:.5f}\n\tPosterior vulnerability: {posterior_reid:.5f}")
print(f"\n[Attribute inference - {sensitive}]\n\tPrior vulnerability; {prior_ai[sensitive]:.5f}\n\tPosterior vulnerability: {posterior_ai[sensitive]:.5f}")

[Re-identification]
	Prior vulnerability; 0.12500
	Posterior vulnerability: 0.50000

[Attribute inference - income]
	Prior vulnerability; 0.33333
	Posterior vulnerability: 0.75000


In [4]:
# Using the optimized method for running both re-identification and attribute inference
posterior_reid, posterior_ai = attack.posterior_reid_ai(qids, sensitive)
print(f"[Re-identification]\n\tPosterior vulnerability: {posterior_reid:.5f}")
print(f"\n[Attribute inference - {sensitive}]\n\tPosterior vulnerability: {posterior_ai[sensitive]:.5f}")

[Re-identification]
	Posterior vulnerability: 0.50000

[Attribute inference - income]
	Posterior vulnerability: 0.75000


In [5]:
# Generating histograms of individual posterior vulnerability (vulnerability of each record)
posterior_reid, hist_reid = attack.posterior_reid(qids, distribution=True)
print("[Re-identification distribution]")
display(hist_reid)

posterior_reid, hist_ai = attack.posterior_ai(qids, sensitive, distribution=True)
print("[Attribute inference distribution]")
display(hist_ai)

[Re-identification distribution]


array([1.        , 0.5       , 0.5       , 0.5       , 0.5       ,
       0.33333333, 0.33333333, 0.33333333])

[Attribute inference distribution]


{'income': array([1.        , 0.5       , 0.5       , 1.        , 1.        ,
        0.66666667, 0.66666667, 0.66666667])}

In [6]:
# Generating histograms of individual posterior vulnerability (vulnerability of each record) in the optimized method
(posterior_reid, hist_reid), (posteriors_ai, hist_ai) = attack.posterior_reid_ai(qids, sensitive, distribution=True)

print("[Re-identification histogram]")
display(hist_reid)

print("[Attribute inference histogram]")
display(hist_ai)

[Re-identification histogram]


array([1.        , 0.5       , 0.5       , 0.5       , 0.5       ,
       0.33333333, 0.33333333, 0.33333333])

[Attribute inference histogram]


array({'income': [np.float64(1.0), np.float64(0.5), np.float64(0.5), np.float64(1.0), np.float64(1.0), np.float64(0.6666666666666666), np.float64(0.6666666666666666), np.float64(0.6666666666666666)]},
      dtype=object)

In [9]:
# Run re-identification and attribute inference attacks for all combination of quasi-identifiers
results_reid = attack.posterior_reid_subset(
    qids=qids,
    num_min=1,
    num_max=len(qids),
    n_processes=2
)
print("[Re-identification results]")
display(results_reid)


results_ai = attack.posterior_ai_subset(
    qids=qids,
    sensitive=sensitive,
    num_min=1,
    num_max=len(qids),
    distribution=True,
    n_processes=2
)
print("[Attribute inferece results]")
display(results_ai)

[Re-identification results]


Unnamed: 0,n_qids,qids,posterior_reid
0,1,education,0.5
1,1,age,0.375
2,2,"age,education",0.5


[Attribute inferece results]


Unnamed: 0,n_qids,qids,posterior_income,posterior_income_record
0,1,age,0.75,"[1.00000000, 0.75000000, 0.75000000, 0.7500000..."
1,1,education,0.75,"[1.00000000, 0.50000000, 0.50000000, 1.0000000..."
2,2,"age,education",0.75,"[1.00000000, 0.50000000, 0.50000000, 1.0000000..."


In [8]:
# Run using the optimed method
results = attack.posterior_reid_ai_subset(
    qids=qids,
    sensitive=sensitive,
    num_min=1,
    num_max=len(qids),
    n_processes=2
)

print("[Re-identification and attribute inference results]")
display(results)

[Re-identification and attribute inference results]


Unnamed: 0,n_qids,qids,posterior_reid,posterior_income
0,1,education,0.5,0.75
1,1,age,0.375,0.75
2,2,"age,education",0.5,0.75
