# Introduction

Synthesized generated data is privacy preserving by design, as the sampled data doesn't contain any row from the training set. 

## Entity Annotation

If the dataset contains PII the entity annotation module is available, [read more here](https://docs.synthesized.io/v1.4/user_guide/augmentation/annotations.html). These are benefits of annotating one or multiple columns as entities
* **Privacy**. If a dataset contains personal data such as names, addresses, or bank accounts, annotating these columns as entities the Synthesizer will automatically generate fake data for these columns.
* **Correlations**. Entities contain some strict rules that need to be enforced, such as gender, title and name, or postcodes and cities. With entity annotation, the generator will sample coherent data for these fields.
* **Realistic data**. Annotated fields contain some fake data that is not present in the original dataset, but it is still realistic. The Synthesizer is able to generate real UK addresses if required, and person (including names, phone numbers, emails, passwords among other fields) and bank accounts that look as real as possible.
* **User-defined Location**. Annotations have some configurable fields, like location. By default, persons will have standard English names, but one also can configure it to have other languages (such as Chinese, Russian among others).


## Differential Privacy

## Inference Attack

Synthesized’s privacy module provides various ways to assess the robustness of synthesized data against different types of attribute inference attack. Attribute inference attack refers to the situation when an attacker adversary might deduce, with significant probability, the value of a hidden sensitive attribute from the values of other attributes. 

In practice, the attacker will have full access to the synthetic data and partial access to the original data. The attacker will train a model using synthetic data, and then use the trained model to predict the unknown value of the sensitive attribute using the known attributes of the original data. Hence, it is important and useful to assess the vulnerability of synthetic dataset against the risk of inference attacks so that the privacy and confidentiality of the original data is preserved.

Synthesized provides two main classes to assess the attribute inference attack:

* `AttributeInferenceAttackML`
* `AttributeInferenceAttackCAP`

Machine Learning (ML) models or Correct Attribution Probability (CAP) models are fit to the synthetic data. The fitted model is then used to compute the privacy score of a sensitive column of the original dataset using predictors in the original dataset. Privacy scores are between 0 and 1; 0 means negligible privacy and 1 means absolute privacy.


### Attribute Inference Attack using ML

`AttributeInferenceAttackML` trains a machine learning model to predict the sensitive attribute using the synthesized dataset. The fitted model is then used to predict the sensitive values in the original dataset. Finally, a privacy score is calculated based on the true value and the predicted value of the sensitive column in the original dataset.

### Attribute Inference Attack using CAP

`AttributeInferenceAttackCAP` computes the privacy score using CAP (Correct Attribution Probability) model. It is modeled as the probability that an attribution is correct. It differs from the ML approach because it doesn’t depend on the choice of the ML model and its training.


## Linkage Attack