# Fingerprinting Relational Data

In this notebook we set an example of how to use our toolbox to embedd the fingerprint into a dataset and how to detect one from the suspricious data. I suggest reading [1] to familiarise yourself to the basis of relational data fingerprinting, parameters and terminology, since the toolbox uses the method from the paper as an algorithmic foundation. 

We perform the following:
- 1) Import the dataset 
- 2) Define the fingerprinting scheme with related parameters (length of a fingerprint and "gamma" (ratio #rows/#marks))
- 3) Embedd the fingerprint into the data
- 4) Detect a fingerprint from the fingerprinted data.
- 5) Additional examples:
    - 5.1) Save fingerprinted data to file
    - 5.2) Extraction from file
    - 5.3) Fingerprint a subset of columns

In [1]:
import pandas as pd
from scheme import Universal

## 1) Import data
Specify the path to the dataset (in csv format). Make sure that the first row of the file contains column names.

In [2]:
data_path = "datasets/adult.csv"

## 2) Defining the scheme
The class 'Universal' implements the fingerprinting scheme appropriate for both numerical and categorical data types. 

- The parameter 'gamma' is required. This is the main parameter for fingerprinting, which defines how many marks are expected to be embedded into the data: #marks = #rows / gamma; i.e. if #rows(Adult dataset)= 48841 and gamma=2, the total number of marks will be approximatelly 48842/2=24421 (due to the nature of the fingerprinting algorithm [1], the real number of modifications will always be approximatelly #marks/2 because half of the marks "mark" the data value to the same original value).

- fingerprint_bit_length is an optional parameter that should be set to a power of 2, otherwise the insertion will throw an error. The default value is 32.

- number_of_recipients is an optional parameter that defines a maximum number of potential data recipients. The default is 100. The possible recipient IDs are then in a range [0, number_of_recipients]

In [3]:
scheme = Universal(gamma=2, fingerprint_bit_length=64)

## 3) Fingerprint embedding

Fingerprint embedding is performed with the method 'insertion'. The following parameters must be specified:
- data (i.e. datapath(string), data(pandas.DataFrame), ...)
- owner's secret key - arbitrary integer number known only to the owner
- recipient ID 

In [4]:
fingerprinted_data = scheme.insertion(data_path, secret_key=12345678, recipient_id=0)

Start insertion algorithm...
	gamma: 2
	fingerprint length: 64

Generated fingerprint for recipient 0: 0100010000101011101010110001100011011011101111101000001100110001
Fingerprint inserted.
	marked tuples: ~50.26%
	single fingerprint bit embedded 383 times
Time: 12 sec.


We obtained the fingerprinted copy of the data provided for the recipient 0.

From the above printout, we can see the parameters used for embedding and exact fingerprint assigned to the recipient 0. 
Furhermore, we see that approximatelly 50% of rows was marked, which is expected due to the defined gamma (i.e. 1 in 2 rows are marked). 

All fingerprint bits (64 of them) are equaly used for marking (cca.. 24421 of marks). Therefore, on average each fingerprint bit has been embedded 383 times (roughly 24421/64 due to randomness).

Let's try to spot some modifications:

In [5]:
#original
original = pd.read_csv(data_path)
original.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [6]:
fingerprinted_data.dataframe.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Wife,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Craft-repair,Wife,Black,Female,0,0,40,Cuba,<=50K


Between above two dataset snippets we can observe that:
- in row 3 the value for 'relationship' has changed from 'Husband' to 'Wife' and 
- in row 4 the value for 'occupation'  has changed from 'Prof-specialty' to 'Craft-repair' 

## 4) Fingerprint extraction

To extract the fingerprint and find the suspect, we use the 'detection' method of the scheme. To successfully extract the data, one must provide the owner's secret key.

In [7]:
suspect = scheme.detection(fingerprinted_data, secret_key=12345678)

Start detection algorithm...
	gamma: 2
	xi: 1
Potential fingerprint detected: 0100010000101011101010110001100011011011101111101000001100110001
Recipient 0 is suspected.
Runtime: 3 sec.


The detection algorithm successfully extracted the fingerprint that belongs to recipient 0 from the data. Observe how the detection fails when the wrong secret key is provided (i.e. only the owner can varify the recipient):

In [8]:
suspect = scheme.detection(fingerprinted_data, secret_key=123)

Start detection algorithm...
	gamma: 2
	xi: 1
Potential fingerprint detected: 1101111001001111100010010001100011100021110001010111110101000011
No one suspected.
Runtime: 3 sec.


## 5) Other examples
### 5.1) Embedding to pandas.DataFrame

In [9]:
dataframe = pd.read_csv(data_path)

In [10]:
fingerprinted_data = scheme.insertion(dataframe, secret_key=12345678, recipient_id=0)

Start insertion algorithm...
	gamma: 2
	fingerprint length: 64

Generated fingerprint for recipient 0: 0100010000101011101010110001100011011011101111101000001100110001
Fingerprint inserted.
	marked tuples: ~50.26%
	single fingerprint bit embedded 383 times
Time: 12 sec.


### 5.1) Save the fingerprinted data to file

In [11]:
scheme.insertion(data_path, secret_key=12345678, recipient_id=0, write_to="fingerprinted/adult_0.csv")

Start insertion algorithm...
	gamma: 2
	fingerprint length: 64

Generated fingerprint for recipient 0: 0100010000101011101010110001100011011011101111101000001100110001
Fingerprint inserted.
	marked tuples: ~50.26%
	single fingerprint bit embedded 383 times
Time: 12 sec.


<datasets._dataset.Dataset at 0x1ef3c95af40>

### 5.2) Extract the fingerprint from the data file

In [12]:
scheme.detection("fingerprinted/adult_0.csv", secret_key=12345678)

Start detection algorithm...
	gamma: 2
	xi: 1
Potential fingerprint detected: 0100010000101011101010110001100011011011101111101000001100110001
Recipient 0 is suspected.
Runtime: 3 sec.


0

### 5.3) Fingerprint a subset of columns

The owner can specify which columns they want to fingerprint. This might be useful when some columns, due to their sensitivity, should stay unmodified, for example the target attribute for an assumed classification task for the dataset.

When embedding the fingerprint, one can do that by specifying a list of data column names to exclude: 

In [13]:
fingerprinted_data = scheme.insertion(dataframe, secret_key=12345678, recipient_id=0, exclude=['income'])

Start insertion algorithm...
	gamma: 2
	fingerprint length: 64

Generated fingerprint for recipient 0: 0100010000101011101010110001100011011011101111101000001100110001
Fingerprint inserted.
	marked tuples: ~50.26%
	single fingerprint bit embedded 383 times
Time: 11 sec.


Therefore, the 'income' column will stay in its original form: 

In [14]:
len(dataframe['income'].compare(fingerprinted_data.dataframe['income'])) == 0

True

Alternativelly, specify the names of the columns to include into fingerprint embedding:

In [15]:
fingerprinted_data = scheme.insertion(data_path, secret_key=12345678, recipient_id=0, 
                                      include=['workclass','education','education-num','marital-status','occupation'],
                                     write_to='fingerprinted_data/adult_0_subset.csv')

Start insertion algorithm...
	gamma: 2
	fingerprint length: 64

Generated fingerprint for recipient 0: 0100010000101011101010110001100011011011101111101000001100110001
Fingerprint inserted.
	marked tuples: ~50.26%
	single fingerprint bit embedded 383 times
Time: 14 sec.


When extracting the fingerprint, the owner must again specify which columns they did or did not mark (holds for specifying 'exclude', too):

In [16]:
scheme.detection('fingerprinted_data/adult_0_subset.csv', secret_key=12345678, 
                 include=['workclass','education','education-num','marital-status','occupation'])

Start detection algorithm...
	gamma: 2
	xi: 1
Potential fingerprint detected: 0100010000101011101010110001100011011011101111101000001100110001
Recipient 0 is suspected.
Runtime: 3 sec.


0

Otherwise, the scheme does not extract the fingerprint correctly:

In [17]:
scheme.detection('fingerprinted_data/adult_0_subset.csv', secret_key=12345678)

Start detection algorithm...
	gamma: 2
	xi: 1
Potential fingerprint detected: 0100010110000111100110011110100111011001101111100110011101101010
No one suspected.
Runtime: 3 sec.


-1

## References:
[1] Li, Yingjiu, Vipin Swarup, and Sushil Jajodia. "Fingerprinting relational databases: Schemes and specialties." IEEE Transactions on Dependable and Secure Computing 2.1 (2005): 34-45.