## Lesson 1: Toy Differential Privacy - Simple Database Queries

## Differential privacy 
It ensures that the removal or addition of a single database item does not significantly affect the outcome of any analysis, thereby protecting individual privacy. This means that an observer cannot determine whether a particular individual's information is included in the dataset, even if they have access to the output of the analysis.


# Create a simple 
Step one is to create our database - we're going to do this by initializing a random list of 1s and 0s (which are the entries in our database). 

Note - the number of entries directly corresponds to the number of people in our database.


In [2]:
import torch

num_entries = 5000
db= torch.rand(num_entries)> 0.5
db

tensor([False, False,  True,  ..., False, False,  True])

## Project: Generate Parallel Databases

Key to the definition of differenital privacy is the ability to ask the question "When querying a database, if I removed someone from the database, would the output of the query be any different?". Thus, in order to check this, we must construct what we term "parallel databases" which are simply databases with one entry removed.

In this first project, I want you to create a list of every parallel database to the one currently contained in the "db" variable. Then, I want you to create a function which both:

1. creates the initial database (db)

2. creates all parallel databases

In [3]:
db.shape

torch.Size([5000])

In [4]:
db[0:2]

tensor([False, False])

In [5]:
remove_index=2


In [6]:
db[3:]

tensor([False,  True, False,  ..., False, False,  True])

In [7]:
#concatenate both tensor
torch.cat((db[0:2],db[3:]))[0:5]

tensor([False, False, False,  True, False])

In [8]:
# define function
def get_parallel_db(db,remove_index):
    return torch.cat((db[0:remove_index], db[remove_index+1:]))

In [9]:
get_parallel_db(db,2)[0:5]

tensor([False, False, False,  True, False])

In [10]:
# Singe datapoint removed
get_parallel_db(db,2).shape

torch.Size([4999])

In [11]:
#Parallel_db

def get_parallel_dbs(db):
    parallel_dbs=list()

    for i in range(len(db)):
        pdb=get_parallel_db(db,i)
        parallel_dbs.append(pdb)
    return parallel_dbs

In [None]:
pbds=get_parallel_dbs(db)
pbds

In [13]:
print(pbds)

[tensor([False,  True, False,  ..., False, False,  True]), tensor([False,  True, False,  ..., False, False,  True]), tensor([False, False, False,  ..., False, False,  True]), tensor([False, False,  True,  ..., False, False,  True]), tensor([False, False,  True,  ..., False, False,  True]), tensor([False, False,  True,  ..., False, False,  True]), tensor([False, False,  True,  ..., False, False,  True]), tensor([False, False,  True,  ..., False, False,  True]), tensor([False, False,  True,  ..., False, False,  True]), tensor([False, False,  True,  ..., False, False,  True]), tensor([False, False,  True,  ..., False, False,  True]), tensor([False, False,  True,  ..., False, False,  True]), tensor([False, False,  True,  ..., False, False,  True]), tensor([False, False,  True,  ..., False, False,  True]), tensor([False, False,  True,  ..., False, False,  True]), tensor([False, False,  True,  ..., False, False,  True]), tensor([False, False,  True,  ..., False, False,  True]), tensor([False

In [14]:
def create_db_and_parallels(num_entries):
    db=torch.rand(num_entries)>0.5
    pbds=get_parallel_dbs(db)

    return db,pbds
    

In [15]:
db,pbds=create_db_and_parallels(20)

In [16]:
db[0:8]

tensor([ True, False, False,  True, False,  True,  True, False])

In [None]:
pbds

Parallel database created with each record with single index missed.

# Day 5 -Evaluation of DP of function
Intuitively, we want to be able to query our database and evaluate whether or not the result of the query is leaking "private" information. As mentioned previously, this is about evaluating whether the output of a query changes when we remove someone from the database. Specifically, we want to evaluate the maximum amount the query changes when someone is removed (maximum over all possible people who could be removed). So, in order to evaluate how much privacy is leaked, we're going to iterate over each person in the database and measure the difference in the output of the query relative to when we query the entire database.

Just for the sake of argument, let's make our first "database query" a simple sum. Aka, we're going to count the number of 1s in the database.

In [18]:
db,pdbs= create_db_and_parallels(300)

In [19]:
db

tensor([ True, False, False, False,  True,  True, False, False,  True, False,
        False, False,  True,  True,  True,  True, False, False,  True, False,
         True, False,  True,  True, False, False,  True,  True, False,  True,
        False, False, False, False, False, False,  True,  True,  True, False,
        False,  True,  True, False,  True, False,  True,  True, False,  True,
         True, False, False,  True,  True,  True, False,  True, False,  True,
        False,  True, False,  True,  True,  True, False,  True,  True,  True,
         True,  True, False, False, False, False, False, False, False, False,
         True,  True,  True, False,  True, False,  True,  True,  True,  True,
         True,  True, False, False, False,  True,  True,  True, False, False,
         True,  True,  True,  True,  True, False,  True, False,  True,  True,
         True,  True,  True,  True, False, False,  True, False, False,  True,
         True, False, False,  True, False,  True,  True, False, 

In [None]:
pdbs

In [21]:
def query(db):
    return db.sum()

In [22]:
full_db_result=query(db)
full_db_result

tensor(149)

## L1 Sensitivity
The maximum amount that the query changes when removing an individuals from the database


In [23]:
max_distance = 0
for pdb in pdbs:
    pdb_result=query(pdb)

    db_distance=torch.abs(pdb_result - full_db_result)

    if(db_distance > max_distance):
        max_distance = db_distance

In [24]:
max_distance

tensor(1)

In [25]:
# Convert to function
def sensitivity(query,n_entries=1000):
    db,pbs= create_db_and_parallels(n_entries)

    full_db_result = query(db)
    max_distance=0
    for pdb in pdbs:
        pdb_result=query(pdb)

        db_distance=torch.abs(pdb_result - full_db_result)

        if(db_distance > max_distance):
            max_distance = db_distance
    return max_distance


In [26]:
def query(db):
    return db.float().mean()

In [27]:
query(db)

tensor(0.4967)

In [28]:
sensitivity(query)

tensor(0.0080)

That sensitivity is WAY lower. Note the intuition here. "Sensitivity" is measuring how sensitive the output of the query is to a person being removed from the database. For a simple sum, this is always 1, but for the mean, removing a person is going to change the result of the query by rougly 1 divided by the size of the database (which is much smaller). Thus, "mean" is a VASTLY less "sensitive" function (query) than SUM

# Calculate L1 Sensitivity For Threshold
I want you to calculate the sensitivty for the "threshold" function.

First compute the sum over the database (i.e. sum(db)) and return whether that sum is greater than a certain threshold.

Then, I want to create databases of size 10 and threshold of 5 and calculate the sensitivity of the function.

Finally, re-initialize the database 10 times and calculate the sensitivity each time

In [29]:
def query(db,threshold=5):
    return (db.sum()>threshold).float()

In [30]:
db,pbds= create_db_and_parallels(10)
db.sum()

tensor(7)

In [31]:
#Chk for threshold value 5 (7>5)
query(db)

tensor(1.)

In [32]:
for i in range(10):
    sens_f=sensitivity(query,n_entries=10)
    print(sens_f)

tensor(1.)
0
tensor(1.)
tensor(1.)
0
0
0
0
tensor(1.)
tensor(1.)


## Project: Perform a Differencing Attack on Row 10

In this project, I want you to construct a database and then demonstrate how we can use two different sum queries to expose the value of the person represented by row 10 in the database (note, you'll need to use a database with at least 10 rows)

In [55]:
db,_ =create_db_and_parallels(300)
db

tensor([ True,  True,  True,  True, False,  True, False, False, False,  True,
         True,  True,  True,  True,  True,  True,  True, False, False, False,
         True,  True,  True,  True,  True,  True,  True,  True, False,  True,
         True,  True,  True,  True,  True,  True, False,  True, False,  True,
        False,  True, False, False,  True,  True, False, False, False,  True,
         True, False,  True, False,  True,  True,  True, False, False,  True,
         True, False,  True, False,  True, False, False,  True,  True,  True,
         True, False,  True, False,  True, False,  True,  True, False,  True,
         True, False,  True,  True, False,  True, False,  True, False,  True,
        False,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True, False,  True,  True,  True, False,  True,  True, False,
        False, False, False,  True, False,  True, False, False, False,  True,
        False,  True,  True,  True, False,  True,  True, False, 

In [52]:
pdb=get_parallel_db(db,remove_index=10)

In [56]:
db[10]

tensor(True)

In [57]:
sum(db)

tensor(162)

In [58]:
sum(pdb)

tensor(57)

In [37]:
# Differencing  attack using sum query

sum(db)-sum(pdb)

tensor(0)

In [43]:
# Differencing  attack using mean query

(sum(db).float()/len(db)) - (sum(pdb).float()/len(pdb))

tensor(-0.0058)