### scikit-mobility tutorials

# NUM - Privacy Risk Assessment

 - Simulate privacy attacks and assess risk with a wors-case scenario framework

- First, we import the necessary libraries along with some constants

In [1]:
import numpy as np
import pandas as pd
from skmob.privacy import attacks
from skmob.core import trajectorydataframe
from skmob.utils.utils import frequency_vector, probability_vector, date_time_precision

- To more easily visualize how risk is computed, we use a dummy dataset. To construct one, we import some constants.

In [2]:
from skmob.utils import constants
latitude = constants.LATITUDE
longitude = constants.LONGITUDE
date_time = constants.DATETIME
user_id = constants.UID

lat_lons = np.array([[43.8430139,10.5079940],
                     [43.5442700, 10.3261500],
                     [43.7085300, 10.4036000],
                     [43.7792500, 11.2462600],
                     [43.8430139,10.5079940],
                     [43.7085300, 10.4036000],
                     [43.8430139,10.5079940],
                     [43.5442700, 10.3261500],
                     [43.5442700, 10.3261500],
                     [43.7085300, 10.4036000],
                     [43.8430139,10.5079940],
                     [43.7792500, 11.2462600],
                     [43.7085300, 10.4036000],
                     [43.5442700, 10.3261500],
                     [43.7792500, 11.2462600],
                     [43.7085300, 10.4036000],
                     [43.7792500, 11.2462600],
                     [43.8430139,10.5079940],
                     [43.8430139,10.5079940],
                     [43.5442700, 10.3261500],
                    [43.8430139,10.5079940],
                    [43.8430139,10.5079940],
                    [43.779250,11.246260]])

traj = pd.DataFrame(lat_lons, columns=[latitude, longitude])

traj[date_time] = pd.to_datetime([
        '20110203 8:34:04', '20110203 9:34:04', '20110203 10:34:04', '20110204 10:34:04',
        '20110203 8:34:04', '20110203 9:34:04', '20110204 10:34:04', '20110204 11:34:04',
        '20110203 8:34:04', '20110203 9:34:04', '20110204 10:34:04', '20110204 11:34:04',
        '20110204 10:34:04', '20110204 11:34:04', '20110204 12:34:04',
        '20110204 10:34:04', '20110204 11:34:04', '20110205 12:34:04',
        '20110204 10:34:04', '20110204 11:34:04',
        '20110204 10:34:04', '20110204 11:34:04','20110205 12:34:04'])

traj[user_id] = [1 for _ in range(4)] + [2 for _ in range(4)] + \
                [3 for _ in range(4)] + [4 for _ in range(3)] + \
                [5 for _ in range(3)] + [6 for _ in range(2)] + \
                [7 for _ in range(3)] 

traj = traj.sort_values([user_id,date_time])
trjdat = trajectorydataframe.TrajDataFrame(traj, user_id=user_id)

In [3]:
trjdat

Unnamed: 0,lat,lng,datetime,uid
0,43.843014,10.507994,2011-02-03 08:34:04,1
1,43.54427,10.32615,2011-02-03 09:34:04,1
2,43.70853,10.4036,2011-02-03 10:34:04,1
3,43.77925,11.24626,2011-02-04 10:34:04,1
4,43.843014,10.507994,2011-02-03 08:34:04,2
5,43.70853,10.4036,2011-02-03 09:34:04,2
6,43.843014,10.507994,2011-02-04 10:34:04,2
7,43.54427,10.32615,2011-02-04 11:34:04,2
8,43.54427,10.32615,2011-02-03 08:34:04,3
9,43.70853,10.4036,2011-02-03 09:34:04,3


We instantiate an attack specifying the length of the background knowledge that we want to simulate.

In [5]:
at = attacks.LocationAttack(k=2)

- To compute privacy risk for all the users in the data, simply call the assess_risk function on the dataframe

In [6]:
r = at.assess_risk(trjdat)
r

Unnamed: 0,uid,risk
0,1,0.333333
1,2,0.5
2,3,0.333333
3,4,0.333333
4,5,0.25
5,6,0.25
6,7,0.5


- If we want to show all the possible combinations we can use the instance_analysis parameter
- This poses even higher computational costs, as all combinations are calculated and evaluated, even unnecessary ones.
- Use with caution

In [7]:
r = at.assess_risk(trjdat, instance_analysis=True)
r

Unnamed: 0_level_0,instance,reid_prob
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"([43.8430139, 10.507994, 2011-02-03 08:34:04, ...",0.25
1,"([43.8430139, 10.507994, 2011-02-03 08:34:04, ...",0.25
1,"([43.8430139, 10.507994, 2011-02-03 08:34:04, ...",0.25
1,"([43.54427, 10.32615, 2011-02-03 09:34:04, 1],...",0.25
1,"([43.54427, 10.32615, 2011-02-03 09:34:04, 1],...",0.333333
1,"([43.70853, 10.4036, 2011-02-03 10:34:04, 1], ...",0.25
2,"([43.8430139, 10.507994, 2011-02-03 08:34:04, ...",0.25
2,"([43.8430139, 10.507994, 2011-02-03 08:34:04, ...",0.5
2,"([43.8430139, 10.507994, 2011-02-03 08:34:04, ...",0.25
2,"([43.70853, 10.4036, 2011-02-03 09:34:04, 2], ...",0.25


- A subset of the users can be specified with the parameter targets, to restrict the calculation to just some of the data. 
- Probability of reidentification is still computed against the original data
- Can be used in combination with instance_analysis to isolate particular individuals and understand what combination poses a threat

In [8]:
t = [1,2]
r = at.assess_risk(trjdat, targets=t, instance_analysis=True)
r

Unnamed: 0_level_0,instance,reid_prob
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"([43.8430139, 10.507994, 2011-02-03 08:34:04, ...",0.25
1,"([43.8430139, 10.507994, 2011-02-03 08:34:04, ...",0.25
1,"([43.8430139, 10.507994, 2011-02-03 08:34:04, ...",0.25
1,"([43.54427, 10.32615, 2011-02-03 09:34:04, 1],...",0.25
1,"([43.54427, 10.32615, 2011-02-03 09:34:04, 1],...",0.333333
1,"([43.70853, 10.4036, 2011-02-03 10:34:04, 1], ...",0.25
2,"([43.8430139, 10.507994, 2011-02-03 08:34:04, ...",0.25
2,"([43.8430139, 10.507994, 2011-02-03 08:34:04, ...",0.5
2,"([43.8430139, 10.507994, 2011-02-03 08:34:04, ...",0.25
2,"([43.70853, 10.4036, 2011-02-03 09:34:04, 2], ...",0.25


- TODO: Aggiungo altro, tipi diversi di attacco etc.