<a href="https://colab.research.google.com/gist/yakine8/d68a548b4abec5cacb5609511e837848/surrogate-generation-strategies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **SURROGATE GENERATION STRATEGIES**

#### Prerequisites 

1. Download the github of this implementation - [here](https://github.com/yakine8/Surrogate-generation-Strategies-in-De-identification)
1. Import the needed files in the current session storage space (*date.py, location.py, dp.py, paper-data.pkl*)
1. Download the required modules via pip

In [None]:
!pip install py-dateinfer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting py-dateinfer
  Downloading py_dateinfer-0.4.5-py3-none-any.whl (17 kB)
Installing collected packages: py-dateinfer
Successfully installed py-dateinfer-0.4.5


### Import Librairies

In [1]:
import location
import pandas as pd
import numpy as np
import date

### Substitution strategies for entities: ***Date & Age*** and ***Location*** in a medical document

Example of a sequence in a medical document: 

---

```
Mr. Durand born in Dijon, 40 years old, was
admitted to the hospital from 02/12/2020 to February 26, 2020
following a road accident in Dijon
```


Sensitive information detected in this sentence: 

Date & Age : 
1. 02/12/2020
1. February 26, 2020
1. 40 years old


Location 
1. Dijon

Person
1. Durand

Let's proceed to the substitution of the detected sensitive information according to the following strategies:

* Date & Ages ⟹ ϵ-d.privacy + Laplace Mechanism 
* Location ⟹ ϵ-d.privacy + Exponential mechanism

Read paper for more details [here](https://)

With a global security budget ϵ set at 1 (0.25 for location and 0.75 for date&age)

---

##### To get the results of the paper, let fix the random seed 

In [None]:
SEED = 5

### Geographic Locations

In [3]:
locations = ['Dijon']
EPSILON = 0.25

#### Location data with health characteristics: Region Bourgogne Franche Comté (France)

##### Features
1. Overall population 
1. Cancer Incidence Rate
1. Stroke

In [4]:
data = pd.read_pickle("paper-data.pkl")
data

Unnamed: 0,city,city code,gps coordinates,overall population,cancer incidence rate,stroke
0,AGENCOURT,21001,"47.1250474303,4.98564114053",513,0.583601,0.874783
1,AISEY SUR SEINE,21006,"47.7479865538,4.57814842131",177,0.201360,0.301826
2,AMPILLY LE SEC,21012,"47.8100690603,4.50329961504",360,0.409545,0.613883
3,ANTHEUIL,21014,"47.1764418313,4.74942827901",62,0.070533,0.105724
4,ANTIGNY LA VILLE,21015,"47.1012813081,4.55975234629",102,0.116038,0.173934
...,...,...,...,...,...,...
3598,VILLEFRANCON,70557,"47.4129035904,5.74249429132",142,0.379823,0.300671
3599,VILLERS SUR SAULNOT,70567,"47.5481669716,6.64791604275",133,0.355750,0.281614
3600,VILORY,70569,"47.7301466997,6.22337661531",66,0.176538,0.139748
3601,VOUHENANS,70577,"47.6424367134,6.49403579676",380,1.016429,0.804612


In [5]:
# Retrieving the city code
codes = []
for loc in locations :
  code = data[data['city'] == loc.upper()]
  code = code['city code'].values[0]
  codes.append(code)

codes

[21231]

In [6]:
def propabilities(code, data, features, RADIUS, K, EPSILON):
    
    # Retrieve line of Code 
    city = data[data['city code'] == code]
    
    ## GPS coordinates
    try :
      coord = city['gps coordinates'].values[0]
      lat, lon = location.coord_to_latlong(coord)
      
      # les distances geographiques par rapport aux autres
      df = location.dist_from_others(lat, lon, data, RADIUS)
      #df.shape
      
      # Les normalisations 
      df = location.normalize_features(df, features)
      
      # les distances des vecteurs
      df = location.vector_distance(code, df, features)
      
      # Trier par ordre croissant par rapport à la colone dist_vect
      df = df.sort_values(by = 'distances')
      
      # Selection des K premières lignes
      # Liste des villes substituables
      df = df.head(K)
      
      # les scores
      k = len(features)
      df = location.scores(df,k=3)
      
      # La list des probabilités
      df = location.pdfs(df, EPSILON)
      
      # Normalize the probabilties so they sum to 1
      pdf = list(df['pdf'])
      pdf = pdf / np.linalg.norm(pdf, ord=1)
      df['normalized distribution'] = pdf
    
      return df
    except Exception as e :
      return e

In [7]:
def loc_substitution(code, code_probabilities, seed):
  ## Choix du substitut par rapport aux propabilités
  R = list(code_probabilities['city code'])
  np.random.seed(seed)
  choice = int(np.random.choice(R, 1, p=code_probabilities['normalized distribution'])[0])

  return choice


In [8]:
RADIUS = 100
K = 10
features = ['overall population', 'cancer incidence rate', 'stroke']
for code in codes :
  try: 
    code_probabilities = propabilities(code, data, features, RADIUS, K, EPSILON)
    display(code_probabilities)
    sub = loc_substitution(code, code_probabilities, SEED)
    orig = code_probabilities[code_probabilities['city code'] == code]['city'].values[0]
    subst = code_probabilities[code_probabilities['city code'] == sub]['city'].values[0]

    print("Original : ",orig," ",code, "===>>", "Sub : ",subst, " ",sub)
  except Exception as e:
    print("City code error")

Unnamed: 0,city,city code,overall population,cancer incidence rate,stroke,overall population_normalized,cancer incidence rate_normalized,stroke_normalized,distances,scores,pdf,normalized distribution
99,DIJON,21231,160204,182.252004,273.184785,1.0,1.0,1.0,0.0,3.0,2.117,0.132613
924,BESANCON,25056,119249,134.135495,218.375283,0.744344,0.735974,0.799356,0.347525,2.652475,1.940836,0.121578
2175,CHALON SUR SAONE,71076,46603,52.730489,108.706972,0.290862,0.289288,0.397888,1.042888,1.957112,1.631138,0.102178
1368,DOLE,39198,24606,57.437117,55.290112,0.153549,0.315114,0.202343,1.381583,1.618417,1.498709,0.093882
2201,LE CREUSOT,71153,21935,24.819073,51.165964,0.136876,0.136132,0.187245,1.407732,1.592268,1.488944,0.093271
2249,MONTCEAU LES MINES,71306,18789,21.259429,43.82755,0.117238,0.116599,0.160381,1.454262,1.545738,1.471724,0.092192
1675,LONS LE SAUNIER,39300,18023,42.070599,40.497996,0.112456,0.230795,0.148193,1.475374,1.524626,1.463977,0.091707
538,BEAUNE,21054,21747,24.739921,37.083653,0.135703,0.135698,0.135694,1.497023,1.502977,1.456075,0.091212
2143,AUTUN,71014,14381,16.271853,33.545372,0.089721,0.089232,0.122741,1.519458,1.480542,1.447931,0.090701
3189,VESOUL,70550,15728,42.069461,33.302482,0.09813,0.230789,0.121852,1.520998,1.479002,1.447374,0.090667


Original :  DIJON   21231 ===>> Sub :  BESANCON   25056


### **Location** : After applying the algorithm => ***ϵ-d.privacy + Exponential mechanism***: 

```
Dijon, 21231 -> Vesoul, 70550

```


### Date & Age Substitution

In [9]:
DATES = ['02/12/2020', 'February 26, 2020', '40 years old']
Epsilon = 0.75

In [10]:
def date_substitution(DATES, Epsilon, SEED):

    LOOKUP_TABLE = dict()

    if len(DATES) != 0: 
      EPSILON = Epsilon/len(DATES)

      print("List of dates to be processed : {}".format(DATES))
      DF_D, DF_A = date.parse_date(DATES)
      DF_D = date.df_to_date(DF_D)
      DF_D, DF_A = date.order_date(DF_D, DF_A)
  
      DF_DALL, DF_D = date.remove_duplicate_nan_date(DF_D)
      DF_AALL, DF_A = date.remove_duplicate_age(DF_A)
      nb_date, col = DF_D.shape
      nb_age, col = DF_A.shape

      if nb_date != 0 or nb_age != 0:
        
        DF_D, LIST_INTERVAL_D = date.set_interval_between_date(DF_D)
        DF_A, LIST_INTERVAL_A = date.set_interval_between_age(DF_A)

        DF_D, DF_A = date.noisy_interval(DF_D, DF_A, EPSILON, SEED)
        print("Noisy Intervals")
        display(DF_D)
        display(DF_A)

        DF_D = date.reconstruct_date_from_interval(DF_D)
        DF_A = date.reconstruct_age_from_interval(DF_A)
        print("Date Reconstruction with Intervals")
        display(DF_D)
        display(DF_A)

        DF_D = date.date_to_orignal_format(DF_D)

        LOOKUP_TABLE = date.construct_lookup_table(DF_D, DF_A)
        
        return LOOKUP_TABLE
      else:
        return "The detected dates are wrong or in wrong format"
    else :
      return "No date found"

In [13]:
date_substitution(DATES, Epsilon, SEED)

List of dates to be processed : ['02/12/2020', 'February 26, 2020', '40 years old']
Noisy Intervals


Unnamed: 0,Detected date,Date,Format,Date_Intervalles,Noisy_Intervals
0,02/12/2020,2020-02-12,%m/%d/%Y,14,10.752154
1,"February 26, 2020",2020-02-26,"%B %d, %Y",971,967.752154


Unnamed: 0,Age,Value,Age_Intervalles,Noisy_Intervals
0,40 years old,40,40,36.752154


Date Reconstruction with Intervals


Unnamed: 0,Detected date,Date,Format,Date_Intervalles,Noisy_Intervals,Dates_reconst
0,02/12/2020,2020-02-12,%m/%d/%Y,14,10.752154,20/02/2020
1,"February 26, 2020",2020-02-26,"%B %d, %Y",971,967.752154,01/03/2020


Unnamed: 0,Age,Value,Age_Intervalles,Noisy_Intervals,Noisy_Age
0,40 years old,40,40,36.752154,37


{'40 years old': '37 years old',
 '02/12/2020': '02/20/2020',
 'February 26, 2020': 'March 01, 2020'}

### **Dates & Ages** : After applying the algorithm => ***ϵ-d.privacy + Laplace Mechanism***: 

```
02/12/2020 -> 02/20/2020
February 26, 2020 -> March 01, 2020
40 years old -> 37 years old
```


Applying the random algorithm detailed in [this paper](https://arxiv.org/pdf/2209.09631.pdf) for Person attribute :

```
Durand -> Julien
```

The final result of our substitution step:

```
Mr. Julien born in Besancon, 37 years old, was
admitted to the hospital from 02/20/2020 to March 01, 2020
following a road accident in Besancon
```