### 01. Pseudonymization

In this notebook, we'll explore pseudonymization methods such as hashing, masking and format-preserving encryption.

For more reading on the topic, please see: 

- [Medium (Alex Ewerlöf): Anonymization vs. Pseudonymization](https://medium.com/@alexewerlof/gdpr-pseudonymization-techniques-62f7b3b46a56)
- [KIProtect: GDPR for Data Science](https://kiprotect.com/blog/gdpr_for_data_science.html)
- [IAPP: Anonymization and Pseudonymization Compared in relation to GDPR compliance](https://iapp.org/media/pdf/resource_center/PA_WP2-Anonymous-pseudonymous-comparison.pdf)

In [36]:
pip install requests


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [37]:
pip install faker


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [38]:
pip install ff3






[notice] A new release of pip available: 22.2.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [39]:
import base64
from hashlib import blake2b

import pandas as pd
import json
import requests

from faker import Faker
from ff3 import FF3Cipher

#### Precheck: What is our data? 
- What information is contained in our data?
- What privacy concerns are there?
- How should we proceed?

In [40]:
df = pd.read_csv('data/iot_example.csv')

In [41]:
df.head()

Unnamed: 0,timestamp,username,temperature,heartrate,build,latest,note
0,2017-01-01T12:00:23,michaelsmith,12,67,4e6a7805-8faa-2768-6ef6-eb3198b483ac,0,interval
1,2017-01-01T12:01:09,kharrison,6,78,7256b7b0-e502-f576-62ec-ed73533c9c84,0,wake
2,2017-01-01T12:01:34,smithadam,5,89,9226c94b-bb4b-a6c8-8e02-cb42b53e9c90,0,
3,2017-01-01T12:02:09,eddierodriguez,28,76,2599ac79-e5e0-5117-b8e1-57e5ced036f7,0,update
4,2017-01-01T12:02:36,kenneth94,29,62,122f1c6a-403c-2221-6ed1-b5caa08f11e0,0,user


#### Section One: Hashing

- Applying the blake2b hash
- Allowing for de-pseudonymization
- Creating a reusable method for hashing

In [42]:
username = df.iloc[0,1]

In [43]:
username

'michaelsmith'

In [44]:
hasher = blake2b()
hasher.update(username)
hasher.hexdigest()

TypeError: Strings must be encoded before hashing

Oops. What went wrong? How can we fix?

In [61]:
 
hasher = blake2b()
hasher.update(username.encode('utf-8'))  
print(hasher.hexdigest())  


a2a858011c0917154cdf8edce30d399e37df5f13217fa6d2959e453dd5245eb73a0787f0784d0c1969df51a48dc5a6664a59b724e33962be6ed4a9f0424ecb43


In [62]:
# %load solutions/proper_encoding.py
hasher = blake2b()
hasher.update(username.encode('utf-8'))
hasher.hexdigest()


'a2a858011c0917154cdf8edce30d399e37df5f13217fa6d2959e453dd5245eb73a0787f0784d0c1969df51a48dc5a6664a59b724e33962be6ed4a9f0424ecb43'

Great! Now we have a hash. Michael is safe! (or [is he?](https://nakedsecurity.sophos.com/2014/06/24/new-york-city-makes-a-hash-of-taxi-driver-data-disclosure/))

But... what if we need to later determine that michaelsmith is a2a858011c091715....

In [None]:
hasher.

SyntaxError: invalid syntax (1411173679.py, line 1)

Okay, let's try something that we can reverse...

In [45]:
# From https://stackoverflow.com/questions/2490334/simple-way-to-encode-a-string-according-to-a-password

def encode(key, clear):
    enc = []
    for i in range(len(clear)):
        key_c = key[i % len(key)]
        #print(key_c)
        enc_c = (ord(clear[i]) + ord(key_c)) % 256
        #print(enc_c)
        enc.append(enc_c)
    return base64.urlsafe_b64encode(bytes(enc))

def decode(key, enc):
    dec = []
    enc = base64.urlsafe_b64decode(enc)
    for i in range(len(enc)):
        key_c = key[i % len(key)]
        dec_c = chr((256 + enc[i] - ord(key_c)) % 256)
        dec.append(dec_c)
    return "".join(dec)

In [46]:
encode('supa_secret', username)

b'4N7TycDY0dbfzujb'

In [47]:
decode('supa_secret', b'4N7TycDY0dbfzujb')

'michaelsmith'

#### Challenge

- Can you come up with another string which will properly decode the secret which is *not* the same as the original key?
- Hint: Take a look at the encode method and use the print statements for a clue.

In [49]:
def encode(key, secret):
    return "".join(chr(ord(c) ^ key) for c in secret)

def decode(key, encoded):
    return "".join(chr(ord(c) ^ key) for c in encoded)


In [None]:
# %load solutions/lockpick.py
decode('supa_secrets_for_you', b'4N7TycDY0dbfzujb')


Welp. That maybe is not so great... 

#### Section Two: Data Masking and Tokenization

- What should we mask?
- How?
- What do we do if we need realistic values?

In [50]:
df.sample(2)

Unnamed: 0,timestamp,username,temperature,heartrate,build,latest,note
21951,2017-01-10T06:15:33,jorgeholmes,29,73,5c0e7f1c-ef92-cd78-e341-d87a08748cf8,0,sleep
59684,2017-01-25T08:27:47,teresa79,8,63,94862cbc-3d65-aa15-3c8a-54ed96fe280e,0,


In [51]:
super_masked = df.applymap(lambda x: 'NOPE')

  super_masked = df.applymap(lambda x: 'NOPE')


In [52]:
super_masked.head()

Unnamed: 0,timestamp,username,temperature,heartrate,build,latest,note
0,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE
1,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE
2,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE
3,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE
4,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE


😜

Okay, no more jokes. But masking usually is just that. Replace your senstive data with some sort of represetation.

But instead, we could also tokenize the data. This means to replace it with random fictitious data. How do we tokenize this?

In [53]:
fakes = Faker()

In [54]:
fakes.name()

'Marissa Torres'

In [55]:
fakes.

SyntaxError: invalid syntax (3757920287.py, line 1)

In [56]:
fakes.user_name()

'qlong'

#### Challenge

Make a new column `pseudonym` which masks the data using the faker `user_name` method.

In [57]:
fake = Faker()

data = {'original_name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']}
df = pd.DataFrame(data)

df['pseudonym'] = df['original_name'].apply(lambda _: fake.user_name())

print(df)

  original_name        pseudonym
0         Alice    johnwilkerson
1           Bob     matthewsivan
2       Charlie    brittneypoole
3         Diana            rneal
4           Eve  alexandravaldez


In [None]:
# %load solutions/masked_pseudonym.py
df['pseudonym'] = df['username'].map(
        lambda x: fakes.user_name())
df['pseudonym'].head()


Whaaaa!?!? Pretty cool, eh? 

(In case you want to read up on [how it works](https://github.com/joke2k/faker/blob/06d323f6cff95103d4ccda03f5d4ab2c45334e46/faker/providers/internet/__init__.py#L162))

But.. we can't reverse it. It is tuned per locale (usually using probabilities based on names in locale). That said, works fabulous for test data!

#### Step Three: Format-Preserving Encryption

In [64]:
key = "2DE79D232DF5585D68CE47882AE256D6"
tweak = "CBD09280979564"

c6 = FF3Cipher.withCustomAlphabet(key, tweak, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_")

plaintext = "michaelsmith"
ciphertext = c6.encrypt(plaintext)

ciphertext

'_4uKXSOIRRWR'

In [65]:
decrypted = c6.decrypt(ciphertext)
decrypted

'michaelsmith'

In [75]:
df['username'] = df['username'].map(c6.encrypt)

KeyError: 'username'

Oh no! What does this message mean and how can we fix it?

In [76]:
# %load solutions/pad_text.py
def add_padding_and_encrypt(cipher, username):
    if len(username) < 4:
        username += "X" * (4-len(username))
    return cipher.encrypt(username)


In [77]:
def add_padding_and_encrypt(cipher, username):
    if len(username) < 4:
        username += "X" * (4 - len(username))
    return cipher.encrypt(username)


In [78]:
df['username'] = df['username'].map(lambda x: add_padding_and_encrypt(c6, x))

KeyError: 'username'

In [74]:
df['username']

KeyError: 'username'

### Questions

- What would happen if someone found our key?
- What happens if a username ends in X?
- What properties do we need in our data in order to maintain encryption-level security?

In [None]:
#1. They could decrypt all data encrypted with the key. 
#2. Padding function will add extra x to username. For example sarax --> saraxx
#3. Keep the key confidential, use secure and modern encryption methosd, use
# systems like HSMs to protect keys, ensure encryption maintains integrity.

#### Additional Challenge

How would we build our own format-preserving encryption?

In [79]:
num_cipher = FF3Cipher.withCustomAlphabet(key, tweak, "0123456789")

In [80]:
example = "2017-01-01T12:00:23"

In [81]:
enc_date = num_cipher.encrypt(example.replace("T","").replace(":","").replace("-",""))

In [82]:
enc_ts = f"{enc_date[:4]}-{enc_date[4:6]}-{enc_date[6:8]}T{enc_date[8:10]}:{enc_date[10:12]}:{enc_date[12:14]}"

In [83]:
enc_ts

'0223-76-68T70:45:11'

#### Homework Challenge

Create a function to format preserve another column in the data.

Return a new dataframe of just the pseudonymized data.

In [84]:

fake = Faker()

def pseudonymize_data(df, pseudonym_column, preserve_column):
    df['pseudonym'] = df[pseudonym_column].map(lambda x: fake.user_name())
    
    new_df = df[[preserve_column, 'pseudonym']]
    
    return new_df

data = {'username': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
        'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'diana@example.com', 'eve@example.com']}

df = pd.DataFrame(data)

result_df = pseudonymize_data(df, 'username', 'email')

print(result_df)


                 email       pseudonym
0    alice@example.com     tammyharris
1      bob@example.com  greenelizabeth
2  charlie@example.com    crawfordmary
3    diana@example.com   stewartrobert
4      eve@example.com        sandra04
