<a href="https://colab.research.google.com/github/kokchun/Databehandling-AI22/blob/main/Lectures/L6-anonymisation.ipynb" target="_parent"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> &nbsp; for interacting with the code


---
# Lecture notes - Anonymization
---

This is a lecture note on **anonymization** - but it's built upon contents from pandas and previous course:

- Python programming

<p class = "alert alert-info" role="alert"><b>Note</b> that this lecture note gives a brief introduction to merging. I encourage you to read further about anonymizations.

Read more

- [anonymization example](https://towardsdatascience.com/anonymise-sensitive-data-in-a-pandas-dataframe-column-with-hashlib-8e7ef397d91f)

- [wikipedia - Cryptographic Hash Function](https://en.wikipedia.org/wiki/Cryptographic_hash_function)

---


In [11]:
import hashlib as hl

# returns utf-8 encoded version of the string 
string_to_hash = "gore_bord@gmail.com".encode()
string_to_hash2 = "Gore_bord@gmail.com".encode() 
print(string_to_hash) # note the b in front of the string 

# we choose SHA-256 as cryptographic hash function, there are other functions that can be chosen
# notice that a small difference in input gives large difference in hash value
print(f"Email (SHA-256): {hl.sha256(string_to_hash).hexdigest()}") 
print(f"Email (SHA-256): {hl.sha256(string_to_hash2).hexdigest()}")

b'gore_bord@gmail.com'
Email (SHA-256): 7d6d9a849ccc7f5febe065ebe3b4f39558fc96ef865e02333dd7b7426ff0c057
Email (SHA-256): 5a1d7bd95608d9067b72c9a04f97e82067922da234d68713fcd0236611737bd0


In [15]:
# same string and same hash function gives same hash value
hl.sha256("gore_bord@gmail.com".encode()).hexdigest() == hl.sha256(string_to_hash).hexdigest()

True

## Example with Pandas

Example taken from here:
- [towardsdatascience  - anonymize sensitive data](https://towardsdatascience.com/anonymise-sensitive-data-in-a-pandas-dataframe-column-with-hashlib-8e7ef397d91f)

- dataset: [Kaggle - credit card customers](https://www.kaggle.com/sakshigoyal7/credit-card-customers) 

In [20]:
import pandas as pd
df = pd.read_csv("Data/BankChurners.csv", usecols=[
                 "CLIENTNUM", "Attrition_Flag", "Customer_Age", "Gender", "Total_Trans_Amt"])
df.head(3)


Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Total_Trans_Amt
0,768805383,Existing Customer,45,M,1144
1,818770008,Existing Customer,49,F,1291
2,713982108,Existing Customer,51,M,1887


In [22]:
df.columns

Index(['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender',
       'Total_Trans_Amt'],
      dtype='object')

In [25]:
df["CLIENTNUM"].head() # change to string

0    768805383
1    818770008
2    713982108
3    769911858
4    709106358
Name: CLIENTNUM, dtype: int64

In [28]:
df["CLIENTNUM"] = df["CLIENTNUM"].astype(str)
hashes = df["CLIENTNUM"].apply(lambda x: hl.sha256(x.encode()).hexdigest())
hashes.head()

0    c9bbef56f9d8292cb3cfa8ae91f9b9167390e6e4b514d5...
1    7996e2340d70489252370a5df035ec99381c8344cc3511...
2    6fb53dbc743724e086243b5bc288df62b4a6dc1b8bde92...
3    f86b86a1047317685f29c399059b199858685faf5ec6a8...
4    0d239470b0cb57e110cf60bc3865344ee2cdced6e3acdc...
Name: CLIENTNUM, dtype: object

In [32]:
df.insert(1, "Hash values", hashes)
df.head()

Unnamed: 0,CLIENTNUM,Hash values,Attrition_Flag,Customer_Age,Gender,Total_Trans_Amt
0,768805383,c9bbef56f9d8292cb3cfa8ae91f9b9167390e6e4b514d5...,Existing Customer,45,M,1144
1,818770008,7996e2340d70489252370a5df035ec99381c8344cc3511...,Existing Customer,49,F,1291
2,713982108,6fb53dbc743724e086243b5bc288df62b4a6dc1b8bde92...,Existing Customer,51,M,1887
3,769911858,f86b86a1047317685f29c399059b199858685faf5ec6a8...,Existing Customer,40,F,1171
4,709106358,0d239470b0cb57e110cf60bc3865344ee2cdced6e3acdc...,Existing Customer,40,M,816


---

Kokchun Giang

[LinkedIn][linkedIn_kokchun]

[GitHub portfolio][github_portfolio]

[linkedIn_kokchun]: https://www.linkedin.com/in/kokchungiang/
[github_portfolio]: https://github.com/kokchun/Portfolio-Kokchun-Giang

---
