<a href="https://colab.research.google.com/github/rtkilian/data-science-blogging/blob/main/hashlib_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anonymise sensitive data in a pandas DataFrame column with hashlib

A common scenario that Data Scientists come across is sharing data with others. But what should you do if that data contains personally identifiable information (PII) such as email addresses, customer IDs or phone numbers?
A simple solution is to remove these fields before sharing the data. However, your analysis may rely on having the PII data. For example, customer IDs in an e-commerce transactional dataset to know which customer bought which product. 
Instead, you can anonymise the PII fields in your data using hashing.

## What is hashing?
Hashing is a one-way process of transforming a string of plaintext characters into a unique string of fixed length. The hashing process has two important characteristics:
1. It is very difficult to convert a hashed string into its original form
2. The same plaintext string will produce the same hashed output

For these reasons, developers will store your hashed password in the website's database.

## A simple example using hashlib
[haslib](https://docs.python.org/3/library/hashlib.html) is a built-in module in Python that contains many popular hash algorithms. In our tutorial, we're going to be using SHA-256 which is part of the SHA-2 (Secure Hash Algorithm 2) family of algorithms.

Before we can convert our string, say an email address, to a hashed value, we must first convert it into bytes using UTF-8 encoding:

In [1]:
import hashlib

# Encode our string using UTF-8 default 
stringToHash = 'example@email.com'.encode()

We can now hash it using SHA-256:

In [2]:
# Hash using SHA-256 and print
print('Email (SHA-256): ', hashlib.sha256(stringToHash).hexdigest())

Email (SHA-256):  36e96648c5410d00a7da7206c01237139f950bed21d8c729aae019dbe07964e7


That's it! Our fake email address has been successfully hashed.

## A complete example using pandas and hashlib
Now that we can apply hashlib to a single string, it's fairly straightforward to scale this example to a pandas DataFrame. We're going to use credit card customer data, available on [Kaggle](https://www.kaggle.com/sakshigoyal7/credit-card-customers), which was originally made available by [Analyttica TreasureHunt LEAPS](https://leaps.analyttica.com/).

**Scenario:** you need to share a list of credit card customers. You want to retain the field 'CLIENTNUM' as a customer can have multiple credit cards and you want to be able to uniquely identify them.

*Note: I store the csv in a folder called 'data'. You will need to manually complete this step if using Google Colab.*

In [3]:
import pandas as pd

df = pd.read_csv('data/BankChurners.csv', usecols=['CLIENTNUM', 'Customer_Age', 'Gender', 'Attrition_Flag', 'Total_Trans_Amt'])
df.head()

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Total_Trans_Amt
0,768805383,Existing Customer,45,M,1144
1,818770008,Existing Customer,49,F,1291
2,713982108,Existing Customer,51,M,1887
3,769911858,Existing Customer,40,F,1171
4,709106358,Existing Customer,40,M,816


After converting our 'CLIENTNUM' column to a string data type, we can then use pandas `.apply()` to hash all the strings in the column:

In [4]:
# Convert column to string
df['CLIENTNUM'] = df['CLIENTNUM'].astype(str)

# Apply hashing function to the column
df['CLIENTNUM_HASH'] = df['CLIENTNUM'].apply(
    lambda x: 
        hashlib.sha256(x.encode()).hexdigest()
)

Good luck trying to crack our newly created column.

In [5]:
df[['CLIENTNUM', 'CLIENTNUM_HASH']].head()

Unnamed: 0,CLIENTNUM,CLIENTNUM_HASH
0,768805383,c9bbef56f9d8292cb3cfa8ae91f9b9167390e6e4b514d5...
1,818770008,7996e2340d70489252370a5df035ec99381c8344cc3511...
2,713982108,6fb53dbc743724e086243b5bc288df62b4a6dc1b8bde92...
3,769911858,f86b86a1047317685f29c399059b199858685faf5ec6a8...
4,709106358,0d239470b0cb57e110cf60bc3865344ee2cdced6e3acdc...


## Conclusion
After completing this tutorial you should have a basic understanding of what a hash algorithm is. We saw how you could use **hashlib** to hash a single string and how this can be applied to a pandas DataFrame column to anonymise sensitive information.

Do you have any questions? [**Tweet me**](https://twitter.com/rtkilian) or add me on [**LinkedIn**](https://www.linkedin.com/in/rtkilian/).

You can find all the code used in this post on [**GitHub**](https://github.com/rtkilian/data-science-blogging/blob/main/hashlib_pandas.ipynb).