# Anonymising Research Data

A practical Python implementation of Guidelines from the [UK Data Service](https://www.ukdataservice.ac.uk/manage-data/legal-ethical/anonymisation)

## Pandas

This notebook will also serve as an introduction to the Python Library [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html). **Pandas** is a library for handling tabular data. Its alot more robust for data that is not just numbers than **NumPy**. Its main data format is the **DataFrame**, which is like a 2D array, but with column names and indexes for the rows. 

Normally (in a ``real experiment``), we'd have the data saved in something like the **comma-separated variable (.csv)** format.

We'll see how to do that, but before we'll see how I generated some fake data. 

If you are running this using Google Colab, you'll need to run the cell below to copy across some data 

In [None]:
#ONLY DO THIS IF YOU ARE USING GOOGLE COLAB (NOT LOCAL!)
!git clone https://github.com/ual-cci/anonymising-data
%cd anonymising-data\

In [None]:
#import the library (pandas) and give it an alias (pd)
import pandas as pd
import numpy as np
import os

In [None]:
#install the faker library
!pip3 install Faker

In [None]:
from faker import Faker

In [None]:
# Generating some fake data 

faker = Faker()
data = []
postcodes = pd.read_csv("postcodes.csv").values.flatten()
for _ in range(100):
    profile = faker.simple_profile()
    profile["address"] = np.random.choice(postcodes)
    profile["job"] = faker.job()
    profile["phone"] = faker.phone_number()
    profile["userid"] = faker.uuid4()
    profile["survey_answers"] = np.random.randint(1,6,20)
    profile["salary"] = np.random.normal(22000,2000)
    data.append(profile)

#High earning outliers
data[0]["salary"] = 120000
data[1]["salary"] = 140000
my_data = pd.DataFrame(data)

In [None]:
my_data.to_csv("my_fake_data.csv", index = False)

## Quantiative Data

### Remove direct identifiers from a dataset

Such identifiers are often not necessary for secondary research.

Lets first look at what information is in here that is **personally identifiable** but **not relevant** to research 

The ``.head()`` function lets us eye-ball the first few entries in the dataset 


In [None]:
#Read in .csv file, tell Pandas which column is contains dates
data = pd.read_csv("my_fake_data.csv", parse_dates = ["birthdate"])

In [None]:
data.head(5)

Some columns we need to remove completely and just delete, some columns we need to remove but keep a reference to. 

We can use the Pandas function ``.drop()`` to completely lose some columns entirely. This is for things that are identifiable and not useful for sharing with other researchers.

In [None]:
## Get rid off username, phone, userid
data = data.drop(["username", "phone", "userid"], axis = 1)

In [None]:
data.columns

We want to remove the name and email as they identifiable, **but** we want to keep a reference to them so we can get back and identify the participants if we need to later. 

We can save this reference separately in a **very secure and restricted place**. There is no reason for anyone but the lead researchers to ever have access to this. 

In [None]:
#Get unique() names 
names = data["name"].unique()

#New dictionaries to keep mappings
id_to_name = {}
name_to_id = {}

#Go through each name
for i, name in enumerate(names):
    
    #Make a new identifier
    identifier = "P"+str(i)
    
    #Get the email
    email = data[data["name"] == name]["mail"].item()
    
    #Save the name and email against the identifier
    id_to_name[identifier] = [name, email]
    
    #And reverse 
    name_to_id[name] = identifier

In [None]:
#Make a new column with the participant id
data["participant_id"] = [name_to_id[name] for name in data["name"]]

In [None]:
#Save the reference file (keep this secure!)
pd.DataFrame(id_to_name).T.to_csv("participant_lookup.csv", header = False)

In [None]:
#drop the name and email columns 
data = data.drop(["name", "mail"], axis = 1)

### Aggregate or reduce the precision 

You can do this for 


* Age 

    * Record birth year (not month, day)
   

* Place of Residence 

    * Record Postcode sectors (first 3 - 4 digits) 


### Dates

In Pandas, the date is specified in its own data type so its quite easy to manipulate to, for example, remove precision information.

We can reformat the column by giving it a new `format string`. Here, we give it ``'%Y'`` to tell it to just **keep the year only**

You can find the documentation for formatting date strings in Python [here](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)

In [None]:
## Reformatting dates 
data["birthdate"] = data["birthdate"].dt.strftime('%Y')

### Post codes

For the post code we can remove the end of the postcode to reduce precision 

For UK Post codes, the first area location can be 2-4 characters. But, consistently we can achieve the anonymisation by removing the **last 3 characters**. 

Here we use the ``str.slice()`` function and use the **negative index** to say ``slice 3 from the end``

In [None]:
## Reformatting strings (delete last 3 characters)
data["address"] = data["address"].str.slice(stop = -3) 

### Restrict upper or lower ranges 

We can hide outliers as these may be used to identify people known to to have atypical values 

For example, you can **top-code** or **bottom-code** high or low values respectively. This means grouping everyone above or below a given threshold into one category. 

In [None]:
## Threshold and replace 
top_limit = 70000
column = "salary"
row_indexer = data[column] > top_limit

#Using .loc and a mask as a [row_indexer, col_indexer]
data.loc[row_indexer, column] = top_limit

In [None]:
data

## Media Data 

### Audio (Voices)

Sometimes its necessary to disguise voices in audio recordings. We can use the `librosa` package to re-pitch a voice but keep the time information the same 


In [None]:
#Install audio packages
!pip3 install librosa
!pip3 install soundfile

In [None]:
import librosa
import soundfile as sf

### Using your own files - Colab Users

You can run these examples using the test files we have given you by just leaving the paths as they are. 

You can use **your own files** by mounting your Google Drive by running the cell below. You'll then be able to access files from your Google Drive at the root path ``/content/drive``


In [None]:
#ONLY DO THIS IF YOU ARE USING GOOGLE COLAB
from google.colab import drive
drive.mount("/content/drive")

In [None]:
#Pick paths
file_path = 'voice.wav'
output_file = 'repitched_voice.wav'

#Load in file
y, sr = librosa.load(file_path)
repitched_audio = librosa.effects.pitch_shift(y, sr, n_steps = -4)

#Save altered file
sf.write(output_file, repitched_audio, sr, subtype='PCM_24')

### Faces - Images

We can also use the ``OpenCV`` package to find faces and apply a ``Gaussian blur``

First we see how this can be applied to images one at a time, and to all images in a given folder using Pythons ``os.walk`` function. This does a recursive walk through all the folders from a given top directory. 

In [None]:
#Install computer vision packages
!pip3 install opencv-contrib-python

In [None]:
#adapted from https://www.geeksforgeeks.org/how-to-blur-faces-in-images-using-opencv-in-python/

import cv2
import matplotlib.pyplot as plt

top_folder = "images/"
for root, dirs, files in os.walk(top_folder, topdown=False):
    for name in files:
        if not name == ".DS_Store":
            image_path = os.path.join(root, name)
            print(image_path)
            # Reading an image using OpenCV
            image = cv2.imread(image_path)
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

            face_detect = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
            face_data = face_detect.detectMultiScale(image, 1.3, 2)

            # Draw rectangle around the faces which is our region of interest (ROI)
            for (x, y, w, h) in face_data:
                roi = image[y:y+h, x:x+w]
                # applying a gaussian blur over this new rectangle area
                roi = cv2.GaussianBlur(roi, (23, 23), 30)
                # impose this blurred image on original image to get final image
                image[y:y+roi.shape[0], x:x+roi.shape[1]] = roi

            #Convert back to RGB    
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            cv2.imwrite(image_path[:-4]+"_blurred.jpg", image)  

### Faces - Video

Now we see how we can go through each frame one by one and blur the faces

We also put the audio back at the bottom using the ``ffmpeg`` library, and optionally repitch 

To install ffmpeg, the easiest way is through [HomeBrew](https://brew.sh/) (which you might have to install as well!) 

Use ``brew install ffmpeg``

In [None]:
input_video_path = "louis.mp4"
output_video_path = "anon.mp4"

cap = cv2.VideoCapture(input_video_path)
#Get input video meta data
fps = cap.get(cv2.CAP_PROP_FPS)
width  = int(cap.get(3))
height = int(cap.get(4))

fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter("blurred_temp.mp4", fourcc, fps, (width,  height))
face_detect = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')

#For every video frame
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        print("End of stream")
        break
    
    #Blur face
    face_data = face_detect.detectMultiScale(frame, 1.2, 1)
    for (x, y, w, h) in face_data:
        roi = frame[y:y+h, x:x+w]
        roi = cv2.GaussianBlur(roi, (23, 23), 30)
        frame[y:y+roi.shape[0], x:x+roi.shape[1]] = roi
        
    #Write new frame
    out.write(frame)
    
out.release() 

#Extract original audio
os.system("ffmpeg -i " + input_video_path + " -q:a 0 -map a audio_temp.wav")

#REPITCH AUDIO (comment out if not wanted!)
y, sr = librosa.load("audio_temp.wav")
repitched_audio = librosa.effects.pitch_shift(y, sr, n_steps=-4)
sf.write("audio_temp.wav", repitched_audio, sr, subtype='PCM_24')

#Put audio back on output video
os.system("ffmpeg -i  blurred_temp.mp4 -i audio_temp.wav -map 0:v:0 -map 1:a:0 " + output_video_path)

#Delete temp files
os.system("rm audio_temp.wav blurred_temp.mp4")