# Coursework 2: Data Anonymisation and Privacy

Methodology:
1. Import required packages
2. Import dataset and identify its features
3. Preprocess features via removal and manipulation for anonymisation
4. Prepare dataset for export, identifying the relevant features for the different audiences
5. Decide on export format of dataset and sharing method

## Import Packages
Import the required packages for data accessing and anonymisation

In [3]:
import pandas as pd

## Data Import
Data found to have columns of:'index'

### Surveyee Info
* given_name <-- to remove
* surname <-- to remove
### Demographic Info
* gender
* birthdate <-- to remove, replace with banded age??
* country_of_birth
* current_country (can be left but info below state level must be removed)
* phone_number <-- to remove
* postcode <-- to remove
### Financial Info
* national_insurance_number <-- to remove
* bank_account_number <-- to remove
### Body Info
* cc_status
* weight
* height
* blood_group
### Vices Info
* avg_n_drinks_per_week
* avg_n_cigret_per_week
### Other Info
* education_level
* n_countries_visited <-- of interest

In [18]:
data = pd.read_csv("Data/customer_information.csv")

In [10]:
data.shape

(1000, 18)

In [11]:
data.tail(5)

Unnamed: 0,given_name,surname,gender,birthdate,country_of_birth,current_country,phone_number,postcode,national_insurance_number,bank_account_number,cc_status,weight,height,blood_group,avg_n_drinks_per_week,avg_n_cigret_per_week,education_level,n_countries_visited
995,Allan,Hammond,M,1964-01-26,Nepal,United Kingdom,+447700900869,SA92 1SJ,ZZ 648472 T,72521708,0,92.7,1.98,A+,1.8,262.4,secondary,21
996,Robin,Morris,M,2002-06-19,Estonia,United Kingdom,(07700) 900743,TS27 2FD,ZZ 851919 T,14900523,0,56.1,1.85,B+,7.7,336.2,other,35
997,Stacey,Barnett,F,1956-04-26,Botswana,United Kingdom,+447700 900776,G89 7HN,ZZ783809T,28276780,0,94.9,2.0,O+,0.9,55.7,secondary,35
998,Jayne,Harrison,F,1962-08-16,Guernsey,United Kingdom,(07700)900596,CT5B 5BN,ZZ793814T,62820464,0,75.6,1.5,A+,4.7,430.5,bachelor,35
999,Oliver,Holmes,M,1957-01-10,Canada,United Kingdom,07700 900 536,SR56 7HG,ZZ 09 94 67 T,88029663,0,95.6,1.65,B-,0.7,34.6,masters,47


## Preprocessing
Removing non required information and anonymising remaining information

In [19]:
# present age in bands

data['birthYear'] = data.apply(lambda x: datetime.datetime.strptime(x.birthdate, "%Y-%m-%d").year,
                                    axis = 1)
data['birthAge'] = data.apply(lambda x: datetime.datetime.now().year - x.birthYear, axis = 1)
data.birthAge.unique() # get a perspective to see if all are two digit
bins = [0, 10, 20, 30, 40, 50, 60, 70]
data['banded_age']= pd.cut(data['birthAge'], bins)

In [21]:
# remove data

import datetime
cleaned_data = data[['gender', 'banded_age', 'country_of_birth', 'current_country', 'cc_status', 'weight', 'height', 'blood_group', 'avg_n_drinks_per_week', 'avg_n_cigret_per_week', 'education_level', 'n_countries_visited']]
cleaned_data

Unnamed: 0,gender,banded_age,country_of_birth,current_country,cc_status,weight,height,blood_group,avg_n_drinks_per_week,avg_n_cigret_per_week,education_level,n_countries_visited
0,F,"(30, 40]",Armenia,United Kingdom,0,74.2,1.73,B+,6.5,218.8,phD,48
1,M,"(20, 30]",Northern Mariana Islands,United Kingdom,0,69.4,1.74,O-,0.7,43.6,primary,42
2,F,"(30, 40]",Venezuela,United Kingdom,0,98.6,1.88,B+,7.8,59.1,bachelor,9
3,F,"(20, 30]",Eritrea,United Kingdom,0,62.0,1.56,O+,4.6,284.2,primary,32
4,F,"(50, 60]",Ecuador,United Kingdom,0,96.3,1.81,A-,4.4,348.8,secondary,34
...,...,...,...,...,...,...,...,...,...,...,...,...
995,M,"(50, 60]",Nepal,United Kingdom,0,92.7,1.98,A+,1.8,262.4,secondary,21
996,M,"(10, 20]",Estonia,United Kingdom,0,56.1,1.85,B+,7.7,336.2,other,35
997,F,"(60, 70]",Botswana,United Kingdom,0,94.9,2.00,O+,0.9,55.7,secondary,35
998,F,"(50, 60]",Guernsey,United Kingdom,0,75.6,1.50,A+,4.7,430.5,bachelor,35


## Prepare for distribution
### For Researchers
* Key info variable of interest: n_countries_visited and cc_status

In [22]:
researcher_data = cleaned_data[['gender', 'banded_age', 'country_of_birth', 'current_country', 'cc_status', 'n_countries_visited']]
researcher_data

Unnamed: 0,gender,banded_age,country_of_birth,current_country,cc_status,n_countries_visited
0,F,"(30, 40]",Armenia,United Kingdom,0,48
1,M,"(20, 30]",Northern Mariana Islands,United Kingdom,0,42
2,F,"(30, 40]",Venezuela,United Kingdom,0,9
3,F,"(20, 30]",Eritrea,United Kingdom,0,32
4,F,"(50, 60]",Ecuador,United Kingdom,0,34
...,...,...,...,...,...,...
995,M,"(50, 60]",Nepal,United Kingdom,0,21
996,M,"(10, 20]",Estonia,United Kingdom,0,35
997,F,"(60, 70]",Botswana,United Kingdom,0,35
998,F,"(50, 60]",Guernsey,United Kingdom,0,35


### For Government
* Interested in Gene <-> education or geographical POV
* Key info variable of interest: education_level, current_country, country_of_birth and cc_status

In [23]:
government_data = cleaned_data[['gender', 'banded_age', 'country_of_birth', 'current_country', 'cc_status', 'education_level']]
government_data

Unnamed: 0,gender,banded_age,country_of_birth,current_country,cc_status,education_level
0,F,"(30, 40]",Armenia,United Kingdom,0,phD
1,M,"(20, 30]",Northern Mariana Islands,United Kingdom,0,primary
2,F,"(30, 40]",Venezuela,United Kingdom,0,bachelor
3,F,"(20, 30]",Eritrea,United Kingdom,0,primary
4,F,"(50, 60]",Ecuador,United Kingdom,0,secondary
...,...,...,...,...,...,...
995,M,"(50, 60]",Nepal,United Kingdom,0,secondary
996,M,"(10, 20]",Estonia,United Kingdom,0,other
997,F,"(60, 70]",Botswana,United Kingdom,0,secondary
998,F,"(50, 60]",Guernsey,United Kingdom,0,bachelor


## Export data

In [26]:
# export as zip file?
researcher_compression_opts = dict(method='zip', archive_name='researcher_data.csv')  
researcher_data.to_csv('researcher_data.zip', index=True, compression=researcher_compression_opts)

government_compression_opts = dict(method='zip', archive_name='government_data.csv')  
government_data.to_csv('government_data.zip', index=True, compression=government_compression_opts)

In [27]:
## check if zip file contains
import zipfile
zf = zipfile.ZipFile('researcher_data.zip') 
# if you want to see all files inside zip folder
zf.namelist() 

['researcher_data.csv']

## Thoughts for group sharing

* remove place of birth, ethnicity, name of social services in charge of care, rare disease or treatment?
* Aggregation for gov data?
* anonymise place of birth by continent?
* double check format of export
* prepare citation for dataset for those who want access to it