# Data Anonymization
## Author: Rida Naeem

In this Notebook, I walk through some manual ways to k-anonymize data using Python. To start, import the necessary libraries and load the data, which is a hospital dataset from the UK.

In [1]:
import numpy as np
import pandas as pd
import sys
import matplotlib.pyplot as plt

In [2]:
filename = "./hospital_ae_data.csv"
df = pd.read_csv(filename)

Let's learn a little bit about our data. First, what are the columns? This will tell us what information is available to us.

In [3]:
print(df.columns)

Index(['Health Service ID', 'Age', 'Time in A&E (mins)', 'Hospital',
       'Arrival Time', 'Treatment', 'Gender', 'Postcode'],
      dtype='object')


Next, let's figure out the size of the dataset and the types of variables we have. This will help us understand how many rows we are working with and how to manipulate them.

In [4]:
print(df.shape)

(10000, 8)


In [5]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Health Service ID   10000 non-null  object
 1   Age                 10000 non-null  int64 
 2   Time in A&E (mins)  10000 non-null  int64 
 3   Hospital            10000 non-null  object
 4   Arrival Time        10000 non-null  object
 5   Treatment           10000 non-null  object
 6   Gender              10000 non-null  object
 7   Postcode            10000 non-null  object
dtypes: int64(2), object(6)
memory usage: 625.1+ KB
None


Finally let's take a look at the first 5 rows of the data.

In [6]:
print(df.head())

  Health Service ID  Age  Time in A&E (mins)  \
0      744-256-9522   42                  56   
1      303-192-8548   27                  40   
2      758-738-5880   54                  61   
3      452-291-9709   35                  51   
4      477-915-0508   14                  32   

                             Hospital         Arrival Time  \
0                   Kingston Hospital  2019-04-07 02:16:52   
1  Northwick Park & St Marks Hospital  2019-04-05 20:40:47   
2                     Barnet Hospital  2019-04-03 12:59:13   
3                 Hillingdon Hospital  2019-04-04 15:20:58   
4             The Royal Free Hospital  2019-04-01 11:27:12   

                       Treatment Gender  Postcode  
0                       Dressing   Male  EC3N 3DH  
1                   Central line   Male   W1Y 0HB  
2                   Central line   Male   NW6 6AF  
3  Other (consider alternatives)   Male   IG6 1GX  
4            Incision & drainage   Male   UB3 3EL  


Now we can get into our anonymization. As a reminder, quasi-identifiers don't directly reveal identity, but with background information they can be pieced together to identify an individual. For this dataset, I would consider the following variables quasi-identifiers

In [7]:
quasi_identifier = ["Age", "Time in A&E (mins)", "Hospital", "Arrival Time", "Treatment", "Gender", "Postcode"]

Now let's calculate the current k-value for each variable to give us an idea of how many instances of that variable they are. I wrote a function to do that calculation and print out a report with the results.

In [8]:
def calculateK():
    #Get a count of how many occurences there are for each value in the dataset and separate it by attrbute (column)

    attribute_values = {} #{key: value = attribute: item in attribute: occurences}}

    for attribute in quasi_identifier:
        current_col = df[attribute] #all values in this col
        found_values = {} #{key: value = item in attribute: occurences}
        attribute_values[attribute] = found_values

        for item in current_col:
            if item in found_values:
                found_values[item] = found_values[item] + 1
            else:
                found_values[item] = 1

    print(attribute_values)

    #Find k-anonymity value. This value will tell us the lowest amount of individuals that share a particular trait.

    min_attributes_values = []
    attributes = []

    for attribute in attribute_values:
        attributes.append(attribute)
        found_values = attribute_values[attribute] #{key: value = item in attribute: occurences}
        min_amt = sys.maxsize #the largest value python can handle

        for value in found_values:
            if found_values[value] < min_amt:
                min_amt = found_values[value] #search for and update min occurences in an attribute

        min_values = []

        for value in found_values:
            if found_values[value] == min_amt:
                min_values.append(value) #the value(s) associated with the min amount

        min_attributes_values.append(min_amt)
        print("The minimum value(s) from", attribute,"is/are", min_values)
        print("with", min_amt, "occurence(s)")
        print(" ")

    k_value = min(min_attributes_values)
    k_value_attributes = []

    for i, amount in enumerate(min_attributes_values):
        if amount == k_value: #the k value is the smallest overall number of occurences
            k_value_attributes.append(attributes[i])

    print("The k-anonymity value is", k_value, "for", k_value_attributes)

Now let's call the value and see what the lowest numbers of occurences are. This way we can target those variables first in our manual k-anonymization protocol. If using an algorithm to anonymize instead, we would not have to do this, but this notebook is meant to be a walkthrough of what is happening at the variable level.

In [9]:
calculateK()

{'Age': {42: 204, 27: 145, 54: 171, 35: 198, 14: 87, 51: 177, 52: 154, 28: 163, 79: 31, 17: 113, 43: 199, 16: 105, 75: 48, 23: 143, 12: 66, 3: 39, 31: 175, 49: 177, 25: 157, 36: 181, 48: 155, 47: 212, 65: 101, 45: 207, 13: 74, 68: 63, 63: 124, 30: 181, 53: 165, 34: 223, 15: 88, 41: 204, 6: 51, 18: 86, 84: 31, 19: 94, 90: 7, 86: 23, 69: 69, 44: 165, 26: 157, 60: 124, 64: 82, 29: 181, 70: 62, 76: 43, 24: 138, 33: 188, 40: 197, 61: 124, 66: 79, 72: 71, 38: 169, 55: 170, 21: 125, 11: 81, 96: 7, 62: 106, 56: 153, 46: 175, 4: 34, 50: 155, 0: 240, 57: 147, 71: 73, 67: 90, 22: 119, 58: 134, 32: 183, 8: 51, 20: 123, 10: 58, 92: 9, 87: 13, 88: 16, 37: 199, 39: 197, 5: 33, 80: 25, 1: 30, 78: 35, 59: 128, 85: 25, 98: 4, 74: 52, 2: 25, 77: 38, 73: 44, 81: 27, 9: 45, 100: 3, 97: 8, 91: 10, 82: 24, 93: 8, 7: 51, 99: 1, 89: 16, 83: 22, 103: 3, 95: 4, 101: 1, 112: 1, 94: 1, 108: 4, 104: 1, 105: 2}, 'Time in A&E (mins)': {56: 206, 40: 110, 61: 188, 51: 182, 32: 67, 72: 176, 37: 99, 105: 12, 33: 92, 65: 

Looks like some variables have a k-value of 1! A k value of 1 esentially means that knowing 1 piece of information can identify someone within this dataset. If one trait is enough in one dataset, that trait can be found as an outlier in another dataset and possibly lead to a linkage attack. Let's start dealing with the columns that have outliers.

Let's look at `Age` first. The outliers are all 94+, so we can generalize the data here and group together certain ages. This way these data points won't stand out on their own anymore.

Before we do this, let's find the "before" values of the combination of `Age` and `Time in A&E (mins)`. This way we can re-calculate at the end and see if we have improved.

In [10]:
df[['Age', 'Time in A&E (mins)']].value_counts().reset_index(name='count')

Unnamed: 0,Age,Time in A&E (mins),count
0,0,14,23
1,46,66,22
2,37,58,22
3,39,62,21
4,40,60,20
...,...,...,...
2326,65,102,1
2327,66,69,1
2328,23,16,1
2329,22,60,1


 Right now, there are many combinations with just 1 occurence. Let's see what we can do about that.

The Python library pandas has a function called qcut, which can be used to split up the data into groups. For now, let's just split up `Age` into 2 groups and see what happens.

In [11]:
bin = 2
newAge = pd.qcut(df['Age'], bin)
df["Age"] = newAge
df

Unnamed: 0,Health Service ID,Age,Time in A&E (mins),Hospital,Arrival Time,Treatment,Gender,Postcode
0,744-256-9522,"(41.0, 112.0]",56,Kingston Hospital,2019-04-07 02:16:52,Dressing,Male,EC3N 3DH
1,303-192-8548,"(-0.001, 41.0]",40,Northwick Park & St Marks Hospital,2019-04-05 20:40:47,Central line,Male,W1Y 0HB
2,758-738-5880,"(41.0, 112.0]",61,Barnet Hospital,2019-04-03 12:59:13,Central line,Male,NW6 6AF
3,452-291-9709,"(-0.001, 41.0]",51,Hillingdon Hospital,2019-04-04 15:20:58,Other (consider alternatives),Male,IG6 1GX
4,477-915-0508,"(-0.001, 41.0]",32,The Royal Free Hospital,2019-04-01 11:27:12,Incision & drainage,Male,UB3 3EL
...,...,...,...,...,...,...,...,...
9995,609-960-2516,"(41.0, 112.0]",82,Royal London Hospital,2019-04-03 14:48:14,Oral airway,Male,W11 2TA
9996,956-497-5150,"(-0.001, 41.0]",44,Barnet Hospital,2019-04-04 08:37:37,Lumbar puncture,Female,W8 9HZ
9997,337-244-8916,"(-0.001, 41.0]",46,Croydon University Hospital,2019-04-03 06:17:18,Anaesthesia,Male,W1V 3TE
9998,039-739-2337,"(-0.001, 41.0]",51,University Hospital Lewisham,2019-04-07 15:54:37,Removal foreign body,Female,N14 7PA


It looks like now instead of showing an exact age there is an age range, which is either 0 to 41 inclusive or 41 non-inclusive to 112 inclusive. Let's see how the k-value for `Age` has changed.

In [12]:
calculateK()

{'Age': {Interval(41.0, 112.0, closed='right'): 4803, Interval(-0.001, 41.0, closed='right'): 5197}, 'Time in A&E (mins)': {56: 206, 40: 110, 61: 188, 51: 182, 32: 67, 72: 176, 37: 99, 105: 12, 33: 92, 65: 211, 43: 160, 82: 105, 26: 58, 27: 53, 48: 165, 39: 126, 77: 131, 45: 164, 47: 169, 42: 138, 58: 216, 66: 191, 73: 162, 85: 79, 67: 175, 23: 40, 84: 93, 79: 117, 46: 137, 57: 171, 78: 140, 59: 162, 60: 212, 50: 181, 93: 57, 95: 44, 97: 41, 68: 159, 28: 62, 117: 4, 89: 55, 83: 96, 69: 183, 64: 196, 31: 86, 94: 48, 62: 225, 36: 90, 44: 144, 53: 176, 88: 71, 96: 46, 75: 160, 52: 190, 107: 12, 80: 104, 55: 183, 70: 143, 92: 55, 81: 100, 63: 200, 115: 6, 41: 118, 10: 4, 5: 8, 71: 177, 122: 2, 91: 53, 30: 54, 76: 152, 90: 67, 54: 186, 87: 80, 74: 177, 86: 86, 38: 115, 102: 30, 110: 12, 29: 64, 17: 13, 101: 31, 4: 5, 18: 22, 49: 171, 19: 26, 8: 9, 22: 46, 21: 35, 34: 88, 35: 93, 100: 33, 15: 15, 14: 25, 108: 12, 24: 44, 25: 31, 103: 17, 9: 11, 16: 20, 104: 25, 121: 4, 109: 9, 13: 15, 99: 33

Now the lowest amount of occurences is no longer one, it is 4000+! This high nubmer will not always be necessary and so we could break up the data into more bins, but let's leave it at this for now and deal with `Time in A&E (mins)` in the same way next.

In [13]:
newTime = pd.qcut(df['Time in A&E (mins)'], bin)
df["Time in A&E (mins)"] = newTime
df

Unnamed: 0,Health Service ID,Age,Time in A&E (mins),Hospital,Arrival Time,Treatment,Gender,Postcode
0,744-256-9522,"(41.0, 112.0]","(0.999, 60.0]",Kingston Hospital,2019-04-07 02:16:52,Dressing,Male,EC3N 3DH
1,303-192-8548,"(-0.001, 41.0]","(0.999, 60.0]",Northwick Park & St Marks Hospital,2019-04-05 20:40:47,Central line,Male,W1Y 0HB
2,758-738-5880,"(41.0, 112.0]","(60.0, 132.0]",Barnet Hospital,2019-04-03 12:59:13,Central line,Male,NW6 6AF
3,452-291-9709,"(-0.001, 41.0]","(0.999, 60.0]",Hillingdon Hospital,2019-04-04 15:20:58,Other (consider alternatives),Male,IG6 1GX
4,477-915-0508,"(-0.001, 41.0]","(0.999, 60.0]",The Royal Free Hospital,2019-04-01 11:27:12,Incision & drainage,Male,UB3 3EL
...,...,...,...,...,...,...,...,...
9995,609-960-2516,"(41.0, 112.0]","(60.0, 132.0]",Royal London Hospital,2019-04-03 14:48:14,Oral airway,Male,W11 2TA
9996,956-497-5150,"(-0.001, 41.0]","(0.999, 60.0]",Barnet Hospital,2019-04-04 08:37:37,Lumbar puncture,Female,W8 9HZ
9997,337-244-8916,"(-0.001, 41.0]","(0.999, 60.0]",Croydon University Hospital,2019-04-03 06:17:18,Anaesthesia,Male,W1V 3TE
9998,039-739-2337,"(-0.001, 41.0]","(0.999, 60.0]",University Hospital Lewisham,2019-04-07 15:54:37,Removal foreign body,Female,N14 7PA


Now let's look at the combination of `Age` and `Time in A&E`

In [14]:
df[['Age', 'Time in A&E (mins)']].value_counts().reset_index(name='count')

Unnamed: 0,Age,Time in A&E (mins),count
0,"(-0.001, 41.0]","(0.999, 60.0]",4659
1,"(41.0, 112.0]","(60.0, 132.0]",4329
2,"(-0.001, 41.0]","(60.0, 132.0]",538
3,"(41.0, 112.0]","(0.999, 60.0]",474


This is much better than the last table, which had 2331 rows versus just 4 now! AGain, such severe grouping migt not be necessary and there are algorithms that will calculate the utility of different splits, but this walkthrough is meant to be a manual exploration due to the nature of the data we might actually use.

Let's recalculate k now.

In [15]:
calculateK()

{'Age': {Interval(41.0, 112.0, closed='right'): 4803, Interval(-0.001, 41.0, closed='right'): 5197}, 'Time in A&E (mins)': {Interval(0.999, 60.0, closed='right'): 5133, Interval(60.0, 132.0, closed='right'): 4867}, 'Hospital': {'Kingston Hospital': 551, 'Northwick Park & St Marks Hospital': 353, 'Barnet Hospital': 489, 'Hillingdon Hospital': 478, 'The Royal Free Hospital': 366, "Queen's Hospital": 792, 'Chase Farm Hospital': 737, 'Princess Royal University Hospital': 238, 'Newham General Hospital': 707, 'The Whittington Hospital': 510, 'University Hospital Lewisham': 846, 'West Middlesex University Hospital': 212, 'Epsom General Hospital': 332, 'Homerton University Hospital': 392, "King's College Hospital": 341, 'Chelsea and Westminster Hospital': 744, "St Thomas' Hospital": 165, "St Mary's Hospital": 182, 'Royal London Hospital': 188, 'University College Hospital': 550, 'Croydon University Hospital': 89, 'Ealing Hospital': 416, 'Charing Cross Hospital': 217, 'Whipps Cross University H

These values are much better than what we had before! Let's deal with `Postcode` next. To do this, I will bring in another dataset that gives us some information about these postcodes.

In [16]:
filename = "./londonPostcodes.csv"
postcodes = pd.read_csv(filename)
postcodes

  postcodes = pd.read_csv(filename)


Unnamed: 0,Postcode,In Use?,Latitude,Longitude,Easting,Northing,Grid Ref,County,District,Ward,...,Distance to sea,LSOA21 Code,Lower layer super output area 2021,MSOA21 Code,Middle layer super output area 2021,Census output area 2021,IMD decile,Constituency Code 2024,Constituency Name 2024,Property Type
0,BR1 1AA,Yes,51.401546,0.015415,540291,168873,TQ402688,Greater London,Bromley,Bromley Town,...,28.0730,E01034386,Bromley 018G,E02000144,Bromley South,E00182600,8,E14001137,Bromley and Biggin Hill,Flat
1,BR1 1AB,Yes,51.406333,0.015208,540262,169405,TQ402694,Greater London,Bromley,Bromley Town,...,27.9776,E01000676,Bromley 008B,E02000134,Bromley North & Sundridge,E00003255,5,E14001137,Bromley and Biggin Hill,Flat
2,BR1 1AD,No,51.400057,0.016715,540386,168710,TQ403687,Greater London,Bromley,Bromley Town,...,28.0211,E01034386,Bromley 018G,E02000144,Bromley South,E00182639,8,E14001137,Bromley and Biggin Hill,
3,BR1 1AE,Yes,51.404543,0.014195,540197,169204,TQ401692,Greater London,Bromley,Bromley Town,...,28.0861,E01000677,Bromley 018C,E02000144,Bromley South,E00003266,7,E14001137,Bromley and Biggin Hill,Flat
4,BR1 1AF,Yes,51.401392,0.014948,540259,168855,TQ402688,Greater London,Bromley,Bromley Town,...,28.1083,E01034386,Bromley 018G,E02000144,Bromley South,E00182600,8,E14001137,Bromley and Biggin Hill,Flat
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
329928,WD3 8UX,Yes,51.624762,-0.494021,504345,192845,TQ043928,Greater London,Hillingdon,Harefield Village,...,65.3523,E01002438,Hillingdon 003A,E02000496,Harefield,E00012152,7,E14001454,"Ruislip, Northwood and Pinner",
329929,WD3 8UZ,Yes,51.626955,-0.494143,504332,193089,TQ043930,Greater London,Hillingdon,Harefield Village,...,65.4296,E01002438,Hillingdon 003A,E02000496,Harefield,E00012152,7,E14001454,"Ruislip, Northwood and Pinner",Flat
329930,WD3 8XD,Yes,51.628575,-0.499204,503978,193262,TQ039932,Greater London,Hillingdon,Harefield Village,...,65.8157,E01002438,Hillingdon 003A,E02000496,Harefield,E00012152,7,E14001454,"Ruislip, Northwood and Pinner",
329931,WD6 2RN,Yes,51.643273,-0.255930,520777,195270,TQ207952,Greater London,Barnet,High Barnet,...,50.6285,E01000290,Barnet 007F,E02000030,Totteridge & Barnet Gate,E00001422,5,E14001169,Chipping Barnet,Terraced


I tried a few different levels of specificity and came to the conclusion that using `County` would be best when combined with other variables. Feel free to change the value from `County` to something like `District` or `Ward` and see how the combination counts change. Let's go ahead and replace the `Postcode` value with a `County` value, and then rename the columns that we have changed so far to better describe what is being shown.

In [17]:
for i in df.index:
    curCode = df.iloc[i].Postcode
    if curCode in postcodes["Postcode"].values:
        newCode = postcodes.loc[postcodes['Postcode'] == curCode].get("County").values[0]
        df["Postcode"] = df["Postcode"].replace([curCode], newCode)


In [18]:
df = df.rename(columns={'Postcode': 'County', 'Age': 'Age Range', 'Time in A&E (mins)': 'Time Range'})
df

Unnamed: 0,Health Service ID,Age Range,Time Range,Hospital,Arrival Time,Treatment,Gender,County
0,744-256-9522,"(41.0, 112.0]","(0.999, 60.0]",Kingston Hospital,2019-04-07 02:16:52,Dressing,Male,City of London
1,303-192-8548,"(-0.001, 41.0]","(0.999, 60.0]",Northwick Park & St Marks Hospital,2019-04-05 20:40:47,Central line,Male,Greater London
2,758-738-5880,"(41.0, 112.0]","(60.0, 132.0]",Barnet Hospital,2019-04-03 12:59:13,Central line,Male,Greater London
3,452-291-9709,"(-0.001, 41.0]","(0.999, 60.0]",Hillingdon Hospital,2019-04-04 15:20:58,Other (consider alternatives),Male,Greater London
4,477-915-0508,"(-0.001, 41.0]","(0.999, 60.0]",The Royal Free Hospital,2019-04-01 11:27:12,Incision & drainage,Male,Greater London
...,...,...,...,...,...,...,...,...
9995,609-960-2516,"(41.0, 112.0]","(60.0, 132.0]",Royal London Hospital,2019-04-03 14:48:14,Oral airway,Male,Greater London
9996,956-497-5150,"(-0.001, 41.0]","(0.999, 60.0]",Barnet Hospital,2019-04-04 08:37:37,Lumbar puncture,Female,Greater London
9997,337-244-8916,"(-0.001, 41.0]","(0.999, 60.0]",Croydon University Hospital,2019-04-03 06:17:18,Anaesthesia,Male,Greater London
9998,039-739-2337,"(-0.001, 41.0]","(0.999, 60.0]",University Hospital Lewisham,2019-04-07 15:54:37,Removal foreign body,Female,Greater London


In [19]:
df.County.value_counts()

County
Greater London    9634
City of London     366
Name: count, dtype: int64

These counts are much better than before. Now let's deal with some of the `Gender` outliers. The `Gender` column includes values like `Male`, `Female`, `Not Specified`, and `Not Known`. Let's change `Not Specified` to `Not Known` for some more consistency.

In [20]:
df['Gender'] = df['Gender'].replace("Not Specified", "Not Known")


Additionally, in k-anonymization you must remove the identifying columns. So, let's go ahead and remove `Health Service ID`. I am also going to remove `Arrival Time` for ease of analysis in a little bit.

In [21]:
dfNew = df
dfNew = dfNew.drop('Arrival Time', axis=1)
dfNew = dfNew.drop('Health Service ID', axis=1)

Using the `drop_duplicates` function in Pandas, we can see that we went from 10000 rows to 4124 rows, showing that we have a largen number of duplciates now.

In [22]:
dfNew.drop_duplicates()

Unnamed: 0,Age Range,Time Range,Hospital,Treatment,Gender,County
0,"(41.0, 112.0]","(0.999, 60.0]",Kingston Hospital,Dressing,Male,City of London
1,"(-0.001, 41.0]","(0.999, 60.0]",Northwick Park & St Marks Hospital,Central line,Male,Greater London
2,"(41.0, 112.0]","(60.0, 132.0]",Barnet Hospital,Central line,Male,Greater London
3,"(-0.001, 41.0]","(0.999, 60.0]",Hillingdon Hospital,Other (consider alternatives),Male,Greater London
4,"(-0.001, 41.0]","(0.999, 60.0]",The Royal Free Hospital,Incision & drainage,Male,Greater London
...,...,...,...,...,...,...
9973,"(-0.001, 41.0]","(60.0, 132.0]",University Hospital Lewisham,Guidance/advice only,Female,Greater London
9980,"(41.0, 112.0]","(60.0, 132.0]",University College Hospital,Anaesthesia,Female,Greater London
9992,"(41.0, 112.0]","(60.0, 132.0]",Princess Royal University Hospital,Central line,Male,Greater London
9997,"(-0.001, 41.0]","(0.999, 60.0]",Croydon University Hospital,Anaesthesia,Male,Greater London


Let's break this down a little further. How many unique combinations do we have across all our data?

In [23]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 150)
unique_combinations = dfNew.groupby(['Age Range', 'Time Range', 'Hospital', 'Gender']).size().reset_index(name='Count')
print(unique_combinations)

          Age Range     Time Range                            Hospital     Gender  Count
0    (-0.001, 41.0]  (0.999, 60.0]                     Barnet Hospital     Female    102
1    (-0.001, 41.0]  (0.999, 60.0]                     Barnet Hospital       Male    117
2    (-0.001, 41.0]  (0.999, 60.0]                     Barnet Hospital  Not Known      2
3    (-0.001, 41.0]  (0.999, 60.0]              Charing Cross Hospital     Female     56
4    (-0.001, 41.0]  (0.999, 60.0]              Charing Cross Hospital       Male     53
5    (-0.001, 41.0]  (0.999, 60.0]              Charing Cross Hospital  Not Known      0
6    (-0.001, 41.0]  (0.999, 60.0]                 Chase Farm Hospital     Female    181
7    (-0.001, 41.0]  (0.999, 60.0]                 Chase Farm Hospital       Male    149
8    (-0.001, 41.0]  (0.999, 60.0]                 Chase Farm Hospital  Not Known      7
9    (-0.001, 41.0]  (0.999, 60.0]    Chelsea and Westminster Hospital     Female    180
10   (-0.001, 41.0]  

  unique_combinations = dfNew.groupby(['Age Range', 'Time Range', 'Hospital', 'Gender']).size().reset_index(name='Count')


Looks like we have about 311 here! This is a very large amount, but we can see that the counts are actually pretty good, with most of the male-female split being pretty even and the only outliers being `Not Known`, which is actually not revealing any more information. To take a detailed look at the dataset, check out the full thing below. Otherwise this wraps up this demo where we applied some k-anonymization generalizing to a few columns to protect PII! This can be expanded further with the `Arrival Time` column, but the format of that data is not the best for grouping because it is in string form and so I have left it out of this demo to keep things simple.

In [24]:
pd.set_option('display.max_rows', 13000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 150)
unique_combinations = dfNew.groupby(['Age Range', 'Time Range', 'Hospital', 'Gender', 'Treatment']).size().reset_index(name='Count')
unique_combinations= unique_combinations[unique_combinations['Count'] != 0]
unique_combinations

  unique_combinations = dfNew.groupby(['Age Range', 'Time Range', 'Hospital', 'Gender', 'Treatment']).size().reset_index(name='Count')


Unnamed: 0,Age Range,Time Range,Hospital,Gender,Treatment,Count
0,"(-0.001, 41.0]","(0.999, 60.0]",Barnet Hospital,Female,Anaesthesia,3
1,"(-0.001, 41.0]","(0.999, 60.0]",Barnet Hospital,Female,Arterial line,4
2,"(-0.001, 41.0]","(0.999, 60.0]",Barnet Hospital,Female,Bandage/support,6
3,"(-0.001, 41.0]","(0.999, 60.0]",Barnet Hospital,Female,Blood product transfusion,4
5,"(-0.001, 41.0]","(0.999, 60.0]",Barnet Hospital,Female,Central line,2
6,"(-0.001, 41.0]","(0.999, 60.0]",Barnet Hospital,Female,Chest drain,3
7,"(-0.001, 41.0]","(0.999, 60.0]",Barnet Hospital,Female,Defibrillation/pacing,1
9,"(-0.001, 41.0]","(0.999, 60.0]",Barnet Hospital,Female,Dressing,3
10,"(-0.001, 41.0]","(0.999, 60.0]",Barnet Hospital,Female,Dressing/wound review,5
11,"(-0.001, 41.0]","(0.999, 60.0]",Barnet Hospital,Female,Eye,1
