# $k$-Anonymity Experiment

In this notebook we explore the concept of $k$-anonymity using a publicly released health care dataset.

This dataset was provided by the [New York State Department of Health](https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/82xm-y6g8).

*The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified File contains discharge level detail on patient characteristics, diagnoses, treatments, services, and charges. This data file contains basic record level detail for the discharge. **The de-identified data file does not contain data that is protected health information (PHI) under HIPAA.** The health information is not individually identifiable; **all data elements considered identifiable have been redacted**. For example, the direct identifiers regarding a date have the day and month portion of the date removed.*

Even though they have removed protected health information, we know that it still might be possible to re-identify individuals in the dataset, if there are **quasi-identifiers** in the data, meaning unique combinations of attributes which could be used to link a record to another dataset containing identifying information.





In [1]:
import pandas as pd
df = pd.read_csv('https://www.dropbox.com/s/tqdp1cljfg7zcb6/hospital.csv?dl=1')

  df = pd.read_csv('https://www.dropbox.com/s/tqdp1cljfg7zcb6/hospital.csv?dl=1')


In [2]:
df.columns

Index(['Health Service Area', 'Hospital County',
       'Operating Certificate Number', 'Facility Id', 'Facility Name',
       'Age Group', 'Zip Code - 3 digits', 'Gender', 'Race', 'Ethnicity',
       'Length of Stay', 'Type of Admission', 'Patient Disposition',
       'Discharge Year', 'CCS Diagnosis Code', 'CCS Diagnosis Description',
       'CCS Procedure Code', 'CCS Procedure Description', 'APR DRG Code',
       'APR DRG Description', 'APR MDC Code', 'APR MDC Description',
       'APR Severity of Illness Code', 'APR Severity of Illness Description',
       'APR Risk of Mortality', 'APR Medical Surgical Description',
       'Payment Typology 1', 'Payment Typology 2', 'Payment Typology 3',
       'Attending Provider License Number',
       'Operating Provider License Number', 'Other Provider License Number',
       'Birth Weight', 'Abortion Edit Indicator',
       'Emergency Department Indicator', 'Total Charges', 'Total Costs'],
      dtype='object')

We will focus at the age, ZIP, gender, race, and ethnicity columns when looking for quasi-identifiers.

In [3]:
cols = ['Age Group','Zip Code - 3 digits','Gender','Race','Ethnicity']

Implement a technique to count, for each unique collection of the protected attributes, how many records share those same attributes.

In [4]:
counts_df = df.groupby(cols).size().reset_index(name='count')
counts_df

Unnamed: 0,Age Group,Zip Code - 3 digits,Gender,Race,Ethnicity,count
0,0 to 17,100,F,Black/African American,Multi-ethnic,10
1,0 to 17,100,F,Black/African American,Not Span/Hispanic,1697
2,0 to 17,100,F,Black/African American,Spanish/Hispanic,387
3,0 to 17,100,F,Black/African American,Unknown,7
4,0 to 17,100,F,Multi-racial,Multi-ethnic,161
...,...,...,...,...,...,...
5568,70 or Older,OOS,M,Other Race,Unknown,93
5569,70 or Older,OOS,M,White,Multi-ethnic,6
5570,70 or Older,OOS,M,White,Not Span/Hispanic,6216
5571,70 or Older,OOS,M,White,Spanish/Hispanic,88


If any groups contain a single record, then these records
contain quasi-identifiers and thus could be uniquely identified.  How many records contain quasi-identifiers?

In [5]:
counts_df[counts_df['count'] == 1]

Unnamed: 0,Age Group,Zip Code - 3 digits,Gender,Race,Ethnicity,count
7,0 to 17,100,F,Multi-racial,Unknown,1
23,0 to 17,100,M,Multi-racial,Unknown,1
34,0 to 17,101,F,Black/African American,Unknown,1
38,0 to 17,101,F,White,Multi-ethnic,1
41,0 to 17,101,F,White,Unknown,1
...,...,...,...,...,...,...
5510,70 or Older,147,M,Other Race,Unknown,1
5516,70 or Older,148,F,Multi-racial,Not Span/Hispanic,1
5525,70 or Older,148,M,Multi-racial,Not Span/Hispanic,1
5546,70 or Older,OOS,F,Multi-racial,Multi-ethnic,1


599 records contain quasi-identifiers.

By merging the count with the original dataset, we can see how many records would violate a $k$-anonymity constraint, for a given $k$.  Determine how many records violate the constraint for $k=2$ up to $10$.

In [6]:
import numpy as np
import matplotlib.pyplot as plt

merged_df = pd.merge(df, counts_df, on=cols, how='left')
k = np.arange(2, 11, 1)
violations = []
for val in k:
  violate = merged_df[merged_df['count'] <= val - 1]
  violations.append(violate)
  print(f"{len(violate)} records violate the {val}-anonymity constraint")

599 records violate the 2-anonymity constraint
1163 records violate the 3-anonymity constraint
1919 records violate the 4-anonymity constraint
2659 records violate the 5-anonymity constraint
3409 records violate the 6-anonymity constraint
4183 records violate the 7-anonymity constraint
5030 records violate the 8-anonymity constraint
5710 records violate the 9-anonymity constraint
6538 records violate the 10-anonymity constraint


Determine and implement a reasonable way to achieve 2-anonymity using suppression and generalization.

In [7]:
copy = df.copy()
copy['Zip Code - 2 digits'] = copy['Zip Code - 3 digits'].str[:2]
cols = ['Age Group', 'Zip Code - 2 digits', 'Gender', 'Race', 'Ethnicity']

counts_df = copy.groupby(cols).size().reset_index(name='count')
counts_df[counts_df['count'] == 1]
merged_df = pd.merge(copy, counts_df, on=cols, how='left')
merged_df = merged_df[merged_df['count'] > 1]

I further generalize zip code by truncating to the first 2 digits, and I suppress records that contain quasi-identifiers. Technically, only the suppression is needed to achieve 2-anonymity (the generalization is optional).