# De-identification and K-Anonymization Process

## Purpose
This notebook implements the de-identification and k-anonymization process for clinical data in accordance with GDPR requirements. The process includes:

1. Creating a dataset from patient records
2. Anonymizing demographic and clinical information
3. Implementing k-anonymity to minimize re-identification risk
4. Backpropagating the anonymized data to maintain referential integrity

## Context
Early detection of clinical deterioration is crucial to prevent morbidity and mortality in hospitalized patients. This process creates anonymized data that can be safely shared with the international research community while preserving analytical value.

In [2]:
# Import libraries for database connectivity and data processing
import json
from sshtunnel import SSHTunnelForwarder
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine, inspect
import pandas as pd
import os
import sys

# Define root path and add to system path for custom imports
ROOT_PATH = "/Users/xaviborrat/Documents/GitHub/TFM_Clinical_Deterioration"
sys.path.append(ROOT_PATH)

# Import custom database connector class
import classes
from classes.xavi_con_class import db_connect as xcc

In [3]:
# Initialize connection to clinical database using SSH tunnel
# Configuration file contains sensitive connection parameters
config_path = os.path.join(ROOT_PATH, 'classes/config_tfm.json')
with open(config_path, 'r') as config_file:
    config = json.load(config_file)

# Create database connection object
datanex = xcc(
    ssh = config["ssh"],
    ssh_user = config["ssh_user"],
    ssh_host = config["ssh_host"],
    ssh_pkey = config["ssh_pkey"],
    db_host = config["db_host"],
    db_port = config["db_port"],
    db_user = config["db_user"],
    db_pass = config["db_pass"],
    flavour = config["flavour"],
    db = config["db"]
)

## Data Extraction for K-Anonymity

This section creates the data set that will undergo k-anonymization, joining patient demographics with their clinical data to ensure all identifying information is properly processed.

In [None]:
# Create dataset for k-anonymization by joining demographic and clinical data
# First, drop the table if it exists to ensure clean creation
datanex.query_build('borrat_project', """DROP TABLE IF EXISTS borrat_project.k_anon_set""")

# Create table joining demographics, ward stays, and diagnostics
# This captures key quasi-identifiers that could lead to re-identification
datanex.query_build(db='borrat_project', query="""CREATE TABLE IF NOT EXISTS borrat_project.k_anon_set AS (
                  			select 
                  				de.patient_ref,
								de.sex,
								natio_ref, 
								icd10_code, 
								age_on_admission,
								stay_id,
								diag_id,
								demog_id
							FROM 
        				    	demographics_ds de
        					JOIN 
        				    	ward_stays_ds ws ON de.patient_ref = ws.patient_ref
        					JOIN 
        				     	diagnostics_ds dg ON dg.episode_ref = ws.episode_ref
						);""")

# Retrieve the created data for processing
k_anon = datanex.query(db='borrat_project', query="""select * from k_anon_set;""")

Database [borrat_project] session created...
<> Query executed Sucessfully <>
Database [borrat_project] session created...
<> Query executed Sucessfully <>
Database [borrat_project] session created...
<> Query Sucessful <>


In [None]:
# Fill missing values with asterisk to ensure consistent handling in anonymization
# This standardizes representation of missing data before k-anonymization
k_anon = k_anon.fillna('*').replace({None: '*'})

In [None]:
# Remove duplicate rows to reduce dataset size and ensure clean k-anonymization
k_anon.drop_duplicates(inplace=True)

# Export data to CSV for external k-anonymization processing
# The external tool will apply generalization and suppression techniques
k_anon.to_csv('k_anon_set.csv', index=False)

## Backpropagation of K-Anonymized Data

After external k-anonymization, this section reintegrates the anonymized data back into the database while maintaining relationships between tables. This ensures that all downstream analyses work with properly anonymized data.

In [4]:
import pandas as pd
# Import the externally k-anonymized dataset
# This file contains data that has been processed to ensure k-anonymity (typically k≥5)
k_anon_done = pd.read_csv('data/anonymized_data.csv')

In [5]:
k_anon_done

Unnamed: 0,patient_ref,sex,natio_ref,icd10_code,age_on_admission,stay_id,diag_id,demog_id
0,2275378579,2,ES,J96.00,49,5765,1,1
1,2275378579,2,ES,E03.9,49,5765,2,1
2,2275378579,2,ES,D64.9,49,5765,3,1
3,2275378579,2,ES,C79.51,49,5765,4,1
4,2275378579,*,*,*,*,5765,5,1
...,...,...,...,...,...,...,...,...
333957,9663935179,1,ES,D69.6,65,2354,232232,18989
333958,9663935179,1,ES,D69.6,65,2529,232232,18989
333959,9663935179,1,ES,D61.818,65,2252,232233,18989
333960,9663935179,1,ES,D61.818,65,2354,232233,18989


In [3]:
k_anon_done_group = k_anon_done.groupby(["stay_id", "patient_ref"]).apply(lambda x: x['icd10_code'].unique())

  k_anon_done_group = k_anon_done.groupby(["stay_id", "patient_ref"]).apply(lambda x: x['icd10_code'].unique())


In [4]:
k_anon_done

Unnamed: 0,patient_ref,sex,natio_ref,icd10_code,age_on_admission,stay_id,diag_id,demog_id
0,2275378579,2,ES,J96.00,49,5765,1,1
1,2275378579,2,ES,E03.9,49,5765,2,1
2,2275378579,2,ES,D64.9,49,5765,3,1
3,2275378579,2,ES,C79.51,49,5765,4,1
4,2275378579,*,*,*,*,5765,5,1
...,...,...,...,...,...,...,...,...
333957,9663935179,1,ES,D69.6,65,2354,232232,18989
333958,9663935179,1,ES,D69.6,65,2529,232232,18989
333959,9663935179,1,ES,D61.818,65,2252,232233,18989
333960,9663935179,1,ES,D61.818,65,2354,232233,18989


In [5]:
print(k_anon_done.columns)
print(k_anon_done.shape)


Index(['patient_ref', 'sex', 'natio_ref', 'icd10_code', 'age_on_admission',
       'stay_id', 'diag_id', 'demog_id'],
      dtype='object')
(333962, 8)


In [7]:
k_anon_done.columns

# We must remove duplicates from the k-anon dataset to ensure that each record is unique



# taula demog
demog_join = k_anon_done[['demog_id','natio_ref']].drop_duplicates()
# taula diag_events
diag_join=k_anon_done[['diag_id', 'icd10_code']].drop_duplicates()
# taula ward_stay_events
stay_join=k_anon_done[['stay_id', 'age_on_admission']].drop_duplicates()

In [17]:
#-------------------------------------DEMOG--------------------------------------------
# Backpropagate nationality reference that has been k-anonymized

# Retrieve original demographics table
demog = datanex.query(db='borrat_project', query="""select * from demographics_ds;""")
# Remove duplicates to ensure clean merging
demog = demog.drop_duplicates(subset=['demog_id'])    

# Remove identifiable columns (nationality and birth date)
# Birth date is removed to prevent age calculation, using pre-calculated age instead
demog.drop(columns=['natio_ref'], inplace=True)

# Merge with anonymized nationality data
# Using right join to ensure only anonymized records are kept
demog = demog.merge(demog_join, on='demog_id', how='right')

# Upload anonymized demographics table back to database
datanex.write_table_2(db='borrat_project', df=demog, table_name='demographics_kn')

Database [borrat_project] session created...
<> Query Sucessful <>
Database [borrat_project] session created...
<> Table [demographics_kn] created <>


In [18]:
demog_join.shape

(29698, 2)

In [26]:
#-------------------------------------DIAG_EVENTS--------------------------------------------
# Backpropagate anonymized diagnosis codes (ICD)

# Retrieve original diagnostics table
diag = datanex.query(db='borrat_project', query="""select * from diagnostics_ds;""")

# Remove original ICD codes to be replaced with anonymized versions
diag.drop(columns=['icd10_code'], inplace=True)

# Merge with anonymized ICD codes
# Right join ensures only anonymized records are kept
diag = diag.merge(diag_join, on='diag_id', how='right')

# Clean the ICD codes by removing asterisks used in anonymization placeholder
diag.icd10_code = diag.icd10_code.apply(lambda s: s.replace('*', '') if isinstance(s, str) else s)

# Upload anonymized diagnostics table back to database
datanex.write_table_2(db='borrat_project', df=diag, table_name='diagnostics_kn')

Database [borrat_project] session created...
<> Query Sucessful <>
Database [borrat_project] session created...
<> Table [diagnostics_kn] created <>


In [28]:
#-------------------------------- WARD_STAY_EVENTS----------------
# Backpropagate anonymized age data to ensure patient age is properly de-identified

# Retrieve original ward stays table
stay = datanex.query(db='borrat_project', query="""select * from ward_stays_ds""")

# Remove original age at admission to be replaced with anonymized version
stay.drop(columns=['age_on_admission'], inplace=True)

# Merge with anonymized age data
# Right join ensures only anonymized records are kept
stay = stay.merge(stay_join, on='stay_id', how='right')

# Upload anonymized ward stays table back to database
datanex.write_table_2(db='borrat_project', df=stay, table_name='ward_stays_kn')

Database [borrat_project] session created...
<> Query Sucessful <>
Database [borrat_project] session created...
<> Table [ward_stays_kn] created <>
