## Data Transformation - Unstructured Data to Structured Data

***


### Aim of Project:

To transform the MIMIC-IV FHIR dataset into a structured format suitable for analysis by converting its JSON files containing patient, condition, and encounter data into a CSV file that maps patient IDs to condition timestamps. This involves extracting, processing, and merging data to create a comprehensive and analyzable dataset.

***

### 1] importing required packages

In [1]:
import json
import csv
import pandas as pd
from datetime import datetime
from dateutil import parser

### 2] loading the ndjson files

In [2]:

def load_ndjson(file_path):
    with open(file_path, 'r') as file:
        return [json.loads(line) for line in file]


patients_data = load_ndjson('Patient.ndjson')
conditions_data = load_ndjson('Condition.ndjson')
encounters_data = load_ndjson('Encounter.ndjson') + load_ndjson('EncounterICU.ndjson')


### 3] viewing the ndjson files

In [3]:
print("First 2 records of patients_data:")

for patient in patients_data[:2]:
    print("\n", patient)

First 2 records of patients_data:

 {'resourceType': 'Patient', 'id': '0a8eebfd-a352-522e-89f0-1d4a13abdebc', 'meta': {'versionId': '1', 'lastUpdated': '2022-05-24T15:14:55.471-04:00', 'source': '#V0XlSRZTewCRRSjY', 'profile': ['http://fhir.mimic.mit.edu/StructureDefinition/mimic-patient']}, 'text': {'status': 'generated', 'div': '<div xmlns="http://www.w3.org/1999/xhtml"><div class="hapiHeaderText"><b>PATIENT_10000032 </b></div><table class="hapiPropertyTable"><tbody><tr><td>Identifier</td><td>10000032</td></tr><tr><td>Date of birth</td><td><span>06 May 2128</span></td></tr></tbody></table></div>'}, 'extension': [{'url': 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-race', 'extension': [{'url': 'ombCategory', 'valueCoding': {'system': 'urn:oid:2.16.840.1.113883.6.238', 'code': '2106-3', 'display': 'White'}}, {'url': 'text', 'valueString': 'White'}]}, {'url': 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity', 'extension': [{'url': 'ombCategory', 'valueCodin

In [4]:
print("\nFirst 2 records of conditions_data:")

for condition in conditions_data[:2]:
    print("\n", condition)


First 2 records of conditions_data:

 {'resourceType': 'Condition', 'id': '0002fff8-11c5-5d6d-975a-b926a13bb02b', 'meta': {'versionId': '1', 'lastUpdated': '2022-05-24T15:51:35.263-04:00', 'source': '#CexDBHtjfcg8Ti57', 'profile': ['http://fhir.mimic.mit.edu/StructureDefinition/mimic-condition']}, 'identifier': [{'system': 'http://fhir.mimic.mit.edu/identifier/condition', 'value': '28108313-4-Z8546'}], 'category': [{'coding': [{'system': 'http://terminology.hl7.org/CodeSystem/condition-category', 'code': 'encounter-diagnosis'}]}], 'code': {'coding': [{'system': 'http://fhir.mimic.mit.edu/CodeSystem/diagnosis-icd10', 'code': 'Z8546', 'display': 'Personal history of malignant neoplasm of prostate'}]}, 'subject': {'reference': 'Patient/b410dd44-7d65-56f9-974f-2751e8aa80e2'}, 'encounter': {'reference': 'Encounter/ca52755d-7780-524a-a5f8-6c5d2fc2136a'}}

 {'resourceType': 'Condition', 'id': '0014d847-44bd-5bfa-ac44-f411071c1e72', 'meta': {'versionId': '1', 'lastUpdated': '2022-05-24T16:57:

In [5]:
print("\nFirst 2 records of encounters_data:")

for encounter in encounters_data[:2]:
    print("\n", encounter)


First 2 records of encounters_data:

 {'resourceType': 'Encounter', 'id': '0071a339-74cd-596a-9083-771d41d6d118', 'meta': {'versionId': '1', 'lastUpdated': '2022-05-24T16:22:58.682-04:00', 'source': '#yJ9Zy3hOLyEtjrG2', 'profile': ['http://fhir.mimic.mit.edu/StructureDefinition/mimic-encounter']}, 'identifier': [{'use': 'usual', 'system': 'http://fhir.mimic.mit.edu/identifier/encounter', 'value': '22429197', 'assigner': {'reference': 'Organization/ee172322-118b-5716-abbc-18e4c5437e15'}}], 'status': 'finished', 'class': {'system': 'http://fhir.mimic.mit.edu/CodeSystem/admission-class', 'code': 'EW EMER.'}, 'type': [{'coding': [{'system': 'http://snomed.info/sct', 'code': '453701000124103', 'display': 'In-person encounter (procedure)'}]}], 'serviceType': {'coding': [{'system': 'http://fhir.mimic.mit.edu/CodeSystem/services', 'code': 'TRAUM'}]}, 'priority': {'coding': [{'system': 'http://fhir.mimic.mit.edu/CodeSystem/admission-type', 'code': 'EW EMER.'}]}, 'subject': {'reference': 'Patie

### 4] creating dictionaries for each key as a patient 

In [6]:

patient_conditions = {}

for condition in conditions_data:
    
    #extracting the patient reference from each condition
    patient_ref = condition.get('subject', {}).get('reference', "")

    #processesing the patient_ref string to extract the patient ID 
    patient_id = patient_ref.split('/')[-1] if patient_ref else None
    
    if patient_id:
        if patient_id not in patient_conditions:
            patient_conditions[patient_id] = []
        patient_conditions[patient_id].append(condition)

### 5] viewing the dictionary for one specific patient

In [26]:
#output will show the array of all conditions associated with a specfic patient through its given id

patient_id = 'b410dd44-7d65-56f9-974f-2751e8aa80e2' 

specific_id = patient_conditions.get(patient_id, [])

print("Conditions for specific patient ID,", patient_id, ":\n")

print(specific_id)


Conditions for specific patient ID, b410dd44-7d65-56f9-974f-2751e8aa80e2 :

[{'resourceType': 'Condition', 'id': '0002fff8-11c5-5d6d-975a-b926a13bb02b', 'meta': {'versionId': '1', 'lastUpdated': '2022-05-24T15:51:35.263-04:00', 'source': '#CexDBHtjfcg8Ti57', 'profile': ['http://fhir.mimic.mit.edu/StructureDefinition/mimic-condition']}, 'identifier': [{'system': 'http://fhir.mimic.mit.edu/identifier/condition', 'value': '28108313-4-Z8546'}], 'category': [{'coding': [{'system': 'http://terminology.hl7.org/CodeSystem/condition-category', 'code': 'encounter-diagnosis'}]}], 'code': {'coding': [{'system': 'http://fhir.mimic.mit.edu/CodeSystem/diagnosis-icd10', 'code': 'Z8546', 'display': 'Personal history of malignant neoplasm of prostate'}]}, 'subject': {'reference': 'Patient/b410dd44-7d65-56f9-974f-2751e8aa80e2'}, 'encounter': {'reference': 'Encounter/ca52755d-7780-524a-a5f8-6c5d2fc2136a'}, 'start_time': 5616018000}, {'resourceType': 'Condition', 'id': '024ab8d3-e719-50c3-b70a-d3e1760ab100

### 6] creating a dictionary from the list of encounters

In [27]:

#the keys are encounter IDs and the values are the dictionaries representing the full details of those encounters

encounter_by_id = {encounter['id']: encounter for encounter in encounters_data}

In [29]:
#viewing details for an encounter by it's id

details = encounter_by_id.get('0071a339-74cd-596a-9083-771d41d6d118')

print(details)

{'resourceType': 'Encounter', 'id': '0071a339-74cd-596a-9083-771d41d6d118', 'meta': {'versionId': '1', 'lastUpdated': '2022-05-24T16:22:58.682-04:00', 'source': '#yJ9Zy3hOLyEtjrG2', 'profile': ['http://fhir.mimic.mit.edu/StructureDefinition/mimic-encounter']}, 'identifier': [{'use': 'usual', 'system': 'http://fhir.mimic.mit.edu/identifier/encounter', 'value': '22429197', 'assigner': {'reference': 'Organization/ee172322-118b-5716-abbc-18e4c5437e15'}}], 'status': 'finished', 'class': {'system': 'http://fhir.mimic.mit.edu/CodeSystem/admission-class', 'code': 'EW EMER.'}, 'type': [{'coding': [{'system': 'http://snomed.info/sct', 'code': '453701000124103', 'display': 'In-person encounter (procedure)'}]}], 'serviceType': {'coding': [{'system': 'http://fhir.mimic.mit.edu/CodeSystem/services', 'code': 'TRAUM'}]}, 'priority': {'coding': [{'system': 'http://fhir.mimic.mit.edu/CodeSystem/admission-type', 'code': 'EW EMER.'}]}, 'subject': {'reference': 'Patient/24450f28-a039-57d8-95c9-d7ba5508ecd4

### ] assigning times for the conditions

In [11]:

#time format conversion to UNIX

def convert_to_unix_timestamp(iso_format_time):
    
    #parsing the datetime string to a datetime object
    dt = parser.parse(iso_format_time)
    # Convert the datetime object to a UNIX timestamp
    return int(dt.timestamp())


#updating each medical condition record with the start time of the corresponding encounter

for conditions in patient_conditions.values():
    
    for condition in conditions:
        encounter_ref = condition.get('encounter', {}).get('reference', "")
        encounter_id = encounter_ref.split('/')[-1] if encounter_ref else None
        
        if encounter_id and encounter_id in encounter_by_id:
            iso_format_time = encounter_by_id[encounter_id].get('period', {}).get('start', "")
            
            if iso_format_time:
                condition['start_time'] = convert_to_unix_timestamp(iso_format_time)
            else:
                condition['start_time'] = None



### 6] creating a csv file

In [12]:

csv_file = 'output_conditions.csv'

column_names = ['pid', 'time', 'code', 'description']

with open(csv_file, 'w', newline='') as csvfile:
    
    #initializing a CSV writer object that will use csvfile as the file to write to
    writer = csv.DictWriter(csvfile, fieldnames = column_names)
    writer.writeheader()
    
    for patient_id, conditions in patient_conditions.items():
        
        for condition in conditions:
            
            #extracting condition details
            code = condition.get('code', {}).get('coding', [{}])[0].get('code', "")
            description = condition.get('code', {}).get('coding', [{}])[0].get('display', "")
            start_time = condition.get('start_time', "")
            
            writer.writerow({
                'pid': patient_id,
                'time': start_time,
                'code': code,
                'description': description
            })



In [13]:
csv_file

'output_conditions.csv'

### 7] viewing the csv file

In [14]:
final_csv = pd.read_csv('output_conditions.csv')

In [17]:
final_csv

Unnamed: 0,pid,time,code,description
0,b410dd44-7d65-56f9-974f-2751e8aa80e2,5616018000,Z8546,Personal history of malignant neoplasm of pros...
1,b410dd44-7d65-56f9-974f-2751e8aa80e2,5387190060,V1072,Personal history of hodgkin's disease
2,b410dd44-7d65-56f9-974f-2751e8aa80e2,5639393940,Z7902,Long term (current) use of antithrombotics/ant...
3,b410dd44-7d65-56f9-974f-2751e8aa80e2,5426582400,49390,"Asthma, unspecified type, unspecified"
4,b410dd44-7d65-56f9-974f-2751e8aa80e2,5426582400,2724,Other and unspecified hyperlipidemia
...,...,...,...,...
4176,94abdf17-f13a-5eae-aac0-eca407bbfadd,6243609900,Z23,Encounter for immunization
4177,94abdf17-f13a-5eae-aac0-eca407bbfadd,6243609900,K219,Gastro-esophageal reflux disease without esoph...
4178,5ddeb201-5de6-5177-a116-fa82ce8ad2f2,5545862220,Z85850,Personal history of malignant neoplasm of thyroid
4179,5ddeb201-5de6-5177-a116-fa82ce8ad2f2,5545862220,G40909,"Epilepsy, unspecified, not intractable, withou..."


****