The Task:
Hi,

Here is your home-task. Reply to this email and attach your solution.
If you have any issue, you may write to this email with your question.


Task

You are working as a data engineer at a healthcare organization that utilizes the Fast Healthcare Interoperability Resources (FHIR) standard for managing patient health data. Your team receives patient encounter data in FHIR format. Your task is to read the file into tables that can be queries easily by our analysts. Finally, you are required to write some simple aggregative queries over the tables in SQL.

Submit your result in a any format that works for you. It can be a Python script or a Jupyter notebook (https://jupyter.org/install) or any other platform that will let us run your solution end-to-end. For SQL queries you may use SQLlite3 (https://docs.python.org/3/library/sqlite3.html) or any other simple DB solution.

Model Tables

Once the JSON file is fixed, model the encounter data into tables. Choose an appropriate relational database schema to represent the data
Provide a written explanation of the table structure and relationships, highlighting the normalization principles applied.
Write SQL Queries

Write SQL queries to perform the following tasks: Retrieve a list of unique patient IDs along with their names.
Count the number of encounters for each patient name.
Find the name of practitioner with the highest number of encounters.
Identify the most common condition observed across all encounters per month.

In [1]:
import sqlite3
import pandas as pd
import json

In [8]:
conn = sqlite3.connect('test_healthcare_data.db')
cursor = conn.cursor()

Tables and Structure:

1. Patients Table:
Columns:
patient_id (Primary Key)
patient_name

2. Encounters Table:
Columns:
encounter_id (Primary Key)
patient_id (Foreign Key referencing Patients table)
status
type (Consider creating a separate table for encounter types)
practitioner_name
encounter_date
condition_observed
location_id (Foreign Key referencing Locations table)
diagnosis_code (Consider creating a separate table for diagnoses)

3. EncounterTypes Table:
Columns:
type_id (Primary Key)
coding_system
coding_code
coding_display
text

4. Locations Table:
Columns:
location_id (Primary Key)
location_reference
location_display

5. Diagnoses Table:
Columns:
diagnosis_code (Primary Key)
diagnosis_display
diagnosis_system

Relationships:
The patient_id column in the Encounters table is a foreign key referencing the patient_id in the Patients table.
The type column in the Encounters table is a foreign key referencing the type_id in the EncounterTypes table.
The location_id column in the Encounters table is a foreign key referencing the location_id in the Locations table.
The diagnosis_code column in the Encounters table is a foreign key referencing the diagnosis_code in the Diagnoses table.

Normalization Principles Applied:
1NF (First Normal Form): Each table has atomic (indivisible) values in each column.
2NF (Second Normal Form): Tables are free of partial dependencies.
3NF (Third Normal Form): No transitive dependencies exist; all non-prime attributes are dependent on the primary key.

exampple of an object:
{
  "encounter": {
    "resourcetype": "Bundle",
    "type": "searchset",
    "total": 1,
    "entry": [
      {
        "fullurl": "https://jade-forest-h.edu/sup-fhirproxy/api/FHIR/R4/Encounter/eDYzEfJtWqzyANXf5-1Efdx86jOMRWXizr-7iu.5QyG83",
        "resource": {
          "resourcetype": "Encounter",
          "id": "eDYzEfJtWqzyANXf5-1Efdx86jOMRWXizr-7iu.5QyG83",
          "status": "arrived",
          "type": [
            {
              "coding": [
                {
                  "system": "urn:oid:1.2.998.1251.341.13.202.3.7.10.698084.30",
                  "code": "2101020001",
                  "display": "Video Visit"
                }
              ],
              "text": "Video Visit"
            },
            {
              "coding": [
                {
                  "system": "urn:oid:1.2.998.1251.341.13.202.3.7.2.808267",
                  "code": "200415",
                  "display": "Video Visit"
                }
              ],
              "text": "Video Visit"
            },
            {
              "coding": [
                {
                  "system": "urn:oid:1.2.998.1251.341.13.202.3.7.10.698084.18875",
                  "code": "3",
                  "display": "Elective"
                }
              ],
              "text": "Elective"
            }
          ],
          "subject": {
            "reference": "Patient3015",
            "display": "Lily Young"
          },
          "participant": [
            {
              "period": {},
              "individual": {
                "display": "Self Referred",
                "type": "Practitioner"
              },
              "type": [
                {
                  "coding": [
                    {
                      "system": "http://hl7.org/fhir/v3/ParticipationType",
                      "code": "REF",
                      "display": "referrer"
                    }
                  ],
                  "text": "referrer"
                }
              ],
              "extension": []
            },
            {
              "reference": "Practitioner/Dr2002",
              "display": "Dr. Benjamin Carter"
            }
          ],
          "period": {
            "start": "2023-11-22T03:00:00Z",
            "_end": "2023-11-22T03:15:00Z"
          },
          "location": [
            {
              "location": {
                "reference": "Location/emFR3fX6yqT.KL--hFHkxMw3",
                "display": "Jade Forest Hospital Center",
                "identifier": {}
              },
              "period": {},
              "physicaltype": {
                "coding": []
              }
            }
          ],
          "reasoncode": [],
          "issue": [],
          "account": [],
          "extension": [],
          "hospitalization": {
            "admitsource": {
              "coding": []
            },
            "extension": [],
            "dischargedisposition": {
              "coding": []
            }
          },
          "partof": {
            "identifier": {}
          },
          "servicetype": {
            "coding": []
          },
          "diagnosis": {
            "reference": "F32.A",
            "display": "Depression, unspecified",
            "system": "ICD-CM-9"
          },
          "episodeofcare": [],
          "priority": {
            "coding": []
          }
        },
        "search": {
          "mode": "match"
        }
      }
    ]
  },
  "sys_insert": "2023-11-26 11:35:38.404169 UTC"
}

In [None]:
# accesses to certen fields in the object:
# patients:
patient_name = data['encounter']['entry'][0]['resource']['subject']['display']
patient_id = int(data['encounter']['entry'][0]['resource']['subject']['reference'].split('Patient')[1])

# Diagnosis Code, Display, and System
diagnosis_code = data['encounter']['entry'][0]['resource']['diagnosis']['reference'].split('/')[-1]
diagnosis_display = data['encounter']['entry'][0]['resource']['diagnosis']['display']
diagnosis_system = data['encounter']['entry'][0]['resource']['diagnosis']['system']

# Location ID, Reference, and Display
location_id = int(data['encounter']['entry'][0]['resource']['location'][0]['location']['reference'].split('/')[-1])
location_reference = data['encounter']['entry'][0]['resource']['location'][0]['location']['reference']
location_display = data['encounter']['entry'][0]['resource']['location'][0]['location']['display']

# Encounter ID and Type
encounter_id = data['encounter']['entry'][0]['resource']['id']
encounter_status = data['encounter']['entry'][0]['resource']['status']
# Encounter ID and Type
encounter_id = data['encounter']['entry'][0]['resource']['id']
encounter_type = data['encounter']['entry'][0]['resource']['type'][0]['text']

# Practitioner Name
practitioner_name = data['encounter']['entry'][0]['resource']['participant'][1]['display']

# Encounter Date
encounter_start_date = data['encounter']['entry'][0]['resource']['period']['start']

# Condition Observed
condition_observed = data['encounter']['entry'][0]['resource']['diagnosis']['display']

# Diagnosis Code
diagnosis_code = data['encounter']['entry'][0]['resource']['diagnosis']['reference'].split('/')[-1]

# Encounter Type ID
encounter_type_id = data['encounter']['entry'][0]['resource']['type'][0]['coding'][0]['code']

# Encounter Type Coding System, Code, and Display
encounter_type_coding_system = data['encounter']['entry'][0]['resource']['type'][0]['coding'][0]['system']
encounter_type_coding_code = data['encounter']['entry'][0]['resource']['type'][0]['coding'][0]['code']
encounter_type_coding_display = data['encounter']['entry'][0]['resource']['type'][0]['coding'][0]['display']

# Encounter Type Text
encounter_type_text = data['encounter']['entry'][0]['resource']['type'][0]['text']

# Location ID, Reference, and Display
location_id = data['encounter']['entry'][0]['resource']['location'][0]['location']['reference'].split('/')[-1]
location_reference = data['encounter']['entry'][0]['resource']['location'][0]['location']['reference']
location_display = data['encounter']['entry'][0]['resource']['location'][0]['location']['display']

# Diagnosis Code, Display, and System
diagnosis_code = data['encounter']['entry'][0]['resource']['diagnosis']['reference'].split('/')[-1]
diagnosis_display = data['encounter']['entry'][0]['resource']['diagnosis']['display']
diagnosis_system = data['encounter']['entry'][0]['resource']['diagnosis']['system']


In [9]:
# Read the entire content of the JSONL file as a single string
with open('fhir_encounters.jsonl', 'r') as file:
    data_str = file.read()

# Split the string into individual JSON objects
json_objects = data_str.split('}\n{')

In [10]:
# Create Patients Table
cursor.execute('''
    CREATE TABLE IF NOT EXISTS patients (
        patient_id INTEGER PRIMARY KEY,
        patient_name TEXT NOT NULL
    )
''')

<sqlite3.Cursor at 0x7fe5d8b60e30>

In [11]:
# Query and display the contents of the 'patients' table
patients_df = pd.read_sql_query("SELECT * FROM patients", conn)
patients_df

Unnamed: 0,patient_id,patient_name


In [12]:
# Process each JSON object and insert into the Patients table
for line_number, json_str in enumerate(json_objects, start=1):
    # Add curly braces for valid JSON syntax
    json_str = f'{{{json_str}}}'
    
    try:
        data = json.loads(json_str)
        patient_name = data['encounter']['entry'][0]['resource']['subject']['display']
        patient_id = int(data['encounter']['entry'][0]['resource']['subject']['reference'].split('Patient')[1])

        # Check if patient_id already exists
        cursor.execute("SELECT 1 FROM Patients WHERE patient_id=?", (patient_id,))
        exists = cursor.fetchone()

        if not exists:
            # Insert data into the Patients table
            cursor.execute('''
                INSERT INTO Patients (patient_id, patient_name)
                VALUES (?, ?)
            ''', (patient_id, patient_name))
        else:
            print(f"Skipping duplicate patient_id {patient_id}")

    except json.JSONDecodeError as e:
        print(f"Error decoding JSON on line {line_number}: {e}")

Error decoding JSON on line 1: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Skipping duplicate patient_id 3006
Skipping duplicate patient_id 3009
Skipping duplicate patient_id 3006
Skipping duplicate patient_id 3025
Skipping duplicate patient_id 3021
Skipping duplicate patient_id 3007
Skipping duplicate patient_id 3033
Skipping duplicate patient_id 3037
Skipping duplicate patient_id 3011
Skipping duplicate patient_id 3016
Skipping duplicate patient_id 3038
Skipping duplicate patient_id 3026
Skipping duplicate patient_id 3015
Error decoding JSON on line 50: Extra data: line 2212 column 2 (char 63808)


In [13]:
# Query and display the contents of the 'patients' table
patients_df = pd.read_sql_query("SELECT * FROM patients", conn)
patients_df

Unnamed: 0,patient_id,patient_name
0,3001,John Doe
1,3002,Jane Doe
2,3003,Michael Smith
3,3004,Emily Johnson
4,3005,David Brown
5,3006,Sophia Martinez
6,3007,William Taylor
7,3008,Emma Anderson
8,3009,Daniel White
9,3011,Liam Harris


In [14]:
patients_df.size

70