### ML Data pre-processing
This notebook is for loading and cleaning the data that will be used to train the ML on.
Things like patient heart rate and blood pressure readings that occurred around the time of the administration of the second dose 

It should persist the data into the "out" directory to be consumed by the ml training notebook

In [1]:
import root_config as rc

rc.configure()

from detectdd.auth_bigquery import BigQueryClient
from detectdd.serializer import Serializer

try:
    cohort = Serializer().read_cohort()  # need to run 01-cohort.ipynb to produce the cohort
except FileNotFoundError:
    raise Exception("Need to run [01-cohort.ipynb] at least once to create the cohort file in the /out directory")


big_query = BigQueryClient.auth()

cohort.describe()

Loaded cohort from ..\out\cohort-full.out
<google.oauth2.credentials.Credentials object at 0x00000173CA0A60B0> assignment-1-395912


Unnamed: 0,subject_id,stay_id,drug_a_item_id,drug_b_item_id,dose_b_time,event_count
count,65814.0,65814.0,65814.0,65814.0,65814,65814.0
mean,14861841.983545,34935152.596742,225337.914061,223402.287659,2154-08-20 12:07:36.738081,7.232534
min,10002428.0,30000484.0,221347.0,221289.0,2110-01-19 21:15:00,1.0
25%,12350214.75,32259031.0,225855.0,221744.0,2134-03-24 07:48:00,1.0
50%,14767018.0,34928659.0,225869.0,221744.0,2155-03-14 18:23:00,3.0
75%,17362900.0,37550329.0,225892.0,225798.0,2174-09-03 21:18:00,8.0
max,19999840.0,39999230.0,229618.0,229233.0,2209-05-31 15:15:00,304.0
std,2903605.512713,2950984.617329,1422.498681,2057.698826,,16.848738


In [2]:
from detectdd.query_multiplexer import QueryMultiplexer, WhereClauseGenerator
from detectdd import config

icu = 'physionet-data.mimiciv_icu'

cs_mean_art_chart_itemids = {"PA mean pressure (PA Line)": 226857,
                                       "Arterial Blood Pressure mean": 220052,
                                       "Non Invasive Blood Pressure mean": 220181,
                                       "ART BP Mean": 225312}
query_multiplexer = QueryMultiplexer(BigQueryClient().auth())

def fetch_blood_pressure_data():
    
    sql_template = f"""SELECT ce.subject_id AS subject_id,
            ce.stay_id AS stay_id,
            ce.hadm_id AS hadm_id,
            ce.itemid AS item_id,
            ce.charttime AS chart_time,
            ce.valuenum AS bp,
            IF(ce.itemid IN {tuple(cs_mean_art_chart_itemids.values())}, "MEAN", "SYSTOLIC") AS abs_type,
        FROM `{icu}.chartevents` AS ce
        
        WHERE ce.itemid IN {tuple(cs_mean_art_chart_itemids.values())}                          --measures of mean arterial BP
        AND ($where)"""
    
    where_fragment = "(ce.stay_id= $stay_id AND DATETIME_DIFF(ce.charttime, '$dose_b_time', HOUR) BETWEEN 0 AND 12)"

    multimap_data = {k: v.tolist() for k, v in cohort.groupby('stay_id')['dose_b_time']}
    results = query_multiplexer.multiplex_query(sql_template, multi_map_data=multimap_data, where_clause=WhereClauseGenerator(where_fragment, "stay_id", "dose_b_time"))
    return results

results = fetch_blood_pressure_data()
results

<google.oauth2.credentials.Credentials object at 0x00000173CA144970> assignment-1-395912
Executing query 1, with 6664 pairs at 2023-10-23 19:57:10.811912
Got result with 99653 values
Executing query 2, with 5050 pairs at 2023-10-23 20:06:13.397627
Got result with 76379 values
Executing query 3, with 4039 pairs at 2023-10-23 20:12:59.855405
Got result with 61486 values
Executing query 4, with 3349 pairs at 2023-10-23 20:16:40.547709
Got result with 51480 values
Executing query 5, with 2847 pairs at 2023-10-23 20:19:36.499385
Got result with 43810 values
Executing query 6, with 2469 pairs at 2023-10-23 20:21:47.183168
Got result with 37974 values
Executing query 7, with 2186 pairs at 2023-10-23 20:23:14.149528
Got result with 33779 values
Executing query 8, with 1917 pairs at 2023-10-23 20:24:11.329104
Got result with 29692 values
Executing query 9, with 1717 pairs at 2023-10-23 20:24:58.985416
Got result with 27052 values
Executing query 10, with 1566 pairs at 2023-10-23 20:25:49.878316

Forbidden: 403 Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas

Location: US
Job ID: 8df57abf-c648-4365-bc53-29368367a69f


In [None]:
type(results)

In [None]:
len(results)

In [None]:
results