# Demographics
Tidy table of the patient's demographics as of the index date.  All table start with `pld_demo`

**Script**
* [scripts/pld/demographics.ipynb](./scripts/pld/demographics.ipynb)

**Prior Script(s)**
* [scripts/de/raven_demographics.ipynb](./scripts/de/raven_demographics.ipynb)

**Parameters**
* `in/pld/demographics.xlsx[param]`
* **AND** `in/pld/demographics.xlsx[race_ref]`

**Input**
* `coh_pt`
* `de_raven_demographics`

**Output**  
* `pld_demo`

**Review**
* [scripts/pld/demographics.html](./scripts/pld/demographics.html)


In [1]:
#Import libraries for this notebook
import pandas as pd  
from drg_connect import Snowflake
import numpy as np
import pickle
import seaborn as sb
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#Load connection variables to connect_dict
with open('../../out/conn/connect_dict.pickle', 'rb') as handle:
    connect_dict = pickle.load(handle)

#Create Eegine to connect to snowflake
snow = Snowflake(role=connect_dict['role'],
                 warehouse=connect_dict['warehouse'],
                 database=connect_dict['database'],
                 schema=connect_dict['schema'])

#Finish engine setup
engine = snow.engine
%load_ext sql_magic
%config SQL.conn_name = 'engine'  #Set the sql_magic connection engine
%config SQL.output_result = True  #Enable output to std out
%config SQL.notify_result = False #disable browser notifications


# Parameters
Create python variables of the parameters

 **Input**  
* `in/pld/demographics.xlsx[param]`

**Output**
* Python variables named after parameters with the value

In [None]:
#Create system variables from excel into script and review values in dictionary
input_df = pd.read_excel('../../in/pld/demographics.xlsx', sheet_name='param', skiprows=4, dtype=str)
var_dict = dict(zip(input_df.parameter, input_df.value))
for key,val in var_dict.items(): exec(key + '=val')

#Check inputs
var_dict

# Reference
Upload reference tables needed for the analysis

## Race Mapping
Uploads reference table to clean race & ethnicity and map to a single race variable

**Input**  
  * `in/pld/demographics.xlsx[race_ref]`

**Output**  
* `pld_demo_race_ref`

In [None]:
#Upload reference table from excel to snowflake and review snowflake output
df = pd.read_excel('../../in/pld/demographics.xlsx', sheet_name='race_ref', skiprows=4, dtype=str)

#Strip white space and make referrable columns uppercase
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
df[['race','ethnicity','source']] =  \
    df[['race','ethnicity','source']].apply(lambda x: x.str.upper() if x.dtype == "object" else x)

#Upload to snowflake
snow.drop_table("pld_demo_race_ref")
snow.upload_dataframe(df,"pld_demo_race_ref")
del df
snow.select("SELECT * FROM pld_demo_race_ref")

## Age Buckets
Upload table to bucket ages

**Input**  
  * `in/pld/demographics.xlsx[age_buckets]`

**Output**  
* `pld_demo_age_buckets`

In [None]:
#Upload reference table from excel to snowflake and review snowflake output
df = pd.read_excel('../../in/pld/demographics.xlsx', sheet_name='age_buckets', skiprows=4, dtype=str)

#Strip white space and make referrable columns uppercase
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
df[['age','age_bucket']] =  \
    df[['age','age_bucket']].apply(lambda x: x.str.upper() if x.dtype == "object" else x)

#Upload to snowflake
snow.drop_table("pld_demo_ref_age_bucket")
snow.upload_dataframe(df,"pld_demo_ref_age_bucket")
del df
snow.select("SELECT * FROM pld_demo_ref_age_bucket")

# Create Variables

## Age, Gender, DOB
Determine the age, gender, date of birth, and encrypted keys for the patients in the cohort

**Parameters**
  * `index_dt`
  
**Input**
  * `coh_pt`
  * `rwd_db.rwd.raven_patient_demographics`
  
**Output**  
* `pld_demo_age_gender`

In [None]:
%%read_sql
--Create raven diagnosis table
DROP TABLE IF EXISTS pld_demo_age_gender; 
CREATE TRANSIENT TABLE pld_demo_age_gender AS
      SELECT coh.patient_id,
             demo.gender,
             demo.date_of_birth,
             Round(datediff(d,demo.date_of_birth,'{index_dt}')/365,0) AS age 
        FROM coh_pt coh
             LEFT JOIN de_raven_demographics demo
                    ON coh.patient_id = demo.patient_id;

## Race & Ethnicity
Identify the patient race from EHR data

In [None]:
%%read_sql
-- Collecting ethnicity data from albatross
DROP TABLE IF EXISTS pld_demo_alb_race;
CREATE TRANSIENT TABLE pld_demo_alb_race AS 
    SELECT coh.patient_id      AS patient_id,
           alb.race            AS race,
           alb.ethnicity1      AS ethnicity,
           alb.lastupdatedttm  AS last_update_dt,
           'albatross'         AS source
    FROM coh_pt coh
         JOIN de_raven_demographics demo
           ON coh.patient_id = demo.patient_id
         JOIN rwd_db.rwd.albatross_ehr_patients alb
           ON (alb.encrypted_key_1 = demo.encrypted_key_1
              AND alb.encrypted_key_2 = demo.encrypted_key_2)
              OR (alb.encrypted_key_1 = demo.encrypted_key_1
                 AND demo.encrypted_key_1 IS NULL);

In [None]:
%%read_sql
-- Collecting race information from pelican

DROP TABLE IF EXISTS pld_demo_race_pelican;

CREATE TRANSIENT TABLE pld_demo_race_pelican AS 
    SELECT coh.patient_id AS patient_id,
           race.race_name AS race,
           NULL           AS ethnicity,
           last_modified  AS last_update_dt,
           'pelican'      AS source
      FROM coh_pt coh
           JOIN de_raven_demographics demo
             ON coh.patient_id = demo.patient_id 
           JOIN rwd_db.rwd.pelican_deid pel
             ON (pel.encrypted_key_1 = demo.encrypted_key_1
                AND pel.encrypted_key_2 = demo.encrypted_key_2)
                OR (pel.encrypted_key_1 = demo.encrypted_key_1
                    AND demo.encrypted_key_1 IS null)
           JOIN rwd_db.rwd.pelican_patient_race raceid
             ON raceid.patient_id = pel.patient_id
           JOIN rwd_db.rwd.pelican_race race
             ON raceid.race_id = race.race_id;   
             
SELECT Count(*) from pld_demo_race_pelican;

In [None]:
%%read_sql
-- Dedup into a single race
DROP TABLE IF EXISTS pld_demo_race_comb;
CREATE TRANSIENT TABLE pld_demo_race_comb AS
    SELECT patient_id, race, ethnicity, last_update_dt, source
      FROM pld_demo_alb_race
    UNION
    SELECT patient_id, race, ethnicity, last_update_dt, source
      FROM pld_demo_race_pelican;


In [None]:
%%read_sql
--Clean up null and unknown race/ethnicity values
BEGIN;
UPDATE pld_demo_race_comb
  SET race = 'Unknown'
WHERE race IS NULL
      OR race ilike '%Unknown%'
      OR race ilike '%other%';
COMMIT;

BEGIN;
UPDATE pld_demo_race_comb
  SET ethnicity = 'Unknown'
WHERE ethnicity IS NULL
      OR ethnicity ilike '%Unknown%'
      OR ethnicity ilikE '%other%';
COMMIT;

In [None]:
%%read_sql
--Pull together all race information
CREATE OR REPLACE TEMP TABLE tmp_race_all AS
    SELECT race.patient_id,
           race.race,
           race.ethnicity,
           race.source,
           race.last_update_dt,
           ref.race_standard
      FROM pld_demo_race_comb race
           LEFT JOIN pld_demo_race_ref ref
                  ON Trim(Upper(race.race)) = ref.race
                     AND Trim(Upper(race.ethnicity)) = ref.ethnicity
                     AND Trim(Upper(race.source)) = ref.source;
     
--Identify most recent race/ethnicity information
CREATE OR REPLACE TEMP TABLE tmp_max_dt AS
    SELECT patient_id,
           Max(last_update_dt) AS max_update_dt
      FROM tmp_race_all
     WHERE race_standard IS NOT NULL
     GROUP BY patient_id;
     
--Create final table
DROP TABLE IF EXISTS pld_demo_race_final;
CREATE TRANSIENT TABLE pld_demo_race_final AS
    SELECT race.patient_id,
           Max(race.race_standard) AS race_standard
      FROM tmp_race_all race
           JOIN tmp_max_dt dt
             ON race.patient_id = dt.patient_id
                AND race.last_update_dt = dt.max_update_dt
     GROUP BY race.patient_id;       
     
--Clean up the nan values and set them to null
BEGIN;
UPDATE pld_demo_race_final
   SET race_standard = CASE WHEN race_standard = 'nan' THEN NULL ELSE race_standard END;
COMMIT;

In [None]:
%%read_sql
--Review the counts
SELECT race_standard,
       Count(*) AS cnt,
       Count(*) / (SELECT Count(*)
                     FROM pld_demo_race_final) AS pct
  FROM pld_demo_race_final
 GROUP BY race_standard
 ORDER BY cnt desc;

# Final Demographic Table
Create the final demographic table for everyone

In [None]:
%%read_sql
--Create final demographics table
DROP TABLE IF EXISTS pld_demo;
CREATE TRANSIENT TABLE pld_demo AS
    SELECT coh.patient_id,
           demo.age,
           age.age_bucket,
           demo.gender,
           demo.date_of_birth,
           race.race_standard
      FROM coh_pt coh
           LEFT JOIN pld_demo_age_gender demo
                  ON coh.patient_id = demo.patient_id
           LEFT JOIN pld_demo_race_final race
                  ON coh.patient_id = race.patient_id
           LEFT JOIN pld_demo_ref_age_bucket age
                  On age.age = demo.age;

In [None]:
%%read_sql
--Confirm basic counts
SELECT Count(*) AS row_cnt,
       Count(Distinct patient_id) AS patient_cnt
  FROM pld_demo;

In [None]:
%%read_sql
--Gender breakdown
SELECT gender,
       Count(*) AS cnt,
       Count(*) / (SELECT Count(*)
                     FROM pld_demo) AS pct
  FROM pld_demo
 GROUP BY gender;

In [None]:
%%read_sql
--Review age buckets for sanity
SELECT age_bucket,
       Count(*) AS row_cnt,
       Count(Distinct patient_id) AS pt_cnt,
       Count(*) / (SELECT Count(*)
                     FROM pld_demo) AS pt_pct
  FROM pld_demo
 GROUP BY age_bucket
 ORDER BY age_bucket;

In [None]:
%%read_sql
--Race Breakdown
SELECT race_standard,
       Count(*) AS cnt,
       Sum(CASE WHEN race_standard IS NOT NULL THEN 1 ELSE 0 END) 
          / (SELECT Count(*)
                     FROM pld_demo
                    WHERE race_standard IS NOT NULL) AS share,
       Count(*) / (SELECT Count(*)
                     FROM pld_demo) AS pct                  
  FROM pld_demo
 GROUP BY race_standard
 ORDER BY cnt DESC;

# Drop Tables
Drop the intermediate tables

In [2]:
%%read_sql
DROP TABLE IF EXISTS PLD_DEMO_AGE_GENDER;
DROP TABLE IF EXISTS PLD_DEMO_ALB_RACE;
DROP TABLE IF EXISTS PLD_DEMO_RACE_COMB;
DROP TABLE IF EXISTS PLD_DEMO_RACE_FINAL;
DROP TABLE IF EXISTS PLD_DEMO_RACE_PELICAN;
DROP TABLE IF EXISTS PLD_DEMO_RACE_REF;
DROP TABLE IF EXISTS PLD_DEMO_REF_AGE_BUCKET;

Query started at 03:15:07 PM Eastern Daylight TimeInitiating login request with your identity provider. A browser window should have opened for you to complete the login. If you can't see it, check existing browser windows, or your OS settings. Press CTRL+C to abort and try again...
; Query executed in 0.13 mQuery started at 03:15:15 PM Eastern Daylight Time; Query executed in 0.04 mQuery started at 03:15:18 PM Eastern Daylight Time; Query executed in 0.03 mQuery started at 03:15:19 PM Eastern Daylight Time; Query executed in 0.03 mQuery started at 03:15:21 PM Eastern Daylight Time; Query executed in 0.03 mQuery started at 03:15:23 PM Eastern Daylight Time; Query executed in 0.03 mQuery started at 03:15:25 PM Eastern Daylight Time; Query executed in 0.03 m

Unnamed: 0,status
0,PLD_DEMO_REF_AGE_BUCKET successfully dropped.
