## Free Lunch
### Goals:
1) Explore the baseline characteristics data, including figuring out when it exists and how/how consistently it is coded.

2) Whittle down studies to those we can assess balance on.

3) Assess balance of those studies. (This part is sketched but not written).

In [1]:
import pandas as pd
import numpy as numpy
from importlib import reload
from tqdm import tqdm_notebook as tqdm
import time

import pdaactconn as pc
from trialexplorer import AACTStudySet

import matplotlib.pyplot
%matplotlib inline

We'll consider studies from 2010 to 2017 for now. We get over 130,000 studies.

In [2]:
conn = pc.AACTConnection(source=pc.AACTConnection.REMOTE)
ss = AACTStudySet.AACTStudySet(conn=conn, 
                               tqdm_handler=tqdm)
ss.add_constraint("start_date >= '2010-01-01'")
ss.add_constraint("start_date <= '2017-12-31'")
ss.add_constraint("study_type = 'Interventional'")
ss.load_studies()

132961 studies loaded!


The baseline characterstics data that can be found on ClinicalTrials.gov exists in the "baseline_measurements" table in the AACT database.

In [3]:
ss.add_dimensions(['baseline_measurements'])
ss.refresh_dim_data()
bm = ss.dimensions['baseline_measurements']

Successfuly added these 1 dimensions: ['baseline_measurements']
Failed to add these 0 dimensions: []


HBox(children=(IntProgress(value=0, max=266), HTML(value='')))

Syncing the temp table temp_cur_studies in 266 chunks x 500 records each

Creating index on the temp table
 - Loading dimension baseline_measurements
 -- Loading raw data
 -- Sorting index


Below we show the baseline measurements for the first study listed. There are twelve total measurements, four for each of three groups. We can intuit that Group B3 is all participants combined. We'll see that it is common to have one group be the total of the others. B1 and B2 are two different treatment groups. Unfortunately which type of group they are (e.g. treatment vs. control vs. total) is not in this table. We'll find that info soon.

This study has two count measures, both under "Gender", for Male and Female. Count parameters don't have corresponding dispersion parameters. The other two measures are age and some medical scale. Both of these use a mean parameter and have standard deviation dispersion parameters.

In [4]:
bm.data.head(12)

Unnamed: 0_level_0,Unnamed: 1_level_0,id,ctgov_group_code,classification,category,title,description,units,param_type,param_value,param_value_num,dispersion_type,dispersion_value,dispersion_value_num,dispersion_lower_limit,dispersion_upper_limit,explanation_of_na
nct_id,result_group_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
NCT00125528,16204996,17336066,B3,,,Numeric Rating Scale (NRS-11),Numeric rating scale is an 11-point rating sca...,units on a scale,Mean,6.54,6.54,Standard Deviation,1.45,1.45,,,
NCT00125528,16204996,17336069,B3,,Male,Gender,,Participants,Count of Participants,16.0,16.0,,,,,,
NCT00125528,16204996,17336072,B3,,Female,Gender,,Participants,Count of Participants,25.0,25.0,,,,,,
NCT00125528,16204996,17336075,B3,,,Age,,years,Mean,54.71,54.71,Standard Deviation,11.06,11.06,,,
NCT00125528,16204997,17336067,B2,,,Numeric Rating Scale (NRS-11),Numeric rating scale is an 11-point rating sca...,units on a scale,Mean,6.6,6.6,Standard Deviation,1.5,1.5,,,
NCT00125528,16204997,17336070,B2,,Male,Gender,,Participants,Count of Participants,10.0,10.0,,,,,,
NCT00125528,16204997,17336076,B2,,,Age,,years,Mean,54.76,54.76,Standard Deviation,10.84,10.84,,,
NCT00125528,16204997,17336073,B2,,Female,Gender,,Participants,Count of Participants,11.0,11.0,,,,,,
NCT00125528,16204998,17336068,B1,,,Numeric Rating Scale (NRS-11),Numeric rating scale is an 11-point rating sca...,units on a scale,Mean,6.47,6.47,Standard Deviation,1.4,1.4,,,
NCT00125528,16204998,17336071,B1,,Male,Gender,,Participants,Count of Participants,6.0,6.0,,,,,,


Now that we've seen how baseline measurements are coded, we check for completeness. Unfortunately it appears that less than 20,000 of our 130,000 studies have any baseline measurements.

In [5]:
bm.data.reset_index(drop=False, inplace=True)
bm.data.nct_id.nunique()

19812

Since baseline_measurements is connected to studies through result_groups in the AACT schema, it's probably the case that baseline_measurements only exist for studies with results posted. We test that by joining baseline_measurements with the main studies table. We group by studies with and without a value for "units", which is in the baseline measurements and has no NA values there, and so is NA after the join iff the study doesn't have baseline measurements. We see that studies with measurements are almost exactly the studies with results submitted.

In [6]:
bm.data.units.isna().sum()

0

In [7]:
bm_joined = ss.studies.merge(bm.data, how='outer', on='nct_id')
print(bm_joined[bm_joined.units.isna() == True].results_first_submitted_date.isnull().mean())
print(bm_joined[bm_joined.units.isna() == False].results_first_submitted_date.isnull().mean())

0.9992487781597716
0.0


We limit our investigation to studies with results, and add in the result_groups dimension to get more info on the groups (e.g. placebo vs. medication vs. total).

In [8]:
ss.add_constraint("results_first_submitted_date is not null")
ss.add_dimensions(['result_groups'])
ss.load_studies()
ss.refresh_dim_data()
bm = ss.dimensions['baseline_measurements']
rg = ss.dimensions['result_groups']

Successfuly added these 1 dimensions: ['result_groups']
Failed to add these 0 dimensions: []
19897 studies loaded!


HBox(children=(IntProgress(value=0, max=40), HTML(value='')))

Syncing the temp table temp_cur_studies in 40 chunks x 500 records each

Creating index on the temp table
 - Loading dimension baseline_measurements
 -- Loading raw data
 -- Sorting index
 - Loading dimension result_groups
 -- Loading raw data
 -- Sorting index


We can inner join result_groups and baseline_measures without dropping any rows of baseline_measures (not demonstrated but feel free to check). Now we can check informative group names rather than just "B1" or "B2" like we saw above. We see that "Total" is indeed a standard group. Unfortunately, although most studies do include a combined group called "Total", that nomenclature is not uniform. We also see "All Study Participants", "All Participants", and "Entire Study Population" in the top ten most common group titles. Treatment groups tend not to be so heavily repeated, since they're often called the name of the treatment.

In [9]:
combined_measures = pd.merge(rg.data, bm.data, left_on = ['nct_id', 'id'], right_on = ['nct_id', 'result_group_id'])
combined_measures.title_x.value_counts().head(10)

Total                      149420
Placebo                     28275
Control                      3983
All Study Participants       2042
Control Group                1973
All Participants             1938
Usual Care                   1924
Standard of Care             1034
Intervention                  967
Entire Study Population       777
Name: title_x, dtype: int64

In [10]:
combined_measures.title_x.value_counts().tail(10)

MISSION-Vet - IU Case Management                        1
Enhanced Implementation Approach GTO Case Management    1
Ripple Mapping Guided AT Ablation                       1
p52-p36- GAP Vaccine + Infectivity Challenge            1
Fluorodeoxythymidine PET/CT (FLT-PET/CT)                1
Clonidine as an Antimanic Agent                         1
Transdermal Estradiol or Placebo                        1
Infectivity Control                                     1
Conventional AT Ablation                                1
Cohort 16                                               1
Name: title_x, dtype: int64

How can we efficiently compare group balance under such heterogeneity? Well, it turns out that most studies have exactly three groups, just as we saw in our example study above...

In [11]:
num_groups = pd.DataFrame(combined_measures.groupby('nct_id').ctgov_group_code_x.nunique())
num_groups.columns = ['n_groups']
num_groups.n_groups.value_counts().head()

3    9150
1    5928
4    2298
5    1193
6     399
Name: n_groups, dtype: int64

...and we actually see that for these studies, "Total" is the only common name for the total group.

In [12]:
temp_combined_measures = combined_measures.merge(num_groups, on = 'nct_id')
temp_combined_measures[temp_combined_measures.n_groups == 3].title_x.value_counts().head(15)

Total                 99303
Placebo               17326
Control                3151
Control Group          1540
Usual Care             1513
Standard of Care        859
Intervention            790
Sugar Pill              625
Intervention Group      426
Control Arm             386
Placebo Group           380
Treatment               369
Standard Care           366
Vehicle                 347
Normal Saline           322
Name: title_x, dtype: int64

So our strategy for creating a workable sample will be:

1) Remove groups called "Total".

2) Keep studies that have exactly 2 groups remaining after that removal.

In [13]:
combined_measures = combined_measures[combined_measures.title_x != 'Total']
num_groups = combined_measures.groupby('nct_id').ctgov_group_code_x.nunique()
num_groups = pd.DataFrame(num_groups)
num_groups.columns.values[0] = 'n_groups'
combined_measures = combined_measures.merge(num_groups, on=['nct_id'])
combined_measures = combined_measures[combined_measures.n_groups==2]

After all this, we have over 9,000 studies remaining out of our original population of almost 20,000 studies with balance measurements.

In [14]:
combined_measures.index.nunique()

9118

Our remaining studies look like the example below. We need to assess the imbalance of these studies.

In [15]:
test_group = combined_measures[combined_measures.index=='NCT00125528'][['ctgov_group_code_x' ,'category', 'title_y',
                                                                        'param_type', 'param_value_num',
                                                                        'dispersion_type', 'dispersion_value_num']]
test_group

Unnamed: 0_level_0,ctgov_group_code_x,category,title_y,param_type,param_value_num,dispersion_type,dispersion_value_num
nct_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
NCT00125528,B2,,Age,Mean,54.76,Standard Deviation,10.84
NCT00125528,B2,Female,Gender,Count of Participants,11.0,,
NCT00125528,B2,,Numeric Rating Scale (NRS-11),Mean,6.6,Standard Deviation,1.5
NCT00125528,B2,Male,Gender,Count of Participants,10.0,,
NCT00125528,B1,Female,Gender,Count of Participants,14.0,,
NCT00125528,B1,,Age,Mean,54.65,Standard Deviation,11.56
NCT00125528,B1,,Numeric Rating Scale (NRS-11),Mean,6.47,Standard Deviation,1.4
NCT00125528,B1,Male,Gender,Count of Participants,6.0,,


The function below takes the baseline measurements for an individual study like the one shown directly above, separates on the group code, and merges back so the corresponding parameters are now in the same row for comparison. We need to decide how exactly to calculate imbalance once the data is in that convenient format.

In [16]:
study_balance_dat = combined_measures[['ctgov_group_code_x' ,'category', 'title_y',
                                       'param_type', 'param_value_num',
                                       'dispersion_type', 'dispersion_value_num']]

def calculate_imbalance_score(study_measures_df):
    group_titles = study_measures_df.ctgov_group_code_x.unique()
    g0 = group_titles[0]
    g1 = group_titles[1]
    dat0 = study_measures_df[study_measures_df.ctgov_group_code_x == g0]
    dat1 = study_measures_df[study_measures_df.ctgov_group_code_x == g1]
    dat0.columns= ['ctgov_group_code_0', 'category', 'title', 'param_type',
                           'param_value_num_0', 'dispersion_type', 'dispersion_value_num_0']
    dat1.columns = ['ctgov_group_code_1', 'category', 'title', 'param_type',
                           'param_value_num_1', 'dispersion_type', 'dispersion_value_num_1']
    dat_comb = pd.merge(dat0, dat1, on = ['category', 'title', 'param_type', 'dispersion_type'])
    
    #TO DO: Create actual imbalance metric
    return(imbalance_metric)

#Uncomment and run below after filling in imbalance calculation in function above.
#all_balance_studies = list(study_balance_dat.index.unique())
#imbalance_dict = dict()
#for stud in all_balance_studies:
    #current_study = all_balance_studies[all_balance_studies.index == stud]
    #imbalance_dict[stud] = calculate_imbalance_score(current_study)