# Data Exploration of Childhood Asthma Management Program Study Teaching Dataset

*Please note: This notebook uses open access data*     
*Please note: BioDataCatalyst RAS Login in the BRH Profile Page needs to be authorized*

Qiong Liu

This teaching dataset was developed using the [Childhood Asthma Management Program(CAMP)](https://biolincc.nhlbi.nih.gov/media/teachingstudies/CAMP_Documentation.pdf?) as the data source. This trial was designed to assess the long-term effects of three treatments(`budesonide, nedocromil, or placebo`) on pulmonary function. The dataset includes longitudinal data of 695 particpants from CAMP trial. This teaching dataset was permutated and anonymized for teaching and training purposes.  This data was not prepared to reproduce the primary outcome results. 

In this tutorial, we will demonstrate how to pull the object file of CAMP study from [BioData Catalyst data commons](https://gen3.biodatacatalyst.nhlbi.nih.gov/) into a BRH workspace, and perform data exploration and visualization using Python packages.

### Import Python libraries

In [None]:
import pandas as pd
import numpy as np
import pyreadstat
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
from scipy import stats

pd.set_option('mode.chained_assignment', None)

### Read object file

**Note**: Please authorize BioDataCatalyst RAS login first under [user profile page](https://brh.data-commons.org/identity).

In [None]:
!gen3 drs-pull object dg.4503/8d84511c-76f9-4464-8fdf-6dd668ed9c64

camp_df, camp_meta = pyreadstat.read_sav("camp_teach.sav", apply_value_formats=True)

In [None]:
# Display column names and column description
col_names =  camp_meta.column_names_to_labels
pd.DataFrame(col_names.items(), columns=['Label', 'Name'])

**FEV1** is the amount of air you can force from your lungs in one second. The normal range of FEV1 varies from person to person. They’re based on standards for an average healthy person of your age, race, height, and gender. Each person has their own predicted FEV1 value. Therefore, FEV1 percentage of predicted (**FEVPP**) can be used as the key measurement for pulmonary function assessment.

### Participants demographic data exploration and visualization

In [None]:
# add age group to the dataframe
def age_group(agelist):
    grouplabel1 = "Early Childhood (2-5yr)"
    grouplabel2 = "Middle Childhood (6-11yr)"
    grouplabel3 = "Early Adolescence (12-18yr)"
    grouplist = []
    for i in agelist:
        if i <= 5:
            grouplist.append(grouplabel1)
        elif i <= 11:
            grouplist.append(grouplabel2)
        elif i >= 12:
            grouplist.append(grouplabel3)
        else:
            grouplist.append("NA")
    return grouplist
camp_df['age_group'] = age_group(camp_df['age_rz'])

first_visit = camp_df.loc[(camp_df["visitc"]=="000")]
first_visit.head(3)

In [None]:
# The row number of the df first_visit shows how many participants were enrolled to the study
first_visit.shape

In [None]:
# Shows the counts of both genders
first_visit['GENDER'].value_counts()

In [None]:
# Plot the composition of age groups by gender among participants in the CAMP study
count_sex_age = pd.crosstab(index=first_visit['age_group'], columns=first_visit['GENDER'])

labels=['Early Adolescence (12-18yr)', 'Early childhood (2-5yr)', 'Middle childhood (6-11yr)']
pie_age_gender = make_subplots(1, 2, specs=[[{'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['Female', 'Male'])
pie_age_gender.add_trace(go.Pie(labels=labels, values=count_sex_age['f'], scalegroup='one',
                     name="Female"), 1, 1)
pie_age_gender.add_trace(go.Pie(labels=labels, values=count_sex_age['m'], scalegroup='one',
                     name="Male"), 1, 2)

pie_age_gender.update_layout(title_text='Gender and Age Characteristics of CAMP Study',
                         annotations=[dict(text='Female', x=0.225, y=0.47, font_size=15, showarrow=False),
                                      dict(text='Male', x=0.78, y=0.46, font_size=15, showarrow=False)],
                            width=800, height=400)
pie_age_gender.update_traces(hole=.4, hoverinfo="label+value+percent+name")
pie_age_gender.show()

In [None]:
# Plot the composition of ethnicity groups by gender among participants in the CAMP study
count_sex_ethnic = pd.crosstab(index=first_visit['ETHNIC'], columns=first_visit['GENDER'])

ethnic_labels= ["black","hispanic","other","white"]
pie_ethnic_gender = make_subplots(1, 2, specs=[[{'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['Female', 'Male'])
pie_ethnic_gender.add_trace(go.Pie(labels=ethnic_labels, values=count_sex_ethnic['f'], scalegroup='one',
                     name="Female"), 1, 1)
pie_ethnic_gender.add_trace(go.Pie(labels=ethnic_labels, values=count_sex_ethnic['m'], scalegroup='one',
                     name="Male"), 1, 2)

pie_ethnic_gender.update_layout(title_text='Gender and Ethnicity Characteristics of CAMP Study',
                         annotations=[dict(text='Female', x=0.225, y=0.47, font_size=15, showarrow=False),
                                      dict(text='Male', x=0.78, y=0.46, font_size=15, showarrow=False)],
                               width=800, height=400)
pie_ethnic_gender.update_traces(hole=.4, hoverinfo="label+value+percent+name")
pie_ethnic_gender.show()

In [None]:
# Counts of participants of different treatment groups
first_visit['TX'].value_counts()

- At the begining of the study, there were a total of 695 participants enrolled to the study.
- Both treatment groups of `budesonide` and `nedocromil` had 210 participants. A total of 141 participants and 134 participants were enrolled to the group of `placebo nedocromil` and `placebo budesonide`, respectively.
- More male participants (412) were enrolled than females (283).
- Over 3 quarters of the participants at enrollment were from age group middle childhood (6-11 yr). The composition of age groups between male and female participants are similar.
- Nearly three quarters of the participants were white. The rest of the participants were from black, hispanic, or other ethnicity groups.

### Key measurements exploration and visualization

`PREFEVPP` is the FEV1 percentage of predicted before bronchodialators. We chose this variable because it reflects the pulmonary function of participants without the confounding effect of bronchodialator. In this section, we explored the variable of `PREFEVPP` between different ethnicity groups and differnt follow up visits

In [None]:
first_visit_rmna = first_visit[first_visit['PREFEVPP'].isna()==False]

In [None]:
# Visualize the boxplots of PREFEVPP of different ethnicity groups at first visit
histo_PREFEVPP_ethnic  = px.box(first_visit_rmna, x='ETHNIC', 
                                y="PREFEVPP", color="GENDER", 
                                title="Boxplot of PREFEVPP at First Visit")
histo_PREFEVPP_ethnic.show()

In [None]:
camp_df['visit_month'] = camp_df['visitc'].astype(int)

visit_month_list = [0,12,24,36,48,60,72]
# extrac id that have PREFEVPP value for all of these visits
all_id = camp_df['id'].unique().tolist()
fig_id_list=[]
for i in all_id:
    i_df = camp_df[(camp_df['id']==i) & (camp_df['PREFEVPP'].isna()==False)]
    i_df_visit =  i_df['visit_month'].tolist()
    if set(visit_month_list).issubset(set(i_df_visit)):
        fig_id_list.append(i)
    else: pass

In [None]:
camp_df_subset = camp_df.loc[camp_df['id'].isin(fig_id_list) & (camp_df['visit_month'].isin(visit_month_list))]
line_PREFEVPP_visit =  px.box(camp_df_subset, x='visit_month', y='PREFEVPP', 
                              color="GENDER",
                              facet_row="TX",
                              width=800, height=800)
line_PREFEVPP_visit.show()

### Analysis of long-term effect of budesonide and nedocromil on pulmonary function

The purpose of this notebook is to assess the long term effect of two drugs. Given the number of observations at different time points of follow up visits, we selected 72 months (6 years since intervention treatment) as the time point to evaluate the long term effect of medicine intervention.   

In [None]:
# Define id list that have both records of PREFEVPP at 72 month visit and first visit
def intersection(lst1, lst2):
    return list(set(lst1) & set(lst2))

visit_72_id = camp_df.loc[(camp_df["visitc"]=="072") & (camp_df["PREFEVPP"].isna()==False),]['id'].tolist()
visit_0_id = camp_df.loc[(camp_df["visitc"]=="000") & (camp_df["PREFEVPP"].isna()==False),]['id'].tolist()

id_intersect =intersection(visit_72_id, visit_0_id)

visit_72_df = camp_df.loc[(camp_df['id'].isin(id_intersect)) & (camp_df['visit_month'].isin([72])),['id','PREFEVPP','POSFEVPP']]
visit_0_df = camp_df.loc[(camp_df['id'].isin(id_intersect)) & (camp_df['visit_month'].isin([0])),]

visit_72_df = visit_72_df.rename(columns={"PREFEVPP":"PREFEVPP_72", "POSFEVPP":"POSFEVPP_72"})

# merge two dfs 
fev1_72_df = pd.merge(visit_0_df, visit_72_df, how='inner', on='id')

# create a new variable PREFEVPP_diff that calculates the difference between PREFEVPP value at 72 and 0 month
fev1_72_df['PREFEVPP_diff'] = fev1_72_df['PREFEVPP_72']-fev1_72_df['PREFEVPP']
fev1_72_df.head()

In [None]:
# The number of participants of different treatment groups
fev1_72_df['TX'].value_counts()

In [None]:
# Here are the histogram distributions of PREFEVPP diff across 4 treatment groups
# The histogram shows that the PREFEVPP_diff follows normal distribution
fig_his = px.histogram(fev1_72_df, x="PREFEVPP_diff", facet_row="TX",
                       title="Histograms of PREFEVPP Diffs Between 72 and 0 Month", 
                       height=600, width=800)
fig_his.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig_his.update_layout(
    xaxis_title="Diff of PREFEVPP between 72 and 0 month"
)
fig_his.show()

In [None]:
# Extract PREFEVPP_diff from each treatment group
bud_diff = fev1_72_df[fev1_72_df['TX']=='bud']['PREFEVPP_diff']
pbud_diff = fev1_72_df[fev1_72_df['TX']=='pbud']['PREFEVPP_diff']
ned_diff = fev1_72_df[fev1_72_df['TX']=='ned']['PREFEVPP_diff']
pned_diff = fev1_72_df[fev1_72_df['TX']=='pned']['PREFEVPP_diff']

In [None]:
# T test between bud treatment group and bud placebo control group 
stats.ttest_ind(bud_diff, pbud_diff, equal_var=False, nan_policy='raise')

In [None]:
# T test between ned treatment group and ned placebo control group 
stats.ttest_ind(ned_diff, pned_diff, equal_var=False, nan_policy='raise')

- The P values of both T-tests were bigger than 0.05. We failed to reject the hypothesis H0 that 2 independent samples have identical average (expected) values.
- Therefore, there was no significant difference found between the intervention treatment and their placebo group for both medications in this dataset.