# Data Exploration of Framingham Heart Study Teaching Dataset 

*Please note: This notebook uses open access data*  
*Please note: BioDataCatalyst RAS Login in the BRH Profile Page needs to be authorized*


This teaching dataset was developed using the longitudinal [Framingham heart study](https://biolincc.nhlbi.nih.gov/teaching/) as the data source. The teaching dataset includes three clinical examination and 20 year follow-up data based on a subset of the original Framingham cohort participants. This dataset was created for teaching and training purposes, and certain measures were used to created anonymous versions. Detailed documentation on variables can be found [HERE](https://biolincc.nhlbi.nih.gov/media/teachingstudies/FHS_Teaching_Longitudinal_Data_Documentation_2021a.pdf?link_time=2022-04-18_09:29:24.200176)

In this tutorial, we will demonstrate how to pull the object file of Framingham teaching dataset from [BioData Catalyst data commons](https://gen3.biodatacatalyst.nhlbi.nih.gov/) into a BRH workspace, and perform data exploration and visualization using Python packages. 




### Import Python libraries

In [None]:
import pandas as pd
import numpy as np
import pyreadstat
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px

pd.set_option('mode.chained_assignment', None)

### Pull Framingham data file 

In [None]:
!gen3 drs-pull object dg.4503/c97507dd-bb79-45ce-9186-3d14ad823f81

fram_df, fram_meta = pyreadstat.read_sav("frmgham2.sav", apply_value_formats=True)

### Data exploration

- Display column labels and their full names from metadata

In [None]:
col_names = fram_meta.column_names_to_labels

print ("{:<10} {:<10}".format('LABEL', 'NAME'))
 
# print each column item.
for key, value in col_names.items():
    print ("{:<10} {:<10}".format(key, value))

- Explore the demographic characteristics of participants

In [None]:
# change the patiant ID column type from numeric to string
fram_df["RANDID"] = fram_df["RANDID"].astype(str)
fram_df["RANDID"] = fram_df["RANDID"].str[:-2]

In [None]:
# Map the gender value to male an female
sex_group_list = ['Male' if i==1.0 else 'Female' for i in fram_df['SEX']]
fram_df['SEX_GROUP'] = sex_group_list

In [None]:
# Create a subset of participants from first visit 
visit_1 =  fram_df[fram_df['PERIOD']==1]
visit_1.head()

In [None]:
# Functions to categorize age, BMI, and education value into groups
# Define age groups within participants
def age_group(agelist):
    min_age = min(agelist)
    grouplabel1 = str(min_age) + "-40 yr"
    grouplabel2 = "41-50 yr"
    grouplabel3 = "51-60 yr"
    grouplabel4 = "61-70 yr"
    grouplabel5 = "71-80 yr"
    grouplabel6 = "81-90 yr"
    grouplist = []
    for i in agelist:
        if i <= 40:
            grouplist.append(grouplabel1)
        elif i <=50:
            grouplist.append(grouplabel2)
        elif i <= 60:
            grouplist.append(grouplabel3)
        elif i <= 70:
            grouplist.append(grouplabel4)
        elif i <= 80:
            grouplist.append(grouplabel5)
        elif i <= 90:
            grouplist.append(grouplabel6)
        else:
            grouplist.append("NA")
            
    return grouplist

# Define BMI groups
def bmi_group(bmilist):
    grouplabel1 = "underweight"
    grouplabel2 = "healthy"
    grouplabel3 = "overweight"
    grouplabel4 = "obese"
    grouplist = []
    for i in bmilist:
        if i < 18.5 :
            grouplist.append(grouplabel1)
        elif i < 25:
            grouplist.append(grouplabel2)
        elif i < 30:
            grouplist.append(grouplabel3)
        else:
            grouplist.append(grouplabel4)
    return grouplist

# Define education groups
def edu_group(edu_list):
    grouplabel1 = "0-11 years"
    grouplabel2 = "High School Diploma, GED"
    grouplabel3 = "Some College, Vocational School"
    grouplabel4 = "College (BS, BA) degree or more"
    grouplist = []
    for i in edu_list:
        if i == 1.0:
            grouplist.append(grouplabel1)
        elif i == 2.0:
            grouplist.append(grouplabel2)
        elif i == 3.0:
            grouplist.append(grouplabel3)
        elif i == 4.0:
            grouplist.append(grouplabel4)
        else:
            grouplist.append("NA")
    return grouplist

In [None]:
visit_1["AGE_GROUP"] = age_group(list(visit_1["AGE"]))
visit_1["BMI_GROUP"] = bmi_group(list(visit_1["BMI"]))
visit_1["EDUC_GROUP"] = edu_group(list(visit_1["EDUC"]))
visit_1[["AGE", "AGE_GROUP", "BMI", "BMI_GROUP", 'EDUC', 'EDUC_GROUP']].head(5)

In [None]:
# Count the number of participants based on group category
pt_sex_age = pd.crosstab(index=visit_1['AGE_GROUP'], columns=visit_1['SEX_GROUP'])
pt_sex_bmi = pd.crosstab(index=visit_1['BMI_GROUP'], columns=visit_1['SEX_GROUP'])
pt_sex_edu = pd.crosstab(index=visit_1['EDUC_GROUP'], columns=visit_1['SEX_GROUP'])
pd.concat([pt_sex_age, pt_sex_bmi, pt_sex_edu], keys=['age', 'bmi', 'education'])

In [None]:
# Plot the counts of these demographic characteristics among participants
# Reformat the dataframe before plotting
demo_df = pd.concat([pt_sex_age, pt_sex_bmi, pt_sex_edu], keys=['Age','BMI','Education'])
demo_df.reset_index(inplace=True)
demo_df = pd.melt(demo_df, id_vars=['level_0', 'level_1'], value_vars=['Female', 'Male'], ignore_index=False)
demo_df = demo_df.rename({'level_0':"Group type", 'level_1':'Group value', 'value':'Counts', 'SEX_GROUP':'Sex'}, axis="columns")

# Show bar plot of participant counts based on gender, age, BMI, and education group
demo_fig = px.bar(demo_df, x="Group value", y="Counts", color="Sex", facet_row="Group type",
                  width=900, height=700, title="Age Group, BMI, and Education Characteristics of Participants in the Framingham Heart Study")
demo_fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
demo_fig.update_xaxes(matches='x')
demo_fig.show()

**Demographic characteristics of FHS participants**:
- Most participants in the Framingham Heart study were recruited to the study from age group of **40-50 and 50-60**.
- At first visit, few participants were found at obese and underweight BMI groups. The majority participants were found in either **healthy or overweight** BMI group. We saw more females in the healthy BMI group than males(1,233 vs 703).
- Nearly three quarters of the participants were found with 0-11 years educations or GED. Fewer participants were found with college degree.

### Risk factor exploration

- Risk factor at different time points

In [None]:
# add bmi group, age group, and education group to the fram_df
fram_df["AGE_GROUP"] = age_group(list(fram_df["AGE"]))
fram_df["BMI_GROUP"] = bmi_group(list(fram_df["BMI"]))
fram_df["EDUC_GROUP"] = edu_group(list(fram_df["EDUC"]))
fram_df.head(5)

In [None]:
# extract a subset of data for risk factor exploration

rf_df =  fram_df[['RANDID','SEX_GROUP', 'TIME', 'PERIOD', 'BMI', 'AGE', 
                  'SYSBP', 'DIABP', 'CIGPDAY', 'TOTCHOL', 'HDLC', 
                  'LDLC', 'GLUCOSE']]

In [None]:
# Unpivot a dataframe from wide to long format
rf_df_melt =  pd.melt(rf_df, id_vars=['RANDID','SEX_GROUP','TIME', 'PERIOD'], 
                      value_vars=['BMI', 'AGE', 'SYSBP', 'DIABP', 'CIGPDAY', 
                                  'TOTCHOL', 'HDLC', 'LDLC', 'GLUCOSE'])
rf_df_melt.head()

The next block shows the distribution of several risk factor variables, including BMI and AGE, at three visits.

In [None]:
# boxplot of risk factors at 3 visits
rf_fig1 =  px.violin(data_frame=rf_df_melt.loc[rf_df_melt['variable'].isin(['AGE', 'BMI']),],
                 x='PERIOD', y='value', width=800, height=500, color='variable', box=True)
rf_fig1.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
rf_fig1.update_xaxes(title="Visit")
rf_fig1.show()

- The median age of participants at three visits were 49, 55 and 60, respectively.
- At all three visit time point, the median value of BMIs were found around 25.5, which falls into `overweight category`.

The next block shows the distribution of several variables, including Systolic Blood Pressure (SYSBP), Diastolic Blood Pressure (DIABP), Serum Total Cholesterol (TOTCHOL), and Casual Serum Glucose (GLUCOSE), at three visits.

In [None]:
# boxplot of few risk factors at three visits
rf_fig2 =  px.box(data_frame=rf_df_melt.loc[rf_df_melt['variable'].isin(['SYSBP', 'DIABP',
                                                                         'TOTCHOL','GLUCOSE']),],
                 x='PERIOD', y='value', width=950, height=500, color='variable')
rf_fig2.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
rf_fig2.update_xaxes(title="Visit")
rf_fig2.show()

- Risk factors and events heatmap visualization

The next block categorizes some of the risk factor values into binary groups using a threshold. For instance, Age over 60 is considered as a risk factor.

In [None]:
# Extract only the final visit of each participant and 
# and known risk factors asscoiated with cardiovascular disease incidence

visit_final =  fram_df.drop_duplicates("RANDID", keep='last')
visit_rf = visit_final[['RANDID','AGE', 'SEX_GROUP','CURSMOKE', 'TOTCHOL', 'BMI', 
                            'GLUCOSE', 'PREVCHD', 'PREVSTRK', 'PREVHYP', 'DIABETES', 'HEARTRTE']]

visit_rf['TOTCHOL_RF'] = [1 if i>200 else 0 for i in visit_rf['TOTCHOL']]
visit_rf['BMI_RF'] = [1 if i>25 else 0 for i in visit_rf['BMI']]
visit_rf['GLUCOSE_RF'] = [1 if i>200 else 0 for i in visit_rf['GLUCOSE']]
visit_rf['HEARTRTE_RF'] = [1 if i>100 else 0 for i in visit_rf['HEARTRTE']]
visit_rf['AGE_RF'] = [1 if i>=60 else 0 for i in visit_rf['AGE']]

visit_rf = visit_rf.drop(['AGE','TOTCHOL','BMI','GLUCOSE', 'HEARTRTE', 'SEX_GROUP'], axis=1)
visit_rf.head()

The next block combines the risk factor dataframe with disease event and generates a correlation heatmap with these variables

In [None]:
# combine the visit_rf with event data

visit_event = visit_final[['RANDID', 'ANGINA','HOSPMI','MI_FCHD','ANYCHD', 'STROKE', 'CVD', 'HYPERTEN','DEATH']]
visit_rf_event = visit_rf.merge(visit_event, left_on="RANDID", right_on="RANDID")

visit_rf_event.head(5)

In [None]:
# generate correlation heatmap
visit_rf_event_noid = visit_rf_event.drop(['RANDID'],axis=1)

sns.heatmap(visit_rf_event_noid.corr())

In [None]:
# create a column that combines all the risk factors
visit_rf_event['RF_SUM'] = visit_rf_event['CURSMOKE'] + visit_rf_event['PREVCHD'] + visit_rf_event['PREVSTRK'] + visit_rf_event['PREVHYP']+  visit_rf_event['DIABETES'] + visit_rf_event['TOTCHOL_RF']+ visit_rf_event['BMI_RF'] + visit_rf_event['GLUCOSE_RF'] + visit_rf_event['HEARTRTE_RF'] + visit_rf_event['AGE_RF']

In [None]:
visit_rf_event.tail(5)

The next two blocks generate a counts table between risk factor sum variable and event sum variable

In [None]:
visit_rf_event['EVENT_SUM'] = visit_rf_event['ANYCHD'] + visit_rf_event['STROKE'] + visit_rf_event['CVD'] + visit_rf_event['HYPERTEN']


In [None]:
rf_count = pd.crosstab(index=visit_rf_event['RF_SUM'], columns=visit_rf_event['EVENT_SUM'])

rf_count

The next block creates a histogram showing the composition of disease events in each risk factor group.

In [None]:
rf_count.reset_index(inplace=True)
rf_count_melt = pd.melt(rf_count, id_vars=['RF_SUM'], value_vars=[0,1,2,3,4], ignore_index=False)
rf_event_fig = px.bar(rf_count_melt, x='RF_SUM', y='value', 
                      color='EVENT_SUM', title="Risk Factors and Disease Events Histogram ",
                     color_discrete_sequence=px.colors.sequential.Sunsetdark)
rf_event_fig.show()

- Most of the participants had at least two risk fastors at third visit. 
- With the increase of risk factor, we observed a significant increase in multiple disease events. For instance, the majority of participants with three risk factors (the tallest bar) have at least 1 disease event.
- Among the participants with at least 3 risk factors, there is a signidicant increase in multiple disease event (2, 3, or 4). Meanwhile, fewer or no participants with at least 3 risk factors were found with zero disease event.