yuchen gong

Statistical Analysis Plan (SAP)
Dataset: CDC Diabetes Health Indicators (BRFSS)

Objective One: Dataset Setup

To begin the analysis, we downloaded the CDC Diabetes Health Indicators dataset 
from the Behavioral Risk Factor Surveillance System (BRFSS) (https://www.cdc.gov/brfss/annual_data/annual_data.htm).
The dataset includes individual-level responses about health behaviors, demographics, and chronic conditions such as diabetes, hypertension, and high cholesterol.

In [None]:
import pandas as pd
df = pd.read_csv("diabetes_data.csv")
df.head()

We verified column names, checked for missing values, and identified the variables needed for analysis. The main variables include:

Diabetes_binary – diabetes diagnosis (0 = No, 1 = Yes)

BMI – body mass index

HighBP – high blood pressure (0/1)

HighChol – high cholesterol (0/1)

Age – age category

Smoker – smoking status (0/1)

PhysActivity – physical activity (0/1)

Objective Two: Summary Statistics

Descriptive statistics were calculated for all variables of interest to understand their distributions and data quality.

In [None]:
df.describe()
df['Diabetes_binary'].value_counts(normalize=True)

We also created histograms and boxplots to visualize distributions.
Missing or implausible BMI values (e.g., <10 or >70) were excluded.
Categorical variables (e.g., Smoker, PhysActivity) were encoded as 0/1.

This stage provided a basic understanding of the dataset and confirmed that it was suitable for regression analysis.

Objective Three: Statistical Analysis Plan
Aim 1: Determine which demographic and lifestyle factors are most strongly associated with diabetes.

A binary logistic regression model will be used, as the dependent variable (Diabetes_binary) is binary.

Rationale: Logistic regression allows us to examine the adjusted effect of each predictor while controlling for others. Odds ratios and 95% confidence intervals will be reported.
Python functions: statsmodels.api.Logit() and summary()

Aim 2: Test whether BMI and physical activity can predict diabetes risk.

A predictive logistic regression model will be trained using an 80/20 train-test split.
We will evaluate accuracy, confusion matrix, and ROC curve (AUC).

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, classification_report

Rationale: These metrics show how well the model distinguishes between diabetic and non-diabetic individuals.

Aim 3: Compare mean BMI and physical activity levels between diabetic and non-diabetic groups.

A two-sample t-test will assess BMI differences, and a chi-square test will compare activity levels.

In [None]:
from scipy import stats
stats.ttest_ind(df[df['Diabetes_binary']==1]['BMI'],
                df[df['Diabetes_binary']==0]['BMI'])
stats.chi2_contingency(pd.crosstab(df['Diabetes_binary'], df['PhysActivity']))

Rationale: These tests determine whether diabetic and non-diabetic groups differ significantly in continuous and categorical factors.

Objective Four: Planned Figures and Visualizations

We will produce the following figures to illustrate our findings and support each research question:

1. Boxplot of BMI by Diabetes Status

Purpose: Compare BMI distributions between diabetic and non-diabetic groups.

In [None]:
sns.boxplot(x='Diabetes_binary', y='BMI', data=df)

2. Correlation Heatmap of Predictors

Purpose: Visualize relationships and detect multicollinearity.

In [None]:
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

3. ROC Curve for Diabetes Prediction Model

Purpose: Evaluate classification performance.

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.plot(fpr, tpr, label=f"AUC = {auc(fpr,tpr):.2f}")