In [1]:
print("Temi Falae")

Temi Falae


In [None]:
import pandas as pd
import pyreadstat

df, meta = pyreadstat.read_sav("Obesity.sav")
categorical_vars = [
    "Gender",
    "Race",
    "Marital_status",
    "Education",
    "Occupation",
    "Household_income",
    "WC_coded",
    "BMI_WHO",
    "BMI_CPG",
    "BMI_2cat",
    "BMI_2cat_WHO",
    "BMI_2cat_overweight",
    "BMI_2cat_overweightCPG"
]

continuous_vars = [
    "Age",
    "Weight",
    "Height",
    "WC",
    "Systolic_BP",
    "Diastolic_BP",
    "Confectionary_scale",
    "Nutrition_Confectionary",
    "Fruits_scale",
    "Nutrition_Fruits",
    "Vege_scale",
    "Nutrition_Veggies",
    "BMI"
]


for col in categorical_vars:
    df[col] = df[col].astype("category")

print("Converted categorical variables:\n", categorical_vars)



# Continuous variables summary
print("\n===== Summary Statistics: Continuous Variables =====\n")
display(df[continuous_vars].describe())

# Categorical variables frequency tables
print("\n===== Frequency Tables: Categorical Variables =====\n")
for col in categorical_vars:
    print(f"\n--- {col} ---")
    print(df[col].value_counts(dropna=False))
    print()


print("\n===== Missing Values Per Variable =====\n")
print(df.isnull().sum())


In [2]:
def classify_hypertension(row):
    if row["Systolic_BP"] >= 130 or row["Diastolic_BP"] >= 80:
        return 1   # hypertensive
    else:
        return 0   # not hypertensive

df["Hypertension"] = df.apply(classify_hypertension, axis=1)
df["Hypertension"] = df["Hypertension"].astype("category")


In [3]:
df["Hypertension"].value_counts()
df["Hypertension"].value_counts(normalize=True) * 100


Hypertension
1    60.0
0    40.0
Name: proportion, dtype: float64


#  **Statistical Analysis Plan (SAP)**

## **1. Study Question and Aims**

**Primary question:**
What is the association between dietary patterns (and diet quality scores) and the prevalence of hypertension?

**Aims:**

1. Describe demographic, anthropometric, and dietary characteristics of the sample.
2. Compare the prevalence of hypertension across dietary groups and diet-quality levels.
3. Determine whether dietary patterns or diet-quality scores predict hypertension after adjusting for confounders.
4. Conduct subgroup and sensitivity analyses to test robustness.


## **2. Overview of Analytic Approach**

* Create a binary **Hypertension** variable (SBP ≥130 or DBP ≥80).
* Perform descriptive statistics (means/SD, medians/IQR, frequencies).
* Conduct univariate tests: chi-square, t-tests, Mann–Whitney U.
* Build logistic regression models (progressive adjustment).
* Perform diagnostics: VIF, ROC/AUC, calibration.
* Conduct subgroup + interaction tests.
* Produce tables and visualizations.


## **3. Data Preparation & Initial Exploration**

### **3.1 Variable Definitions**

* **Outcome:**

  * `Hypertension` = 1 if Systolic_BP ≥130 or Diastolic_BP ≥80; else 0.
* **Primary exposures:**

  * Dietary pattern variable, and/or dietary scales (Fruits, Vegetables, Confectionary).
* **Covariates:**

  * Age, Gender, Race, BMI, Education, Income, Occupation, WC, etc.

### **3.2 Missing Data Plan**

* Summarize missingness with `df.isnull().sum()`.
* If **<5%** missing → complete-case analysis.
* If **≥5%** and missingness appears at-random → consider multiple imputation.
* Imputation model should include all variables used in analysis.

### **3.3 Outliers & Distributions**

* Use histograms, boxplots, Q-Q plots for continuous data.
* Identify outliers via:

  * z-scores (|z| > 3)
  * IQR rule (Q1 − 1.5×IQR, Q3 + 1.5×IQR)
* Consider transformations or non-parametric tests for skewed data.
* Conduct sensitivity analyses with outliers removed.

### **3.4 Multicollinearity Assessment**

* Compute VIF for continuous predictors.
* VIF > 5 indicates potentially problematic multicollinearity.
* If multiple highly correlated dietary scales exist, consider:

  * Using single scales,
  * Creating composite indices, or
  * Running PCA for dietary patterns.


## **4. Univariate Analyses**

### **4.1 Descriptive Statistics**

* **Table 1:** Sample characteristics

  * Continuous: mean ± SD or median [IQR]
  * Categorical: n (%)
* Include overall hypertension prevalence.

### **4.2 Univariate Comparisons**

* **Categorical × categorical:** Chi-square (or Fisher’s exact).
* **Continuous × binary:** t-test (if normal) or Mann–Whitney U (if non-normal).
* Report p-values and effect sizes.


## **5. Multivariable Modeling Strategy**

### **5.1 Progressive Logistic Regression Models**

Report ORs and 95% CIs.

**Model 0 (Unadjusted):**
`Hypertension ~ Dietary_variable`

**Model 1 (Demographics):**
`Hypertension ~ Dietary_variable + Age + Gender`

**Model 2 (SES Adjustment):**
`Hypertension ~ Dietary_variable + Age + Gender + Race + Education + Household_income`

**Model 3 (Full Model):**
`Hypertension ~ Dietary_variable + Age + Gender + Race + Education + Household_income + BMI + WC`

### **5.2 Exposure Parameterizations**

* Diet variables as continuous (OR per SD increase).
* Categorical versions (tertiles/quartiles).
* If using PCA: include PC1, PC2 as exposures.


## **6. Model Diagnostics & Performance**

### **6.1 Assumptions & Fit**

* Check linearity of continuous predictors with log-odds.
* Calculate VIF for multicollinearity.
* Detect influential observations (Cook’s distance, leverage).
* Rerun models excluding influential points if needed.

### **6.2 Model Fit & Discrimination**

* Hosmer–Lemeshow goodness-of-fit test.
* ROC curve and AUC for discrimination.
* Calibration plots (predicted vs observed probabilities).


## **7. Subgroup & Interaction Analyses**

### **7.1 Subgroups**

Run stratified logistic models for:

* Sex (Male/Female)
* Age groups (<50 vs ≥50)
* BMI categories (Normal, Overweight/Obese)

### **7.2 Interaction Terms**

Include:

* `Dietary_variable * Gender`
* `Dietary_variable * Age_group`
  Report interaction p-values and stratified ORs.


## **8. Sensitivity Analyses**

* Use alternative hypertension thresholds (e.g., ≥140/90).
* Exclude outliers and re-run models.
* Compare complete-case vs imputed models if applicable.
* Run models with individual diet variables vs grouped diet variables.


## **9. Multiple Testing Considerations**

* Main hypotheses remain primary analyses.
* Exploratory analyses may use FDR correction if needed.
* Emphasize effect sizes and confidence intervals over p-values alone.


## **10. Planned Tables and Figures**

### Tables

1. **Table 1:** Descriptive characteristics.
2. **Table 2:** Hypertension prevalence by dietary groups.
3. **Table 3:** Logistic regression results for Models 0–3.
4. Supplemental: VIF table, sensitivity analysis tables.

### Figures

1. Figure 1: Hypertension prevalence by dietary pattern (bar chart).
2. Figure 2: Scatterplot of diet score vs Systolic BP with regression line.
3. Supplemental: correlation heatmap, ROC curve, calibration plot.


## **11. Statistical Software**

Analyses will be performed using:

* `pandas`, `numpy`
* `matplotlib`, `seaborn`
* `scipy.stats`
* `statsmodels.api` & `statsmodels.formula.api`
* `sklearn.metrics`


