# NHANES project about Number of Teeth and Intrinsic Capacity: descriptive and regression analysis
> This notebook has the purpose to collect all the analysis on Nhanes dataset for a medical paper project 

Requirements and Information:
1. Nhanes dataset from 2009/10 to 2013/14
2. 5 Intrinsic Capacity domains
    1. Locomotion:
        - Standingup from armless chair difficulty (PFQ061I)
    2. Cognitive Function:
        - Trouble concentrating on things (DPQ070)
    3. Vitality:
        - Weight change intentional (WHQ060)
        - Poor appetite or overeating (DPQ050)
    4. Psychological status:
        - Feeling down, depressed, or hopeless (DPQ020)
        - Have little interest in doing things (DPQ010)
    5. Sensory domain:
        - Have serious difficulty hearing? (DLQ010)
        - Have serious difficulty seeing? (DLQ020)
3. Outcome:
    - number of teeth (OHXDEN)
    - consists of 2 categories: patients with < 20 teeth, patients with >= 20 teeth
    - Other categorization:
        1. Edentulus : 0 teeth
        2. Severe Loss : 1-9 teeth
        3. Moderate Loss : 10-19 teeth
        4. Nearly Complete : >=20 teeth
4. Confounding Variables:
    - Gender (RIAGENDR)
    - Age at screening (RIDAGEYR)
    - Race (RIDRETH1)
    - Education	(DMDEDUC2)
    - Poverty income ratio (INDFMPIR)
    - Smoking status (SMQ020)
5. Mediators:
    - Heart failure	(RIDRETH1)  
    - Coronary heart disease (MCQ160b)
    - Stroke (MCQ160c)
    - Liver disease	(MCQ160o)
    - Cancer (MCQ500)
    - Diabetes (MCQ220)
    - High blood pressure (DIQ010)
6. Age => 60

## NHANES 2013/14: Intrinsic Capacity and Teeth counts

### Import Libraries

In [5]:
library(haven)
library(nhanesA)
library(survey)
library(MASS)
library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(readr)
library(flextable)
library(officer)
library(nnet)
library(broom)
library(ggplot2)

Loading required package: grid

Loading required package: Matrix

Loading required package: survival


Attaching package: 'survey'


The following object is masked from 'package:graphics':

    dotchart



Attaching package: 'dplyr'


The following object is masked from 'package:MASS':

    select


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



Attaching package: 'tidyr'


The following objects are masked from 'package:Matrix':

    expand, pack, unpack


-- [1mAttaching core tidyverse packages[22m ------------------------ tidyverse 2.0.0 --
[32mv[39m [34mforcats  [39m 1.0.0     [32mv[39m [34mreadr    [39m 2.1.5
[32mv[39m [34mggplot2  [39m 3.5.1     [32mv[39m [34mstringr  [39m 1.5.1
[32mv[39m [34mlubridate[39m 1.9.4     [32mv[39m [34mtibble   [39m 3.2.1
[32mv[39m [34mpurrr    [39m 1.0.2     
-- [1mConflicts[22m -------------------

### Configurations

In [6]:
path_to_data_09_10 <- "/Users/silvanoquarto/Desktop/LAVORO/MEDICAL_PHYSICS/Med-Physics/data/NHANES/2009_10/"
path_to_data_11_12 <- "/Users/silvanoquarto/Desktop/LAVORO/MEDICAL_PHYSICS/Med-Physics/data/NHANES/2011_12/"
path_to_data_13_14 <- "/Users/silvanoquarto/Desktop/LAVORO/MEDICAL_PHYSICS/Med-Physics/data/NHANES/2013_14/"

### Load Dataset & Feature Selection

In [3]:
# Datasets for 2009/10 period

demo_09_10 <- read_xpt(file.path(path_to_data_09_10, "DEMO_F.xpt"))

demo_09_10_selected <- demo_09_10 %>%
  select(SEQN, RIAGENDR, RIDAGEYR, RIDRETH1, DMDEDUC2, INDFMPIR)

alcohol_09_10 <- read_xpt(file.path(path_to_data_09_10, "ALQ_F.xpt"))

alcohol_09_10_selected <- alcohol_09_10 %>%
  select(SEQN, ALQ101)

smoking_09_10 <- read_xpt(file.path(path_to_data_09_10, "SMQ_F.xpt.txt"))

smoking_09_10_selected <- smoking_09_10 %>%
    select(SEQN, SMQ020)

med_conditions_09_10 <- read_xpt(file.path(path_to_data_09_10, "MCQ_F.xpt"))

med_conditions_09_10_selected <- med_conditions_09_10 %>%
    select(SEQN, MCQ140, MCQ160B, MCQ160C, MCQ160D, MCQ160E, MCQ160F, MCQ160L, MCQ220)

med_conditions_09_10_selected <- med_conditions_09_10_selected %>%
  rename(DLQ020 = MCQ140)


blood_pressure_09_10 <- read_xpt(file.path(path_to_data_09_10, "BPQ_F.xpt"))

blood_pressure_09_10_selected <- blood_pressure_09_10 %>%
    select(SEQN, BPQ020)


diabetes_09_10 <- read_xpt(file.path(path_to_data_09_10, "DIQ_F.xpt"))

diabetes_09_10_selected <- diabetes_09_10 %>%
    select(SEQN, DIQ010)


teeth_09_10 <- read_xpt(file.path(path_to_data_09_10, "OHXDEN_F.xpt.txt"))

selected_cols <- colnames(teeth_09_10)[grepl("^OHX\\d{2}TC", colnames(teeth_09_10))]

teeth_09_10_selected <- teeth_09_10 %>%
    select(SEQN, all_of(selected_cols))


locomotion_09_10 <- read_xpt(file.path(path_to_data_09_10, "PFQ_F.xpt"))

locomotion_09_10_selected <- locomotion_09_10 %>%
    select(SEQN, PFQ061I)


mental_health_09_10 <- read_xpt(file.path(path_to_data_09_10, "DPQ_F.xpt"))

mental_health_09_10_selected <- mental_health_09_10 %>%
    select(SEQN, DPQ010, DPQ020, DPQ050, DPQ070)

mental_health_09_10_selected <- mental_health_09_10_selected %>%
  mutate(DPQ070 = case_when(
    DPQ070 %in% c(0, 1) ~ 2,
    DPQ070 %in% c(2, 3) ~ 1,
    TRUE ~ DPQ070
  ))

mental_health_09_10_selected <- mental_health_09_10_selected %>%
  rename(DLQ040 = DPQ070)


weight_history_09_10 <- read_xpt(file.path(path_to_data_09_10, "WHQ_F.xpt"))

weight_history_09_10_selected <- weight_history_09_10 %>%
    select(SEQN, WHQ060)


audiometry_09_10 <- read_xpt(file.path(path_to_data_09_10, "AUQ_F.xpt"))

audiometry_09_10_selected <- audiometry_09_10 %>%
    select(SEQN, AUQ131)

audiometry_09_10_selected <- audiometry_09_10_selected %>%
  mutate(AUQ131 = case_when(
    AUQ131 %in% c(1, 2, 3) ~ 2,
    AUQ131 %in% c(4, 5, 6) ~ 1,
    TRUE ~ AUQ131
  ))

audiometry_09_10_selected <- audiometry_09_10_selected %>%
  rename(DLQ010 = AUQ131)

In [4]:
# Datasets for 2011/12 period

demo_11_12 <- read_xpt(file.path(path_to_data_11_12, "DEMO_G.xpt.txt"))

demo_11_12_selected <- demo_11_12 %>%
  select(SEQN, RIAGENDR, RIDAGEYR, RIDRETH1, DMDEDUC2, INDFMPIR)

alcohol_11_12 <- read_xpt(file.path(path_to_data_11_12, "ALQ_G.xpt.txt"))

alcohol_11_12_selected <- alcohol_11_12 %>%
  select(SEQN, ALQ101)


smoking_11_12 <- read_xpt(file.path(path_to_data_11_12, "SMQ_G.xpt.txt"))

smoking_11_12_selected <- smoking_11_12 %>%
    select(SEQN, SMQ020)


med_conditions_11_12 <- read_xpt(file.path(path_to_data_11_12, "MCQ_G.xpt.txt"))

med_conditions_11_12_selected <- med_conditions_11_12 %>%
    select(SEQN, MCQ140, MCQ160B, MCQ160C, MCQ160D, MCQ160E, MCQ160F, MCQ160L, MCQ220)

med_conditions_11_12_selected <- med_conditions_11_12_selected %>%
  rename(DLQ020 = MCQ140)


blood_pressure_11_12 <- read_xpt(file.path(path_to_data_11_12, "BPQ_G.xpt.txt"))

blood_pressure_11_12_selected <- blood_pressure_11_12 %>%
    select(SEQN, BPQ020)


diabetes_11_12 <- read_xpt(file.path(path_to_data_11_12, "DIQ_G.xpt.txt"))

diabetes_11_12_selected <- diabetes_11_12 %>%
    select(SEQN, DIQ010)


teeth_11_12 <- read_xpt(file.path(path_to_data_11_12, "OHXDEN_G.xpt.txt"))

selected_cols <- colnames(teeth_11_12)[grepl("^OHX\\d{2}TC", colnames(teeth_11_12))]

teeth_11_12_selected <- teeth_11_12 %>%
    select(SEQN, all_of(selected_cols))


locomotion_11_12 <- read_xpt(file.path(path_to_data_11_12, "PFQ_G.xpt.txt"))

locomotion_11_12_selected <- locomotion_11_12 %>%
    select(SEQN, PFQ061I)


mental_health_11_12 <- read_xpt(file.path(path_to_data_11_12, "DPQ_G.xpt.txt"))

mental_health_11_12_selected <- mental_health_11_12 %>%
    select(SEQN, DPQ010, DPQ020, DPQ050, DPQ070)

mental_health_11_12_selected <- mental_health_11_12_selected %>%
  mutate(DPQ070 = case_when(
    DPQ070 %in% c(0, 1) ~ 2,
    DPQ070 %in% c(2, 3) ~ 1,
    TRUE ~ DPQ070
  ))

mental_health_11_12_selected <- mental_health_11_12_selected %>%
  rename(DLQ040 = DPQ070)


weight_history_11_12 <- read_xpt(file.path(path_to_data_11_12, "WHQ_G.xpt.txt"))

weight_history_11_12_selected <- weight_history_11_12 %>%
    select(SEQN, WHQ060)


audiometry_11_12 <- read_xpt(file.path(path_to_data_11_12, "AUQ_G.xpt.txt"))

audiometry_11_12_selected <- audiometry_11_12 %>%
  select(SEQN, AUQ054)

audiometry_11_12_selected <- audiometry_11_12_selected %>%
  mutate(AUQ054 = case_when(
    AUQ054 %in% c(1, 2, 3) ~ 2,
    AUQ054 %in% c(4, 5, 6) ~ 1,
    TRUE ~ AUQ054
  ))

audiometry_11_12_selected <- audiometry_11_12_selected %>%
  rename(DLQ010 = AUQ054)

In [5]:
# Datasets for 2013/14 period

demo_13_14 <- read_xpt(file.path(path_to_data_13_14, "DEMO_H.xpt.txt"))

demo_13_14_selected <- demo_13_14 %>%
    select(SEQN, RIAGENDR, RIDAGEYR, RIDRETH1, DMDEDUC2, INDFMPIR)


alcohol_13_14 <- read_xpt(file.path(path_to_data_13_14, "ALQ_H.xpt.txt"))

alcohol_13_14_selected <- alcohol_13_14 %>%
    select(SEQN, ALQ101)


smoking_13_14 <- read_xpt(file.path(path_to_data_13_14, "SMQ_H.xpt.txt"))

smoking_13_14_selected <- smoking_13_14 %>%
    select(SEQN, SMQ020)


med_conditions_13_14 <- read_xpt(file.path(path_to_data_13_14, "MCQ_H.xpt.txt"))

med_conditions_13_14_selected <- med_conditions_13_14 %>%
    select(SEQN, MCQ160B, MCQ160C, MCQ160D, MCQ160E, MCQ160F, MCQ160L, MCQ220)


blood_pressure_13_14 <- read_xpt(file.path(path_to_data_13_14, "BPQ_H.xpt.txt"))

blood_pressure_13_14_selected <- blood_pressure_13_14 %>%
    select(SEQN, BPQ020)


diabetes_13_14 <- read_xpt(file.path(path_to_data_13_14, "DIQ_H.xpt.txt"))

diabetes_13_14_selected <- diabetes_13_14 %>%
    select(SEQN, DIQ010)


teeth_13_14 <- read_xpt(file.path(path_to_data_13_14, "OHXDEN_H.xpt.txt"))

selected_cols <- colnames(teeth_13_14)[grepl("^OHX\\d{2}TC", colnames(teeth_13_14))]

teeth_13_14_selected <- teeth_13_14 %>%
    select(SEQN, all_of(selected_cols))


locomotion_13_14 <- read_xpt(file.path(path_to_data_13_14, "PFQ061I.txt"))

locomotion_13_14_selected <- locomotion_13_14 %>%
    select(SEQN, PFQ061I)


disability_13_14 <- read_xpt(file.path(path_to_data_13_14, "DLQ040.txt"))

disability_13_14_selected <- disability_13_14 %>%
    select(SEQN, DLQ010, DLQ020, DLQ040)


mental_health_13_14 <- read_xpt(file.path(path_to_data_13_14, "DPQ-.txt"))

mental_health_13_14_selected <- mental_health_13_14 %>%
    select(SEQN, DPQ010, DPQ020, DPQ050)


weight_history_13_14 <- read_xpt(file.path(path_to_data_13_14, "WHQ060.txt"))

weight_history_13_14_selected <- weight_history_13_14 %>%
    select(SEQN, WHQ060)

In [6]:
dim(teeth_09_10_selected)
dim(teeth_11_12_selected)
dim(teeth_13_14_selected)

### Merge datasets without NA and missing values
> Merge all data from each datasets and then exclude patients

In [7]:
# Merge datasets demographics and intrinsic capacity data

datasets_09_10 <- list(
  demo_09_10_selected, alcohol_09_10_selected, smoking_09_10_selected, med_conditions_09_10_selected,
  blood_pressure_09_10_selected, diabetes_09_10_selected,
  locomotion_09_10_selected, mental_health_09_10_selected,
  weight_history_09_10_selected, audiometry_09_10_selected, teeth_09_10_selected
)

datasets_11_12 <- list(
  demo_11_12_selected, alcohol_11_12_selected, smoking_11_12_selected, med_conditions_11_12_selected,
  blood_pressure_11_12_selected, diabetes_11_12_selected,
  locomotion_11_12_selected, mental_health_11_12_selected,
  weight_history_11_12_selected, audiometry_11_12_selected, teeth_11_12_selected
)

datasets_13_14 <- list(
  demo_13_14_selected, alcohol_13_14_selected, smoking_13_14_selected, med_conditions_13_14_selected,
  blood_pressure_13_14_selected, diabetes_13_14_selected,
  locomotion_13_14_selected, disability_13_14_selected, mental_health_13_14_selected,
  weight_history_13_14_selected, teeth_13_14_selected
)

# Horizontal union for period 2009/10, 2011/12, 2013/14

df_09_10 <- Reduce(function(x, y) full_join(x, y, by = "SEQN"), datasets_09_10)

df_11_12 <- Reduce(function(x, y) full_join(x, y, by = "SEQN"), datasets_11_12)

df_13_14 <- Reduce(function(x, y) full_join(x, y, by = "SEQN"), datasets_13_14)

# Vertical union

df_final <- bind_rows(df_09_10, df_11_12, df_13_14)

print("Dimensions before removing NA values")
dim(df_final)

# Filter with AGE >= 60

df_final_age_60 <- subset(df_final, RIDAGEYR >= 60)

print("Dimensions with AGE >= 60")
dim(df_final_age_60)

# Excluding patients with missing values in Intrinsic Capacity features

df_final_excluding_IC <- df_final_age_60[complete.cases(df_final_age_60[, c('DLQ020', 'PFQ061I', 'DPQ010', 'DPQ020', 'DPQ050', 'DLQ040', 'WHQ060', 'DLQ010')]), ]

df_final_excluding_IC <- df_final_excluding_IC %>%
  filter(!if_any(c(DLQ020, PFQ061I, DLQ040, WHQ060, DLQ010,
                   DPQ020, DPQ050, DPQ010), ~ . == 9))

df_final_excluding_IC <- df_final_excluding_IC %>%
  filter(!if_any(c(DLQ020, PFQ061I, DLQ040, WHQ060, DLQ010,
                   DPQ020, DPQ050, DPQ010), ~ . == 7))

df_final_excluding_IC <- df_final_excluding_IC %>%
  filter(!if_any(c(DLQ020, PFQ061I, DLQ040, WHQ060, DLQ010,
                   DPQ020, DPQ050, DPQ010), ~ . == 99))

print("Dimensions without IC missing values")
dim(df_final_excluding_IC)

# Excluding patients with no examinations for Teeth counts

df_final_excluding_teeth <- df_final_excluding_IC %>%
  filter(rowSums(!is.na(select(., starts_with("OHX")))) > 0)

print("Dimensions without Teeth counts missing values")
dim(df_final_excluding_teeth)

# Excluding patients with missing values in Confounding features

df_final_excluding_confounding <- df_final_excluding_teeth[complete.cases(df_final_excluding_teeth[, 
                                  c('RIAGENDR', 'RIDAGEYR', 'RIDRETH1', 'DMDEDUC2', 'INDFMPIR', 'ALQ101', 'SMQ020', 'MCQ160B',
                                  'MCQ160C', 'MCQ160D', 'MCQ160E', 'MCQ160F', 'MCQ160L', 'MCQ220', 'BPQ020', 'DIQ010')]), ]

df_final_merged <- df_final_excluding_confounding %>%
  filter(!if_any(c(DMDEDUC2, ALQ101, MCQ160B, MCQ160C, MCQ160D,
                   MCQ160E, MCQ160F, MCQ160L, BPQ020, DIQ010), ~ . == 9))

print("Dimensions without Confounding missing values")
dim(df_final_merged)

[1] "Dimensions before removing NA values"


[1] "Dimensions with AGE >= 60"


[1] "Dimensions without IC missing values"


[1] "Dimensions without Teeth counts missing values"


[1] "Dimensions without Confounding missing values"


### Teeth counts

Preprocessed features:
- total number of teeth
- binary category: >=20 teeth or < 20 teeth
- edentulus category
- Other categorization:
    1. Edentulus : 0 teeth
    2. Severe Loss : 1-9 teeth
    3. Moderate Loss : 10-19 teeth
    4. Nearly Complete : >=20 teeth

In [8]:
# Functions to check if there are patients with zero permanent teeth,
# and moreover, patients with only not present teeth and fragments/root

find_patients_no_teeth <- function(df) {

  teeth_cols <- grep("^OHX\\d{2}TC$", names(df), value = TRUE)
  
  no_teeth_patients <- df[rowSums(df[, teeth_cols] == 2, na.rm = TRUE) == 0, ]
  
  return(no_teeth_patients)
}

find_patients_with_non4_values <- function(df) {
  
  teeth_cols <- grep("^OHX\\d{2}TC$", names(df), value = TRUE)

  no_teeth_patients <- find_patients_no_teeth(df)

  patients_with_non4 <- no_teeth_patients[rowSums(no_teeth_patients[, teeth_cols] != 4, na.rm = TRUE) > 0, ]

  return(patients_with_non4)

}

test_edentolus <- find_patients_with_non4_values(df_final_merged)
head(test_edentolus)

SEQN,RIAGENDR,RIDAGEYR,RIDRETH1,DMDEDUC2,INDFMPIR,ALQ101,SMQ020,DLQ020,MCQ160B,...,OHX23TC,OHX24TC,OHX25TC,OHX26TC,OHX27TC,OHX28TC,OHX29TC,OHX30TC,OHX31TC,OHX32TC
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
60998,2,60,2,1,2.64,2,1,2,2,...,4,4,4,4,5,4,4,4,4,4
64973,1,64,3,3,1.13,1,2,2,2,...,4,4,4,4,5,4,4,4,4,4
74674,2,67,1,1,0.97,2,2,2,2,...,4,4,4,4,4,4,5,4,4,4
78009,1,80,4,2,1.39,1,1,1,2,...,5,5,5,5,5,5,5,4,4,4


In [9]:
# Function to calculate total number of teeth for each patient and categorize it

count_teeth <- function(df) {

  teeth_cols <- grep("^OHX\\d{2}TC$", names(df), value = TRUE)
  
  df$total_teeth <- rowSums(df[, teeth_cols] == 2, na.rm = TRUE)
  
  # Binary category: 1 if >=20 teeth, 0 otherwise
  df$has_20_or_more_teeth <- ifelse(df$total_teeth >= 20, 1, 0)
  
  # edentulous patients (every 32 teeth with value 4 or 5)
  df$edentulous <- ifelse(rowSums(df[, teeth_cols] == 4 | df[, teeth_cols] == 5, na.rm = TRUE) == length(teeth_cols), 1, 0)
  
  # Other possible categories: Edentulous, Severe, Moderate, Nearly Complete
  df$teeth_category <- cut(
    df$total_teeth,
    breaks = c(-Inf, 0, 9, 19, 32),
    labels = c("Edentulous", "Severe Loss (1-9)", "Moderate Loss (10-19)", "Nearly Complete (25-32)"),
    right = TRUE
  )
  
  df <- df[, !names(df) %in% teeth_cols]
  
  return(df)
}

df <- as.data.frame(count_teeth(df_final_merged))
head(df)

Unnamed: 0_level_0,SEQN,RIAGENDR,RIDAGEYR,RIDRETH1,DMDEDUC2,INDFMPIR,ALQ101,SMQ020,DLQ020,MCQ160B,...,DPQ010,DPQ020,DPQ050,DLQ040,WHQ060,DLQ010,total_teeth,has_20_or_more_teeth,edentulous,teeth_category
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,51633,1,80,3,4,1.27,1,1,1,2,...,0,0,0,2,1,2,6,0,0,Severe Loss (1-9)
2,51654,1,66,3,4,2.2,1,1,2,2,...,0,0,0,2,2,2,26,1,0,Nearly Complete (25-32)
3,51680,2,60,4,4,2.59,1,1,1,2,...,0,1,1,2,1,2,20,1,0,Nearly Complete (25-32)
4,51687,1,78,3,5,5.0,1,1,2,2,...,0,0,0,2,2,2,28,1,0,Nearly Complete (25-32)
5,51736,2,60,1,3,0.68,1,1,2,2,...,0,1,0,2,1,2,21,1,0,Nearly Complete (25-32)
6,51861,1,80,3,3,4.94,1,2,2,2,...,0,0,0,2,1,1,8,0,0,Severe Loss (1-9)


In [10]:
# Saving preprocessed

write.csv(df, "/Users/silvanoquarto/Desktop/LAVORO/MEDICAL_PHYSICS/Med-Physics/data/NHANES/preprocessed_df_teeth_09_14.csv", row.names=FALSE)

### Descriptive Analysis

In [1]:
# Load preprocessed and cleaned df

df_test <- read.csv("/Users/silvanoquarto/Desktop/LAVORO/MEDICAL_PHYSICS/Med-Physics/data/NHANES/preprocessed_df_teeth_09_14.csv",
               header = TRUE)

head(df_test)

Unnamed: 0_level_0,SEQN,RIAGENDR,RIDAGEYR,RIDRETH1,DMDEDUC2,INDFMPIR,ALQ101,SMQ020,DLQ020,MCQ160B,...,DPQ010,DPQ020,DPQ050,DLQ040,WHQ060,DLQ010,total_teeth,has_20_or_more_teeth,edentulous,teeth_category
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>
1,51633,1,80,3,4,1.27,1,1,1,2,...,0,0,0,2,1,2,6,0,0,Severe Loss (1-9)
2,51654,1,66,3,4,2.2,1,1,2,2,...,0,0,0,2,2,2,26,1,0,Nearly Complete (25-32)
3,51680,2,60,4,4,2.59,1,1,1,2,...,0,1,1,2,1,2,20,1,0,Nearly Complete (25-32)
4,51687,1,78,3,5,5.0,1,1,2,2,...,0,0,0,2,2,2,28,1,0,Nearly Complete (25-32)
5,51736,2,60,1,3,0.68,1,1,2,2,...,0,1,0,2,1,2,21,1,0,Nearly Complete (25-32)
6,51861,1,80,3,3,4.94,1,2,2,2,...,0,0,0,2,1,1,8,0,0,Severe Loss (1-9)


In [2]:
table(df_test$DMDEDUC2, df_test$teeth_category)

   
    Edentulous Moderate Loss (10-19) Nearly Complete (25-32) Severe Loss (1-9)
  1         34                    25                      29                18
  2         44                    31                      37                20
  3         44                    40                      72                23
  4         33                    30                     135                23
  5          6                    34                     100                 9

In [3]:
table(df_test$RIDRETH1, df_test$teeth_category)

   
    Edentulous Moderate Loss (10-19) Nearly Complete (25-32) Severe Loss (1-9)
  1         11                    15                      45                 9
  2         10                    12                      29                 9
  3         82                    61                     190                37
  4         49                    64                      92                33
  5          9                     8                      17                 5

In [13]:
# Usando il pacchetto srvyr (più intuitivo con approccio dplyr)
library(srvyr)

svy_obj <- df_final_merged %>%
  as_survey_design(
    ids = SDMVPSU,
    strata = SDMVSTRA,
    weights = wt,
    nest = TRUE
  )

# Stima della popolazione
pop_est <- svy_obj %>%
  summarize(pop = survey_total(1, vartype = "ci"))

print(pop_est)

[90m# A tibble: 1 x 3[39m
       pop  pop_low  pop_upp
     [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m
[90m1[39m 8[4m3[24m[4m5[24m[4m1[24m527. 7[4m1[24m[4m7[24m[4m3[24m828. 9[4m5[24m[4m2[24m[4m9[24m225.


In [None]:
library(survey)

# Creazione variabile "Total" per il conteggio
df_final_merged$Total <- 1

# Stima usando svytotal
svy_design <- svydesign(
  id = ~SDMVPSU,
  strata = ~SDMVSTRA,
  weights = ~wt,
  nest = TRUE,
  options = list(lonely.psu = "adjust"),  # Gestisce il problema dei PSU solitari
  data = df_final_merged
)

# Calcolo della stima totale
pop_total <- svytotal(~Total, svy_design)
print(pop_total)
print(confint(pop_total))

# Stima per categoria di denti
teeth_totals <- svyby(~Total, ~teeth_category, svy_design, svytotal)
print(teeth_totals)
print(confint(teeth_totals))

        total     SE
Total 8351526 585735
        2.5 %  97.5 %
Total 7203508 9499545
                                 teeth_category     Total       se
Edentulous                           Edentulous 1487304.1 219644.5
Moderate Loss (10-19)     Moderate Loss (10-19) 1469842.4 178245.4
Nearly Complete (25-32) Nearly Complete (25-32) 4559454.5 432282.1
Severe Loss (1-9)             Severe Loss (1-9)  834925.6 128546.2
                            2.5 %  97.5 %
Edentulous              1056808.6 1917799
Moderate Loss (10-19)   1120487.9 1819197
Nearly Complete (25-32) 3712197.2 5406712
Severe Loss (1-9)        582979.5 1086872


#### Descriptive Analysis in csv table format
> TO BE UPDATED: non-normal variables, ordinal variables, p-values 

In [51]:
# Function for the descriptive analysis

create_descriptive_table <- function(df) {
  require(tableone)
  require(dplyr)

  df$has_20_or_more_teeth <- factor(df$has_20_or_more_teeth, levels = c(0, 1),
                                    labels = c("< 20", ">= 20"))
  
  # Recoding categorical variables
  df <- df %>%
    mutate(
      `Age (years)` = RIDAGEYR,
      `Number of teeth` = total_teeth,
      `Ratio of family income` = INDFMPIR,
      Gender = factor(RIAGENDR, levels = c(1, 2), 
                      labels = c("Male", "Female")),
      
      Ethnicity = factor(RIDRETH1, levels = 1:5, 
                         labels = c("Mexican American", "Other Hispanic",
                                    "Non-Hispanic White", "Non-Hispanic Black",
                                    "Other Race")),
      
      Education = factor(DMDEDUC2, levels = 1:5,
                         labels = c("Less than 9th grade", "9-11th grade",
                                    "High school graduate",
                                    "Some college/AA degree",
                                    "College graduate or above")),
      
      Smoking = factor(SMQ020, levels = c(1, 2),
                       labels = c("Yes", "No")),
      
      `Alcohol intake` = factor(ALQ101, levels = c(1, 2),
                                labels = c("Over 12 alcohol drinks/1 yr",
                                "Under 12 alcohol drinks/1 yr")),
      
      `Heart Failure` = factor(MCQ160B, levels = c(1, 2), 
                               labels = c("Yes", "No")),
      `Coronary Heart` = factor(MCQ160C, levels = c(1, 2),
                                labels = c("Yes", "No")),
      Angina = factor(MCQ160D, levels = c(1, 2),
                      labels = c("Yes", "No")),
      `Heart Attack` = factor(MCQ160E, levels = c(1, 2),
                              labels = c("Yes", "No")),
      Stroke = factor(MCQ160F, levels = c(1, 2),
                      labels = c("Yes", "No")),
      Cancer = factor(MCQ220, levels = c(1, 2),
                      labels = c("Yes", "No")),
      Liver = factor(MCQ160L, levels = c(1, 2),
                     labels = c("Yes", "No")),
      Hypertension = factor(BPQ020, levels = c(1,2),
                            labels = c("Yes", "No")),
      Diabetes = factor(DIQ010, levels = 1:3,
                        labels = c("Yes", "No", "Borderline"))
    )
  
  continuous_vars <- c("Age (years)", "Number of teeth", "Ratio of family income")
  
  categorical_vars <- c("Gender", "Ethnicity", "Education", "Smoking", "Alcohol intake",
                        "Heart Failure", "Coronary Heart", "Angina", "Heart Attack",
                        "Stroke", "Cancer", "Liver", "Hypertension", "Diabetes")

  table1 <- CreateTableOne(vars = c(continuous_vars, categorical_vars),
                           strata = "has_20_or_more_teeth",
                           data = df,
                           test = TRUE)

  cont_vars_overall <- c("Age (years)", "Number of teeth", "Ratio of family income")
  
  table_overall <- CreateTableOne(vars = c(cont_vars_overall, categorical_vars),
                                  data = df,
                                  test = FALSE)
  
  formatted_table <- print(table1,
                           #nonnormal = continuous_vars,
                           nonnormal = NULL,
                           contDigits = 2,
                           showAllLevels = TRUE,
                           printToggle = FALSE,
                           smd = FALSE)
  
  formatted_table_overall <- print(table_overall,
                                   #nonnormal = continuous_vars,
                                   nonnormal = NULL,
                                   contDigits = 2,
                                   showAllLevels = TRUE,
                                   printToggle = FALSE,
                                   smd = FALSE)
  
  final_table <- list("Stratified by 20 teeth as cut-off" = formatted_table, 
                      "Overall" = formatted_table_overall)
  
  return(final_table)
}

In [52]:
# Test e debugging
results <- create_descriptive_table(df_test)

results

Loading required package: tableone



Unnamed: 0,level,< 20,>= 20,p,test
n,,414,373,,
Age (years) (mean (SD)),,70.76 (6.81),68.05 (6.68),<0.001,
Number of teeth (mean (SD)),,6.99 (6.78),24.86 (2.99),<0.001,
Ratio of family income (mean (SD)),,1.99 (1.36),2.97 (1.60),<0.001,
Gender (%),Male,216 (52.2),186 (49.9),0.565,
,Female,198 (47.8),187 (50.1),,
Ethnicity (%),Mexican American,35 ( 8.5),45 (12.1),0.015,
,Other Hispanic,31 ( 7.5),29 ( 7.8),,
,Non-Hispanic White,180 (43.5),190 (50.9),,
,Non-Hispanic Black,146 (35.3),92 (24.7),,

Unnamed: 0,level,Overall
n,,787
Age (years) (mean (SD)),,69.47 (6.88)
Number of teeth (mean (SD)),,15.46 (10.39)
Ratio of family income (mean (SD)),,2.45 (1.55)
Gender (%),Male,402 (51.1)
,Female,385 (48.9)
Ethnicity (%),Mexican American,80 (10.2)
,Other Hispanic,60 ( 7.6)
,Non-Hispanic White,370 (47.0)
,Non-Hispanic Black,238 (30.2)


#### Descriptive Analysis in word table format

##### Multi labels

In [26]:
# Descriptive analysis with multi classes:
# "20 teeth or more", "10-19 teeth", "1-9 teeth", "Edentulous"

create_descriptive_table <- function(df, survey_design = NULL) {
  require(gtsummary)
  require(dplyr)
  require(survey)
  require(srvyr)
  
  # Check if survey design is provided
  use_survey_design <- !is.null(survey_design)
  
  # Recode and transform variables
  df <- df %>%
    mutate(
      teeth_category = factor(df$teeth_category,
                              levels = c("Nearly Complete (25-32)", "Moderate Loss (10-19)",
                                         "Severe Loss (1-9)", "Edentulous"),
                              labels = c("20 teeth or more", "10-19 teeth",
                                          "1-9 teeth", "Edentulous")),

      # Continuous variables
      `Age (years)` = RIDAGEYR,
      `Number of teeth` = total_teeth,
      `Ratio of family income` = INDFMPIR,
      
      # Categorical and Ordinal variables
      Gender = factor(RIAGENDR, levels = c(1, 2), 
                      labels = c("Male", "Female")),
      
      Education = factor(DMDEDUC2, levels = 1:5,
                         labels = c("Less than 9th grade", "9-11th grade",
                                    "High school graduate",
                                    "Some college/AA degree",
                                    "College graduate or above"),
                         ordered = TRUE),
      
      Ethnicity = factor(RIDRETH1, levels = 1:5, 
                         labels = c("Mexican American", "Other Hispanic",
                                    "Non-Hispanic White", "Non-Hispanic Black",
                                    "Other Race")),
      
      Smoking = factor(SMQ020, levels = c(1, 2),
                       labels = c("Yes", "No")),
      
      `Alcohol intake` = factor(ALQ101, levels = c(1, 2),
                                labels = c("Over 12 alcohol drinks/1 yr",
                                "Under 12 alcohol drinks/1 yr")),
      
      `Heart Failure` = factor(MCQ160B, levels = c(1, 2), 
                               labels = c("Yes", "No")),
      `Coronary Heart` = factor(MCQ160C, levels = c(1, 2),
                                labels = c("Yes", "No")),
      Angina = factor(MCQ160D, levels = c(1, 2),
                      labels = c("Yes", "No")),
      `Heart Attack` = factor(MCQ160E, levels = c(1, 2),
                              labels = c("Yes", "No")),
      Stroke = factor(MCQ160F, levels = c(1, 2),
                      labels = c("Yes", "No")),
      Cancer = factor(MCQ220, levels = c(1, 2),
                      labels = c("Yes", "No")),
      Liver = factor(MCQ160L, levels = c(1, 2),
                     labels = c("Yes", "No")),
      Hypertension = factor(BPQ020, levels = c(1, 2),
                            labels = c("Yes", "No")),
      Diabetes = factor(DIQ010, levels = 1:3,
                        labels = c("Yes", "No", "Borderline"))
    )
  
  # Normality test a priori for continuous variables
  continuous_vars <- c("Age (years)", "Ratio of family income")
  
  normality_results <- list()
  
  for (var in continuous_vars) {
    # Limit: 5000 observations for Shapiro-Wilk test
    if (length(na.omit(df[[var]])) > 5000) {
      sample_data <- sample(na.omit(df[[var]]), 5000)
    } else {
      sample_data <- na.omit(df[[var]])
    }
    
    test_result <- shapiro.test(sample_data)
    normality_results[[var]] <- test_result$p.value > 0.05
    message(var, " p-value: ", test_result$p.value)
  }
  
  message("Normality test results:")
  for (var in names(normality_results)) {
    message(var, ": ", ifelse(normality_results[[var]], "Normal", "Non-normal"))
  }
  
  variables_to_include <- c(
    "Age (years)", "Gender", "Ethnicity", "Education", "Ratio of family income",
    "Number of teeth", "Smoking", "Alcohol intake",
    "Heart Failure", "Coronary Heart", "Angina", "Heart Attack",
    "Stroke", "Cancer", "Liver", "Hypertension", "Diabetes"
  )
  
  # Variables using median (IQR) or mean (SD)
  median_vars <- names(normality_results)[!unlist(normality_results)]
  median_vars <- c(median_vars, "Number of teeth")

  stat_labels <- list(
    "Age (years)" = "Age (mean, SD)",
    "Number of teeth" = "Number of teeth (median, IQR)",
    "Ratio of family income" = "Ratio of family income (mean, SD)",
    "Gender" = "Gender (n, %)",
    "Ethnicity" = "Ethnicity (n, %)",
    "Education" = "Education (n, %)",
    "Smoking" = "Smoking (n, %)",
    "Alcohol intake" = "Alcohol intake (n, %)",
    "Heart Failure" = "Heart Failure (n, %)",
    "Coronary Heart" = "Coronary Heart (n, %)",
    "Angina" = "Angina (n, %)",
    "Heart Attack" = "Heart Attack (n, %)",
    "Stroke" = "Stroke (n, %)",
    "Cancer" = "Cancer (n, %)",
    "Liver" = "Liver (n, %)",
    "Hypertension" = "Hypertension (n, %)",
    "Diabetes" = "Diabetes (n, %)"
  )
  
  for (var in names(normality_results)) {
    if (normality_results[[var]]) {
      stat_labels[[var]] <- gsub("\\(median, IQR\\)", "(mean, SD)", stat_labels[[var]])
    } else {
      stat_labels[[var]] <- gsub("\\(mean, SD\\)", "(median, IQR)", stat_labels[[var]])
    }
  }

  # Define statistics
  stat_list <- list(
    all_continuous() ~ "{mean} ({sd})",
    all_categorical() ~ "{n} ({p}%)"
  )
  for (var in median_vars) {
    stat_list[[var]] <- "{median} ({p25}, {p75})"
  }
  
    # Create table with gtsummary - use tbl_summary for consistent display
    # but calculate weighted population for the additional column
    table_strat <- df %>%
      tbl_summary(
        by = teeth_category,
        include = all_of(variables_to_include),
        statistic = stat_list,
        label = stat_labels,
        missing = "ifany",
        missing_text = "Missing",
        digits = all_continuous() ~ 2,
        value = all_categorical() ~ "level",
        type = all_categorical() ~ "categorical"
      ) %>%
      add_p(
        test = list(
          continuous_vars[unlist(normality_results[continuous_vars])] ~ "anova", 
          continuous_vars[!unlist(normality_results[continuous_vars])] ~ "kruskal.test",
          `Number of teeth` ~ "kruskal.test",
          all_categorical() ~ "chisq.test",
          #Ethnicity ~ "fisher.test",  # some categories have few patients
          Education ~ "kruskal.test"  # it is an ordinal feature
        )
      ) %>%
      add_overall()
  
  # Calculate N for each group
  total_n <- nrow(df)
  n_by_group <- df %>%
    group_by(teeth_category) %>%
    summarize(n = n()) %>%
    pull(n, name = teeth_category)
  
  # Add weighted N in millions column only if survey design is provided
  if (use_survey_design) {
    survey_obj <- survey_design %>% 
      as_survey(options = list(lonely.psu = "adjust"))

    total_pop_in_millions <- survey_obj %>%
      summarize(pop = survey_total(1)) %>%
      mutate(pop_millions = pop/1000000) %>%
      pull(pop_millions)

    pop_by_group_in_millions <- survey_obj %>%
      group_by(teeth_category) %>%
      summarize(pop = survey_total(1)) %>%
      mutate(pop_millions = pop/1000000)
    
    table_strat <- table_strat %>%
      modify_table_body(
        ~.x %>%
          dplyr::mutate(
            weighted_n = case_when(
              is.na(row_type) ~ "", 
              row_type == "label_header" ~ "**Weighted N (Millions)**",
              row_type == "label" ~ "",
              row_type == "level" & !is.na(variable) & !is.na(label) ~ "",
              TRUE ~ ""
            )
          ) %>%
          dplyr::relocate(weighted_n, .before = stat_0)
      )
    
    # Add weighted N column as an additional column
    table_strat <- table_strat %>%
      modify_header(
        label = "**Characteristics**",
        weighted_n = paste0("**Weighted N**\n**in Millions**")
      )
    
    # Add weighted population estimates for each row
    if (use_survey_design) {
      # Function to calculate weighted population for a specific variable and level
      calculate_weighted_pop <- function(var_name, level = NULL) {
        tryCatch({
          if (is.null(level)) {
            # For continuous variables: sum weights where variable is not NA
            var_data <- survey_design$variables[[var_name]]
            weights_sum <- sum(weights(survey_design, "analysis")[!is.na(var_data)]) / 1000000
            return(weights_sum)
          } else {
            # For categorical variables and specific levels
            var_data <- survey_design$variables[[var_name]]
            level_match <- var_data == level & !is.na(var_data)
            weights_sum <- sum(weights(survey_design, "analysis")[level_match]) / 1000000
            return(weights_sum)
          }
        }, error = function(e) {
          message("Error calculating weighted population for ", var_name, 
                  if(!is.null(level)) paste(" level:", level), ": ", e$message)
          return(NA)
        })
      }
      
      # Update each row with weighted population estimate
      table_strat$table_body <- table_strat$table_body %>%
        rowwise() %>%
        mutate(
          weighted_n = case_when(
            !is.na(row_type) & row_type == "label" & !is.na(variable) ~ 
              sprintf("%.2f", calculate_weighted_pop(variable)),
            !is.na(row_type) & row_type == "level" & !is.na(variable) & !is.na(label) ~
              sprintf("%.2f", calculate_weighted_pop(variable, label)),
            TRUE ~ weighted_n
          )
        ) %>%
        ungroup()
    }
  } else {
    # If no survey design, just modify the headers without weighted N
    table_strat <- table_strat %>%
      modify_header(
        label = "**Characteristics**",
        stat_0 = paste0("**Total**\nN = ", total_n), 
        stat_1 = paste0("**20 teeth or more**\nN = ", ifelse("20 teeth or more" %in% names(n_by_group), n_by_group["20 teeth or more"], 0)), 
        stat_2 = paste0("**10-19 teeth**\nN = ", ifelse("10-19 teeth" %in% names(n_by_group), n_by_group["10-19 teeth"], 0)),
        stat_3 = paste0("**1-9 teeth**\nN = ", ifelse("1-9 teeth" %in% names(n_by_group), n_by_group["1-9 teeth"], 0)), 
        stat_4 = paste0("**Edentulous**\nN = ", ifelse("Edentulous" %in% names(n_by_group), n_by_group["Edentulous"], 0)),
        p.value = "**P-value**"
      )
  }
  
  # Notes
  table_strat <- table_strat %>%
    modify_footnote(
      update = all_stat_cols() ~ "Values are n (%) for categorical variables, median (IQR) for non-normally distributed continuous variables, and mean (SD) for normally distributed continuous variables."
    )
  
  if (use_survey_design) {
    table_strat <- table_strat %>%
      modify_footnote(
        add = "Weighted N in millions represents the estimated US population based on NHANES survey weights."
      )
  }
  
  return(table_strat)
}

In [23]:
# Saving un-weighted results in docx format

result_table <- create_descriptive_table(df_test)
flex_table <- result_table %>% as_flex_table()

library(flextable)
library(officer)
save_as_docx(flex_table, path = "/Users/silvanoquarto/Desktop/LAVORO/MEDICAL_PHYSICS/Med-Physics/results/NHANES_09_14_teeth/descriptive_table_multi_labels.docx")

Age (years) p-value: 3.36317085020386e-21

Ratio of family income p-value: 1.92157997592047e-22

Normality test results:

Age (years): Non-normal

Ratio of family income: Non-normal

[1m[22m[33m![39m For variable `Ethnicity` (`teeth_category`) and [34m"statistic"[39m, [34m"p.value"[39m, and
  [34m"parameter"[39m statistics: [33mChi-squared approximation may be incorrect[39m


In [7]:
# Select weights

# Weights from Demographic datasets

demo_09_10 <- read_xpt(file.path(path_to_data_09_10, "DEMO_F.xpt"))

demo_09_10_weights <- demo_09_10 %>%
    select(SEQN, RIDAGEYR, WTMEC2YR, SDMVPSU, SDMVSTRA)

demo_11_12 <- read_xpt(file.path(path_to_data_11_12, "DEMO_G.xpt.txt"))

demo_11_12_weights <- demo_11_12 %>%
    select(SEQN, RIDAGEYR, WTMEC2YR, SDMVPSU, SDMVSTRA)

demo_13_14 <- read_xpt(file.path(path_to_data_13_14, "DEMO_H.xpt.txt"))

demo_13_14_weights <- demo_13_14 %>%
    select(SEQN, RIDAGEYR, WTMEC2YR, SDMVPSU, SDMVSTRA)

weights_09_10 <- list(
  demo_09_10_weights
)

weights_11_12 <- list(
  demo_11_12_weights
)

weights_13_14 <- list(
  demo_13_14_weights
)

# Horizontal union for period 2009/10, 2011/12, 2013/14

wt_09_10 <- Reduce(function(x, y) full_join(x, y, by = "SEQN"), weights_09_10)

wt_11_12 <- Reduce(function(x, y) full_join(x, y, by = "SEQN"), weights_11_12)

wt_13_14 <- Reduce(function(x, y) full_join(x, y, by = "SEQN"), weights_13_14)

# Vertical union

wt_final <- bind_rows(wt_09_10, wt_11_12, wt_13_14)

print("Dimensions before removing NA values")
dim(wt_final)

# Filter with AGE >= 60

wt_final_age_60 <- subset(wt_final, RIDAGEYR >= 60)

print("Dimensions with AGE >= 60")
dim(wt_final_age_60)

wt_final_age_60 <- subset(wt_final_age_60, select = -RIDAGEYR)
head(wt_final_age_60)

# Merge wt_final_age_60 with my final data frame 

df_final_merged <- df_test %>%
  inner_join(wt_final_age_60, by = "SEQN")

dim(df_final_merged)

[1] "Dimensions before removing NA values"


[1] "Dimensions with AGE >= 60"


SEQN,WTMEC2YR,SDMVPSU,SDMVSTRA
<dbl>,<dbl>,<dbl>,<dbl>
51628,21000.339,2,75
51633,12381.115,1,77
51635,22502.507,1,79
51645,9590.458,1,75
51654,55670.35,2,86
51661,6385.327,2,88


In [8]:
# Preprocessing for WTMEC2YR: divide it for the number of NHANES cycles used (3 for our case)

df_final_merged[, "wt"] = df_final_merged[, "WTMEC2YR"] / 3

head(df_final_merged)

Unnamed: 0_level_0,SEQN,RIAGENDR,RIDAGEYR,RIDRETH1,DMDEDUC2,INDFMPIR,ALQ101,SMQ020,DLQ020,MCQ160B,...,WHQ060,DLQ010,total_teeth,has_20_or_more_teeth,edentulous,teeth_category,WTMEC2YR,SDMVPSU,SDMVSTRA,wt
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,...,<int>,<int>,<int>,<int>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,51633,1,80,3,4,1.27,1,1,1,2,...,1,2,6,0,0,Severe Loss (1-9),12381.115,1,77,4127.038
2,51654,1,66,3,4,2.2,1,1,2,2,...,2,2,26,1,0,Nearly Complete (25-32),55670.35,2,86,18556.783
3,51680,2,60,4,4,2.59,1,1,1,2,...,1,2,20,1,0,Nearly Complete (25-32),18341.27,1,79,6113.757
4,51687,1,78,3,5,5.0,1,1,2,2,...,2,2,28,1,0,Nearly Complete (25-32),42248.559,2,82,14082.853
5,51736,2,60,1,3,0.68,1,1,2,2,...,1,2,21,1,0,Nearly Complete (25-32),7084.377,1,81,2361.459
6,51861,1,80,3,3,4.94,1,2,2,2,...,1,1,8,0,0,Severe Loss (1-9),22127.661,1,76,7375.887


In [27]:
# Function to create survey design for NHANES datasets (to update for case with 3 different two-years period)

create_nhanes_design <- function(df) {

  # Recode and transform variables
  df <- df %>%
    mutate(
      teeth_category = factor(df$teeth_category,
                              levels = c("Nearly Complete (25-32)", "Moderate Loss (10-19)",
                                         "Severe Loss (1-9)", "Edentulous"),
                              labels = c("20 teeth or more", "10-19 teeth",
                                          "1-9 teeth", "Edentulous")),
      
      # Continuous variables
      `Age (years)` = RIDAGEYR,
      `Number of teeth` = total_teeth,
      `Ratio of family income` = INDFMPIR,
      
      # Categorical and Ordinal variables
      Gender = factor(RIAGENDR, levels = c(1, 2), 
                      labels = c("Male", "Female")),
      
      Education = factor(DMDEDUC2, levels = 1:5,
                         labels = c("Less than 9th grade", "9-11th grade",
                                    "High school graduate",
                                    "Some college/AA degree",
                                    "College graduate or above"),
                         ordered = TRUE),
      
      Ethnicity = factor(RIDRETH1, levels = 1:5, 
                         labels = c("Mexican American", "Other Hispanic",
                                    "Non-Hispanic White", "Non-Hispanic Black",
                                    "Other Race")),
      
      Smoking = factor(SMQ020, levels = c(1, 2),
                       labels = c("Yes", "No")),
      
      `Alcohol intake` = factor(ALQ101, levels = c(1, 2),
                                labels = c("Over 12 alcohol drinks/1 yr",
                                "Under 12 alcohol drinks/1 yr")),
      
      `Heart Failure` = factor(MCQ160B, levels = c(1, 2), 
                               labels = c("Yes", "No")),
      `Coronary Heart` = factor(MCQ160C, levels = c(1, 2),
                                labels = c("Yes", "No")),
      Angina = factor(MCQ160D, levels = c(1, 2),
                      labels = c("Yes", "No")),
      `Heart Attack` = factor(MCQ160E, levels = c(1, 2),
                              labels = c("Yes", "No")),
      Stroke = factor(MCQ160F, levels = c(1, 2),
                      labels = c("Yes", "No")),
      Cancer = factor(MCQ220, levels = c(1, 2),
                      labels = c("Yes", "No")),
      Liver = factor(MCQ160L, levels = c(1, 2),
                     labels = c("Yes", "No")),
      Hypertension = factor(BPQ020, levels = c(1, 2),
                            labels = c("Yes", "No")),
      Diabetes = factor(DIQ010, levels = 1:3,
                        labels = c("Yes", "No", "Borderline"))
    )
  design <- svydesign(
    id = ~SDMVPSU,
    strata = ~SDMVSTRA,
    weights = ~wt,
    options = list(lonely.psu = "adjust"),
    nest = TRUE,
    data = df
  )
  return(design)
}

In [28]:
# Saving weighted results in docx format

nhanes_design <- create_nhanes_design(df_final_merged)

result_table_weighted <- create_descriptive_table(df_final_merged, nhanes_design)
flex_table_weighted <- result_table_weighted %>% as_flex_table()

save_as_docx(flex_table_weighted,
            path = "/Users/silvanoquarto/Desktop/LAVORO/MEDICAL_PHYSICS/Med-Physics/results/NHANES_09_14_teeth/multi_labels_weighted_descriptive_analysis.docx")

Age (years) p-value: 3.36317085020386e-21

Ratio of family income p-value: 1.92157997592047e-22

Normality test results:

Age (years): Non-normal

Ratio of family income: Non-normal

[1m[22m[33m![39m For variable `Ethnicity` (`teeth_category`) and [34m"statistic"[39m, [34m"p.value"[39m, and
  [34m"parameter"[39m statistics: [33mChi-squared approximation may be incorrect[39m


##### Binary label

In [None]:
# Descriptive analysis with two classes: "< 20 teeth", ">= 20 teeth"

create_descriptive_table <- function(df, survey_design = NULL) {
  require(gtsummary)
  require(dplyr)
  
  # maybe it is useless 
  use_survey_design <- !is.null(survey_design)
  
  # Recode and transform variables
  df <- df %>%
    mutate(
      has_20_or_more_teeth = factor(has_20_or_more_teeth, levels = c(0, 1),
                                    labels = c("< 20 teeth", ">= 20 teeth")),
      
      # Continuous variables
      `Age (years)` = RIDAGEYR,
      `Number of teeth` = total_teeth,
      `Ratio of family income` = INDFMPIR,
      
      # Categorical and Ordinal variables
      Gender = factor(RIAGENDR, levels = c(1, 2), 
                      labels = c("Male", "Female")),
      
      Education = factor(DMDEDUC2, levels = 1:5,
                         labels = c("Less than 9th grade", "9-11th grade",
                                    "High school graduate",
                                    "Some college/AA degree",
                                    "College graduate or above"),
                         ordered = TRUE),
      
      Ethnicity = factor(RIDRETH1, levels = 1:5, 
                         labels = c("Mexican American", "Other Hispanic",
                                    "Non-Hispanic White", "Non-Hispanic Black",
                                    "Other Race")),
      
      Smoking = factor(SMQ020, levels = c(1, 2),
                       labels = c("Yes", "No")),
      
      `Alcohol intake` = factor(ALQ101, levels = c(1, 2),
                                labels = c("Over 12 alcohol drinks/1 yr",
                                "Under 12 alcohol drinks/1 yr")),
      
      `Heart Failure` = factor(MCQ160B, levels = c(1, 2), 
                               labels = c("Yes", "No")),
      `Coronary Heart` = factor(MCQ160C, levels = c(1, 2),
                                labels = c("Yes", "No")),
      Angina = factor(MCQ160D, levels = c(1, 2),
                      labels = c("Yes", "No")),
      `Heart Attack` = factor(MCQ160E, levels = c(1, 2),
                              labels = c("Yes", "No")),
      Stroke = factor(MCQ160F, levels = c(1, 2),
                      labels = c("Yes", "No")),
      Cancer = factor(MCQ220, levels = c(1, 2),
                      labels = c("Yes", "No")),
      Liver = factor(MCQ160L, levels = c(1, 2),
                     labels = c("Yes", "No")),
      Hypertension = factor(BPQ020, levels = c(1, 2),
                            labels = c("Yes", "No")),
      Diabetes = factor(DIQ010, levels = 1:3,
                        labels = c("Yes", "No", "Borderline"))
    )
  
  # Normality test a priori for continuous variables
  continuous_vars <- c("Age (years)", "Ratio of family income")
  
  normality_results <- list()
  
  for (var in continuous_vars) {
    # Limit: 5000 observations for Shapiro-Wilk test
    if (length(na.omit(df[[var]])) > 5000) {
      sample_data <- sample(na.omit(df[[var]]), 5000)
    } else {
      sample_data <- na.omit(df[[var]])
    }
    
    test_result <- shapiro.test(sample_data)
    normality_results[[var]] <- test_result$p.value > 0.05
    message(var, " p-value: ", test_result$p.value)
  }
  
  message("Normality test results:")
  for (var in names(normality_results)) {
    message(var, ": ", ifelse(normality_results[[var]], "Normal", "Non-normal"))
  }
  
  variables_to_include <- c(
    "Age (years)", "Gender", "Ethnicity", "Education", "Ratio of family income",
    "Number of teeth", "Smoking", "Alcohol intake",
    "Heart Failure", "Coronary Heart", "Angina", "Heart Attack",
    "Stroke", "Cancer", "Liver", "Hypertension", "Diabetes"
  )
  
  # Variables using median (IQR) or mean (SD)
  median_vars <- names(normality_results)[!unlist(normality_results)]
  median_vars <- c(median_vars, "Number of teeth")

  stat_labels <- list(
    "Age (years)" = "Age (mean, SD)",
    "Number of teeth" = "Number of teeth (median, IQR)",
    "Ratio of family income" = "Ratio of family income (mean, SD)",
    "Gender" = "Gender (n, %)",
    "Ethnicity" = "Ethnicity (n, %)",
    "Education" = "Education (n, %)",
    "Smoking" = "Smoking (n, %)",
    "Alcohol intake" = "Alcohol intake (n, %)",
    "Heart Failure" = "Heart Failure (n, %)",
    "Coronary Heart" = "Coronary Heart (n, %)",
    "Angina" = "Angina (n, %)",
    "Heart Attack" = "Heart Attack (n, %)",
    "Stroke" = "Stroke (n, %)",
    "Cancer" = "Cancer (n, %)",
    "Liver" = "Liver (n, %)",
    "Hypertension" = "Hypertension (n, %)",
    "Diabetes" = "Diabetes (n, %)"
  )
  
  for (var in names(normality_results)) {
    if (normality_results[[var]]) {
      stat_labels[[var]] <- gsub("\\(median, IQR\\)", "(mean, SD)", stat_labels[[var]])
    } else {
      stat_labels[[var]] <- gsub("\\(mean, SD\\)", "(median, IQR)", stat_labels[[var]])
    }
  }

  # Define statistics
  stat_list <- list(
    all_continuous() ~ "{mean} ({sd})",
    all_categorical() ~ "{n} ({p}%)"
  )
  for (var in median_vars) {
    stat_list[[var]] <- "{median} ({p25}, {p75})"
  }
  
  # Create table with gtsummary
  if (use_survey_design) {
    # Survey design if weighted results
    table_strat <- survey_design %>%
      tbl_svysummary(
        by = has_20_or_more_teeth,
        include = all_of(variables_to_include),
        statistic = stat_list,
        label = stat_labels,
        missing = "ifany",
        missing_text = "Missing",
        value = all_categorical() ~ "level",
        type = all_categorical() ~ "categorical"
      ) %>%
      add_p() %>%
      add_overall()
  } else {
    # Without survey design
    table_strat <- df %>%
      tbl_summary(
        by = has_20_or_more_teeth,
        include = all_of(variables_to_include),
        statistic = stat_list,
        label = stat_labels,
        missing = "ifany",
        missing_text = "Missing",
        digits = all_continuous() ~ 2,
        value = all_categorical() ~ "level",
        type = all_categorical() ~ "categorical"
      ) %>%
      add_p(
        test = list(
          continuous_vars[unlist(normality_results[continuous_vars])] ~ "t.test", 
          continuous_vars[!unlist(normality_results[continuous_vars])] ~ "wilcox.test",
          `Number of teeth` ~ "wilcox.test",
          all_categorical() ~ "chisq.test",
          Education ~ "kruskal.test"
        )
      ) %>%
            add_overall()
        }
  
  # Calculate N for each group
  total_n <- nrow(df)
  n_by_group <- df %>%
    group_by(has_20_or_more_teeth) %>%
    summarize(n = n()) %>%
    pull(n, name = has_20_or_more_teeth)
  
  table_strat <- table_strat %>%
    modify_header(
      label = "**Characteristics**",
      stat_0 = paste0("**Total**\nN = ", total_n), 
      stat_1 = paste0("**< 20 teeth**\nN = ", ifelse("< 20 teeth" %in% names(n_by_group), n_by_group["< 20 teeth"], 0)), 
      stat_2 = paste0("**>= 20 teeth**\nN = ", ifelse(">= 20 teeth" %in% names(n_by_group), n_by_group[">= 20 teeth"], 0)),
      p.value = "**P-value**"
    )
  
  # Notes
  table_strat <- table_strat %>%
    modify_footnote(
      update = all_stat_cols() ~ "Values are n (%) for categorical variables, median (IQR) for non-normally distributed continuous variables, and mean (SD) for normally distributed continuous variables."
    )
  
  return(table_strat)
}

In [None]:
# Descriptive analysis with two classes: "< 20 teeth", ">= 20 teeth"

create_descriptive_table_fixed <- function(df, survey_design = NULL) {
  require(gtsummary)
  require(dplyr)
  require(survey)
  
  # Check if survey design is provided
  use_survey_design <- !is.null(survey_design)
  
  # Recode and transform variables
  df <- df %>%
    mutate(
      has_20_or_more_teeth = factor(has_20_or_more_teeth, levels = c(0, 1),
                                    labels = c("< 20 teeth", ">= 20 teeth")),
      
      # Continuous variables
      `Age (years)` = RIDAGEYR,
      `Number of teeth` = total_teeth,
      `Ratio of family income` = INDFMPIR,
      
      # Categorical and Ordinal variables
      Gender = factor(RIAGENDR, levels = c(1, 2), 
                      labels = c("Male", "Female")),
      
      Education = factor(DMDEDUC2, levels = 1:5,
                         labels = c("Less than 9th grade", "9-11th grade",
                                    "High school graduate",
                                    "Some college/AA degree",
                                    "College graduate or above"),
                         ordered = TRUE),
      
      Ethnicity = factor(RIDRETH1, levels = 1:5, 
                         labels = c("Mexican American", "Other Hispanic",
                                    "Non-Hispanic White", "Non-Hispanic Black",
                                    "Other Race")),
      
      Smoking = factor(SMQ020, levels = c(1, 2),
                       labels = c("Yes", "No")),
      
      `Alcohol intake` = factor(ALQ101, levels = c(1, 2),
                                labels = c("Over 12 alcohol drinks/1 yr",
                                "Under 12 alcohol drinks/1 yr")),
      
      `Heart Failure` = factor(MCQ160B, levels = c(1, 2), 
                               labels = c("Yes", "No")),
      `Coronary Heart` = factor(MCQ160C, levels = c(1, 2),
                                labels = c("Yes", "No")),
      Angina = factor(MCQ160D, levels = c(1, 2),
                      labels = c("Yes", "No")),
      `Heart Attack` = factor(MCQ160E, levels = c(1, 2),
                              labels = c("Yes", "No")),
      Stroke = factor(MCQ160F, levels = c(1, 2),
                      labels = c("Yes", "No")),
      Cancer = factor(MCQ220, levels = c(1, 2),
                      labels = c("Yes", "No")),
      Liver = factor(MCQ160L, levels = c(1, 2),
                     labels = c("Yes", "No")),
      Hypertension = factor(BPQ020, levels = c(1, 2),
                            labels = c("Yes", "No")),
      Diabetes = factor(DIQ010, levels = 1:3,
                        labels = c("Yes", "No", "Borderline"))
    )
  
  # Normality test a priori for continuous variables
  continuous_vars <- c("Age (years)", "Ratio of family income")
  
  normality_results <- list()
  
  for (var in continuous_vars) {
    # Limit: 5000 observations for Shapiro-Wilk test
    if (length(na.omit(df[[var]])) > 5000) {
      sample_data <- sample(na.omit(df[[var]]), 5000)
    } else {
      sample_data <- na.omit(df[[var]])
    }
    
    test_result <- shapiro.test(sample_data)
    normality_results[[var]] <- test_result$p.value > 0.05
    message(var, " p-value: ", test_result$p.value)
  }
  
  message("Normality test results:")
  for (var in names(normality_results)) {
    message(var, ": ", ifelse(normality_results[[var]], "Normal", "Non-normal"))
  }
  
  variables_to_include <- c(
    "Age (years)", "Gender", "Ethnicity", "Education", "Ratio of family income",
    "Number of teeth", "Smoking", "Alcohol intake",
    "Heart Failure", "Coronary Heart", "Angina", "Heart Attack",
    "Stroke", "Cancer", "Liver", "Hypertension", "Diabetes"
  )
  
  # Variables using median (IQR) or mean (SD)
  median_vars <- names(normality_results)[!unlist(normality_results)]
  median_vars <- c(median_vars, "Number of teeth")

  stat_labels <- list(
    "Age (years)" = "Age (mean, SD)",
    "Number of teeth" = "Number of teeth (median, IQR)",
    "Ratio of family income" = "Ratio of family income (mean, SD)",
    "Gender" = "Gender (n, %)",
    "Ethnicity" = "Ethnicity (n, %)",
    "Education" = "Education (n, %)",
    "Smoking" = "Smoking (n, %)",
    "Alcohol intake" = "Alcohol intake (n, %)",
    "Heart Failure" = "Heart Failure (n, %)",
    "Coronary Heart" = "Coronary Heart (n, %)",
    "Angina" = "Angina (n, %)",
    "Heart Attack" = "Heart Attack (n, %)",
    "Stroke" = "Stroke (n, %)",
    "Cancer" = "Cancer (n, %)",
    "Liver" = "Liver (n, %)",
    "Hypertension" = "Hypertension (n, %)",
    "Diabetes" = "Diabetes (n, %)"
  )
  
  for (var in names(normality_results)) {
    if (normality_results[[var]]) {
      stat_labels[[var]] <- gsub("\\(median, IQR\\)", "(mean, SD)", stat_labels[[var]])
    } else {
      stat_labels[[var]] <- gsub("\\(mean, SD\\)", "(median, IQR)", stat_labels[[var]])
    }
  }

  # Define statistics
  stat_list <- list(
    all_continuous() ~ "{mean} ({sd})",
    all_categorical() ~ "{n} ({p}%)"
  )
  for (var in median_vars) {
    stat_list[[var]] <- "{median} ({p25}, {p75})"
  }
  
    # Create table with gtsummary - use tbl_summary for consistent display
    # but calculate weighted population for the additional column
    table_strat <- df %>%
      tbl_summary(
        by = has_20_or_more_teeth,
        include = all_of(variables_to_include),
        statistic = stat_list,
        label = stat_labels,
        missing = "ifany",
        missing_text = "Missing",
        digits = all_continuous() ~ 2,
        value = all_categorical() ~ "level",
        type = all_categorical() ~ "categorical"
      ) %>%
      add_p(
        test = list(
          continuous_vars[unlist(normality_results[continuous_vars])] ~ "t.test", 
          continuous_vars[!unlist(normality_results[continuous_vars])] ~ "wilcox.test",
          `Number of teeth` ~ "wilcox.test",
          all_categorical() ~ "chisq.test",
          Education ~ "kruskal.test"
        )
      ) %>%
      add_overall()
  
  # Calculate N for each group
  total_n <- nrow(df)
  n_by_group <- df %>%
    group_by(has_20_or_more_teeth) %>%
    summarize(n = n()) %>%
    pull(n, name = has_20_or_more_teeth)
  
  # Add weighted N in millions column only if survey design is provided
  if (use_survey_design) {
    # Total population estimate
    total_pop_in_millions <- tryCatch({
      total_pop_estimate <- svytotal(~1, survey_design, na.rm = TRUE)
      coef(total_pop_estimate) / 1000000
    }, error = function(e) {
      message("Using alternative method to calculate total population")
      # Alternative approach: sum all weights
      sum(weights(survey_design, "analysis")) / 1000000
    })
    
    # Population estimates by group
    pop_by_group_in_millions <- tryCatch({
      pop_by_group <- svyby(~1, ~has_20_or_more_teeth, survey_design, svytotal, na.rm = TRUE)
      coef(pop_by_group) / 1000000
    }, error = function(e) {
      message("Using alternative method to calculate group populations")
      result <- tapply(weights(survey_design, "analysis"),
                      survey_design$variables$has_20_or_more_teeth,
                      sum) / 1000000
      result
    })
    
    table_strat <- table_strat %>%
      modify_table_body(
        ~.x %>%
          dplyr::mutate(
            weighted_n = case_when(
              is.na(row_type) ~ "", 
              row_type == "label_header" ~ "**Weighted N (Millions)**",
              row_type == "label" ~ "",
              row_type == "level" & !is.na(variable) & !is.na(label) ~ "",
              TRUE ~ ""
            )
          ) %>%
          dplyr::relocate(weighted_n, .before = stat_0)
      )
    
    # Add weighted N column as an additional column
    table_strat <- table_strat %>%
      modify_header(
        label = "**Characteristics**",
        weighted_n = paste0("**Weighted N**\n**in Millions**")
      )
    
    # Add weighted population estimates for each row
    if (use_survey_design) {
      # Function to calculate weighted population for a specific variable and level
      calculate_weighted_pop <- function(var_name, level = NULL) {
        tryCatch({
          if (is.null(level)) {
            # For continuous variables: sum weights where variable is not NA
            var_data <- survey_design$variables[[var_name]]
            weights_sum <- sum(weights(survey_design, "analysis")[!is.na(var_data)]) / 1000000
            return(weights_sum)
          } else {
            # For categorical variables and specific levels
            var_data <- survey_design$variables[[var_name]]
            level_match <- var_data == level & !is.na(var_data)
            weights_sum <- sum(weights(survey_design, "analysis")[level_match]) / 1000000
            return(weights_sum)
          }
        }, error = function(e) {
          message("Error calculating weighted population for ", var_name, 
                  if(!is.null(level)) paste(" level:", level), ": ", e$message)
          return(NA)
        })
      }
      
      # Update each row with weighted population estimate
      table_strat$table_body <- table_strat$table_body %>%
        rowwise() %>%
        mutate(
          weighted_n = case_when(
            !is.na(row_type) & row_type == "label" & !is.na(variable) ~ 
              sprintf("%.2f", calculate_weighted_pop(variable)),
            !is.na(row_type) & row_type == "level" & !is.na(variable) & !is.na(label) ~
              sprintf("%.2f", calculate_weighted_pop(variable, label)),
            TRUE ~ weighted_n
          )
        ) %>%
        ungroup()
    }
  } else {
    # If no survey design, just modify the headers without weighted N
    table_strat <- table_strat %>%
      modify_header(
        label = "**Characteristics**",
        stat_0 = paste0("**Total**\nN = ", total_n), 
        stat_1 = paste0("**< 20 teeth**\nN = ", ifelse("< 20 teeth" %in% names(n_by_group), n_by_group["< 20 teeth"], 0)), 
        stat_2 = paste0("**>= 20 teeth**\nN = ", ifelse(">= 20 teeth" %in% names(n_by_group), n_by_group[">= 20 teeth"], 0)),
        p.value = "**P-value**"
      )
  }
  
  # Notes
  table_strat <- table_strat %>%
    modify_footnote(
      update = all_stat_cols() ~ "Values are n (%) for categorical variables, median (IQR) for non-normally distributed continuous variables, and mean (SD) for normally distributed continuous variables."
    )
  
  if (use_survey_design) {
    table_strat <- table_strat %>%
      modify_footnote(
        add = "Weighted N in millions represents the estimated US population based on NHANES survey weights."
      )
  }
  
  return(table_strat)
}

In [61]:
# Saving un-weighted results in docx format

result_table <- create_descriptive_table_fixed(df_test)
flex_table <- result_table %>% as_flex_table()

library(flextable)
library(officer)
save_as_docx(flex_table, path = "/Users/silvanoquarto/Desktop/LAVORO/MEDICAL_PHYSICS/Med-Physics/results/NHANES_09_14_teeth/final_tabella_descrittiva.docx")

Age (years) p-value: 3.36317085020386e-21

Ratio of family income p-value: 1.92157997592047e-22

Normality test results:

Age (years): Non-normal

Ratio of family income: Non-normal



In [9]:
# Select weights

# Weights from Demographic datasets

demo_09_10 <- read_xpt(file.path(path_to_data_09_10, "DEMO_F.xpt"))

demo_09_10_weights <- demo_09_10 %>%
    select(SEQN, RIDAGEYR, WTMEC2YR, SDMVPSU, SDMVSTRA)

demo_11_12 <- read_xpt(file.path(path_to_data_11_12, "DEMO_G.xpt.txt"))

demo_11_12_weights <- demo_11_12 %>%
    select(SEQN, RIDAGEYR, WTMEC2YR, SDMVPSU, SDMVSTRA)

demo_13_14 <- read_xpt(file.path(path_to_data_13_14, "DEMO_H.xpt.txt"))

demo_13_14_weights <- demo_13_14 %>%
    select(SEQN, RIDAGEYR, WTMEC2YR, SDMVPSU, SDMVSTRA)

weights_09_10 <- list(
  demo_09_10_weights
)

weights_11_12 <- list(
  demo_11_12_weights
)

weights_13_14 <- list(
  demo_13_14_weights
)

# Horizontal union for period 2009/10, 2011/12, 2013/14

wt_09_10 <- Reduce(function(x, y) full_join(x, y, by = "SEQN"), weights_09_10)

wt_11_12 <- Reduce(function(x, y) full_join(x, y, by = "SEQN"), weights_11_12)

wt_13_14 <- Reduce(function(x, y) full_join(x, y, by = "SEQN"), weights_13_14)

# Vertical union

wt_final <- bind_rows(wt_09_10, wt_11_12, wt_13_14)

print("Dimensions before removing NA values")
dim(wt_final)

# Filter with AGE >= 60

wt_final_age_60 <- subset(wt_final, RIDAGEYR >= 60)

print("Dimensions with AGE >= 60")
dim(wt_final_age_60)

wt_final_age_60 <- subset(wt_final_age_60, select = -RIDAGEYR)
head(wt_final_age_60)

# Merge wt_final_age_60 with my final data frame 

df_final_merged <- df_test %>%
  inner_join(wt_final_age_60, by = "SEQN")

dim(df_final_merged)

[1] "Dimensions before removing NA values"


[1] "Dimensions with AGE >= 60"


SEQN,WTMEC2YR,SDMVPSU,SDMVSTRA
<dbl>,<dbl>,<dbl>,<dbl>
51628,21000.339,2,75
51633,12381.115,1,77
51635,22502.507,1,79
51645,9590.458,1,75
51654,55670.35,2,86
51661,6385.327,2,88


In [10]:
# Preprocessing for WTMEC2YR: divide it for the number of NHANES cycles used (3 for our case)

df_final_merged[, "wt"] = df_final_merged[, "WTMEC2YR"] / 3

head(df_final_merged)

Unnamed: 0_level_0,SEQN,RIAGENDR,RIDAGEYR,RIDRETH1,DMDEDUC2,INDFMPIR,ALQ101,SMQ020,DLQ020,MCQ160B,...,WHQ060,DLQ010,total_teeth,has_20_or_more_teeth,edentulous,teeth_category,WTMEC2YR,SDMVPSU,SDMVSTRA,wt
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,...,<int>,<int>,<int>,<int>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,51633,1,80,3,4,1.27,1,1,1,2,...,1,2,6,0,0,Severe Loss (1-9),12381.115,1,77,4127.038
2,51654,1,66,3,4,2.2,1,1,2,2,...,2,2,26,1,0,Nearly Complete (25-32),55670.35,2,86,18556.783
3,51680,2,60,4,4,2.59,1,1,1,2,...,1,2,20,1,0,Sufficient (20-24),18341.27,1,79,6113.757
4,51687,1,78,3,5,5.0,1,1,2,2,...,2,2,28,1,0,Nearly Complete (25-32),42248.559,2,82,14082.853
5,51736,2,60,1,3,0.68,1,1,2,2,...,1,2,21,1,0,Sufficient (20-24),7084.377,1,81,2361.459
6,51861,1,80,3,3,4.94,1,2,2,2,...,1,1,8,0,0,Severe Loss (1-9),22127.661,1,76,7375.887


In [64]:
# Function to create survey design for NHANES datasets (to update for case with 3 different two-years period)

create_nhanes_design <- function(df) {

  # Recode and transform variables
  df <- df %>%
    mutate(
      has_20_or_more_teeth = factor(has_20_or_more_teeth, levels = c(0, 1),
                                    labels = c("< 20 teeth", ">= 20 teeth")),
      
      # Continuous variables
      `Age (years)` = RIDAGEYR,
      `Number of teeth` = total_teeth,
      `Ratio of family income` = INDFMPIR,
      
      # Categorical and Ordinal variables
      Gender = factor(RIAGENDR, levels = c(1, 2), 
                      labels = c("Male", "Female")),
      
      Education = factor(DMDEDUC2, levels = 1:5,
                         labels = c("Less than 9th grade", "9-11th grade",
                                    "High school graduate",
                                    "Some college/AA degree",
                                    "College graduate or above"),
                         ordered = TRUE),
      
      Ethnicity = factor(RIDRETH1, levels = 1:5, 
                         labels = c("Mexican American", "Other Hispanic",
                                    "Non-Hispanic White", "Non-Hispanic Black",
                                    "Other Race")),
      
      Smoking = factor(SMQ020, levels = c(1, 2),
                       labels = c("Yes", "No")),
      
      `Alcohol intake` = factor(ALQ101, levels = c(1, 2),
                                labels = c("Over 12 alcohol drinks/1 yr",
                                "Under 12 alcohol drinks/1 yr")),
      
      `Heart Failure` = factor(MCQ160B, levels = c(1, 2), 
                               labels = c("Yes", "No")),
      `Coronary Heart` = factor(MCQ160C, levels = c(1, 2),
                                labels = c("Yes", "No")),
      Angina = factor(MCQ160D, levels = c(1, 2),
                      labels = c("Yes", "No")),
      `Heart Attack` = factor(MCQ160E, levels = c(1, 2),
                              labels = c("Yes", "No")),
      Stroke = factor(MCQ160F, levels = c(1, 2),
                      labels = c("Yes", "No")),
      Cancer = factor(MCQ220, levels = c(1, 2),
                      labels = c("Yes", "No")),
      Liver = factor(MCQ160L, levels = c(1, 2),
                     labels = c("Yes", "No")),
      Hypertension = factor(BPQ020, levels = c(1, 2),
                            labels = c("Yes", "No")),
      Diabetes = factor(DIQ010, levels = 1:3,
                        labels = c("Yes", "No", "Borderline"))
    )
  design <- svydesign(
    id = ~SDMVPSU,
    strata = ~SDMVSTRA,
    weights = ~wt,
    nest = TRUE,
    data = df
  )
  return(design)
}

In [65]:
# Saving weighted results in docx format

nhanes_design <- create_nhanes_design(df_final_merged)

result_table_weighted <- create_descriptive_table_fixed(df_final_merged, nhanes_design)
flex_table_weighted <- result_table_weighted %>% as_flex_table()

save_as_docx(flex_table_weighted,
            path = "/Users/silvanoquarto/Desktop/LAVORO/MEDICAL_PHYSICS/Med-Physics/results/NHANES_09_14_teeth/final_weighted_descriptive_analysis.docx")

Age (years) p-value: 3.36317085020386e-21

Ratio of family income p-value: 1.92157997592047e-22

Normality test results:

Age (years): Non-normal

Ratio of family income: Non-normal

Using alternative method to calculate total population

Using alternative method to calculate group populations



### Regression Analysis

#### Load Preprocessed Data + Recoded outcome and some variables