# NHANES project about Number of Teeth and Intrinsic Capacity: descriptive and regression analysis
> This notebook has the purpose to collect all the analysis on Nhanes dataset for a medical paper project 

Requirements and Information:
1. Nhanes dataset from 2009/10 to 2013/14
2. 5 Intrinsic Capacity domains
    1. Locomotion:
        - Standingup from armless chair difficulty (PFQ061I)
    2. Cognitive Function:
        - Trouble concentrating on things (DPQ070)
    3. Vitality:
        - Weight change intentional (WHQ060)
        - Poor appetite or overeating (DPQ050)
    4. Psychological status:
        - Feeling down, depressed, or hopeless (DPQ020)
        - Have little interest in doing things (DPQ010)
    5. Sensory domain:
        - Have serious difficulty hearing? (DLQ010)
        - Have serious difficulty seeing? (DLQ020)
3. Outcome:
    - number of teeth (OHXDEN)
    - consists of 2 categories: patients with < 20 teeth, patients with >= 20 teeth
4. Confounding Variables:
    - Gender (RIAGENDR)
    - Age at screening (RIDAGEYR)
    - Race (RIDRETH1)
    - Education	(DMDEDUC2)
    - Poverty income ratio (INDFMPIR)
    - Smoking status (SMQ020)
5. Mediators:
    - Heart failure	(RIDRETH1)  
    - Coronary heart disease (MCQ160b)
    - Stroke (MCQ160c)
    - Liver disease	(MCQ160o)
    - Cancer (MCQ500)
    - Diabetes (MCQ220)
    - High blood pressure (DIQ010)
6. Age => 60

TO-DO LIST (Italian version):

1. replicare tutta l'analisi fatta per la periodontite
2. per il momento, unica sostituzione sarà sostituire la periodontite con la variabile binaria del **numero di denti**
3. data preparation praticamente già pronta: load dei dati, formazione unico df, selezione feature, esclusione dei pazienti. Unica modifica riguarda gestione dei denti (che fare con paziente che ha 30 valori per il numero di denti ma due NA?? RISOLTO: if the variable item was tooth not present, meaning tooth loss. INOLTRE come gestire gli edentolous?)
4. Partire subito con una descrittiva --> seguire esempi dati da Vittorio
5. Confondenti e mediatori vanno bene o bisogna aggiungere o rimuovere qualcosa?
6. Subito una analisi univariata, gestione delle feature della IC sarà la stessa.

## NHANES 2013/14: Intrinsic Capacity and Teeth counts

### Import Libraries

In [1]:
library(haven)
library(nhanesA)
library(survey)
library(MASS)
library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(readr)
library(flextable)
library(officer)
library(nnet)
library(broom)
library(ggplot2)

Loading required package: grid

Loading required package: Matrix

Loading required package: survival


Attaching package: 'survey'


The following object is masked from 'package:graphics':

    dotchart



Attaching package: 'dplyr'


The following object is masked from 'package:MASS':

    select


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



Attaching package: 'tidyr'


The following objects are masked from 'package:Matrix':

    expand, pack, unpack


-- [1mAttaching core tidyverse packages[22m ------------------------ tidyverse 2.0.0 --
[32mv[39m [34mforcats  [39m 1.0.0     [32mv[39m [34mreadr    [39m 2.1.5
[32mv[39m [34mggplot2  [39m 3.5.1     [32mv[39m [34mstringr  [39m 1.5.1
[32mv[39m [34mlubridate[39m 1.9.4     [32mv[39m [34mtibble   [39m 3.2.1
[32mv[39m [34mpurrr    [39m 1.0.2     
-- [1mConflicts[22m -------------------

### Configurations

In [2]:
path_to_data_09_10 <- "/Users/silvanoquarto/Desktop/LAVORO/MEDICAL_PHYSICS/Med-Physics/data/NHANES/2009_10/"
path_to_data_11_12 <- "/Users/silvanoquarto/Desktop/LAVORO/MEDICAL_PHYSICS/Med-Physics/data/NHANES/2011_12/"
path_to_data_12_13 <- "/Users/silvanoquarto/Desktop/LAVORO/MEDICAL_PHYSICS/Med-Physics/data/NHANES/2013_14/"

### Load Dataset & Feature Selection

In [3]:
# Datasets for 2009/10 period

demo_09_10 <- read_xpt(file.path(path_to_data_09_10, "DEMO_F.xpt"))

demo_09_10_selected <- demo_09_10 %>%
  select(SEQN, RIAGENDR, RIDAGEYR, RIDRETH1, DMDEDUC2, INDFMPIR)

alcohol_09_10 <- read_xpt(file.path(path_to_data_09_10, "ALQ_F.xpt"))

alcohol_09_10_selected <- alcohol_09_10 %>%
  select(SEQN, ALQ101)

smoking_09_10 <- read_xpt(file.path(path_to_data_09_10, "SMQ_F.xpt.txt"))

smoking_09_10_selected <- smoking_09_10 %>%
    select(SEQN, SMQ020)

med_conditions_09_10 <- read_xpt(file.path(path_to_data_09_10, "MCQ_F.xpt"))

med_conditions_09_10_selected <- med_conditions_09_10 %>%
    select(SEQN, MCQ140, MCQ160B, MCQ160C, MCQ160D, MCQ160E, MCQ160F, MCQ160L, MCQ220)

med_conditions_09_10_selected <- med_conditions_09_10_selected %>%
  rename(DLQ020 = MCQ140)


blood_pressure_09_10 <- read_xpt(file.path(path_to_data_09_10, "BPQ_F.xpt"))

blood_pressure_09_10_selected <- blood_pressure_09_10 %>%
    select(SEQN, BPQ020)


diabetes_09_10 <- read_xpt(file.path(path_to_data_09_10, "DIQ_F.xpt"))

diabetes_09_10_selected <- diabetes_09_10 %>%
    select(SEQN, DIQ010)


teeth_09_10 <- read_xpt(file.path(path_to_data_09_10, "OHXDEN_F.xpt.txt"))

selected_cols <- colnames(teeth_09_10)[grepl("^OHX\\d{2}TC", colnames(teeth_09_10))]

teeth_09_10_selected <- teeth_09_10 %>%
    select(SEQN, all_of(selected_cols))


locomotion_09_10 <- read_xpt(file.path(path_to_data_09_10, "PFQ_F.xpt"))

locomotion_09_10_selected <- locomotion_09_10 %>%
    select(SEQN, PFQ061I)


mental_health_09_10 <- read_xpt(file.path(path_to_data_09_10, "DPQ_F.xpt"))

mental_health_09_10_selected <- mental_health_09_10 %>%
    select(SEQN, DPQ010, DPQ020, DPQ050, DPQ070)

mental_health_09_10_selected <- mental_health_09_10_selected %>%
  mutate(DPQ070 = case_when(
    DPQ070 %in% c(0, 1) ~ 2,
    DPQ070 %in% c(2, 3) ~ 1,
    TRUE ~ DPQ070
  ))

mental_health_09_10_selected <- mental_health_09_10_selected %>%
  rename(DLQ040 = DPQ070)


weight_history_09_10 <- read_xpt(file.path(path_to_data_09_10, "WHQ_F.xpt"))

weight_history_09_10_selected <- weight_history_09_10 %>%
    select(SEQN, WHQ060)


audiometry_09_10 <- read_xpt(file.path(path_to_data_09_10, "AUQ_F.xpt"))

audiometry_09_10_selected <- audiometry_09_10 %>%
    select(SEQN, AUQ131)

audiometry_09_10_selected <- audiometry_09_10_selected %>%
  mutate(AUQ131 = case_when(
    AUQ131 %in% c(1, 2, 3) ~ 2,
    AUQ131 %in% c(4, 5, 6) ~ 1,
    TRUE ~ AUQ131
  ))

audiometry_09_10_selected <- audiometry_09_10_selected %>%
  rename(DLQ010 = AUQ131)

In [4]:
# Datasets for 2011/12 period

demo_11_12 <- read_xpt(file.path(path_to_data_11_12, "DEMO_G.xpt.txt"))

demo_11_12_selected <- demo_11_12 %>%
  select(SEQN, RIAGENDR, RIDAGEYR, RIDRETH1, DMDEDUC2, INDFMPIR)

alcohol_11_12 <- read_xpt(file.path(path_to_data_11_12, "ALQ_G.xpt.txt"))

alcohol_11_12_selected <- alcohol_11_12 %>%
  select(SEQN, ALQ101)


smoking_11_12 <- read_xpt(file.path(path_to_data_11_12, "SMQ_G.xpt.txt"))

smoking_11_12_selected <- smoking_11_12 %>%
    select(SEQN, SMQ020)


med_conditions_11_12 <- read_xpt(file.path(path_to_data_11_12, "MCQ_G.xpt.txt"))

med_conditions_11_12_selected <- med_conditions_11_12 %>%
    select(SEQN, MCQ140, MCQ160B, MCQ160C, MCQ160D, MCQ160E, MCQ160F, MCQ160L, MCQ220)

med_conditions_11_12_selected <- med_conditions_11_12_selected %>%
  rename(DLQ020 = MCQ140)


blood_pressure_11_12 <- read_xpt(file.path(path_to_data_11_12, "BPQ_G.xpt.txt"))

blood_pressure_11_12_selected <- blood_pressure_11_12 %>%
    select(SEQN, BPQ020)


diabetes_11_12 <- read_xpt(file.path(path_to_data_11_12, "DIQ_G.xpt.txt"))

diabetes_11_12_selected <- diabetes_11_12 %>%
    select(SEQN, DIQ010)


teeth_11_12 <- read_xpt(file.path(path_to_data_11_12, "OHXDEN_G.xpt.txt"))

selected_cols <- colnames(teeth_11_12)[grepl("^OHX\\d{2}TC", colnames(teeth_11_12))]

teeth_11_12_selected <- teeth_11_12 %>%
    select(SEQN, all_of(selected_cols))


locomotion_11_12 <- read_xpt(file.path(path_to_data_11_12, "PFQ_G.xpt.txt"))

locomotion_11_12_selected <- locomotion_11_12 %>%
    select(SEQN, PFQ061I)


mental_health_11_12 <- read_xpt(file.path(path_to_data_11_12, "DPQ_G.xpt.txt"))

mental_health_11_12_selected <- mental_health_11_12 %>%
    select(SEQN, DPQ010, DPQ020, DPQ050, DPQ070)

mental_health_11_12_selected <- mental_health_11_12_selected %>%
  mutate(DPQ070 = case_when(
    DPQ070 %in% c(0, 1) ~ 2,
    DPQ070 %in% c(2, 3) ~ 1,
    TRUE ~ DPQ070
  ))

mental_health_11_12_selected <- mental_health_11_12_selected %>%
  rename(DLQ040 = DPQ070)


weight_history_11_12 <- read_xpt(file.path(path_to_data_11_12, "WHQ_G.xpt.txt"))

weight_history_11_12_selected <- weight_history_11_12 %>%
    select(SEQN, WHQ060)


audiometry_11_12 <- read_xpt(file.path(path_to_data_11_12, "AUQ_G.xpt.txt"))

audiometry_11_12_selected <- audiometry_11_12 %>%
  select(SEQN, AUQ054)

audiometry_11_12_selected <- audiometry_11_12_selected %>%
  mutate(AUQ054 = case_when(
    AUQ054 %in% c(1, 2, 3) ~ 2,
    AUQ054 %in% c(4, 5, 6) ~ 1,
    TRUE ~ AUQ054
  ))

audiometry_11_12_selected <- audiometry_11_12_selected %>%
  rename(DLQ010 = AUQ054)

In [5]:
# Datasets for 2013/14 period

demo_13_14 <- read_xpt(file.path(path_to_data_12_13, "DEMO_H.xpt.txt"))

demo_13_14_selected <- demo_13_14 %>%
    select(SEQN, RIAGENDR, RIDAGEYR, RIDRETH1, DMDEDUC2, INDFMPIR)


alcohol_13_14 <- read_xpt(file.path(path_to_data_12_13, "ALQ_H.xpt.txt"))

alcohol_13_14_selected <- alcohol_13_14 %>%
    select(SEQN, ALQ101)


smoking_13_14 <- read_xpt(file.path(path_to_data_12_13, "SMQ_H.xpt.txt"))

smoking_13_14_selected <- smoking_13_14 %>%
    select(SEQN, SMQ020)


med_conditions_13_14 <- read_xpt(file.path(path_to_data_12_13, "MCQ_H.xpt.txt"))

med_conditions_13_14_selected <- med_conditions_13_14 %>%
    select(SEQN, MCQ160B, MCQ160C, MCQ160D, MCQ160E, MCQ160F, MCQ160L, MCQ220)


blood_pressure_13_14 <- read_xpt(file.path(path_to_data_12_13, "BPQ_H.xpt.txt"))

blood_pressure_13_14_selected <- blood_pressure_13_14 %>%
    select(SEQN, BPQ020)


diabetes_13_14 <- read_xpt(file.path(path_to_data_12_13, "DIQ_H.xpt.txt"))

diabetes_13_14_selected <- diabetes_13_14 %>%
    select(SEQN, DIQ010)


teeth_13_14 <- read_xpt(file.path(path_to_data_12_13, "OHXDEN_H.xpt.txt"))

selected_cols <- colnames(teeth_13_14)[grepl("^OHX\\d{2}TC", colnames(teeth_13_14))]

teeth_13_14_selected <- teeth_13_14 %>%
    select(SEQN, all_of(selected_cols))


locomotion_13_14 <- read_xpt(file.path(path_to_data_12_13, "PFQ061I.txt"))

locomotion_13_14_selected <- locomotion_13_14 %>%
    select(SEQN, PFQ061I)


disability_13_14 <- read_xpt(file.path(path_to_data_12_13, "DLQ040.txt"))

disability_13_14_selected <- disability_13_14 %>%
    select(SEQN, DLQ010, DLQ020, DLQ040)


mental_health_13_14 <- read_xpt(file.path(path_to_data_12_13, "DPQ-.txt"))

mental_health_13_14_selected <- mental_health_13_14 %>%
    select(SEQN, DPQ010, DPQ020, DPQ050)


weight_history_13_14 <- read_xpt(file.path(path_to_data_12_13, "WHQ060.txt"))

weight_history_13_14_selected <- weight_history_13_14 %>%
    select(SEQN, WHQ060)

In [6]:
dim(teeth_09_10_selected)
dim(teeth_11_12_selected)
dim(teeth_13_14_selected)

### Merge datasets without NA and missing values
> Merge all data from each datasets and then exclude patients

In [7]:
# Merge datasets demographics and intrinsic capacity data

datasets_09_10 <- list(
  demo_09_10_selected, alcohol_09_10_selected, smoking_09_10_selected, med_conditions_09_10_selected,
  blood_pressure_09_10_selected, diabetes_09_10_selected,
  locomotion_09_10_selected, mental_health_09_10_selected,
  weight_history_09_10_selected, audiometry_09_10_selected, teeth_09_10_selected
)

datasets_11_12 <- list(
  demo_11_12_selected, alcohol_11_12_selected, smoking_11_12_selected, med_conditions_11_12_selected,
  blood_pressure_11_12_selected, diabetes_11_12_selected,
  locomotion_11_12_selected, mental_health_11_12_selected,
  weight_history_11_12_selected, audiometry_11_12_selected, teeth_11_12_selected
)

datasets_13_14 <- list(
  demo_13_14_selected, alcohol_13_14_selected, smoking_13_14_selected, med_conditions_13_14_selected,
  blood_pressure_13_14_selected, diabetes_13_14_selected,
  locomotion_13_14_selected, disability_13_14_selected, mental_health_13_14_selected,
  weight_history_13_14_selected, teeth_13_14_selected
)

# Horizontal union for period 2009/10, 2011/12, 2013/14

df_09_10 <- Reduce(function(x, y) full_join(x, y, by = "SEQN"), datasets_09_10)

df_11_12 <- Reduce(function(x, y) full_join(x, y, by = "SEQN"), datasets_11_12)

df_13_14 <- Reduce(function(x, y) full_join(x, y, by = "SEQN"), datasets_13_14)

# Vertical union

df_final <- bind_rows(df_09_10, df_11_12, df_13_14)

print("Dimensions before removing NA values")
dim(df_final)

# Filter with AGE >= 60

df_final_age_60 <- subset(df_final, RIDAGEYR >= 60)

print("Dimensions with AGE >= 60")
dim(df_final_age_60)

# Excluding patients with missing values in Intrinsic Capacity features

df_final_excluding_IC <- df_final_age_60[complete.cases(df_final_age_60[, c('DLQ020', 'PFQ061I', 'DPQ010', 'DPQ020', 'DPQ050', 'DLQ040', 'WHQ060', 'DLQ010')]), ]

df_final_excluding_IC <- df_final_excluding_IC %>%
  filter(!if_any(c(DLQ020, PFQ061I, DLQ040, WHQ060, DLQ010,
                   DPQ020, DPQ050, DPQ010), ~ . == 9))

df_final_excluding_IC <- df_final_excluding_IC %>%
  filter(!if_any(c(DLQ020, PFQ061I, DLQ040, WHQ060, DLQ010,
                   DPQ020, DPQ050, DPQ010), ~ . == 7))

df_final_excluding_IC <- df_final_excluding_IC %>%
  filter(!if_any(c(DLQ020, PFQ061I, DLQ040, WHQ060, DLQ010,
                   DPQ020, DPQ050, DPQ010), ~ . == 99))

print("Dimensions without IC missing values")
dim(df_final_excluding_IC)

# Excluding patients with no examinations for Teeth counts

df_final_excluding_teeth <- df_final_excluding_IC %>%
  filter(rowSums(!is.na(select(., starts_with("OHX")))) > 0)

print("Dimensions without Teeth counts missing values")
dim(df_final_excluding_teeth)

# Excluding patients with missing values in Confounding features

df_final_excluding_confounding <- df_final_excluding_teeth[complete.cases(df_final_excluding_teeth[, 
                                  c('RIAGENDR', 'RIDAGEYR', 'RIDRETH1', 'DMDEDUC2', 'INDFMPIR', 'ALQ101', 'SMQ020', 'MCQ160B',
                                  'MCQ160C', 'MCQ160D', 'MCQ160E', 'MCQ160F', 'MCQ160L', 'MCQ220', 'BPQ020', 'DIQ010')]), ]

df_final_merged <- df_final_excluding_confounding %>%
  filter(!if_any(c(DMDEDUC2, ALQ101, MCQ160B, MCQ160C, MCQ160D,
                   MCQ160E, MCQ160F, MCQ160L, BPQ020, DIQ010), ~ . == 9))

print("Dimensions without Confounding missing values")
dim(df_final_merged)

[1] "Dimensions before removing NA values"


[1] "Dimensions with AGE >= 60"


[1] "Dimensions without IC missing values"


[1] "Dimensions without Teeth counts missing values"


[1] "Dimensions without Confounding missing values"


### Teeth counts
> two categories: >=20 teeth or < 20 teeth

In [None]:
# Functions to check if there are patients with zero permanent teeth, and moreover, patients with only not present teeth and fragments/root

find_patients_no_teeth <- function(df) {
  # Seleziona le colonne dei denti (OHXxxTC)
  teeth_cols <- grep("^OHX\\d{2}TC$", names(df), value = TRUE)
  
  # Identifica i pazienti che non hanno nemmeno un valore 2
  no_teeth_patients <- df[rowSums(df[, teeth_cols] == 2, na.rm = TRUE) == 0, ]
  
  # Restituisce gli ID dei pazienti senza denti naturali
  return(no_teeth_patients)
}

find_patients_with_non4_values <- function(df) {
  
  teeth_cols <- grep("^OHX\\d{2}TC$", names(df), value = TRUE)

  no_teeth_patients <- find_patients_no_teeth(df)

  patients_with_non4 <- no_teeth_patients[rowSums(no_teeth_patients[, teeth_cols] != 4, na.rm = TRUE) > 0, ]

  return(patients_with_non4)

}

test_edentolus <- find_patients_with_non4_values(df_final_merged)
head(test_edentolus)

SEQN,RIAGENDR,RIDAGEYR,RIDRETH1,DMDEDUC2,INDFMPIR,ALQ101,SMQ020,DLQ020,MCQ160B,...,OHX23TC,OHX24TC,OHX25TC,OHX26TC,OHX27TC,OHX28TC,OHX29TC,OHX30TC,OHX31TC,OHX32TC
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
60998,2,60,2,1,2.64,2,1,2,2,...,4,4,4,4,5,4,4,4,4,4
64973,1,64,3,3,1.13,1,2,2,2,...,4,4,4,4,5,4,4,4,4,4
74674,2,67,1,1,0.97,2,2,2,2,...,4,4,4,4,4,4,5,4,4,4
78009,1,80,4,2,1.39,1,1,1,2,...,5,5,5,5,5,5,5,4,4,4


In [34]:
# Function to calculate total number of teeth for each patient and categorize it

count_teeth <- function(df) {

  teeth_cols <- grep("^OHX\\d{2}TC$", names(df), value = TRUE)
  
  df$total_teeth <- rowSums(df[, teeth_cols] == 2, na.rm = TRUE)
  
  # Binary category: 1 if >=20 teeth, 0 otherwise
  df$has_20_or_more_teeth <- ifelse(df$total_teeth >= 20, 1, 0)
  
  # edentulous patients (every 32 teeth with value 4 or 5)
  df$edentulous <- ifelse(rowSums(df[, teeth_cols] == 4 | df[, teeth_cols] == 5, na.rm = TRUE) == length(teeth_cols), 1, 0)
  
  # Other possible categories: Edentulous, Severe, Moderate, Sufficient, Nearly Complete
  df$teeth_category <- cut(
    df$total_teeth,
    breaks = c(-Inf, 0, 9, 19, 24, 32),
    labels = c("Edentulous", "Severe Loss (1-9)", "Moderate Loss (10-19)", "Sufficient (20-24)", "Nearly Complete (25-32)"),
    right = TRUE
  )
  
  df <- df[, !names(df) %in% teeth_cols]
  
  return(df)
}

df <- as.data.frame(count_teeth(df_final_merged))
head(df)

Unnamed: 0_level_0,SEQN,RIAGENDR,RIDAGEYR,RIDRETH1,DMDEDUC2,INDFMPIR,ALQ101,SMQ020,DLQ020,MCQ160B,...,DPQ010,DPQ020,DPQ050,DLQ040,WHQ060,DLQ010,total_teeth,has_20_or_more_teeth,edentulous,teeth_category
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,51633,1,80,3,4,1.27,1,1,1,2,...,0,0,0,2,1,2,6,0,0,Severe Loss (1-9)
2,51654,1,66,3,4,2.2,1,1,2,2,...,0,0,0,2,2,2,26,1,0,Nearly Complete (25-32)
3,51680,2,60,4,4,2.59,1,1,1,2,...,0,1,1,2,1,2,20,1,0,Sufficient (20-24)
4,51687,1,78,3,5,5.0,1,1,2,2,...,0,0,0,2,2,2,28,1,0,Nearly Complete (25-32)
5,51736,2,60,1,3,0.68,1,1,2,2,...,0,1,0,2,1,2,21,1,0,Sufficient (20-24)
6,51861,1,80,3,3,4.94,1,2,2,2,...,0,0,0,2,1,1,8,0,0,Severe Loss (1-9)


In [35]:
# Saving preprocessed 

write.csv(df, "/Users/silvanoquarto/Desktop/LAVORO/MEDICAL_PHYSICS/Med-Physics/data/NHANES/preprocessed_df_teeth_09_14.csv", row.names=FALSE)