##### Disease risks and longevity scores on UKBB 

## Preprocessing UKBB phenotypic data (jan 2021)

##### Initialize and load required packages

In [1]:
source(here::here("code/init.R"))
source(here::here("code/ukbb_preprocessing.R"))
source(here::here("code/models.R"))
options(tgutil.cache=FALSE)

### loading full dataset

In [2]:
ukbb_data <- load_data()

"[1m[22m`data_frame()` was deprecated in tibble 1.1.0.
[36mi[39m Please use `tibble()` instead.
[36mi[39m The deprecated feature was likely used in the [34mukbtools[39m package.
  Please report the issue to the authors."


### Date of birth (dob) and death (dod)
#### extracting dob / dod /race info from full dataset

In [3]:
ukbb_demog <- get_demog_data(ukbb_data) %cache_df% here('output/ukbb_demog.csv') %>% as_tibble()

Show the column data in ukbb_demog

In [4]:
colnames(ukbb_demog)

### Extracting diagnosis from all sources
hospitalizations, hesin followups, self reported questionnaires, first occurrences and general practice clinic followup.

In [5]:
ukbb_diagnosis <- get_diagnosis_data(ukbb_data, ukbb_demog ) %cache_df% here('output/ukbb_diagnosis.csv') %>% as_tibble()

### Loading lab data

In [6]:
ukbb_visits <- get_visit_data(ukbb_demog) %cache_df% here('output/ukbb_visits.csv') %>% as_tibble()
ukbb_labs <- get_labs_data(ukbb_data, ukbb_visits) %cache_df% here('output/ukbb_labs.csv') %>% as_tibble() %>% 
    mutate(sex=c('male', 'female')[sex]) %>% 
    inner_join(ln_ukbb_labs() %>% mutate(field=as.numeric(ukbb_code)) %>% select(field), by = "field")


[1m[22m[36mi[39m In argument: `value = as.numeric(value)`.
[33m![39m NAs introduced by coercion"


### Normalize labs

In [7]:
ukbb_labs$q <- ln_normalize_multi_ukbb(ukbb_labs %>% select(id, lab_code=field, age, sex, value))

> Downloading to a temporary directory [34m[34m/tmp/6935891.1.all.q/RtmplmKP6K[34m[39m.

> Extracting data to [34m[34m/tmp/6935891.1.all.q/RtmplmKP6K[34m[39m.

> Extracting data to [34m[34m/tmp/6935891.1.all.q/RtmplmKP6K[34m[39m.

[32mv[39m Data downloaded successfully.

[36mi[39m Converting [32m[32mumol/L[32m[39m to [32m[32mmg/dL[32m[39m for lab [32m[32mUrine Creatinine[32m[39m. Using the formula `0.011312 * x`.

[36mi[39m Converting [32m[32mg/L[32m[39m to [32m[32mg/dL[32m[39m for lab [32m[32mAlbumin[32m[39m. Using the formula `0.1 * x`.

[36mi[39m Converting [32m[32mumol/L[32m[39m to [32m[32mmg/dL[32m[39m for lab [32m[32mDirect Bilirubin[32m[39m. Using the formula `0.058467 * x`.

[36mi[39m Converting [32m[32mmmol/L[32m[39m to [32m[32mmg/dL[32m[39m for lab [32m[32mUrea[32m[39m. Using the formula `6.006 * x`.

[36mi[39m Converting [32m[32mmmol/L[32m[39m to [32m[32mmg/dL[32m[39m for lab [32m[32mCalcium[3

In [8]:
head(ukbb_labs %>% select(field, description, age, sex, value, q))

field,description,age,sex,value,q
<dbl>,<chr>,<dbl>,<chr>,<dbl>,<dbl>
30000,White blood cell (leukocyte) count,57.70959,female,6.1,0.4178591
30000,White blood cell (leukocyte) count,46.4,female,11.35,0.980127
30000,White blood cell (leukocyte) count,57.98356,male,10.12,0.9570285
30000,White blood cell (leukocyte) count,67.73425,female,5.4,0.1806039
30000,White blood cell (leukocyte) count,41.46849,female,8.44,0.8015539
30000,White blood cell (leukocyte) count,63.21918,male,5.6,0.2272401


### Computing diseases onset

In [9]:
cancer_codes <- build_cancer_icd9_icd10_dictionary(ukbb_data)
ukbb_diseases <- get_diseases(ukbb_diagnosis, cancer_codes) %cache_df% here('output/ukbb_diseases.csv') %>% as_tibble()

### Computing parent survival data

In [10]:
parents <- get_parents_survival(ukbb_data) %cache_df% here('output/ukbb_parents.csv') %>% as_tibble()

### Free up memory

In [11]:
rm(ukbb_data)
gc()

Unnamed: 0,used,(Mb),gc trigger,(Mb).1,max used,(Mb).2
Ncells,3859814,206.2,62787469,3353.3,78484336,4191.6
Vcells,398493495,3040.3,8623833456,65794.7,10779791820,82243.3


## computing Longevity and Diseases models scores
We will use the `mldpEHR` package to run infer scores from the models that were generated using the Clalit database.
We start by loading the models.
### Load prediction models

In [12]:
models_dir <- 'data/models/'
predictors <- c('longevity', 'diabetes', 'ckd', 'copd', 'cvd', 'liver') %>% 
    purrr::set_names() %>% 
    purrr::map(function(m) 
    {
        readr::read_rds(paste0(models_dir, m, '.rds')) %>% 
            purrr::imap( ~ c(.x, age=as.numeric(.y), feature_names=list(unique(unlist(purrr::map(.x$model, ~ .x$feature_names))))))
    })



### gathering all potential model features
Each predictor had its own features used in the model.
As the overlap is extensive between the different predictors, we will gather all features and compute them once.


In [13]:
potential_features <- unique(unlist(purrr::map(predictors, function(predictor) {
    purrr::map(predictor, function(p) {
        p$feature_names
    })
})))

### computing all features for all patients

In [14]:
#building features to be used by all predictors (longevity, diseases)
ukbb_to_clalit <- tgutil::fread('data/ukbb_lab_field_to_clalit_lab.csv')
features <- purrr::map2_df(predictors[[1]], names(predictors[[1]]), function(model, age_model) {
    message(age_model)
    age_model <- as.numeric(age_model)
    labs_features <- ukbb_labs %>% filter(age<age_model, age>age_model-5, !is.na(q)) %>% 
        left_join(ukbb_to_clalit %>% select(field, track), by="field") %>% 
        mutate(feature=paste0(track, '.quantiles_1_years_minus1095')) %>% 
        filter(feature %in% potential_features) %>% 
        group_by(id, feature) %>% summarize(value=mean(q), .groups="drop")

    disease_features <- ukbb_diseases %>% filter(age <= age_model) %>% 
        mutate(feature=paste0('WZMN.', cohort, '_minus43800_0')) %>% 
        filter(feature %in% potential_features) %>% 
        distinct(id, feature) %>% 
        mutate(value=1)

    ids <- unique(c(labs_features$id, disease_features$id))

    #adding female/male/age info
    features_tidy <- data.frame(id=ids, feature="age", value=age_model) %>% 
        bind_rows(ukbb_demog %>% filter(id %in% ids) %>% mutate(feature="male", value= sex==1) %>% select(id, feature, value)) %>% 
        bind_rows(labs_features) %>% 
        bind_rows(disease_features)

    #moving from tidy format
    features <- features_tidy %>% pivot_wider(id_cols='id', names_from='feature') %>% 
        mutate(sex=2-male)

    #setting missing diesease values to 0
    disease_feature_names <- grep('WZMN.disease', colnames(features), value=TRUE)
    features[,disease_feature_names][is.na(features[,disease_feature_names])] <- 0

    #adding missing features
    missing_features <- setdiff(potential_features, colnames(features))
    features[,missing_features] <- NA
    
    #requiring RBC
    features <- features %>% filter(!is.na(lab.101.quantiles_1_years_minus1095))
    return(features)
}) %cache_df% here('output/ukbb_mldp_features.csv') %>% as_tibble()


80

75

70

65

60

55

50

45

40

35

30



#### compute scores

In [15]:
predictor_scores <- purrr::map2_df(predictors, names(predictors), ~ mldp_predict_multi_age(features, .x) %>% mutate(predictor=.y))

In [16]:
#note: setting disease score for patients that are already sick to NA
pop <- predictor_scores %>% filter(predictor == "longevity") %>% 
    select(id, age, sex, longevity=score, longevity_q=quantile) %>% 
    mutate(sex=factor(c('male', 'female')[sex], levels=c('male', 'female'))) %>% 
    left_join(predictor_scores %>% filter(predictor != "longevity") %>% 
        select(id, age, predictor, score) %>% 
        left_join(ukbb_diseases %>% select(id, disease_age=age, predictor=cohort)) %>% 
        mutate(score = ifelse(!is.na(disease_age) & disease_age < age, NA, score)) %>% 
        pivot_wider(id_cols=c("id", "age"), names_from="predictor", values_from="score")
) %cache_df% here('output/pop_scores.csv') %>% as_tibble()
head(pop %>% select(-id))


[1m[22mJoining with `by = join_by(id, predictor)`
[1m[22mJoining with `by = join_by(id, age)`


age,sex,longevity,longevity_q,diabetes,ckd,copd,cvd,liver
<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
45,female,0.9675882,0.1925792,0.08815524,0.04167553,0.02798628,0.2060382,0.022592036
45,male,0.9119281,0.1060348,0.15909692,0.03919249,0.09064247,0.29449,0.008487539
45,female,0.9969831,0.4574361,0.20979514,0.13855365,0.12811647,0.6370679,0.022256322
45,male,0.9945255,0.398427,0.07791844,0.05965531,0.04335517,0.2117718,0.033393441
45,female,0.9838166,0.2651281,0.08008288,0.02218681,0.06593383,0.1207745,0.009952694
45,male,0.9377217,0.1345166,0.03196407,0.04930571,0.01263509,0.1157229,0.014305011
