In [6]:
source("src/sampling.R")
source('constants.R')
source('src/statistics.R')
source('src/preprocess.R')
source('src/fit.R')

set.seed(1001)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:data.table’:

    between, first, last


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




# Read dataset

In [2]:
meta  = fread('data/meta.csv')
prs   = fread('data/prs.csv')
pheno = fread('data/pheno.csv')

# Print summary statistics of the cases

In [3]:
cases_summary(pheno)

Disease Name: diabetes 
Disease Code: HC221 
   Disease 1 Disease 0
SA       460      5479
WB      1451     75331
Ratio of SA 1 to WB 1 cases: 0.3170227 
SA cases with disease in Train: 0 
SA cases with disease in Val: 0 
SA cases with disease in Test: 0 

Disease Name: renal_failure 
Disease Code: HC294 
   Disease 1 Disease 0
SA        76      2644
WB      1835     78166
Ratio of SA 1 to WB 1 cases: 0.04141689 
SA cases with disease in Train: 0 
SA cases with disease in Val: 0 
SA cases with disease in Test: 0 

Disease Name: myocardial_infarction 
Disease Code: HC326 
   Disease 1 Disease 0
SA       151      3234
WB      1760     77576
Ratio of SA 1 to WB 1 cases: 0.08579545 
SA cases with disease in Train: 0 
SA cases with disease in Val: 0 
SA cases with disease in Test: 0 

Disease Name: asthma 
Disease Code: HC382 
   Disease 1 Disease 0
SA       285     10909
WB      1626     69901
Ratio of SA 1 to WB 1 cases: 0.1752768 
SA cases with disease in Train: 0 
SA cases with disease 

# Fit models for WB-SA

In [4]:
populations <- list('white_british', 's_asian')
model_dir_path <- 'models/glinternet/under_sampling/WB_SA_metabolomics'

# Fit models

In [7]:
aucs_with_prs  <- list()

for (disease in DISEASE_CODES) {
    
    prepared_data <- prepare_data(meta, prs, pheno)
    preprocessed_data <- preprocess_data(
        populations, disease, 
        prepared_data$pheno, prepared_data$meta, 
        prepared_data$prs, use_prs=TRUE, prepared_data$demo, use_demo=TRUE)
    
    sampled_data        <- sample_data(preprocessed_data, 'under_sampling')
    system.time({cv_fit <- fit_models(model_dir_path, disease, preprocessed_data, update_models=FALSE)})
    
    i_1Std <- which(cv_fit$lambdaHat1Std == cv_fit$lambda)
    coefs <- coef(cv_fit$glinternetFit)[[i_1Std]]
    
    main_effects <- coefs$mainEffects$cont
    
    cat('Main effects:\n')
    for (effect in main_effects) {
        col_index <- effect
        print(colnames(preprocessed_data$X_train[, ..col_index]))
    }

    cat('\nInteractions:\n')
    
    if (length(coefs$interactions$catcont) > 0) {
    for (row in 1:nrow(coefs$interactions$catcont)) {
    
        var_idx <- (coefs$interactions$catcont[row, 2])
        print(colnames(preprocessed_data$X_train[, ..var_idx]))
    }
    }
    
    X_test <- preprocessed_data$X_test
    y_test <- preprocessed_data$y_test

    predictions <- as.vector(predict(cv_fit, X_test, type = "response"))

    roc_curve <- suppressMessages(roc(y_test, predictions, quietly = TRUE))
    auc_score <- auc(roc_curve)
    aucs_with_prs <- c(aucs_with_prs, auc_score)
    cat('\nAUC score: ', auc_score)
    cat('\n\n####################################################################################################')
    cat('\n####################################################################################################\n')

}

       PRS_HC221 age sex population     Total_C   non_HDL_C   Remnant_C
    1: -0.322643  68   1          0  1.68751073  1.81708129  1.07378772
    2: -0.744962  79   0          0 -0.83329166 -1.34938732 -0.65547621
    3:  0.181231  61   1          0 -0.82099739 -0.92062825 -0.47259763
    4: -0.137105  79   0          0  3.19675008  3.45613443  1.71405582
    5:  0.363630  69   0          0 -0.23965437  0.18789434  0.19959743
   ---                                                                 
82717: -0.423486  52   0          0  0.51746474  0.74499367  0.57617313
82718: -0.500670  74   1          0  0.43371654  0.44234425  0.12780314
82719: -0.544700  77   1          0 -0.51591114 -0.45034725 -0.18533347
82720: -0.432580  57   0          0 -0.00575579 -0.60172002 -0.38961791
82721: -0.935360  67   0          0 -0.34817818 -0.06855723  0.01497893
            VLDL_C Clinical_LDL_C       LDL_C        HDL_C    Total_TG
    1:  0.83207065   1.2241989093  0.74317667 -0.129471244  1.661

Timing stopped at: 55.8 0.042 55.97

