In [1]:
suppressWarnings(suppressPackageStartupMessages({
    library(tidyverse)
    library(data.table)
}))


In [2]:
source('0_parameters.sh')


# Aggregate the predictive performance of the PRS models


- Read the predictive performance table stored in the raw output file: `snpnet.eavl.2_refit.tsv`
- Apply filter and keep the traits that are 
  - Present in the `GBE_ID.lst`
  - NOT present in the MRP blacklist traits
  - Present in the trait category file (removal of the duplicated traits)
- Join with
  - The trait info (trait names and trait category)
  - The significance of the incremental predictive performance
- We save the following 3 files
  - All the results across population groups
  - The WB test set predictive performance table
  - The WB test set predictive performance table for the traits with significant predictive performance


In [102]:
f <- '/oak/stanford/groups/mrivas/projects/PRS/private_output/202009_batch/snpnet.eval.2_refit.tsv'
lst_f <- 'GBE_ID.lst'
cat_f <- '/oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/05_gbe/extras/20200812_GBE_category/GBE_category.20201024.tsv'

# output

eval_full_f <- file.path(data_d, 'eval_full.tsv')
summary_f <- file.path(data_d, 'traits.tsv')
summary_sig_f <- file.path(data_d, 'traits_significant.tsv')
# we also copy the results to the OAK directory as well as the GBE server


## read the reference datasets

### list of traits of our interest


In [26]:
lst_f %>% fread(head=F) %>% pull() -> lst


In [27]:
lst %>% length()

### MRP blacklist

In [24]:
mrp_blacklist_f %>% fread(head=F) %>% pull() -> mrp_blacklist


In [25]:
mrp_blacklist %>% length()

In [29]:
setdiff(lst, mrp_blacklist) %>% length()

### trait category info

- some of the duplicated phenotypes present in the lst_f file was removed in the latest version of the GBE category file.
- Because of this, we apply inner_join here instead of left_join so that we have non-redundant set of traits in our results
- For more information on the GBE category, please look at the following documentation
- https://github.com/rivas-lab/ukbb-tools/tree/master/05_gbe/extras/20200812_GBE_category


In [30]:
cat_f %>% 
fread(
    colClasses = c('#GBE_category'='character', 'GBE_ID'='character'),
    select=c('#GBE_category', 'GBE_ID', 'GBE_short_name')
) %>%
rename('GBE_category'='#GBE_category') %>%
rename('trait_name'='GBE_short_name', 'trait_category'='GBE_category') -> cat_df


In [33]:
lst %>% setdiff(mrp_blacklist) %>% intersect(cat_df %>% pull(GBE_ID)) -> final_list_of_traits

final_list_of_traits %>% length()


### Significance of the PRS (P-values)

In [34]:
file.path(res_d, PRS_pval_f) %>% fread() %>%
rename('phe'='#phe') -> PRS_pval_df


In [35]:
setdiff(
final_list_of_traits, PRS_pval_df %>% filter(variable == 'PRS') %>% pull(phe) 
)

## PRS performance table

Read it, filter it, and join with the WB test set p-value and trait category


In [86]:
f %>%
fread(
    colClasses = c('#GBE_ID'='character'),
    select = c(
        '#GBE_ID', 'split', 'geno', 'covar', 'geno_covar', 'geno_delta', 'n_variables', 'family'
    )
) %>%
rename('GBE_ID'='#GBE_ID') %>%
filter(GBE_ID %in% final_list_of_traits) %>% unique() %>%
left_join(cat_df, by='GBE_ID') %>%
left_join(
    PRS_pval_df %>% filter(variable == 'PRS') %>%
    select(phe, P) %>% rename('WB_test_P' = 'P'),
    by=c('GBE_ID'='phe')
) %>%
rename('trait'='GBE_ID') %>%
select(
    trait_category, trait, trait_name, family,
    split, geno, covar, geno_covar, geno_delta,
    n_variables, WB_test_P
) %>%
arrange(WB_test_P, -n_variables) %>%
mutate(
    is_significant_in_WB = WB_test_P < (0.05/2000)
) -> eval_full_df


How may traits we have across each population group?


In [87]:
eval_full_df %>% count(split) %>% arrange(-n)

split,n
<chr>,<int>
non_british_white,1617
test,1617
train_val,1617
s_asian,1615
african,1607
e_asian,1528


### Generate filtered set of tables

In [88]:
eval_full_df %>%
filter(split == 'test') %>%
select(-split) -> summary_df


In [92]:
summary_df %>%
filter(is_significant_in_WB) %>%
select(-is_significant_in_WB) -> summary_sig_df


In [94]:
eval_full_df %>% dim() %>% print()
summary_df %>% dim() %>% print()
summary_sig_df %>% dim() %>% print()


[1] 9601   12
[1] 1617   11
[1] 428  10


### write the results to files

In [99]:
eval_full_df %>%
rename('#trait_category' = 'trait_category') %>%
fwrite(eval_full_f, sep='\t', na = "NA", quote=F)


In [100]:
summary_df %>%
rename('#trait_category' = 'trait_category') %>%
fwrite(summary_f, sep='\t', na = "NA", quote=F)


In [101]:
summary_sig_df %>%
rename('#trait_category' = 'trait_category') %>%
fwrite(summary_sig_f, sep='\t', na = "NA", quote=F)
