# master PRS file

We will create the master PRS file using the following steps:

- read the trait list
- read the biomarker trait list
- read the PRS files from biomarker project (35 traits)
  - rename the columns as needed
  - be careful with the handling of 32 + 3 biomarkers
  - the following 3 traits does NOT have the original measurement
    - AST ALT ratio
    - eGFR
    - NAP non-alubumin protein
    - so we will use the INI300.. number for those
- read master PRS file from the previous iteration
- join them together
- join with covariates
- sort the columns
- save it to a file
- compess it (outside of the notebook)


In [1]:
suppressWarnings(suppressPackageStartupMessages({
    library(tidyverse)
    library(data.table)
}))



In [40]:
source('paths.sh')
source(snpnet_helper)
source(fPRS_helper)


In [18]:
trait_list_f %>%
fread() %>%
rename_with(function(x){str_replace(x, '#', '')}, starts_with("#")) -> trait_list_df

biomarkers_mapping_f %>%
fread() %>%
rename_with(function(x){str_replace(x, '#', '')}, starts_with("#")) -> biomarkers_mapping_df



In [19]:
trait_list_df        %>% dim %>% print
biomarkers_mapping_df %>% dim %>% print


[1] 1565    5
[1] 35  4


In [28]:
1:nrow(biomarkers_mapping_df) %>%
lapply(function(biomarkers_idx){

    biomarkers_snpnet_f %>%
    str_replace_all(
        '__TRAIT__', biomarkers_mapping_df$annotation[biomarkers_idx]
    )%>%
    read_sscore_file() %>%
    rename(
        !!sprintf('PRS_%s', biomarkers_mapping_df$trait[biomarkers_idx]) := 3
    ) 
}) %>%
reduce(function(x,y){inner_join(x,y,by=c("FID", "IID"))}) -> PRS_biomarkers_df


In [35]:
PRS202009_f %>%
read_sscore_file(
    columns = paste0(
        'PRS_',
        trait_list_df %>%
        filter(trait_category != 'Biomarkers') %>%
        pull(trait)
    )
) -> PRS_others_df


In [43]:
# read the covariates
sample_qc_f %>%
fread(
    colClasses = c(
        '#FID'='character', 'IID'='character', 'population' = "character",
        'split' = "character",
        'split_nonWB' = "character"
    )
) %>%
rename_with(
    function(x){str_replace(x, '#', '')}, starts_with("#")
) -> sample_qc_df


In [48]:
sample_qc_df      %>% dim %>% print
PRS_others_df     %>% dim %>% print
PRS_biomarkers_df %>% dim %>% print


[1] 488377     95
[1] 488377   1532
[1] 488377     37


In [39]:
ncol(PRS_others_df) - 2 + ncol(PRS_biomarkers_df) - 2

In [50]:
sample_qc_df %>%
left_join(
    PRS_biomarkers_df, by=c("FID", "IID")
) %>%
left_join(
    PRS_others_df, by=c("FID", "IID")
) %>%
select(all_of(c(
    colnames(sample_qc_df),
    paste0(
        'PRS_',
        trait_list_df %>%
        pull(trait)
    )
))) -> master_PRS_df


In [51]:
master_PRS_df %>% dim %>% print

95 + 1565

[1] 488377   1660


In [53]:
str_replace(PRS202110_f, '.gz$', '')

In [54]:
master_PRS_df %>%
rename('#FID' = 'FID') %>%
fwrite(
    str_replace(PRS202110_f, '.gz$', ''),
    sep='\t', na = "NA", quote=F
)
