In [1]:
#R-code to include the PAC score in our analyses
library('data.table')
library('dplyr')
library('impute')
source("pac_proteomic_age.R")


Attaching package: ‘dplyr’


The following objects are masked from ‘package:data.table’:

    between, first, last


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [2]:
#Get eids set 3
set3_eids <- fread('Data/Processed/Full/full_test_eids.csv', data.table = FALSE)

#Proteomics data
allprots3000 <- fread('Data/Olink_UKBB_3000prots/Olink3000_Instance0.csv', data.table = FALSE)
ourset = fread('proteins_UKBB_imputed.csv', data.table=FALSE)

#Impute following the approach of https://doi.org/10.1111/acel.14195
rownames(allprots3000) <- allprots3000$eid
proteins_to_impute <- select(allprots3000, -c(eid, glipr1, npm1, pcolce))  # Remove eid and columns not to be imputed according to Kuo et al.
imputed_data <- impute.knn(as.matrix(proteins_to_impute), k = 10, rowmax = 0.5, colmax = 0.8, maxp = 1500, rng.seed = 7)$data
imputed_data <- as.data.frame(imputed_data)
imputed_data$eid <- rownames(imputed_data)  # Bring back eid for joining

# Reorder columns
allprots3000_imputed <- imputed_data[, c("eid", setdiff(names(imputed_data), "eid"))]

# Keep Set 3
allprots3000_set3 <- subset(allprots3000_imputed, eid %in% set3_eids$V1)

# Age
basic_info <- fread('Data/basicinfo_instance_0.csv', data.table = FALSE) %>%
  select(eid, age_center.0.0) %>%
  mutate(eid = as.character(eid)) %>%
  rename(age = age_center.0.0)

# Combine 
allprots3000_set3 <- left_join(allprots3000_set3, basic_info, by = "eid")

“4286 rows with more than 50 % entries missing;
 mean imputation used for these rows”


Cluster size 48727 broken into 24214 24513 
Cluster size 24214 broken into 13480 10734 
Cluster size 13480 broken into 528 12952 
Done cluster 528 
Cluster size 12952 broken into 5768 7184 
Cluster size 5768 broken into 3366 2402 
Cluster size 3366 broken into 145 3221 
Done cluster 145 
Cluster size 3221 broken into 2087 1134 
Cluster size 2087 broken into 873 1214 
Done cluster 873 
Done cluster 1214 
Done cluster 2087 
Done cluster 1134 
Done cluster 3221 
Done cluster 3366 
Cluster size 2402 broken into 1793 609 
Cluster size 1793 broken into 309 1484 
Done cluster 309 
Done cluster 1484 
Done cluster 1793 
Done cluster 609 
Done cluster 2402 
Done cluster 5768 
Cluster size 7184 broken into 3698 3486 
Cluster size 3698 broken into 1321 2377 
Done cluster 1321 
Cluster size 2377 broken into 457 1920 
Done cluster 457 
Cluster size 1920 broken into 1323 597 
Done cluster 1323 
Done cluster 597 
Done cluster 1920 
Done cluster 2377 
Done cluster 3698 
Cluster size 3486 broken into 17

In [3]:
nrow(allprots3000_set3)

In [4]:
pac_set3 = pac_proteomic_age(allprots3000_set3)
df_pac_set3 = cbind(allprots3000_set3[,"eid"], as.data.frame(pac_set3))
names(df_pac_set3) = c("eid", "PAC")
fwrite(df_pac_set3, 'Data/Other_Biomarkers/pac_set3.csv')

In [5]:
fwrite(allprots3000_set3, 'Data/Other_Biomarkers/proteins_set3_imputed_other_biomarkers.csv')