# Additional PCA computation for those who are not used in the 5 population definitions
#### Yosuke Tanigawa (ytanigaw@stanford.edu)
#### Last update: 2020/5/22

We defined the 5 populations in UK Biobank. Please see `README.md` file for more details.

After applying the standard QC fiter, there are still 28,551 individuals who are not included in those 5 populations.

This notebook generates a keep phe file for those 28k individuals. We apply the [`sample_qc_v3.PCA.sh`](sample_qc_v3.PCA.sh) to compute the PCs for those individuals.


In [1]:
suppressWarnings(suppressPackageStartupMessages({
    library(tidyverse)
    library(data.table)
}))


In [20]:
# input
sqc_f <- '/oak/stanford/groups/mrivas/ukbb24983/sqc/population_stratification_w24983_20200313/ukb24983_master_sqc.20200313.phe'

# output
out_f <- '/oak/stanford/groups/mrivas/ukbb24983/sqc/population_stratification_w24983_20200522/ukb24983_others.phe'


In [3]:
sqc_df <- fread(sqc_f)


In [4]:
sqc_df %>% 
filter(
    used_in_pca_calculation == 1, 
    excess_relatives == 0, 
    het_missing_outliers == 0, 
    putative_sex_chromosome_aneuploidy == 0
) %>%
count(population) %>%
arrange(-n)


population,n
<chr>,<int>
white_british,337138
,28551
non_british_white,24905
s_asian,7885
african,6497
e_asian,1154
e_asian_outlier,618
s_asian_outlier,77


In [5]:
337138 + 24905 + 7885 + 6497 + 1154

In [6]:
337138 + 24905 + 7885 + 6497 + 1154 + 28551

In [7]:
(337138 + 24905 + 7885 + 6497 + 1154 + 28551) / (337138 + 24905 + 7885 + 6497 + 1154)

In [21]:
sqc_df %>% 
filter(
    used_in_pca_calculation == 1, 
    excess_relatives == 0, 
    het_missing_outliers == 0, 
    putative_sex_chromosome_aneuploidy == 0
) %>%
filter(is.na(population)) %>%
select(FID, IID) %>%
fwrite(out_f, sep='\t', na = "NA", quote=F, col.names=F)


In [22]:
sqc_df %>% 
filter(
    used_in_pca_calculation == 1, 
    excess_relatives == 0, 
    het_missing_outliers == 0, 
    putative_sex_chromosome_aneuploidy == 0
) %>%
filter(is.na(population)) %>%
filter(sex == 0) %>%
select(FID, IID) %>%
fwrite(str_replace(out_f, '.phe$', '.F.phe'), sep='\t', na = "NA", quote=F, col.names=F)


In [23]:
sqc_df %>% 
filter(
    used_in_pca_calculation == 1, 
    excess_relatives == 0, 
    het_missing_outliers == 0, 
    putative_sex_chromosome_aneuploidy == 0
) %>%
filter(is.na(population)) %>%
filter(sex == 1) %>%
select(FID, IID) %>%
fwrite(str_replace(out_f, '.phe$', '.M.phe'), sep='\t', na = "NA", quote=F, col.names=F)


In [24]:
out_f