# Step 1) Sample-level QC (sqc) and initial population assignment based on Global genotype PCs and self-reported ancestry information

#### Yosuke Tanigawa (ytanigaw@stanford.edu)
#### Last update: 2020/9/1

We define the following 5 populations in UK Biobank. Please see `README.md` file for more details.


In [2]:
suppressWarnings(suppressPackageStartupMessages({
    library(tidyverse)
    library(data.table)
}))


## Step 1-0) Read the file names

In [19]:
source('0_file_names.R')


In [20]:
file_names <- get_file_names()

## Step 1-1) Read the input files

- `sqc` (sample quality control) file from UK Biobank has the same order as in fam file for the array fam file


In [3]:
source('../sample_qc_functions.R')

In [4]:
master_sqc_df <- read_master_sqc(file_names)


### Characterize some counts

#### QC filter

In [36]:
master_sqc_df %>% count(
    putative_sex_chromosome_aneuploidy,
    het_missing_outliers,
    excess_relatives,
    used_in_pca_calculation
) %>%
arrange(-n)


putative_sex_chromosome_aneuploidy,het_missing_outliers,excess_relatives,used_in_pca_calculation,n
<int>,<int>,<int>,<int>,<int>
0,0,0,1,406825
0,0,0,0,79745
0,1,0,0,968
1,0,0,1,394
1,0,0,0,257
0,0,1,0,187
1,0,1,0,1


In [44]:
png(file=sprintf('%s.QC.UpSetR.wo_PCA_flag.png', file_names$out_figs_prefix), width=800, height=600, units="px", family = "Helvetica")
UpSetR::upset(
    UpSetR::fromList(list(
        'putative_sex_chromosome_aneuploidy=1' = master_sqc_df %>% 
        filter(putative_sex_chromosome_aneuploidy == 1) %>% pull(IID),
        
        'het_missing_outliers=1' = master_sqc_df %>% 
        filter(het_missing_outliers == 1) %>% pull(IID),

        'excess_relatives=1' = master_sqc_df %>% 
        filter(excess_relatives == 1) %>% pull(IID)
    )),
    mainbar.y.label = "Number of removed individuals",
    sets.x.label = "# individuals", nsets = 20, nintersects = NA,
    text.scale = 1.5, order.by = "freq", show.numbers = "yes"
)
dev.off()


In [45]:
png(file=sprintf('%s.QC.UpSetR.png', file_names$out_figs_prefix), width=800, height=600, units="px", family = "Helvetica")
UpSetR::upset(
    UpSetR::fromList(list(
        'putative_sex_chromosome_aneuploidy=1' = master_sqc_df %>% 
        filter(putative_sex_chromosome_aneuploidy == 1) %>% pull(IID),
        
        'het_missing_outliers=1' = master_sqc_df %>% 
        filter(het_missing_outliers == 1) %>% pull(IID),

        'excess_relatives=1' = master_sqc_df %>% 
        filter(excess_relatives == 1) %>% pull(IID),
        
        'used_in_pca_calculation=0' = master_sqc_df %>% 
        filter(used_in_pca_calculation == 0) %>% pull(IID)
    )), 
    mainbar.y.label = "Number of removed individuals",
    sets.x.label = "# individuals", nsets = 20, nintersects = NA,
    text.scale = 1.5, order.by = "freq", show.numbers = "yes"
)
dev.off()


#### Summary (number of individuals by self-reported ethnicity)

- `n` is the total number of individuals
- `n_QC` is the number of individuals who passed the 4 QC filters above
- `n_QC_OCA` is the number of **unrelated** individuals who passed the 4 QC filters above


In [9]:
master_sqc_df %>% 
show_counts_for_self_reported_ethnicity()

f21000_top_label,f21000_sub_label,f21000,n,n_QC,n_QC_PCA
<chr>,<chr>,<int>,<int>,<int>,<int>
Do not know,Do not know,-1.0,200,200,181
Prefer not to answer,Prefer not to answer,-3.0,1531,1526,1307
White,White,1.0,525,522,441
White,British,1001.0,430740,429184,355071
White,Irish,1002.0,12576,12515,10492
White,Any other white background,1003.0,15632,15542,14539
Mixed,Mixed,2.0,46,46,42
Mixed,White and Black Caribbean,2001.0,594,593,527
Mixed,White and Black African,2002.0,397,396,362
Mixed,White and Asian,2003.0,795,794,719


### Save PCA plots

In [10]:
plot_pca_all       <- master_sqc_df %>% plot_pca_self_reported() +
labs(
    title='Genotype-based global PCs across self-reported ethnicity in UK Biobank', 
    x='Genotype-based global PC1',
    y='Genotype-based global PC2',
    color='Self-reported ethnicity'
)

plot_pca_top_label <- plot_pca_all + facet_wrap( ~ f21000_top_label, ncol=3) 


In [11]:
# for(ext in c('png', 'pdf')){
for(ext in c('png')){    
    ggsave(
        sprintf('%s.PCA.self.reported.ethnicity.%s', file_names$out_figs_prefix, ext),
        plot_pca_all
    )
    
    ggsave(
        sprintf('%s.PCA.self.reported.ethnicity.facet.%s', file_names$out_figs_prefix, ext),
        plot_pca_top_label
    )
    
}

Saving 6.67 x 6.67 in image

Saving 6.67 x 6.67 in image



## Step 1-2) define the population groups based on thresholds on PC1 and PC2

In [12]:
master_sqc_pop_df <- master_sqc_df %>% 
define_populations()


In [13]:
pops <- master_sqc_pop_df %>% 
drop_na(population) %>% 
select(population) %>% 
unique() %>% 
pull()


### The number of individuals in the initial population assignment

In [14]:
master_sqc_pop_df %>% show_population_counts()

population,UKBB,UKBL,w_exome,wo_exome,n
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
white_british,300094,37035,34392,302737,337129
non_british_white,22406,2499,2694,22211,24905
s_asian,7962,0,893,7069,7962
african,6497,0,847,5650,6497
e_asian,1772,0,187,1585,1772


## Step 1-3) Write the initial population definition to phe files
Every time this notebook is run, we should save the new phenotype (`.phe`) files to a new directory that has the date in the name. We have kept the root directory as `/oak/stanford/groups/mrivas/ukbb24983/sqc/` on Sherlock, and the name of the directory as `population_stratification_w24983_YYYYMMDD`.

This file paths is now configured in `0_file_names.R`.

### [warning] Please check and update the paths in `0_file_names.R`. Otherwise, we will LOSE the previous files from the previous iteration.

Please uncomment the following `fwrite` commands after you double checked the file paths


In [50]:
file_names$pop_refinement_pca

In [18]:
# for (pop in pops){
#     print(pop)
    
#     master_sqc_pop_df %>% filter(population == pop) %>%
#     select(FID, IID) %>% 
#     fwrite(file.path(file_names$pop_refinement_pca, paste0('ukb24983_', pop, '.phe')), sep='\t', col.names = F)    
# }


[1] "white_british"
[1] "e_asian"
[1] "non_british_white"
[1] "s_asian"
[1] "african"


## Next Step: Run PCA for each population group for population refinement

Please check `2_pca1_refinement.sh`