# Sample QC, Redaction and Population Splitting

#### Yosuke Tanigawa (ytanigaw@stanford.edu)

#### 2017/11/13 (updated on 2018/5/12)

#### Guhan Venkataraman (guhan@stanford.edu) (updated on 2018/11/2)

## Purposes of this notebook

Re-use this notebook whenever you need to remove individuals from the study. Not infrequently, BioBank will send out a set of individuals that have requested to remove their data from the study. This notebook is a means of streamlining the process of the removal of the individuals and the subsequent redefinition of population subclasses. To reiterate, this notebook:

- Removes individuals that meet Sample Quality Control exclusion criteria
- Removes individuals that have been redacted from BioBank
    - Update 11/1/2018 - `/oak/stanford/groups/mrivas/ukbb24983/sqc/w24983_20181016.csv` contains the superset of all redacted individuals.
- Subsequently, creates `.phe` file that subset individuals into the following populations:
  - African
  - Sounth Asian
  - East Asian
  - European British

In [1]:
import pandas as pd
import os
from functools import reduce

# SQC (Sample Quality Control)
- The following metrics were recommended by the UK BioBank as sample quality control metrics to watch out for:
  1. used_in_pca_calculation == 0
  1. het_missing_outliers == 1
  1. excess_relatives == 1
  1. putative_sex_chromosome_aneuploidy == 1
  1. FID < 0 (if the family ID is less than 0, that means that the individual has opted out of the study)

In [2]:
# We import the SQC files from the OAK space
with open('/oak/stanford/groups/mrivas/ukbb24983/sqc/download/ukb_sqc_v2.fields.txt') as f:
    sqc_columns = [x for x in f.read().splitlines() if len(x) > 0]

In [3]:
sqc = pd.read_csv(
    '/oak/stanford/groups/mrivas/ukbb24983/sqc/download/ukb_sqc_v2.txt',
    sep='\s+', names = sqc_columns
)
# We get an idea of what is inside the SQC files by looking at the shape and columns
print(sqc.shape)
print(sqc.columns)

(488377, 89)
Index([u'affymetrix_field_1', u'affymetrix_field_2', u'genotyping_array',
       u'Batch', u'Plate_Name', u'Well', u'Cluster_CR', u'dQC',
       u'Internal_Pico_ng_uL', u'Submitted_Gender', u'Inferred_Gender',
       u'X_intensity', u'Y_intensity', u'Submitted_Plate_Name',
       u'Submitted_Well', u'sample_qc_missing_rate', u'heterozygosity',
       u'heterozygosity_pc_corrected', u'het_missing_outliers',
       u'putative_sex_chromosome_aneuploidy', u'in_kinship_table',
       u'excluded_from_kinship_inference', u'excess_relatives',
       u'in_white_British_ancestry_subset', u'used_in_pca_calculation', u'PC1',
       u'PC2', u'PC3', u'PC4', u'PC5', u'PC6', u'PC7', u'PC8', u'PC9', u'PC10',
       u'PC11', u'PC12', u'PC13', u'PC14', u'PC15', u'PC16', u'PC17', u'PC18',
       u'PC19', u'PC20', u'PC21', u'PC22', u'PC23', u'PC24', u'PC25', u'PC26',
       u'PC27', u'PC28', u'PC29', u'PC30', u'PC31', u'PC32', u'PC33', u'PC34',
       u'PC35', u'PC36', u'PC37', u'PC38', u'PC39

We can see that there are many fields here, of which we will use four to filter out the folks that meet the aforementioned exclusion criteria.

## Import of the `.fam` file
We import the genotype file and get an idea of its size and what it looks like.

In [4]:
genotype_fam_file = '/oak/stanford/groups/mrivas/ukbb24983/fam/ukb2498_cal_v2_s488370.fam'
genotype_fam = pd.read_csv(
    genotype_fam_file, sep='\s+', 
    names = ['FID', 'IID', 'father', 'mother', 'sex_1=male_2=female_0=unkonwn', 'batch']
)
print(genotype_fam.shape)
genotype_fam.head()

(488377, 6)


Unnamed: 0,FID,IID,father,mother,sex_1=male_2=female_0=unkonwn,batch
0,2502845,2502845,0,0,1,Batch_b001
1,2314965,2314965,0,0,2,Batch_b001
2,1142584,1142584,0,0,2,Batch_b001
3,3665122,3665122,0,0,2,Batch_b001
4,4377492,4377492,0,0,2,Batch_b001


### List of individuals to exclude
As mentioned before, these exclusion criteria were recommended/required by the BioBank.

In [5]:
exclude_sqc = set(
    genotype_fam['FID'][
        (sqc.used_in_pca_calculation == 0) |
        (sqc.het_missing_outliers == 1) |
        (sqc.excess_relatives == 1) |
        (sqc.putative_sex_chromosome_aneuploidy == 1) |
        (genotype_fam.FID < 0)
    ]
)
# The total number of people excluded via SQC metrics and initial redactions
print("Total number of exclusions (initial redactions + SQC): " + str(len(exclude_sqc)))

Total number of exclusions (initial redactions + SQC): 81559


## Additional redactions
- As of November 2018, we need to remove 79 individuals, specified in the file /oak/stanford/groups/mrivas/ukbb24983/sqc/w24983_20181016.csv
- *IMPORTANT*: As the list is updated in the future, the lab should run this notebook again, replacing the path to the file below with the path to a file that supersets all redacted individuals thus far.

In [6]:
redacted = set()
list_indivs = open('/oak/stanford/groups/mrivas/ukbb24983/sqc/w24983_20181016.csv', 'r')
for val in list_indivs.read().split():
    redacted.add(int(val))
list_indivs.close()
print("Total number of redacted individuals: " + str(len(redacted)))
# We add the newly redacted individuals to the previous set of SQC rejects/initial redactions
exclude = exclude_sqc.union(redacted)
# This leads to the total number of exclusions...
print("Total number of exclusions (redactions + SQC rejections): " + str(len(exclude)))

Total number of redacted individuals: 79
Total number of exclusions (redactions + SQC rejections): 81623


## Developing cohorts: White British European
Now that we have the individuals we need to exclude, we can use the sample QC file to find out their self-reported race and subset the populations appropriately. We will start with the white British cohort.

In [7]:
white_british_unfiltered = set(genotype_fam['FID'][(sqc.in_white_British_ancestry_subset == 1)])
print("Number of white British people in dataset before exclusion: " + str(len(white_british_unfiltered)))
ethnic_groups = dict([])
ethnic_groups['white_british'] = white_british_unfiltered - exclude
print("Number of white British people in dataset after exclusion: " + str(len(ethnic_groups['white_british'])))

Number of white British people in dataset before exclusion: 409703
Number of white British people in dataset after exclusion: 337151


### ID mapping

In [8]:
mapping_df = pd.read_csv(
    '/oak/stanford/groups/mrivas/ukbb24983/phe_qc/ukb24983_2228_mapping.tsv', sep='\t',
    names=['24983', '2228']
)

## African

In [13]:
african_unfiltered = set(    
    pd.read_csv(
        '/scratch/PI/mrivas/users/annashch/pca/african/pca_results_v2_chrom1_african.eigenvec', 
        sep='\s+'
    ).merge(
        mapping_df,
        left_on = '#FID',
        right_on = '2228'
    )['24983']
)
print("Number of African people in dataset before exclusion: " + str(len(african_unfiltered)))
ethnic_groups['african'] = african_unfiltered - exclude
print("Number of African people in dataset after exclusion: " + str(len(ethnic_groups['african'])))

Number of African people in dataset before exclusion: 7073
Number of African people in dataset after exclusion: 6497


## South Asian

In [10]:
s_asian_unfiltered = set(
    pd.read_csv(
        '/scratch/PI/mrivas/users/annashch/pca/s_asian/pca_results_v2_chrom1_s_asian.eigenvec', 
        sep='\s+'
    ).merge(
        mapping_df,
        left_on = '#FID',
        right_on = '2228'
    )['24983']
)
print("Number of South Asian people in dataset before exclusion: " + str(len(s_asian_unfiltered)))
ethnic_groups['s_asian'] = s_asian_unfiltered - exclude
print("Number of South Asian people in dataset after exclusion: " + str(len(ethnic_groups['s_asian'])))

Number of South Asian people in dataset before exclusion: 8111
Number of South Asian people in dataset after exclusion: 7363


## East Asian

In [11]:
e_asian_unfiltered = set(
    pd.read_csv(
        '/scratch/PI/mrivas/users/annashch/pca/e_asian/pca_results_v2_chrom1_e_asian.eigenvec', 
        sep='\s+'
    ).merge(
        mapping_df,
        left_on = '#FID',
        right_on = '2228'
    )['24983']
)
print("Number of East Asian people in dataset before exclusion: " + str(len(e_asian_unfiltered)))
ethnic_groups['e_asian'] = e_asian_unfiltered - exclude
print("Number of East Asian people in dataset after exclusion: " + str(len(ethnic_groups['e_asian'])))

Number of East Asian people in dataset before exclusion: 2156
Number of East Asian people in dataset after exclusion: 2061


## Write to files
Every time this notebook is run, we should save the new phenotype (`.phe`) files to a new directory that has the date in the name. We have kept the root directory as `/oak/stanford/groups/mrivas/ukbb24983/sqc/` on Sherlock, and the name of the directory as `population_stratification_w24983_YYYYMMDD`.

## It is imperative you rename the directory below, OR ELSE ALL THE PREVIOUS FILES GET OVERWRITTEN. I've commented it out in the GitHub version because of this.

In [12]:
#write_dir = '/oak/stanford/groups/mrivas/ukbb24983/sqc/population_stratification_w24983_YYYYMMDD/'

In [None]:
for key in ethnic_groups.keys():
    pd.DataFrame({
        'FID': sorted(ethnic_groups[key]),
        'IID': sorted(ethnic_groups[key])
    }).to_csv(
        os.path.join(write_dir, 'ukb24983_{}.phe'.format(key)), sep='\t', index=False, header=False
    )

# Summary of number of individuals

| White British | 337,151 |
| African       |   6,497 |
| South Asian   |   7,363 |
| East Asian    |   2,061 |