# Population stratification for Multi-ethnic group analysis

#### Yosuke Tanigawa (ytanigaw@stanford.edu)
#### 2017/11/13 (updated on 2018/5/12)


### Multi-ethnic group analysis is based on PCA analysis conducted by Anna Shcherbina

### This note aims to extend (and replace) what we've done in the previous analysis for European British population
- https://github.com/rivas-lab/ukbb24983wiki/blob/master/genotype_data/sqc/ukb24983_remove_from_sqc_file.ipynb

## objective

- Create `phe` file that to subset individuals to the following non-European populations.
  - African
  - Sounth Asian
  - East Asian
  - European British
    - We've defined exclusion file here:
    - https://github.com/rivas-lab/ukbb24983wiki/blob/master/genotype_data/sqc/ukb24983_remove_from_sqc_file.ipynb

- For African, South Asian, and East Asian populations, we will adopt the criteria from Anna

## reducted individuals
- The following files specify the list of individuals reducted from the study. We need to remove them. These files are notified from UKBB via email.
    - `/oak/stanford/groups/mrivas/ukbb24983/phe_qc/w2498_20170726.phe`
    - `/oak/stanford/groups/mrivas/ukbb24983/phe_qc/w24983_20180503.phe`
- The results of this population stratification will be written to 
    - `/oak/stanford/groups/mrivas/ukbb24983/sqc/population_stratification_w24983_20180503_<name of the reduction file>`
        - For example:
            - `/oak/stanford/groups/mrivas/ukbb24983/sqc/population_stratification_w2498_20170726`
            - `/oak/stanford/groups/mrivas/ukbb24983/sqc/population_stratification_w24983_20180503`
    - We create a symbolic link to the most recent version of population definition
        - `/oak/stanford/groups/mrivas/ukbb24983/sqc/population_stratification -> population_stratification_w24983_20180503`


In [1]:
import pandas as pd
import os

In [13]:
from functools import reduce

# sqc file
- Sample Quality Control file provided by consortia
- We will use the following criteria to exclude individuals
  1. used_in_pca_calculation == 0
  1. het_missing_outliers == 1
  1. excess_relatives == 1
  1. putative_sex_chromosome_aneuploidy == 1

In [2]:
with open('/oak/stanford/groups/mrivas/ukbb24983/sqc/download/ukb_sqc_v2.fields.txt') as f:
    sqc_columns = [x for x in f.read().splitlines() if len(x) > 0]

In [3]:
sqc = pd.read_csv(
    '/oak/stanford/groups/mrivas/ukbb24983/sqc/download/ukb_sqc_v2.txt',
    sep='\s+', names = sqc_columns
)

In [4]:
sqc.shape

(488377, 89)

In [5]:
sqc.columns

Index(['affymetrix_field_1', 'affymetrix_field_2', 'genotyping_array', 'Batch',
       'Plate_Name', 'Well', 'Cluster_CR', 'dQC', 'Internal_Pico_ng_uL',
       'Submitted_Gender', 'Inferred_Gender', 'X_intensity', 'Y_intensity',
       'Submitted_Plate_Name', 'Submitted_Well', 'sample_qc_missing_rate',
       'heterozygosity', 'heterozygosity_pc_corrected', 'het_missing_outliers',
       'putative_sex_chromosome_aneuploidy', 'in_kinship_table',
       'excluded_from_kinship_inference', 'excess_relatives',
       'in_white_British_ancestry_subset', 'used_in_pca_calculation', 'PC1',
       'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10', 'PC11',
       'PC12', 'PC13', 'PC14', 'PC15', 'PC16', 'PC17', 'PC18', 'PC19', 'PC20',
       'PC21', 'PC22', 'PC23', 'PC24', 'PC25', 'PC26', 'PC27', 'PC28', 'PC29',
       'PC30', 'PC31', 'PC32', 'PC33', 'PC34', 'PC35', 'PC36', 'PC37', 'PC38',
       'PC39', 'PC40', 'in_Phasing_Input_chr1', 'in_Phasing_Input_chr2',
       'in_Phasing_Inpu

## 488370 individuals
- We are focusing on 488370 individuals with genotype data

In [6]:
genotype_fam_file = '/oak/stanford/groups/mrivas/ukbb24983/fam/ukb2498_cal_v2_s488370.fam'

In [7]:
genotype_fam = pd.read_csv(
    genotype_fam_file, sep='\s+', 
    names = ['FID', 'IID', 'father', 'mother', 'sex_1=male_2=female_0=unkonwn', 'batch']
)

In [8]:
print(genotype_fam.shape)
genotype_fam.head()

(488377, 6)


Unnamed: 0,FID,IID,father,mother,sex_1=male_2=female_0=unkonwn,batch
0,2502845,2502845,0,0,1,Batch_b001
1,2314965,2314965,0,0,2,Batch_b001
2,1142584,1142584,0,0,2,Batch_b001
3,3665122,3665122,0,0,2,Batch_b001
4,4377492,4377492,0,0,2,Batch_b001


In [9]:
genotype_fam[genotype_fam.FID < 0]

Unnamed: 0,FID,IID,father,mother,sex_1=male_2=female_0=unkonwn,batch
57326,-1,-1,0,0,0,redacted
61153,-2,-2,0,0,0,redacted
68758,-3,-3,0,0,0,redacted
127546,-4,-4,0,0,0,redacted
308491,-5,-5,0,0,0,redacted
387484,-6,-6,0,0,0,redacted
409932,-7,-7,0,0,0,redacted


- This means there are 488370 ( = 488377 - 7) individuals in this fam file.

### list of individuals to exclude

In [10]:
exclude_sqc = set(
    genotype_fam['FID'][
        (sqc.used_in_pca_calculation == 0) |
        (sqc.het_missing_outliers == 1) |
        (sqc.excess_relatives == 1) |
        (sqc.putative_sex_chromosome_aneuploidy == 1) |
        (genotype_fam.FID < 0)
    ]
)

In [11]:
len(exclude_sqc)

81559

## additional remove file
- We need to remove 6 individuals

In [12]:
reducted_individuals_dir = '/oak/stanford/groups/mrivas/ukbb24983/phe_qc/'

In [14]:
reducted_dict = dict([])
for f in ['w2498_20170726', 'w24983_20180503']:
    reducted_dict[f] = set(
        pd.read_csv(
            os.path.join(reducted_individuals_dir, '{}.phe'.format(f)),
            sep='\s+', names=['FID', 'IID'],
        )['FID']
    )
reducted = reduce(lambda x, y: x.union(y), reducted_dict.values())

In [15]:
reducted

{1213560, 2331559, 2730934, 2828647, 3299994, 5649059}

In [16]:
exclude = exclude_sqc.union(reducted)

In [17]:
len(exclude)

81564

## White British European 

In [18]:
white_british_unfiltered = set(genotype_fam['FID'][(sqc.in_white_British_ancestry_subset == 1)])

In [19]:
len(white_british_unfiltered)

409703

In [20]:
ethnic_groups = dict([])

In [21]:
ethnic_groups['white_british'] = white_british_unfiltered - exclude

In [22]:
len(ethnic_groups['white_british'])

337198

### 337,208, 337,205, 337,199, or 337,198 ??
- 337,198 is the number based on `w24983_20180503`
- 337,199 was the number based on `w2498_20170726`
- 337,205 & 337,208 contained some individuals that we should've remove it.
    - Please check the 2017/11/13 version of the notebook (commit ID: c914a3957557c0374d4e7dca6dc148bf033e6750)
    - https://github.com/rivas-lab/ukbb24983wiki/blob/c914a3957557c0374d4e7dca6dc148bf033e6750/genotype_data/sqc/Multi-ethnic-group-analysis.ipynb


### ID mapping

In [23]:
mapping_df = pd.read_csv(
    '/oak/stanford/groups/mrivas/ukbb24983/phe_qc/ukb24983_2228_mapping.tsv', sep='\t',
    names=['24983', '2228']
)

## African

In [24]:
african_unfiltered = set(    
    pd.read_csv(
        '/scratch/PI/mrivas/users/annashch/pca/african/pca_results_v2_chrom1_african.eigenvec', 
        sep='\s+'
    ).merge(
        mapping_df,
        left_on = '#FID',
        right_on = '2228'
    )['24983']
)

In [25]:
ethnic_groups['african'] = african_unfiltered - exclude

In [26]:
len(african_unfiltered), len(ethnic_groups['african'])

(7073, 6498)

## South Asian

In [27]:
s_asian_unfiltered = set(
    pd.read_csv(
        '/scratch/PI/mrivas/users/annashch/pca/s_asian/pca_results_v2_chrom1_s_asian.eigenvec', 
        sep='\s+'
    ).merge(
        mapping_df,
        left_on = '#FID',
        right_on = '2228'
    )['24983']
)

In [28]:
ethnic_groups['s_asian'] = s_asian_unfiltered - exclude

In [29]:
len(s_asian_unfiltered), len(ethnic_groups['s_asian'])

(8111, 7363)

## East Asian

In [30]:
e_asian_unfiltered = set(
    pd.read_csv(
        '/scratch/PI/mrivas/users/annashch/pca/e_asian/pca_results_v2_chrom1_e_asian.eigenvec', 
        sep='\s+'
    ).merge(
        mapping_df,
        left_on = '#FID',
        right_on = '2228'
    )['24983']
)

In [31]:
ethnic_groups['e_asian'] = e_asian_unfiltered - exclude

In [32]:
len(e_asian_unfiltered), len(ethnic_groups['e_asian'])

(2156, 2061)

## write to files

In [33]:
write_dir = '/oak/stanford/groups/mrivas/ukbb24983/sqc/population_stratification_w24983_20180503/'

In [34]:
for key in ethnic_groups.keys():
    pd.DataFrame({
        'FID': sorted(ethnic_groups[key]),
        'IID': sorted(ethnic_groups[key])
    }).to_csv(
        os.path.join(write_dir, 'ukb24983_{}.phe'.format(key)), sep='\t', index=False, header=False
    )

# summary

| | number of indivudlas |
| --- | --- |
| White British | 377,198 |
| African       |   6,498 |
| South Asian   |   7,363 |
| East Asian    |   2,061 |