# Genotype array covariate for multi-ethnic group

#### Mamie Wang (szmamie@stanford.edu)

#### 2018/02/19

This note aims to investigate the problem of `Array` covariate dropping in glm regression of the GWAS pipeline.

### Objective
* produce a contingency table of array type count for each of the four population (`white_british`, `african`, `e_asian`, `s_asian`)

In [1]:
import pandas as pd

## Genotype array covariate in sqc file
* `genotype_array` column of sqc file provides the type of array (`UKBL`, `UKBB`)

In [2]:
with open('/oak/stanford/groups/mrivas/ukbb24983/sqc/download/ukb_sqc_v2.fields.txt') as f:
    sqc_columns = [x for x in f.read().splitlines() if len(x) > 0]

In [3]:
sqc = pd.read_csv(
    '/oak/stanford/groups/mrivas/ukbb24983/sqc/download/ukb_sqc_v2.txt',
    sep='\s+', names = sqc_columns
)

In [4]:
list(sqc.columns)[0:10]

['affymetrix_field_1',
 'affymetrix_field_2',
 'genotyping_array',
 'Batch',
 'Plate_Name',
 'Well',
 'Cluster_CR',
 'dQC',
 'Internal_Pico_ng_uL',
 'Submitted_Gender']

In [5]:
sqc.iloc[:,2].unique()

array(['UKBB', 'UKBL'], dtype=object)

In [11]:
sqc.shape

(488377, 89)

## Fam file and population stratification file

In [14]:
fam = pd.read_csv(
    '/oak/stanford/groups/mrivas/ukbb24983/fam/ukb2498_cal_v2_s488370.fam', sep='\s+',
    names=['FID', 'IID', 'father', 'mother', 'sex', 'batch']
)

In [15]:
fam.shape

(488377, 6)

In [38]:
white_british = pd.read_csv(
    "/oak/stanford/groups/mrivas/ukbb24983/sqc/population_stratification/ukb24983_white_british.phe",
    sep='\s+', names = ["FID", "IID"]
)
african = pd.read_csv(
    "/oak/stanford/groups/mrivas/ukbb24983/sqc/population_stratification/ukb24983_african.phe",
    sep='\s+', names = ["FID", "IID"]
)

s_asian = pd.read_csv(
    "/oak/stanford/groups/mrivas/ukbb24983/sqc/population_stratification/ukb24983_s_asian.phe",
    sep='\s+', names = ["FID", "IID"]
)

e_asian = pd.read_csv(
    "/oak/stanford/groups/mrivas/ukbb24983/sqc/population_stratification/ukb24983_e_asian.phe",
    sep='\s+', names = ["FID", "IID"]
)    

In [16]:
sqc_fam = pd.concat([fam, sqc], axis=1)

In [34]:
def genotypeArrayCount(phe):
    print(phe.merge(sqc_fam, on=('FID', 'IID'), how='left')\
    .groupby('genotyping_array')['IID']\
    .count())

In [39]:
genotypeArrayCount(white_british)

genotyping_array
UKBB    300158
UKBL     37041
Name: IID, dtype: int64


In [35]:
genotypeArrayCount(african)

genotyping_array
UKBB    6497
UKBL       1
Name: IID, dtype: int64


In [36]:
genotypeArrayCount(s_asian)

genotyping_array
UKBB    7347
UKBL      16
Name: IID, dtype: int64


In [37]:
genotypeArrayCount(e_asian)

genotyping_array
UKBB    2060
UKBL       1
Name: IID, dtype: int64


# summary

| number of individuals | UKBB | UKBL |
| --- | --- | --- |
| White British | 300,158 | 37,041 |
| African       |   6,497 |      1 |
| South Asian   |   7,347 |     16 |
| East Asian    |   2,060 |      1 | 