# Liftover and genotype data creation

Create genotype array data and recreate all the genotype files for White European, Asian and African to be used in the burden test of different traits in REGENIE

Original file 
`~/UKBiobank_Yale_transfer/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed`

Starting file for the liftover. After doing the first part of variant_qc and sample_qc in this notebook(project/UKBB_GWAS_dev/workflow/UKBB_QC_genoarray.ipynb)
`~/UKBiobank/genotype_files_processed/082621_sampleqc_call90/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.082621_sampleqc_call90.filtered.bed`

In [13]:
liftover_sos=~/project/bioworkflows/GWAS/liftover.ipynb
input_file=/mnt/vast/hpc/csg/UKBiobank/genotype_files_processed/082621_sampleqc_call90/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.082621_sampleqc_call90.filtered.bim 
output_file=UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples 
cwd=/mnt/vast/hpc/csg/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/
fr=hg19 
to=hg38 
tpl_file=~/project/bioworkflows/admin/csg.yml
liftover_sbatch=/mnt/vast/hpc/csg/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/liftover_hg19tohg38_$(date +"%Y-%m-%d").sbatch

In [14]:
liftover_args="""default
    --cwd $cwd
    --input_file $input_file
    --output_file $output_file
    --fr $fr 
    --to $to
    --remove-missing
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $liftover_sos \
    --to-script $liftover_sbatch \
    --args "$liftover_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/liftover_hg19tohg38_2023-01-20.sbatch[0m
INFO: Workflow csg (ID=wc6ae0c5fa24262a9) is executed successfully with 1 completed step.



**I had to run by login into the node**

```
qrsh -l h_vmem=100G -l h_rt=200:00:00 -q csg2.q -l t_pri
module load Singularity/3.5.3
 sos run /home/dmc2245/project/bioworkflows/GWAS/liftover.ipynb     default    --cwd /mnt/vast/hpc/csg/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/    --input_file /mnt/vast/hpc/csg/UKBiobank/genotype_files_processed/082621_sampleqc_call90/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.082621_sampleqc_call90.filtered.bim    --output_file UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples    --fr hg19     --to hg38    --remove-missing
```

I've got an error in plink 
```
Error:
/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.bim
has a split chromosome. Use --make-pgen + --sort-vars to remedy this.
```

`plink --bfile [old name] --make-bed --out [new name]`

### Sort the bed file to eliminate split chromosomes error

```
qrsh -l h_vmem=100G -l h_rt=200:00:00 -q csg2.q -l t_pri
module load PLINK/1.9.10
plink --bfile /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples \
--make-bed --out /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted
```

In [150]:
%save -f ~/project/UKBB_GWAS_dev/output/plink_sorting_hg38_geno_array.sh

#!/bin/sh
#$ -l h_rt=200:00:00
#$ -l h_vmem=100G
#$ -N plink_sorting_hg38_geno_array
#$ -o /home/dmc2245/project/UKBB_GWAS_dev/output/plink_sorting_hg38_geno_array-$JOB_ID.out
#$ -e /home/dmc2245/project/UKBB_GWAS_dev/output/plink_sorting_hg38_geno_array-$JOB_ID.err
#$ -j y
#$ -q csg.q
#$ -S /bin/bash
module load PLINK/1.9
plink --bfile /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples \
--make-bed --out /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted


## Generate the phenotype files for the different ancestries

In [9]:
import pandas as pd
import numpy as np
from datetime import datetime

### Read-in database

In [3]:
# collect the necessary column names of the database for our analysis

with open("/mnt/mfs/statgen/UKBiobank/data/ukbb_databases/ukb47922_updatedAug2021/ukb47922.tab") as fp:
    line = fp.readline() # header
    header = line.split("\t")
    
    indiv = ["f.eid"]
    reported_sex = ["f.31.0.0"]
    genetic_sex = ["f.22001.0.0"]
    white_british = ["f.22006.0.0"]
    ethnicity = [col.strip('"') for col in header if "f.21000." in col]
    year_of_birth = [col.strip('"') for col in header if "f.34." in col]
    month_of_birth = [col.strip('"') for col in header if "f.52." in col]

In [4]:
combined_cols = indiv  + ethnicity + reported_sex + genetic_sex +  year_of_birth + month_of_birth + white_british

In [5]:
print(datetime.now())

2023-01-23 10:12:13.311799


In [6]:
df = pd.read_csv("/mnt/mfs/statgen/UKBiobank/data/ukbb_databases/ukb47922_updatedAug2021/ukb47922.tab", dtype="string", sep='\t', usecols=combined_cols)
df

Unnamed: 0,f.eid,f.31.0.0,f.34.0.0,f.52.0.0,f.21000.0.0,f.21000.1.0,f.21000.2.0,f.22001.0.0,f.22006.0.0
0,1000019,0,1960,11,1001,,,0,1
1,1000022,1,1954,8,1001,,,1,1
2,1000035,1,1944,5,1001,,,1,1
3,1000046,0,1946,3,1001,,,0,1
4,1000054,0,1942,1,1001,,,0,1
...,...,...,...,...,...,...,...,...,...
502456,6025409,0,1946,11,1001,1001,,0,1
502457,6025411,0,1960,11,1001,,,0,1
502458,6025425,0,1963,8,1001,,,0,1
502459,6025438,1,1952,9,1001,,,1,1


In [7]:
print(datetime.now())

2023-01-23 12:08:16.569455


### Count ancestry of individuals in the full database

In [8]:
# set of answers for the ethnicity question
set(df[ethnicity[0]].to_list()).union( set(df[ethnicity[1]].to_list()) , set(df[ethnicity[2]].to_list()))

{'-1',
 '-3',
 '1',
 '1001',
 '1002',
 '1003',
 '2',
 '2001',
 '2002',
 '2003',
 '2004',
 '3',
 '3001',
 '3002',
 '3003',
 '3004',
 '4',
 '4001',
 '4002',
 '4003',
 '5',
 '6',
 <NA>}

In [31]:
# these should align with all possible options for ethnicity answers except for <NA>, Do not know, and Prefer not to answer
white = ['1001', '1002', '1','1003']
african = ['4001','2001', '4002', '2002', '4', '4003' ]
asian = ['3001', '3002', '2003', '3004', '3003', '3']
mixed = ['2', '2004']
chinese = ['5']
other = ['6']

# figure out the ancestry of each individual
def ancestry(row):
    temp = [x for x in row[ethnicity] if not pd.isna(x) and x != "-3" and x != "-1"]
    if len(temp) == 0:
        return "Unknown"
    
    if len(set(temp)) == 1 and temp[0] in white: # if we have only one unique answer and the answer is in the white variable
        return "_".join(temp[0].split(" ")) # return the unique answer
    
    if len([x for x in temp if x in white]) == len(temp):
        return "Inconsistent_white"
    if len([x for x in temp if x in asian]) == len(temp):
        return "Asian"
    if len([x for x in temp if x in african]) == len(temp):
        return "African"
    if len([x for x in temp if x in mixed]) == len(temp):
        return "Mixed"
    if len([x for x in temp if x in chinese]) == len(temp):
        return "Chinese"
    if len([x for x in temp if x in other]) == len(temp):
        return "Other"
    return "Inconsistent"

In [32]:
df_white = df.copy()

In [28]:
# This part was to get a better count of the individuals by their ancestry. For practical reasons I'll use the ancestry function in general
df_white["ethnicity"] = df_white[ethnicity].apply(ancestry_all, axis=1)

In [29]:
df_white.groupby(['ethnicity']).count()

Unnamed: 0_level_0,f.eid,f.31.0.0,f.34.0.0,f.52.0.0,f.21000.0.0,f.21000.1.0,f.21000.2.0,f.22001.0.0,f.22006.0.0
ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,553,553,553,553,553,5,1,529,0
1001,442342,442342,442342,442342,442337,18619,10593,430782,409433
1002,13021,13021,13021,13021,13021,395,229,12575,0
1003,16149,16149,16149,16149,16148,377,199,15636,0
2,49,49,49,49,49,0,0,46,0
2001,617,617,617,617,617,9,18,594,0
2002,421,421,421,421,421,15,6,398,0
2003,825,825,825,825,825,18,10,796,0
2004,1019,1019,1019,1019,1019,14,7,982,0
3,44,44,44,44,44,1,1,43,0


## Subset for the invididuals present in the QCed sample

In [17]:
qc_ind = pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/genotype_files_processed/082621_sampleqc_call90/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.082621_sampleqc_call90.filtered.fam', dtype="string", sep='\t', names=["f.eid", "IID", "father","mother","sex", "pheno"])

In [18]:
qc_ind

Unnamed: 0,f.eid,IID,father,mother,sex,pheno
0,1000019,1000019,0,0,2,-9
1,1000022,1000022,0,0,1,-9
2,1000035,1000035,0,0,1,-9
3,1000046,1000046,0,0,2,-9
4,1000054,1000054,0,0,2,-9
...,...,...,...,...,...,...
486411,6025390,6025390,0,0,2,-9
486412,6025409,6025409,0,0,2,-9
486413,6025411,6025411,0,0,2,-9
486414,6025425,6025425,0,0,2,-9


In [19]:
qc_list = set([str(i) for i in qc_ind['f.eid'].to_list()])

def matches_qc_individuals(row):
    return row["f.eid"] in qc_list

In [20]:
filtered = df[df[["f.eid"]].apply(matches_qc_individuals, axis=1)]

In [21]:
filtered

Unnamed: 0,f.eid,f.31.0.0,f.34.0.0,f.52.0.0,f.21000.0.0,f.21000.1.0,f.21000.2.0,f.22001.0.0,f.22006.0.0
0,1000019,0,1960,11,1001,,,0,1
1,1000022,1,1954,8,1001,,,1,1
2,1000035,1,1944,5,1001,,,1,1
3,1000046,0,1946,3,1001,,,0,1
4,1000054,0,1942,1,1001,,,0,1
...,...,...,...,...,...,...,...,...,...
502455,6025390,0,1942,3,1001,,,0,
502456,6025409,0,1946,11,1001,1001,,0,1
502457,6025411,0,1960,11,1001,,,0,1
502458,6025425,0,1963,8,1001,,,0,1


### Count ancestry in the qc'ed database

In [22]:
# these should align with all possible options for ethnicity answers except for <NA>, Do not know, and Prefer not to answer
white = ['1001', '1002', '1','1003']
african = ['4001','2001', '4002', '2002', '4', '4003' ]
asian = ['3001', '3002', '2003', '3004', '3003', '3']
mixed = ['2', '2004']
chinese = ['5']
other = ['6']

# figure out the ancestry of each individual
def ancestry_all(row):
    temp = [x for x in row[ethnicity] if not pd.isna(x) and x != "-3" and x != "-1"]
    if len(temp) == 0:
        return "Unknown"
    
    if len(set(temp)) == 1 and temp[0] in white: # if we have only one unique answer and the answer is in the white variable
        return "_".join(temp[0].split(" ")) # return the unique answer
    
    if len([x for x in temp if x in white]) == len(temp):
        return "Inconsistent_white"
    
    if len(set(temp)) == 1 and temp[0] in asian: # if we have only one unique answer and the answer is in the asian variable
        return "_".join(temp[0].split(" ")) # return the unique answer
    
    if len(set(temp)) == 1 and temp[0] in african: # if we have only one unique answer and the answer is in the white variable
        return "_".join(temp[0].split(" ")) # return the unique answer
    
    if len(set(temp)) == 1 and temp[0] in mixed: # if we have only one unique answer and the answer is in the white variable
        return "_".join(temp[0].split(" ")) # return the unique answer
    
    if len([x for x in temp if x in chinese]) == len(temp):
        return "Chinese"
    if len([x for x in temp if x in other]) == len(temp):
        return "Other"
    return "Inconsistent"

In [23]:
df_qc = filtered.copy()

In [24]:
df_qc["ethnicity"] = df_qc[ethnicity].apply(ancestry_all, axis=1)

In [25]:
df_qc.groupby(['ethnicity']).count()

Unnamed: 0_level_0,f.eid,f.31.0.0,f.34.0.0,f.52.0.0,f.21000.0.0,f.21000.1.0,f.21000.2.0,f.22001.0.0,f.22006.0.0
ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,526,526,526,526,526,5,1,526,0
1001,429231,429231,429231,429231,429226,18462,10350,429231,407993
1002,12518,12518,12518,12518,12518,385,216,12518,0
1003,15543,15543,15543,15543,15542,374,192,15543,0
2,45,45,45,45,45,0,0,45,0
2001,593,593,593,593,593,9,18,593,0
2002,396,396,396,396,396,15,6,396,0
2003,796,796,796,796,796,18,10,796,0
2004,981,981,981,981,981,13,7,981,0
3,42,42,42,42,42,1,1,42,0


In [34]:
df_qc["ethnicity"] = df_qc[ethnicity].apply(ancestry, axis=1)

### Define white individuals 

We keep all of those that self-identified as white plus the inconsistent_white and unknown groups to do PCA analysis. We first define the non-white

In [35]:
def find_non_white(row):
    return row["ethnicity"] not in white and row["ethnicity"] != "Unknown" and row["ethnicity"] != "Inconsistent_white"

In [36]:
ex_non_white = df_qc[["ethnicity"]].apply(find_non_white, axis=1)

In [37]:
df_qc_white = df_qc[~ex_non_white]

In [38]:
df_qc_white

Unnamed: 0,f.eid,f.31.0.0,f.34.0.0,f.52.0.0,f.21000.0.0,f.21000.1.0,f.21000.2.0,f.22001.0.0,f.22006.0.0,ethnicity
0,1000019,0,1960,11,1001,,,0,1,1001
1,1000022,1,1954,8,1001,,,1,1,1001
2,1000035,1,1944,5,1001,,,1,1,1001
3,1000046,0,1946,3,1001,,,0,1,1001
4,1000054,0,1942,1,1001,,,0,1,1001
...,...,...,...,...,...,...,...,...,...,...
502455,6025390,0,1942,3,1001,,,0,,1001
502456,6025409,0,1946,11,1001,1001,,0,1,1001
502457,6025411,0,1960,11,1001,,,0,1,1001
502458,6025425,0,1963,8,1001,,,0,1,1001


In [39]:
df_qc_white = df_qc_white.rename(columns={'f.eid': 'IID', 'f.31.0.0': 'sex'})

In [42]:
df_qc_white['FID'] = df_qc_white['IID']

In [43]:
df_qc_white

Unnamed: 0,IID,sex,f.34.0.0,f.52.0.0,f.21000.0.0,f.21000.1.0,f.21000.2.0,f.22001.0.0,f.22006.0.0,ethnicity,FID
0,1000019,0,1960,11,1001,,,0,1,1001,1000019
1,1000022,1,1954,8,1001,,,1,1,1001,1000022
2,1000035,1,1944,5,1001,,,1,1,1001,1000035
3,1000046,0,1946,3,1001,,,0,1,1001,1000046
4,1000054,0,1942,1,1001,,,0,1,1001,1000054
...,...,...,...,...,...,...,...,...,...,...,...
502455,6025390,0,1942,3,1001,,,0,,1001,6025390
502456,6025409,0,1946,11,1001,1001,,0,1,1001,6025409
502457,6025411,0,1960,11,1001,,,0,1,1001,6025411
502458,6025425,0,1963,8,1001,,,0,1,1001,6025425


In [44]:
df_qc_white[["FID","IID","ethnicity"]].to_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/white_IID/012323_ukb47922_white_expanded_qc_460649.iid", sep="\t", index=False)

In [50]:
df_qc_white=pd.read_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/white_IID/012323_ukb47922_white_expanded_qc_460649.iid", dtype="string", sep="\t")

In [51]:
rel_white= pd.read_csv('~/UKBiobank/results/092821_PCA_results_500K/092821_king/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.white_europeans.filtered.092821_king.related_id', dtype="string", sep=" ", names=["FID", "IID"])

In [52]:
rel_white

Unnamed: 0,FID,IID
0,1000019,1000019
1,1000035,1000035
2,1000054,1000054
3,1000224,1000224
4,1000255,1000255
...,...,...
109214,6025176,6025176
109215,6025180,6025180
109216,6025322,6025322
109217,6025425,6025425


In [53]:
df_qc_white_unrel=df_qc_white[~df_qc_white.FID.isin(rel_white.FID)]

In [54]:
df_qc_white_unrel

Unnamed: 0,FID,IID,ethnicity
1,1000022,1000022,1001
3,1000046,1000046,1001
5,1000063,1000063,1001
6,1000078,1000078,1001
7,1000081,1000081,1001
...,...,...,...
460642,6025363,6025363,1001
460643,6025378,6025378,1001
460644,6025390,6025390,1001
460645,6025409,6025409,1001


In [55]:
df_qc_white_unrel.to_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/white_IID/012323_ukb47922_white_expanded_qc_unrel_351430.iid", sep="\t", index=False)

In [56]:
df_qc_white_rel=df_qc_white[df_qc_white.FID.isin(rel_white.FID)]

In [57]:
df_qc_white_rel

Unnamed: 0,FID,IID,ethnicity
0,1000019,1000019,1001
2,1000035,1000035,1001
4,1000054,1000054,1001
20,1000224,1000224,1001
23,1000255,1000255,1001
...,...,...,...
460626,6025176,6025176,1001
460627,6025180,6025180,1001
460638,6025322,6025322,1001
460647,6025425,6025425,1001


In [58]:
df_qc_white_rel.to_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/white_IID/012323_ukb47922_white_expanded_qc_rel_109219.iid", sep="\t", index=False)

#### Select individuals and remove variants HWE<1e-15

In [4]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38
# bfile with variant_qc_1 N=486416 contains all samples and variants=674,489
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.bed
#To remove related samples
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/white_IID/012323_ukb47922_white_expanded_qc_unrel_351430.iid
gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
gwasqc_sbatch=$cwd/hg38_white_european_variantqc2_$(date +"%Y-%m-%d").sbatch
# common variants 1% MAF
maf_filter=0.01
#call rate 99%
geno_filter=0.01
#hwe
hwe_filter=1e-15
# Set mind filter to 0 not to filter out more individuals based on sample missingness
mind_filter=0
mem='30G'
name='hg38_white_eur_qc'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/hg38_white_european_variantqc2_2023-01-30.sbatch[0m
INFO: Workflow csg (ID=wa31b293079fa4c8a) is executed successfully with 1 completed step.



#### Select individuals and remove variants HWE<1e-8

In [5]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/hwe_1e-8
# bfile with variant_qc_1 N=486416 contains all samples and variants=674,489
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.bed
#To remove related samples
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/white_IID/012323_ukb47922_white_expanded_qc_unrel_351430.iid
gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
gwasqc_sbatch=$cwd/hg38_white_european_variantqc2_hwe1e-8$(date +"%Y-%m-%d").sbatch
# common variants 1% MAF
maf_filter=0.01
#call rate 99%
geno_filter=0.01
#hwe
hwe_filter=1e-8
# Set mind filter to 0 not to filter out more individuals based on sample missingness
mind_filter=0
mem='30G'
name='hg38_white_eur_qc'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/hwe_1e-8/hg38_white_european_variantqc2_hwe1e-82023-01-30.sbatch[0m
INFO: Workflow csg (ID=wa795ccc4b346a471) is executed successfully with 1 completed step.



#### Final file white europeans after individuals and variant qc_2

In [59]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/final_files
# bfile with variant_qc_1 N=486416 contains all samples and variants=674,489
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.bed
#To keep related and unrelated samples
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/white_IID/012323_ukb47922_white_expanded_qc_460649.iid
# To keep variants after geno=0.01, maf=0.01 and hwe 1e-15
keep_variants=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.bim
gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
gwasqc_sbatch=$cwd/hg38_white_european_final_$(date +"%Y-%m-%d").sbatch
mem='30G'
name='hg38_white_eur_qc'
job_size=1
numThreads=2
geno=0.0
mind=0.0
hwe=0.0

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --geno_filter $geno
    --mind_filter $mind
    --hwe_filter $hwe
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/final_files/hg38_white_european_final_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=wed2f26d4276f8614) is executed successfully with 1 completed step.



#### step 1. PCA for white europeans: keep unrelated and do LD prunning (get bed file)

In [60]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/ldprun_unrelated
## Use the qc version of the genotype array with the already filtered asian individuals
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/final_files/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.extracted.bed
#To keep the samples of asian and unrelated individuals only
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/white_IID/012323_ukb47922_white_expanded_qc_unrel_351430.iid
#GWAS QC variables: leave all the variables in 0 so there's no more filtering in the already filtered data
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#LD prunning variables
window=50
shift=10
r2=0.1
gwas_sbatch=$cwd/ldprun_unrelated_whiteEUR_$(date +"%Y-%m-%d").sbatch
numThreads=20
mem='30G'

gwasqc_args="""qc
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/ldprun_unrelated/ldprun_unrelated_whiteEUR_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=wb6ca060407205956) is executed successfully with 1 completed step.



#### step 2 PCA: run for unrelated white European

In [69]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/pca_unrelated
#This is the bfile originated after filtering unrelated individuals and pruning
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/ldprun_unrelated/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.extracted.filtered.prune.bed
phenofile=/mnt/mfs/statgen/UKBiobank/phenotype_files/white_IID/012323_ukb47922_white_expanded_qc_unrel_351430.iid
label_col=ethnicity
pop_col=ethnicity
pca_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb
pca_sbatch=$cwd/flashpca_whiteEUR_unrelated_genoarray_$(date +"%Y-%m-%d").sbatch
k=10
maha_k=5
min_axis=""
max_axis=""
homogeneous=TRUE

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --min_axis $min_axis
    --max_axis $max_axis
    --homogeneous $homogeneous
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/pca_unrelated/flashpca_whiteEUR_unrelated_genoarray_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=w749ee929472f7952) is executed successfully with 1 completed step.



#### step 3. Get the bed file for related

In [66]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/related_whiteEUR
## Use qc'ed genotype array
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/final_files/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.extracted.bed
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/white_IID/012323_ukb47922_white_expanded_qc_rel_109219.iid
#Keep the same variants as above
keep_variants=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/ldprun_unrelated/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.extracted.filtered.prune.in
#GWAS QC variables
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
gwas_sbatch=$cwd/flashpca_whiteEUR_related_qc_genoarray_$(date +"%Y-%m-%d").sbatch
numThreads=20
mem='30G'


gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/related_whiteEUR/flashpca_whiteEUR_related_qc_genoarray_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=w569295ddb9de6d21) is executed successfully with 1 completed step.



#### step 4. PCA project back related individuals

In [67]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/project_related_whiteEUR
#This is the bfile originated after filtering related individuals
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/related_whiteEUR/*.bed
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/white_IID/012323_ukb47922_white_expanded_qc_460649.iid
pca_model=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/pca_unrelated/*.pca.rds
pca_sbatch=$cwd/flashpca_whiteEUR_related_genoarray_projected_$(date +"%Y-%m-%d").sbatch
label_col=ethnicity
pop_col=ethnicity
k=10
maha_k=5
prob=0.997
pval=0.05
## set the --homogeneous TRUE options to consider all the pops like one 
homogeneous=TRUE
## For the plot you need to use the *.projected.rds and not the *.projected.mahalanobis.rds
#plot_data=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.rds
#outlier_file=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.outliers


pca_args="""project_samples
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --pca_model $pca_model
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --prob $prob
    --pval $pval
    --homogeneous $homogeneous
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/project_related_whiteEUR/flashpca_whiteEUR_related_genoarray_projected_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=wc73d8ac3fde3ebca) is executed successfully with 1 completed step.



#### step. 5 PCA: plot and look for outliers

In [11]:
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
job_size=1
numThreads=2
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/project_related_whiteEUR/plot
#This is the bfile originated after filtering related individuals
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/related_whiteEUR/*.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/white_IID/012323_ukb47922_white_expanded_qc_460649.iid
pca_model=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/pca_unrelated/012323_ukb47922_white_expanded_qc_460649.pca.rds
plot_data=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/project_related_whiteEUR/012323_ukb47922_white_expanded_qc_460649.pca.projected.rds
outlier_file=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/project_related_whiteEUR/012323_ukb47922_white_expanded_qc_460649.pca.outliers
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$cwd/flashpca_whiteEUR_related_genoarray_plot_$(date +"%Y-%m-%d").sbatch
k=10
pca_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb

pca_args="""plot_pca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --label_col $label_col
    --pop_col $pop_col
    --plot_data $plot_data
    --outlier_file $outlier_file
    --k $k
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg  \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/project_related_whiteEUR/plot/flashpca_whiteEUR_related_genoarray_plot_2023-02-02.sbatch[0m
INFO: Workflow csg (ID=w43f05903bc6f2d1e) is executed successfully with 1 completed step.


#### Get individual ID and variants text files and final qc'ed bed files

In [10]:
%save -f /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/final_files_no_outliers/final_genoarray_whiteEUR.sh
#!/bin/sh
#$ -l h_rt=36:00:00
#$ -l h_vmem=30G
#$ -N final_genoarray_whiteEUR_2023-01-31
#$ -o /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/final_files_no_outliers/final_genoarray_whiteEUR_2023-01-31.out
#$ -e /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/final_files_no_outliers/final_genoarray_whiteEUR_2023-01-31.err
#$ -j y
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH
module load PLINK/2.0
plink2 \
    --bfile /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/final_files/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.extracted \
    --remove /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/PCA/project_related_whiteEUR/*.pca.projected.outliers \
    --write-snplist --write-samples --no-id-header \
    --threads 20 \
    --make-bed \
    --out /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_white_european_460649ind_hg38/final_files_no_outliers/


### Define Asian individuals

In [45]:
# these should align with all possible options for ethnicity answers except for <NA>, Do not know, and Prefer not to answer
white = ['1001', '1002', '1','1003']
african = ['4001','2001', '4002', '2002', '4', '4003' ]
asian = ['3001', '3002', '2003', '3004', '3003', '3']
mixed = ['2', '2004']
chinese = ['5']
other = ['6']

# figure out the ancestry of each individual
def ancestry(row):
    temp = [x for x in row[ethnicity] if not pd.isna(x) and x != "-3" and x != "-1"]
    if len(temp) == 0:
        return "Unknown"
    
    if len(set(temp)) == 1 and temp[0] in asian: # if we have only one unique answer and the answer is in the asian variable
        return "_".join(temp[0].split(" ")) # return the unique answer
    
    if len([x for x in temp if x in asian]) == len(temp):
        return "Inconsistent_asian"
    if len([x for x in temp if x in white]) == len(temp):
        return "White"
    if len([x for x in temp if x in african]) == len(temp):
        return "African"
    if len([x for x in temp if x in mixed]) == len(temp):
        return "Mixed"
    if len([x for x in temp if x in chinese]) == len(temp):
        return "Chinese"
    if len([x for x in temp if x in other]) == len(temp):
        return "Other"
    return "Inconsistent"

In [46]:
df_qc_asian = filtered.copy()

In [47]:
df_qc_asian["ethnicity"] = df_qc_asian[ethnicity].apply(ancestry, axis=1)

In [48]:
def find_asian(row):
    return row["ethnicity"] in asian or row["ethnicity"] == "Inconsistent_asian"

In [49]:
inc_asian = df_qc_asian[["ethnicity"]].apply(find_asian, axis=1)

In [50]:
print(sum(inc_asian), "individuals considered asian")

10189 individuals considered asian


In [59]:
df_qc_asian=df_qc_asian[inc_asian]
df_qc_asian.groupby(['ethnicity']).count()

Unnamed: 0_level_0,f.eid,f.31.0.0,f.34.0.0,f.52.0.0,f.21000.0.0,f.21000.1.0,f.21000.2.0,f.22001.0.0,f.22006.0.0
ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2003,796,796,796,796,796,18,10,796,0
3,42,42,42,42,42,1,1,42,0
3001,5651,5651,5651,5651,5651,70,82,5651,0
3002,1739,1739,1739,1739,1739,39,26,1739,0
3003,221,221,221,221,221,1,2,221,0
3004,1732,1732,1732,1732,1732,18,11,1732,0
Inconsistent_asian,8,8,8,8,8,3,6,8,0


In [60]:
df_qc_asian = df_qc_asian.rename(columns={'f.eid': 'IID', 'f.31.0.0': 'sex'})

In [61]:
df_qc_asian['FID'] = df_qc_asian['IID']

In [62]:
df_qc_asian

Unnamed: 0,IID,sex,f.34.0.0,f.52.0.0,f.21000.0.0,f.21000.1.0,f.21000.2.0,f.22001.0.0,f.22006.0.0,ethnicity,FID
89,1000906,1,1962,5,3003,,,1,,3003,1000906
186,1001874,0,1947,8,3004,,,0,,3004,1001874
270,1002712,0,1965,12,3001,,,0,,3001,1002712
301,1003025,1,1942,10,3001,,,1,,3001,1003025
307,1003083,1,1961,10,3001,,,1,,3001,1003083
...,...,...,...,...,...,...,...,...,...,...,...
502347,6024313,0,1964,5,3004,,,0,,3004,6024313
502399,6024837,1,1965,5,3001,,,1,,3001,6024837
502405,6024898,0,1950,9,3001,,,0,,3001,6024898
502425,6025096,0,1965,9,2003,,,0,,2003,6025096


In [63]:
df_qc_asian[["FID","IID","ethnicity"]].to_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/012323_ukb47922_asian_qc_10189.iid", sep="\t", index=False)

In [54]:
df_qc_asian=pd.read_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/012323_ukb47922_asian_qc_10189.iid", sep="\t", dtype="string")

In [55]:
call95_asian= pd.read_csv("/mnt/mfs/statgen/UKBiobank/genotype_files_processed/010622_asian_10189ind/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_asian_10189ind.filtered.mindrem.id", skiprows=[0], names=["FID", "IID"], dtype="string", sep='\t')
call95_asian

Unnamed: 0,FID,IID
0,1011968,1011968


In [56]:
df_qc_asian_call95=df_qc_asian[~df_qc_asian.FID.isin(call95_asian.FID)]

In [74]:
df_qc_asian_call95[["FID","IID", "ethnicity"]].to_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/012323_ukb47922_asian_qc_call95_10188.iid", sep="\t", index=False)

In [80]:
# Relatedness per UKB variable f.22021.0.0>1
rel_asian= pd.read_csv("/mnt/mfs/statgen/UKBiobank/genotype_files_processed/010622_asian_10189ind/010722_sampleQC_IID_asian_related.txt", sep='\t', dtype="string", header=None)
rel_asian

Unnamed: 0,0,1
0,4492584,4492584


In [58]:
# Relatedness per king calculation
rel_king_asian = pd.read_csv("/mnt/mfs/statgen/UKBiobank/genotype_files_processed/010622_asian_10189ind/010722_king_asian/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.010722_king_asian.related_id", names=["FID", "IID"],sep=" ", dtype="string", header=None)
rel_king_asian

Unnamed: 0,FID,IID
0,1011060,1011060
1,1011143,1011143
2,1013331,1013331
3,1023766,1023766
4,1028494,1028494
...,...,...
1106,6013816,6013816
1107,6015306,6015306
1108,6019749,6019749
1109,6021570,6021570


In [65]:
#create relate and unrelated files
##unrelated
df_qc_asian_call95_unrel=df_qc_asian_call95[~df_qc_asian_call95.FID.isin(rel_king_asian.FID)]

In [69]:
df_qc_asian_call95_unrel.to_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/012323_ukb47922_asian_qc_call95_unrel_9077.iid", sep="\t", index=False)

In [67]:
## related
df_qc_asian_call95_rel=df_qc_asian_call95[df_qc_asian_call95.FID.isin(rel_king_asian.FID)]

In [71]:
df_qc_asian_call95_rel.to_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/012323_ukb47922_asian_qc_call95_rel_1111.iid", sep="\t", index=False)

In [103]:
df_qc_asian_unrel=df_qc_asian[~df_qc_asian.FID.isin(call95_asian.FID) & ~df_qc_asian.FID.isin(rel_king_asian.FID)]

In [104]:
df_qc_asian_unrel

Unnamed: 0,IID,sex,f.34.0.0,f.52.0.0,f.21000.0.0,f.21000.1.0,f.21000.2.0,f.22001.0.0,f.22006.0.0,ethnicity,FID
89,1000906,1,1962,5,3003,,,1,,3003,1000906
186,1001874,0,1947,8,3004,,,0,,3004,1001874
270,1002712,0,1965,12,3001,,,0,,3001,1002712
301,1003025,1,1942,10,3001,,,1,,3001,1003025
307,1003083,1,1961,10,3001,,,1,,3001,1003083
...,...,...,...,...,...,...,...,...,...,...,...
502342,6024266,0,1943,2,3001,,,0,,3001,6024266
502399,6024837,1,1965,5,3001,,,1,,3001,6024837
502405,6024898,0,1950,9,3001,,,0,,3001,6024898
502425,6025096,0,1965,9,2003,,,0,,2003,6025096


In [105]:
df_qc_asian_unrel[["FID","IID"]].to_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/012323_ukb47922_asian_qc_unrel_9077.iid", sep="\t", index=False)

#### Select individuals and remove HWE < 1e-15

In [1]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38
# bfile with variant_qc_1 N=486416 contains all samples and variants=674,489
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.bed
#To remove related samples
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/012323_ukb47922_asian_qc_unrel_9077.iid
gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
gwasqc_sbatch=$cwd/hg38_asian_variantqc2_$(date +"%Y-%m-%d").sbatch
# common variants 1% MAF
maf_filter=0.01
#call rate 99%
geno_filter=0.01
#hwe
hwe_filter=1e-15
# Set mind filter to 0 not to filter out more individuals based on sample missingness
mind_filter=0
mem='30G'
name='hg38_asianqc'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

  msg['msg_id'] = self._parent_header['header']['msg_id']


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/hg38_asian_variantqc2_2023-01-30.sbatch[0m
INFO: Workflow csg (ID=wda6679e6762a601a) is executed successfully with 1 completed step.



#### Select individuals and remove HWE < 1e-8

In [7]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/hwe_1e-8
# bfile with variant_qc_1 N=486416 contains all samples and variants=674,489
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.bed
#To remove related samples
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/012323_ukb47922_asian_qc_unrel_9077.iid
gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
gwasqc_sbatch=$cwd/hg38_asian_variantqc2_hwe1e-8_$(date +"%Y-%m-%d").sbatch
# common variants 1% MAF
maf_filter=0.01
#call rate 99%
geno_filter=0.01
#hwe
hwe_filter=1e-8
# Set mind filter to 0 not to filter out more individuals based on sample missingness
mind_filter=0
mem='30G'
name='hg38_asianqc'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/hwe_1e-8/hg38_asian_variantqc2_hwe1e-8_2023-01-30.sbatch[0m
INFO: Workflow csg (ID=wcbc523ef1f626070) is executed successfully with 1 completed step.



#### Final file asians after variant qc_2 (contains related and unrelated individuals)

In [2]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/final_files
# bfile with variant_qc_1 N=486416 contains all samples and variants=674,489
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.bed
#To keep related and unrelated asians
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/012323_ukb47922_asian_qc_call95_10188.iid
# To keep variants using geno=0.01, maf=0.01 and hwe=1e-15
keep_variants=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.bim
gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
gwasqc_sbatch=$cwd/hg38_asian_ind_var_qc_$(date +"%Y-%m-%d").sbatch
mem='30G'
name='hg38_asianqc'
job_size=1
numThreads=2
geno=0.0
mind=0.0
hwe=0.0

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --geno_filter $geno
    --mind_filter $mind
    --hwe_filter $hwe
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/final_files/hg38_asian_ind_var_qc_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=waad2f31a456bc389) is executed successfully with 1 completed step.



#### step 1. PCA for Asians: keep unrelated and do LD prunning (get bed file)

In [37]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/ldprun_unrelated
## Use the qc version of the genotype array with the already filtered asian individuals
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/final_files/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.extracted.bed
#To keep the samples of asian and unrelated individuals only
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/012323_ukb47922_asian_qc_unrel_9077.iid
#GWAS QC variables: leave all the variables in 0 so there's no more filtering in the already filtered data
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#LD prunning variables
window=50
shift=10
r2=0.1
gwas_sbatch=$cwd/ldprun_unrelated_asian_$(date +"%Y-%m-%d").sbatch
numThreads=20
mem='30G'

gwasqc_args="""qc
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/ldprun_unrelated/ldprun_unrelated_asian_2023-01-30.sbatch[0m
INFO: Workflow csg (ID=w5db99fd0b20fa2d9) is executed successfully with 1 completed step.



#### step 2 PCA: run for unrelated Asian

In [72]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/pca_unrelated
#This is the bfile originated after filtering unrelated individuals and pruning
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/ldprun_unrelated/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.extracted.filtered.prune.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenofile=/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/012323_ukb47922_asian_qc_call95_unrel_9077.iid
label_col=ethnicity
pop_col=ethnicity
pca_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb
pca_sbatch=$cwd/flashpca_asian_unrelated_genoarray_$(date +"%Y-%m-%d").sbatch
k=10
maha_k=5
min_axis=""
max_axis=""
homogeneous=TRUE

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --min_axis $min_axis
    --max_axis $max_axis
    --homogeneous $homogeneous
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/pca_unrelated/flashpca_asian_unrelated_genoarray_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=wf6048da0d9ee6611) is executed successfully with 1 completed step.



#### step 3. Get the bed file for related

In [73]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/related_asian
## Use qc'ed genotype array
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/final_files/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.extracted.bed
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/012323_ukb47922_asian_qc_call95_rel_1111.iid
#Keep the same variants as above
keep_variants=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/ldprun_unrelated/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.extracted.filtered.prune.in
#GWAS QC variables
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
gwas_sbatch=$cwd/flashpca_asian_related_qc_genoarray_$(date +"%Y-%m-%d").sbatch
numThreads=20
mem='30G'


gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/related_asian/flashpca_asian_related_qc_genoarray_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=w41eac5ecc15c0f2e) is executed successfully with 1 completed step.



#### step 4. PCA project back related individuals

In [76]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/project_related_asian
#This is the bfile originated after filtering related individuals
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/related_asian/*.bed
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/012323_ukb47922_asian_qc_call95_10188.iid
pca_model=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/pca_unrelated/*.pca.rds
pca_sbatch=$cwd/flashpca_asian_related_genoarray_projected_$(date +"%Y-%m-%d").sbatch
label_col=ethnicity
pop_col=ethnicity
k=10
maha_k=5
prob=0.997
pval=0.05
## set the --homogeneous TRUE options to consider all the pops like one 
homogeneous=TRUE
## For the plot you need to use the *.projected.rds and not the *.projected.mahalanobis.rds
#plot_data=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.rds
#outlier_file=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.outliers


pca_args="""project_samples
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --pca_model $pca_model
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --prob $prob
    --pval $pval
    --homogeneous $homogeneous
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/project_related_asian/flashpca_asian_related_genoarray_projected_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=wff275343d758f495) is executed successfully with 1 completed step.



#### step. 5 PCA: plot and look for outliers

In [77]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/project_related_asian/plot
#This is the bfile originated after filtering related individuals
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/related_asian/*.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/012323_ukb47922_asian_qc_call95_10188.iid
pca_model=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/pca_unrelated/*.pca.rds
plot_data=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/project_related_asian/*.pca.projected.rds
outlier_file=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/project_related_asian/*.pca.projected.outliers
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$cwd/flashpca_asian_related_genoarray_plot_$(date +"%Y-%m-%d").sbatch
k=10

pca_args="""plot_pca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --label_col $label_col
    --pop_col $pop_col
    --plot_data $plot_data
    --outlier_file $outlier_file
    --k $k
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg  \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/project_related_asian/plot/flashpca_asian_related_genoarray_plot_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=w9226415ae531a493) is executed successfully with 1 completed step.



#### Get individual ID and variants text files and final qc'ed bed files

In [6]:
%save -f /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/final_files_no_outliers/final_genoarray_asians.sh
# N=10,157 samples and 444,076 variants
#!/bin/sh
#$ -l h_rt=36:00:00
#$ -l h_vmem=30G
#$ -N hg38_final_genoarray_2023-01-31
#$ -o /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/final_files_no_outliers/hg38_final_genoarray_2023-01-31.out
#$ -e /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/final_files_no_outliers/hg38_final_genoarray_2023-01-31.err
#$ -j y
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH
module load PLINK/2.0
plink2 \
    --bfile /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/final_files/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.extracted \
    --remove /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/PCA/project_related_asian/012323_ukb47922_asian_qc_call95_10188.pca.projected.outliers \
    --write-snplist --write-samples --no-id-header \
    --threads 20 \
    --make-bed \
    --out /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_asian_10189ind_hg38/final_files_no_outliers/UKB_genotypedatadownloaded083019_hg38_10057ind_447567var_ASIAN

### Define African individuals

In [64]:
# these should align with all possible options for ethnicity answers except for <NA>, Do not know, and Prefer not to answer
white = ['1001', '1002', '1','1003']
african = ['4001','2001', '4002', '2002', '4', '4003' ]
asian = ['3001', '3002', '2003', '3004', '3003', '3']
mixed = ['2', '2004']
chinese = ['5']
other = ['6']

# figure out the ancestry of each individual
def ancestry(row):
    temp = [x for x in row[ethnicity] if not pd.isna(x) and x != "-3" and x != "-1"]
    if len(temp) == 0:
        return "Unknown"
    
    if len(set(temp)) == 1 and temp[0] in african: # if we have only one unique answer and the answer is in the asian variable
        return "_".join(temp[0].split(" ")) # return the unique answer
    
    if len([x for x in temp if x in african]) == len(temp):
        return "Inconsistent_african"
    if len([x for x in temp if x in white]) == len(temp):
        return "White"
    if len([x for x in temp if x in asian]) == len(temp):
        return "Asian"
    if len([x for x in temp if x in mixed]) == len(temp):
        return "Mixed"
    if len([x for x in temp if x in chinese]) == len(temp):
        return "Chinese"
    if len([x for x in temp if x in other]) == len(temp):
        return "Other"
    return "Inconsistent"

In [65]:
df_qc_afr = filtered.copy()

In [66]:
df_qc_afr["ethnicity"] = df_qc_afr[ethnicity].apply(ancestry, axis=1)

In [67]:
def find_african(row):
    return row["ethnicity"] in african or row["ethnicity"] == "Inconsistent_african"

In [68]:
inc_african = df_qc_afr[["ethnicity"]].apply(find_african, axis=1)

In [69]:
print(sum(inc_african), "individuals considered african")

8621 individuals considered african


In [70]:
# Filter the asian individuals
df_qc_afr = df_qc_afr[inc_african]
df_qc_afr.groupby(['ethnicity']).count()

Unnamed: 0_level_0,f.eid,f.31.0.0,f.34.0.0,f.52.0.0,f.21000.0.0,f.21000.1.0,f.21000.2.0,f.22001.0.0,f.22006.0.0
ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2001,593,593,593,593,593,9,18,593,0
2002,396,396,396,396,396,15,6,396,0
4,25,25,25,25,25,0,0,25,0
4001,4286,4286,4286,4286,4286,44,40,4286,0
4002,3199,3199,3199,3199,3199,38,20,3199,0
4003,116,116,116,116,116,2,0,116,0
Inconsistent_african,6,6,6,6,6,4,2,6,0


In [71]:
df_qc_afr

Unnamed: 0,f.eid,f.31.0.0,f.34.0.0,f.52.0.0,f.21000.0.0,f.21000.1.0,f.21000.2.0,f.22001.0.0,f.22006.0.0,ethnicity
68,1000697,0,1965,10,4001,,,0,,4001
143,1001447,0,1961,6,4001,,,0,,4001
145,1001465,0,1942,6,2001,,,0,,2001
199,1002004,1,1957,2,4002,,,1,,4002
234,1002354,0,1962,9,4001,,,0,,4001
...,...,...,...,...,...,...,...,...,...,...
502221,6023054,0,1941,9,4001,,,0,,4001
502390,6024740,0,1945,12,4001,,,0,,4001
502417,6025018,0,1960,2,4001,,,0,,4001
502443,6025273,0,1947,7,4001,,,0,,4001


In [72]:
df_qc_afr = df_qc_afr.rename(columns={'f.eid': 'IID', 'f.31.0.0': 'sex'})

In [73]:
df_qc_afr['FID'] = df_qc_afr['IID']

In [74]:
df_qc_afr

Unnamed: 0,IID,sex,f.34.0.0,f.52.0.0,f.21000.0.0,f.21000.1.0,f.21000.2.0,f.22001.0.0,f.22006.0.0,ethnicity,FID
68,1000697,0,1965,10,4001,,,0,,4001,1000697
143,1001447,0,1961,6,4001,,,0,,4001,1001447
145,1001465,0,1942,6,2001,,,0,,2001,1001465
199,1002004,1,1957,2,4002,,,1,,4002,1002004
234,1002354,0,1962,9,4001,,,0,,4001,1002354
...,...,...,...,...,...,...,...,...,...,...,...
502221,6023054,0,1941,9,4001,,,0,,4001,6023054
502390,6024740,0,1945,12,4001,,,0,,4001,6024740
502417,6025018,0,1960,2,4001,,,0,,4001,6025018
502443,6025273,0,1947,7,4001,,,0,,4001,6025273


In [75]:
df_qc_afr[["FID","IID","ethnicity"]].to_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/012323_ukb47922_african_qc_8621.iid", sep="\t", index=False)

In [10]:
df_qc_afr=pd.read_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/012323_ukb47922_african_qc_8621.iid", sep="\t", dtype="string")

In [11]:
call95_african= pd.read_csv('~/UKBiobank/genotype_files_processed/010622_african_9096ind/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_african_9096ind.filtered.mindrem.id', skiprows=[0], names=["FID", "IID"], dtype="string", sep='\t')

In [111]:
call95_african

Unnamed: 0,FID,IID
0,3656538,3656538
1,3733695,3733695
2,4958925,4958925
3,5991763,5991763


In [12]:
df_qc_afr_call95=df_qc_afr[~df_qc_afr.FID.isin(call95_african.FID)]

In [14]:
df_qc_afr_call95.to_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/012323_ukb47922_african_qc_call95_8617.iid", sep="\t", index=False)

In [16]:
# Relatedness per UKB variable f.22021.0.0>1
rel_african= pd.read_csv("/mnt/mfs/statgen/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_sampleQC_IID_african_related.txt", sep='\t', dtype="string", header=None)
rel_african

Unnamed: 0,0,1
0,3304568,3304568
1,4094127,4094127


In [17]:
# Relatedness per king calculation
rel_king_african = pd.read_csv("~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_king_african/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.010722_king_african.related_id", names=["FID", "IID"],sep=" ", dtype="string", header=None)
rel_king_african

Unnamed: 0,FID,IID
0,1000697,1000697
1,1001447,1001447
2,1006879,1006879
3,1009579,1009579
4,1011881,1011881
...,...,...
1336,6005674,6005674
1337,6013990,6013990
1338,6017509,6017509
1339,6018745,6018745


In [18]:
df_qc_afr_unrel=df_qc_afr[~df_qc_afr.FID.isin(call95_african.FID) & ~df_qc_afr.FID.isin(rel_king_african.FID)]

In [19]:
df_qc_afr_unrel

Unnamed: 0,FID,IID,ethnicity
2,1001465,1001465,2001
3,1002004,1002004,4002
4,1002354,1002354,4001
5,1002390,1002390,4002
6,1002608,1002608,4001
...,...,...,...
8616,6023054,6023054,4001
8617,6024740,6024740,4001
8618,6025018,6025018,4001
8619,6025273,6025273,4001


In [20]:
df_qc_afr_unrel.to_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/012323_ukb47922_african_qc_unrel_7276.iid", sep="\t", index=False)

In [23]:
df_qc_afr_rel=df_qc_afr[~df_qc_afr.FID.isin(call95_african.FID) & df_qc_afr.FID.isin(rel_king_african.FID)]

In [24]:
df_qc_afr_rel

Unnamed: 0,FID,IID,ethnicity
0,1000697,1000697,4001
1,1001447,1001447,4001
14,1006879,1006879,4001
19,1009579,1009579,2001
22,1011881,1011881,4001
...,...,...,...
8589,6005674,6005674,2001
8600,6013990,6013990,4002
8606,6017509,6017509,2001
8608,6018745,6018745,4002


In [25]:
df_qc_afr_rel.to_csv("/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/012323_ukb47922_african_qc_rel_1341.iid", sep="\t", index=False)

#### Select individuals and remove HWE <1e-15

In [2]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38
# bfile with variant_qc_1 N=486416 and variants=674,489 hg38
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.bed
#To keep unrelated samples
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/012323_ukb47922_african_qc_unrel_7276.iid
gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
gwasqc_sbatch=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/hg38_african_variantqc2_$(date +"%Y-%m-%d").sbatch
maf_filter=0.01
geno_filter=0.01
hwe_filter=1e-15
# Set mind filter to 0 not to filter out more individuals based on sample missingness
mind_filter=0
mem='30G'
name='hg38_012323'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/hg38_african_variantqc2_2023-01-30.sbatch[0m
INFO: Workflow csg (ID=w15f2a42f363f6ee3) is executed successfully with 1 completed step.



#### Select individuals and remove variants HWE < 1e-8

In [10]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/hwe_1e-8
# bfile with variant_qc_1 N=486416 and variants=674,489 hg38
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.bed
#To keep unrelated samples
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/012323_ukb47922_african_qc_unrel_7276.iid
gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
gwasqc_sbatch=$cwd/hg38_african_variantqc2_hwe1e-8_$(date +"%Y-%m-%d").sbatch
maf_filter=0.01
geno_filter=0.01
hwe_filter=1e-8
# Set mind filter to 0 not to filter out more individuals based on sample missingness
mind_filter=0
mem='30G'
name='hg38_012323'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/hwe_1e-8/hg38_african_variantqc2_hwe1e-8_2023-01-30.sbatch[0m
INFO: Workflow csg (ID=w2054e3258676deb8) is executed successfully with 1 completed step.



#### Final file africans after individuals and variant qc_2

In [7]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/final_files
# bfile with variant_qc_1 N=486416 and variants=674,489 hg38
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.bed
#To keep related and unrelated samples
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/012323_ukb47922_african_qc_call95_8617.iid
# To keep variants after geno=0.01, maf=0.01 and hwe=1e-15
keep_variants=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.bim
gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
gwasqc_sbatch=$cwd/hg38_african_variantqc2_hwe1e-8_$(date +"%Y-%m-%d").sbatch
mem='30G'
name='hg38_012323'
job_size=1
numThreads=2
geno=0.0
mind=0.0
hwe=0.0

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --geno_filter $geno
    --mind_filter $mind
    --hwe_filter $hwe
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/final_files/hg38_african_variantqc2_hwe1e-8_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=wadc7b85a19596c3f) is executed successfully with 1 completed step.



#### step 1. PCA for africans: keep unrelated and do LD prunning (get bed file)

In [21]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/ldprun_unrelated
## Use the qc version of the genotype array with the already filtered asian individuals
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/final_files/*.bed
#To keep the samples of asian and unrelated individuals only
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/012323_ukb47922_african_qc_unrel_7276.iid
#GWAS QC variables: leave all the variables in 0 so there's no more filtering in the already filtered data
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#LD prunning variables
window=50
shift=10
r2=0.1
gwas_sbatch=$cwd/ldprun_unrelated_african_$(date +"%Y-%m-%d").sbatch
numThreads=20
mem='30G'

gwasqc_args="""qc
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/ldprun_unrelated/ldprun_unrelated_asian_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=w5c56d09b5cc2b696) is executed successfully with 1 completed step.



#### step 2 PCA: run for unrelated African

In [30]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/pca_unrelated
#This is the bfile originated after filtering unrelated individuals and pruning
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/ldprun_unrelated/*.filtered.prune.bed
#You need to input the correct file containing only the unrelated individuals, otherwise you'll get NA as a label
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/012323_ukb47922_african_qc_unrel_7276.iid
label_col=ethnicity
pop_col=ethnicity
pca_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb
pca_sbatch=$cwd/flashpca_african_unrelated_genoarray_$(date +"%Y-%m-%d").sbatch
k=10
maha_k=5
min_axis=""
max_axis=""
homogeneous=TRUE

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --min_axis $min_axis
    --max_axis $max_axis
    --homogeneous $homogeneous
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/pca_unrelated/flashpca_african_unrelated_genoarray_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=w2d2f5cd8191ffc9e) is executed successfully with 1 completed step.



#### step 3. Get the bed file for related

In [27]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/related_african
## Use qc'ed genotype array
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/final_files/*.sorted.filtered.extracted.bed
keep_samples=/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/012323_ukb47922_african_qc_rel_1341.iid
#Keep the same variants as above
keep_variants=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/ldprun_unrelated/*.sorted.filtered.extracted.filtered.prune.in
#GWAS QC variables
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
gwas_sbatch=$cwd/flashpca_african_related_qc_genoarray_$(date +"%Y-%m-%d").sbatch
numThreads=20
mem='30G'


gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/related_african/flashpca_african_related_qc_genoarray_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=wdcf68ccfe350a5de) is ignored with 1 ignored step.



#### step 4. PCA project back related individuals

In [29]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/project_related_african
#This is the bfile originated after filtering related individuals
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/related_african/*.bed
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/012323_ukb47922_african_qc_call95_8617.iid
pca_model=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/pca_unrelated/*.pca.rds
pca_sbatch=$cwd/flashpca_african_related_genoarray_projected_$(date +"%Y-%m-%d").sbatch
label_col=ethnicity
pop_col=ethnicity
k=10
maha_k=5
prob=0.997
pval=0.05
## set the --homogeneous TRUE options to consider all the pops like one 
homogeneous=TRUE
## For the plot you need to use the *.projected.rds and not the *.projected.mahalanobis.rds
#plot_data=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.rds
#outlier_file=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.outliers


pca_args="""project_samples
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --pca_model $pca_model
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --prob $prob
    --pval $pval
    --homogeneous $homogeneous
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/project_related_african/flashpca_african_related_genoarray_projected_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=w2f210550b4dac3c1) is executed successfully with 1 completed step.



#### step. 5 PCA: plot and look for outliers

In [32]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/project_related_african/plot
#This is the bfile originated after filtering related individuals
genoFile=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/related_african/*.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/012323_ukb47922_african_qc_call95_8617.iid
pca_model=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/pca_unrelated/*.pca.rds
plot_data=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/project_related_african/*.pca.projected.rds
outlier_file=/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/project_related_african/*.pca.projected.outliers
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$cwd/flashpca_african_related_genoarray_plot_$(date +"%Y-%m-%d").sbatch
k=10

pca_args="""plot_pca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --label_col $label_col
    --pop_col $pop_col
    --plot_data $plot_data
    --outlier_file $outlier_file
    --k $k
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg  \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/project_related_african/plot/flashpca_african_related_genoarray_plot_2023-01-31.sbatch[0m
INFO: Workflow csg (ID=w106abc66920250ef) is executed successfully with 1 completed step.



#### Get individual ID and variants text files and final qc'ed bed files

In [65]:
%save -f /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/final_files_no_outliers/final_genoarray_africans_hg38.sh
#!/bin/sh
#$ -l h_rt=36:00:00
#$ -l h_vmem=30G
#$ -N final_genoarray_africans_hg38_2023-01-31
#$ -o /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg388/final_files_no_outliers/final_genoarray_africans_hg38_2023-01-31.out
#$ -e /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/final_files_no_outliers/final_genoarray_africans_hg38_2023-01-31.err
#$ -j y
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH
module load PLINK/2.0
plink2 \
    --bfile /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/final_files/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.extracted \
    --remove /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/project_related_african/012323_ukb47922_african_qc_call95_8617.pca.projected.outliers \
    --write-snplist --write-samples --no-id-header \
    --threads 20 \
    --make-bed \
    --out /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/final_files_no_outliers/UKB_genotypedatadownloaded083019_hg38_8591ind_346727var_AFRICAN

In [64]:
module load PLINK/2.0
plink2 \
    --bfile /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/final_files/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.extracted \
    --remove /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/project_related_african/012323_ukb47922_african_qc_call95_8617.pca.projected.outliers \
    --write-snplist --write-samples --no-id-header \
    --threads 20 \
    --make-bed \
    --out /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/final_files_no_outliers/UKB_genotypedatadownloaded083019_hg38_8591ind_346727var_AFRICAN

PLINK v2.00a2.3LM 64-bit Intel (24 Jan 2020)   www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/final_files_no_outliers/UKB_genotypedatadownloaded083019_hg38_8591ind_346727var_AFRICAN.log.
Options in effect:
  --bfile /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/final_files/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.sorted.filtered.extracted
  --make-bed
  --no-id-header
  --out /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/final_files_no_outliers/UKB_genotypedatadownloaded083019_hg38_8591ind_346727var_AFRICAN
  --remove /mnt/mfs/statgen/UKBiobank/genotype_files_processed/012323_african_9096ind_hg38/PCA/project_related_african/012323_ukb47922_african_qc_call95_8617.pca.projected.outliers
  --threads 20
  --write-samples
  --write-snplist

Start

In [141]:
file <- "/mnt/mfs/statgen/UKBiobank/genotype_files_processed/012023_hg38_674489var_s486416ind/UKB_genotypedatadownloaded083019_hg38_674489variants_486416samples.bim"

df <- read.table(file, header= FALSE, stringsAsFactors = FALSE)
colnames(df) <- c("chr", "var","CM","pos", "a1", "a2")

In [146]:
df[df$chr == 19 ,]

Unnamed: 0_level_0,chr,var,CM,pos,a1,a2
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<int>,<chr>,<chr>
614991,19,chr19:260970:A:T,0,260970,A,T
614992,19,chr19:267039:C:T,0,267039,C,T
614993,19,chr19:267614:A:C,0,267614,A,C
614994,19,chr19:277715:C:G,0,277715,C,G
614995,19,chr19:280712:T:C,0,280712,T,C
614996,19,chr19:282181:T:C,0,282181,T,C
614997,19,chr19:282753:A:G,0,282753,A,G
614998,19,chr19:287703:A:G,0,287703,A,G
614999,19,chr19:288062:A:G,0,288062,A,G
615000,19,chr19:288123:T:C,0,288123,T,C


In [None]:
df[df$gender == 'M',]