# Estimate the rare variant heritability of age-related hearing impairment traits

# Aim


# Concepts


Proportion of phenotypic variance captured by common SNPs - SNP-based heritability (h^2SNP)

Discrepancy between h^2ped and h^2SNP:

1. Causal variants are not well tagged by common SNPs because they are rare
2. Pedigree heritability is overestimated because of confounding with environmental effects or non-additive genetic variation


## 1. Calculate the SNP-based heritability from common variants and compared to available literature

First of all a GRM from all of the autosomal SNPs needs to be calculated.


## Input files

### Phenotype files

These files have already been QC'ed to include individuals with each of the hearing impairment traits and control individuals without hearing impairment related phenotypes

Mega-sample 

H-aid:

* /mnt/vast/hpc/csg/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv

H-diff:

* /mnt/vast/hpc/csg/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl_PC1_2.tsv

H-noise:

* /mnt/vast/hpc/csg/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl_PC1_2.tsv

H-both:

* /mnt/vast/hpc/csg/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl_PC1_2.tsv


### Genotype files

Original exome sequence files in plink format are here: 
* /mnt/vast/hpc/csg/UKBiobank/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed

QC done in these VCF files was: 

- DP-SPNs=10
- DP-indels=10
- GQ=20
- AB-SNP=0.15
- AB-indel=0.20
- geno=0.1

Samples with missingness >10% `-mind 0.1` in the genotype array

* ~/UKBiobank/data/exome_files/project_VCF/072721_run/merged_plink/mind_0.1/cache/ukb23155_qc_merged.mind_0.1.filtered.mindrem.id

Extra QC step is needed here to make sure we have the best quality variants for heritability calculation

### Selecting white European samples

According to Wainschtein et al 2022, they do two rounds of PC's calculations (20 PC's) one with common variants and one with rare variants. The prunning is also done using different parameters for each of these analyses

- Common: MAF 0.01-0.5, window 50Kb, r2=0.1
- Rare: MAF 0.004 (MAC=5) - 0.01, window 100Kb, r2=0.05

In our case, we will use our already defined white European population that was classified using the genotype array data with common variants, calculating 10 PC's and the manhalanobis distance to dected outliers. 

* /mnt/vast/hpc/csg/UKBiobank/results/083021_PCA_results/europeans/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.fam

### Removing related individuals 

In this step we just want to keep the unrelated European individuals for heritability calculations. We use a kinship=0.0625 (to remove related individuals up to third degree)

* remove_samples=/mnt/vast/hpc/csg/UKBiobank/results/083021_PCA_results/090221_king/*.related_id

### Remove individuals showing excess of heterozygosity based on GRM off-diagonal?

Don't know if this is necessary or not

Here we start with QC'ed exome sequence data but we will generate additional files with more stringent QC for heritability calculation

1. MAF keep all rare-variants (Wainschtein 2022 paper uses `--maf 0.0001` )
2. `--geno 0.1` (originally for our exome QC we used a `--geno 0.1`)
3. `--hwe 0.00000001` 
4. `--mind 0.05` (originally we did not remove individuals based on mind for the exome QC)
5. `--snps_only` add this option to remove indels from calculation (in this pipeline I decided to keep both SNPs and indels)

### Build the GRM from the exome sequence data?

## Step 1. Extra QC on the exome data

### Select white European unrelated individuals

In [24]:
eur <- read.table("~/UKBiobank/results/083021_PCA_results/europeans/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.fam", header=F)
colnames(eur) <- c("FID", "IID","fatherid", "motherid", "sex", "pheno")
dim(eur)
head(eur)

Unnamed: 0_level_0,FID,IID,fatherid,motherid,sex,pheno
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,1000019,1000019,0,0,2,-9
2,1000035,1000035,0,0,1,-9
3,1000078,1000078,0,0,2,-9
4,1000081,1000081,0,0,1,-9
5,1000198,1000198,0,0,2,-9
6,1000210,1000210,0,0,1,-9


In [30]:
unrel <- read.table("~/UKBiobank/results/083021_PCA_results/090221_ldprun_unrelated/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.090221_ldprun_unrelated.filtered.fam")
colnames(unrel) <- c("FID", "IID","fatherid", "motherid", "sex", "pheno")
dim(unrel)
head(unrel)

Unnamed: 0_level_0,FID,IID,fatherid,motherid,sex,pheno
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,1000019,1000019,0,0,2,-9
2,1000078,1000078,0,0,2,-9
3,1000081,1000081,0,0,1,-9
4,1000198,1000198,0,0,2,-9
5,1000210,1000210,0,0,1,-9
6,1000236,1000236,0,0,1,-9


In [29]:
outlier <- read.table("~/UKBiobank/results/083021_PCA_results/090321_PCA_related_pval0.005/030821_ukb42495_exomed_white_189010ind.090321_PCA_related_pval0.005.pca.projected.outliers")
colnames(outlier) <- c("FID", "IID")
dim(outlier)
head(outlier)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
1,1008606,1008606
2,1010412,1010412
3,1028129,1028129
4,1032822,1032822
5,1035752,1035752
6,1044288,1044288


Select white European individuals that are unrelated and remove ancestry outliers obtained by PCA calculation

In [36]:
eur_unrel <- unrel[-which(unrel$IID %in% outlier$IID), ]
eur_unrel_drop <- eur_unrel[ -c(3:6)]

In [38]:
head(eur_unrel_drop)
dim(eur_unrel_drop)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
1,1000019,1000019
2,1000078,1000078
3,1000081,1000081
4,1000198,1000198
5,1000210,1000210
6,1000236,1000236


In [40]:
write.table(eur_unrel_drop, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/ukb42495_exomed_white_europeans_unrelated_no_outliers_167652.id", sep="\t", row.names = FALSE, col.names =FALSE, quote=FALSE)

### Apply HWE only on the subgroup of controls which are the same for our ARHI analyses

In [2]:
# Get the unrelated controls from out phenotype files
eur_unrel <-  read.table("/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/ukb42495_exomed_white_europeans_unrelated_no_outliers_167652.id", header=F)
colnames(eur_unrel) <- c("FID", "IID")
head(eur_unrel)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
1,1000019,1000019
2,1000078,1000078
3,1000081,1000081
4,1000198,1000198
5,1000210,1000210
6,1000236,1000236


In [4]:
# Load the f3393 phenotype file to subset for the controls 
f3393 <- read.table("~/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl", header=T)
head(f3393)
nrow(f3393)

Unnamed: 0_level_0,FID,IID,sex,f3393,age
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>
1,1001384,1001384,1,1,61
2,1002548,1002548,0,1,62
3,1002888,1002888,0,1,68
4,1002944,1002944,0,1,65
5,1003258,1003258,0,1,74
6,1004843,1004843,0,1,64


In [7]:
library('dplyr')
f3393_ctrl <- f3393 %>% filter(f3393==0) %>%
          select('FID', 'IID')

In [13]:
f3393_ctrl_unrel <- f3393 %>% filter(IID %in% eur_unrel$IID) %>% select("FID", "IID")        
nrow(f3393_ctrl_unrel)
head(f3393_ctrl_unrel)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
1,1001384,1001384
2,1002548,1002548
3,1002888,1002888
4,1002944,1002944
5,1003258,1003258
6,1004843,1004843


In [14]:
write.table(f3393_ctrl_unrel, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/ukb42495_exomed_white_europeans_unrelated_no_outliers_ctrl_ARHI_92040.id", sep="\t", row.names = FALSE, col.names =TRUE, quote=FALSE)

In [33]:
# Get the variants that pass HWE qc filter
## Select White European unrelated individuals 
## Do some extra QC on the exome data
UKBB_PATH=/mnt/vast/hpc/csg/UKBiobank
cwd=$UKBB_PATH/results/ARHI_heritability/051922_white_eur_unrel
## Use the exome filtered file
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
## To keep only the samples of white Europeans unrelated individuals with outliers removed that are controls for ARHI samples so we can apply HWE
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/ukb42495_exomed_white_europeans_unrelated_no_outliers_ctrl_ARHI_92040.id
# Do not set a MAF filter, this will keep both common and rare variants
maf_filter=0 
# No geno filter in this subset of controls
geno_filter=0
# Set a HWE filter 1x10^-8
hwe_filter=0.00000001
# Do not set a sample missingness filter at this point, otherwise many samples would be removed
mind_filter=0
# Keep both SNPs and indels in the heritability calculation
snps_only=False
meta_only=True
other_args=""
gwas_sbatch=$UKBB_PATH/results/ARHI_heritability/051922_white_eur_unrel/hwe_eur_unrel_control_ARHI_$(date +"%Y-%m-%d").sbatch
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif
numThreads=20
job_size=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --meta_only $meta_only
    --other_args $other_args
    --numThreads $numThreads 
    --job_size $job_size
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/hwe_eur_unrel_control_ARHI_2022-05-19.sbatch[0m
INFO: Workflow csg (ID=w72163a39fda9e30b) is ignored with 1 ignored step.



In [18]:
# Count number of variants after VCF_QC
for file in $(ls -v /mnt/vast/hpc/csg/UKBiobank/data/exome_files/project_VCF/072721_run/plink/ukb23156_c*.merged.filtered.bim);
do 
    echo "${file##*/}" >> /mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/072721_ukb23156_c1_22.merged.filtered.snpcount.txt
    cat $file | wc -l  >> /mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/072721_ukb23156_c1_22.merged.filtered.snpcount.txt;
done




In [36]:
# Count number of variants after HWE filter 
for file in $(ls -v /mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c*.merged.filtered.filtered.snplist);
do 
    wc -l $file  >> /mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_ukb23156_c1_22.merged.filtered.hwe_ctrl_ARHI.snpcount.txt;
done




In [39]:
cat $(ls -v /mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c**.merged.filtered.filtered.snplist) >> \
/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c1_22.filtered.filtered.snplist




### Note: I've found some duplicated records in the extracted variants
```
      2 chr10:5642571:T:TA
      2 chr11:2915280:C:CGT
      2 chr14:51243825:C:CT
      2 chr14:73265273:A:AGG
      2 chr14:73654828:C:CT
      2 chr15:36594814:T:TA
      2 chr17:20156278:A:AC
      2 chr19:45408051:C:CA
      2 chr19:9980591:A:ATCTC
      2 chr1:149935279:CA:C
      2 chr1:240493541:A:AT
      2 chr1:53074691:T:TG
      2 chr20:49159148:T:TA
      2 chr22:36481640:C:CGG
      2 chr2:127637295:A:AC
      2 chr3:111991354:AT:A
      2 chr3:98391562:G:GA
      2 chr5:169670501:T:TTTTTA
      2 chr5:58975878:C:CT
      2 chr5:96783875:A:ACT
      2 chr6:42683028:G:GT
      2 chr7:100885457:G:GC
      2 chr9:27217785:A:AT
      2 chr9:92474739:G:GTCC
```

### Remove variants that do not pass HWE in ARHI controls and get final bed file per chromosome

In [40]:
# Get the variants that pass HWE qc filter
## Select White European unrelated individuals 
## Do some extra QC on the exome data
UKBB_PATH=/mnt/vast/hpc/csg/UKBiobank
cwd=$UKBB_PATH/results/ARHI_heritability/051922_white_eur_unrel
## Use the exome filtered file
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
## To keep only the samples of white Europeans unrelated individuals with outliers removed that are controls for ARHI samples so we can apply HWE
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/ukb42495_exomed_white_europeans_unrelated_no_outliers_167652.id
keep_variants=~/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c1_22.filtered.filtered.snplist
# Do not set a MAF filter, this will keep both common and rare variants
maf_filter=0 
# No geno filter in this subset of controls
geno_filter=0.1
# Set a HWE filter to 0 because there's not need to do it here
hwe_filter=0
# Do not set a sample missingness filter at this point, otherwise many samples would be removed
mind_filter=0
# Keep both SNPs and indels in the heritability calculation
snps_only=False
other_args=""
gwas_sbatch=$UKBB_PATH/results/ARHI_heritability/051922_white_eur_unrel/eur_unrel_ARHI_QC_$(date +"%Y-%m-%d").sbatch
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif
numThreads=20
job_size=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --other_args $other_args
    --numThreads $numThreads 
    --job_size $job_size
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/eur_unrel_ARHI_QC_2022-05-19.sbatch[0m
INFO: Workflow csg (ID=w3655ddad4c106b79) is executed successfully with 1 completed step.



In [41]:
# Count number of variants after geno=0.1 filter 
for file in $(ls -v /mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c*.merged.filtered.filtered.extracted.bim);
do 
    wc -l $file  >> /mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_ukb23156_c1_22.merged.filtered.geno0.1_ARHI.snpcount.txt;
done




## Step 2. Merge plink files and use `--mind 0.05` filter

From the Wainschtein 2022 paper

```
plink \
--bfile ${BED_file_merged} \
--merge-list ${list_beds} \
--make-bed \
--maf 0.0001 \
--geno 0.05  \
--hwe 0.000001 \
--mind 0.05 \
--out ${BED_file_merged_QC} \
--threads ${ncpu}
```

### Get the file for rare variants MAF<0.01

In [43]:
genoFile=`echo /mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c{1..22}.merged.filtered.filtered.extracted.bed`
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_merged_bed
# Select only the variants with MAF between 0 and 0.01
maf_max_filter=0.01
# Do not set a MAF filter to keep all variants below 0.01
maf_filter=0
# No need to filter again in the merge
geno_filter=0
# HWE already applied only to controls of ARHI sample
hwe_filter=0
# Set a sample missingness of 5%
mind_filter=0.05
name='ukb23156_merged_eur_unrel_rarevarsMAFbelow0.01'

gwas_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_merged_bed/rareMAFbelow0.01_eur_unrel_exome_merged$(date +"%Y-%m-%d").sbatch
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/genotype_formatting.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

merge_args="""merge_plink
    --cwd $cwd
    --genoFile $genoFile
    --name $name
    --maf_filter $maf_filter
    --maf_max_filter $maf_max_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$merge_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_merged_bed/rareMAFbelow0.01_eur_unrel_exome_merged2022-05-19.sbatch[0m
INFO: Workflow csg (ID=w184d6d32cb096add) is executed successfully with 1 completed step.



### Get the file for common variants MAF>0.01

In [44]:
genoFile=`echo /mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c{1..22}.merged.filtered.filtered.extracted.bed`
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_merged_bed
# Set a MAF filter of 0.01 to keep variants above that threshold
maf_filter=0.01
# No need to filter for variant missigness again
geno_filter=0
# HWE already applied only to controls of ARHI sample
hwe_filter=0
# Set a sample missingness of 5%
mind_filter=0.05
name='ukb23156_merged_eur_unrel_commonvarsMAFabove0.01'
gwas_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_merged_bed/commonMAFabove0.01_eur_unrel_exome_merged_$(date +"%Y-%m-%d").sbatch
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/genotype_formatting.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

merge_args="""merge_plink
    --cwd $cwd
    --genoFile $genoFile
    --name $name
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$merge_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_merged_bed/commonMAFabove0.01_eur_unrel_exome_merged_2022-05-19.sbatch[0m
INFO: Workflow csg (ID=wd209f89c42e59fdb) is executed successfully with 1 completed step.



### Get a file with both common and rare variants

In [46]:
genoFile=`echo /mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c{1..22}.merged.filtered.filtered.extracted.bed`
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_merged_bed
# Do not set a MAF filter, this will keep both common and rare variants
maf_filter=0 
# Do not filter again for variant missigness
geno_filter=0
# HWE already applied only to controls of ARHI sample
hwe_filter=0
# Set a sample missingness of 5%
mind_filter=0.05
name='ukb23156_merged_eur_unrel_allvars'
gwas_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/051922_merged_bed/allvars_eur_unrel_exome_merged_$(date +"%Y-%m-%d").sbatch
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/genotype_formatting.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

merge_args="""merge_plink
    --cwd $cwd
    --genoFile $genoFile
    --name $name
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$merge_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_merged_bed/allvars_eur_unrel_exome_merged_2022-05-19.sbatch[0m
INFO: Workflow csg (ID=wadfa7b47d59121a2) is executed successfully with 1 completed step.



# Create the GRM matrix

Here we need to create a GRM per bed file for common, rare and both types of variants

```

i={1..99}
GCTA \
--bfile ${BED_file_merged_QC} \
--extract ${list_variants_LD_bin} \
--make-grm-part 99 "$i" \
--thread-num ${ncpu} \
--out ${GRM_out} \
--make-grm-alg 1


#Merge all GRM parts together

cat ${GRM_out}.part_99_*.grm.id > ${GRM_out}.grm.id
cat ${GRM_out}.part_99_*.grm.bin > ${GRM_out}.grm.bin
cat ${GRM_out}.part_99_*.grm.N.bin > ${GRM_out}.grm.N.bin
```

## Get GRM for rare variant substet

In [47]:
bfile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_rarevarsMAFbelow0.01.bed
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_grm_rare
numThreads=20
gcta_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_grm/rarevars_grm_eur_unrel_gtca_$(date +"%Y-%m-%d").sbatch
gcta_sos=~/project/bioworkflows/GWAS/LMM.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
#Phenofile is just inputed as a requirement for the LMM to run but it is not actually needed for GRM calculation
phenoFile=/mnt/vast/hpc/csg/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
mem='8G'
walltime='48h'

gcta_args="""gcta
    --cwd $cwd
    --bfile $bfile
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --numThreads $numThreads
    --phenoFile $phenoFile
    --container_lmm $container
    --walltime $walltime
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_grm/rarevars_grm_eur_unrel_gtca_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=w3a0b3db8c49eabe5) is executed successfully with 1 completed step.



## Get the GRM for common variant subset

In [48]:
bfile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.bed
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_grm_common
numThreads=20
gcta_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_grm_common/commonvars_eur_unrel_gtca_$(date +"%Y-%m-%d").sbatch
gcta_sos=~/project/bioworkflows/GWAS/LMM.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
#Phenofile is just inputed as a requirement for the LMM to run but it is not actually needed for GRM calculation
phenoFile=/mnt/vast/hpc/csg/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
mem='8G'
walltime='48h'

gcta_args="""gcta
    --cwd $cwd
    --bfile $bfile
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --numThreads $numThreads
    --phenoFile $phenoFile
    --container_lmm $container
    --walltime $walltime
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_grm_common/commonvars_eur_unrel_gtca_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=w5dfba0004114fab4) is executed successfully with 1 completed step.



## Get the GRM for all variants (rare + common)

In [49]:
bfile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_allvars.bed
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_grm_allvars
numThreads=20
gcta_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_grm_allvars/allvars_grm_eur_unrel_$(date +"%Y-%m-%d").sbatch
gcta_sos=~/project/bioworkflows/GWAS/LMM.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
#Phenofile is just inputed as a requirement for the LMM to run but it is not actually needed for GRM calculation
phenoFile=/mnt/vast/hpc/csg/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
mem='8G'
walltime='48h'

gcta_args="""merge_plink
    --cwd $cwd
    --bfile $bfile
    --numThreads $numThreads
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_grm_allvars/allvars_grm_eur_unrel_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=w8df80d9b5785842a) is executed successfully with 1 completed step.



In [None]:
###### Create a file containing multiple GRMs in a directory (need full path) ######

for i in *.grm.bin ; do readlink -f "$i"  | cut -d'.' -f1-2 >>  ${mgrm_file_path}; done

## Do LD prunning for each group of variants

```
i={1..22}
plink \
--bfile ${BED_file_merged_QC} \
--chr "$i" \
--extract ${list_variants_bin} \
--indep-pairwise 50 5 0.1 \
--out ${out_indep_var}_chr"$i" \
--threads ${ncpu}
```

### LD pruning for common variants

In [61]:
## Selected White European unrelated individuals 
UKBB_PATH=/mnt/vast/hpc/csg/UKBiobank
cwd=$UKBB_PATH/results/ARHI_heritability/052022_ldprun_common
## Use the exome filtered file
genoFile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.bed
maf_filter=0 
geno_filter=0
hwe_filter=0
mind_filter=0
window=50
shift=10
r2=0.1
gwas_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_ldprun_common/common_ldprun_eur_unrel_$(date +"%Y-%m-%d").sbatch
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg2.yml
container=~/containers/bioinfo.sif
numThreads=20
job_size=1
mem='30G'

gwasqc_args="""qc:2
    --cwd $cwd
    --genoFile $genoFile
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_ldprun_common/common_ldprun_eur_unrel_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=w39d73fa64d48acf5) is executed successfully with 1 completed step.



### LD pruning for rare variants

In [62]:
## Selected White European unrelated individuals 
UKBB_PATH=/mnt/vast/hpc/csg/UKBiobank
cwd=$UKBB_PATH/results/ARHI_heritability/052022_ldprun_rare
## Use the exome filtered file
genoFile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_rarevarsMAFbelow0.01.bed
window=2000
shift=400
r2=0.01
gwas_sbatch=$UKBB_PATH/results/ARHI_heritability/052022_ldprun_rare/rare_ldprun_eur_unrel_$(date +"%Y-%m-%d").sbatch
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif
numThreads=20
job_size=1
mem='30G'

gwasqc_args="""qc:2
    --cwd $cwd
    --genoFile $genoFile
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_ldprun_rare/rare_ldprun_eur_unrel_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=wd5c42b2fe0425a2b) is executed successfully with 1 completed step.



## Recalculate PC's for the subset of unrelated

In the paper they use plink2 to calculate the PC's

```
plink2 \
--bfile ${BED_file_merged_QC} \
--extract ${list_variants_bin} \
--pca 20 approx \
--out ${PCA_out} \
--thread-num ${ncpu}
```

In our case we have developed a pipeline that uses flashpca. We use that one instead

### PCA for f3393 common variants of the 50K samples

#### Step 1

In [7]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/
gwas_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/common_pca_50K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_ldprun_common/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.filtered.prune.bed
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.50Ksubset.keep_id

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/common_pca_50K_2022-06-13.sbatch[0m
INFO: Workflow csg (ID=wd66b7460fcddc18a) is ignored with 1 ignored step.



#### Step 2

In [8]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/
#This is the bfile obtained in step 1
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/*.bed
# Format FID, IID, pop
# PC's are actually being calculated for 50K individuals that are white European and unrelated
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/flashpca50K_common_$(date +"%Y-%m-%d").sbatch
k=10
min_axis=0
max_axis=0
pca_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/flashpcaR.sif

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --container $container
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/flashpca50K_common_2022-06-13.sbatch[0m
INFO: Workflow csg (ID=w041e9c3bb3e48c12) is executed successfully with 1 completed step.



#### Step 1 this is wrong since I kept more than 50K individuals for PCA calculation

In [63]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_pca_common/f3393
gwas_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_pca_common/common_pca_f3393_exome_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_ldprun_common/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.filtered.prune.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.keep_id

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/common_pca_f3393_exome_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=w9bd46fcb4c4e3372) is executed successfully with 1 completed step.



#### Step 2. 

In [64]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_pca_common/f3393
#This is the bfile obtained in step 1
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_pca_common/f3393/cache/*.bed
# Format FID, IID, pop
# PC's are actually being calculated for 91950 individuals that are white unrelated and have phenotype data for f3393
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_pca_common/flashpca_f3393_common_$(date +"%Y-%m-%d").sbatch
k=10
min_axis=0
max_axis=0
pca_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/flashpcaR.sif

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --container $container
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/flashpca_f3393_common_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=w90c9447dc037bccd) is executed successfully with 1 completed step.



#### Merge the phenofile with the PC calculation for the available individuals (unrelated and white European) for common variants

In [None]:
pheno <- read.table("~/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl", header=T)
head(pheno)
dim(pheno)

In [None]:
pca <- read.table("/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_pca_common/f3393/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.pca.txt", header=T)
head(pca)
dim(pca)

In [3]:
f3393 <- merge(pheno, pca, by = c('FID', 'IID'), all.y=TRUE)
head(f3393)
dim(f3393)
library(dplyr)
f3393_final <- select(f3393, -c('ID', 'ethnicity'))
head(f3393_final)

Unnamed: 0_level_0,FID,IID,sex,f3393,age,ID,ethnicity,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000078,1000078,1,0,60,1000078:1000078,British,0.001867425,0.013093557,-0.012545642,-0.0077773685,-0.0076396044,-0.004067034,0.003904697,-0.0066333866,-0.011400679,-0.003770311
2,1000081,1000081,0,0,67,1000081:1000081,British,0.134179544,-0.015138242,0.014278056,0.0142731188,-0.0085332226,-0.010936588,0.011577007,0.0004865037,0.008102219,-0.002117615
3,1000236,1000236,0,0,70,1000236:1000236,British,0.008867756,0.011425836,-0.023071386,0.0064096586,0.0033986306,0.021169599,-0.003277936,0.0054276548,-0.01065791,-0.03410221
4,1000331,1000331,1,0,53,1000331:1000331,Any_other_white_background,0.080695947,-0.003566874,-0.002728336,-0.0039599944,0.0005021192,0.003650339,0.029506097,0.0038201997,-0.013963517,-0.01629406
5,1000340,1000340,1,0,54,1000340:1000340,British,-0.012073413,-0.005182776,-0.00929079,0.0007751943,-0.0024663547,-0.006647973,0.002676356,-0.0018374353,0.002831617,-0.001478281
6,1000415,1000415,0,0,65,1000415:1000415,Irish,-0.018177064,-0.053073714,-0.000446653,-0.0024920664,0.0170787094,-0.003519533,0.005667462,0.0251297768,-0.006116507,0.005747638



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Unnamed: 0_level_0,FID,IID,sex,f3393,age,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000078,1000078,1,0,60,0.001867425,0.013093557,-0.012545642,-0.0077773685,-0.0076396044,-0.004067034,0.003904697,-0.0066333866,-0.011400679,-0.003770311
2,1000081,1000081,0,0,67,0.134179544,-0.015138242,0.014278056,0.0142731188,-0.0085332226,-0.010936588,0.011577007,0.0004865037,0.008102219,-0.002117615
3,1000236,1000236,0,0,70,0.008867756,0.011425836,-0.023071386,0.0064096586,0.0033986306,0.021169599,-0.003277936,0.0054276548,-0.01065791,-0.03410221
4,1000331,1000331,1,0,53,0.080695947,-0.003566874,-0.002728336,-0.0039599944,0.0005021192,0.003650339,0.029506097,0.0038201997,-0.013963517,-0.01629406
5,1000340,1000340,1,0,54,-0.012073413,-0.005182776,-0.00929079,0.0007751943,-0.0024663547,-0.006647973,0.002676356,-0.0018374353,0.002831617,-0.001478281
6,1000415,1000415,0,0,65,-0.018177064,-0.053073714,-0.000446653,-0.0024920664,0.0170787094,-0.003519533,0.005667462,0.0251297768,-0.006116507,0.005747638


In [4]:
write.table(f3393_final, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.greml_pheno", sep="\t", row.names = FALSE, col.names =TRUE, quote=FALSE)

### PCA for f3393 common variants of the 80K samples

#### Step 1

In [4]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060722_pca_common_80K/f3393
gwas_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060722_pca_common_80K/common_pca_f3393_80K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_ldprun_common/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.filtered.prune.bed
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.80Ksubset.keep_id

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/060722_pca_common_80K/common_pca_f3393_80K_2022-06-07.sbatch[0m
INFO: Workflow csg (ID=w81b967b0cda647cd) is executed successfully with 1 completed step.



#### Step 2

In [8]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060722_pca_common_80K/f3393
#This is the bfile obtained in step 1
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060722_pca_common_80K/f3393/*.bed
# Format FID, IID, pop
# PC's are actually being calculated for 80K individuals that are white unrelated and have phenotype data for f3393
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_pca_common/flashpca_f3393_common_$(date +"%Y-%m-%d").sbatch
k=10
min_axis=0
max_axis=0
pca_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/flashpcaR.sif

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --container $container
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/060722_pca_common_80K/flashpca_f3393_common80K_2022-06-07.sbatch[0m
INFO: Workflow csg (ID=w0d063444a8caa3d2) is executed successfully with 1 completed step.



#### Get the phenotype file for GREML of the 80K

In [12]:
pheno <- read.table("~/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl", header=T)
head(pheno)
dim(pheno)

Unnamed: 0_level_0,FID,IID,sex,f3393,age
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>
1,1001384,1001384,1,1,61
2,1002548,1002548,0,1,62
3,1002888,1002888,0,1,68
4,1002944,1002944,0,1,65
5,1003258,1003258,0,1,74
6,1004843,1004843,0,1,64


In [13]:
pca <- read.table("/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060722_pca_common_80K/f3393/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.pca.txt", header=T)
head(pca)
dim(pca)

Unnamed: 0_level_0,ID,FID,IID,ethnicity,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<fct>,<int>,<int>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000081:1000081,1000081,1000081,British,0.134658064,0.014188187,0.0135583292,0.0143258827,0.0085594803,0.0075761687,-0.010186438,-0.001022228,0.008934933,-0.002508883
2,1000236:1000236,1000236,1000236,British,0.008812187,-0.01145655,-0.0233585109,0.0081630403,-0.0029101523,-0.0194662282,0.007941112,0.00856768,-0.006907107,-0.03616751
3,1000331:1000331,1000331,1000331,Any_other_white_background,0.080859632,0.003496805,-0.0029620121,-0.0034742864,-0.0009168386,-0.0114138724,-0.028556378,0.006783471,-0.013180153,-0.01621851
4,1000340:1000340,1000340,1000340,British,-0.01245736,0.005989548,-0.0095406987,0.0009187248,0.0026318657,0.0071420228,-0.005840268,-0.001536171,0.002371991,-9.876745e-05
5,1000415:1000415,1000415,1000415,Irish,-0.018346574,0.053126182,-0.0005572422,-0.0018209019,-0.0180591232,0.0026329042,-0.008080767,0.024710975,-0.001029565,0.005543847
6,1000421:1000421,1000421,1000421,British,0.003190009,0.025497201,0.0111553077,-0.0079503301,-0.0092678599,-0.0002117366,-0.003330029,0.004412672,0.007056397,-0.01543426


In [14]:
f3393 <- merge(pheno, pca, by = c('FID', 'IID'), all.y=TRUE)
head(f3393)
dim(f3393)
library(dplyr)
f3393_final <- select(f3393, -c('ID', 'ethnicity'))
head(f3393_final)

Unnamed: 0_level_0,FID,IID,sex,f3393,age,ID,ethnicity,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000081,1000081,0,0,67,1000081:1000081,British,0.134658064,0.014188187,0.0135583292,0.0143258827,0.0085594803,0.0075761687,-0.010186438,-0.001022228,0.008934933,-0.002508883
2,1000236,1000236,0,0,70,1000236:1000236,British,0.008812187,-0.01145655,-0.0233585109,0.0081630403,-0.0029101523,-0.0194662282,0.007941112,0.00856768,-0.006907107,-0.03616751
3,1000331,1000331,1,0,53,1000331:1000331,Any_other_white_background,0.080859632,0.003496805,-0.0029620121,-0.0034742864,-0.0009168386,-0.0114138724,-0.028556378,0.006783471,-0.013180153,-0.01621851
4,1000340,1000340,1,0,54,1000340:1000340,British,-0.01245736,0.005989548,-0.0095406987,0.0009187248,0.0026318657,0.0071420228,-0.005840268,-0.001536171,0.002371991,-9.876745e-05
5,1000415,1000415,0,0,65,1000415:1000415,Irish,-0.018346574,0.053126182,-0.0005572422,-0.0018209019,-0.0180591232,0.0026329042,-0.008080767,0.024710975,-0.001029565,0.005543847
6,1000421,1000421,1,0,64,1000421:1000421,British,0.003190009,0.025497201,0.0111553077,-0.0079503301,-0.0092678599,-0.0002117366,-0.003330029,0.004412672,0.007056397,-0.01543426



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Unnamed: 0_level_0,FID,IID,sex,f3393,age,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000081,1000081,0,0,67,0.134658064,0.014188187,0.0135583292,0.0143258827,0.0085594803,0.0075761687,-0.010186438,-0.001022228,0.008934933,-0.002508883
2,1000236,1000236,0,0,70,0.008812187,-0.01145655,-0.0233585109,0.0081630403,-0.0029101523,-0.0194662282,0.007941112,0.00856768,-0.006907107,-0.03616751
3,1000331,1000331,1,0,53,0.080859632,0.003496805,-0.0029620121,-0.0034742864,-0.0009168386,-0.0114138724,-0.028556378,0.006783471,-0.013180153,-0.01621851
4,1000340,1000340,1,0,54,-0.01245736,0.005989548,-0.0095406987,0.0009187248,0.0026318657,0.0071420228,-0.005840268,-0.001536171,0.002371991,-9.876745e-05
5,1000415,1000415,0,0,65,-0.018346574,0.053126182,-0.0005572422,-0.0018209019,-0.0180591232,0.0026329042,-0.008080767,0.024710975,-0.001029565,0.005543847
6,1000421,1000421,1,0,64,0.003190009,0.025497201,0.0111553077,-0.0079503301,-0.0092678599,-0.0002117366,-0.003330029,0.004412672,0.007056397,-0.01543426


In [18]:
write.table(f3393_final, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.common80K.greml_pheno", sep="\t", row.names = FALSE, col.names =TRUE, quote=FALSE)

### PCA for f3393 rare variants in the 50K group

#### Step 1

In [2]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052622_pca_rare_50K/f3393
gwas_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052622_pca_rare_50K/rare50K_pca_f3393_exome_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_ldprun_rare/*.prune.bed
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.50Ksubset.keep_id

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052622_pca_rare_50K/rare50K_pca_f3393_exome_2022-05-26.sbatch[0m
INFO: Workflow csg (ID=w31b939c960d416df) is executed successfully with 1 completed step.



#### Step 2. 

In [3]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052622_pca_rare_50K/f3393
#This is the bfile obtained in step 1
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052622_pca_rare_50K/f3393/ukb23156_merged_eur_unrel_rarevarsMAFbelow0.01.filtered.prune.filtered.bed
# Format FID, IID, pop
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052622_pca_rare_50K/flashpca_f3393_rare50k_$(date +"%Y-%m-%d").sbatch
k=10
min_axis=0
max_axis=0
pca_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=/mnt/vast/hpc/csg/containers/flashpcaR.sif
walltime='60h'

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --container $container
    --walltime $walltime
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052622_pca_rare_50K/flashpca_f3393_rare50k_2022-05-31.sbatch[0m
INFO: Workflow csg (ID=w2faab7fac9677616) is executed successfully with 1 completed step.



#### Merge the phenofile 50K with the PC calculation for the available individuals (unrelated and white European) for rare variants

In [4]:
pheno <- read.table("~/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl", header=T)
head(pheno)
dim(pheno)

Unnamed: 0_level_0,FID,IID,sex,f3393,age
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>
1,1001384,1001384,1,1,61
2,1002548,1002548,0,1,62
3,1002888,1002888,0,1,68
4,1002944,1002944,0,1,65
5,1003258,1003258,0,1,74
6,1004843,1004843,0,1,64


In [5]:
pca <- read.table("~/UKBiobank/results/ARHI_heritability/052622_pca_rare_50K/f3393/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.pca.txt", header=T)
head(pca)
dim(pca)

Unnamed: 0_level_0,ID,FID,IID,ethnicity,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<fct>,<int>,<int>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000081:1000081,1000081,1000081,British,-0.0019568461,-0.0003087856,-0.0020169598,-0.001798777,0.001177783,-0.003362221,0.0121367445,-0.0020889197,-0.0056102374,0.0007451252
2,1000331:1000331,1000331,1000331,Any_other_white_background,-0.0004171203,-0.0001401596,-0.0003095201,-0.000211136,0.00069921,-0.0036493841,0.0039540739,-0.0017140885,-0.0034396176,-0.001264749
3,1000340:1000340,1000340,1000340,British,0.000173436,5.561481e-05,0.0001703869,6.928099e-05,-9.218887e-05,0.0002012874,-0.0005118113,0.0001987724,0.000268085,-0.000162182
4,1000421:1000421,1000421,1000421,British,0.0002647704,9.941067e-05,0.0002086665,0.0001034935,-8.353187e-05,0.0002334827,-0.0004849626,0.0003358854,0.0006122641,0.0001673864
5,1000439:1000439,1000439,1000439,British,0.0002492869,0.0001016217,0.0001869445,0.0001501699,-0.0001035148,0.000188998,-0.0005558054,0.0001782244,0.0001964613,8.653552e-05
6,1000752:1000752,1000752,1000752,British,0.0001850879,7.896201e-05,0.0001708455,0.000110915,-2.766921e-05,0.0002038623,-0.0006285896,0.0002629819,0.0001643356,-2.829154e-05


In [6]:
f3393 <- merge(pheno, pca, by = c('FID', 'IID'), all.y=TRUE)
head(f3393)
dim(f3393)
library(dplyr)
f3393_final <- select(f3393, -c('ID', 'ethnicity'))
head(f3393_final)

Unnamed: 0_level_0,FID,IID,sex,f3393,age,ID,ethnicity,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000081,1000081,0,0,67,1000081:1000081,British,-0.0019568461,-0.0003087856,-0.0020169598,-0.001798777,0.001177783,-0.003362221,0.0121367445,-0.0020889197,-0.0056102374,0.0007451252
2,1000331,1000331,1,0,53,1000331:1000331,Any_other_white_background,-0.0004171203,-0.0001401596,-0.0003095201,-0.000211136,0.00069921,-0.0036493841,0.0039540739,-0.0017140885,-0.0034396176,-0.001264749
3,1000340,1000340,1,0,54,1000340:1000340,British,0.000173436,5.561481e-05,0.0001703869,6.928099e-05,-9.218887e-05,0.0002012874,-0.0005118113,0.0001987724,0.000268085,-0.000162182
4,1000421,1000421,1,0,64,1000421:1000421,British,0.0002647704,9.941067e-05,0.0002086665,0.0001034935,-8.353187e-05,0.0002334827,-0.0004849626,0.0003358854,0.0006122641,0.0001673864
5,1000439,1000439,1,0,59,1000439:1000439,British,0.0002492869,0.0001016217,0.0001869445,0.0001501699,-0.0001035148,0.000188998,-0.0005558054,0.0001782244,0.0001964613,8.653552e-05
6,1000752,1000752,1,0,53,1000752:1000752,British,0.0001850879,7.896201e-05,0.0001708455,0.000110915,-2.766921e-05,0.0002038623,-0.0006285896,0.0002629819,0.0001643356,-2.829154e-05



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Unnamed: 0_level_0,FID,IID,sex,f3393,age,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000081,1000081,0,0,67,-0.0019568461,-0.0003087856,-0.0020169598,-0.001798777,0.001177783,-0.003362221,0.0121367445,-0.0020889197,-0.0056102374,0.0007451252
2,1000331,1000331,1,0,53,-0.0004171203,-0.0001401596,-0.0003095201,-0.000211136,0.00069921,-0.0036493841,0.0039540739,-0.0017140885,-0.0034396176,-0.001264749
3,1000340,1000340,1,0,54,0.000173436,5.561481e-05,0.0001703869,6.928099e-05,-9.218887e-05,0.0002012874,-0.0005118113,0.0001987724,0.000268085,-0.000162182
4,1000421,1000421,1,0,64,0.0002647704,9.941067e-05,0.0002086665,0.0001034935,-8.353187e-05,0.0002334827,-0.0004849626,0.0003358854,0.0006122641,0.0001673864
5,1000439,1000439,1,0,59,0.0002492869,0.0001016217,0.0001869445,0.0001501699,-0.0001035148,0.000188998,-0.0005558054,0.0001782244,0.0001964613,8.653552e-05
6,1000752,1000752,1,0,53,0.0001850879,7.896201e-05,0.0001708455,0.000110915,-2.766921e-05,0.0002038623,-0.0006285896,0.0002629819,0.0001643356,-2.829154e-05


In [7]:
write.table(f3393_final, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.greml_pheno_rarevars", sep="\t", row.names = FALSE, col.names =TRUE, quote=FALSE)

### PCA for f3393 rare variants in the 80K group

#### Step 1

In [11]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060722_pca_rare_80K/f3393
gwas_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060722_pca_rare_80K/rare80K_pca_f3393_exome_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_ldprun_rare/*.prune.bed
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.80Ksubset.keep_id

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/060722_pca_rare_80K/rare80K_pca_f3393_exome_2022-06-07.sbatch[0m
INFO: Workflow csg (ID=wa733fe160ca0d594) is ignored with 1 ignored step.



#### Step 2

In [22]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060722_pca_rare_80K/f3393
#This is the bfile obtained in step 1
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060722_pca_rare_80K/f3393/ukb23156_merged_eur_unrel_rarevarsMAFbelow0.01.filtered.prune.filtered.bed
# Format FID, IID, pop
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060722_pca_rare_80K/flashpca_f3393_rare80k_$(date +"%Y-%m-%d").sbatch
k=10
min_axis=0
max_axis=0
pca_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=/mnt/vast/hpc/csg/containers/flashpcaR.sif
walltime='60h'

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --container $container
    --walltime $walltime
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/060722_pca_rare_80K/flashpca_f3393_rare80k_2022-06-07.sbatch[0m
INFO: Workflow csg (ID=w64f9966a6e8ac734) is executed successfully with 1 completed step.



# Calculate the heritability

For a case-control study it should be estimated like

```
gcta64 --grm test --pheno test_cc.phen --reml --prevalence 0.01 --out test --thread-num 10
```

## f3393 common variants

### Generate a random sample of 50.000 and 80.000 individuals for GREML computation

In [1]:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/mnt/mfs/cluster/R-Deb10_Libs




  msg['msg_id'] = self._parent_header['header']['msg_id']


In [1]:
pheno <- read.table("/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.greml_pheno", header=T)
head(pheno)

  msg['msg_id'] = self._parent_header['header']['msg_id']


In [8]:
# Get a subset of 50K individuals
subset_ind <- pheno[sample(nrow(pheno), 50000), ]
nrow(subset_ind)

In [2]:
# Get a subset of 80K individuals
subset_ind <- pheno[sample(nrow(pheno), 80000), ]
nrow(subset_ind)

In [4]:
keep_samples <- subset_ind[, c('FID', 'IID')]

In [5]:
head(keep_samples)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
90075,5921756,5921756
44900,3455376,3455376
49091,3689005,3689005
32700,2784217,2784217
10668,1582481,1582481
3703,1202137,1202137


In [17]:
# Write to a file for the 50K individuals
write.table(keep_samples, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.50Ksubset.keep_id", sep="\t", row.names = FALSE, col.names =FALSE, quote=FALSE)

In [6]:
# Write to a file for the 80K individuals
write.table(keep_samples, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.80Ksubset.keep_id", sep="\t", row.names = FALSE, col.names =FALSE, quote=FALSE)

### Select the 50K samples from the GRM 

This did not work, I could not subset the individuals from the already created GRM

In [18]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/50K_greml
#This is the bfile obtained in step 1
grm=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_grm_common/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.grm.bin
# Format FID, IID, pop
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.50Ksubset.keep_id
grm_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/50K_greml/f3393_greml_common50K_$(date +"%Y-%m-%d").sbatch
greml_sos=~/project/UKBB_GWAS_dev/workflow/GREML.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.50Ksubset.greml_pheno
phenoCol=f3393
name=50Ksubset
mem='50G'

grm_args="""grm_processing
    --cwd $cwd
    --grm $grm
    --keep_samples $keep_samples
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --name $name
    --container $container
    --mem $mem
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $greml_sos \
    --to-script $grm_sbatch \
    --args "$grm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/50K_greml/f3393_greml_common50K_2022-05-25.sbatch[0m
INFO: Workflow csg (ID=wd0fabc9ffcf610d1) is executed successfully with 1 completed step.



#### Recalculate GRM for 50K

In [1]:
bfile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.bed
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052522_grm_common_50K
numThreads=20
gcta_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/commonvars_eur_unrel_gtca_50K_$(date +"%Y-%m-%d").sbatch
gcta_sos=~/project/bioworkflows/GWAS/LMM.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
#Phenofile is just inputed as a requirement for the LMM to run but it is not actually needed for GRM calculation
phenoFile=/mnt/vast/hpc/csg/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
mem='8G'
walltime='48h'
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.50Ksubset.keep_id

gcta_args="""gcta
    --cwd $cwd
    --bfile $bfile
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --numThreads $numThreads
    --phenoFile $phenoFile
    --keep_samples $keep_samples
    --container_lmm $container
    --walltime $walltime
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"

  msg['msg_id'] = self._parent_header['header']['msg_id']


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/commonvars_eur_unrel_gtca_50K_2022-05-26.sbatch[0m
INFO: Workflow csg (ID=w3fa3dd1b77d3c0aa) is executed successfully with 1 completed step.


#### GREML for 50K individuals

In [1]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052622_greml_common_50K_prev0.01
#This is the bfile obtained in step 1
grm=~/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.grm.bin
# Format FID, IID, pop
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.50Ksubset.greml_pheno
greml_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052622_greml_common_50K_prev0.01/f3393_greml_common_50K_$(date +"%Y-%m-%d").sbatch
greml_sos=~/project/UKBB_GWAS_dev/workflow/GREML.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
phenoCol=f3393
covarCol=sex
qCovarCol=`echo age PC{1..10}`
prevalence=0.01

greml_args="""greml
    --cwd $cwd
    --grm $grm
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --prevalence $prevalence
    --container $container   
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $greml_sos \
    --to-script $greml_sbatch \
    --args "$greml_args"

  msg['msg_id'] = self._parent_header['header']['msg_id']


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052622_greml_50K/f3393_greml_common_50K_2022-05-26.sbatch[0m
INFO: Workflow csg (ID=wee86e652bfdbb854) is executed successfully with 1 completed step.



### Prevalence of 30% which was way too high 

In [5]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052622_greml_common_50K_prev0.3
#This is the bfile obtained in step 1
grm=~/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.grm.bin
# Format FID, IID, pop
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.50Ksubset.greml_pheno
greml_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052622_greml_common_50K_prev0.3/f3393_greml_common_50K_$(date +"%Y-%m-%d").sbatch
greml_sos=~/project/UKBB_GWAS_dev/workflow/GREML.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
phenoCol=f3393
covarCol=sex
qCovarCol=`echo age PC{1..10}`
prevalence=0.3

greml_args="""greml
    --cwd $cwd
    --grm $grm
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --prevalence $prevalence
    --container $container   
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $greml_sos \
    --to-script $greml_sbatch \
    --args "$greml_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052622_greml_50K_prev0.3/f3393_greml_common_50K_2022-05-26.sbatch[0m
INFO: Workflow csg (ID=w3797ec58440ab2af) is executed successfully with 1 completed step.



### Corrected prevalence of 6% as calculated from the data

In [8]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060122_greml_common_50K_prev0.06
#This is the bfile obtained in step 1
grm=~/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.grm.bin
# Format FID, IID, pop
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.50Ksubset.greml_pheno
greml_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060122_greml_common_50K_prev0.06/f3393_greml_common_50K_prev0.06_$(date +"%Y-%m-%d").sbatch
greml_sos=~/project/UKBB_GWAS_dev/workflow/GREML.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
phenoCol=f3393
covarCol=sex
qCovarCol=`echo age PC{1..10}`
# Use the prevalence calculated from the actual data
prevalence=0.06
walltime='36h'
mem='250G'
greml_args="""greml
    --cwd $cwd
    --grm $grm
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --prevalence $prevalence
    --container $container  
    --walltime $walltime
    --mem $mem
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $greml_sos \
    --to-script $greml_sbatch \
    --args "$greml_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/060122_greml_common_50K_prev0.06/f3393_greml_common_50K_prev0.06_2022-06-01.sbatch[0m
INFO: Workflow csg (ID=w1d8bb48e4388f731) is executed successfully with 1 completed step.



### Select the 80,000 random samples from the GRM of common variants

In [10]:
bfile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.bed
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060122_grm_common_80K
numThreads=20
gcta_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060122_grm_common_80K/commonvars_eur_unrel_gtca_80K_$(date +"%Y-%m-%d").sbatch
gcta_sos=~/project/bioworkflows/GWAS/LMM.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
#Phenofile is just inputed as a requirement for the LMM to run but it is not actually needed for GRM calculation
phenoFile=/mnt/vast/hpc/csg/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
mem='8G'
walltime='48h'
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.80Ksubset.keep_id

gcta_args="""gcta
    --cwd $cwd
    --bfile $bfile
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --numThreads $numThreads
    --phenoFile $phenoFile
    --keep_samples $keep_samples
    --container_lmm $container
    --walltime $walltime
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/060122_grm_common_80K/commonvars_eur_unrel_gtca_80K_2022-06-01.sbatch[0m
INFO: Workflow csg (ID=we22deef3b9961b7c) is ignored with 1 ignored step.



### Run GREML for common variants with the 80K individuals

In [21]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060722_greml_common_80K_prev0.06
#This is the bfile obtained in step 1
grm=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060122_grm_common_80K/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.grm.bin
# Format FID, IID, pop
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.common80K.greml_pheno
greml_sbatch=~/UKBiobank/results/ARHI_heritability/060722_greml_common_80K_prev0.06/f3393_greml_common_80K_prev0.06_$(date +"%Y-%m-%d").sbatch
greml_sos=~/project/UKBB_GWAS_dev/workflow/GREML.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
phenoCol=f3393
covarCol=sex
qCovarCol=`echo age PC{1..10}`
# Use the prevalence calculated from the actual data
prevalence=0.06
walltime='36h'
mem='250G'
greml_args="""greml
    --cwd $cwd
    --grm $grm
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --prevalence $prevalence
    --container $container  
    --walltime $walltime
    --mem $mem
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $greml_sos \
    --to-script $greml_sbatch \
    --args "$greml_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/UKBiobank/results/ARHI_heritability/060722_greml_common_80K_prev0.06/f3393_greml_common_80K_prev0.06_2022-06-07.sbatch[0m
INFO: Workflow csg (ID=w69db70580a71628f) is executed successfully with 1 completed step.



## f3393 rare variants

#### Recalculate GRM for 50K random samples from the rare variants

In [None]:
bfile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_rarevarsMAFbelow0.01.bed
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/053122_grm_rare_50K
numThreads=1
gcta_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/053122_grm_rare_50K/rarevars_eur_unrel_gtca_50K_$(date +"%Y-%m-%d").sbatch
gcta_sos=~/project/bioworkflows/GWAS/LMM.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=/mnt/vast/hpc/csg/containers/lmm.sif
#Phenofile is just inputed as a requirement for the LMM to run but it is not actually needed for GRM calculation
phenoFile=/mnt/vast/hpc/csg/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
mem='8G'
walltime='48h'
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.50Ksubset.keep_id

gcta_args="""gcta
    --cwd $cwd
    --bfile $bfile
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --numThreads $numThreads
    --phenoFile $phenoFile
    --keep_samples $keep_samples
    --container_lmm $container
    --walltime $walltime
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"

#### GREML for 50K individuals rare variants

Note: gave 100G memory and was killed because of not enough mem

In [3]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/053122_greml_rare_50K
#This is the bfile obtained in step 1
grm=~/UKBiobank/results/ARHI_heritability/053122_grm_rare_50K/ukb23156_merged_eur_unrel_rarevarsMAFbelow0.01.grm.bin
# Format FID, IID, pop
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.greml_pheno_rarevars
greml_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/053122_greml_rare_50K/f3393_greml_rare_50K_$(date +"%Y-%m-%d").sbatch
greml_sos=~/project/UKBB_GWAS_dev/workflow/GREML.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
phenoCol=f3393
covarCol=sex
qCovarCol=`echo age PC{1..10}`
prevalence=0.06
reml_priors='0.0287975 0.0294562'

greml_args="""greml
    --cwd $cwd
    --grm $grm
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --prevalence $prevalence
    --reml_priors $reml_priors
    --container $container   
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $greml_sos \
    --to-script $greml_sbatch \
    --args "$greml_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/053122_greml_rare_50K/f3393_greml_rare_50K_2022-06-28.sbatch[0m
INFO: Workflow csg (ID=w5df973877f333a69) is executed successfully with 1 completed step.



#### GREML for 80K individuals rare variants set

## f2247 common variants

### Get the 50K and 80K random samples for f2247

In [49]:
# This file contains the list of individuals that are white European and unrelated which was used to build the GRM
white_unrel <-  read.table("/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_ldprun_common/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.filtered.prune.fam", header=F)
colnames(white_unrel)<- c('FID','IID')
head(white_unrel)
dim(white_unrel)

Unnamed: 0_level_0,FID,IID,NA,NA,NA,NA
Unnamed: 0_level_1,<int>,<int>,<int>,<int>.1,<int>.2,<int>.3
1,1000019,1000019,0,0,0,-9
2,1000078,1000078,0,0,0,-9
3,1000081,1000081,0,0,0,-9
4,1000198,1000198,0,0,0,-9
5,1000210,1000210,0,0,0,-9
6,1000236,1000236,0,0,0,-9


In [39]:
pheno <- read.table("~/UKBiobank/results/083021_PCA_results/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl.keep_id", header=F)
colnames(pheno) <- c('FID','IID')
head(pheno)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
1,1000198,1000198
2,1000396,1000396
3,1000494,1000494
4,1001076,1001076
5,1001123,1001123
6,1001196,1001196


In [54]:
pheno_white_unrel <- pheno[which(pheno$IID %in% white_unrel$IID),]
dim(pheno_white_unrel)

In [55]:
# Get a subset of 50K individuals
subset_ind <- pheno_white_unrel[sample(nrow(pheno_white_unrel), 50000), ]
nrow(subset_ind)

In [56]:
keep_samples <- subset_ind[, c('FID', 'IID')]

In [57]:
head(keep_samples)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
13757,2524692,2524692
101913,3931896,3931896
11955,2318742,2318742
100700,3867865,3867865
135518,5684544,5684544
119426,4842102,4842102


In [58]:
# Write to a file for the 50K individuals
write.table(keep_samples, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl.50Ksubset.keep_id", sep="\t", row.names = FALSE, col.names =FALSE, quote=FALSE)

In [59]:
# Get a subset of 80K individuals
subset_ind <- pheno_white_unrel[sample(nrow(pheno_white_unrel), 80000), ]
nrow(subset_ind)

In [60]:
# Write to a file for the 80K individuals
write.table(keep_samples, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl.80Ksubset.keep_id", sep="\t", row.names = FALSE, col.names =FALSE, quote=FALSE)

### Calculate the PC's

#### Step 1

In [61]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2247
gwas_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2247/common_f2247_pca_50K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_ldprun_common/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.filtered.prune.bed
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl.50Ksubset.keep_id

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2247/common_f2247_pca_50K_2022-06-13.sbatch[0m
INFO: Workflow csg (ID=wcc8a9966b0e6048c) is executed successfully with 1 completed step.



#### Step 2

In [1]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2247
#This is the bfile obtained in step 1
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2247/*.bed
# Format FID, IID, pop
# PC's are actually being calculated for 50K individuals that are white European and unrelated
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/083021_PCA_results/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2247/flashpca_f2247_50Kcommon_$(date +"%Y-%m-%d").sbatch
k=10
min_axis=0
max_axis=0
pca_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/flashpcaR.sif
numThreads=1

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --container $container
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

  msg['msg_id'] = self._parent_header['header']['msg_id']


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2247/flashpca_f2247_50Kcommon_2022-06-14.sbatch[0m
INFO: Workflow csg (ID=wc240fcce5c26454f) is executed successfully with 1 completed step.



### Merge phenofile with PC's

In [27]:
pheno <- read.table("~/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl", header=T)
head(pheno)
dim(pheno)

Unnamed: 0_level_0,FID,IID,sex,f2247,age
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>
1,1000198,1000198,1,1,41
2,1000396,1000396,0,1,48
3,1000494,1000494,0,1,61
4,1001076,1001076,0,1,69
5,1001123,1001123,1,1,62
6,1001196,1001196,0,1,60


In [28]:
pca <- read.table("~/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2247/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl.pca.txt", header=T)
head(pca)
dim(pca)

Unnamed: 0_level_0,ID,FID,IID,ethnicity,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<fct>,<int>,<int>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000078:1000078,1000078,1000078,British,0.0001682157,-0.010868328,0.013926516,0.007888838,0.005840653,-0.004907483,0.003455339,-0.002962793,-0.008213459,0.011089191
2,1000081:1000081,1000081,1000081,British,-0.1377972901,0.019088647,-0.01236802,-0.013887942,0.0066285082,-0.012289974,-0.002614088,-0.002112217,0.010352732,0.001305127
3,1000236:1000236,1000236,1000236,British,-0.009666206,-0.008954983,0.026675716,-0.005429788,-0.0057268152,0.012727944,0.015520513,0.001436914,-0.009764744,0.032244368
4,1000340:1000340,1000340,1000340,British,0.0136983803,0.004912169,0.010146371,-0.002739609,-0.0001738624,-0.008639668,-0.004547821,-0.00466864,0.003564449,0.005583519
5,1000415:1000415,1000415,1000415,Irish,0.017182761,0.05020254,-0.001868328,0.003220866,-0.0153254036,-0.00936021,0.003488297,0.026279814,0.001351374,-0.002582746
6,1000674:1000674,1000674,1000674,British,0.0125119033,-0.023940619,0.016911804,-0.002264127,0.0002390969,0.011951628,-0.01542622,-0.004994943,0.003275061,-0.013175549


In [29]:
f2247 <- merge(pheno, pca, by = c('FID', 'IID'), all.y=TRUE)
head(f2247)
dim(f2247)
library(dplyr)
f2247_final <- select(f2247, -c('ID', 'ethnicity'))
head(f2247_final)

Unnamed: 0_level_0,FID,IID,sex,f2247,age,ID,ethnicity,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000078,1000078,1,0,60,1000078:1000078,British,0.0001682157,-0.010868328,0.013926516,0.007888838,0.005840653,-0.004907483,0.003455339,-0.002962793,-0.008213459,0.011089191
2,1000081,1000081,0,0,67,1000081:1000081,British,-0.1377972901,0.019088647,-0.01236802,-0.013887942,0.0066285082,-0.012289974,-0.002614088,-0.002112217,0.010352732,0.001305127
3,1000236,1000236,0,0,70,1000236:1000236,British,-0.009666206,-0.008954983,0.026675716,-0.005429788,-0.0057268152,0.012727944,0.015520513,0.001436914,-0.009764744,0.032244368
4,1000340,1000340,1,0,54,1000340:1000340,British,0.0136983803,0.004912169,0.010146371,-0.002739609,-0.0001738624,-0.008639668,-0.004547821,-0.00466864,0.003564449,0.005583519
5,1000415,1000415,0,0,65,1000415:1000415,Irish,0.017182761,0.05020254,-0.001868328,0.003220866,-0.0153254036,-0.00936021,0.003488297,0.026279814,0.001351374,-0.002582746
6,1000674,1000674,0,0,41,1000674:1000674,British,0.0125119033,-0.023940619,0.016911804,-0.002264127,0.0002390969,0.011951628,-0.01542622,-0.004994943,0.003275061,-0.013175549



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Unnamed: 0_level_0,FID,IID,sex,f2247,age,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000078,1000078,1,0,60,0.0001682157,-0.010868328,0.013926516,0.007888838,0.005840653,-0.004907483,0.003455339,-0.002962793,-0.008213459,0.011089191
2,1000081,1000081,0,0,67,-0.1377972901,0.019088647,-0.01236802,-0.013887942,0.0066285082,-0.012289974,-0.002614088,-0.002112217,0.010352732,0.001305127
3,1000236,1000236,0,0,70,-0.009666206,-0.008954983,0.026675716,-0.005429788,-0.0057268152,0.012727944,0.015520513,0.001436914,-0.009764744,0.032244368
4,1000340,1000340,1,0,54,0.0136983803,0.004912169,0.010146371,-0.002739609,-0.0001738624,-0.008639668,-0.004547821,-0.00466864,0.003564449,0.005583519
5,1000415,1000415,0,0,65,0.017182761,0.05020254,-0.001868328,0.003220866,-0.0153254036,-0.00936021,0.003488297,0.026279814,0.001351374,-0.002582746
6,1000674,1000674,0,0,41,0.0125119033,-0.023940619,0.016911804,-0.002264127,0.0002390969,0.011951628,-0.01542622,-0.004994943,0.003275061,-0.013175549


In [30]:
which(f2247_final$f2247=="", arr.ind=TRUE)
sum(f2247_final$f2247=="")

In [31]:
write.table(f2247_final, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl.common50K_PC1_10.greml_pheno", sep="\t", row.names = FALSE, col.names =TRUE, quote=FALSE)

### Recalculate GRM for 50K individuals f2247

In [52]:
bfile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.bed
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/f2247
numThreads=20
gcta_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/f2247/common_f2247_eur_unrel_gtca_50K_$(date +"%Y-%m-%d").sbatch
gcta_sos=~/project/bioworkflows/GWAS/LMM.ipynb
tpl_file=~/project/bioworkflows/admin/csg2.yml
container=~/containers/lmm.sif
#Phenofile is just inputed as a requirement for the LMM to run but it is not actually needed for GRM calculation
phenoFile=/mnt/vast/hpc/csg/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl
phenoCol=f2247
covarCol=sex
qCovarCol="age"
mem='8G'
walltime='48h'
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl.50Ksubset.keep_id

gcta_args="""gcta
    --cwd $cwd
    --bfile $bfile
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --numThreads $numThreads
    --phenoFile $phenoFile
    --keep_samples $keep_samples
    --container_lmm $container
    --walltime $walltime
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/f2247/common_f2247_eur_unrel_gtca_50K_2022-06-14.sbatch[0m
INFO: Workflow csg (ID=wd9a1e9b71fa79687) is executed successfully with 1 completed step.



### Run GREML for common variants with 50K individuals 

In [56]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060922_greml_f2247_common50K_prev0.32
#This is the bfile obtained in step 1
grm=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/f2247/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.grm.bin
# Format FID, IID, pop
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl.common50K_PC1_10.greml_pheno
greml_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060922_greml_f2247_common50K_prev0.32/f2247_greml_common_50K_$(date +"%Y-%m-%d").sbatch
greml_sos=~/project/UKBB_GWAS_dev/workflow/GREML.ipynb
tpl_file=~/project/bioworkflows/admin/csg2.yml
container=~/containers/lmm.sif
phenoCol=f2247
covarCol=sex
qCovarCol=`echo age PC{1..10}`
prevalence=0.32
mem="100G"

greml_args="""greml
    --cwd $cwd
    --grm $grm
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --prevalence $prevalence
    --mem $mem
    --container $container   
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $greml_sos \
    --to-script $greml_sbatch \
    --args "$greml_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/060922_greml_f2247_common50K_prev0.32/f2247_greml_common_50K_2022-06-14.sbatch[0m
INFO: Workflow csg (ID=w8d10416ba6431b9b) is executed successfully with 1 completed step.



## f2257 common variants

### Get the 50K and 80K random samples for f2257

In [2]:
# This file contains the list of individuals that are white European and unrelated which was used to build the GRM
white_unrel <-  read.table("~/UKBiobank/results/ARHI_heritability/052022_ldprun_common/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.filtered.prune.fam", header=F)
colnames(white_unrel)<- c('FID','IID')
head(white_unrel)
dim(white_unrel)

Unnamed: 0_level_0,FID,IID,NA,NA,NA,NA
Unnamed: 0_level_1,<int>,<int>,<int>,<int>.1,<int>.2,<int>.3
1,1000019,1000019,0,0,0,-9
2,1000078,1000078,0,0,0,-9
3,1000081,1000081,0,0,0,-9
4,1000198,1000198,0,0,0,-9
5,1000210,1000210,0,0,0,-9
6,1000236,1000236,0,0,0,-9


In [5]:
pheno <- read.table("~/UKBiobank/results/083021_PCA_results/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl.keep_id", header=F)
colnames(pheno) <- c('FID','IID')
head(pheno)
dim(pheno)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
1,1000035,1000035
2,1000198,1000198
3,1000304,1000304
4,1000396,1000396
5,1000494,1000494
6,1000551,1000551


In [4]:
pheno_white_unrel <- pheno[which(pheno$IID %in% white_unrel$IID),]
dim(pheno_white_unrel)

In [6]:
# Get a subset of 50K individuals
subset_ind <- pheno_white_unrel[sample(nrow(pheno_white_unrel), 50000), ]
nrow(subset_ind)

In [7]:
keep_samples <- subset_ind[, c('FID', 'IID')]

In [8]:
head(keep_samples)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
7532,1570915,1570915
136298,4670666,4670666
40482,4106711,4106711
6493,1488858,1488858
42342,4246213,4246213
80491,1772970,1772970


In [9]:
# Write to a file for the 50K individuals
write.table(keep_samples, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl.50Ksubset.keep_id", sep="\t", row.names = FALSE, col.names =FALSE, quote=FALSE)

In [10]:
# Get a subset of 80K individuals
subset_ind <- pheno_white_unrel[sample(nrow(pheno_white_unrel), 80000), ]
nrow(subset_ind)

In [11]:
# Write to a file for the 80K individuals
write.table(keep_samples, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl.80Ksubset.keep_id", sep="\t", row.names = FALSE, col.names =FALSE, quote=FALSE)

#### PCA Step 1 50K

In [12]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2257
gwas_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2257/common_f2257_pca_50K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_ldprun_common/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.filtered.prune.bed
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl.50Ksubset.keep_id

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2257/common_f2257_pca_50K_2022-06-14.sbatch[0m
INFO: Workflow csg (ID=w6536308cf565914b) is executed successfully with 1 completed step.



#### PCA Step 2 50K

In [25]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2257
#This is the bfile obtained in step 1
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2257/*.bed
# Format FID, IID, pop
# PC's are actually being calculated for 50K individuals that are white European and unrelated
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/083021_PCA_results/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2257/flashpca_f2257_50Kcommon_$(date +"%Y-%m-%d").sbatch
k=10
min_axis=0
max_axis=0
pca_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/flashpcaR.sif
numThreads=1

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --container $container
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2257/flashpca_f2257_50Kcommon_2022-06-14.sbatch[0m
INFO: Workflow csg (ID=wc75cdd85d4690081) is executed successfully with 1 completed step.



### Merge the phenofile with the PC's 50K

In [33]:
pheno <- read.table("~/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl", header=T)
head(pheno)
dim(pheno)

Unnamed: 0_level_0,FID,IID,sex,f2257,age
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>
1,1000035,1000035,0,1,63
2,1000198,1000198,1,1,41
3,1000304,1000304,1,1,56
4,1000396,1000396,0,1,48
5,1000494,1000494,0,1,61
6,1000551,1000551,0,1,68


In [34]:
pca <- read.table("~/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/f2257/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl.pca.txt", header=T)
head(pca)
dim(pca)

Unnamed: 0_level_0,ID,FID,IID,ethnicity,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<fct>,<int>,<int>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000198:1000198,1000198,1000198,British,0.008703159,0.016927991,-0.0142239732,-0.0091254594,-0.007455701,-0.016742283,-0.00915246,-0.0010389604,-7.670805e-05,-0.014548517
2,1000236:1000236,1000236,1000236,British,-0.008314325,-0.010150314,-0.0249975681,-0.0002223465,0.001298904,-0.02200564,-0.00473087,-0.0049540757,0.009384657,0.029767361
3,1000331:1000331,1000331,1000331,Any_other_white_background,-0.082581777,0.004069695,-0.0050828535,0.0033783106,-0.001947084,-0.006427138,0.036089759,-0.0005694666,0.008023191,0.010743691
4,1000340:1000340,1000340,1000340,British,0.012075342,0.00059985,-0.0105988782,0.0007342952,0.004920901,0.006027913,0.002311576,0.0063891532,0.003816363,0.006731609
5,1000396:1000396,1000396,1000396,British,0.020225342,0.033027462,0.0202408038,-0.0094411387,0.022433478,-0.009788024,0.013531119,0.0006526454,0.02029744,-0.005424345
6,1000415:1000415,1000415,1000415,Irish,0.017341955,0.05373382,0.0007680537,0.0023536418,-0.01811991,0.004422258,0.0014442,-0.0249729215,-0.002747917,-0.005898875


In [35]:
f2257 <- merge(pheno, pca, by = c('FID', 'IID'), all.y=TRUE)
head(f2257)
dim(f2257)
library(dplyr)
f2257_final <- select(f2257, -c('ID', 'ethnicity'))
head(f2257_final)

Unnamed: 0_level_0,FID,IID,sex,f2257,age,ID,ethnicity,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000198,1000198,1,1,41,1000198:1000198,British,0.008703159,0.016927991,-0.0142239732,-0.0091254594,-0.007455701,-0.016742283,-0.00915246,-0.0010389604,-7.670805e-05,-0.014548517
2,1000236,1000236,0,0,70,1000236:1000236,British,-0.008314325,-0.010150314,-0.0249975681,-0.0002223465,0.001298904,-0.02200564,-0.00473087,-0.0049540757,0.009384657,0.029767361
3,1000331,1000331,1,0,53,1000331:1000331,Any_other_white_background,-0.082581777,0.004069695,-0.0050828535,0.0033783106,-0.001947084,-0.006427138,0.036089759,-0.0005694666,0.008023191,0.010743691
4,1000340,1000340,1,0,54,1000340:1000340,British,0.012075342,0.00059985,-0.0105988782,0.0007342952,0.004920901,0.006027913,0.002311576,0.0063891532,0.003816363,0.006731609
5,1000396,1000396,0,1,48,1000396:1000396,British,0.020225342,0.033027462,0.0202408038,-0.0094411387,0.022433478,-0.009788024,0.013531119,0.0006526454,0.02029744,-0.005424345
6,1000415,1000415,0,0,65,1000415:1000415,Irish,0.017341955,0.05373382,0.0007680537,0.0023536418,-0.01811991,0.004422258,0.0014442,-0.0249729215,-0.002747917,-0.005898875


Unnamed: 0_level_0,FID,IID,sex,f2257,age,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000198,1000198,1,1,41,0.008703159,0.016927991,-0.0142239732,-0.0091254594,-0.007455701,-0.016742283,-0.00915246,-0.0010389604,-7.670805e-05,-0.014548517
2,1000236,1000236,0,0,70,-0.008314325,-0.010150314,-0.0249975681,-0.0002223465,0.001298904,-0.02200564,-0.00473087,-0.0049540757,0.009384657,0.029767361
3,1000331,1000331,1,0,53,-0.082581777,0.004069695,-0.0050828535,0.0033783106,-0.001947084,-0.006427138,0.036089759,-0.0005694666,0.008023191,0.010743691
4,1000340,1000340,1,0,54,0.012075342,0.00059985,-0.0105988782,0.0007342952,0.004920901,0.006027913,0.002311576,0.0063891532,0.003816363,0.006731609
5,1000396,1000396,0,1,48,0.020225342,0.033027462,0.0202408038,-0.0094411387,0.022433478,-0.009788024,0.013531119,0.0006526454,0.02029744,-0.005424345
6,1000415,1000415,0,0,65,0.017341955,0.05373382,0.0007680537,0.0023536418,-0.01811991,0.004422258,0.0014442,-0.0249729215,-0.002747917,-0.005898875


In [36]:
write.table(f2257_final, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl.common50K_PC1_10.greml_pheno", sep="\t", row.names = FALSE, col.names =TRUE, quote=FALSE)

### Recalculate GRM for 50K individuals for f2257

In [50]:
bfile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.bed
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/f2257
numThreads=20
gcta_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/f2257/common_f2257_eur_unrel_gtca_50K_$(date +"%Y-%m-%d").sbatch
gcta_sos=~/project/bioworkflows/GWAS/LMM.ipynb
tpl_file=~/project/bioworkflows/admin/csg2.yml
container=~/containers/lmm.sif
#Phenofile is just inputed as a requirement for the LMM to run but it is not actually needed for GRM calculation
phenoFile=/mnt/vast/hpc/csg/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl
phenoCol=f2257
covarCol=sex
qCovarCol="age PC1 PC2"
mem='8G'
walltime='48h'
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl.50Ksubset.keep_id

gcta_args="""gcta
    --cwd $cwd
    --bfile $bfile
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --numThreads $numThreads
    --phenoFile $phenoFile
    --keep_samples $keep_samples
    --container_lmm $container
    --walltime $walltime
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/f2257/common_f2257_eur_unrel_gtca_50K_2022-06-14.sbatch[0m
INFO: Workflow csg (ID=wddcc7cf05eaa282f) is executed successfully with 1 completed step.



### Run GREML with common variants for 50K individuals

In [54]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060922_greml_f2257_common50K_prev0.40
#This is the bfile obtained in step 1
grm=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/f2257/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.grm.bin
# Format FID, IID, pop
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl.common50K_PC1_10.greml_pheno
greml_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060922_greml_f2257_common50K_prev0.40/f2257_greml_common50K_$(date +"%Y-%m-%d").sbatch
greml_sos=~/project/UKBB_GWAS_dev/workflow/GREML.ipynb
tpl_file=~/project/bioworkflows/admin/csg2.yml
container=~/containers/lmm.sif
phenoCol=f2257
covarCol=sex
qCovarCol=`echo age PC{1..10}`
prevalence=0.40
mem="100G"
greml_args="""greml
    --cwd $cwd
    --grm $grm
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --prevalence $prevalence
    --mem $mem
    --container $container   
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $greml_sos \
    --to-script $greml_sbatch \
    --args "$greml_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/060922_greml_f2257_common50K_prev0.40/f2257_greml_common50K_2022-06-14.sbatch[0m
INFO: Workflow csg (ID=wd29319332cee05f9) is executed successfully with 1 completed step.



## H-both f2247 & f2257 common variants

### Get the 50K and 80K random samples for the combined trait

In [14]:
# This file contains the list of individuals that are white European and unrelated which was used to build the GRM
white_unrel <-  read.table("~/UKBiobank/results/ARHI_heritability/052022_ldprun_common/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.filtered.prune.fam", header=F)
colnames(white_unrel)<- c('FID','IID')
head(white_unrel)
dim(white_unrel)

Unnamed: 0_level_0,FID,IID,NA,NA,NA,NA
Unnamed: 0_level_1,<int>,<int>,<int>,<int>.1,<int>.2,<int>.3
1,1000019,1000019,0,0,0,-9
2,1000078,1000078,0,0,0,-9
3,1000081,1000081,0,0,0,-9
4,1000198,1000198,0,0,0,-9
5,1000210,1000210,0,0,0,-9
6,1000236,1000236,0,0,0,-9


In [15]:
pheno <- read.table("~/UKBiobank/results/083021_PCA_results/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl.keep_id", header=F)
colnames(pheno) <- c('FID','IID')
head(pheno)
dim(pheno)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
1,1000198,1000198
2,1000396,1000396
3,1000494,1000494
4,1001076,1001076
5,1001123,1001123
6,1001196,1001196


In [16]:
pheno_white_unrel <- pheno[which(pheno$IID %in% white_unrel$IID),]
dim(pheno_white_unrel)

In [17]:
# Get a subset of 50K individuals
subset_ind <- pheno_white_unrel[sample(nrow(pheno_white_unrel), 50000), ]
nrow(subset_ind)

In [18]:
keep_samples <- subset_ind[, c('FID', 'IID')]

In [19]:
head(keep_samples)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
108883,4662538,4662538
97299,4061891,4061891
10280,2341886,2341886
37369,5887215,5887215
69492,2611022,2611022
61252,2189392,2189392


In [20]:
# Write to a file for the 50K individuals
write.table(keep_samples, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl.50Ksubset.keep_id", sep="\t", row.names = FALSE, col.names =FALSE, quote=FALSE)

In [21]:
# Get a subset of 80K individuals
subset_ind <- pheno_white_unrel[sample(nrow(pheno_white_unrel), 80000), ]
nrow(subset_ind)

In [22]:
# Write to a file for the 80K individuals
write.table(keep_samples, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl.80Ksubset.keep_id", sep="\t", row.names = FALSE, col.names =FALSE, quote=FALSE)

#### PCA Step 1 50K

In [23]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/combined
gwas_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/combined/common_combined_pca_50K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052022_ldprun_common/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.filtered.prune.bed
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl.50Ksubset.keep_id

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/combined/common_combined_pca_50K_2022-06-14.sbatch[0m
INFO: Workflow csg (ID=w22bfc9f27d9bd16e) is executed successfully with 1 completed step.



#### PCA Step 2 50K

In [26]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/combined
#This is the bfile obtained in step 1
genoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/combined/*.bed
# Format FID, IID, pop
# PC's are actually being calculated for 50K individuals that are white European and unrelated
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/083021_PCA_results/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/combined/flashpca_combined_50Kcommon_$(date +"%Y-%m-%d").sbatch
k=10
min_axis=0
max_axis=0
pca_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/flashpcaR.sif
numThreads=1

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --container $container
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/combined/flashpca_combined_50Kcommon_2022-06-14.sbatch[0m
INFO: Workflow csg (ID=w89763e40a9a75bda) is executed successfully with 1 completed step.



### Merge phenofile with PC's 50K

In [42]:
pheno <- read.table("~/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl", header=T)
head(pheno)
dim(pheno)

Unnamed: 0_level_0,FID,IID,sex,f2247_f2257,age
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>
1,1000198,1000198,1,1,41
2,1000396,1000396,0,1,48
3,1000494,1000494,0,1,61
4,1001076,1001076,0,1,69
5,1001123,1001123,1,1,62
6,1001196,1001196,0,1,60


In [43]:
pca <- read.table("~/UKBiobank/results/ARHI_heritability/061322_pca_common_50K/combined/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl.pca.txt", header=T)
head(pca)
dim(pca)

Unnamed: 0_level_0,ID,FID,IID,ethnicity,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<fct>,<int>,<int>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000081:1000081,1000081,1000081,British,0.1340994077,0.015762098,-0.012304563,-0.013436178,0.006676593,0.012044666,0.011305502,0.0052594946,-0.007379954,-0.0002535988
2,1000198:1000198,1000198,1000198,British,-0.0078943889,0.016439942,0.012239718,-0.010269737,-0.00627822,-0.017395729,-0.002464998,-0.0064620805,0.005931359,0.0059562117
3,1000331:1000331,1000331,1000331,Any_other_white_background,0.0808571833,0.004574555,0.001596097,0.004930975,0.001433612,-0.001703583,0.026327249,-0.0059474518,0.003164743,-0.0141142041
4,1000396:1000396,1000396,1000396,British,-0.0201384818,0.034682309,-0.019682605,-0.011249187,0.019086357,-0.001988183,0.017769778,-0.0132614443,0.023200946,-0.0009004311
5,1000439:1000439,1000439,1000439,British,-0.0008244728,-0.01567398,0.019379132,0.005052614,-0.001708328,-0.012906163,-0.003569291,-0.0027146156,0.004485849,0.0051293334
6,1000858:1000858,1000858,1000858,Any_other_white_background,0.1672089665,0.033710157,0.004041379,0.027007462,0.006825866,-0.008038509,-0.005028552,-0.0001475687,-0.016071285,-0.0027806141


In [44]:
hboth <- merge(pheno, pca, by = c('FID', 'IID'), all.y=TRUE)
head(hboth)
dim(hboth)
library(dplyr)
hboth_final <- select(hboth, -c('ID', 'ethnicity'))
head(hboth_final)

Unnamed: 0_level_0,FID,IID,sex,f2247_f2257,age,ID,ethnicity,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000081,1000081,0,0,67,1000081:1000081,British,0.1340994077,0.015762098,-0.012304563,-0.013436178,0.006676593,0.012044666,0.011305502,0.0052594946,-0.007379954,-0.0002535988
2,1000198,1000198,1,1,41,1000198:1000198,British,-0.0078943889,0.016439942,0.012239718,-0.010269737,-0.00627822,-0.017395729,-0.002464998,-0.0064620805,0.005931359,0.0059562117
3,1000331,1000331,1,0,53,1000331:1000331,Any_other_white_background,0.0808571833,0.004574555,0.001596097,0.004930975,0.001433612,-0.001703583,0.026327249,-0.0059474518,0.003164743,-0.0141142041
4,1000396,1000396,0,1,48,1000396:1000396,British,-0.0201384818,0.034682309,-0.019682605,-0.011249187,0.019086357,-0.001988183,0.017769778,-0.0132614443,0.023200946,-0.0009004311
5,1000439,1000439,1,0,59,1000439:1000439,British,-0.0008244728,-0.01567398,0.019379132,0.005052614,-0.001708328,-0.012906163,-0.003569291,-0.0027146156,0.004485849,0.0051293334
6,1000858,1000858,0,0,61,1000858:1000858,Any_other_white_background,0.1672089665,0.033710157,0.004041379,0.027007462,0.006825866,-0.008038509,-0.005028552,-0.0001475687,-0.016071285,-0.0027806141


Unnamed: 0_level_0,FID,IID,sex,f2247_f2257,age,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000081,1000081,0,0,67,0.1340994077,0.015762098,-0.012304563,-0.013436178,0.006676593,0.012044666,0.011305502,0.0052594946,-0.007379954,-0.0002535988
2,1000198,1000198,1,1,41,-0.0078943889,0.016439942,0.012239718,-0.010269737,-0.00627822,-0.017395729,-0.002464998,-0.0064620805,0.005931359,0.0059562117
3,1000331,1000331,1,0,53,0.0808571833,0.004574555,0.001596097,0.004930975,0.001433612,-0.001703583,0.026327249,-0.0059474518,0.003164743,-0.0141142041
4,1000396,1000396,0,1,48,-0.0201384818,0.034682309,-0.019682605,-0.011249187,0.019086357,-0.001988183,0.017769778,-0.0132614443,0.023200946,-0.0009004311
5,1000439,1000439,1,0,59,-0.0008244728,-0.01567398,0.019379132,0.005052614,-0.001708328,-0.012906163,-0.003569291,-0.0027146156,0.004485849,0.0051293334
6,1000858,1000858,0,0,61,0.1672089665,0.033710157,0.004041379,0.027007462,0.006825866,-0.008038509,-0.005028552,-0.0001475687,-0.016071285,-0.0027806141


In [45]:
write.table(hboth_final, "/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl.common50K_PC1_10.greml_pheno", sep="\t", row.names = FALSE, col.names =TRUE, quote=FALSE)

### Recalculate the GRM for the specific 50K individuals in the combined trait

In [48]:
bfile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.bed
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/combined
numThreads=20
gcta_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/combined/common_combined_eur_unrel_gtca_50K_$(date +"%Y-%m-%d").sbatch
gcta_sos=~/project/bioworkflows/GWAS/LMM.ipynb
tpl_file=~/project/bioworkflows/admin/csg2.yml
container=~/containers/lmm.sif
#Phenofile is just inputed as a requirement for the LMM to run but it is not actually needed for GRM calculation
phenoFile=/mnt/vast/hpc/csg/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl
phenoCol=f2247_f2257
covarCol=sex
qCovarCol="age PC1 PC2"
mem='8G'
walltime='48h'
keep_samples=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl.50Ksubset.keep_id

gcta_args="""gcta
    --cwd $cwd
    --bfile $bfile
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --numThreads $numThreads
    --phenoFile $phenoFile
    --keep_samples $keep_samples
    --container_lmm $container
    --walltime $walltime
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/combined/common_combined_eur_unrel_gtca_50K_2022-06-14.sbatch[0m
INFO: Workflow csg (ID=w1480acec28d452ea) is executed successfully with 1 completed step.



### Run GREML with common variants for 50K individuals

In [53]:
## Columbia's cluster
cwd=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060922_greml_combined_common50K_prev0.28
#This is the bfile obtained in step 1
grm=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/052522_grm_common_50K/combined/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.grm.bin
# Format FID, IID, pop
phenoFile=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl.common50K_PC1_10.greml_pheno
greml_sbatch=/mnt/vast/hpc/csg/UKBiobank/results/ARHI_heritability/060922_greml_combined_common50K_prev0.28/combined_greml_common50K_$(date +"%Y-%m-%d").sbatch
greml_sos=~/project/UKBB_GWAS_dev/workflow/GREML.ipynb
tpl_file=~/project/bioworkflows/admin/csg2.yml
container=~/containers/lmm.sif
phenoCol=f2247_f2257
covarCol=sex
qCovarCol=`echo age PC{1..10}`
prevalence=0.28
mem="100G"

greml_args="""greml
    --cwd $cwd
    --grm $grm
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --prevalence $prevalence
    --mem $mem
    --container $container   
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $greml_sos \
    --to-script $greml_sbatch \
    --args "$greml_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/060922_greml_combined_common50K_prev0.28/combined_greml_common50K_2022-06-14.sbatch[0m
INFO: Workflow csg (ID=wc8b077166fde3e32) is executed successfully with 1 completed step.

