# Estimate the rare variant heritability of age-related hearing impairment traits

# Aim


# Concepts


Proportion of phenotypic variance captured by common SNPs - SNP-based heritability (h^2SNP)

Discrepancy between h^2ped and h^2SNP:

1. Causal variants are not well tagged by common SNPs because they are rare
2. Pedigree heritability is overestimated because of confounding with environmental effects or non-additive genetic variation


## 1. Calculate the SNP-based heritability from common variants and compared to available literature

First of all a GRM from all of the autosomal SNPs needs to be calculated.


## Input files

### Phenotype files

These files have already been QC'ed to include individuals with each of the hearing impairment traits and control individuals without hearing impairment related phenotypes

Mega-sample 

H-aid:

* /mnt/mfs/statgen/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv

H-diff:

* /mnt/mfs/statgen/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl_PC1_2.tsv

H-noise:

* /mnt/mfs/statgen/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl_PC1_2.tsv

H-both:

* /mnt/mfs/statgen/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl_PC1_2.tsv


### Genotype files

Original exome sequence files in plink format are here: 
* /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed

QC done in these VCF files was: 

- DP-SPNs=10
- DP-indels=10
- GQ=20
- AB-SNP=0.15
- AB-indel=0.20
- geno=0.1

Samples with missingness >10% `-mind 0.1` in the genotype array

* ~/UKBiobank/data/exome_files/project_VCF/072721_run/merged_plink/mind_0.1/cache/ukb23155_qc_merged.mind_0.1.filtered.mindrem.id

Extra QC step is needed here to make sure we have the best quality variants for heritability calculation

### Selecting white European samples

According to Wainschtein et al 2022, they do two rounds of PC's calculations (20 PC's) one with common variants and one with rare variants. The prunning is also done using different parameters for each of these analyses

- Common: MAF 0.01-0.5, window 50Kb, r2=0.1
- Rare: MAF 0.004 (MAC=5) - 0.01, window 100Kb, r2=0.05

In our case, we will use our already defined white European population that was classified using the genotype array data with common variants, calculating 10 PC's and the manhalanobis distance to dected outliers. 

* /mnt/mfs/statgen/UKBiobank/results/083021_PCA_results/europeans/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.fam

### Removing related individuals 

In this step we just want to keep the unrelated European individuals for heritability calculations. We use a kinship=0.0625 (to remove related individuals up to third degree)

* remove_samples=/mnt/mfs/statgen/UKBiobank/results/083021_PCA_results/090221_king/*.related_id

### Remove individuals showing excess of heterozygosity based on GRM off-diagonal?

Don't know if this is necessary or not

Here we start with QC'ed exome sequence data but we will generate additional files with more stringent QC for heritability calculation

1. MAF keep all rare-variants (Wainschtein 2022 paper uses `--maf 0.0001` )
2. `--geno 0.05` (originally for our exome QC we used a `--geno 0.1`)
3. `--hwe 0.000001` 
4. `--mind 0.05` (originally we did not remove individuals based on mind for the exome QC)
5. `--snps_only` add this option to remove indels from calculation

### Build the GRM from the exome sequence data?

## Step 1. Extra QC on the exome data

### Select white European unrelated individuals

In [24]:
eur <- read.table("~/UKBiobank/results/083021_PCA_results/europeans/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.fam", header=F)
colnames(eur) <- c("FID", "IID","fatherid", "motherid", "sex", "pheno")
dim(eur)
head(eur)

Unnamed: 0_level_0,FID,IID,fatherid,motherid,sex,pheno
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,1000019,1000019,0,0,2,-9
2,1000035,1000035,0,0,1,-9
3,1000078,1000078,0,0,2,-9
4,1000081,1000081,0,0,1,-9
5,1000198,1000198,0,0,2,-9
6,1000210,1000210,0,0,1,-9


In [30]:
unrel <- read.table("~/UKBiobank/results/083021_PCA_results/090221_ldprun_unrelated/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.090221_ldprun_unrelated.filtered.fam")
colnames(unrel) <- c("FID", "IID","fatherid", "motherid", "sex", "pheno")
dim(unrel)
head(unrel)

Unnamed: 0_level_0,FID,IID,fatherid,motherid,sex,pheno
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,1000019,1000019,0,0,2,-9
2,1000078,1000078,0,0,2,-9
3,1000081,1000081,0,0,1,-9
4,1000198,1000198,0,0,2,-9
5,1000210,1000210,0,0,1,-9
6,1000236,1000236,0,0,1,-9


In [29]:
outlier <- read.table("~/UKBiobank/results/083021_PCA_results/090321_PCA_related_pval0.005/030821_ukb42495_exomed_white_189010ind.090321_PCA_related_pval0.005.pca.projected.outliers")
colnames(outlier) <- c("FID", "IID")
dim(outlier)
head(outlier)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
1,1008606,1008606
2,1010412,1010412
3,1028129,1028129
4,1032822,1032822
5,1035752,1035752
6,1044288,1044288


Select white European individuals that are unrelated and remove ancestry outliers obtained by PCA calculation

In [36]:
eur_unrel <- unrel[-which(unrel$IID %in% outlier$IID), ]
eur_unrel_drop <- eur_unrel[ -c(3:6)]

In [38]:
head(eur_unrel_drop)
dim(eur_unrel_drop)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
1,1000019,1000019
2,1000078,1000078
3,1000081,1000081
4,1000198,1000198
5,1000210,1000210
6,1000236,1000236


In [40]:
write.table(eur_unrel_drop, "/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/pheno_files/ukb42495_exomed_white_europeans_unrelated_no_outliers_167652.id", sep="\t", row.names = FALSE, col.names =FALSE, quote=FALSE)

### Apply HWE only on the subgroup of controls which are the same for our ARHI analyses

In [2]:
# Get the unrelated controls from out phenotype files
eur_unrel <-  read.table("/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/pheno_files/ukb42495_exomed_white_europeans_unrelated_no_outliers_167652.id", header=F)
colnames(eur_unrel) <- c("FID", "IID")
head(eur_unrel)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
1,1000019,1000019
2,1000078,1000078
3,1000081,1000081
4,1000198,1000198
5,1000210,1000210
6,1000236,1000236


In [4]:
# Load the f3393 phenotype file to subset for the controls 
f3393 <- read.table("~/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl", header=T)
head(f3393)
nrow(f3393)

Unnamed: 0_level_0,FID,IID,sex,f3393,age
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>
1,1001384,1001384,1,1,61
2,1002548,1002548,0,1,62
3,1002888,1002888,0,1,68
4,1002944,1002944,0,1,65
5,1003258,1003258,0,1,74
6,1004843,1004843,0,1,64


In [7]:
library('dplyr')
f3393_ctrl <- f3393 %>% filter(f3393==0) %>%
          select('FID', 'IID')

In [13]:
f3393_ctrl_unrel <- f3393 %>% filter(IID %in% eur_unrel$IID) %>% select("FID", "IID")        
nrow(f3393_ctrl_unrel)
head(f3393_ctrl_unrel)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
1,1001384,1001384
2,1002548,1002548
3,1002888,1002888
4,1002944,1002944
5,1003258,1003258
6,1004843,1004843


In [14]:
write.table(f3393_ctrl_unrel, "/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/pheno_files/ukb42495_exomed_white_europeans_unrelated_no_outliers_ctrl_ARHI_92040.id", sep="\t", row.names = FALSE, col.names =TRUE, quote=FALSE)

In [33]:
# Get the variants that pass HWE qc filter
## Select White European unrelated individuals 
## Do some extra QC on the exome data
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
cwd=$UKBB_PATH/results/ARHI_heritability/051922_white_eur_unrel
## Use the exome filtered file
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
## To keep only the samples of white Europeans unrelated individuals with outliers removed that are controls for ARHI samples so we can apply HWE
keep_samples=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/pheno_files/ukb42495_exomed_white_europeans_unrelated_no_outliers_ctrl_ARHI_92040.id
# Do not set a MAF filter, this will keep both common and rare variants
maf_filter=0 
# No geno filter in this subset of controls
geno_filter=0
# Set a HWE filter 1x10^-8
hwe_filter=0.00000001
# Do not set a sample missingness filter at this point, otherwise many samples would be removed
mind_filter=0
# Keep both SNPs and indels in the heritability calculation
snps_only=False
meta_only=True
other_args=""
gwas_sbatch=$UKBB_PATH/results/ARHI_heritability/051922_white_eur_unrel/hwe_eur_unrel_control_ARHI_$(date +"%Y-%m-%d").sbatch
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif
numThreads=20
job_size=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --meta_only $meta_only
    --other_args $other_args
    --numThreads $numThreads 
    --job_size $job_size
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/hwe_eur_unrel_control_ARHI_2022-05-19.sbatch[0m
INFO: Workflow csg (ID=w72163a39fda9e30b) is ignored with 1 ignored step.



In [18]:
# Count number of variants after VCF_QC
for file in $(ls -v /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/072721_run/plink/ukb23156_c*.merged.filtered.bim);
do 
    echo "${file##*/}" >> /mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/072721_ukb23156_c1_22.merged.filtered.snpcount.txt
    cat $file | wc -l  >> /mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/072721_ukb23156_c1_22.merged.filtered.snpcount.txt;
done




In [36]:
# Count number of variants after HWE filter 
for file in $(ls -v /mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c*.merged.filtered.filtered.snplist);
do 
    wc -l $file  >> /mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_ukb23156_c1_22.merged.filtered.hwe_ctrl_ARHI.snpcount.txt;
done




In [39]:
cat $(ls -v /mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c**.merged.filtered.filtered.snplist) >> \
/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c1_22.filtered.filtered.snplist




### Note: I've found some duplicated records in the extracted variants
```
      2 chr10:5642571:T:TA
      2 chr11:2915280:C:CGT
      2 chr14:51243825:C:CT
      2 chr14:73265273:A:AGG
      2 chr14:73654828:C:CT
      2 chr15:36594814:T:TA
      2 chr17:20156278:A:AC
      2 chr19:45408051:C:CA
      2 chr19:9980591:A:ATCTC
      2 chr1:149935279:CA:C
      2 chr1:240493541:A:AT
      2 chr1:53074691:T:TG
      2 chr20:49159148:T:TA
      2 chr22:36481640:C:CGG
      2 chr2:127637295:A:AC
      2 chr3:111991354:AT:A
      2 chr3:98391562:G:GA
      2 chr5:169670501:T:TTTTTA
      2 chr5:58975878:C:CT
      2 chr5:96783875:A:ACT
      2 chr6:42683028:G:GT
      2 chr7:100885457:G:GC
      2 chr9:27217785:A:AT
      2 chr9:92474739:G:GTCC
```

### Remove variants that do not pass HWE in ARHI controls and get final bed file per chromosome

In [40]:
# Get the variants that pass HWE qc filter
## Select White European unrelated individuals 
## Do some extra QC on the exome data
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
cwd=$UKBB_PATH/results/ARHI_heritability/051922_white_eur_unrel
## Use the exome filtered file
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
## To keep only the samples of white Europeans unrelated individuals with outliers removed that are controls for ARHI samples so we can apply HWE
keep_samples=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/pheno_files/ukb42495_exomed_white_europeans_unrelated_no_outliers_167652.id
keep_variants=~/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c1_22.filtered.filtered.snplist
# Do not set a MAF filter, this will keep both common and rare variants
maf_filter=0 
# No geno filter in this subset of controls
geno_filter=0.1
# Set a HWE filter to 0 because there's not need to do it here
hwe_filter=0
# Do not set a sample missingness filter at this point, otherwise many samples would be removed
mind_filter=0
# Keep both SNPs and indels in the heritability calculation
snps_only=False
other_args=""
gwas_sbatch=$UKBB_PATH/results/ARHI_heritability/051922_white_eur_unrel/eur_unrel_ARHI_QC_$(date +"%Y-%m-%d").sbatch
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif
numThreads=20
job_size=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --other_args $other_args
    --numThreads $numThreads 
    --job_size $job_size
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/eur_unrel_ARHI_QC_2022-05-19.sbatch[0m
INFO: Workflow csg (ID=w3655ddad4c106b79) is executed successfully with 1 completed step.



In [41]:
# Count number of variants after geno=0.1 filter 
for file in $(ls -v /mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c*.merged.filtered.filtered.extracted.bim);
do 
    wc -l $file  >> /mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_ukb23156_c1_22.merged.filtered.geno0.1_ARHI.snpcount.txt;
done




## Step 2. Merge plink files and use `--mind 0.05` filter

From the Wainschtein 2022 paper

```
plink \
--bfile ${BED_file_merged} \
--merge-list ${list_beds} \
--make-bed \
--maf 0.0001 \
--geno 0.05  \
--hwe 0.000001 \
--mind 0.05 \
--out ${BED_file_merged_QC} \
--threads ${ncpu}
```

### Get the file for rare variants MAF<0.01

In [43]:
genoFile=`echo /mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c{1..22}.merged.filtered.filtered.extracted.bed`
cwd=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_merged_bed
# Select only the variants with MAF between 0 and 0.01
maf_max_filter=0.01
# Do not set a MAF filter to keep all variants below 0.01
maf_filter=0
# No need to filter again in the merge
geno_filter=0
# HWE already applied only to controls of ARHI sample
hwe_filter=0
# Set a sample missingness of 5%
mind_filter=0.05
name='ukb23156_merged_eur_unrel_rarevarsMAFbelow0.01'

gwas_sbatch=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_merged_bed/rareMAFbelow0.01_eur_unrel_exome_merged$(date +"%Y-%m-%d").sbatch
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/genotype_formatting.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

merge_args="""merge_plink
    --cwd $cwd
    --genoFile $genoFile
    --name $name
    --maf_filter $maf_filter
    --maf_max_filter $maf_max_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$merge_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_merged_bed/rareMAFbelow0.01_eur_unrel_exome_merged2022-05-19.sbatch[0m
INFO: Workflow csg (ID=w184d6d32cb096add) is executed successfully with 1 completed step.



### Get the file for common variants MAF>0.01

In [44]:
genoFile=`echo /mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c{1..22}.merged.filtered.filtered.extracted.bed`
cwd=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_merged_bed
# Set a MAF filter of 0.01 to keep variants above that threshold
maf_filter=0.01
# No need to filter for variant missigness again
geno_filter=0
# HWE already applied only to controls of ARHI sample
hwe_filter=0
# Set a sample missingness of 5%
mind_filter=0.05
name='ukb23156_merged_eur_unrel_commonvarsMAFabove0.01'
gwas_sbatch=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_merged_bed/commonMAFabove0.01_eur_unrel_exome_merged_$(date +"%Y-%m-%d").sbatch
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/genotype_formatting.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

merge_args="""merge_plink
    --cwd $cwd
    --genoFile $genoFile
    --name $name
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$merge_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_merged_bed/commonMAFabove0.01_eur_unrel_exome_merged_2022-05-19.sbatch[0m
INFO: Workflow csg (ID=wd209f89c42e59fdb) is executed successfully with 1 completed step.



### Get a file with both common and rare variants

In [46]:
genoFile=`echo /mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_white_eur_unrel/ukb23156_c{1..22}.merged.filtered.filtered.extracted.bed`
cwd=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_merged_bed
# Do not set a MAF filter, this will keep both common and rare variants
maf_filter=0 
# Do not filter again for variant missigness
geno_filter=0
# HWE already applied only to controls of ARHI sample
hwe_filter=0
# Set a sample missingness of 5%
mind_filter=0.05
name='ukb23156_merged_eur_unrel_allvars'
gwas_sbatch=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_merged_bed/allvars_eur_unrel_exome_merged_$(date +"%Y-%m-%d").sbatch
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/genotype_formatting.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

merge_args="""merge_plink
    --cwd $cwd
    --genoFile $genoFile
    --name $name
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$merge_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/051922_merged_bed/allvars_eur_unrel_exome_merged_2022-05-19.sbatch[0m
INFO: Workflow csg (ID=wadfa7b47d59121a2) is executed successfully with 1 completed step.



# Create the GRM matrix

Here we need to create a GRM per bed file for common, rare and both types of variants

```

i={1..99}
GCTA \
--bfile ${BED_file_merged_QC} \
--extract ${list_variants_LD_bin} \
--make-grm-part 99 "$i" \
--thread-num ${ncpu} \
--out ${GRM_out} \
--make-grm-alg 1


#Merge all GRM parts together

cat ${GRM_out}.part_99_*.grm.id > ${GRM_out}.grm.id
cat ${GRM_out}.part_99_*.grm.bin > ${GRM_out}.grm.bin
cat ${GRM_out}.part_99_*.grm.N.bin > ${GRM_out}.grm.N.bin
```

## Get GRM for rare variant substet

In [47]:
bfile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_rarevarsMAFbelow0.01.bed
cwd=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_grm_rare
numThreads=20
gcta_sbatch=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_grm/rarevars_grm_eur_unrel_gtca_$(date +"%Y-%m-%d").sbatch
gcta_sos=~/project/bioworkflows/GWAS/LMM.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
#Phenofile is just inputed as a requirement for the LMM to run but it is not actually needed for GRM calculation
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
mem='8G'
walltime='48h'

gcta_args="""gcta
    --cwd $cwd
    --bfile $bfile
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --numThreads $numThreads
    --phenoFile $phenoFile
    --container_lmm $container
    --walltime $walltime
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_grm/rarevars_grm_eur_unrel_gtca_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=w3a0b3db8c49eabe5) is executed successfully with 1 completed step.



## Get the GRM for common variant subset

In [48]:
bfile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.bed
cwd=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_grm_common
numThreads=20
gcta_sbatch=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_grm_common/commonvars_eur_unrel_gtca_$(date +"%Y-%m-%d").sbatch
gcta_sos=~/project/bioworkflows/GWAS/LMM.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
#Phenofile is just inputed as a requirement for the LMM to run but it is not actually needed for GRM calculation
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
mem='8G'
walltime='48h'

gcta_args="""gcta
    --cwd $cwd
    --bfile $bfile
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --numThreads $numThreads
    --phenoFile $phenoFile
    --container_lmm $container
    --walltime $walltime
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_grm_common/commonvars_eur_unrel_gtca_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=w5dfba0004114fab4) is executed successfully with 1 completed step.



## Get the GRM for all variants (rare + common)

In [49]:
bfile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_allvars.bed
cwd=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_grm_allvars
numThreads=20
gcta_sbatch=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_grm_allvars/allvars_grm_eur_unrel_$(date +"%Y-%m-%d").sbatch
gcta_sos=~/project/bioworkflows/GWAS/LMM.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
#Phenofile is just inputed as a requirement for the LMM to run but it is not actually needed for GRM calculation
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
mem='8G'
walltime='48h'

gcta_args="""merge_plink
    --cwd $cwd
    --bfile $bfile
    --numThreads $numThreads
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_grm_allvars/allvars_grm_eur_unrel_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=w8df80d9b5785842a) is executed successfully with 1 completed step.



In [None]:
###### Create a file containing multiple GRMs in a directory (need full path) ######

for i in *.grm.bin ; do readlink -f "$i"  | cut -d'.' -f1-2 >>  ${mgrm_file_path}; done

## Do LD prunning for each group of variants

```
i={1..22}
plink \
--bfile ${BED_file_merged_QC} \
--chr "$i" \
--extract ${list_variants_bin} \
--indep-pairwise 50 5 0.1 \
--out ${out_indep_var}_chr"$i" \
--threads ${ncpu}
```

### LD pruning for common variants

In [61]:
## Selected White European unrelated individuals 
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
cwd=$UKBB_PATH/results/ARHI_heritability/052022_ldprun_common
## Use the exome filtered file
genoFile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.bed
maf_filter=0 
geno_filter=0
hwe_filter=0
mind_filter=0
window=50
shift=10
r2=0.1
gwas_sbatch=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_ldprun_common/common_ldprun_eur_unrel_$(date +"%Y-%m-%d").sbatch
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg2.yml
container=~/containers/bioinfo.sif
numThreads=20
job_size=1
mem='30G'

gwasqc_args="""qc:2
    --cwd $cwd
    --genoFile $genoFile
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_ldprun_common/common_ldprun_eur_unrel_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=w39d73fa64d48acf5) is executed successfully with 1 completed step.



### LD pruning for rare variants

In [62]:
## Selected White European unrelated individuals 
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
cwd=$UKBB_PATH/results/ARHI_heritability/052022_ldprun_rare
## Use the exome filtered file
genoFile=~/UKBiobank/results/ARHI_heritability/051922_merged_bed/ukb23156_merged_eur_unrel_rarevarsMAFbelow0.01.bed
window=2000
shift=400
r2=0.01
gwas_sbatch=$UKBB_PATH/results/ARHI_heritability/052022_ldprun_rare/rare_ldprun_eur_unrel_$(date +"%Y-%m-%d").sbatch
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif
numThreads=20
job_size=1
mem='30G'

gwasqc_args="""qc:2
    --cwd $cwd
    --genoFile $genoFile
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_ldprun_rare/rare_ldprun_eur_unrel_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=wd5c42b2fe0425a2b) is executed successfully with 1 completed step.



## Recalculate PC's for the subset of unrelated

In the paper they use plink2 to calculate the PC's

```
plink2 \
--bfile ${BED_file_merged_QC} \
--extract ${list_variants_bin} \
--pca 20 approx \
--out ${PCA_out} \
--thread-num ${ncpu}
```

In our case we have developed a pipeline that uses flashpca. We use that one instead

### PCA for f3393 common variants

#### Step 1

In [63]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/f3393
gwas_sbatch=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/common_pca_f3393_exome_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_ldprun_common/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.filtered.prune.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.keep_id

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/common_pca_f3393_exome_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=w9bd46fcb4c4e3372) is executed successfully with 1 completed step.



#### Step 2. 

In [64]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/f3393
#This is the bfile obtained in step 1
genoFile=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/f3393/cache/*.bed
# Format FID, IID, pop
phenoFile=/mnt/mfs/statgen/UKBiobank/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/flashpca_f3393_common_$(date +"%Y-%m-%d").sbatch
k=10
min_axis=0
max_axis=0
pca_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/flashpcaR.sif

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --container $container
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/flashpca_f3393_common_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=w90c9447dc037bccd) is executed successfully with 1 completed step.



### Merge the phenofile with the PC calculation for the available individuals (unrelated and white European) for common variants

In [1]:
pheno <- read.table("~/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl", header=T)
head(pheno)
dim(pheno)

  msg['msg_id'] = self._parent_header['header']['msg_id']


In [2]:
pca <- read.table("/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/f3393/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.pca.txt", header=T)
head(pca)
dim(pca)

Unnamed: 0_level_0,ID,FID,IID,ethnicity,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<fct>,<int>,<int>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000078:1000078,1000078,1000078,British,0.001867425,0.013093557,-0.012545642,-0.0077773685,-0.0076396044,-0.004067034,0.003904697,-0.0066333866,-0.011400679,-0.003770311
2,1000081:1000081,1000081,1000081,British,0.134179544,-0.015138242,0.014278056,0.0142731188,-0.0085332226,-0.010936588,0.011577007,0.0004865037,0.008102219,-0.002117615
3,1000236:1000236,1000236,1000236,British,0.008867756,0.011425836,-0.023071386,0.0064096586,0.0033986306,0.021169599,-0.003277936,0.0054276548,-0.01065791,-0.03410221
4,1000331:1000331,1000331,1000331,Any_other_white_background,0.080695947,-0.003566874,-0.002728336,-0.0039599944,0.0005021192,0.003650339,0.029506097,0.0038201997,-0.013963517,-0.01629406
5,1000340:1000340,1000340,1000340,British,-0.012073413,-0.005182776,-0.00929079,0.0007751943,-0.0024663547,-0.006647973,0.002676356,-0.0018374353,0.002831617,-0.001478281
6,1000415:1000415,1000415,1000415,Irish,-0.018177064,-0.053073714,-0.000446653,-0.0024920664,0.0170787094,-0.003519533,0.005667462,0.0251297768,-0.006116507,0.005747638


In [3]:
f3393 <- merge(pheno, pca, by = c('FID', 'IID'), all.y=TRUE)
head(f3393)
dim(f3393)
library(dplyr)
f3393_final <- select(f3393, -c('ID', 'ethnicity'))
head(f3393_final)

Unnamed: 0_level_0,FID,IID,sex,f3393,age,ID,ethnicity,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000078,1000078,1,0,60,1000078:1000078,British,0.001867425,0.013093557,-0.012545642,-0.0077773685,-0.0076396044,-0.004067034,0.003904697,-0.0066333866,-0.011400679,-0.003770311
2,1000081,1000081,0,0,67,1000081:1000081,British,0.134179544,-0.015138242,0.014278056,0.0142731188,-0.0085332226,-0.010936588,0.011577007,0.0004865037,0.008102219,-0.002117615
3,1000236,1000236,0,0,70,1000236:1000236,British,0.008867756,0.011425836,-0.023071386,0.0064096586,0.0033986306,0.021169599,-0.003277936,0.0054276548,-0.01065791,-0.03410221
4,1000331,1000331,1,0,53,1000331:1000331,Any_other_white_background,0.080695947,-0.003566874,-0.002728336,-0.0039599944,0.0005021192,0.003650339,0.029506097,0.0038201997,-0.013963517,-0.01629406
5,1000340,1000340,1,0,54,1000340:1000340,British,-0.012073413,-0.005182776,-0.00929079,0.0007751943,-0.0024663547,-0.006647973,0.002676356,-0.0018374353,0.002831617,-0.001478281
6,1000415,1000415,0,0,65,1000415:1000415,Irish,-0.018177064,-0.053073714,-0.000446653,-0.0024920664,0.0170787094,-0.003519533,0.005667462,0.0251297768,-0.006116507,0.005747638



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Unnamed: 0_level_0,FID,IID,sex,f3393,age,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1000078,1000078,1,0,60,0.001867425,0.013093557,-0.012545642,-0.0077773685,-0.0076396044,-0.004067034,0.003904697,-0.0066333866,-0.011400679,-0.003770311
2,1000081,1000081,0,0,67,0.134179544,-0.015138242,0.014278056,0.0142731188,-0.0085332226,-0.010936588,0.011577007,0.0004865037,0.008102219,-0.002117615
3,1000236,1000236,0,0,70,0.008867756,0.011425836,-0.023071386,0.0064096586,0.0033986306,0.021169599,-0.003277936,0.0054276548,-0.01065791,-0.03410221
4,1000331,1000331,1,0,53,0.080695947,-0.003566874,-0.002728336,-0.0039599944,0.0005021192,0.003650339,0.029506097,0.0038201997,-0.013963517,-0.01629406
5,1000340,1000340,1,0,54,-0.012073413,-0.005182776,-0.00929079,0.0007751943,-0.0024663547,-0.006647973,0.002676356,-0.0018374353,0.002831617,-0.001478281
6,1000415,1000415,0,0,65,-0.018177064,-0.053073714,-0.000446653,-0.0024920664,0.0170787094,-0.003519533,0.005667462,0.0251297768,-0.006116507,0.005747638


In [4]:
write.table(f3393_final, "/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.greml_pheno", sep="\t", row.names = FALSE, col.names =TRUE, quote=FALSE)

### PCA for f3393 rare variants

#### Step 1

In [63]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_rare/f3393
gwas_sbatch=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_rare/rare_pca_f3393_exome_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_ldprun_rare/*.prune.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.keep_id

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'
gwasqc_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/bioinfo.sif

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --container $container
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/common_pca_f3393_exome_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=w9bd46fcb4c4e3372) is executed successfully with 1 completed step.



#### Step 2. 

In [64]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/f3393
#This is the bfile obtained in step 1
genoFile=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/f3393/cache/*.bed
# Format FID, IID, pop
phenoFile=/mnt/mfs/statgen/UKBiobank/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/flashpca_f3393_common_$(date +"%Y-%m-%d").sbatch
k=10
min_axis=0
max_axis=0
pca_sos=~/project/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/flashpcaR.sif

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --container $container
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_pca_common/flashpca_f3393_common_2022-05-20.sbatch[0m
INFO: Workflow csg (ID=w90c9447dc037bccd) is executed successfully with 1 completed step.



# Calculate the heritability

For a case-control study it should be estimated like

```
gcta64 --grm test --pheno test_cc.phen --reml --prevalence 0.01 --out test --thread-num 10
```

## f3393 common variants

In [5]:
## Columbia's cluster
cwd=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/greml
#This is the bfile obtained in step 1
grm=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/052022_grm_common/ukb23156_merged_eur_unrel_commonvarsMAFabove0.01.grm.bin
# Format FID, IID, pop
phenoFile=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/pheno_files/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.PC1_10.greml_pheno
greml_sbatch=/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/greml/f3393_greml_common_$(date +"%Y-%m-%d").sbatch
greml_sos=~/project/UKBB_GWAS_dev/workflow/GREML.ipynb
tpl_file=~/project/bioworkflows/admin/csg.yml
container=~/containers/lmm.sif
phenoCol=f3393
covarCol=sex
qCovarCol=`echo age PC{1..10}`
prevalence=0.01
greml_args="""greml
    --cwd $cwd
    --grm $grm
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --covarCol $covarCol
    --qCovarCol $qCovarCol
    --prevalence $prevalence
    --numThreads $numThreads 
    --container $container   
"""

sos run  ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $greml_sos \
    --to-script $greml_sbatch \
    --args "$greml_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/mfs/statgen/UKBiobank/results/ARHI_heritability/greml/f3393_greml_common_2022-05-24.sbatch[0m
INFO: Workflow csg (ID=w9ce7c0faf4470ac1) is executed successfully with 1 completed step.

