# fastGWA analyses for UK Biobank data

For binary and quantitative traits.

### Aim

To perfom genetic association analysis for binary traits (asthma and type 2 diabetes) using fastGWA software and UK Biobank imputed data of ~500K invidivuals

### Method and worflow overview

1. Data from UKB: phenotype, genotypes files (`.fam`,`.bed`, `.bim`) and imputed genotypes (`.bgen`, `.bgi`, `.sample`)
2. Install fastGWA part of the GCTA software in Yale's HRC cluster (in this case binary file is being used). Instructions can be found in https://cnsgenomics.com/software/gcta/#Download
3. Run fastGWA analysis to obtain summary statistics for association analysis of binary traits

### Input data

Step 1: Create the Genetic Relationship Matrix
1. genotypic file in PLINK binary format: `.bin`, `.fam`, `.bed`
2. phenotypic file: is needed in plink format `.phe`

Asthma:
```
~/project/phenotypes_UKB/Asthma_casesbyICD10codesANDselfreport_controlsbyselfreportandicd10_noautoimmuneincontrols_fastGWA.phe
~/project/phenotypes_UKB/Asthma_casesbyICD10codesANDselfreport_controlsbyselfreportandicd10_noautoimmuneincontrols_fastGWA_covSEX.txt
~/project/phenotypes_UKB/Asthma_casesbyICD10codesANDselfreport_controlsbyselfreportandicd10_noautoimmuneincontrols_fastGWA_covAGE.txt
```

Diabetes:

```
~/project/phenotypes_UKB/diabetes_casesbyICD10andselfreport_controlswithoutautoiummune_052820_fastGWA.phe
~/project/phenotypes_UKB/diabetes_casesbyICD10andselfreport_controlswithoutautoiummune_052820_fastGWA_covSEX.txt
~/project/phenotypes_UKB/diabetes_casesbyICD10andselfreport_controlswithoutautoiummune_052820_fastGWA_covAGE.txt
```

3. covariates file: is in PLINK format

```
--make-grm
--make-grm-bin
--make-grm-part m i
```

## Running fastGWA

On Yale Farnam cluster,

Asthma

```
sos run ~/project/pleiotropy_UKB/workflow/fastGWA.ipynb -c ~/project/pleiotropy_UKB/farnam.yml -q farnam -J 100 -s build &> sos-submission-fastGWA-asthma-051920.log
```

Diabetes
```
sos run ~/project/pleiotropy_UKB/workflow/fastGWA.ipynb -c ~/project/pleiotropy_UKB/farnam.yml -q farnam -J 100 -s build &> sos-submission-fastGWA-diabetes-052820.log
```

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path('~/scratch60/2020-04_fastGWA/asthma')
# Genotype file in plink binary format
parameter: genoFile = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.bed')
# Path to bgen files
parameter: bgenFile = paths([f'/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb_imp_chr{x+1}_v3.bgen' for x in range(22)])
# Path to sample file
parameter: sampleFile = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample')
# Phenotype file for binary trait 1 (asthma) This is the original file
#parameter: phenoFile = path('/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/Asthma_casesbyICD10codesANDselfreport_controlsbyselfreportandicd10_noautoimmuneincontrols_forbolt030720')
# Phenotype file for binary trait 1 (asthma) This is the created file with fastGWA formatting
parameter: phenoFile = path('~/project/phenotypes_UKB/diabetes_casesbyICD10andselfreport_controlswithoutautoiummune_052820_fastGWA.phe')
# Qualitative (SEX) covariates file for binary trait=asthma. This is the created file with fastGWA formatting
parameter: covarFile = path('~/project/phenotypes_UKB/diabetes_casesbyICD10andselfreport_controlswithoutautoiummune_052820_fastGWA_covSEX.txt')
# Quantitative covariates (AGE) for binary trait=asthma. This is the created file with fastGWA formatting
parameter: qcovarFile = path('~/project/phenotypes_UKB/diabetes_casesbyICD10andselfreport_controlswithoutautoiummune_052820_fastGWA_covAGE.txt')
# Minimum MAF to be used
parameter: bgenMinMAF = 0.001
# Mimimum info score to be used
parameter: bgenMinINFO = 0.8
# Number of parts the GRM calculation is to be partitioned
parameter: parts = 100
# Specific number of threads to use
parameter: numThreads = 6
# For cluster jobs, number commands to run per job
parameter: job_size = 1

## Step 1: Creation of the GRM
The GRM only needs to be created once for all the phenotypes to analyze with the same genotypic data. In this step the GRM calculation is divided into multiple parts for a faster computational time.

In [None]:
# Partition the GRM into 100 parts and allocate 8GB memory to each job
[gcta_1]
depends: executable("gcta64")
# Number of partition for the current run
part_number = [f'{parts}_{format(x+1, "0" + str(len(str(parts))))}' for x in range(parts)]
input: genoFile, for_each = 'part_number'
output: f'{cwd}/{_input:bn}.part_{_part_number}.grm.bin', 
        f'{cwd}/{_input:bn}.part_{_part_number}.grm.N.bin', 
        f'{cwd}/{_input:bn}.part_{_part_number}.grm.id'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "${ }", workdir = cwd, stderr = False , stdout = False
    gcta64 \
    --bfile ${_input[0]:n} \
    --make-grm-part ${parts} ${_part_number} \
    --thread-num ${numThreads} \
    --out ${_output[0]:nnn}

## Step 2: Combine all the GRM parts into one file

In [None]:
# Merge all the parts together (Linux, Mac)
[gcta_2]
input: group_by = 'all'
output: f'{cwd}/{genoFile:bn}.grm.bin', 
        f'{cwd}/{genoFile:bn}.grm.N.bin', 
        f'{cwd}/{genoFile:bn}.grm.id' 
bash: expand = "${ }", workdir = cwd, stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    # here input is results all parts each having 3 items. We need to get the corresponding every other 3 items
    cat ${paths(_input[::3])} > ${_output[0]}
    cat ${paths(_input[1::3])} > ${_output[1]}
    cat ${paths(_input[2::3])} > ${_output[2]}
    #rm ${paths(_input)}

## Step 3: Make a sparse GRM to be used in the association analyses

In [None]:
# Make a sparse GRM from the merged full-dense GRM
[gcta_3]
depends: executable("gcta64")
output: sparse_grm = f'{_input[0]:nn}.grm.sp' 
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G', cores = 1, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", workdir = cwd, stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    gcta64 --grm ${_output[0]:nn} --make-bK-sparse 0.05 --out ${_output[0]:nn}

## Step 4: Run the single variant association analysis using FastGWA

In [None]:
# fastGWA mixed model (based on the sparse GRM generated above)
[fastGWA]
depends: executable("gcta64")
input: bgenFile, group_by = 1, group_with = dict(info=[(phenoFile, sampleFile, named_output('sparse_grm')[0], qcovarFile, covarFile)] * len(bgenFile))
output: f'{cwd}/{_input[0]:nn}.{_input.info[0]:n}.fastGWA', f'{cwd}/{_input[0]:nn}.{_input.info[0]:n}.log'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G', cores = numThreads, tags = f'{step_name}_{_output[1]:bn}'
bash: expand = "${ }", workdir = cwd, stderr = False, stdout = False
    gcta64 \
    --bgen ${_input} \
    --sample  ${_input.info[1]} \
    --grm-sparse ${_input.info[2]:nn} \
    --maf ${bgenMinMAF} \
    --info ${bgenMinINFO} \
    --fastGWA-mlm \
    --pheno ${_input.info[0]} \
    --qcovar ${_input.info[3]} \
    --covar ${_input.info[4]} \
    --threads ${numThreads} \
    --out ${_output[0]:n}
    
    mv ${_output[1]} ${_output[1]:n}.fastGWA.log

### Output files

1. **gcta_1 for x number of parts (in the example above x=100, so this step will create 400 files):**
* test.part_{_part_number}.grm.bin
* test.part_{_part_number}.grm.N.bin 
* test.part_{_part_number}.grm.id
* test.part_{_part_number}.log (the program creates the log file so there is no need for .stderr and .stdout)

2. **gcta_2 this step creates 5 output files:**
* test.grm.bin (it is a binary file which contains the lower triangle elements of the GRM)
* test.grm.N.bin (it is a binary file which contains the number of SNPs used to calculate the GRM)
* test.grm.id (no header line; columns are family ID and individual ID, see above)
* test.grm.stderr
* test.grm.stdout

3. **gcta_3 this step creates 3 output files:**
* test.grm.sp (sparse GRM made from the dense GRM)
* test.grm.sp.stderr
* test.grm.sp.stdout

4. **fastGWA this step creates 2 output files per chromosome**
* test{chr1:22}.fastGWA
* test{chr1:22}.fastGWA.log