# fastGWA analyses for UK Biobank data: qualitative traits

### Aim

To perfom genetic association analysis for binary traits (asthma and diabetes) using fastGWA software and UK Biobank imputed data of ~500K invidivuals

### Method and worflow overview

1. Data from UKB: phenotype, genotypes files (`.fam`,`.bed`, `.bim`) and imputed genotypes (`.bgen`, `.bgi`, `.sample`)
2. Install fastGWA part of the GCTA software in Yale's HRC cluster (in this case binary file is being used). Instructions in https://cnsgenomics.com/software/gcta/#Download
3. Run fastGWA analysis to obtain summary statistics for association analysis of binary traits

### Input data

Step 1: Create the Genetic Relationship Matrix
1. genotypic file in PLINK binary format: `.bin`, `.fam`, `.bed`
2. phenotypic file: apparently is needed in plink format `.phe`
3. covariates file:


```
--make-grm
--make-grm-bin
--make-grm-part m i
```

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path('~/results/pleiotropy/2020-04_fastGWA/asthma/')
# Genotype file in plink binary format
parameter: genoFile = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.bed')
# Phenotype file for binary trait 1 (asthma)
parameter: phenoFile = path('/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/Asthma_casesbyICD10codesANDselfreport_controlsbyselfreportandicd10_noautoimmuneincontrols_forbolt030720')
# Number of parts the GRM calculation is to be partitioned
parameter: parts = 100
# Number of partition for the current run
parameter: part_number = [f'{x+1}' for x in range(100)]
# Specific number of threads to use
parameter: numThreads = 6
# For cluster jobs, number commands to run per job
parameter: job_size = 1

In [None]:
[gcta]
depends: executable("gcta64")
input: genoFile, phenoFile
output: f'{cwd}/{genoFile:bn}.fastGWA.grm.bin', f'{cwd}/{genoFile:bn}.fastGWA.grm.N.bin', f'{cwd}/{genoFile:bn}.fastGWA.grm.id', f'{cwd}/{genoFile:bn}.fastGWA.grm.sp'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", workdir = cwd, stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
# Partition the GRM into 100 parts and allocate 8GB memory to each job
    gcta64 \
    --bfile ${_input[0]:n} \
    --make-grm-part ${parts} ${part_number} \
    --thread-num ${numThreads} \
    --out ${_output[0]:nn}

# Merge all the parts together (Linux, Mac)
    cat ${_output[0]:nn}.part_${parts}_*.grm.id > ${_output[2]}
    cat ${_output[0]:nn}.part_${parts}_*.grm.bin > ${_output[0]}
    cat ${_output[0]:nn}.part_${parts}_*.grm.N.bin > ${_output[1]}
    
# Make a sparse GRM from the merged full-dense GRM
    gcta64 --grm ${_output[0]} --make-bK-sparse 0.05 --out ${_output[3]:nn}
    
# fastGWA mixed model (based on the sparse GRM generated above)
    gcta64 \
    --bfile ${_input[0]:n} \
    --grm-sparse ${_output[3]:nn} \
    --fastGWA-mlm \
    --pheno ${_input[1]} \
    --qcovar pc.txt \
    --covar fixed.txt \
    --threads ${numThreads} \
    --out geno_assoc

### Output files

1. test.grm.bin (it is a binary file which contains the lower triangle elements of the GRM)
2. test.grm.N.bin (it is a binary file which contains the number of SNPs used to calculate the GRM)
3. test.grm.id (no header line; columns are family ID and individual ID, see above)
4. test.grm.sp (sparse GRM made from the dense GRM)