# Heritability estimation using GREML 

## Aim

The objetive of this notebook is to generate heritability estimates for both common and rare variants in complex traits 

## Input data

## Output data

## Parameters

*--reml*

Perform a REML (restricted maximum likelihood) analysis. This option is usually followed by the option --grm (one GRM) or --mgrm (multiple GRMs) to estimate the variance explained by the SNPs that were used to estimate the GRM.

*--reml-no-constrain*

By default, if an estimate of variance component escapes from the parameter space (i.e. negative value), it will be set to be a small positive value i.e. Vp * 10-6 with Vp being the phenotypic variance. If the estimate keeps escaping from the parameter space, the estimate will be constrained to be Vp * 10-6. If the option --reml-no-constrain is specified, the program will allow an estimate of variance component to be negative, which may result in the estimate of proportion variance explained by all the SNPs > 100%

*--reml-lrt 1*

Calculate the log likelihood of a reduce model with one or multiple genetic variance components dropped from the full model and calculate the LRT and p-value. By default, GCTA will always calculate and report the LRT for the first genetic variance component, i.e. --reml-lrt 1, unless you re-specify this option, e.g. --reml-lrt 2 assuming there are a least two genetic variance components included in the analysis. You can also test multiple components simultaneously, e.g. --reml-lrt 1 2 4. See FAQ #1 for more details

*--reml-no-lrt*

Turn off the LRT.

*--prevalence 0.01*

Specify the disease prevalence for a case-control study. Once this option is specified, GCTA will transform the estimate of variance explained, V(1)/Vp, on the observed scale to that on the underlying scale, V(1)/Vp_L. The prevalence should be estimated from a general population in literatures rather than that estimated from the sample. 

## References

## Command interface

## Global parameter settings

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path
# Calculated genetic relationship matrix
parameter: grm = path
# Phenotype file
parameter: phenoFile = path
# I running a binary trait
parameter: prevalence = 0.0
# Phenotype to be analyzed (specify the column)
parameter: phenoCol = list
# Covariate file path. Will use phenoFile if empty
parameter: covarFile = path('.')
# Summary statisticss format file path used for unifying output column names. Will not unify names if empty
parameter: formatFile = path('.')
# Qualitative covariates to be used in the analysis
parameter: covarCol = []
# Quantitative covariates to be used in the analysis
parameter: qCovarCol = []
# Specific number of threads to use
parameter: numThreads = 1
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: mem = '240G'
parameter: walltime = '240h'
# The container with the lmm software. Can be either a dockerhub image or a singularity `sif` file.
# Default is set to using dockerhub image
parameter: container = 'statisticalgenetics/lmm:2.4'
if not covarFile.is_file():
    covarFile = phenoFile
cwd = path(f"{cwd:a}")

## Illustration of minimal working example

On a minimal working example (MWE) dataset,
```
sos run GREML.ipynb greml\
    --cwd output \
    --phenoFile phenotype_greml.txt\
    --phenoCol f3393 \
    --covarCol sex \
    --qCovarCol `echo age {PC1..10}` \
    --grm common_var.grm.bin \
    --prevalence 0.0\

```

# GREML workflow implementation

In [None]:
[greml]
# extract and prepare phenotype & covariate files
import pandas as pd
import numpy as np
dat = pd.read_csv(phenoFile, header=0, delim_whitespace=True, dtype=str)
dat = dat.replace(to_replace =np.nan, value ="NA")
if len(phenoCol) > 0:
    dat.to_csv(f"{cwd}/{phenoFile:bn}.gcta_phenotype", sep=' ', index=False, columns = ['FID', 'IID'] + phenoCol)
dat = pd.read_csv(covarFile, header=0, delim_whitespace=True,  dtype=str)
if len(covarCol) > 0:
    dat.to_csv(f"{cwd}/{phenoFile:bn}.gcta_covar", sep=' ', index=False, columns = ['FID', 'IID'] + covarCol)
if len(qCovarCol) > 0:
    dat.to_csv(f"{cwd}/{phenoFile:bn}.gcta_qcovar", sep=' ', index=False, columns = ['FID', 'IID'] + qCovarCol)
parameter: reml_priors = ''
input: grm
output: f'{cwd}/{_input:bnn}.{phenoFile:bn}.hsq' 
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container, expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
        
gcta64 \
    --reml \
    --grm ${grm:nn}\
    --prevalence ${prevalence} \
    --pheno ${cwd}/${phenoFile:bn}.gcta_phenotype \
    --qcovar ${cwd}/${covarFile:bn}.gcta_qcovar \
    --covar ${cwd}/${covarFile:bn}.gcta_covar \
    --thread-num ${numThreads} \
    --out ${_output:nn}\
    ${('--reml-priors %s' % reml_priors ) if reml_priors else ''} 
    --reml-no-lrt\
    --reml-no-constrain

In [None]:
[grm_processing]
parameter: keep_samples = path('.')
parameter: name = ""
input: grm
output: f'{cwd}/{_input:bnn}.{name}.grm.bin' 
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container, expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
        
gcta64 \
    --grm ${grm:nn}\
    --keep ${keep_samples} \
    --make-grm \
    --thread-num ${numThreads} \
    --out ${_output:nn}