Skip to content
Michael Zietz edited this page May 4, 2023 · 5 revisions

Command line usage

Once the package has been installed with pip, there are two command line entry points, indirect-GWAS and a helper function, compute-feature-partial-covariance. Both of these commands have documentation available using the -h/--help flag (e.g. indirect-GWAS -h).

indirect-GWAS

This is the main command. It requires the user to provide three key pieces of information:

  1. Projection coefficients
  2. GWAS summary statistics for the feature phenotypes
  3. Feature phenotype partial covariance matrix

Projection coefficients are given using -P/--projection-coefficients.

The argument should refer to a delimited file (e.g. .csv, .csv.gz, .tsv, etc.). The rows/columns of this file should refer to features/projections. If --no-index is not included, the first row/column should be the names of the features/projections.

For example, the following is a valid file:

feature,projection_1,projection_2
feature_1,1,1
feature_2,2,-1

GWAS summary statistics

This argument takes one or more files giving GWAS summary statistics on the feature phenotypes. The names of these files should correspond to the feature names. These files must contain, at minimum, the following columns: a variant identifier, the coefficient estimates, and the standard errors. This argument must be paired with additional arguments that specify the relevant column names.

The GWAS summary statistics must also provide the degrees of freedom for these regressions. A few different options are available:

  1. A constant value for every GWAS regression (only use this if the degrees of freedom for every feature x variant is the same) (--degrees-of-freedom)

  2. A column in the summary statistics files giving the degrees of freedom (--degrees-of-freedom-column)

  3. Through sample size (N) and number of covariates (K) $\rightarrow$ (DF = N - K - 1).

    a. Sample size is either a constant for all (--equal-sample-size) or a column in the summary statistic files (--sample-size-column)

    b. Number of covariates is either a constant for all (--equal-number-of-covariates) or a column in the summary statistic files (--number-of-covariates-column)

For example, given the following file:

variant_id,beta,se,n_samples,n_covar
rs123456,2.1,1.0,10000,5
...

the following arguments would be appropriate

--variant-id-column variant_id
--coefficient-column beta
--standard-error-column se
--sample-size-column n_samples
--number-of-covariates-column n_covar

Feature phenotype partial covariance matrix

The covariance matrix of phenotypes that have been adjusted for covariates (features x features) To remove the effects of covariates, this requires you to regress each feature phenotype on the covariates (phenotype ~ covariates), compute the predicted phenotypes given covariates ($\hat{y} | \mathrm{covariates}$), compute the residuals ($y - \hat{y})$, then compute the covariance matrix of these residuals across all feature phenotypes.

The following is a valid file:

feature_id,feature_1,feature_2
feature_1,1.0,0.55
feature_2,0.55,0.999

This can be computed easily if individual-level data are available and all regressions are linear models using compute-feature-partial-covariance or any statistical software. For linear mixed models, use residuals if provided, or refer to the documentation/outputs of the specific method. See information about how this works with SAIGE.

compute-feature-partial-covariance

This is a simple helper script to compute the phenotype partial covariance matrix when individual-level data are available. The phenotypes and covariates files can be specified using --pheno and --covar (optional). If --covar is not specified, it assumes the covariates are in the --pheno file. Sample IDs must be specified using --sample-id-column, which defaults to #FID, IID. Names of phenotypes and covariates can be specified with --pheno-name and --covar-name. When not provided, these default to all columns other than the sample IDs. Note that either --covar or --covar-name must be specified.

An intercept can be added to the covariates with --add-intercept.

This script has the same options as indirect-GWAS, like --separator, --float-format, and --output.

Python usage

Overall, this package can be used in two simple steps:

  1. Load the data into an xarray (using one of the three from_* functions or manually)
  2. Compute the indirect GWAS summary statistics using indirect_GWAS.gwas_indirect

For example:

import indirect_GWAS

# Load data into Python
gwas_coef_df = pd.read_csv("gwas_coef.csv")                 # variants x features
gwas_stderr_df = pd.read_csv("gwas_stderr.csv")             # variants x features
gwas_sample_size_df = pd.read_csv("gwas_sample_size.csv")   # variants x features
feature_partial_cov = pd.read_csv("partial_cov.csv")        # features x features

projection_coefficients = pd.DataFrame(
    [[1], [-1], [5], [0.5]], index=gwas_coef_df.columns)    # features x projections

# Load data into xarray
ds = indirect_GWAS.from_summary_statistics(
    projection_coefficients=projection_coefficients,
    feature_partial_covariance=feature_partial_cov,
    feature_gwas_coefficients=gwas_coef_df,
    feature_gwas_standard_error=gwas_stderr_df,
    feature_gwas_sample_size=feature_gwas_sample_size,
    n_covar=10,  # The number included in feature GWAS (including intercept)
)

# Compute indirect GWAS summary statistics
beta, se, t_stat, p = indirect_GWAS.gwas_indirect(ds)

Using SAIGE summary statistics

Indirect GWAS can be used with SAIGE only when the covariance of the Step 1 residuals is known or can be computed. See the SAIGE documentation for more information about this.

In short, if you run step1_fitNULLGLMM.R, SAIGE produces an RData file with the suffix .rda. This file contains a variable, modglmm$residuals, that contains the residuals needed to compute the phenotypic partial covariance matrix. Here is a short snippet of R code that shows how you could compute the phenotypic partial covariance matrix given a directory of SAIGE step 1 output files.

library(tidyverse)

rda_files <- list.files(".", "*.rda", full.names = T)

full_df <- data.frame()

for (file in rda_files) {
    env <- new.env()
    load(file, envir = env)
    full_df <- bind_rows(
        full_df, 
        data.frame(IID = env$modglmm$sampleID, resid = env$modglmm$residuals, 
                   pheno = str_replace(file, ".rda", ""))
    )
}

wide_resid_df <- full_df %>%
    pivot_wider(id_cols = IID, names_from = pheno, values_from = resid)

feature_partial_covariance_matrix <- wide_resid_df %>% cov()

Illustration of the method

For simplicity we'll use the mtcars dataset here. Suppose that we are interested in the projected phenotype hp - mpg, with vs as the genotype variable and gear as an adjusted covariate.

The direct regression equivalent of this is the following:

import statsmodels.api as sm

# Load data
mtcars = sm.datasets.get_rdataset("mtcars", "datasets").data

# Perform the direct regression
projection = mtcars["hp"] - mtcars["mpg"]
X = mtcars[["vs", "gear"]].assign(const=1.)
direct_reg = sm.OLS(projection, X).fit()

# Print summary information
direct_summary = pd.DataFrame({
    "coef": direct_reg.params, "se": direct_reg.bse,
    "t_stat": direct_reg.tvalues, "p": direct_reg.pvalues
}).loc["vs"]

Returns

vs
coef -106.103
se 18.8541
t_stat -5.62757
p 4.44445e-06

The indirect version of this regression uses only hp and mpg summary information.

import indirect_GWAS

# Perform feature regressions first
feature_regs = {y: sm.OLS(mtcars[y], X).fit() for y in ["hp", "mpg"]}
feature_coef = pd.DataFrame({y: [reg.params["vs"]] for y, reg in regs.items()})
feature_stderr = pd.DataFrame({y: [reg.bse["vs"]] for y, reg in regs.items()})
# Partial covariance is after residualizing just the covariates
feature_partial_cov = pd.DataFrame({
    y: sm.OLS(mtcars[y], X[["gear", "const"]]).fit().resid
    for y, reg in regs.items()
}).cov()

# Setup the indirect GWAS dataset
ds = indirect_GWAS.from_summary_statistics(
    projection_coefficients=pd.DataFrame([[1], [-1]], index=["hp", "mpg"]),
    feature_partial_covariance=feature_partial_cov,
    feature_gwas_coefficients=feature_coef,
    feature_gwas_standard_error=feature_stderr,
    feature_gwas_sample_size=mtcars.shape[0],
    n_covar=2,  # (includes intercept)
)

# Perform indirect GWAS on the projection
beta, se, t_stat, p = indirect_GWAS.gwas_indirect(ds)

# Print summary information
indirect_summary = pd.Series({
    "coef": beta.item(), "se": se.item(), "t_stat": t_stat.item(), "p": p.item()})

Returns

vs
coef -106.103
se 18.8541
t_stat -5.62757
p 4.44445e-06

These results are identical to 12 decimal places.

assert (direct_summary - indirect_summary).abs().max() < 1e-12