Skip to content

Pipeline for automatic integration of open chromatin regions (peaks) and GWAS data.

Notifications You must be signed in to change notification settings

shooshtarilab/ochroGWAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

ochroGWAS

This page contains data related to the following paper:

Title:

Single-cell chromatin accessibility data combined with GWAS improves detection of relevant cell types in 59 complex phenotypes

Authors:

Akash Chandra Das (akashchandra@iitg.ac.in), Aidin Foroutan (aidin.foroutan@uwo.ca), Brian Qian (bqian7@uwo.ca), Nader Hosseini Naghavi (nhosse2@uwo.ca), Kayvan Shabani (kshaban2@uwo.ca), Parisa Shooshtari (pshoosh@uwo.ca)

This repository contains the pipeline for integration of genome-wide association studies (GWAS) data and open chromatin regions (peaks) data using linkage disequilibrium score (LDSC) regression. The following procedures require high-performance computing which has been done in this project using Compute Canada. All bash scripts are therefore customed to run on compute canada clusters. The first 4 lines of any bash script contain information for the cluster to allocate optimised resources to the job, while the next three lines creates a virtual environment for the job to run on with the versions of python and bedtools required by ldsc. Comments have been made to get a comprehensive understanding of the various steps involved. hg38 has been used for all files. For more information regarding the working of the python scripts of ldsc, please refer to https://github.com/bulik/ldsc/wiki.

For all the following steps, we have used bulk-sequencing dataset (https://dhs.ccm.sickkids.ca/) and single-cell ATAC sequencing dataset (https://www.cell.com/cell/fulltext/S0092-8674(21)01279-4) and integrated them with GWAS (https://www.nature.com/articles/s41588-021-00931-x). The relevant peak files and ".sumstats" files required for the following steps need to be downloaded from above mentioned resources.

1. Processing

Required files:

  1. Simple BED files containing open chromatin region base-pair locations
    1. Format: CHRX -- Start -- End
    2. IMPORTANT: Ensure that each individual cell type is assigned a single .bed file containing peaks for the entire genome, and is named without any spaces. (i.e. CD14.positive.monocyte.bed)
  2. PLINK .bim files
  3. Hapmap3 SNPs file
  4. Baseline LD file for reference genome
  5. No HLA weights for hapmap 3
  6. GWAS .sumstats file

Step 1 → Step1_Annotations.sh

Uses make_annot.py to make annotation files for all cell types (.bed files).

Inputs:

--bed-file <bedfiles directory> → Peak data directory
--bimfile <PLINK directory> → PLINK directory

Outputs:

--annot-file <output directory> → Output of annotations file directory

Function:

  • Creates a binary annotation file representing SNP locations within the open chromatin regions for each chromosome of every cell type. SNPs are represented by 1s if present in the open chromatin regions, else zero.

Step 2 → Step2_LDSC.sh

Uses ldsc.py to calculate ldscores from annotation files. Inputs required are -bfiles (.bed,.bim,.fam files of PLINK), annotation files from previous step and Hapmap3 SNPs file.

Inputs:

--bfile <PLINK directory> → PLINK directory
--annot <annotation directory> → Annotation directory from previous step
--print-snps <hapmap3 SNPs directory> → Hapmap3 SNP directory

Outputs:

--out <output directory> → Output directory to generate LDSC scores

Function:

  • Runs LDSC regression to generate LD score files for each of the listed cell types.

Step 3 → Step3_GWAS.sh

(Prerequisite)

One important pre-requisite to this step is the generation of ".ldct" file. This file contains information about the cell type and the location of their LD score files. Step3(pre)_CreateLDCT.py can be used for this purpose. It can be created manually as well. The format is in the following way:

CellType1 ~/ldscores/CellType1.
CellType2 ~/ldscores/CellType2.

and so on for all cell types.

GWAS Integration

Integrates the GWAS sumstats data with LDSC from previous step. Below are the inputs and outputs for Step3_GWAS.sh

Inputs:

--h2-cts <SUMSTATS file> → SUMSTATs file of the phenotype to be analysed
--ref-ld-chr <Baseline LD> → Baseline LD files for reference genome
--ref-ld-chr-cts <ldct file> → LDCT file for reference (mentioned above)
--w-ld-chr <No HLA weights> → No HLA weights for Hapmap3 directory

Outputs:

--out <Output directory> → Directory to output files for generated p-values

Function:

  • Integrates GWAS with previously generated LD score files and outputs p-values.

Further Steps

The final outputs from the above steps are text files containing p-values of association of all cell-types for each GWAS. Each file has the name of the GWAS used for integration and contains all the cell types in increasing order of p-values. These p-values should be adjusted using Benjamini-Hoschberg correction with an FDR threshold of 0.05, and similar step should be done for all phenotypes. Once that has been done, the results can be concatenated, heatmaps can be generated and visualisations can be done.

2. Visualisation

The R scripts in Scripts folder can be run to visualise the analysis and generate figures mentioned in the paper. The order has been maintaned, and the input file required for these scripts are present in the Input_Data folder. The input files and scripts have been named according to the figure they are used for (Figure3_immune_cells_adult.csv and Figure3_immune_cells_fetal.csv are used by Figure3_immune_cells_script.R to generate Figure3.png). Similarly for the rest. The output that could be generated (the main and supplementary figures of the paper) are in the Output_Figures folder.

3. Supplementary Tables

This folder contains the supplementary tables mentioned in the paper. Table 1 contains information about the GWAS phenotypes, their cases and controls that were considered in this study (obtained from supplementary data provided by Sakaue et. al. in their paper titled "A cross-population atlas of genetic associations for 220 human phenotypes"). Another additional column was added in this study, the % of cases/control, to give an idea of the variation in data that has been gathered. Table 2 contains information about the single-cell cell types, the number of nuclei that were studied (which was obtained from the study done by Zhang et. al. titled "A single-cell atlas of chromatin accessibility in the human genome"), and the number of peaks that were observed per cell type. It also classifies the cell-type into categories that have been used in this study.

About

Pipeline for automatic integration of open chromatin regions (peaks) and GWAS data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published