scHiC 2.0: Sequence and analysis pipeline of single-cell Hi-C datasets

Introduction

This repository provides the code used to process and analyze single-cell Hi-C libraries in the paper: Cell-cycle dynamics of chromosomal organisation at single-cell resolution by Nagano and Lubling et al., Nature, 2017.

It is composed of two main parts (see more details below):

Sequence processing: relevant code under the map3c folder. It processes a de-multiplexed input fastq paired-end reads files into a list of contacts.
Single-cell analysis: relevant code under the analysis folder. It contains the code that builds the data and creates the figures that appear in the paper.

Requirements

Perl (with module List::MoreUtils)
bowtie2
R with these packages:
- misha (see below how to install it)
- shaman
- ggplot2
- plyr
- dplyr
- tidyr
- KernSmooth
- RColorBrewer

Installing a misha genomic database##

misha is an R package for genomic analysis, developed by the Tanay lab, and used throughout this repository. To install the package type in your R session:

#!r
install.packages("https://schic2.s3.eu-west-1.amazonaws.com

/schic2/misha_3.5.6.tar.gz", repos=NULL)

We supply an mm9 genomic database with the genomic and epigenetic data required to run the sequence pipeline and the downstream analysis. Download and unpack this archive (5.5 Gb after unpacking). The trackdb folder in this tar is the root directory of the genomic database, later referred to as groot.

Sequence Processing

Processing starts with a pair of fastq files for each cell. The multiplexed fastq files and the cells' indices that allow de-multiplexation are available in GEO under accession GSE94489. Briefly, Each multiplexed sequencing run comprises 4 FASTQ files:

Forward Sequencing Reads
Reverse Sequencing Reads
Barcodes 1 (8bps)
Barcodes 2 (8bps)

The order of the reads in the FASTQ files correspond to one another. The relevant barcode information can be obtained from the first read in the two barcode files. Reads with unexpected barcodes are written to a single additional file. For more details see the perl script split_barcodes (note that it is tailored made to run in Babaraham's cluster environment).

Detailed instructions how to use use the complete pipeline are available in the map3c directory README The final step of the pipeline is to upload the contact map of a cell into the genomic database. We supply all contact maps below to allow you to spare you from rerunning the sequence processing step if you prefer:

Cells	Condition	Batches	Link	Size
Haploids	2i	All	gz	1.2 Gb
Haploids	Serum	All	gz	842 Mb
Diploids	2i	1CDU	gz	461 Mb
Diploids	2i	1CDX1	gz	511 Mb
Diploids	2i	1CDX2	gz	618 Mb
Diploids	2i	1CDX3	gz	618 Mb
Diploids	2i	1CDX4	gz	779 Mb
Diploids	2i	1CDES	gz	589 Mb
Diploids	Serum	1CDS1	gz	831 Mb
Diploids	Serum	1CDS2	gz	1.1 Gb

Assuming you unpack the archives into a directory pointed by data_dir, and you unpack the genomic database into mm9_db, run the following code in R to upload contact maps into the genomic database (update support_sge to TRUE if sge is supported):

#!r
library(misha)

# define the genomic database we're working on
gdb.init(sprintf("%s/trackdb", mm9_db)) 

# get the list of directories to work on
dirs = list.files(data_dir)

# parse dirs into track names
nms = paste0("scell.nextera.", gsub("-", "_", gsub(".", "_", dirs, fixed=T), fixed=T))

# fends file is the list of GATC fragment ends
fends = sprintf("%s/seq/redb/GATC.fends", mm9_db)

# create tracks directory 
dir.create(sprintf("%s/trackdb/tracks/scell/nextera", mm9_db), showWarnings=F, recursive=T)

# uploading contact files to misha
if (support_sge) {
    # build the commands
    commands = paste0("gtrack.2d.import_contacts(\"", nms, "\", \"\", \"", paste(base_dir, dirs, "adj", sep="/"), "\", fends, allow.duplicates=F)", collapse=",")
    
    # submit jobs to the sge cluster
    res = eval(parse(text=paste("gcluster.run2(",commands,")")))
}
else {
  # upload cell by cell
  for (i in seq_along(nms)) {
    gtrack.2d.import_contacts(nms[i], "", sprintf("%s/%s/adj", data_dir, dirs[i]), fends, allow.duplicates=F)
  }
}

Translating contact maps into genomic coordinates

Each row in the above supplied contact maps represents a single contact. The columns in the files are fend1, fend2 and count, where fend1/2 are the IDs of the interacting fragment ends. You can use the GATC.fends file (also available in the genomic database supplied above under the seq/redb directory) to translate the fend ids into genomic coordinates. The GATC.fends file maps each fend ID (first column) to chromosome and coordinate (2nd and 3rd columns).

Analyse Single-Cell Hi-C datasets

Once the contact maps of the cells are uploaded to the genomic database, you can start their analysis. All code and parameter files are supploid under the analysis directory.

We supply configuration files for the diploids and haploids datasets under analysis/config. Make sure to update the paths in those files before you issue any command (look for the "UPDATE REQUIRED" to easily find them).

Run the following commands to build the data and generate all figures (open the R session when you're in the analysis directory):

#!r

# load the code
source("paper.r")

# build the tables (need to run once, submits jobs to the sge cluster, takes a while). Supply the diploid (hyb) or haploid (hap) parameters file
sch_build_all(spec_params_fn="config/hyb_2i_params.r")

# After sch_build_all finished, subsequent R sessions can load the tables:
sch_load_tables(spec_params_fn="config/hyb_2i_params.r")

# Once tables are built, finish building the data (need to run once, submits jobs to the sge cluster, takes a while)
build_stats()

# Generate figures
paper_figs()

Metadata information on the cells is available in the features table at GSE94489. It contains experimental information on each cell, virtual sorting data and the features that were computed and used in the paper.

Who do I talk to?

For help, please contact yaniv.lubling@weizmann.ac.il

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
analysis		analysis
map3c		map3c
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scHiC 2.0: Sequence and analysis pipeline of single-cell Hi-C datasets

Introduction

Requirements

Installing a misha genomic database##

Sequence Processing

Translating contact maps into genomic coordinates

Analyse Single-Cell Hi-C datasets

Who do I talk to?

About

Releases

Packages

Contributors 3

Languages

tanaylab/schic2

Folders and files

Latest commit

History

Repository files navigation

scHiC 2.0: Sequence and analysis pipeline of single-cell Hi-C datasets

Introduction

Requirements

Installing a misha genomic database##

Sequence Processing

Translating contact maps into genomic coordinates

Analyse Single-Cell Hi-C datasets

Who do I talk to?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages