This repository provides the code used to process and analyze single-cell Hi-C libraries in the paper: Cell-cycle dynamics of chromosomal organisation at single-cell resolution by Nagano and Lubling et al., Nature, 2017.
It is composed of two main parts (see more details below):
- Sequence processing: relevant code under the map3c folder. It processes a de-multiplexed input fastq paired-end reads files into a list of contacts.
- Single-cell analysis: relevant code under the analysis folder. It contains the code that builds the data and creates the figures that appear in the paper.
- Perl (with module List::MoreUtils)
- bowtie2
- R with these packages:
- misha (see below how to install it)
- shaman
- ggplot2
- plyr
- dplyr
- tidyr
- KernSmooth
- RColorBrewer
misha is an R package for genomic analysis, developed by the Tanay lab, and used throughout this repository. To install the package type in your R session:
#!r
install.packages("https://schic2.s3.eu-west-1.amazonaws.com
/schic2/misha_3.5.6.tar.gz", repos=NULL)
We supply an mm9 genomic database with the genomic and epigenetic data required to run the sequence pipeline and the downstream analysis. Download and unpack this archive (5.5 Gb after unpacking). The trackdb folder in this tar is the root directory of the genomic database, later referred to as groot.
Processing starts with a pair of fastq files for each cell. The multiplexed fastq files and the cells' indices that allow de-multiplexation are available in GEO under accession GSE94489. Briefly, Each multiplexed sequencing run comprises 4 FASTQ files:
- Forward Sequencing Reads
- Reverse Sequencing Reads
- Barcodes 1 (8bps)
- Barcodes 2 (8bps)
The order of the reads in the FASTQ files correspond to one another. The relevant barcode information can be obtained from the first read in the two barcode files. Reads with unexpected barcodes are written to a single additional file. For more details see the perl script split_barcodes (note that it is tailored made to run in Babaraham's cluster environment).
Detailed instructions how to use use the complete pipeline are available in the map3c directory README The final step of the pipeline is to upload the contact map of a cell into the genomic database. We supply all contact maps below to allow you to spare you from rerunning the sequence processing step if you prefer:
Cells | Condition | Batches | Link | Size |
---|---|---|---|---|
Haploids | 2i | All | gz | 1.2 Gb |
Haploids | Serum | All | gz | 842 Mb |
Diploids | 2i | 1CDU | gz | 461 Mb |
Diploids | 2i | 1CDX1 | gz | 511 Mb |
Diploids | 2i | 1CDX2 | gz | 618 Mb |
Diploids | 2i | 1CDX3 | gz | 618 Mb |
Diploids | 2i | 1CDX4 | gz | 779 Mb |
Diploids | 2i | 1CDES | gz | 589 Mb |
Diploids | Serum | 1CDS1 | gz | 831 Mb |
Diploids | Serum | 1CDS2 | gz | 1.1 Gb |
Assuming you unpack the archives into a directory pointed by data_dir, and you unpack the genomic database into mm9_db, run the following code in R to upload contact maps into the genomic database (update support_sge to TRUE if sge is supported):
#!r
library(misha)
# define the genomic database we're working on
gdb.init(sprintf("%s/trackdb", mm9_db))
# get the list of directories to work on
dirs = list.files(data_dir)
# parse dirs into track names
nms = paste0("scell.nextera.", gsub("-", "_", gsub(".", "_", dirs, fixed=T), fixed=T))
# fends file is the list of GATC fragment ends
fends = sprintf("%s/seq/redb/GATC.fends", mm9_db)
# create tracks directory
dir.create(sprintf("%s/trackdb/tracks/scell/nextera", mm9_db), showWarnings=F, recursive=T)
# uploading contact files to misha
if (support_sge) {
# build the commands
commands = paste0("gtrack.2d.import_contacts(\"", nms, "\", \"\", \"", paste(base_dir, dirs, "adj", sep="/"), "\", fends, allow.duplicates=F)", collapse=",")
# submit jobs to the sge cluster
res = eval(parse(text=paste("gcluster.run2(",commands,")")))
}
else {
# upload cell by cell
for (i in seq_along(nms)) {
gtrack.2d.import_contacts(nms[i], "", sprintf("%s/%s/adj", data_dir, dirs[i]), fends, allow.duplicates=F)
}
}
Each row in the above supplied contact maps represents a single contact. The columns in the files are fend1, fend2 and count, where fend1/2 are the IDs of the interacting fragment ends. You can use the GATC.fends file (also available in the genomic database supplied above under the seq/redb directory) to translate the fend ids into genomic coordinates. The GATC.fends file maps each fend ID (first column) to chromosome and coordinate (2nd and 3rd columns).
Once the contact maps of the cells are uploaded to the genomic database, you can start their analysis. All code and parameter files are supploid under the analysis directory.
We supply configuration files for the diploids and haploids datasets under analysis/config. Make sure to update the paths in those files before you issue any command (look for the "UPDATE REQUIRED" to easily find them).
Run the following commands to build the data and generate all figures (open the R session when you're in the analysis directory):
#!r
# load the code
source("paper.r")
# build the tables (need to run once, submits jobs to the sge cluster, takes a while). Supply the diploid (hyb) or haploid (hap) parameters file
sch_build_all(spec_params_fn="config/hyb_2i_params.r")
# After sch_build_all finished, subsequent R sessions can load the tables:
sch_load_tables(spec_params_fn="config/hyb_2i_params.r")
# Once tables are built, finish building the data (need to run once, submits jobs to the sge cluster, takes a while)
build_stats()
# Generate figures
paper_figs()
- Metadata information on the cells is available in the features table at GSE94489. It contains experimental information on each cell, virtual sorting data and the features that were computed and used in the paper.
For help, please contact yaniv.lubling@weizmann.ac.il