ALLHiC: scaffolding an auto polyploid sugarcane genome

To better understand the process of ALLHiC, we provide the example to show how we use ALLHiC to scaffold an auto-tetraploid sugarcane S. spontaneum AP85-441 genome, which has been published in Nature Genetics (https://www.nature.com/articles/s41588-018-0237-2).

Overview of scaffolding of auto-tetraploid sugarcane AP85-441 genome

Basic information of S. spontaneum AP85-441 genome

Accession name: S. spontaneum AP85-441
Ploidy: 1n=4x=32
Estimated genome size: 3.4 Gb
Assembled genome size: 3.13 Gb
Contig N50: 45 kb

Hi-C sequencing information

No. of Hi-C libraries: 4
Restriction enzyme sites: HindIII
Data Size: ~300 Gb
Unique mapped ratio: 11.90%
Validate rate: 88.80%
Dangling End Rate: 11.38%

Hi-C reads mapping and filtering

We use bwa index and samtools faidx to index your draft genme assembly

bwa index -a bwtsw draft.asm.fasta  
samtools faidx draft.asm.fasta

Aligning Hi-C reads to the draft assembly

bwa aln -t 24 draft.asm.fasta reads_R1.fastq.gz > sample_R1.sai  
bwa aln -t 24 draft.asm.fasta reads_R2.fastq.gz > sample_R2.sai  
bwa sampe draft.asm.fasta sample_R1.sai sample_R2.sai reads_R1.fastq.gz reads_R2.fastq.gz > sample.bwa_aln.sam

Filtering SAM file and removing low-quality mapping hits

PreprocessSAMs.pl sample.bwa_aln.sam draft.asm.fasta MBOI
filterBAM_forHiC.pl sample.bwa_aln.REduced.paired_only.bam sample.clean.sam  
samtools view -bt draft.asm.fasta.fai sample.clean.sam > sample.clean.bam

Separate homologous groups to reduce scaffolding complexity

We aimed to solve the problem of scaffolding for each of the homologous groups (HGs) and thus partitioned the 36,560 AP85 contigs that are anchored in the allele table and 94.56 million uniquely mapped reads to 10 pre-set homologous groups according to sorghum chromosomes (SbChr01 to SbChr10).

partition.pl -g Allele.gene.table -r draft.asm.fasta -b sample.clean.bam -d wrk_dir

Now you will find a list of folders named by chromosome names (Chr01 to Chr10) in the wrk_dir directory, with each containing a subsetted bam (sample.clean.bam), a list of contig names/sequences that matched in the same homologous chromosome (ctg.list and seq.fasta)

Build superscaffolds using ALLHiC pipeline

Prune

For each set of homologous chromosome groups, the down-sampled contigs and mapped reads as well as the allele table identified before were used to remove two types of noisy Hi-C signals: 1) inter-alleleic reads and 2) week signals between collapsed regions and phased contigs.

ALLHiC_prune -i Allele.ctg.table -b sample.clean.bam -r seq.fasta

Partition

Subsequently, contigs were partitioned to a user pre-defined number of groups (K value) based on the Hi-C signals extracted from pruned bam file. Since the basic chromosome numbers in S. spontaneum and S. bicolor are different (8 vs 10), large chromosome rearrangement events were expected when scaffolding the genome. We carefully tested K value (number of groups in the partition step) from 4 to 8 and inspected the reasonable results by comparing the AP85 scaffolds with sorghum chromosomes as they have very short divergence time and high level of collinearity was expected.

ALLHiC_partition -r seq.fasta -b prune.bam -e AAGCTT -k 4 -m 25

Rescue (Optional)

After pruning the Hi-C signals, we may have a chance to lose some informatic links. In the Rescue step, ALLHiC will search for contigs that are not involved in partition step from original un-pruned bam files and assigned them to specific clusters according Hi-C signal density.

ALLHiC_rescue -r seq.fasta -b sample.clean.bam -c prunning.clusters.txt -i prunning.counts_AAGCTT.txt

Optimize

The following steps will apply genetic algorithm (GA) to optimize the ordering and orientation of contigs using Hi-C paired-end reads.

allhic extract sample.clean.bam seq.fasta --RE AAGCTT
for K in {1..4};do allhic optimize group${K}.txt sample.clean.clm;done

Build

The build step reconstructs each chromosome by concatenating the contigs, adding gaps between the contigs and generating the final genome release in FASTA format.

ALLHiC_build seq.fasta

Linking super-scaffolds to chromosomal level assembly

After running ALLHiC on each set of homologous chromosome groups, we got 48 super-scaffolds, among which 32 were assigned to specific chromosomes. Hi-C signals among the 48 super-scaffolds provide evidence to join them into chromosomes. Similar to our prune method, we firstly remove inter-allelic Hi-C signals as well as weak links (signal density less than 0.35), and then searches for the best linkage signals for each super-scaffold. Super-scafolds are clustered together if they have mutually best Hi-C signals. This goal can be achived by manual check. We also provide a script, though its not robust and need modification.

perl ALLHiC/script/link_superscaffold.pl

Globally rescue

The strategy we used to assign contigs and mapping Hi-C reads to pre-set of groups based on synteny analysis with a closely related species (here sorghum) indeed reduce the complexity and difficulty, however, will lose contigs which do not have gene annotated. To rescue these lost contigs, ALLHiC rescue calculated the signal density between unplaced contigs and anchored contigs (which have been assigned to a specific group). The unplaced contigs will be concentrated to specific partition for further ordering and orientation if they have best signal density with corresponding clusters.
This step can be achived by re-using ALLHiC_rescue, but need users to prepare a cluster file to indicate which contigs should be assigned to the same chromosome.
Prepare the restriction enzyme sites

allhic extract sample.clean.bam draft.asm.fasta --RE AAGCTT

Globally rescue unplaced contigs

ALLHiC_rescue -r draft.asm.fasta -b sample.clean.bam -c user-prepared.clusters.txt -i sample.clean.counts_AAGCTT.txt
for K in {1..48}; do allhic optimize group${K}.txt sample.clean.clm;done

Provide feedback

Saved searches

Use saved searches to filter your results more quickly