ALLHiC: scaffolding an auto polyploid sugarcane genome
To better understand the process of ALLHiC, we provide the example to show how we use ALLHiC to scaffold an auto-tetraploid sugarcane S. spontaneum AP85-441 genome, which has been published in Nature Genetics (https://www.nature.com/articles/s41588-018-0237-2).
- Accession name: S. spontaneum AP85-441
- Ploidy: 1n=4x=32
- Estimated genome size: 3.4 Gb
- Assembled genome size: 3.13 Gb
- Contig N50: 45 kb
- No. of Hi-C libraries: 4
- Restriction enzyme sites: HindIII
- Data Size: ~300 Gb
- Unique mapped ratio: 11.90%
- Validate rate: 88.80%
- Dangling End Rate: 11.38%
We use bwa index and samtools faidx to index your draft genme assembly
bwa index -a bwtsw draft.asm.fasta
samtools faidx draft.asm.fasta
Aligning Hi-C reads to the draft assembly
bwa aln -t 24 draft.asm.fasta reads_R1.fastq.gz > sample_R1.sai
bwa aln -t 24 draft.asm.fasta reads_R2.fastq.gz > sample_R2.sai
bwa sampe draft.asm.fasta sample_R1.sai sample_R2.sai reads_R1.fastq.gz reads_R2.fastq.gz > sample.bwa_aln.sam
Filtering SAM file and removing low-quality mapping hits
PreprocessSAMs.pl sample.bwa_aln.sam draft.asm.fasta MBOI
filterBAM_forHiC.pl sample.bwa_aln.REduced.paired_only.bam sample.clean.sam
samtools view -bt draft.asm.fasta.fai sample.clean.sam > sample.clean.bam
We aimed to solve the problem of scaffolding for each of the homologous groups (HGs) and thus partitioned the 36,560 AP85 contigs that are anchored in the allele table and 94.56 million uniquely mapped reads to 10 pre-set homologous groups according to sorghum chromosomes (SbChr01 to SbChr10).
partition.pl -g Allele.gene.table -r draft.asm.fasta -b sample.clean.bam -d wrk_dir
Now you will find a list of folders named by chromosome names (Chr01 to Chr10) in the wrk_dir directory, with each containing a subsetted bam (sample.clean.bam), a list of contig names/sequences that matched in the same homologous chromosome (ctg.list and seq.fasta)
- Prune
For each set of homologous chromosome groups, the down-sampled contigs and mapped reads as well as the allele table identified before were used to remove two types of noisy Hi-C signals: 1) inter-alleleic reads and 2) week signals between collapsed regions and phased contigs.
ALLHiC_prune -i Allele.ctg.table -b sample.clean.bam -r seq.fasta
- Partition
Subsequently, contigs were partitioned to a user pre-defined number of groups (K value) based on the Hi-C signals extracted from pruned bam file. Since the basic chromosome numbers in S. spontaneum and S. bicolor are different (8 vs 10), large chromosome rearrangement events were expected when scaffolding the genome. We carefully tested K value (number of groups in the partition step) from 4 to 8 and inspected the reasonable results by comparing the AP85 scaffolds with sorghum chromosomes as they have very short divergence time and high level of collinearity was expected.
ALLHiC_partition -r seq.fasta -b prune.bam -e AAGCTT -k 4 -m 25
- Rescue (Optional)
After pruning the Hi-C signals, we may have a chance to lose some informatic links. In the Rescue step, ALLHiC will search for contigs that are not involved in partition step from original un-pruned bam files and assigned them to specific clusters according Hi-C signal density.
ALLHiC_rescue -r seq.fasta -b sample.clean.bam -c prunning.clusters.txt -i prunning.counts_AAGCTT.txt
- Optimize
The following steps will apply genetic algorithm (GA) to optimize the ordering and orientation of contigs using Hi-C paired-end reads.
allhic extract sample.clean.bam seq.fasta --RE AAGCTT
for K in {1..4};do allhic optimize group${K}.txt sample.clean.clm;done
- Build
The build step reconstructs each chromosome by concatenating the contigs, adding gaps between the contigs and generating the final genome release in FASTA format.
ALLHiC_build seq.fasta
After running ALLHiC on each set of homologous chromosome groups, we got 48 super-scaffolds, among which 32 were assigned to specific chromosomes. Hi-C signals among the 48 super-scaffolds provide evidence to join them into chromosomes. Similar to our prune method, we firstly remove inter-allelic Hi-C signals as well as weak links (signal density less than 0.35), and then searches for the best linkage signals for each super-scaffold. Super-scafolds are clustered together if they have mutually best Hi-C signals. This goal can be achived by manual check. We also provide a script, though its not robust and need modification.
perl ALLHiC/script/link_superscaffold.pl
The strategy we used to assign contigs and mapping Hi-C reads to pre-set of groups based on synteny analysis with a closely related species (here sorghum) indeed reduce the complexity and difficulty, however, will lose contigs which do not have gene annotated. To rescue these lost contigs, ALLHiC rescue calculated the signal density between unplaced contigs and anchored contigs (which have been assigned to a specific group). The unplaced contigs will be concentrated to specific partition for further ordering and orientation if they have best signal density with corresponding clusters.
This step can be achived by re-using ALLHiC_rescue, but need users to prepare a cluster file to indicate which contigs should be assigned to the same chromosome.
Prepare the restriction enzyme sites
allhic extract sample.clean.bam draft.asm.fasta --RE AAGCTT
Globally rescue unplaced contigs
ALLHiC_rescue -r draft.asm.fasta -b sample.clean.bam -c user-prepared.clusters.txt -i sample.clean.counts_AAGCTT.txt
for K in {1..48}; do allhic optimize group${K}.txt sample.clean.clm;done
© 2017 - present, ALLHiC authors