Genome scaffolding using 10x data Description Scaff10x is a pipeline for genome scaffolding using 10x data. Barcoded tags are extracted from raw sequencing reads and appended to read names for further processing. Alignments are carried out with either BWA or SMALT. Barcodes are sorted together with contigs as well as mapping coordinates. A relation matrix is constructed to record the shared barcodes among the contigs which may be linked. Order and orientation are conducted after nearest neighbours are searched. Say if you have an assembly with 10x reads: genome-assembly.fasta read_1.fastq read_2.fastq (1) Barcode extraction: /tmp/scaff10x/scaff_BC-reads-1 read_1.fastq read-BC_1.fastq read-BC_1.name > try.out /tmp/scaff10x/scaff_BC-reads-2 read-BC_1.name read_2.fastq read-BC_2.fastq > try.out (2) Run Scaff10x Using SMALT: https://sourceforge.net/projects/smalt/ /tmp/scaff10x/scaff10x -nodes 30 -matrix 2000 -reads 8 -edge 50000 -link 6 -block 2500 -align smalt genome-assembly.fasta read-BC_1.fastq read-BC_2.fastq scaffolds.fasta > try.out nodes (30) - number of CPUs requested matrix (2000) - relation matrix size reads (10) - minimum number of reads per barcode edge (50000) - edge length of mapped reads to consider for scaffolding link (8) - minimum number of shared barcodes block (2500) - length to determine for nearest neighbours or BWA /tmp/scaff10x/scaff10x -nodes 30 -matrix 2000 -reads 8 -edge 50000 -link 6 -block 2500 -align bwa genome-assembly.fasta read-BC_1.fastq read-BC_2.fastq scaffolds.fasta > try.out If you have a sam file already, try this /tmp/scaff10x/scaff10x -nodes 30 -matrix 2000 -reads 8 -edge 50000 -link 6 -block 2500 -sam /lustre/scratch117/sciops/team117/hpag/zn1/project/human/10xg/new/human.sam genome-assembly.fasta read-BC_1.fastq read-BC_2.fastq scaffolds.fasta > try.out Note: a. To run scaff10x, you need to give the full path /tmp/scaff10x/scaff10x. b. You also need to give the full path of the sam file. However, the target assembly file and two read files should be in your working directory (full path won't work). c. SMALT is notably slower than BWA. So your first try goes to BWA. d. The block value is very important. The default value of 2500 is very conservative and you may increase this value to say 5000 or 10000 to improve the length of scaffolds. e. The default numbers of "reads" and "link" are based on 30X coverage of the data. You need to increase these values if you have say 60X reads. f. To get the best results, you may run the pipeline twice, here the new genome-assembly.fasta input file as the scaffolded file from last time. (3) Install gunzip scaff10x.tar.gz tar xvf scaff10x.tar make Please contact Zemin Ning ( zn1@sanger.ac.uk ) for any further information.
-
Notifications
You must be signed in to change notification settings - Fork 0
zning-sanger/Scaff10x
About
Genome scaffolding using 10x data
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published