Skip to content

shingocat/lrscaf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LRScaf: improving draft genomes using long noisy reads

Hybrid assembly strategy is a reasonable and promising approach to utilize strengths and settle weaknesses in Next-Generation Sequencing (NGS) and Third-Generation Sequencing (TGS) technologies. According to this principle, we here present a new toolkit named LRScaf (Long Reads Scaffolder) by applied TGS data to improve draft genome assembly. The main features are: short running time, accuracy, and being contiguity. To scaffold rice genome, it could be done in 20 mins with minimap mapper. In human, LRScaf could improve the draft assembly NG50 from 127.5 Kb to 10.4 Mb on 20x PacBio CHM1 dataset and NG50 from 115.7 Kb to 17.4 Mb on ~35x Nanopore NA12878 dataset.


################################################################################
Requirements
################################################################################
Java version: 1.8+.

################################################################################
Building LRScaf project
################################################################################
There are two ways to build and run this project:
  • There is a jar package named LRScaf-<version>.jar under target folder in releases. User could run it with command (The details of configuration XML file are described below.):
  • >java -jar LRScaf-<version>.jar -x <configure.xml>.
  • If you want to compile the source code by yourself, you could download the source code and then compile and build this project by maven <https://maven.apache.org/> in following steps:
  • # 1. download the latest releases version and unzip the package
    >unzip lrscaf-<version>.zip
    # 2. change the working folder
    >cd lrscaf-<version>
    # 3. complie source code and package the project, and a jar package named
    # LRScaf-<version>.jar would be under the target folder.
    >mvn package

    ################################################################################
    Quick starting
    ################################################################################
    # XML configuration style
    >java -jar LRScaf-<version>.jar -x <configure.xml>
    # or command-line in short style
    >java -jar LRScaf-<version>.jar -c <draft_assembly.fasta> -a <alignment.m4> -t <m4> -o <output_foloder> [options]
    # or command-line in long style
    >java -jar LRScaf-<version>.jar --contig <draft_assembly.fasta> --alignedFile <alignment.m4> -t <m4> --output <output_foloder> [options]

    ################################################################################
    A Oryza sativa L. Tutorials
    ################################################################################
    # Improving a draft assemblies using LRScaf is generally by three steps.
    # The first step: generated a draft assemblies using NGS reads.
    # Download the NGS dataset (prefecth SRR8446493) and extract NGS reads (fastq-dump SRR8446493);
    # Download the TGS dataset under the project PRJNA318714 on NCBI and extract TGS reads of about 20-fold coverages;
    # Counstruct the NGS draft assemlbies using SOAPdenovo2 (More details: https://sourceforge.net/projects/soapdenovo2/)
    >SOAPdenovo127mer pregraph -s ./assembly.config -d 1 -K 83 -R -p 48 -o ./83/83
    >SOAPdenovo127mer contig -R -g ./83/83
    >SOAPdenovo127mer map -p 48 -s ./assembly.config -g ./83/83
    >SOAPdenovo127mer scaff -p 48 -L 150 -F -g ./83/83
    # The content of "assembly.config" file:
    #   max_rd_len=150
    #   [LIB]
    #   avg_ins=300
    #   reverse_seq=0
    #   asm_flags=3
    #   q1=read_R1.fq
    #   q2=read_R2.fq

    # The second step: alignment the TGS long reads against the draft assemblies.
    # mapping the TGS long reads against the draft assemblies with minimap2 or BLASR.
    >minimap2 -t 8 ./draft.fa ./tgs20x.fa >./aln.mm

    # The last step: improving draft assemblies using LRScaf.
    >java -Xms100g -Xmx100g -jar LRScaf.jar -x ./scafconf.xml
    # The content of "scafconf.xml" file:
    # <scaffold>
    #   <input>
    #     <contig>./draft.fa</contig>
    #     <mm>./aln.mm</mm>
    #   </input>
    #   <output>./</output>
    #   <paras>
    #     <min_contig_length>1200</min_contig_length>
    #     <identity>0.1</identity>
    #     <min_overlap_length>960</min_overlap_length>
    #     <min_overlap_ratio>0.8</min_overlap_ratio>
    #     <max_overhang_length>1000</max_overhang_length>
    #     <max_overhang_ratio>0.1</max_overhang_ratio>
    #     <max_end_length>1000</max_end_length>
    #     <max_end_ratio>0.1</max_end_ratio>
    #     <min_supported_links>1</min_supported_links>
    #     <tips_length>1000</tips_length>
    #     <ratio>0.2</ratio>
    #     <repeat_mask>true</repeat_mask>
    #     <iqr_time>3</iqr_time>
    #     <mmcm>20</mmcm> <!--only for Minimap Alignment.-->
    #     <process>4</process>
    #   </paras>
    # </scaffold>

    ################################################################################
    Parameters of LRScaf
    ################################################################################

    LRScaf supports parameters set by XML confiuration file or command-line. It recommends to use XML configuration file. There is a template configuration file of XML format, named "scafconf.xml", in the project. In command-line, LRScaf supports long (dash-dash) and short (dash) style of GNU like options. And the following table would show each parameter meaning and default value if available.

    The first and second columns are the command-line paremeters in long and its coressponding short style.

    The third column is the code in XML configuration file. NA is not available in XML configuration file.

    The fourth column is the details and default value of this option if available.

    ParameterAbbreviationXML CodeDetails
    xmlxNAThe XML configuration file. All command-line parameters would be omitted if this is set.
    contigccontigThe contigs file of draft assembly in fasta format.
    m5m5m5The alignment file in -m 5 format of BLASR.
    m4m4m4The alignment file in -m 4 format of BLASR.
    samsamsamThe alignment file in sam format of BLASR.
    mmmmmmThe alignment file in PAF format of Minimap.
    outputooutputThe output folder.
    miniCntLenmiclmin_contig_lengthThe minimum contigs length to be included for scaffolding. Default: <200> bp.
    identityiidentityThe identity threshold for filtering invalid alignment. Default: <0.8>.
    This value must be modify according to the mapper.
    For the BLASR alignment file, the higher value means the higher identity.
    For the Minimap alignment file, the value should not be larger than 0.3 and the value could be set to 0.1.
    miniOLLenmiollmin_overlap_lengthThe minimum overlap length of contig. Default: <160> bp.
    miniOLRatiomiolrmin_overlap_ratioThe minimum overlap length ratio of contig. Default: <0.8>.
    If the overlap length is large than the miniOLLen,
    it will compute the ratio of overlap length which is overlap_length/contig_length.
    maOHLenmaohlmax_overhang_lengthThe maximum overhang length of contig. Default: <300> bp.
    maOHRatiomaohrmax_overhang_ratioThe maximum overhang ratio of contig. Default: <0.1>.
    If the overhang length is less than the maohl,
    it will compute the ratio of overhang length which is overhang_lenght/contig_length.
    maELenmaelmax_end_lengthThe maximum ending length of long read. Default: <300> bp.
    maERatiomaermax_end_ratioThe maximum ending ratio of long read. Default: <0.1>.
    It will compute the ending length (ending_len) by long_read_length * maer,
    then def_ending_len = (mael >= ending_len ? ending_len : mael).
    miSLNmislmin_supported_linksThe minimum support links. Default: <1>.
    If the depth of long reads less than 10x, the misl could be set to 1.
    ratiorratioThe ratio for deleting error prone edges in divergence nodes. Default: <0.2>.
    mrmrrepeat_maskThe indicator for masking repeats. Default: <true>. Masking repeats will reduce the divergent nodes in the scaffolding graph and improve the contiguity of assemblies. It recommends to be true.
    tiplengthtltip_lengthThe maximum tip length. Default: <1500> bp.
    iqrtimeiqrtiqr_timeThe IQR times for setting contigs as repeats by their coverages. Default: <1.5>.
    mmcmmmcmmmcmThe parameter to filter invalid Minimap alignments. Default: <8>. Only for Minimap alignment.
    processpprocessThe multi-threads settings. Default:<4>.
    helphNAPrint this help information.

    ################################################################################
    XML Configuration File Content
    ################################################################################
    <?xml version="1.0" encoding="UTF-8"?>
    <scaffold>
      <!--The input file for scaffolding, including contigs and aligned files (i.e. m5, m4 or mm file) -->
      <input>
        <contig>Draft assembly in fasta format.</contig>
        <m4>The aligned file in BLASR -m 4 format.</m4>
      </input>
      <!-- The output folder for scaffolding -->
      <output>The output folder.</output>
      <!-- The parameters for scaffolding-->
      <paras>
        <!--More details are showed in README.md-->
        <min_contig_length>500</min_contig_length>
        <identity>0.8</identity>
        <min_overlap_length>400</min_overlap_length>
        <min_overlap_ratio>0.8</min_overlap_ratio>
        <max_overhang_length>500</max_overhang_length>
        <max_overhang_ratio>0.1</max_overhang_ratio>
        <max_end_length>500</max_end_length>
        <max_end_ratio>0.1</max_end_ratio>
        <min_supported_links>2</min_supported_links>
        <tip_length>1500</tip_length>
        <ratio>0.2</ratio>
        <repeat_mask>true</repeat_mask>
        <iqr_time>3</iqr_time>
        <mmcm>8</mmcm> <!--only for Minimap Alignment.-->
        <process>4</process>
      </paras>
    </scaffold>

    ################################################################################
    Licence
    ################################################################################

    LRScaf is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
    This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
    You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.


    If you have any questions, please feel free to contact me <qinmao@caas.cn>.