Linear is a long-read analysis framework that employs methods more flexible and efficient than assembly- or alignment-based ones. Linear is compatible with existing software including SAMtools, SVs callers, and IGV.
Please make sure the following systems have been installed before building from the source.
- GNU/Linux GCC ≥ 4.9.0
- CMAKE ≥ 3.0.0
- zlib ≥ 1.2
#To install prerequisites for Debian, Ubuntu, etc.
sudo apt-get install cmake
sudo apt-get install zlib1g zlib1g-dev
#To install prerequisites for RedHat, Fedora, etc.
sudo dnf install cmake
sudo dnf install zlib-devel
#To install cmake prerequisites for Arch, manjaro,etc.
#zlib-dev is not needed
sudo pacman -S cmake
#To build from source, please type in the commandline
mkdir -p build/release && cd $_
CMake [path to source]
make linear -j4 #use 4 threads to compile
# Specify modules in Linear
linear module [options]
# Check generic help for available modules
linear -h
Linear - options and arguments.
====================================
SYNOPSIS
Linear <submodules> -h for help
DESCRIPTION
-h, --help
Display this help message.
--version
Display version information.
AVAILABLE SUBMODULES:
filter: The submodule is to detect SVs signals hidden in long reads.
It takes input as long reads and outputs SAM/BAM. Type "linear filter -h" for more info.
The filter module named Leaf (pipeline B in the figure) is an ultra-fast SV filter for population-scale long-read SV detection. It is built on generative models, which are computationally efficient and effective in detecting intra-read SVs. Leaf outputs SAM/BAM*, which is compatible with alignment-based software.
#Example 1: Sequence format .fa(stq)(.gz) are supported for input.
linear filter read.fa(stq)(.gz) genome.fa(.gz)
#Following is the status when running the filter
Linear: ALIgNment-free methods for long-read vARiants resolution
--Read genomes
File: all.fa.gz [24 sequences; 2945 mbases; Elapsed time[s] 19.75 100%]
--Index::Initiate[100%]
Index::Hash [100%]
End creating index Elapsed time[s] 22
--SRR9001768.fa
I/O::in :273300 cpu:32.70[s] speed:8358.11[rds/thd/s]
I/O::out:270400 cpu:135.54[s] speed:1995.00[rds/thd/s]
Compute:273000 cpu:472.13[s] speed:1578.24[rds/thd/s]
Processed:270400 time:138.27[s] speed:1955.62[rds/s]
#Example 2: Argument x between the reads and references for more than 2 inputs.
linear filter *.fa(stq)(.gz) x *.fa(.gz)
#Example 3: For options help
Linear filter -h
Linear filter - options and arguments.
===========================================
SYNOPSIS
Linear filter [OPTIONS] read.fa/fastq(.gz) genome.fa(.gz)
DESCRIPTION
-h, --help
Display this help message.
--version
Display version information.
Basic options:
-o, --output STR
Set the prefix of output. The filter will use the prefix of the filename of reads as the prefix of output if
the option isn't set
-ot, --output_type INT
Set the format of the output file. 1 to enable .APF, an approximate map file for non-standard application; 2 to
enable .SAM {DEFAULT}; 4 to enable .BAM; Set values 3 (3=1+2) to enable both .apf and .sam
-t, --thread INT
Set the number of threads to run. -t 4 {DEFAULT}
-g, --gap_len INT
Set the minimal length of gaps. -g 50 {DEFAULT}. -g 0 to turn off map of gaps.
-rg, --read_group STR
Set the name of the read group specified in the SAM header
-sn, --sample_name STR
Set the name of the sample specified in the SAM header
More options (tweak):
-dup, --duplication INT
Redetect duplications for signals of insertions. Enabling (-dup 1) this option will treat many insertions as
duplications. This option is off (-dup 0) {DEFAULT}
-b, --bal_flag INT
Set to Enable/Disable dynamic balancing tasks schedule. -b 1(Enable) {DEFAULT}
-p, --preset INT
Set predefined sets of parameters. -p 0 {DEFAULT} -p 1 efficient -p 2 additional
-i, --index_type INT
Choose the type of indices{1, 2}. -i 1 {DEFAULT}
-c, --apx_c_flag INT
0 to turn off apx map
-f, --feature_type INT
Set types of features {1,2}. -f 2 (2-mer, 48bases){DEFAULT}
-r, --reform_ccs_cigar_flag INT
Enable/Disable compressing the cigar string for Pacbio CCS reads. -r 0(Disable) {DEFAULT}
Compatibility with samtools 1.10 has been tested. Results of the filter are compatible with 'samtools view', 'samtools index' and 'samtools sort'.
PBSV is a SVs caller for PacBio long reads. Compatibility with PBSV has been tested. Set the sample and group name appropriately with option -s when using pbsv discover.
SVIM is an SVs caller for PacBio and ONT reads. SVIM takes as input the SAM/BAM. The compatibility of the filter with SVIM has been tested. And results of the filter can be processed directly by SVIM with default settings.
cuteSV is an SVs caller for PacBio and ONT reads. cuteSV takes as input the SAM/BAM. The compatibility of the filter with cuteSV has been tested. And results of the filter can be processed directly by cuteSV with default settings.
IGV is a sequencing visualization tool. Compatibility with IGV has been tested. Please use samtools to convert and index the results of filter before using IGV. The indexed BAM* can be visualized directly by IGV.
SAM/BAM* is an extension of standard SAM/BAM for virtual alignemnt. It is a superset of the standard SAM/BAM. It also supports alignment whose SAM/BAM* is identical to the standard SAM/BAM.
3 fields in the standard format are redefined:
- The 6th column, cigar string(denoted by cigar*), is redefined. cigar* string includes 4 types of cigar pairs as shown in the following figure where the virtual alignment from A to E are expressed by the cigar pairs =I, =D, XI, and XD.
-
The 10th column, SEQ*, is subsequence from read or reference.
-
The 12th column, tag* 'SA:Z', is redefined. Other tags are identical to the standard tag, which can be found at SAM/BAM format and Optional tags.
#An example of records in SAM/BAM*.
#SEQs are generated according to cigars rather than segments of read.
#Bases in SEQs corresponding to ’49S’ are from read;
#Bases in SEQs corresponding to ’6=’ are from genome;
#Bases in SEQs corresponding to '1I' are from read;
#Bases in SEQs corresponding to ’35X’ are from read.
# if the base is unequal to the corresponding base in the genome,
# otherwise the ’N’ is inserted.
#SA:Z tag is generated according to the cigars and SEQs.
@HD VN:1.6
@SQ SN:chr10 LN:135534747
@PG PN:Linear
@RG ID:1 SM:1
m140612_082500_42156_c100652082550000001823118110071461_s1_p0/104454/5061_10840
0 chr10 59256034 255 49S6=1I34=1I31=5I30=1I110=2I1=1I49=3I11=2I49=1I6=4I44=2I74
=16I1=1I51=17I96=35X26I66=5I40=3I70=2I101=5319S * 0 0 TAGCATAAGCTCTTTAGTTTAATTAG
ATCAGACATTTGTCAATGTTTGTGTCAATGGTTGGCTTTTGTTGCCTTTGCTTTTAGTGTTTTAAGTCATGAAGTCTTTG
...CCACTTGTGTAGAGAGGATGTGGAGAAAAAGAAATGCTTTTACACAGTTGGTGGGAGTGTAAATTCGTTCAACCACT
GTAGAAGACAGTGTTGTGATTCCTCAAGACACACNNNTTTTNCGCNNNTTTAANNNCTTTGNAGAACCCAACAATTAATA
...AGCTGGAAACCATCATTCTCAGCAAACTAACACAGGAACAGAAAACCAAACAC * SA:Z:chr10,59257622,-
,4379S320M5I4884S,255,27;chr10,59257982,+,1371S3138M338I146S,255,528;