Skip to content

sanjaysinghrathi/AVLR-Mapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AVLR-Mapper

Next generation sequencing (NGS) technologies are generating a huge amount of genetic data and conventional single-processor sequence alignment tools are becoming incapable/inefficient in keeping track of the same. Therefore, cloud computing and MapReduce frameworks, using thousands of commodity machines to store and process huge datasets, are emerging as the solution of choice for the problem of ever-growing genomics data. Many MapReduce based sequence alignment tools like CloudBurst, BlastReduce, CloudAligner, and SparkBWA are being used for faster and efficient alignment of genome sequences. These sequence aligners are mainly either hashing based like CloudBurst or indexing based like SparkBWA. In hashing based sequence aligners, either the query genome or the reads are hashed, while the other is used to find matches. Most of these methods are fast and efficient for short reads, but result in problems of lower accuracy and higher memory requirements when dealing with long reads. In indexing based aligners an index like suffix array or FM-index (Full-text index in Minute space) is generated for the reference genome to map reads using the index. These are accurate, efficient with less memory footprint required and have the capability to handle long reads but still have limitations like high index generation time, searching a read in the entire index and unable to handle highly dynamic genomes which are updated on a daily basis. To combat these aforementioned issues, we propose a new approach called AVLR-Mapper, which is based on a suffix array that can align variable length reads to the reference genome efficiently and accurately. It uses a hybrid model based on seed-and-extend and suffix array approach. Furthermore, our index is partitioned based on a prefix and each partition can reside on a different machine. Our index generation time is greatly reduced by using Spark’s distributed architecture and only a partition of the index is used to find and map a read. We tested the effectiveness, efficiency and scalability of our approach for standard and real life genome datasets. Read AVLR-Mapper.docx for instructions to use this tool.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages