Author: Saima Sultana Tithi and Hong Tran
####Introduction BAM_ABS is a tool which simulates a Bayesian model that computes the posterior probability of mapping a multiread to each candidate genomic location, taking advantage of uniquely aligned reads. The inputs of this tool are a set of ambiguously mapped reads and a set of uniquely mapped reads from Bismark tool; and the output is the most probable genomic locations for those ambiguously mapped reads. The most probable genomic location is the location which has the highest calculated probability.
####System Requirement This software package has been tested on Ubuntu 14.04 LTS. To run this program, user needs to have samtools, perl, bedtools, and g++ compiler installed on his/her Linux/UNIX system. This program has been tested using g++ 4.8.4.
####Compilation If you receive BAM_ABS as a compressed file, first decompress it. Then use the following commands to create the executable file:
make clean make
For this command to work, the user needs g++ installed on his/her system. You can use the following command to install g++:
sudo apt-get install g++
####Execution #####a) Pre-process the input data: Step 1: If Perl is not installed in the system, then prior to this step, Perl needs to be installed. Convert_to_bed_unite.pl has two options --ambiguous and --unique to indicate whether the input file is ambiguous alignments or unique alignments. Here, run Convert_to_bed_unite.pl to convert ambiguous read file to bed format with --ambiguous option.
- Input: Ambiguous reads in SAM format (output of Bismark tool)
- Output: Ambiguous reads in BED format (ambiguous_read_file.bed)
perl Convert_to_bed_unite.pl --ambiguous ambiguous_read_file.sam
Step 2: Prior to execute this step, Samtools need to be installed on the system. After installing Samtools, run Samtools to get overlapped unique reads in sam format
- Input argument 1: Ambiguous reads in BED format (output of step 1)
- Input argument 2: Unique reads in BAM format (output of Bismark tool)
- Output: Unique reads in SAM format with mapping quality greater than a given value
samtools view -L ambiguous_read_file.bed all_unique_reads.bam -q 20 > unique_reads.sam
The above command will only retain reads with MAQ(Mapping Quality) > 20 with no header
Step 3: Run the following command to get rid of duplicates from the unique reads
- Input: Unique reads in SAM format (output of step 2)
- Output: Unique reads with no duplicate in SAM format
sort -n -r -k3,3 -k4,4 -k5,5 unique_reads.sam|uniq -u > unique_reads_nodup.sam
Step 4: If Perl is not installed in the system, then prior to this step, Perl needs to be installed. Run Convert_to_bed_unite.pl to convert unique read file to bed format with --unique option.
- Input: Unique reads with no duplicate in SAM format (output of step 3)
- Output: Unique reads with no duplicate in BED format (unique_reads_nodup.bed)
perl Convert_to_bed_unite.pl --unique unique_reads_nodup.sam
Setp 5: Prior to execute this step, Bedtools need to be installed. After installing Bedtools, to get overlapped unique reads by using Bedtools, run the following command in the bedtools folder
- Input argument 1 (ambiguous_read_file.bed): Ambiguous reads in BED format
- Input argument 2 (unique_reads_nodup.bed): Unique reads with no duplicate in BED format (output of step 4)
- Output (unique_overlap_read_file.txt): All overlapping unique reads in txt format
./intersectBed -a ambiguous_read_file.bed -b unique_reads_nodup.bed -wb -wa > unique_overlap_read_file.txt
#####b) Score the multi-reads: Run main.exe in BAM_ABS folder using the following command:
./main file.fa ambiguous_read_file.sam unique_overlap_read_file.txt
- Input argument 1 (file.fa): The reference file in Fasta format
- Input argument 2 (ambiguous_read_file.sam): The file containing all ambiguously mapped reads in SAM format
- Input argument 3 (unique_overlap_read_file.txt): The file containing all uniquely mapped reads which are overlapped with multi-reads or ambiguously mapped reads in txt format (output of step 5)
- Output (Reads_with_highest_probable_location.sam): Output file contains multi-reads along with the most probable genomic location in SAM format. This file only contains those multi-reads for which a probable genomic location can be calculated using our model.
####SNP and Methylation Rate BAM_ABS folder also contains a file "snp_methylation_rate.txt". This file contains snp and methylation rate. If the user wants to change any rate, he or she needs to modify this file. Please do not delete this file or do not change the format of this file, only modify the rate part if necessary.
####Result This tool will generate one output file: Reads_with_highest_probable_location.sam. Reads_with_highest_probable_location.sam contains multi-reads along with the most probable genomic location in SAM format.
- Input file:
- The reference file for mouse (in Fasta format): mm10.fa (you can download this file from http://hgdownload.cse.ucsc.edu/downloads.html#mouse)
- Multiread file (in SAM format): L5_10_sample0.1_ambiguous_final
- Overlapping uniquely mapped reads: L5_sample0.1_10_unique_overlap.txt
- Output file: Multireads aligned to highest probable locations (in SAM format): Reads_with_highest_probable_location.sam
BAM_ABS command for the given example:
./main $BAM_ABS_Home$/mm10.fa $BAM_ABS_Home$/Sample_input_output/input/L5_10_sample0.1_ambiguous_final $BAM_ABS_Home$/Sample_input_output/input/L5_sample0.1_10_unique_overlap.txt