The outputs of NGS experiments are sequence reads that have to be aligned and mapped to a reference genome. The first step in NGS data analysis is to align reads to the reference genome. This task of alignment is computationally demanding due to the huge volumes of NGS data and reference genomes. However, there are tools beyond R to do this. Most commonly used alignment tools include BWA and Bowtie. It is beyond the scope of the book to go into the details of these two methods. Nevertheless, a short explanation has been offered on Bowtie in the See also section of this recipe.

NGS实验数据输出的序列读取必须是对齐并映射到参考基因组。NGS数据分析第一步与参考基因比对。比对的人物需要大量的计算。但是，R以外的其他工具可以做到这一点。最常用的比对工具是BWA和Bowtie。本书不详述这两种方法的细节。尽管如此，本教程还是提供对Bowtie简短的解释。

1. First, download an example file from within R, as follows：

 下载文件：

In [7]:
download.file(url="http://genome.ucsc.edu/goldenPath/help/examples/bamExample.bam", destfile = "bamExample.bam")

2. To read the BAM file, use the scanBam function from the Rsamtools package as follows (the file has been provided with the code files on the book's web page for the ease of access):

 读取BAM文件，用Rsamtools包中的scanBam函数：

In [2]:
library(Rsamtools)

Loading required package: GenomeInfoDb
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Loading required package

In [9]:
bam <- scanBam("bamExample(1).bam") #这里用的是另一个相同的文件，不过是下载文件是用其他软件下载的。

3. Take a look at the attributes for the first list element of the read data as follows:

 查看读取数据的第一个列表元素：

In [10]:
names(bam[[1]])

4. Check the count of the records in the data as follows:

 检查数据中记录的数量：

In [12]:
countBam("bamExample(1).bam")

space,start,end,width,file,records,nucleotides
,,,,bamExample(1).bam,36142,1474950


5. If you want to read only selected attributes, set them as parameters by typing the following commands:
 
 读取选定的属性，请输入以下命令将其设置参数：

In [14]:
what <- c("rname", "strand", "pos", "qwidth", "seq")
param <- ScanBamParam(what=what)
bam2 <- scanBam("bamExample(1).bam", param=param)
names(bam2[[1]])

6. Read the data as a DataFrame object using the following function:
 
 将数据作为DataFrame对象读取：
 

In [15]:
bam_df <- do.call("DataFrame", bam[[1]])
head(bam_df)

DataFrame with 6 rows and 13 columns
                                qname      flag    rname   strand       pos
                          <character> <integer> <factor> <factor> <integer>
1                  SRR010939.15011799        35       21        +  33019936
2                  SRR010939.15011799        19       21        -  33019947
3                   SRR006419.2418801        16       21        -  33019958
4 -XAT_0001_FC208BFAAXX:5:149:585:182        16       21        -  33019960
5 -XAH_0003_FC203BTAAXX:1:191:243:169        16       21        -  33019963
6                   ERR001302.6114800        19       21        -  33019965
     qwidth      mapq       cigar     mrnm      mpos     isize
  <integer> <integer> <character> <factor> <integer> <integer>
1        76        99         76M       21  33019947        87
2        76        99         76M       21  33019936       -87
3        51        76         51M       NA         0         0
4        47        91         47M       

7. From this DataFrame object, extract the sequences that fulfill certain conditions as follows:
 
 从DataFrame对象中提取满足以下特定条件的序列：

In [16]:
table(bam_df$rname == '21' & bam_df$flag == 16)


FALSE  TRUE 
31894  4248 