FASTQ data has the sequences (the bases) as the corresponding quality scores (Phred) in terms of ASCII characters, as explained in the introductory part of the chapter. Once read into the R workspace, the data is ready to be analyzed. However, it needs some preprocessing to meet the desired conditions on quality and data instance according to our interest. For example, we need higher Phred scores and a particular strand. This preprocessing involves quality assessment and filtering. This recipe will deal with these aspects, specifically filtering and quality checks.

FASTQ数据将序列作为ASCII字符对应的质量分数（Phred）。一旦在R工作区读入，就可以分析数据了。但是，根据我们的兴趣，它需要一些预处理来满足质量和数据实例的要求。例如，我们需要更高的Phred分数和特定的线。这种预处理包括质量评估和过滤。这个佩服那个将处理这些方面，特别是过滤和质量检查。

1. First, download the required files as follows (note that the download might take a while):
 
 下载文件，可以使用其他软件下载。

2. Unzip the downloaded files from within R as follows:

 解压文件：

In [3]:
install.packages("R.utils") # install the R.utils from CRAN

"unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
  无法打开URL'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'"

package 'R.utils' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Administrator\AppData\Local\Temp\RtmpAXvpso\downloaded_packages


In [2]:
library(R.utils)

Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.22.0 (2018-04-21) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'

The following objects are masked from 'package:methods':

    getClasses, getMethods

The following objects are masked from 'package:base':

    attach, detach, gc, load, save

R.utils v2.6.0 (2017-11-04) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'

The following object is masked from 'package:utils':

    timestamp

The following objects are masked from 'package:base':




In [11]:
bunzip2(list.files(pattern = ".fastq.gz2$")) #Unzips and removes the original bunzip file

ERROR: Error in decompressFile.default(filename = filename, ..., ext = ext, FUN = FUN): Argument 'filename' and 'destname' are identical: SRR351672.fastq.gz


3. To assess the quality, use the FastQuality function from the ShortRead library as follows:

 用ShortRead包中的FastQuality函数评估质量：

In [1]:
library(ShortRead)

Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Loading required package: BiocParallel
"package 'BiocParallel' 

In [2]:
myFiles <- list.files(getwd(), "fastq", full=TRUE)
myFiles

In [3]:
myFQ <- lapply(myFiles, readFastq)

In [4]:
myQual <- FastqQuality(quality(quality(myFQ[[1]])))

4. Convert the quality measure in terms of a matrix as follows:

 将质量度量转化为矩阵的形式：


In [5]:
readM <- as(myQual, "matrix")

5. Visualize the results as a boxplot with the help of the following command:

 可视化结果：

In [6]:
boxplot(as.data.frame(readM), outline = FALSE, main="Per Cycle Read Quality", xlab="Cycle", ylab="Phred Quality")

ERROR: Error: 无法分配大小为84.0 Mb的矢量


6. Another interesting preprocessing step involves filtering the sequences while reading
alignment data. For this, you first need alignment data as follows (note that Bowtie
data is available on the book's web page under the name myBowtie.txt) and you
must either copy it to your working directory or use this directory as working directory
to use the myBowtie.txt file:

 利益与处理步骤设计在读取对齐比对数据的同时过滤序列，为此，首先需要比对数据：

In [None]:
myData <- readAligned("myBowtie.txt", type = "Bowtie")

7. To check the read alignments, look at the myData object as follows：

 检查读取的比对数据：

In [None]:
myData

8. There are different filters that can be used, such as chromosomes, sequence length, strands, and so on. To use these, first define the required filter (in this case, define a filter for the + strand) as follows:

 这里用不同的过滤器可以使用，比如染色体、序列长度、链等等。 第一是定义这些过滤器：

In [1]:
strand <- strandFilter("+")

ERROR: Error in strandFilter("+"): 没有"strandFilter"这个函数


9. Then, use the created filter as follows:

 然后，创建过滤器：

In [None]:
myRead_strand <- readAligned("myBowtie.txt", filter=strand, type= "Bowtie")

10. Combine more than one filters and then use with the compose function as follows:
 
 组合多个筛选器，然后与组合函数一起使用，如下所示:

In [None]:
chromosome <- chromosomeFilter("3") 

In [None]:
myFilt <- compose(strand, chromosome)

In [None]:
myRead_filt <- readAligned("myBowtie.txt", filter=Myfilt, type =
"Bowtie")

11. Again, take a look at the filtered data (myRead_filt) as follows:

 查看过滤器：

In [None]:
myRead_filt