Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

additions to readme, added more ggplot2 fun doc

  • Loading branch information...
commit 6ddaa523c1206d8802846ec16bd7d0aa5b7fdab7 1 parent 7ef8abd
@vsbuffalo authored
View
44 README.md
@@ -12,11 +12,47 @@ as well, but please open an issue first.
## About
-qrqc is a Bioconductor package is a fast and extensible package that
-reports basic quality and summary statistics on FASTQ and FASTA files,
-including base and quality distribution by position, sequence length
-distribution, and common sequences.
+qrqc (short for "Quick Read Quality Control") is a fast and extensible
+package that reports basic quality and summary statistics on FASTQ and
+FASTA files, including base and quality distribution by position,
+sequence length distribution, and common sequences.
## License
GNU General Public License, version 2.
+
+## FAQ
+
+### Why `ggplot2`?
+
+I've had some feature requests for `qrqc` since its release, mostly
+related to customizing the graphics. Since data accessibility and
+custom graphics were the reason I created `qrqc`, I initially rewrote
+`qrqc` to provide more graphics options through `lattice`. However,
+all the graphics parameters I added led to large numbers of arguments
+to functions and high complexity. This rewrite uses `ggplot2`, which
+is a very excellent way to create graphics as any graphics object can
+be further manipulated.
+
+### Why do you use Monte Carlo simulations to generate the smooth curve?
+
+`qrqc` is fast because it bins the quality scores of bases by
+positions; there is data summarization done by `readSeqFile`. To
+create a smooth curve, the function needs multiple data points (not
+binned data), which I simulate via Monte Carlo draws from the quality
+distribution by position. This is an approximation, but it leads to a
+smooth curve which can create a useful visual tool in assessing
+quality drops.
+
+### What do I do about bad quality regions?
+
+Illumina reads often have poor 3'-end qualities. I've noticed that
+HiSeq machines also produce poor quality 5'-ends. For increased
+mapping rates and better assmeblies, it is generally advisable that
+these poor quality regions be trimmed off. Nik Joshi's took `sickle`
+tool can do this; you can get it here
+<http://github.com/najoshi/sickle>.
+
+3'-end adapter contamination can be difficult to recognize (and thus
+remove) due to poor quality and likely incorrect bases. I've developed
+a tool called `scythe` that removes
View
5 qrqc/R/ggplotting-methods.R
@@ -27,6 +27,7 @@ function(x) {
colnames(gc) <- c('position', 'gc')
return(gc)
})
+ gcd
})
setMethod("getBase", signature(x="SequenceSummary"),
@@ -106,7 +107,7 @@ function(x, fun) {
setMethod("qualPlot", signature(x="FASTQSummary"),
# Plot a single FASTQSummary object.
-function(x, smooth=TRUE, extreme.color="grey", quantile.color="orange",
+function(x, smooth=TRUE, extreme.color="grey", quartile.color="orange",
mean.color="blue", median.color=NULL) {
qd <- getQual(x)
p <- ggplot(qd)
@@ -122,7 +123,7 @@ function(x, smooth=TRUE, extreme.color="grey", quantile.color="orange",
setMethod("qualPlot", signature(x="list"),
# Plot a list of FASTQSummary objects as facets.
-function(x, smooth=TRUE, extreme.color="grey", quantile.color="orange",
+function(x, smooth=TRUE, extreme.color="grey", quartile.color="orange",
mean.color="blue", median.color=NULL) {
if (!length(names(x)))
stop("A list pased into qualPlot must have named elements.")
View
17 qrqc/man/getBase-methods.Rd
@@ -2,15 +2,16 @@
\docType{methods}
\alias{getBase-methods}
\alias{getBase,SequenceSummary-method}
-\title{Get a Data Frame of Bae Frequency Data from a \code{SequenceSummary} object}
+\title{Get a Data Frame of Base Frequency Data from a \code{SequenceSummary} object}
\description{
An object that inherits from class \code{SequenceSummary} contains
- base data by position gathered by \code{readSeqFile}. \code{getBase}
+ base frequency data by position gathered by \code{readSeqFile}. \code{getBase}
is an accessor function that reshapes the base frequency data by position
into a data frame.
This accessor function is useful if you want to map variables to
- custom \code{ggplot2} aesthetics.
+ custom \code{ggplot2} aesthetics. Base proportions can be accessed
+ with \code{getBaseProp}.
}
@@ -36,7 +37,6 @@
\section{Methods}{
\describe{
-
\item{\code{signature(x = "SequenceSummary")}}{
\code{getBase} is an accessor function that works on any object read
in with \code{readSeqFile}; that is, objects that inherit from
@@ -50,11 +50,18 @@
s.fastq <- readSeqFile(system.file('extdata', 'test.fastq',
package='qrqc'))
- # A custom base quality plot
+ # A custom base plot
ggplot(getBase(s.fastq)) + geom_line(aes(x=position, y=frequency,
color=base)) + facet_grid(. ~ base) + scale_color_dna()
}
+
+\seealso{getGC}
+\seealso{getSeqlen}
+\seealso{getBaseProp}
+\seealso{getQual}
+\seealso{getMCQual}
\seealso{basePlot}
+
\keyword{methods}
\keyword{accessor}
View
63 qrqc/man/getBaseProp-methods.Rd
@@ -2,16 +2,65 @@
\docType{methods}
\alias{getBaseProp-methods}
\alias{getBaseProp,SequenceSummary-method}
-\title{ ~~ Methods for Function \code{getBaseProp} ~~}
+\title{Get a Data Frame of Base Proportion Data from a \code{SequenceSummary} object}
\description{
- ~~ Methods for function \code{getBaseProp} ~~
+ An object that inherits from class \code{SequenceSummary} contains
+ base frequency data by position gathered by \code{readSeqFile}. \code{getBaseProp}
+ is an accessor function that reshapes the base frequency data by position
+ into a data frame and calculates the proportions of each base per position.
+
+ This accessor function is useful if you want to map variables to
+ custom \code{ggplot2} aesthetics. Base frequency be accessed
+ with \code{getBase}.
}
-\section{Methods}{
-\describe{
-\item{\code{signature(x = "SequenceSummary")}}{
-%% ~~describe this method here~~
+
+\usage{
+ getBaseProp(x, drop=TRUE)
+}
+
+\arguments{
+ \item{x}{an S4 object that inherits from \code{SequenceSummary} from
+ \code{readSeqFile}.}
+ \item{drop}{a logical value indicating whether to drop bases that
+ don't have any counts.}
+}
+
+
+\value{
+ \code{getBaseProp} returns a \code{data.frame} with columns:
+
+ \item{position}{the position in the read.}
+ \item{base}{the base.}
+ \item{proportion}{the proportion of a base found per position in the read.}
}
+
+\section{Methods}{
+ \describe{
+ \item{\code{signature(x = "SequenceSummary")}}{
+ \code{getBaseProp} is an accessor function that works on any object read
+ in with \code{readSeqFile}; that is, objects that inherit from
+ \code{SequenceSummary}.
+ }
}}
+
+\author{Vince Buffalo <vsbuffalo@ucdavis.edu>}
+\examples{
+ ## Load a FASTQ file, with sequence hashing.
+ s.fastq <- readSeqFile(system.file('extdata', 'test.fastq',
+ package='qrqc'))
+
+ # A custom base plot
+ ggplot(getBaseProp(s.fastq)) + geom_line(aes(x=position, y=proportion,
+ color=base)) + facet_grid(. ~ base) + scale_color_dna()
+}
+
+\seealso{getGC}
+\seealso{getSeqlen}
+\seealso{getBase}
+\seealso{getQual}
+\seealso{getMCQual}
+\seealso{basePlot}
+
\keyword{methods}
-\keyword{ ~~ other possible keyword(s) ~~ }
+\keyword{accessor}
View
63 qrqc/man/getGC-methods.Rd
@@ -2,16 +2,65 @@
\docType{methods}
\alias{getGC-methods}
\alias{getGC,SequenceSummary-method}
-\title{ ~~ Methods for Function \code{getGC} ~~}
+\title{Get a Data Frame of GC Content from a \code{SequenceSummary} object}
\description{
- ~~ Methods for function \code{getGC} ~~
+ An object that inherits from class \code{SequenceSummary} contains
+ base frequency data by position gathered by \code{readSeqFile}. \code{getGC}
+ is an accessor function that reshapes the base frequency data into a
+ data frame and returns the GC content by position.
+
+ This accessor function is useful if you want to map variables to
+ custom \code{ggplot2} aesthetics. Frequencies or proportions of all
+ bases (not just GC) can be accessed with \code{getBase} and
+ \code{getBaseProp} respectively.
}
-\section{Methods}{
-\describe{
-\item{\code{signature(x = "SequenceSummary")}}{
-%% ~~describe this method here~~
+\usage{
+ getGC(x)
+}
+
+\arguments{
+ \item{x}{an S4 object that inherits from \code{SequenceSummary} from
+ \code{readSeqFile}.}
+}
+
+
+\value{
+ \code{getGC} returns a \code{data.frame} with columns:
+
+ \item{position}{the position in the read.}
+ \item{gc}{GC content per position in the read.}
}
+
+\section{Methods}{
+ \describe{
+ \item{\code{signature(x = "SequenceSummary")}}{
+ \code{getGC} is an accessor function that works on any object read
+ in with \code{readSeqFile}; that is, objects that inherit from
+ \code{SequenceSummary}.
+ }
}}
+
+\author{Vince Buffalo <vsbuffalo@ucdavis.edu>}
+\examples{
+ ## Load a FASTQ file, with sequence hashing.
+ s.fastq <- readSeqFile(system.file('extdata', 'test.fastq',
+ package='qrqc'))
+
+ # A custom GC plot
+ d <- merge(getQual(s.fastq), getGC(s.fastq), by.x="position", by.y="position")
+ p <- ggplot(d) + geom_linerange(aes(x=position, ymin=lower,
+ ymax=upper, color=gc)) + scale_color_gradient(low="red",
+ high="blue") + scale_y_continuous("GC content")
+ p
+}
+
+\seealso{getSeqlen}
+\seealso{getBase}
+\seealso{getBaseProp}
+\seealso{getQual}
+\seealso{getMCQual}
+
+\seealso{gcPlot}
\keyword{methods}
-\keyword{ ~~ other possible keyword(s) ~~ }
+\keyword{accessor}
View
62 qrqc/man/getMCQual-methods.Rd
@@ -2,16 +2,64 @@
\docType{methods}
\alias{getMCQual-methods}
\alias{getMCQual,FASTQSummary-method}
-\title{ ~~ Methods for Function \code{getMCQual} ~~}
+\title{Get a Data Frame of Simulated Qualitied from a \code{FASTQSummary} object}
\description{
- ~~ Methods for function \code{getMCQual} ~~
+ An object that inherits from class \code{FASTQSummary} contains
+ base quality data by position gathered by \code{readSeqFile}. \code{getMCQual}
+ generates simulated quality data for each base from this binned
+ quality data that can be used for adding smoothed lines via lowess.
+
+ This accessor function is useful if you want to map variables to
+ custom \code{ggplot2} aesthetics.
}
-\section{Methods}{
-\describe{
-\item{\code{signature(x = "FASTQSummary")}}{
-%% ~~describe this method here~~
+
+\usage{
+ getMCQual(x, n=100)
+}
+
+\arguments{
+ \item{x}{an S4 object that inherits from \code{FASTQSummary} from
+ \code{readSeqFile}.}
+ \item{n}{a numeric value indicating the number of quality values to
+ draw per base.}
+}
+
+
+\value{
+ \code{getMCQual} returns a \code{data.frame} with columns:
+
+ \item{position}{the position in the read.}
+ \item{quality}{simulated quality scores.}
}
+
+\section{Methods}{
+ \describe{
+ \item{\code{signature(x = "FASTQSummary")}}{
+ \code{getMCQual} is a function that works on any object with class
+ \code{FASTQSummary} read in with \code{readSeqFile}.
+ }
}}
+
+\author{Vince Buffalo <vsbuffalo@ucdavis.edu>}
+\examples{
+ ## Load a FASTQ file, with sequence hashing.
+ s.fastq <- readSeqFile(system.file('extdata', 'test.fastq',
+ package='qrqc'))
+
+ # A custom quality plot
+ ggplot(getQual(s.fastq)) + geom_linerange(aes(x=position, ymin=lower,
+ ymax=upper), color="grey") + geom_smooth(aes(x=position, y=quality),
+ data=getMCQual(s.fastq), color="blue", se=FALSE)
+}
+
+\seealso{getGC}
+\seealso{getSeqlen}
+\seealso{getBase}
+\seealso{getBaseProp}
+\seealso{getQual}
+
+\seealso{qualPlot}
+
\keyword{methods}
-\keyword{ ~~ other possible keyword(s) ~~ }
+\keyword{accessor}
View
9 qrqc/man/getQual-methods.Rd
@@ -63,7 +63,12 @@
ymax=upper, color=mean)) + scale_color_gradient("mean quality",
low="red", high="green") + scale_y_continuous("quality")
}
-\seealso{getQual}
-\seealso{list2df}
+\seealso{getGC}
+\seealso{getSeqlen}
+\seealso{getBase}
+\seealso{getBaseProp}
+\seealso{getMCQual}
+
+\seealso{qualPlot}
\keyword{methods}
\keyword{accessor}
View
61 qrqc/man/getSeqlen-methods.Rd
@@ -2,16 +2,63 @@
\docType{methods}
\alias{getSeqlen-methods}
\alias{getSeqlen,SequenceSummary-method}
-\title{ ~~ Methods for Function \code{getSeqlen} ~~}
+\title{Get a Data Frame of Sequence Lengths from a \code{SequenceSummary} object}
\description{
- ~~ Methods for function \code{getSeqlen} ~~
+ An object that inherits from class \code{SequenceSummary} contains
+ sequence length data by position gathered by \code{readSeqFile}. \code{getSeqlen}
+ is an accessor function that returns the sequence length data.
+
+ This accessor function is useful if you want to map variables to
+ custom \code{ggplot2} aesthetics.
}
-\section{Methods}{
-\describe{
-\item{\code{signature(x = "SequenceSummary")}}{
-%% ~~describe this method here~~
+
+\usage{
+ getSeqlen(x)
+}
+
+\arguments{
+ \item{x}{an S4 object that inherits from \code{SequenceSummary} from
+ \code{readSeqFile}.}
+}
+
+
+\value{
+ \code{getSeqlen} returns a \code{data.frame} with columns:
+
+ \item{length}{the sequence length.}
+ \item{count}{the number of reads with this sequence length.}
}
+
+\section{Methods}{
+ \describe{
+ \item{\code{signature(x = "SequenceSummary")}}{
+ \code{getSeqlen} is an accessor function that works on any object read
+ in with \code{readSeqFile}; that is, objects that inherit from
+ \code{SequenceSummary}.
+ }
}}
+
+\author{Vince Buffalo <vsbuffalo@ucdavis.edu>}
+\examples{
+ ## Load a FASTQ file, with sequence hashing.
+ s.trimmed.fastq <- readSeqFile(system.file('extdata', 'test-trimmed.fastq',
+ package='qrqc'))
+
+ # A custom plot - a bit contrived, but should show power
+ d <- merge(getSeqlen(s.trimmed.fastq), getQual(s.trimmed.fastq),
+ by.x="length", by.y="position")
+ ggplot(d) + geom_linerange(aes(x=length, ymin=0, ymax=count),
+ color="grey") + geom_linerange(aes(x=length, ymin=lower, ymax=upper),
+ color="blue") + scale_y_continuous("quality/count") + theme_bw()
+}
+
+\seealso{getGC}
+\seealso{getBase}
+\seealso{getBaseProp}
+\seealso{getQual}
+\seealso{getMCQual}
+\seealso{seqlenPlot}
+
\keyword{methods}
-\keyword{ ~~ other possible keyword(s) ~~ }
+\keyword{accessor}
Please sign in to comment.
Something went wrong with that request. Please try again.