diff --git a/DESCRIPTION b/DESCRIPTION
index 0121d825..b46c1519 100755
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -11,6 +11,7 @@ LazyData: true
RoxygenNote: 6.1.1
biocViews:
Biarch: true
+Depends: R (>= 3.5.0)
Imports: tibble, dplyr, magrittr, tidyr, ggplot2, readr, widyr, foreach, rlang, purrr
Suggests:
parallel,
diff --git a/vignettes/introduction.Rmd b/vignettes/introduction.Rmd
index f2a64f39..f0cade85 100755
--- a/vignettes/introduction.Rmd
+++ b/vignettes/introduction.Rmd
@@ -1,5 +1,5 @@
---
-title: "Introduction to the ttBulk package"
+title: "Overview of the ttBulk package"
author: "Stefano Mangiola"
package: ttBulk
output:
@@ -11,7 +11,16 @@ vignette: >
%\VignetteEngine{knitr::rmarkdown}
---
-```{r, echo=FALSE, include=FALSE, }
+
+
+
+
+
+
+
+
+
+```{r, echo=FALSE, include=FALSE}
library(knitr)
knitr::opts_chunk$set(cache = TRUE, warning = FALSE,
message = FALSE, cache.lazy = FALSE)
@@ -43,21 +52,22 @@ my_theme =
-ttBulk is a collection of wrappers for bulk tanscriptomic analyses that follows the "tidy" paradigm. The data structure is a tibble with a column for
+ttBulk is a collection of wrapper functions for bulk tanscriptomic analyses that follows the "tidy" paradigm. The data structure is a tibble with columns for
+ sample identifier column
+ transcript identifier column
+ `read count` column
+ annotation (and other info) columns
+**RD: why are cell type and read count quoted?**
```{r}
ttBulk::counts # Accessible via ttBulk::counts
```
-Every function takes this tructure as input and outputs: (i) this structure with addictional information (action="add"); or (ii) this structure with the isolated new information (action="get")
+Every function takes this structure as input and outputs either (i) this structure with additional information, or (ii) this structure with **the isolated new information**(I don't get this), via "add" and "get" actions respectively. **More details about the actions will be explained in section xx and xx**.
+**RD: the actions are a bit out of context to me without seeing examples of what the functions can do.**
In brief you can:
-
+ Going from BAM/SAM to a tidy data frame of counts (FeatureCounts)
+ Adding gene symbols from ensembl IDs
+ Aggregating duplicated gene symbols
@@ -73,7 +83,8 @@ In brief you can:
# Aggregate `transcripts`
-Aggregating duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert ensembl to gene symbol, in doing so we have to deal with duplicated symbols
+ttBulk provide the `aggregate_duplicates` function to aggregate duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert **transcript names(?) from ** ensembl to gene symbol, but in doing so we have to deal with duplicated symbols. **`aggregate_duplicates` takes xx as arguments and returns xx. **
+**RD: The example in vignette doesn't really show what is changed by this function. "number of merged tr.." column is truncated**
```{r, cache=TRUE}
counts.aggr =
@@ -91,10 +102,14 @@ counts.aggr
# Normalise `read counts`
-For visualisation purposes or ad hoc analyses, we may want to calculate the normalised read counts for library size (e.g., with TMM algorithm). These new values will be added to the original data set as ` normalised`
+For visualisation purposes or ad hoc analyses, we may want to calculate the normalised read counts for library size (e.g., with TMM algorithm).**RD: citation perhaps?**
+These new values will be added to the original data set as ` normalised`.
+
+**RD: The columns are truncated again. **
```{r, cache=TRUE}
-counts.norm = counts.aggr %>% normalise_counts(sample, transcript, `read count`)
+counts.norm = counts.aggr %>%
+ normalise_counts(sample, transcript, `read count`)
counts.norm
```
@@ -108,60 +123,62 @@ counts.norm %>%
my_theme
```
-# Reduce `dimensions`
+# Reduce `dimensions`
-For visualisation purposes or ad hoc analyses, we may want to reduce the dimentions of our data, for example using PCA or MDS algorithms. These new values will be added to the original data set.
+For visualisation purposes or ad hoc analyses, we may want to reduce the dimensions of our data, for example using PCA or MDS algorithms.
+**These new values can be added to the original data set, or retrived in regards to samples via `reduce_dimensions` funcion.**
+**RD: MDS and PCA are not really different sections since they use the same function. Perhaps just explain what methods are supported by the function and show MDS and PCA as examples?**
## MDS
-
```{r, cache=TRUE}
-counts.norm.MDS =
+counts.norm.MDS =
counts.norm %>%
reduce_dimensions(value_column = `read count normalised`, method="MDS" , elements_column = sample, feature_column = transcript, components = 1:10)
counts.norm.MDS
```
```{r, cache=TRUE}
-counts.norm.MDS %>%
+counts.norm.MDS %>%
select(contains("Dimension"), sample, `Cell type`) %>%
distinct() %>%
- GGally::ggpairs(columns = 1:6, ggplot2::aes(colour=`Cell type`))
+ GGally::ggpairs(columns = 1:6, ggplot2::aes(colour=`Cell type`))
```
-## PCA
+## adding PCA values
```{r, cache=TRUE}
-counts.norm.PCA =
+counts.norm.PCA =
counts.norm %>%
reduce_dimensions(value_column = `read count normalised`, method="PCA" , elements_column = sample, feature_column = transcript, components = 1:10)
counts.norm.PCA
```
```{r, cache=TRUE}
-counts.norm.PCA %>%
+counts.norm.PCA %>%
select(contains("PC"), sample, `Cell type`) %>%
distinct() %>%
- GGally::ggpairs(columns = 1:6, ggplot2::aes(colour=`Cell type`))
+ GGally::ggpairs(columns = 1:6, ggplot2::aes(colour=`Cell type`))
```
# Rotate `dimensions`
-For visualisation purposes or ad hoc analyses, we may want to rotate the reduced dimentions (or any two numeric columns really) of our data, of a set angle. The rotated dimensions will be added to the original data set as ` rotated ` by default, or as specified in the input arguments.
+For visualisation purposes or ad hoc analyses, we may want to rotate the reduced dimensions (or any two numeric columns really) of our data, of a set angle. The rotated dimensions will be added to the original data set as ` rotated ` by default, or as specified in the input arguments.
+**RD: For visualisation purposes or ad hoc analyses × 3**
```{r, cache=TRUE}
counts.norm.MDS.rotated =
counts.norm.MDS %>%
rotate_dimensions(`Dimension 1`, `Dimension 2`, rotation_degrees = 45, elements_column = sample)
```
-
+**RD: These(⇩) are not really sections, but just two values to be compared? Perhaps just indicate that with fig titles and explain.**
## Original
```{r, cache=TRUE}
counts.norm.MDS.rotated %>%
distinct(sample, `Dimension 1`,`Dimension 2`, `Cell type`) %>%
- ggplot(aes(x=`Dimension 1`, y=`Dimension 2`, color=`Cell type` )) +
+ ggplot(aes(x=`Dimension 1`, y=`Dimension 2`, color=`Cell type` )) +
geom_point() +
my_theme
```
@@ -171,15 +188,17 @@ counts.norm.MDS.rotated %>%
```{r, cache=TRUE}
counts.norm.MDS.rotated %>%
distinct(sample, `Dimension 1 rotated 45`,`Dimension 2 rotated 45`, `Cell type`) %>%
- ggplot(aes(x=`Dimension 1 rotated 45`, y=`Dimension 2 rotated 45`, color=`Cell type` )) +
+ ggplot(aes(x=`Dimension 1 rotated 45`, y=`Dimension 2 rotated 45`, color=`Cell type` )) +
geom_point() +
my_theme
```
# Differential transcirption
+**RD: This section title doesn't have a verb.**
-We may want to test for differential transcription between sample-wise factors of interest (e.g., with edgeR). The statistics will be added to the original data. In this example we set action ="get" to just output the non redundant gene statistics.
+We may want to test for differential transcription between sample-wise factors of interest (e.g., with edgeR). The statistics **what stats? maybe explain a little.** will be added to the original data. In this example we set `action ="get"` to just output the non-redundant gene statistics **such as ...**.
+**RD: explain the args? counts_mini is a new dataset so perhaps show that first?**
```{r, cache=TRUE}
ttBulk::counts_mini %>%
annotate_differential_transcription(
@@ -187,31 +206,33 @@ ttBulk::counts_mini %>%
sample_column = sample,
transcript_column = transcript,
counts_column = `read count`,
- action="get")
+ action="get")
```
# Adjust `read counts`
-Adjust `read counts` for (known) unwanted variation. For visualisation purposes or ad hoc analyses, we may want to adjust our normalised counts to remove known unwanted variation. The adjusted counts will be added to the original data set as ` adjusted`. The formulation is similar to a linear model, where the first covariate is the factor of interest and the second covariate is the unwanted variation. At the moment just an unwanted covariated is allowed at the time.
+**`adjust_counts` function adjusts** `read counts` for (known) unwanted variation. For visualisation purposes or ad hoc analyses, we may want to adjust our normalised counts to remove known unwanted variation. The adjusted counts will be added to the original data set as ` adjusted`. The formulation is similar to a linear model, where the first covariate is the factor of interest and the second covariate is the unwanted variation. At the moment just an unwanted covariated is allowed at the **(a?)** time.
+
+**"## Standardizing Data across genes" is still in the cache?**
```{r, cache=TRUE}
-counts.norm.adj =
+counts.norm.adj =
counts.norm %>%
-
+
# Add fake batch and factor of interest
- left_join(
- (.) %>%
- distinct(sample) %>%
- mutate(batch = sample(0:1, n(), replace = T))
+ left_join(
+ (.) %>%
+ distinct(sample) %>%
+ mutate(batch = sample(0:1, n(), replace = T))
) %>%
mutate(factor_of_interest = `Cell type` == "b_cell") %>%
-
+
# Add covariate
adjust_counts(
- ~ factor_of_interest + batch,
- sample,
- transcript,
- `read count normalised`,
+ ~ factor_of_interest + batch,
+ sample,
+ transcript,
+ `read count normalised`,
action = "get"
)
@@ -219,34 +240,36 @@ counts.norm.adj
```
# Cell type composition
+**No verb in the title**
-We may want to infer the cell type composition of our samples (e.g., with cibersort). The cell type proportions will be added to the original data.
-
+We may want to infer the cell type composition of our samples (e.g., with **cibersort (I don't know this one)**). The cell type proportions will be added to **or "get"** the original data **with `annotate_cell_type` function.
+**columns truncated**
```{r, cache=TRUE}
-counts.cibersort =
- ttBulk::counts %>%
- annotate_cell_type(sample, transcript, `read count`, action="add")
+counts.cibersort =
+ ttBulk::counts %>%
+ annotate_cell_type(sample, transcript, `read count`, action="add")
counts.cibersort
```
-We can plot the distributions of cell types across samples, and compare them with the nominal cell type labels to check for the purity of isolation.
+We can plot the distributions of cell types across samples, and compare them with the nominal cell type labels to check for the purity of isolation **using xx function**.
```{r, cache=TRUE}
counts.cibersort %>%
rename(`Cell type experimental` = `Cell type.x`, `Cell type estimated` = `Cell type.y`) %>%
distinct(sample, `Cell type experimental`, `Cell type estimated`, proportion) %>%
- ggplot(aes(x=`Cell type estimated`, y=proportion, fill=`Cell type experimental`)) +
- geom_boxplot() +
+ ggplot(aes(x=`Cell type estimated`, y=proportion, fill=`Cell type experimental`)) +
+ geom_boxplot() +
facet_wrap(~`Cell type experimental`) +
- my_theme +
+ my_theme +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), aspect.ratio=1/5)
```
-# Cluster
+# Cluster **xx data**
For visualisation purposes or ad hoc analyses, we may want to cluster our data (e.g., k-means sample-wise). The cluster annotation will be added to the original data set.
+**similar to the PCA/MDS, maybe just explain the xx function supports k-means clustering by xxx?**
## k-means
```{r, cache=TRUE}
@@ -262,12 +285,12 @@ We can add cluster annotation to the MDS dimesion reduced data set and plot
counts.norm.MDS %>%
annotate_clusters(
value_column = `read count normalised`,
- elements_column = sample,
- feature_column = transcript,
- number_of_clusters = 2
- ) %>%
+ elements_column = sample,
+ feature_column = transcript,
+ number_of_clusters = 2
+ ) %>%
distinct(sample, `Dimension 1`, `Dimension 2`, cluster) %>%
- ggplot(aes(x=`Dimension 1`, y=`Dimension 2`, color=cluster)) +
+ ggplot(aes(x=`Dimension 1`, y=`Dimension 2`, color=cluster)) +
geom_point() +
my_theme
```
@@ -276,15 +299,19 @@ We can add cluster annotation to the MDS dimesion reduced data set and plot
For visualisation purposes or ad hoc analyses, we may want to remove redundant elements from the original data set (e.g., samples or transcripts).
+**I don't get this part. What's the motivation for removing these information?**
+**Similar to above subsections, "e.g. ..." sounds like more than these two modes are supported. But it is unclear what are supported."**
+
## Use correlation or sample removal
+**sample removal?**
```{r, cache=TRUE}
-counts.norm.non_redundant =
- counts.norm.MDS %>%
+counts.norm.non_redundant =
+ counts.norm.MDS %>%
drop_redundant(
method = "correlation",
elements_column = sample,
- feature_column = transcript,
+ feature_column = transcript,
value_column = `read count normalised`
)
```
@@ -294,7 +321,7 @@ We can visualise how the reduced redundancy with the reduced dimentions look lik
```{r, cache=TRUE}
counts.norm.non_redundant %>%
distinct(sample, `Dimension 1`, `Dimension 2`, `Cell type`) %>%
- ggplot(aes(x=`Dimension 1`, y=`Dimension 2`, color=`Cell type`)) +
+ ggplot(aes(x=`Dimension 1`, y=`Dimension 2`, color=`Cell type`)) +
geom_point() +
my_theme
@@ -303,8 +330,8 @@ counts.norm.non_redundant %>%
## Use reduced dimensions
```{r, cache=TRUE}
-counts.norm.non_redundant =
- counts.norm.MDS %>%
+counts.norm.non_redundant =
+ counts.norm.MDS %>%
drop_redundant(
method = "reduced_dimensions",
elements_column = sample,
@@ -316,11 +343,11 @@ counts.norm.non_redundant =
We can visualise how the reduced redundancy with the reduced dimentions look like
-
+**Maybe explain briefly what information the plot shows? Same with the previous plots as well.**
```{r, cache=TRUE}
counts.norm.non_redundant %>%
distinct(sample, `Dimension 1`, `Dimension 2`, `Cell type`) %>%
- ggplot(aes(x=`Dimension 1`, y=`Dimension 2`, color=`Cell type`)) +
+ ggplot(aes(x=`Dimension 1`, y=`Dimension 2`, color=`Cell type`)) +
geom_point() +
my_theme
@@ -328,13 +355,19 @@ counts.norm.non_redundant %>%
# Other useful wrappers
-We can convert a list of BAM/SAM files into a tidy data frame of annotated counts, via FeatureCounts
+**some filler sentences? To conclude the above and explain these functions a bit more?**
+
+**e.g. above functions do xxx on single BAM/SAM file, however, xxx is also supported via xx function.**
+
+We can convert a list of BAM/SAM files into a tidy data frame of annotated counts, via FeatureCounts.
+
+**I don't know what FeatureCounts is :(**
```{r eval=FALSE}
counts = bam_sam_to_featureCounts_tibble(file_names, genome = "hg38")
```
-We can add gene symbols from ensembl identifiers
+We can add gene symbols from ensembl identifiers **give an example of application?**
```{r eval = F}
counts_ensembl %>% annotate_symbol(ens)