# Monocle 3 on Colab in R

[![](https://reconstrue.github.io/single_cell_on_colab/monocle_cover_image.png)](https://cole-trapnell-lab.github.io/monocle3/)


This code can be found on GitHub in the repo, [@reconstrue/single_cell_on_colab](https://github.com/reconstrue/single_cell_on_colab/tree/master/tools/monocle), where there is also a companion notebook, [monocle3_on_colab_in_python.ipynb](https://github.com/reconstrue/single_cell_on_colab/tree/master/tools/monocle/monocle3_on_colab_in_python.ipynb). That does the same as this notebook but has a better UI experience, on Colab.


## Introduction

[Monocle 3](https://cole-trapnell-lab.github.io/monocle3/) is "an analysis toolkit for single-cell RNA-seq." It is [MIT](https://github.com/cole-trapnell-lab/monocle3/blob/master/LICENSE.md) code out of Seattle's Seattle Lake Union area.

This R Jupyter notebook started as simply tests to see if [Monocle 3](https://cole-trapnell-lab.github.io/monocle3/) could be deployed and exercised on Colab. It can, mostly; more on that below. 

## Legal

This code is licensed under the Apache License, Version 2.0. This basically means you can do WHATEVER you want with it but don't come crying to me when someone gets an eye poked out.

<img src="http://reconstrue.com/assets/images/reconstrue_combo_mark.svg" width="200px" align="left"/>

In [0]:
# Copyright 2019-2020 Reconstrue LLC. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

## Results
This experiment went surprising well. The one big problem that needs to be solved is that there is a bit of UI that seems to be made for RStudio but which has not been made to work on Colab. (The author of this notebook is not an experience R developer.) 

Specifically, this limitation relates to selecting a subset of a scatter plat; seemingly R interactive code has not been fully worked out on Colab. If this one small-ish issue were solved, then there would be an example Jupyter notebook which shows folks how to deploy Monocle 3 on Colab. 


## R on Colab
Monocle 3 is written in R. Turns out Colab can run R notebooks, although this is not widely known because as of now (2019-11-20) R is not an officially supported language on Colab (and it show; more on that below).

On 2019-06-18 JFT found IRkernel's Demo.html via [stackoverflow: How to use R with Google Colaboratory?](https://stackoverflow.com/a/54595286). I.e starting from a demo page ([Demo.ipynb](https://github.com/IRkernel/IRkernel/blob/master/example-notebooks/Demo.ipynb)) of IRkernel which is [MIT licensed](https://github.com/IRkernel/IRkernel/blob/master/DESCRIPTION#L20) this page was built out on Colab. 

The core point is to start with Demo.ipynb because is has the JSON metadata to specify that the notebook is designed for an R kernel, and Colab will provide:
```
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "language_info": {
      "codemirror_mode": "r",
      "file_extension": ".r",
      "mimetype": "text/x-r-source",
      "name": "R",
      "pygments_lexer": "r",
      "version": "3.3.1"
    },
    "kernelspec": {
      "display_name": "R",
      "language": "R",
      "name": "ir"
    }
  }...
```

## Install Monocle 3

One of the nice features of Colab (Colab is essentially just Google hosting Jupyter notebooks for free) is that for Python the vast majority of any packages one would use is pre-installed making for very quick initialization. Currently this is very much not the case for R as we will see shortly; the installs are slow and annoying, c'est la vie but another detail that makes R on Colab nowhere near as cool as Python on Colab. This is a real drag given the frequent disconnects upon idle, as well as the hard 12 hour limit imposed on Colab.


### System Configuration
The Monocle folks did nice work making it easy to install Monocle, as documented in [Installing Monocle 3](https://cole-trapnell-lab.github.io/monocle3/docs/installation/). Nonetheless, on Colab the following rigmarole needs to happen before installing Monocle, otherwise various support libraries (e.g. units) will fail to install.

While hacking to get Monocle 3 running on Colab, various dependency libraries did not install without a fight. Again, while hacking it was discovered that [others have had similar issues](https://github.com/datacarpentry/r-raster-vector-geospatial/issues/138#issue-313014296) (these are R-on-Colab issues, not Monocle 3 issues) and the solution seems to be:
```
!apt-get -qq install -y libudunits2-dev libgdal-dev libgeos-dev libproj-dev 
```
But unfortuneately, IRkernel does not seem to handle !magics so `apt-get` has to be invoked via an R system() call. 

**Note:** this next cell can take some minutes; feel free to remove the `suppressMessages()` wrappers if you're into very long status messages, which nonetheless can serve as a sort of progress indicator.

In [13]:
# https://mothergeo-py.readthedocs.io/en/latest/development/how-to/gdal-ubuntu-pkg.html#before-you-begin-python-3-6
suppressMessages(system("apt-get install python3.6-dev"))

# Need to add this repo, otherwise libgdal-dev will cause apt-get to return 100 b/c of a 404. See:
# https://mothergeo-py.readthedocs.io/en/latest/development/how-to/gdal-ubuntu-pkg.html#install-gdal-ogr
# https://github.com/datacarpentry/r-raster-vector-geospatial/issues/138#issue-313014296

suppressMessages(system("add-apt-repository ppa:ubuntugis/ppa"))
suppressMessages(system("apt-get -qq update"))
suppressMessages(system("apt-get -qq -y install --fix-missing libudunits2-dev python-gdal gdal-bin libgdal-dev", intern=TRUE))
suppressMessages(system("apt-get -qq -y install --fix-missing libudunits2-dev python-gdal gdal-bin libgdal-dev", intern=TRUE))

### Stock Installation

The stock Monocle 3 install instructions begin with installing Bioconductor.

In [3]:
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()
# BiocManager::valid()


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.2 (2019-12-12)

Installing package(s) 'BiocVersion'

Old packages: 'callr', 'curl', 'devtools', 'digest', 'DT', 'farver',
  'jsonlite', 'knitr', 'mime', 'processx', 'ps', 'remotes', 'rprojroot',
  'rstudioapi', 'stringi', 'svglite', 'xfun', 'xtable', 'lattice', 'nlme'



The second step in the stock install instructions is to actually install Monocle 3. For whatever reason, Bioconductor needs to be explicitly told to install 8 packages that are needed.

This install is long and has no progress indicator.

In [4]:
# Test one at a time to find the problem

# [X]: "Boibase"
# [X]: "DelayedArray"
# [X]: "DelayedMatrixStats"
# [X]: "limma"
# [X]: "S4Vectors"
# [ ]: "SingleCellExperiment"
# [ ]: "SummarizedExperiment"))

options(install.packages.check.source = "yes")
# Problem child: BiocManager::install(c("SingleCellExperiment"))
BiocManager::install(c("GenomicRanges"))

Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.2 (2019-12-12)

Installing package(s) 'GenomicRanges'

also installing the dependencies ‘bitops’, ‘RCurl’, ‘GenomeInfoDbData’, ‘zlibbioc’, ‘BiocGenerics’, ‘S4Vectors’, ‘IRanges’, ‘GenomeInfoDb’, ‘XVector’


Old packages: 'callr', 'curl', 'devtools', 'digest', 'DT', 'farver',
  'jsonlite', 'knitr', 'mime', 'processx', 'ps', 'remotes', 'rprojroot',
  'rstudioapi', 'stringi', 'svglite', 'xfun', 'xtable', 'lattice', 'nlme'



In [5]:
# Issue, devtools::install_github('cole-trapnell-lab/monocle3') will report:
#   Skipping 8 packages not available: Biobase, BiocGenerics, DelayedArray, DelayedMatrixStats, limma, S4Vectors, SingleCellExperiment, SummarizedExperiment
# So, explicitly installing these seemed to help:

BiocManager::install(c("Biobase", "DelayedArray", "DelayedMatrixStats", "limma", "S4Vectors", "SingleCellExperiment", "SummarizedExperiment"))

Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.2 (2019-12-12)

Installing package(s) 'Biobase', 'DelayedArray', 'DelayedMatrixStats', 'limma',
  'S4Vectors', 'SingleCellExperiment', 'SummarizedExperiment'

also installing the dependencies ‘formatR’, ‘lambda.r’, ‘futile.options’, ‘futile.logger’, ‘snow’, ‘rhdf5’, ‘Rhdf5lib’, ‘matrixStats’, ‘BiocParallel’, ‘HDF5Array’


Old packages: 'callr', 'curl', 'devtools', 'digest', 'DT', 'farver',
  'jsonlite', 'knitr', 'mime', 'processx', 'ps', 'remotes', 'rprojroot',
  'rstudioapi', 'stringi', 'svglite', 'xfun', 'xtable', 'lattice', 'nlme'



In [6]:
# TODO: note this is repeated later. This is here just for isolated testing.
BiocManager::install("batchelor")

Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.2 (2019-12-12)

Installing package(s) 'batchelor'

also installing the dependencies ‘beeswarm’, ‘vipor’, ‘gridExtra’, ‘RcppAnnoy’, ‘RcppHNSW’, ‘irlba’, ‘rsvd’, ‘ggbeeswarm’, ‘viridis’, ‘BiocNeighbors’, ‘BiocSingular’, ‘scater’, ‘beachmat’


Old packages: 'callr', 'curl', 'devtools', 'digest', 'DT', 'farver',
  'jsonlite', 'knitr', 'mime', 'processx', 'ps', 'remotes', 'rprojroot',
  'rstudioapi', 'stringi', 'svglite', 'xfun', 'xtable', 'lattice', 'nlme'



Back to stock install step 2:

In [7]:
devtools::install_github('cole-trapnell-lab/monocle3')


Downloading GitHub repo cole-trapnell-lab/monocle3@master



leidenbase   (NA     -> c22a7d01f...) [GitHub]
ggrepel      (NA     -> 0.8.1       ) [CRAN]
grr          (NA     -> 0.9.5       ) [CRAN]
igraph       (NA     -> 1.2.4.2     ) [CRAN]
lmtest       (NA     -> 0.9-37      ) [CRAN]
pbapply      (NA     -> 1.4-2       ) [CRAN]
pbmcapply    (NA     -> 1.5.0       ) [CRAN]
pheatmap     (NA     -> 1.0.12      ) [CRAN]
plotly       (NA     -> 4.9.2       ) [CRAN]
proxy        (NA     -> 0.4-23      ) [CRAN]
pscl         (NA     -> 1.5.2       ) [CRAN]
RANN         (NA     -> 2.6.1       ) [CRAN]
rsample      (NA     -> 0.0.5       ) [CRAN]
RhpcBLASctl  (NA     -> 0.20-17     ) [CRAN]
Rtsne        (NA     -> 0.15        ) [CRAN]
slam         (NA     -> 0.1-47      ) [CRAN]
spdep        (NA     -> 1.1-3       ) [CRAN]
speedglm     (NA     -> 0.3-2       ) [CRAN]
uwot         (NA     -> 0.1.5       ) [CRAN]
digest       (0.6.23 -> 0.6.24      ) [CRAN]
zoo          (NA     -> 1.8-7       ) [CRAN]
jsonlite     (1.6    -> 1.6.1       ) [CRAN]
hexbin  

Skipping 10 packages not available: Biobase, SingleCellExperiment, batchelor, BiocGenerics, DelayedArray, DelayedMatrixStats, limma, Matrix.utils, S4Vectors, SummarizedExperiment

Downloading GitHub repo cole-trapnell-lab/leidenbase@master



igraph (NA -> 1.2.4.2) [CRAN]


Installing 1 packages: igraph

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



[32m✔[39m  [90mchecking for file ‘/tmp/Rtmp02vLl7/remotes7b79b6c298/cole-trapnell-lab-leidenbase-c22a7d0/DESCRIPTION’[39m[36m[39m
[90m─[39m[90m  [39m[90mpreparing ‘leidenbase’:[39m[36m[39m
[32m✔[39m  [90mchecking DESCRIPTION meta-information[39m[36m[39m
[90m─[39m[90m  [39m[90mcleaning src[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for LF line-endings in source and make files and shell scripts[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for empty or unneeded directories[39m[36m[36m (591ms)[36m[39m
[90m─[39m[90m  [39m[90mbuilding ‘leidenbase_0.1.0.tar.gz’[39m[36m[39m
   


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing 54 packages: ggrepel, grr, igraph, lmtest, Matrix.utils, pbapply, pbmcapply, pheatmap, plotly, proxy, pscl, RANN, rsample, RhpcBLASctl, Rtsne, slam, spdep, speedglm, uwot, digest, zoo, jsonlite, hexbin, data.table, furrr, mime, xtable, sp, spData, sf, deldir, LearnBayes, coda, expm, gmodels, stringi, FNN, RSpectra, RcppParallel, RcppProgress, dqrng, farver, curl, future, globals, listenv, raster, classInt, units, e1071, gdata, gtools, RcppEigen, sitmo

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



ERROR: ignored

In [9]:
# The following is the recommended way to test an install of Monocle 3.
# It should dump out a bunch or config stats

# TODO: Those verbose messages are obviously being suppressed for brevity
# remove the suppressMessages() wrapper to debug, if things go awry
library(monocle3)

ERROR: ignored

And that's it. Their set up is simple, after tackling some Colab specific annoyances. Install is slow on Colab but c'est la vie. With more experience perhaps there is a way to streamline this (the author is not an experienced R developer).



## Exercise Monocle 3

Hannah Pliner, one of [the two core devs of Monocle 3](https://github.com/cole-trapnell-lab/monocle3/graphs/contributors), gave presentations on Monocle 3 at [Brotman Baty Institute's Single Cell Symposium](https://brotmanbaty.org/event/single-cell-symposium/). Her [tutorial content](http://staff.washington.edu/hpliner/) is on-line, which includes some test data and [an R script for doing the basics with Monocle 3](http://staff.washington.edu/hpliner/scripts/20190603_tutorial_script.R). The data is from C. elegans data from [Cao & Packer et al. 2017](https://science.sciencemag.org/content/357/6352/661) as explained in [the Monocle 3 docs](https://cole-trapnell-lab.github.io/monocle3/monocle3_docs/#clustering-and-classifying-your-cells).

Here, Pliner's script is copied with minimal modification to get things going on Colab, including breaking it out into multiple code cells.

In [0]:
# Load up Pliner's test data
expression_matrix <- readRDS(url("http://staff.washington.edu/hpliner/data/cao_l2_expression.rds"))
cell_metadata <- readRDS(url("http://staff.washington.edu/hpliner/data/cao_l2_colData.rds"))
gene_annotation <- readRDS(url("http://staff.washington.edu/hpliner/data/cao_l2_rowData.rds"))

cds <- new_cell_data_set(expression_matrix,
                         cell_metadata = cell_metadata,
                         gene_metadata = gene_annotation)

In [0]:
# Test out the accessor functions:
colData(cds)

rowData(cds)

head(counts(cds))



In [0]:
# Preprocess the cds - in default mode, this function normalizes the
# data and runs PCA
cds <- preprocess_cds(cds, num_dim = 100)

# Run UMAP to get a low dimension representation, and plot
cds <- reduce_dimension(cds)
plot_cells(cds)



In [0]:
# Cluster cells and view the clusters and partitions (super-clusters)
cds <- cluster_cells(cds)
head(partitions(cds, reduction_method = "UMAP"))
head(clusters(cds, reduction_method = "UMAP"))

plot_cells(cds, color_cells_by="partition", group_cells_by="partition")


In [0]:
plot_cells(cds, color_cells_by="cluster", group_cells_by="cluster")

In [0]:
# Subset cells interactively

# Original code, which errors on Colab I guess that the script 
# was build for RStudio or such, not for Jupyter widgetery.
#   Error: choose_cells only works in interactive mode.
# cds_subset <- choose_cells(cds)

# This hack just bypassing the subsetting, making the rest of this notebook
# not so interesting, but we're just testing if the code runs on Colab, and it does.
cds_subset <- cds


In [0]:
# Compare genes across chosen clusters (first 100 genes for speed)
gene_fits <- fit_models(cds_subset[1:100,], model_formula_str = "~cluster")
fit_coefs <- coefficient_table(gene_fits)
head(fit_coefs)

# Find top marker genes for each cluster
marker_genes <- top_markers(cds[1:1000,], genes_to_test_per_group = 3)
head(marker_genes)

tops_sig <- subset(marker_genes, marker_test_q_value < .05)

plot_cells(cds_subset, genes=gene_cluster_df, show_trajectory_graph=F, color_by="cell.type")


generate_garnett_marker_file(marker_genes)


## Next

OK stopping here, the goal was to see if Monocle 3 can be deployed on Colab. Mostly, yes.

Next step would seem to be 
- Solve the problem around `choose_cells()`
- Use Monocle on data from the wild, rather than prepackaged test data in *.rds files