# Clustering and classifying your cells
Single-cell experiments are often performed on tissues containing many cell types. Monocle 3 provides a simple set of functions you can use to group your cells according to their gene expression profiles into clusters. Often cells form clusters that correspond to one cell type or a set of highly related cell types. Monocle 3 uses techniques to do this that are widely accepted in single-cell RNA-seq analysis and similar to the approaches used by Seurat, scanpy, and other tools.
In this section, you will learn how to cluster cells using Monocle 3. We will demonstrate the main functions used for clustering with the C. elegans data from Cao & Packer et al. This study described how to do single-cell RNA-seq with combinatorial indexing in a protocol called "sci-RNA-seq". Cao & Packer et al. used sci-RNA-seq to produce the first single-cell RNA-seq analysis of a whole animal, so there are many cell types represented in the data. You can learn more about the dataset and see how the authors performed the original analysis at the UW Genome Sciences RNA Atlas of the Worm site.


You can load the data into Monocle 3 like this:

In [None]:
library(monocle3)
library(dplyr) # imported for some downstream data manipulation

expression_matrix <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/cao_l2_expression.rds"))
cell_metadata <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/cao_l2_colData.rds"))
gene_annotation <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/cao_l2_rowData.rds"))

cds <- new_cell_data_set(expression_matrix,
                         cell_metadata = cell_metadata,
                         gene_metadata = gene_annotation)

## Pre-process the data
Now that the data's all loaded up, we need to pre-process it. This step is where you tell Monocle 3 how you want to normalize the data, whether to use Principal Components Analysis (the standard for RNA-seq) or Latent Semantic Indexing (common in ATAC-seq), and how to remove any batch effects. We will just use the standard PCA method in this demonstration. When using PCA, you should specify the number of principal components you want Monocle to compute.

In [None]:
cds <- preprocess_cds(cds, num_dim = 100)

It's a good idea to check that you're using enough PCs to capture most of the variation in gene expression across all the cells in the data set. You can look at the fraction of variation explained by each PC using `plot_pc_variance_explained()`:

In [None]:
plot_pc_variance_explained(cds)

We can see that using more than 100 PCs would capture only a small amount of additional variation, and each additional PC makes downstream steps in Monocle slower.

## Reduce dimensionality and visualize the cells
Now we're ready to visualize the cells. To do so, you can use either t-SNE, which is very popular in single-cell RNA-seq, or UMAP, which is increasingly common. Monocle 3 uses UMAP by default, as we feel that it is both faster and better suited for clustering and trajectory analysis in RNA-seq. To reduce the dimensionality of the data down into the X, Y plane so we can plot it easily, call `reduce_dimension()`:

In [None]:
cds <- reduce_dimension(cds)

To plot the data, use Monocle's main plotting function, `plot_cells()`:

In [None]:
plot_cells(cds)

Each point in the plot above represents a different cell in the `cell_data_set` object `cds`. As you can see the cells form many groups, some with thousands of cells, some with only a few. Cao & Packer annotated each cell according to type manually by looking at which genes it expresses. We can color the cells in the UMAP plot by the authors' original annotations using the `color_cells_by` argument to `plot_cells()`.

In [None]:
plot_cells(cds, color_cells_by="cao_cell_type")

You can see that many of the cell types land very close to one another in the UMAP plot.

Except for a few cases described in a moment, `color_cells_by` can be the name of any column in `colData(cds)`. Note that when `color_cells_by` is a categorical variable, labels are added to the plot, with each label positioned roughly in the middle of all the cells that have that label.

You can also color your cells according to how much of a gene or set of genes they express:

> ## Faster clustering with UMAP
> If you have a relatively large dataset (with >10,000 cells or more), you may want to take advantage of options that can accelerate UMAP. Passing `umap.fast_sgd=TRUE` to `reduce_dimension()` will use a fast stochastic gradient descent method inside of UMAP. If your computer has multiple cores, you can use the `cores` argument to make UMAP multithreaded. However, invoking `reduce_dimension()` with either of these options will make it produce slighly different output each time you run it. If this is acceptable to you, you could see signifant reductions in the running time of `reduction_dimension()`.

If you want, you can also use t-SNE to visualize your data. First, call reduce_dimension with reduction_method="tSNE".

In [None]:
cds <- reduce_dimension(cds, reduction_method="tSNE")

Then, when you call `plot_cells()`, pass `reduction_method="tSNE"` to it as well:

In [None]:
plot_cells(cds, reduction_method="tSNE", color_cells_by="cao_cell_type")

You can actually use UMAP and t-SNE on the same `cds` object - one won't overwrite the results of the other. But you must specify which one you want in downstream functions like `plot_cells`.

## Check for and remove batch effects
When performing gene expression analysis, it's important to check for batch effects, which are systematic differences in the transcriptome of cells measured in different experimental batches. These could be technical in nature, such as those introduced during the single-cell RNA-seq protocol, or biological, such as those that might arise from different litters of mice. How to recognize batch effects and account for them so that they don't confound your analysis can be a complex issue, but Monocle provides tools for dealing with them.

You should always check for batch effects when you perform dimensionality reduction. You should add a column to the `colData` that encodes which batch each cell is from. Then you can simply color the cells by batch. Cao & Packer et al included a "plate" annotation in their data, which specifies which sci-RNA-seq plate each cell originated from. Coloring the UMAP by plate reveals:

In [None]:
plot_cells(cds, color_cells_by="plate", label_cell_groups=FALSE)