Import Dataset Classes

In [1]:
import os 
os.chdir('../.') 

from scvi.dataset import LoomDataset, CsvDataset, Dataset10X 
from scvi.dataset import BrainLargeDataset, CortexDataset, PbmcDataset, RetinaDataset, HematoDataset, CbmcDataset, BrainSmallDataset

  from ._conv import register_converters as _register_converters


# Generic Datasets
`scvi v0.1.3` supports dataset loading for the following three generic file formats: 
* `.loom` files
* `.csv` files 
* datasets from `10x` website 

### Loading a `.loom` file
Any `.loom` file can be loaded with initializing `LoomDataset` with `filename`. 

Optional parameters: 
* `save_path`: save path (default to be `data/`) of the file
* `url`: url the dataset if the file needs to be downloaded from the web
* `new_n_genes`: the number of subsampling genes - set it to be `False` to turn off subsampling

In [2]:
loom_dataset = LoomDataset("Cortex.loom", 
                             save_path='data/',
                             url='http://loom.linnarssonlab.org/clone/Previously%20Published/Cortex.loom')

Downloading data
Preprocessing dataset
Finished preprocessing dataset
Downsampling from 21135 to 558 genes


### Loading a `.csv` file 
Any `.csv` file can be loaded with initializing `CsvDataset` with `filename`. 

Optional parameters: 
* `save_path`: save path (default to be `data/`) of the file
* `url`: url of the dataset if the file needs to be downloaded from the web
* `compression`: set `compression` as `.gz`, `.bz2`, `.zip`, or `.xz` to load a zipped `csv` file 
* `new_n_genes`: the number of subsampling genes - set it to be `False` to turn off subsampling

Note: `CsvDataset` currently only supoorts `.csv` files that are genes by cells. 

In [3]:
csv_dataset = CsvDataset("GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz", 
                        save_path='data/',
                        compression='gzip', 
                        url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE100866&format=file&file=GSE100866%5FCBMC%5F8K%5F13AB%5F10X%2DRNA%5Fumi%2Ecsv%2Egz")

Downloading data
Preprocessing dataset
Finished preprocessing dataset
Downsampling from 36280 to 600 genes


### Loading a file from `10x` website 

`10x` has published several datasets on their [website](https://www.10xgenomics.com). 
Initialize `Dataset10X` by passing in the dataset name of one of the following datasets that `scvi` currently supports: `frozen_pbmc_donor_a`, `frozen_pbmc_donor_b`, `frozen_pbmc_donor_c`, `pbmc8k`, `pbmc4k`, `t_3k`, `t_4k`, and `neuron_9k`. 

Optional parameters: 
* `save_path`: save path (default to be `data/`) of the file
* `type`: set `type` (default to be `filtered`) to be `filtered` or `raw` to choose one from the two datasets that's available on `10X`
* `new_n_genes`: the number of subsampling genes - set it to be `False` to turn off subsampling

In [11]:
tenX_dataset = Dataset10X("neuron_9k")

Preprocessing dataset
Finished preprocessing dataset
Downsampling from 27998 to 3000 genes


# Built In Datasets 

We've also implemented seven built-in datasets to make it easier to reproduce results from the scVI paper. 

Descriptions of the seven built-in datasets: 
* **BRAIN LARGE**: 1.3 million mouse brain cells, spanning the cortex, hippocampus and subventricular zone, and profiled with 10x chromium; 
* **CORTEX**: 3,005 mouse cortex cells profiled with the Smart-seq2 protocol, with the addition of UMI. To facilitate comparison with other methods, we use a filtered set of 558 highly variable genes; 
* **PBMC**: 12,039 human peripheral blood mononuclear cells profiled with 10x; 
* **RETINA**: 27,499 mouse retinal bipolar neurons, profiled in two batches using the Drop-Seq technology; 
* **HEMATO**: 4,016 cells from two batches that were profiled using in-drop; 
* **CBMC**: 8,617 cord blood mononuclear cells profiled using 10x along with, for each cell, 13 well-characterized mononuclear antibodies; 
* **BRAIN SMALL**: 9,128 mouse brain cells profiled using 10x. 

### Loading `BRAIN-LARGE` dataset
`BRAIN-LARGE` dataset can be used to demonstrate the scalability of scVI. 

In [5]:
brain_large_dataset = BrainLargeDataset() 

Downloading data
Preprocessing Brain Large data
720 genes subsampled
1306127 cells subsampled
Finished preprocessing data


### Loading `CORTEX` dataset, `PBMC` dataset, `RETINA` dataset
The `CORTEX` dataset exhibits a clear high-level subpopulation structure, which has been inferred by the authors of the original publication using computational tools and annotated by inspection of specific genes or transcriptional programs. Similar levels of annotation are provided with the `PBMC` and `RETINA` datasets.

In [7]:
cortex_dataset = CortexDataset() 
pbmc_dataset = PbmcDataset() 
retina_dataset = RetinaDataset()

Preprocessing Cortex data
Finished preprocessing Cortex data
Downloading data.zip
Preprocessing pbmc data
Finished preprocessing pbmc data
Preprocessing dataset
Finished preprocessing dataset


### Loading `HEMATO` dataset 
`HEMATO` dataset can be used as an example for cases where gene expression varies in a continuous fashion rather than forming discrete subpopulations. 

In [8]:
hemato_dataset = HematoDataset() 

Downloading data
Preprocessing Hemato data
Finished preprocessing Hemato data


### Loading `CBMC` dataset
`CBMC` dataset can be used to analyze how the latent spaces inferred by dimensionality-reduction algorithms summarize protein marker abundance.

In [9]:
cbmc_dataset = CbmcDataset()

Downloading data
Preprocessing dataset
Finished preprocessing dataset
Downsampling from 36280 to 600 genes


### Loading `BRAIN-SMALL` dataset
`BRAIN-SMALL` is used as a complement to PBMC for the study of zero abundance and quality control metrics correlation with the generative posterior parameters.

In [10]:
brain_small_dataset = BrainSmallDataset()

Preprocessing dataset
Finished preprocessing dataset
Downsampling from 27998 to 3000 genes
