Datasets10x #37

Edouard360 · 2018-06-15T13:03:30Z

Noticed some common structure in the organisation/structure of the 10x datasets.

Dataset10X("pbmc8k"), Dataset10X("t_3k") , Dataset10X("frozen_pbmc_donor_a") all worked for me.

These will be useful for reproducing some harmonisation results.

@imyiningliu : that also include BrainSmallDataset if I do Dataset10X("neuron_9k"), but you might want to include some additional parts in preprocessing (for instance selecting the highest most variable genes).

Also, needs unit tests for coverage

* Semi supervised: * Add `get_indices` and `get_data_loaders` functions, useful for semi_supervised training * Correction of the semi-supervised training scheme, where if dataloaders were unbalanced (more labelled samples than unlabelled for instance), the actual loss minimized was wrong. * Better naming convention * optional in SVAEC: `LinearLogRegClassifier` classifier, only updated every epoch, that acts like a NN on the forward pass - ie. for unlabelled samples, computations are still entirely happening on GPU. * Others: * added `get_latent_mean` and `get_latents` functions. (Only `get_latent` before). Still backward compatible, but removes the need to call .cpu().numpy(). Useful for models with more than one hidden stochastic layer. * Correct wrong multinomial formula * Add `base.py` file to normalize algorithms methods (VAE, VAEC, SVAEC)

* Adding notebook example * Adding benchmark metrics

* Optimize computations for SVAEC and VAEC for unlabelled batches (unnecessary computations were done before)

* Dataset10X("frozen_pbmc_donor_a"), Dataset10X("pbmc8k"), Dataset10X("t_4k"), ... would download the corresponding datasets on the 10X website and extract the archive. The organisation shares common structure, so perhaps factorising their loading is a good idea. * `concat_datasets` merge multiple datasets based on genes_names

jeff-regier · 2018-06-15T15:28:27Z

Great, let's merge this as is. @imyiningliu If you'll use this new class for the BrainSmallDataset then test_brain_small will start testing this new code.

@Edouard360 Do we need to test it with additional datasets that have slightly different formats than brain_small? Or do these 10x datasets have the same format?

jeff-regier · 2018-06-15T15:29:32Z

Also, @Edouard360 what about brain_large? Is that format very different than these datasets?

Edouard360 · 2018-06-15T16:46:53Z

All of the aforementioned datasets come in the same format with 3 files matrix.mtx, genes.tsv, barcodes.tsv. brain_large's dataset contains a h5 format however, which cannot be currently handled by Dataset10X. But if other datasets are also in .h5 format, it might be worth adding an extra specification at __init__ time (using something similar to the mapping in the object available_datasets).

Edouard360 added 9 commits June 13, 2018 20:13

adding missing modules

eded90c

improve naming convention

5a17d75

notebook and benchmarks

25d07d0

* Adding notebook example * Adding benchmark metrics

flake8 warning

d0ab4e0

correcting notebook output

0e26f32

optimize unlabelled foward pass

fca635d

* Optimize computations for SVAEC and VAEC for unlabelled batches (unnecessary computations were done before)

Merge branch 'master' into datasets10x

293eb96

jeff-regier merged commit 241ad35 into master Jun 15, 2018

jeff-regier deleted the datasets10x branch June 15, 2018 15:30

jeff-regier mentioned this pull request Jun 15, 2018

use Dataset10x for brain_small #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets10x #37

Datasets10x #37

Edouard360 commented Jun 15, 2018 •

edited

Loading

jeff-regier commented Jun 15, 2018 •

edited

Loading

jeff-regier commented Jun 15, 2018

Edouard360 commented Jun 15, 2018

Datasets10x #37

Datasets10x #37

Conversation

Edouard360 commented Jun 15, 2018 • edited Loading

jeff-regier commented Jun 15, 2018 • edited Loading

jeff-regier commented Jun 15, 2018

Edouard360 commented Jun 15, 2018

Edouard360 commented Jun 15, 2018 •

edited

Loading

jeff-regier commented Jun 15, 2018 •

edited

Loading