Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets10x #37

Merged
merged 9 commits into from
Jun 15, 2018
Merged

Datasets10x #37

merged 9 commits into from
Jun 15, 2018

Conversation

Edouard360
Copy link
Contributor

@Edouard360 Edouard360 commented Jun 15, 2018

Noticed some common structure in the organisation/structure of the 10x datasets.

Dataset10X("pbmc8k"), Dataset10X("t_3k") , Dataset10X("frozen_pbmc_donor_a") all worked for me.

These will be useful for reproducing some harmonisation results.

@imyiningliu : that also include BrainSmallDataset if I do Dataset10X("neuron_9k"), but you might want to include some additional parts in preprocessing (for instance selecting the highest most variable genes).

Also, needs unit tests for coverage

* Semi supervised:

    * Add `get_indices` and `get_data_loaders` functions, useful for semi_supervised training

    * Correction of the semi-supervised training scheme, where if dataloaders were unbalanced (more labelled samples than unlabelled for instance), the actual loss minimized was wrong.

    * Better naming convention

    * optional in SVAEC: `LinearLogRegClassifier` classifier, only updated every epoch, that acts like a NN on the forward pass - ie. for unlabelled samples,  computations are still entirely happening on GPU.

* Others:

    * added `get_latent_mean` and `get_latents` functions. (Only `get_latent` before). Still backward compatible, but removes the need to call .cpu().numpy(). Useful for models with more than one hidden stochastic layer.

    * Correct wrong multinomial formula

    * Add `base.py` file to normalize algorithms methods (VAE, VAEC, SVAEC)
* Adding notebook example

* Adding benchmark metrics
* Optimize computations for SVAEC and VAEC for unlabelled batches (unnecessary computations were done before)
* Dataset10X("frozen_pbmc_donor_a"), Dataset10X("pbmc8k"), Dataset10X("t_4k"), ... would download the corresponding datasets on the 10X website and extract the archive. The organisation shares common structure, so perhaps factorising their loading is a good idea.

* `concat_datasets` merge multiple datasets based on genes_names
@jeff-regier
Copy link
Contributor

jeff-regier commented Jun 15, 2018

Great, let's merge this as is. @imyiningliu If you'll use this new class for the BrainSmallDataset then test_brain_small will start testing this new code.

@Edouard360 Do we need to test it with additional datasets that have slightly different formats than brain_small? Or do these 10x datasets have the same format?

@jeff-regier
Copy link
Contributor

Also, @Edouard360 what about brain_large? Is that format very different than these datasets?

@jeff-regier jeff-regier merged commit 241ad35 into master Jun 15, 2018
@jeff-regier jeff-regier deleted the datasets10x branch June 15, 2018 15:30
@Edouard360
Copy link
Contributor Author

All of the aforementioned datasets come in the same format with 3 files matrix.mtx, genes.tsv, barcodes.tsv. brain_large's dataset contains a h5 format however, which cannot be currently handled by Dataset10X. But if other datasets are also in .h5 format, it might be worth adding an extra specification at __init__ time (using something similar to the mapping in the object available_datasets).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants