-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datasets10x #37
Datasets10x #37
Conversation
* Semi supervised: * Add `get_indices` and `get_data_loaders` functions, useful for semi_supervised training * Correction of the semi-supervised training scheme, where if dataloaders were unbalanced (more labelled samples than unlabelled for instance), the actual loss minimized was wrong. * Better naming convention * optional in SVAEC: `LinearLogRegClassifier` classifier, only updated every epoch, that acts like a NN on the forward pass - ie. for unlabelled samples, computations are still entirely happening on GPU. * Others: * added `get_latent_mean` and `get_latents` functions. (Only `get_latent` before). Still backward compatible, but removes the need to call .cpu().numpy(). Useful for models with more than one hidden stochastic layer. * Correct wrong multinomial formula * Add `base.py` file to normalize algorithms methods (VAE, VAEC, SVAEC)
* Adding notebook example * Adding benchmark metrics
* Optimize computations for SVAEC and VAEC for unlabelled batches (unnecessary computations were done before)
* Dataset10X("frozen_pbmc_donor_a"), Dataset10X("pbmc8k"), Dataset10X("t_4k"), ... would download the corresponding datasets on the 10X website and extract the archive. The organisation shares common structure, so perhaps factorising their loading is a good idea. * `concat_datasets` merge multiple datasets based on genes_names
Great, let's merge this as is. @imyiningliu If you'll use this new class for the @Edouard360 Do we need to test it with additional datasets that have slightly different formats than |
Also, @Edouard360 what about |
All of the aforementioned datasets come in the same format with 3 files |
Noticed some common structure in the organisation/structure of the 10x datasets.
Dataset10X("pbmc8k"), Dataset10X("t_3k") , Dataset10X("frozen_pbmc_donor_a")
all worked for me.These will be useful for reproducing some harmonisation results.
@imyiningliu : that also include
BrainSmallDataset
if I doDataset10X("neuron_9k")
, but you might want to include some additional parts in preprocessing (for instance selecting the highest most variable genes).Also, needs unit tests for coverage