# Advanced autotune tutorial

**DISCLAIMER: Most experiments in this notebook require one or more GPUs to keep their runtime a matter of hours.**
**DISCLAIMER: To use our new autotune feature in parallel mode, you need to install [MongoDb](https://docs.mongodb.com/manual/installation/) first.**

In this notebook, we give an in-depth tutorial on `scVI`'s new `autotune` module.

Overall, the new module enables users to perform parallel hyperparemter search for any scVI model and on any number of GPUs/CPUs. Although, the search may be performed sequentially using only one GPU/CPU, we will focus on the paralel case.
Note that GPUs provide a much faster approach as they are particularly suitable for neural networks gradient back-propagation.

Additionally, we provide the code used to generate the results presented in our [Hyperoptimization blog post](https://yoseflab.github.io/2019/07/05/Hyperoptimization/). For an in-depth analysis of the results obtained on three gold standard scRNAseq datasets (Cortex, PBMC and BrainLarge), please to the above blog post. In the blog post, we also suggest guidelines on how and when to use our auto-tuning feature.

In [1]:
import sys

sys.path.append("../../")
sys.path.append("../")

%matplotlib inline

In [36]:
import logging
import os
import pickle
import scanpy
import anndata

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from hyperopt import hp

import scvi
from scvi.dataset import cortex, pbmc_dataset, brainlarge_dataset, annotation_simulation
from scvi.inference import UnsupervisedTrainer
from scvi.inference.autotune import auto_tune_scvi_model
from scvi.models import VAE

In [3]:
logger = logging.getLogger("scvi.inference.autotune")
logger.setLevel(logging.WARNING)

In [4]:
def allow_notebook_for_test():
    print("Testing the autotune advanced notebook")

# test_mode = True
test_mode = False


def if_not_test_else(x, y):
    if not test_mode:
        return x
    else:
        return y


save_path = "data/"
n_epochs = if_not_test_else(1000, 1)
n_epochs_brain_large = if_not_test_else(50, 1)
max_evals = if_not_test_else(100, 1)
reserve_timeout = if_not_test_else(180, 5)
fmin_timeout = if_not_test_else(300, 10)

## Default usage

For the sake of principled simplicity, we provide an all-default approach to hyperparameter search for any `scVI` model.
The few lines below present an example of how to perform hyper-parameter search for `scVI` on the Cortex dataset.

Note that, by default, the model used is `scVI`'s `VAE` and the trainer is the `UnsupervisedTrainer`

Also, the default search space is as follows:
* `n_latent`: [5, 15]
* `n_hidden`: {64, 128, 256}
* `n_layers`: [1, 5]
* `dropout_rate`: {0.1, 0.3, 0.5, 0.7}
* `reconstruction_loss`: {"zinb", "nb"}
* `lr`: {0.01, 0.005, 0.001, 0.0005, 0.0001}

On a more practical note, verbosity varies in the following way:
* `logger.setLevel(logging.WARNING)` will show a progress bar.
* `logger.setLevel(logging.INFO)` will show global logs including the number of jobs done.
* `logger.setLevel(logging.DEBUG)` will show detailed logs for each training (e.g the parameters tested).

This function's behaviour can be customized, please refer to the rest of this tutorial as well as its documentation for information about the different parameters available.

#### Running the hyperoptimization process.

In [5]:
cortex = scvi.dataset.cortex(save_path=save_path)

[2020-07-14 12:48:16,395] INFO - scvi.dataset._utils | File /Users/galen/scVI/tests/notebooks/data/expression.bin already downloaded
[2020-07-14 12:48:16,397] INFO - scvi.dataset.cortex | Loading Cortex data from /Users/galen/scVI/tests/notebooks/data/expression.bin


Transforming to str index.


[2020-07-14 12:48:24,686] INFO - scvi.dataset.cortex | Finished loading Cortex data
[2020-07-14 12:48:25,554] INFO - scvi.dataset._anndata | Using data from adata.X
[2020-07-14 12:48:25,555] INFO - scvi.dataset._anndata | No batch_key inputted, assuming all cells are same batch
[2020-07-14 12:48:25,559] INFO - scvi.dataset._anndata | Using labels from adata.obs["labels"]
[2020-07-14 12:48:25,560] INFO - scvi.dataset._anndata | Computing library size prior per batch
[2020-07-14 12:48:26,163] INFO - scvi.dataset._anndata | Successfully registered anndata object containing 3005 cells, 19972 genes, and 1 batches 
Registered keys:['X', 'batch_indices', 'local_l_mean', 'local_l_var', 'labels']


In [6]:
best_trainer, trials = auto_tune_scvi_model(
    gene_dataset=cortex,
    parallel=True,
    exp_key="cortex_dataset",
    train_func_specific_kwargs={"n_epochs": n_epochs},
    max_evals=max_evals,
    reserve_timeout=reserve_timeout,
    fmin_timeout=fmin_timeout,
)

[2020-07-14 12:48:26,169] INFO - scvi.inference.autotune.all | Starting experiment: cortex_dataset
[2020-07-14 12:48:26,171] DEBUG - scvi.inference.autotune.all | Using default parameter search space.
[2020-07-14 12:48:26,172] DEBUG - scvi.inference.autotune.all | Adding default early stopping behaviour.
[2020-07-14 12:48:26,174] INFO - scvi.inference.autotune.all | Fixed parameters: 
model: 
{}
trainer: 
{'early_stopping_kwargs': {'early_stopping_metric': 'elbo', 'save_best_state_metric': 'elbo', 'patience': 50, 'threshold': 0, 'reduce_lr_on_plateau': True, 'lr_patience': 25, 'lr_factor': 0.2}, 'metrics_to_monitor': ['elbo']}
train method: 
{'n_epochs': 1}
[2020-07-14 12:48:26,174] INFO - scvi.inference.autotune.all | Starting parallel hyperoptimization
[2020-07-14 12:48:26,188] DEBUG - scvi.inference.autotune.all | Starting MongoDb process, logs redirected to ./mongo/mongo_logfile.txt.
[2020-07-14 12:48:26,209] ERROR - scvi.inference.autotune.all | Caught (2, "No such file or directo

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/Users/galen/anaconda3/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/Users/galen/anaconda3/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/galen/anaconda3/lib/python3.7/logging/handlers.py", line 1475, in _monitor
    record = self.dequeue(True)
  File "/Users/galen/anaconda3/lib/python3.7/logging/handlers.py", line 1424, in dequeue
    return self.queue.get(block)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/queues.py", line 94, in get
    res = self._recv_bytes()
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", lin

FileNotFoundError: [Errno 2] No such file or directory: 'mongod': 'mongod'

#### Returned objects

The `trials` object contains detailed information about each run.
`trials.trials` is an `Iterable` in which each element corresponds to a single run. It can be used as a dictionary for wich the key "result" yields a dictionnary containing the outcome of the run as defined in our default objective function (or the user's custom version). For example, it will contain information on the hyperparameters used (under the "space" key), the resulting metric (under the "loss" key) or the status of the run.

The `best_trainer` object can be used directly as an scVI `Trainer` object. It is the result of a training on the whole dataset provided using the optimal set of hyperparameters found.

## Custom hyperamater space

Although our default can be a good one in a number of cases, we still provide an easy way to use custom values for the hyperparameters search space.
These are broken down in three categories:
* Hyperparameters for the `Trainer` instance. (if any)
* Hyperparameters for the `Trainer` instance's `train` method. (e.g `lr`)
* Hyperparameters for the model instance. (e.g `n_layers`)

To build your own hyperparameter space follow the scheme used in `scVI`'s codebase as well as the sample below.
Note the various spaces you define, have to follow the `hyperopt` syntax, for which you can find a detailed description [here](https://github.com/hyperopt/hyperopt/wiki/FMin#2-defining-a-search-space).

For example, if you were to want to search over a continuous range of droupouts varying in [0.1, 0.3] and for a continuous learning rate varying in [0.001, 0.0001], you could use the following search space.

In [7]:
space = {
    "model_tunable_kwargs": {"dropout_rate": hp.uniform("dropout_rate", 0.1, 0.3)},
    "train_func_tunable_kwargs": {"lr": hp.loguniform("lr", -4.0, -3.0)},
}

best_trainer, trials = auto_tune_scvi_model(
    gene_dataset=cortex,
    space=space,
    parallel=True,
    exp_key="cortex_dataset_custom_space",
    train_func_specific_kwargs={"n_epochs": n_epochs},
    max_evals=max_evals,
    reserve_timeout=reserve_timeout,
    fmin_timeout=fmin_timeout,
)

[2020-07-14 12:48:37,149] INFO - scvi.inference.autotune.all | Starting experiment: cortex_dataset_custom_space
[2020-07-14 12:48:37,150] DEBUG - scvi.inference.autotune.all | Adding default early stopping behaviour.
[2020-07-14 12:48:37,151] INFO - scvi.inference.autotune.all | Fixed parameters: 
model: 
{}
trainer: 
{'early_stopping_kwargs': {'early_stopping_metric': 'elbo', 'save_best_state_metric': 'elbo', 'patience': 50, 'threshold': 0, 'reduce_lr_on_plateau': True, 'lr_patience': 25, 'lr_factor': 0.2}, 'metrics_to_monitor': ['elbo']}
train method: 
{'n_epochs': 1}
[2020-07-14 12:48:37,152] INFO - scvi.inference.autotune.all | Starting parallel hyperoptimization
[2020-07-14 12:48:37,154] DEBUG - scvi.inference.autotune.all | Starting MongoDb process, logs redirected to ./mongo/mongo_logfile.txt.
[2020-07-14 12:48:37,168] ERROR - scvi.inference.autotune.all | Caught (2, "No such file or directory: 'mongod'") in auto_tune_scvi_model, starting cleanup.
Traceback (most recent call las

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/Users/galen/anaconda3/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/Users/galen/anaconda3/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/galen/anaconda3/lib/python3.7/logging/handlers.py", line 1475, in _monitor
    record = self.dequeue(True)
  File "/Users/galen/anaconda3/lib/python3.7/logging/handlers.py", line 1424, in dequeue
    return self.queue.get(block)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/queues.py", line 94, in get
    res = self._recv_bytes()
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", lin

FileNotFoundError: [Errno 2] No such file or directory: 'mongod': 'mongod'

## Custom objective metric

By default, our autotune process tracks the marginal negative log likelihood of the best state of the model according ot the held-out Evidence Lower BOund (ELBO). But, if you want to track a different early stopping metric and optimize a different loss you can use `auto_tune_scvi_model`'s parameters.

For example, if for some reason, you had a dataset coming from two batches (i.e two merged datasets) and wanted to optimize the hyperparameters for the batch mixing entropy. You could use the code below, which makes use of the `metric_name` argument of `auto_tune_scvi_model`. This can work for any metric that is implemented in the  `Posterior` class you use. You may also specify the name of the `Posterior` attribute you want to use (e.g "train_set").

In [8]:
pbmc = pbmc_dataset(save_path=save_path)

[2020-07-14 12:48:42,028] INFO - scvi.dataset._utils | File data/gene_info_pbmc.csv already downloaded
[2020-07-14 12:48:42,029] INFO - scvi.dataset._utils | File data/pbmc_metadata.pickle already downloaded
[2020-07-14 12:48:42,080] INFO - scvi.dataset._utils | File data/pbmc8k/filtered_gene_bc_matrices.tar.gz already downloaded
[2020-07-14 12:48:42,082] INFO - scvi.dataset.dataset10X | Extracting tar file
[2020-07-14 12:49:00,757] INFO - scvi.dataset.dataset10X | Removing extracted data at data/pbmc8k/filtered_gene_bc_matrices
[2020-07-14 12:49:01,171] INFO - scvi.dataset._utils | File data/pbmc4k/filtered_gene_bc_matrices.tar.gz already downloaded
[2020-07-14 12:49:01,173] INFO - scvi.dataset.dataset10X | Extracting tar file
[2020-07-14 12:49:10,146] INFO - scvi.dataset.dataset10X | Removing extracted data at data/pbmc4k/filtered_gene_bc_matrices
[2020-07-14 12:49:11,460] INFO - scvi.dataset._anndata | Using data from adata.X
[2020-07-14 12:49:11,461] INFO - scvi.dataset._anndata | 

In [9]:
best_trainer, trials = auto_tune_scvi_model(
    gene_dataset=pbmc,
    metric_name="entropy_batch_mixing",
    posterior_name="train_set",
    parallel=True,
    exp_key="pbmc_entropy_batch_mixing",
    train_func_specific_kwargs={"n_epochs": n_epochs},
    max_evals=max_evals,
    reserve_timeout=reserve_timeout,
    fmin_timeout=fmin_timeout,
)

[2020-07-14 12:49:11,581] INFO - scvi.inference.autotune.all | Starting experiment: pbmc_entropy_batch_mixing
[2020-07-14 12:49:11,582] DEBUG - scvi.inference.autotune.all | Using default parameter search space.
[2020-07-14 12:49:11,584] DEBUG - scvi.inference.autotune.all | Adding default early stopping behaviour.
[2020-07-14 12:49:11,586] INFO - scvi.inference.autotune.all | Fixed parameters: 
model: 
{}
trainer: 
{'early_stopping_kwargs': {'early_stopping_metric': 'elbo', 'save_best_state_metric': 'elbo', 'patience': 50, 'threshold': 0, 'reduce_lr_on_plateau': True, 'lr_patience': 25, 'lr_factor': 0.2}, 'metrics_to_monitor': ['elbo']}
train method: 
{'n_epochs': 1}
[2020-07-14 12:49:11,587] INFO - scvi.inference.autotune.all | Starting parallel hyperoptimization
[2020-07-14 12:49:11,589] DEBUG - scvi.inference.autotune.all | Starting MongoDb process, logs redirected to ./mongo/mongo_logfile.txt.
[2020-07-14 12:49:11,607] ERROR - scvi.inference.autotune.all | Caught (2, "No such file

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/Users/galen/anaconda3/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/Users/galen/anaconda3/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/galen/anaconda3/lib/python3.7/logging/handlers.py", line 1475, in _monitor
    record = self.dequeue(True)
  File "/Users/galen/anaconda3/lib/python3.7/logging/handlers.py", line 1424, in dequeue
    return self.queue.get(block)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/queues.py", line 94, in get
    res = self._recv_bytes()
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", lin

FileNotFoundError: [Errno 2] No such file or directory: 'mongod': 'mongod'

## Custom objective function

Below, we describe, using one of our Synthetic dataset, how to tune our annotation model `SCANVI` for, e.g, better accuracy on a 20% subset of the labelled data. Note that the model is trained in a semi-supervised framework, that is why we have a labelled and unlabelled dataset. Please, refer to the original [paper](https://www.biorxiv.org/content/10.1101/532895v1) for details on SCANVI!

In this case, as described in our `annotation` notebook we may want to form the labelled/unlabelled sets using batch indices. Unfortunately, that requires a little "by hand" work. Even in that case, we are able to leverage the new autotune module to perform hyperparameter tuning. In order to do so, one has to write his own objective function and feed it to `auto_tune_scvi_model`.

One can proceed as described below.
Note three important conditions:
* Since it is going to be pickled the objective should not be implemented in the "__main__" module, i.e an executable script or a notebook.
* the objective should have the search space as its first attribute and a boolean `is_best_training` as its second.
* If not using a cutstom search space, it should be expected to take the form of a dictionary with the following keys:
    * `"model_tunable_kwargs"`
    * `"trainer_tunable_kwargs"`
    * `"train_func_tunable_kwargs"`

In [None]:
from notebooks.utils.autotune_advanced_notebook import custom_objective_hyperopt

In [38]:
synthetic_dataset = annotation_simulation(1)
objective_kwargs = dict(dataset=synthetic_dataset, n_epochs=n_epochs)
best_trainer, trials = auto_tune_scvi_model(
    custom_objective_hyperopt=custom_objective_hyperopt,
    objective_kwargs=objective_kwargs,
    parallel=True,
    exp_key="synthetic_dataset_scanvi",
    max_evals=max_evals,
    reserve_timeout=reserve_timeout,
    fmin_timeout=fmin_timeout,
)

[2020-07-14 14:19:44,868] INFO - scvi.dataset._utils | File /Users/galen/scVI/tests/notebooks/data/simulation_1.loom already downloaded


NameError: name 'custom_objective_hyperopt' is not defined

## Delayed populating, for very large datasets.

**DISCLAIMER: We don't actually need this for the BrainLarge dataset with 720 genes, this is just an example.**

The fact is that after building the objective function and feeding it to `hyperopt`, it is pickled on to the `MongoWorkers`. Thus, if you pass a loaded dataset as a partial argument to the objective function, and this dataset exceeds 4Gb, you'll get a `PickleError` (Objects larger than 4Gb can't be pickled).

To remedy this issue, in case you have a very large dataset for which you want to perform hyperparameter optimization, you should subclass `scVI`'s `DownloadableDataset` or use one of its many existing subclasses, such that the dataset can be populated inside the objective function which is called by each worker.

In [10]:
if test_mode:
    brain_large_dataset_path = '../data/brainlarge_dataset_test.h5ad'
else:
    #brain_large_dataset_path = path_to_processed_brain_large
    pass

best_trainer, trials = auto_tune_scvi_model(
    gene_dataset=brain_large_dataset_path,
    delayed_populating=True,
    parallel=True,
    exp_key="brain_large_dataset",
    max_evals=max_evals,
    trainer_specific_kwargs={
        "early_stopping_kwargs": {
            "early_stopping_metric": "elbo",
            "save_best_state_metric": "elbo",
            "patience": 20,
            "threshold": 0,
            "reduce_lr_on_plateau": True,
            "lr_patience": 10,
            "lr_factor": 0.2,
        }
    },
    train_func_specific_kwargs={"n_epochs": n_epochs_brain_large},
    reserve_timeout=reserve_timeout,
    fmin_timeout=fmin_timeout,
)

[2020-07-14 12:49:17,599] INFO - scvi.inference.autotune.all | Starting experiment: brain_large_dataset
[2020-07-14 12:49:17,599] DEBUG - scvi.inference.autotune.all | Using default parameter search space.
[2020-07-14 12:49:17,601] INFO - scvi.inference.autotune.all | Fixed parameters: 
model: 
{}
trainer: 
{'early_stopping_kwargs': {'early_stopping_metric': 'elbo', 'save_best_state_metric': 'elbo', 'patience': 20, 'threshold': 0, 'reduce_lr_on_plateau': True, 'lr_patience': 10, 'lr_factor': 0.2}}
train method: 
{'n_epochs': 1}
[2020-07-14 12:49:17,602] INFO - scvi.inference.autotune.all | Starting parallel hyperoptimization
[2020-07-14 12:49:17,604] DEBUG - scvi.inference.autotune.all | Starting MongoDb process, logs redirected to ./mongo/mongo_logfile.txt.
[2020-07-14 12:49:17,620] ERROR - scvi.inference.autotune.all | Caught (2, "No such file or directory: 'mongod'") in auto_tune_scvi_model, starting cleanup.
Traceback (most recent call last):
  File "/Users/galen/scVI/scvi/inferenc

Exception in thread Thread-7:
Traceback (most recent call last):
  File "/Users/galen/anaconda3/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/Users/galen/anaconda3/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/galen/anaconda3/lib/python3.7/logging/handlers.py", line 1475, in _monitor
    record = self.dequeue(True)
  File "/Users/galen/anaconda3/lib/python3.7/logging/handlers.py", line 1424, in dequeue
    return self.queue.get(block)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/queues.py", line 94, in get
    res = self._recv_bytes()
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", lin

FileNotFoundError: [Errno 2] No such file or directory: 'mongod': 'mongod'

## Blog post reproducibility

Below, we provide some code to reproduce the results of our [blog post](https://yoseflab.github.io/2019/07/05/Hyperoptimization/) on hyperparameter search with scVI.
Note, that this can also be used as a tutorial on how to make senss of the output of the autotuning process, the `Trials` object.

In `scvi` version 1.0, we removed the dataset corruption feature, so please use an earlier version of scVI for full reproducibility.

## Cortex, Pbmc and BrainLarge hyperparameter optimization

First off, we run the default hyperparameter optimization procedure (default search space, 100 runs) on each of the three dataset of our study:
* The Cortex dataset (done above)
* The Pbmc dataset
* The Brain Large dataset (done above)

In [11]:
best_trainer, trials = auto_tune_scvi_model(
    gene_dataset=pbmc,
    parallel=True,
    exp_key="pbmc_dataset",
    max_evals=max_evals,
    train_func_specific_kwargs={"n_epochs": n_epochs},
    reserve_timeout=reserve_timeout,
    fmin_timeout=fmin_timeout,
)

[2020-07-14 12:49:21,421] INFO - scvi.inference.autotune.all | Starting experiment: pbmc_dataset
[2020-07-14 12:49:21,422] DEBUG - scvi.inference.autotune.all | Using default parameter search space.
[2020-07-14 12:49:21,424] DEBUG - scvi.inference.autotune.all | Adding default early stopping behaviour.
[2020-07-14 12:49:21,425] INFO - scvi.inference.autotune.all | Fixed parameters: 
model: 
{}
trainer: 
{'early_stopping_kwargs': {'early_stopping_metric': 'elbo', 'save_best_state_metric': 'elbo', 'patience': 50, 'threshold': 0, 'reduce_lr_on_plateau': True, 'lr_patience': 25, 'lr_factor': 0.2}, 'metrics_to_monitor': ['elbo']}
train method: 
{'n_epochs': 1}
[2020-07-14 12:49:21,426] INFO - scvi.inference.autotune.all | Starting parallel hyperoptimization
[2020-07-14 12:49:21,429] DEBUG - scvi.inference.autotune.all | Starting MongoDb process, logs redirected to ./mongo/mongo_logfile.txt.
[2020-07-14 12:49:21,444] ERROR - scvi.inference.autotune.all | Caught (2, "No such file or directory

Exception in thread Thread-8:
Traceback (most recent call last):
  File "/Users/galen/anaconda3/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/Users/galen/anaconda3/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/galen/anaconda3/lib/python3.7/logging/handlers.py", line 1475, in _monitor
    record = self.dequeue(True)
  File "/Users/galen/anaconda3/lib/python3.7/logging/handlers.py", line 1424, in dequeue
    return self.queue.get(block)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/queues.py", line 94, in get
    res = self._recv_bytes()
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/Users/galen/anaconda3/lib/python3.7/multiprocessing/connection.py", lin

FileNotFoundError: [Errno 2] No such file or directory: 'mongod': 'mongod'

## Handy class to handle the results of each experiment

In the helper, `autotune_advanced_notebook.py` we have implemented a `Benchmarkable` class which will help with things such as benchmark computation, results visualisation in dataframes, etc.

In [None]:
from notebooks.utils.autotune_advanced_notebook import Benchmarkable

## Make experiment benchmarkables

Below, we use our helper class to store and process the results of the experiments.
It allows us to generate:
* Imputed values from scVI
* Dataframes containing:
    * For each dataset, the results of each trial along with the parameters used.
    * For all dataset, the best result and the associated hyperparameters

In [None]:
results_path = "."

In [None]:
cortex = Benchmarkable(
    global_path=results_path, exp_key="cortex_dataset", name="Cortex tuned"
)
pbmc = Benchmarkable(
    global_path=results_path, exp_key="pbmc_dataset", name="Pbmc tuned"
)
brain_large = Benchmarkable(
    global_path=results_path, exp_key="brain_large_dataset", name="Brain Large tuned"
)

#### Train each VAE

In [None]:
n_epochs_one_shot = if_not_test_else(400, 1)

In [None]:
stats = cortex.uns['scvi_summary_stats']
vae = VAE(stats['n_genes'], n_batch=stats['n_batch']* False)
trainer = UnsupervisedTrainer(
    vae, cortex, train_size=0.75, use_cuda=True, frequency=1
)
trainer.train(n_epochs=n_epochs_one_shot, lr=1e-3)
with open("trainer_cortex_one_shot", "wb") as f:
    pickle.dump(trainer, f)
with open("model_cortex_one_shot", "wb") as f:
    torch.save(vae, f)

In [None]:
stats = pbmc.uns['scvi_summary_stats']
vae = VAE(stats['n_genes'], n_batch=stats['n_batch'] * False)
trainer = UnsupervisedTrainer(
    vae, pbmc, train_size=0.75, use_cuda=True, frequency=1
)
trainer.train(n_epochs=n_epochs_one_shot, lr=1e-3)
with open("trainer_pbmc_one_shot", "wb") as f:
    pickle.dump(trainer, f)
with open("model_pbmc_one_shot", "wb") as f:
    torch.save(vae, f)

In [33]:
if test_mode:
    brain_large = anndata.read("../data/brainlarge_dataset_test.h5ad")
else:
    brain_large = brainlarge_dataset()
stats = brain_large.uns['scvi_summary_stats']

vae = VAE(stats['n_genes'], n_batch=stats['n_batch'] * False)
trainer = UnsupervisedTrainer(
    vae, brain_large, train_size=0.75, use_cuda=True, frequency=1
)
trainer.train(n_epochs=n_epochs_brain_large, lr=1e-3)
with open("trainer_brain_large_one_shot", "wb") as f:
    pickle.dump(trainer, f)
with open("model_brain_large_one_shot", "wb") as f:
    torch.save(vae, f)

AttributeError: 'AnnData' object has no attribute ''

Again, we use our helper class to contain, preprocess and access the results of each experiment.

In [None]:
cortex_one_shot = Benchmarkable(
    trainer_fname="trainer_cortex_one_shot",
    model_fname="model_cortex_one_shot",
    name="Cortex default",
    is_one_shot=True,
)
pbmc_one_shot = Benchmarkable(
    trainer_fname="trainer_pbmc_one_shot",
    model_fname="model_pbmc_one_shot",
    name="Pbmc default",
    is_one_shot=True,
)
brain_large_one_shot = Benchmarkable(
    trainer_fname="trainer_brain_large_one_shot",
    model_fname="model_brain_large_one_shot",
    name="Brain Large default",
    is_one_shot=True,
)

## Hyperparameter space `DataFrame`

Our helper class allows us to get a dataframe per experiment resuming the results of each trial.

In [20]:
cortex_df = cortex.obs
cortex_df.to_csv("cortex_df")
cortex.obs

Unnamed: 0,labels,precise_labels,cell_type,_scvi_batch,_scvi_local_l_mean,_scvi_local_l_var
0,2,1,interneurons,0,9.343505,0.423329
1,2,1,interneurons,0,9.343505,0.423329
2,2,1,interneurons,0,9.343505,0.423329
3,2,1,interneurons,0,9.343505,0.423329
4,2,1,interneurons,0,9.343505,0.423329
5,2,1,interneurons,0,9.343505,0.423329
6,2,1,interneurons,0,9.343505,0.423329
7,2,1,interneurons,0,9.343505,0.423329
8,2,1,interneurons,0,9.343505,0.423329
9,2,1,interneurons,0,9.343505,0.423329


In [19]:
pbmc_df = pbmc.obs
pbmc_df.to_csv("pbmc_df")
pbmc.obs

Unnamed: 0,batch,n_counts,labels,str_labels,_scvi_batch,_scvi_local_l_mean,_scvi_local_l_var
AAACCTGAGCTAGTGG-1,0,4520.0,2,CD4 T cells,0,7.227619,0.158229
AAACCTGCACATTAGC-1,0,2788.0,2,CD4 T cells,0,7.227619,0.158229
AAACCTGCACTGTTAG-1,0,4667.0,1,CD14+ Monocytes,0,7.227619,0.158229
AAACCTGCATAGTAAG-1,0,4440.0,1,CD14+ Monocytes,0,7.227619,0.158229
AAACCTGCATGAACCT-1,0,3224.0,3,CD8 T cells,0,7.227619,0.158229
AAACCTGGTAAGAGGA-1,0,5205.0,2,CD4 T cells,0,7.227619,0.158229
AAACCTGGTAGAAGGA-1,0,5493.0,1,CD14+ Monocytes,0,7.227619,0.158229
AAACCTGGTCCAGTGC-1,0,4419.0,2,CD4 T cells,0,7.227619,0.158229
AAACCTGGTGTCTGAT-1,0,3343.0,2,CD4 T cells,0,7.227619,0.158229
AAACCTGGTTTGCATG-1,0,6360.0,1,CD14+ Monocytes,0,7.227619,0.158229


In [34]:
brain_large_df = brain_large.obs
brain_large_df.to_csv("brain_large_df")
brain_large.obs

Unnamed: 0,labels,batch,n_counts,n_genes,_scvi_batch,_scvi_labels,_scvi_local_l_mean,_scvi_local_l_var
0,0.0,0.0,2451.0,497,0,0,7.937451,0.340575
1,0.0,0.0,1119.0,385,0,0,7.937451,0.340575
2,0.0,0.0,2456.0,499,0,0,7.937451,0.340575
3,0.0,0.0,1761.0,463,0,0,7.937451,0.340575
4,0.0,0.0,4464.0,597,0,0,7.937451,0.340575
5,0.0,0.0,6051.0,614,0,0,7.937451,0.340575
6,0.0,0.0,1232.0,421,0,0,7.937451,0.340575
7,0.0,0.0,2107.0,492,0,0,7.937451,0.340575
8,0.0,0.0,1679.0,463,0,0,7.937451,0.340575
9,0.0,0.0,2470.0,534,0,0,7.937451,0.340575


## Best run  `DataFrame `

Using the previous dataframes we are able to build one containing the best results along with the results obtained with the default parameters.

In [22]:
cortex_best = cortex_df.iloc[0]
cortex_best.name = "Cortex tuned"
cortex_default = pd.Series(
    [
        cortex_one_shot.best_performance,
        1, 128, 10, "zinb", 0.1, 0.001, 400, None, None
    ],
    index=cortex_best.index
)
cortex_default.name = "Cortex default"
pbmc_best = pbmc_df.iloc[0]
pbmc_best.name = "Pbmc tuned"
pbmc_default = pd.Series(
    [
        pbmc_one_shot.best_performance,
        1, 128, 10, "zinb", 0.1, 0.001, 400, None, None
    ],
    index=pbmc_best.index
)
pbmc_default.name = "Pbmc default"
brain_large_best = brain_large_df.iloc[0]
brain_large_best.name = "Brain Large tuned"
brain_large_default = pd.Series(
    [
        brain_large_one_shot.best_performance,
        1, 128, 10, "zinb", 0.1, 0.001, 400, None, None
    ],
    index=brain_large_best.index
)
brain_large_default.name = "Brain Large default"
df_best = pd.concat(
    [cortex_best,
     cortex_default,
     pbmc_best,
     pbmc_default,
     brain_large_best,
     brain_large_default
    ],
    axis=1
)
df_best = df_best.iloc[np.logical_not(np.isin(df_best.index, ["n_params", "run index"]))]
df_best

NameError: name 'cortex_one_shot' is not defined

## Handy class to compare the results of each experiment

We use a second handy class to compare these results altogether.
Specifically, the `PlotBenchmarkable` allows to retrieve:
* A `DataFrame` containg the runtime information of each experiment.
* A `DataFrame` comparint the different benchmarks (negative marginal LL, imputation) between tuned and default VAEs.
* For each dataset, a plot aggregating the ELBO histories of each run.

In [None]:
from notebooks.utils.autotune_advanced_notebook import PlotBenchmarkables

In [None]:
tuned_benchmarkables = {
    "cortex": cortex,
    "pbmc": pbmc,
    "brain large": brain_large,
}
one_shot_benchmarkables = {
    "cortex": cortex_one_shot,
    "pbmc": pbmc_one_shot,
    "brain large": brain_large_one_shot
}
plotter = PlotBenchmarkables(
    tuned_benchmarkables=tuned_benchmarkables,
    one_shot_benchmarkables=one_shot_benchmarkables,
)

## Runtime `DataFrame`

In [35]:
df_runtime = plotter.get_runtime_dataframe()
df_runtime

NameError: name 'plotter' is not defined

## Results `DataFrame` for best runs

In [None]:
def highlight_min(data, color="yellow"):
    attr = "background-color: {}".format(color)
    if data.ndim == 1:  # Series from .apply(axis=0) or axis=1
        is_min = data == data.min()
        return [attr if v else "" for v in is_min]
    else:  # from .apply(axis=None)
        is_min = data == data.min().min()
        return pd.DataFrame(np.where(is_min, attr, ""),
                            index=data.index, columns=data.columns)

In [None]:
df_results = plotter.get_results_dataframe()
styler = df_results.style.apply(highlight_min, axis=0, subset=pd.IndexSlice["cortex", :])
styler = styler.apply(highlight_min, axis=0, subset=pd.IndexSlice["pbmc", :])
styler = styler.apply(highlight_min, axis=0, subset=pd.IndexSlice["brain large", :])
styler

## ELBO Histories plot

In the ELBO histories plotted below, the runs are colored from red to green, where red is the first run and green the last one.

In [None]:
plt.rcParams["figure.dpi"] = 200
plt.rcParams["figure.figsize"] = (10, 7)

In [None]:
ylims_dict = {
    "cortex": [1225, 1600],
    "pbmc": [1325, 1600],
    "brain large": [140, 160],
}
plotter.plot_histories(figsize=(17, 5), ylims_dict=ylims_dict, filename="elbo_histories_all", alpha=0.1)