# Annotation Tutorial

**NB**: please refer to the scVI-dev notebook for introduction of the scVI package.

In this notebook, we investigate how semi-supervised learning combined with the probabilistic modelling of latent variables in scVI can help address the annotation problem.

The annotation problem consists in labelling cells, ie. **inferring their cell types**, knowing only a part of the labels.

In [1]:
cd ../..

/home/ubuntu/scVI


### Loading Config

In [2]:
import json
with open('docs/notebooks/annotation.config.json') as f:
    config = json.load(f)
print(config)

n_epochs_all = config['n_epochs'] if 'n_epochs' in config else None
save_path = config['save_path'] if 'save_path' in config else 'data/'
n_samples_tsne = config['n_samples_tsne'] if 'n_samples_tsne' in config else None
n_samples_posterior_density = config['n_samples_posterior_density'] if 'n_samples_posterior_density' in config else None
train_size = config['train_size'] if 'train_size' in config else None
M_sampling_all = config['M_sampling'] if 'M_sampling' in config else None
M_permutation_all = config['M_permutation'] if 'M_permutation' in config else None
n_labelled_samples_per_class = config['n_labelled_samples_per_class'] if 'n_labelled_samples_per_class' in config else None

{'save_path': 'data/'}


In [3]:
from scvi.dataset import CortexDataset
from scvi.models import SCANVI, VAE
from scvi.inference import JointSemiSupervisedTrainer

gene_dataset = CortexDataset(save_path=save_path)

File data/expression.bin already downloaded
Preprocessing Cortex data
Finished preprocessing Cortex data


We instantiate the SVAEC model and train it over 250 epochs. Only labels from the `data_loader_labelled` will be used, but to cross validate the results, the labels of `data_loader_unlabelled` will is used at test time. The accuracy of the `unlabelled` dataset reaches 93% here at the end of training.

In [4]:
gene_dataset = CortexDataset(save_path=save_path)

use_batches=False
use_cuda=True

n_epochs = 100 if n_epochs_all is None else n_epochs_all
n_cl =  10 if n_labelled_samples_per_class is None else n_labelled_samples_per_class
scanvi = SCANVI(gene_dataset.nb_genes, gene_dataset.n_batches, gene_dataset.n_labels)
trainer = JointSemiSupervisedTrainer(scanvi, gene_dataset, 
                                     n_labelled_samples_per_class=n_cl, 
                                     classification_ratio=100)
trainer.train(n_epochs=n_epochs)

trainer.unlabelled_set.accuracy()

File data/expression.bin already downloaded
Preprocessing Cortex data
Finished preprocessing Cortex data
training: 100%|██████████| 100/100 [00:40<00:00,  2.45it/s]


0.9189097103918228

**Benchmarking against other algorithms**

We can compare ourselves against the random forest and SVM algorithms, where we do grid search with 3-fold cross validation to find the best hyperparameters of these algorithms. This is automatically performed through the functions **`compute_accuracy_svc`** and **`compute_accuracy_rf`**.

These functions should be given as input the numpy array corresponding to the equivalent dataloaders, which is the purpose of the **`get_raw_data`** method from `scvi.dataset.utils`.

The format of the result is an Accuracy named tuple object giving higher granularity information about the accuracy ie, with attributes:

- **unweighted**: the standard definition of accuracy

- **weighted**: we might give the same weight to all classes in the final accuracy results. Informative only if the dataset is unbalanced.

- **worst**: the worst accuracy score for the classes

- **accuracy_classes** : give the detail of the accuracy per classes


Compute the accuracy score for rf and svc

In [5]:
from scvi.inference.annotation import compute_accuracy_rf, compute_accuracy_svc

data_train, labels_train = trainer.labelled_set.raw_data()
data_test, labels_test = trainer.unlabelled_set.raw_data()
svc_scores = compute_accuracy_svc(data_train, labels_train, data_test, labels_test)
rf_scores = compute_accuracy_rf(data_train, labels_train, data_test, labels_test)

print("\nSVC score test :\n", svc_scores[0][1])
print("\nRF score train :\n", rf_scores[0][1])


SVC score test :
 Accuracy(unweighted=0.8701873935264055, weighted=0.8465908861701248, worst=0.7223650385604113, accuracy_classes=[0.794392523364486, 0.9066666666666666, 0.8857142857142857, 0.7954545454545454, 0.9345679012345679, 0.8869752421959096, 0.7223650385604113])

RF score train :
 Accuracy(unweighted=0.8933560477001703, weighted=0.874333009653023, worst=0.6503856041131105, accuracy_classes=[0.883177570093458, 0.8977777777777778, 0.9964285714285714, 0.8181818181818182, 0.9604938271604938, 0.9138858988159311, 0.6503856041131105])
