# Convert from CSV to AnnData-Zarr

## Motivation

Vitessce is implemented as a web-based visualization tool which loads data from files in [particular formats](http://vitessce.io/docs/data-types-file-types/). Additionally, for best performance, these files should be as small as possible. The [Zarr](https://zarr.readthedocs.io/en/stable/) format supports chunking (splitting a large file into multiple smaller files) which enables loading only the subset of the data that is required for a particular visualization, and compression.

In [21]:
import pandas as pd
import numpy as np
from anndata import AnnData
from os.path import join
from vitessce.data_utils import (
    optimize_adata,
    VAR_CHUNK_SIZE,
)

## Load data from CSV files using pandas

For this example, we are starting from "raw" data that was saved to CSV files. Using [pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) we can load the CSVs into pandas DataFrame objects. We can explore how the data is organized into the CSVs by checking the first 5 rows using the [DataFrame.head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) method.

For example, this CSV contains a cell-by-gene expression matrix, where the rows represent cells, and the columns represent genes:

In [3]:
matrix_df = pd.read_csv(join("raw_data", "habib17.cell_by_gene_matrix.csv"), index_col=0)
matrix_df.head()

Unnamed: 0_level_0,LINC00115,RP11-54O7.1,LINC02593,SAMD11,ISG15,RP11-54O7.11,MXRA8,MRPL20,RP4-758J18.13,ANKRD65,...,RP11-539G18.2,RP11-592B15.3,RP11-698N11.4,SIK3-IT1,AC011526.1,CTA-357J21.1,RP11-28F1.2,RP11-638I8.1,RNVU1-20,RP3-511B24.6
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
hHP1_ACTCAATAGCAA-habib17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hHP1_TTCCCGTTAAAG-habib17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hHP1_GTCATTGAATCA-habib17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hHP1_CACCTTCAATAC-habib17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hHP1_ATACATGTTGTC-habib17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


This next CSV contains cell type annotations:

In [7]:
cell_type_df = pd.read_csv(join("raw_data", "habib17.cell_type_annotations.csv"), index_col=0)
cell_type_df.head()

Unnamed: 0_level_0,CellType
index,Unnamed: 1_level_1
hHP1_ACTCAATAGCAA-habib17,exCA1
hHP1_TTCCCGTTAAAG-habib17,exCA3
hHP1_GTCATTGAATCA-habib17,ASC1
hHP1_CACCTTCAATAC-habib17,exCA1
hHP1_ATACATGTTGTC-habib17,exCA3


We can add a new column called `CoarseCellType` to construct a cell type hierarchy. For example, mapping `GABA1` and `GABA2` to the coarser `GABA` annotation, and `ASC1` and `ASC2` to the coarser `ASC` annotation.

In [20]:
# Apply a function to every row of the pandas Series for the "CellType" column.
# The function returns the "coarse" value corresponding to each "fine" cell type value.
cell_type_df["CoarseCellType"] = cell_type_df["CellType"].apply(lambda fine_cell_type: (
    "GABA" if fine_cell_type.startswith("GABA") else (
        "ASC" if fine_cell_type.startswith("ASC") else fine_cell_type
    )
))
cell_type_df.head()

Unnamed: 0_level_0,CellType,CoarseCellType
index,Unnamed: 1_level_1,Unnamed: 2_level_1
hHP1_ACTCAATAGCAA-habib17,exCA1,exCA1
hHP1_TTCCCGTTAAAG-habib17,exCA3,exCA3
hHP1_GTCATTGAATCA-habib17,ASC1,ASC
hHP1_CACCTTCAATAC-habib17,exCA1,exCA1
hHP1_ATACATGTTGTC-habib17,exCA3,exCA3


This third CSV contains a 2-dimensional UMAP dimensionality reduction that was computed on the gene expression matrix. Note that Vitessce loads pre-processed dimensionality reduction coordinates, and does not perform any dimensionality reduction "on-the-fly".

In [9]:
umap_df = pd.read_csv(join("raw_data", "habib17.umap.csv"), index_col=0)
umap_df.head()

Unnamed: 0_level_0,UMAP_1,UMAP_2
index,Unnamed: 1_level_1,Unnamed: 2_level_1
hHP1_ACTCAATAGCAA-habib17,3.140266,-7.16688
hHP1_TTCCCGTTAAAG-habib17,-3.105793,-3.203529
hHP1_GTCATTGAATCA-habib17,6.181531,3.414144
hHP1_CACCTTCAATAC-habib17,2.862645,-7.548567
hHP1_ATACATGTTGTC-habib17,-4.022884,-4.216279


## Instantiate a new AnnData object

While Zarr is an efficient format for storing multidimensional arrays, it does not dictate how multiple individual arrays are organized in a larger data structure. [AnnData](https://anndata.readthedocs.io/en/latest/) fills this gap by defining a data structure for observation-by-feature matrices and many types of associated metadata. This works nicely for the single-cell transcriptomics use case: think of cells as observations (rows) and genes as features (columns).

In this example, we are going to use the following fields of the AnnData object:
- `X`: the observation-by-feature (i.e., cell-by-gene) expression matrix, stored as a 2D NumPy array
- `obs`: a Pandas DataFrame where the rows match the rows of the `X` matrix (same number and ordering of rows in `obs` as rows in `X`)
- `var`: a Pandas DataFrame where the rows match the _columns_ of the `X` matrix (same number and ordering of rows in `var` as columns in `X`)
- `obsm`: a Python `dict`:
    - keys are strings, with the convention to begin with the prefix `X_` (e.g., `X_umap` to store an array of UMAP coordinates)
    - values are multidimensional NumPy arrays where the rows (i.e., elements of the zeroth dimension) match the rows of the `X` matrix


<img width="300" src="https://anndata.readthedocs.io/en/latest/_images/anndata_schema.svg"/>

In [10]:
obs = cell_type_df
var = pd.DataFrame(data=[], index=matrix_df.columns.values.tolist(), columns=[])
X = matrix_df.values
obsm={ "X_umap": umap_df.values }

In [15]:
X.shape # (number of rows, number of cols)

(13067, 5782)

In [16]:
obs.shape

(13067, 2)

In [17]:
var.shape

(5782, 0)

In [24]:
# Use the AnnData constructor to instantiate a new object.
adata = AnnData(X=X, obs=obs, var=var, obsm=obsm)
adata

  adata = AnnData(X=X, obs=obs, var=var, obsm=obsm)


AnnData object with n_obs × n_vars = 13067 × 5782
    obs: 'CellType', 'CoarseCellType'
    obsm: 'X_umap'

We can use the `optimize_adata` function from the `vitessce` Python package to optimize the performance of any AnnData object and prepare it for usage with Vitessce. This function discards unused fields of the object and casts numerical data types to smaller types (when the numerical values are not changed by the operation).

In [25]:
adata = optimize_adata(
    adata,
    obs_cols=["CoarseCellType", "CellType"],
    obsm_keys=["X_umap"],
    optimize_X=True,
)

## Save the AnnData object to a Zarr store ("AnnData-Zarr")

To finish, we save the `AnnData` object to a Zarr store using the `write_zarr` function.

In [26]:
adata.write_zarr(join("processed_data", "habib17.zarr"), chunks=(adata.shape[0], VAR_CHUNK_SIZE))