# Cell Types Database

![651806289 MinIP](http://reconstrue.com/projects/brightfield_neurons/demo_images/651806289_minip_cubehelix_wide.png)

[The Cell Types Database](http://celltypes.brain-map.org/) is one of the basis data products produced by The Allen Institute. They are constructing an altas of type of cells found in brains of mice and humans. For a primer with contextual history, state of the art, and goals related to the Cell Types Database, see the Allen Institute's 36 minute talk on the subject, presented at their 2019 Showcase symposium: [Cell Types: From Data to Taxonomy to Product](https://www.youtube.com/watch?v=NCTq7GHakqg).


There are multiple ways the cells are represented in the database: electrophysiology spike train recordings, transcriptomics, simulation models (GLIF or perisomatic), etc. Of particular interest for this project is the morphology data – the skeletons in the `*.swc` files.

The Allen has [about 500 SWC files for mouse neurons](http://celltypes.brain-map.org/data?donor__species=Mus%20musculus&nr__reconstruction_type=[full,dendrite-only]). Those ~500 are inside the red circle in the following Venn diagram of all mouse neurons in the Cell Types DB.

![](http://reconstrue.com/projects/brightfield_neurons/demo_images/brain_map_venn.png)

The main problem from The Allen's perspective is that they would like to have the red circle be a big as the main outer circle. It takes many hours to manually trace skeletons. The Allen processes hundreds of such cells a year. This is a serious manual labor bottleneck.

It would seem like this sort of task would be a perfect candidate for automation via deep learning, CNNs, and related recent advances in computer vision. Unfortunately, this is proving to be nontrivial; algorithmic improvements in the last few years are starting to make larger projects feasible. 


## Data access options

The data can be accessed 3 ways:
- Web UI at brain-map.org
- RMA, the Allen's RESTful HTTP API
- Programatically via Allen SDK (Python)

This is a Jupyter notebook of Python code so the Allen SDK is the natural way to go about accessing the data. Sometimes allensdk does not have an existing method to access some bit of data, in which case RMA is the fallback.


A nice feature of the Allen Institute's set-up is that they do not require *any* auth to get to the public data in any of the three methods listed.


### brain-map.org web UI

[brain-map.org](http://brain-map.org) is where the target images reside. The repository has a web UI, wherein the image stack can be viewed. Here is an example from their documentation [[*](http://help.brain-map.org/display/celltypes/Physiology+and+Morphology)]:

>displays two orthogonal projections of the biocytin filled neuron and the neuron's 3D morphology reconstruction. From this page, you can also view the stack of high resolution images used for the reconstruction.

![](http://help.brain-map.org/download/attachments/8323624/MorphBrowse.PNG?version=1&modificationDate=1476664307214&api=v2)

So, we can explore the web UI to preview what the images look like but we want to download them via Python code:

> You can also access the data programatically and obtain sample code to run your own model simulations. For more details go to the Download page. 



### RESTful RMA

The second way to access the data is through the RESTful "RMA" interface. This is the core way to get **all** data.


### Allen SDK

The Allen Institute first came up with a RESTful interface to their resources, called [RMA](http://help.brain-map.org/pages/viewpage.action?pageId=5308449). RMA is a [HATEOAS](https://restfulapi.net/hateoas/) style RESTful API. Later they added the Python SDK as client-side convenience wrapper code around the RMA.

The `allensdk` is Python code which provides a programmatic interface to the info available via RMI. It also maintains a cache of files for performance purposes (`allensdk.core.cell_types_cache.CellTypesCache`).

Although `allensdk` can provide metadata about cells in the repository, it does not have methods to acquire the raw image stack. To get the raw images, RMI is the only method. So, `allensdk` can provide IDs of available cells, but further work is required to then iterate through the stack and grab each file.


## Explore RESTful RMA

Their documentation includes [example URLs for fetching data](http://help.brain-map.org/display/celltypes/API#API-morphology_image_download). Let's express those in a Jupyter notebook.

json_query_url = "http://api.brain-map.org/api/v2/data/query.json?criteria=model::ProjectionImage,rma::criteria,[specimen_id$eq313862022]"




### Set up

#### Installations

As discussed in the previous notebook, [allensdk.ipynb](http://reconstrue.com/data_sources/allen_institute/allensdk_on_colab.html), the Allen SDK needs to be installed.

In [6]:
# Simple install way unfortunately causes version fights about pandas
#!pip3 -q install allensdk
#!pip3 show pandas

# This way has a tweaked requirements.txt, avoiding error messages
!pip3 install -q git+git://github.com/reconstrue/AllenSDK

  Building wheel for allensdk (setup.py) ... [?25l[?25hdone


### Images into markdown

First up would be simply an image in markdown, using their sample URL :
> [`http://api.brain-map.org/api/v2/section_image_download/323637357`](http://api.brain-map.org/api/v2/section_image_download/323637357)

That image is a JPEG, less than 2 MB, ~5k x ~7k pixels. According to the docs:
>images were first stitched from tiles in Tiff format, white balanced and finally converted to JPEG 2000 file format. Aperio ScanScope images were first converted to JPEG 2000 format, then orientation adjusted and white balanced. In either case, the final products were images in JPEG 2000 compressed format for further pipeline processing and analysis.


The square grid is an artifact of the image tile stitching algorithm that has been used to assemble this whole slide image:

<img src="http://api.brain-map.org/api/v2/section_image_download/323637357" height='450px' />

### Images into Python



#### URL encode queries
First thing to note, some of the examples they give work but need to be URL encoded otherwise, results will come back but not be what is to be expected. For example, they provide an example:
```
http://api.brain-map.org/api/v2/data/query.json?criteria=model::ProjectionImage,rma::criteria,[specimen_id$eq313862022]
```
That will return some JSON but not just about specimen_id 313862022.

But URL encode it and it works as expected:
```
http://api.brain-map.org/api/v2/data/query.json?criteria=model%3A%3AProjectionImage%2Crma%3A%3Acriteria%2C%5Bspecimen_id%24eq313862022%5D"
```


In [0]:
# Download an image file

#an_img_url = "http://api.brain-map.org/api/v2/section_image_download/321549675"
an_img_url = "http://api.brain-map.org/api/v2/section_image_download/323637357"
an_img_file_name = "/content/an_image"
!wget --no-verbose --progress=bar:force:noscroll -O {an_img_file_name} {an_img_url} 

In [0]:
# Get stats on image file just downloaded
print(f"Detected image file type: {imghdr.what(an_img_file_name)}")
!echo -----------
!ls -lh {an_img_file_name}

In [0]:
!ffprobe {an_img_file_name}

In [0]:
slide_img = PIL.Image.open(an_img_file_name)
display(slide_img)


### Request styles

Seemingly when querying, data can be requested in multiple formats: XML, JSON, and CSV.


### As XML

In [0]:
xml_request_url = "http://api.brain-map.org/api/v2/data/query.xml?criteria=model::ProjectionImage,rma::criteria,[specimen_id$eq313862022]"
xml_file_name = "response.xml"
!wget -O {xml_file_name} {xml_request_url}

In [0]:
!cat {xml_file_name}

That's nice: for each image stack, they provide both MaximumIntensityProjection and MinimumIntensityProjection from both the frontal view plane (xy) and one of the two side views (yz plane).

### As JSON

This is exactly the same as the above XML response, except in the requested URL `/query.xml?` is changed to `/query.json?` 

In [0]:
json_file_name = "/content/response.json"

query_url_root = 'http://api.brain-map.org/api/v2/data/query.json?criteria='
query_encoded = urllib.parse.quote('model::ProjectionImage,rma::criteria,[specimen_id$eq313862022]')
json_query_url = query_url_root + query_encoded

In [0]:
!wget -O {json_file_name} {json_query_url}

Cell Types DB docs say [[*](http://help.brain-map.org/display/celltypes/API#API-download_swc)]:
> The API provides programmatic access to the microscopy images used for reconstruction, axis-oriented projections of those images, and morphological reconstructions.  A cell can have up to four axis-oriented projections of the images used for reconstruction:
- XY minimum intensity projection
- YZ minimum intensity projection
- XY maximum intensity projection
- YZ maximum intensity projection

> The reconstruction images display a dark, biocytin-filled cell on a light background. The maximum intensity projections are constructed from inverted and contrast-enhanced versions of the morphology images, resulting in a light cell on a dark background.


In [0]:
with open(json_file_name) as f:
  eg_data = json.load(f)

print(json.dumps(eg_data, indent=2))

So, the above is saying that for specimen_id `13862022` there are 4 image available:
- 2 Min intensity projections (XY plane and YZ plane)
- 2 Max intensity projections (XY plane and YZ plane)

For brightfield, the more natural projection is minimum, not maximum, and the XY project is the more interesting of the two part mugshot projections. So, we want the one with:
```
"image_type": "MinimumIntensityProjection - xy"
```
Or equivalently: `projection_function` == min and `axes` == xy.

Finally, for a download URL, grab that projection's `id` and stick it on the end end of the following:
```
http://api.brain-map.org/api/v2/section_image_download/
```

For example,
```
http://api.brain-map.org/api/v2/section_image_download/323637357
```

Which, in markdown at width=200px looks like:

<img src="http://api.brain-map.org/api/v2/section_image_download/323637357" width="200px" />




In [0]:
minips_collection_url = "http://api.brain-map.org/api/v2/section_image_download/"

#### Query full image stack

http://help.brain-map.org/display/celltypes/API:
>Find all images used for reconstruction for a layer 4 spiny cell (Specimen 313862022)
```
http://api.brain-map.org/api/v2/data/query.xml?criteria=
model::SubImage
,rma::criteria,data_set[specimen_id$eq313862022]
![alt text](https://)
```

In [0]:
json_file_name = "/content/response.json"

query_terms = f'model::SubImage,rma::criteria,data_set[specimen_id$eq{cell_id}]'

query_url_root = 'http://api.brain-map.org/api/v2/data/query.json?criteria='
query_encoded = urllib.parse.quote(query_terms)
json_query_url = query_url_root + query_encoded

!wget -O {json_file_name} {json_query_url}

with open(json_file_name) as f:
  eg_data = json.load(f)

print(json.dumps(eg_data, indent=2))

Seems they all have `data_set_id`: 321549626 so that is the ID of the image stack? Or is the stack only part of it?

### RmaApi

The above sort of query construction gets old quick, and is just sloppy and lame. So, the Allen folks have Python utility classes for such tasks: [RMA Database and Service API](http://alleninstitute.github.io/AllenSDK/data_api_client.html).


### ImageDownloadApi

http://alleninstitute.github.io/AllenSDK/allensdk.api.queries.image_download_api.html#allensdk.api.queries.image_download_api.ImageDownloadApi

Note:
>By default, an unfiltered full-sized image with the highest quality is returned as a download if no parameters are provided.

### To Pandas

The JSON is shaped ala:
```json
{
  "success": true,
  "id": 0,
  "start_row": 0,
  "num_rows": 50,
  "total_rows": 2686,
  "msg": [
    {
```
There is some pagination going on via `start_row`, `num_rows`, and `total_rows`.

The `msg` array is what we want to feed to Pandas. Here's a hacky, lazy way to perform that task via `pandas.read_json`.



In [0]:
rows_json = eg_data["msg"]

# Write to FS
processed_json_file_name = "/content/query_trimmed.json"
with open(processed_json_file_name, 'w') as json_dest_file:
  json.dump(rows_json, json_dest_file) 

# Test file just written to
with open(processed_json_file_name) as f:
  test_data = json.load(f)

print(json.dumps(test_data, indent=2))



In [0]:
# specimen_id is the id of the cell imaged, id is the id of an image
query_df = pd.read_json(processed_json_file_name)
DataTable(query_df.sort_values(by=['id'])) # Adds filtering UI

### Count mouse SWC files

The Allen Institute's brain-map.org has data on many neurons, not all of which can/should be used to train ML models. The AllenSDK can be used ot query for a count of the number of mouse neuron cells that have both brightfield image stacks and SWC skeletons files.


- [cell_types_cache Python source code](https://alleninstitute.github.io/AllenSDK/_modules/allensdk/core/cell_types_cache.html)
- [Example Python usage](https://allensdk.readthedocs.io/en/latest/_static/examples/nb/cell_types.html#Cell-Morphology-Reconstructions)




In [0]:
from allensdk.core.cell_types_cache import CellTypesCache
from allensdk.api.queries.cell_types_api import CellTypesApi
import pprint
pp = pprint.PrettyPrinter(indent=4)

ctc = CellTypesCache(manifest_file='cell_types/manifest.json')

cells = ctc.get_cells(require_reconstruction=True, require_morphology=True, species=[CellTypesApi.MOUSE])
print('Number of mouse cells with images and SWC files: %i' % len(cells))


pp.pprint(cells[0])

In [0]:
# TODO: junk
# this saves the NWB file to 'cell_types/specimen_464212183/ephys.nwb'
# cell_specimen_id = 464212183
# data_set = ctc.get_ephys_data(cell_specimen_id)

In [0]:
!ls cell_types

In [0]:
cells_df = pd.DataFrame(cells).sort_values(by=['id'], ascending=False)
DataTable(cells_df)

## Brightfield training data

From a model training perspective, the skeleton in an SWC file can be seen as the "labels" for "the labeled training data." For training purposes, we're only interested in the subset of cells in the atlas Cell Types Database that have skeletons and a microscopy image stack. The image stack is the input the machine to be built, and the SWC file is the output. Each SWC files represents many hours of manual labor by trained specialists reviewing and editing the SWC file.


In [4]:
# Query the Cell Types DB for files with skeletons a.k.a. reconstructions

# via https://allensdk.readthedocs.io/en/latest/cell_types.html#cell-types-cache
from allensdk.core.cell_types_cache import CellTypesCache

ctc = CellTypesCache(manifest_file='cell_types/manifest.json')

# a list of cell metadata for cells with reconstructions, download if necessary
cells = ctc.get_cells(require_reconstruction=True)
print('Number of cells with SWC files: %i' % len(cells))

Number of cells with SWC files: 637


Some of those are human cells, atop the roughly 500 mouse cells. Humans brains are much bigger than mouse brains. Training should focus on one species. The Allen has many more mouse neurons than human neurons. So, train on mouse neurons only. To query for "mouse neurons only" via the SDK, ask for `species` == `CellTypesApi.MOUSE`.


In [5]:
from allensdk.api.queries.cell_types_api import CellTypesApi

# We want mouse cells that have images and skeletons, both.
# Former is data; latter is training labels a.k.a. gold standards.
cells = ctc.get_cells(require_reconstruction=True, require_morphology=True, species=[CellTypesApi.MOUSE])
print('Number of mouse cells with images and SWC files: %i' % len(cells))


Number of mouse cells with images and SWC files: 485


So, The Allen's Cell Types Database can be used as a training dataset consisting of about 500 samples.

## References
[Cell Types DB Physiology and Morphology whitepaper](http://help.brain-map.org/display/celltypes/Physiology+and+Morphology)

[cell_types_cache docs](https://allensdk.readthedocs.io/en/latest/allensdk.core.cell_types_cache.html#allensdk.core.cell_types_cache.CellTypesCache.get_cells)
