# Accessing the fly connectome dataset with CAVE

This tutorial provides a high-level overview for how to access FlyWire's dataset through CAVE. CAVE is the [connectome annotation versioning engine](https://www.biorxiv.org/content/10.1101/2023.07.26.550598v1.abstract), a service infrastructure for managing connectomics datasets and is hosted in the cloud for broad access. CAVE supports proofreading of datasets and their analysis even while proofreading is ongoing.

# CAVEclient and setup

The CAVEclient is a python library that facilitates communication with a CAVE system. It can be install with 

`pip install caveclient`

and imported like so:

In [1]:
import caveclient

## CAVE account setup

Each and every user needs to create a CAVE account and download a user token to access CAVE's services programmatically. FlyWire's data is publicly available which means that no extra permissions need to be given to a new user account to access the data.

A Google account (or Google-enabled account) is required to create a CAVE account.

#### Start here you do not have a CAVE account or are not sure

Login to CAVE to setup a new account. To do this go to this [website](https://prod.flywire-daf.com/materialize/views/datastack/flywire_fafb_public).

#### Once you have an account: Setup your token

Create a new token by running the next cell. Then, copy the token and insert it into the argument of the following cell. These two cells should be redone together to make sure that the correct token is stored on your machine. You can copy your token and store on as many machines as you like. If you think your token has been compromised just reset it but rerunning the following cell.

In [3]:
client = caveclient.CAVEclient()
client.auth.setup_token(make_new=True)

New Tokens need to be acquired by hand. Please follow the following steps:
                1) Go to: https://global.daf-apis.com/auth/api/v1/create_token to create a new token.
                2) Log in with your Google credentials and copy the token shown afterward.
                3a) Save it to your computer with: client.auth.save_token(token="PASTE_YOUR_TOKEN_HERE")
                or
                3b) Set it for the current session only with client.auth.token = "PASTE_YOUR_TOKEN_HERE"
                Note: If you need to save or load multiple tokens, please read the documentation for details.


In [5]:
client.auth.save_token(token="your token goes here", overwrite=True)

## Datastacks

Datasets in CAVE are organized as datastacks. These are a combination of an EM dataset, a segmentation and a set of annotations. The datastack for FlyWire's public release is `flywire_fafb_public`. When you instantiate your client with this datastack, it loads all relevant information to access it.

In [8]:
datastack_name = "flywire_fafb_public"
client = caveclient.CAVEclient(datastack_name)

## Materialization versions

Data in CAVE is timestamped and periodically versioned - each (materialization) version cooresponds to a specific timestamp. Individual versions are made publicly available. The materialization service provides annotation queries to the dataset. It is available under `client.materialize`. 

Currently the following versions are publicly available:

In [9]:
client.materialize.get_versions()

[630]

And these are their associated timestamps:

In [11]:
for version in client.materialize.get_versions():
    print(f"Version {version}: {client.materialize.get_timestamp(version)}")

Version 630: 2023-03-21 08:10:01.194185+00:00


The client will automatically query the latest materialization version. You can specify a `materialization_version` for every query if you want to access a specific version.

# Querying the dataset

Let's have a look what annotation tables are available:

In [12]:
client.materialize.get_tables()

['fly_synapses_neuropil_v2',
 'neuron_information_v2',
 'proofread_neurons',
 'nuclei_v1',
 'synapses_nt_v1',
 'hierarchical_neuron_annotations']

## Querying neurons

The `proofread_neurons` table contains all neurons that were released in a given version. The dataset contains many more segments that were either not proofread because they are small, or belong to non-neuronal cells or other structures such as trachea. Therefore, knowing the list of all segments that represent proofread neurons is useful. It can be queried in full:

In [30]:
proofread_neurons_df = client.materialize.query_table("proofread_neurons")
proofread_neurons_df

Unnamed: 0,id,valid,pt_supervoxel_id,pt_root_id,pt_position
0,1,t,78253067652813181,720575940628857210,"[443342, 203965, 157450]"
1,2,t,82053323167374089,720575940626838909,"[664417, 227538, 77011]"
2,3,t,81842766824648491,720575940626046919,"[653666, 259618, 110467]"
3,4,t,82405647924682371,720575940630311383,"[687457, 254763, 80194]"
4,5,t,82756942157675987,720575940633370649,"[707653, 222503, 145366]"
...,...,...,...,...,...
127973,127974,t,77198292605339502,720575940624334775,"[384508, 252336, 207800]"
127974,127975,t,83251928414696891,720575940625178772,"[733464, 363772, 112720]"
127975,127976,t,76706536096887135,720575940625567758,"[353200, 298616, 227080]"
127976,127977,t,75297580732428963,720575940641724661,"[273840, 204492, 246800]"


In the table above each row is a proofread neuron. Segment IDs (aka neuron IDs) are called `root_ids` in CAVE. Each annotation is associated with at least one point with which data is associated; in this case `pt_root_id` and `pt_position` are the most relevant columns.

Positions in this table were calculated to be in the backbone of a neuron. This was found to be the most robust location to identify a neuron with as some do not have cell bodies and cell bodies are not central to fly neurons. There is a table that represents all _cell_ nuclei in the brain created by [Shang et al.](https://www.biorxiv.org/content/10.1101/2021.11.04.467197v1.abstract)

In [34]:
nuclei_df = client.materialize.query_table("nuclei_v1")
nuclei_df

Unnamed: 0,id,valid,volume,pt_supervoxel_id,pt_root_id,pt_position,bb_start_position,bb_end_position
0,7393349,t,26.141245,82827379285852979,720575940626838909,"[709888, 227744, 57160]","[708032, 226144, 54760]","[711904, 229440, 59280]"
1,7416439,t,11.523400,82827998029690530,720575940627484553,"[710592, 263392, 129800]","[708800, 262112, 128200]","[712064, 264832, 131080]"
2,7415038,t,32.895959,83038623024880664,720575940626046919,"[722528, 234656, 77000]","[720768, 232896, 74480]","[724224, 236576, 79360]"
3,7415013,t,53.711176,83038760463398837,720575940630311383,"[722912, 244032, 65200]","[720480, 242144, 62480]","[725600, 246208, 67760]"
4,7415848,t,9.280717,83038554439606237,720575940633370649,"[721984, 229792, 119560]","[720832, 228320, 118240]","[723232, 231264, 120840]"
...,...,...,...,...,...,...,...,...
143135,4389032,t,244.330332,79377386675948158,720575940629762043,"[511328, 113024, 48560]","[503744, 104128, 44600]","[520704, 122080, 53040]"
143136,8558952,t,253.359391,84870066067479558,720575940636923511,"[830720, 344608, 149840]","[821120, 335680, 146960]","[840576, 354912, 152880]"
143137,3076633,t,261.782528,78042030341189115,720575940623944072,"[433152, 210560, 223440]","[428544, 206976, 219120]","[437632, 214048, 227600]"
143138,3125634,t,274.994586,78113635431454405,720575940621426568,"[434208, 282112, 23400]","[429728, 278496, 19040]","[439264, 285568, 27320]"


Not all neurons contains have a cell body in the brain (e.g. sensory, ascending) and for ~6,000 of intrinsic neurons, the segmentation did not reach up to the cell bodies which are sitting at the outer layer of the brain. In those cases, the `pt_root_id` is 0. This table also contains glia cell bodies, as well as a few false positive annotations. 

Every table has an associated description which provides further context and references to publications. This feature is provided by the annotation service which can be reached at `cave.annotation`.

In [37]:
print(client.annotation.get_table_metadata("nuclei_v1")["description"])


FlyWire nucleus description
Nucleus version: 20210322

Nuclei in this table consist of center points (in nm), volume (in μm3), and bounding boxes (in nm).

The nucleus segmentation was generated by Shang Mu (smu@princeton.edu, Seung Lab at Princeton University) using a 2D convolutional neural network (CNN) and heuristic interpolations. The training data was assembled from annotations by Selden Koolman, Merlin Moore, Sarah Morejohn, Ben Silverman, Kyle Willie, Ryan Willie, Szi-chieh Yu and Shang Mu.

As this data was generated using a 2D, rather than 3D, neural network, defects are present in the detected nuclei, particularly where there are large defects in section alignment or a number of consecutive missing sections.

False positive fragments, nucleus fragments and partial nuclei are the most common type of defects. A simple, rudimentary method for cleaning up is to disregard small fragments by thresholding by segment size or by the z-dimension of the bounding boxes. A size threshol

Positions can be useful for analysis. The CAVEclient provides some convenience functions: 

Splitting of position columns into separate x, y, and z columns

In [38]:
nuclei_df = client.materialize.query_table("nuclei_v1", split_positions=True)
nuclei_df

Unnamed: 0,id,valid,volume,pt_position_x,pt_position_y,pt_position_z,pt_supervoxel_id,pt_root_id,bb_start_position_x,bb_start_position_y,bb_start_position_z,bb_end_position_x,bb_end_position_y,bb_end_position_z
0,7393349,t,26.141245,709888,227744,57160,82827379285852979,720575940626838909,708032,226144,54760,711904,229440,59280
1,7416439,t,11.523400,710592,263392,129800,82827998029690530,720575940627484553,708800,262112,128200,712064,264832,131080
2,7415038,t,32.895959,722528,234656,77000,83038623024880664,720575940626046919,720768,232896,74480,724224,236576,79360
3,7415013,t,53.711176,722912,244032,65200,83038760463398837,720575940630311383,720480,242144,62480,725600,246208,67760
4,7415848,t,9.280717,721984,229792,119560,83038554439606237,720575940633370649,720832,228320,118240,723232,231264,120840
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
143135,4389032,t,244.330332,511328,113024,48560,79377386675948158,720575940629762043,503744,104128,44600,520704,122080,53040
143136,8558952,t,253.359391,830720,344608,149840,84870066067479558,720575940636923511,821120,335680,146960,840576,354912,152880
143137,3076633,t,261.782528,433152,210560,223440,78042030341189115,720575940623944072,428544,206976,219120,437632,214048,227600
143138,3125634,t,274.994586,434208,282112,23400,78113635431454405,720575940621426568,429728,278496,19040,439264,285568,27320


Defining the position resolution. Resolutions are always in nanometers and this query asks for points to be provided in micrometers (aka 1000 nanometers):

In [39]:
nuclei_df = client.materialize.query_table("nuclei_v1", desired_resolution=[1000, 1000, 1000])
nuclei_df

Unnamed: 0,id,valid,volume,pt_supervoxel_id,pt_root_id,pt_position,bb_start_position,bb_end_position
0,7393349,t,26.141245,82827379285852979,720575940626838909,"[709.888, 227.744, 57.16]","[708.032, 226.144, 54.76]","[711.904, 229.44, 59.28]"
1,7416439,t,11.523400,82827998029690530,720575940627484553,"[710.592, 263.392, 129.8]","[708.8, 262.112, 128.2]","[712.064, 264.832, 131.08]"
2,7415038,t,32.895959,83038623024880664,720575940626046919,"[722.528, 234.656, 77.0]","[720.768, 232.896, 74.48]","[724.224, 236.576, 79.36]"
3,7415013,t,53.711176,83038760463398837,720575940630311383,"[722.912, 244.032, 65.2]","[720.48, 242.144, 62.48]","[725.6, 246.208, 67.76]"
4,7415848,t,9.280717,83038554439606237,720575940633370649,"[721.984, 229.792, 119.56]","[720.832, 228.32, 118.24]","[723.232, 231.264, 120.84]"
...,...,...,...,...,...,...,...,...
143135,4389032,t,244.330332,79377386675948158,720575940629762043,"[511.328, 113.024, 48.56]","[503.744, 104.128, 44.6]","[520.704, 122.08, 53.04]"
143136,8558952,t,253.359391,84870066067479558,720575940636923511,"[830.72, 344.608, 149.84]","[821.12, 335.68, 146.96]","[840.576, 354.912, 152.88]"
143137,3076633,t,261.782528,78042030341189115,720575940623944072,"[433.152, 210.56, 223.44]","[428.544, 206.976, 219.12]","[437.632, 214.048, 227.6]"
143138,3125634,t,274.994586,78113635431454405,720575940621426568,"[434.208, 282.112, 23.4]","[429.728, 278.496, 19.04]","[439.264, 285.568, 27.32]"


## Querying annotations - hierarchical annotations from Schlegel et al., 2023

[Schlegel et al](https://www.biorxiv.org/content/10.1101/2023.06.27.546055v2.abstract) introduced hierarchical annotations for all proofread neurons in the dataset. Figure 1 from their paper (shown below) outlines the hierarchy and renders individual groups of neuurons.



![](https://www.biorxiv.org/content/biorxiv/early/2023/07/15/2023.06.27.546055/F1.large.jpg)

To load all annotations (this will take ~20s):

In [29]:
hierarchical_annos_df = client.materialize.query_table("hierarchical_neuron_annotations")
hierarchical_annos_df

Unnamed: 0,id,valid,target_id,classification_system,cell_type,id_ref,valid_ref,pt_supervoxel_id,pt_root_id,pt_position
0,391232,t,60947,flow,intrinsic,60947,t,83532853635373500,720575940627005443,"[751197, 330784, 106807]"
1,391233,t,60946,flow,intrinsic,60946,t,84307115980143621,720575940638062141,"[796396, 343997, 119089]"
2,391234,t,60945,flow,intrinsic,60945,t,85151060007686276,720575940634330991,"[845576, 316831, 145515]"
3,391235,t,60960,flow,intrinsic,60960,t,85010459891445545,720575940610409946,"[835953, 325029, 132976]"
4,391236,t,60961,flow,intrinsic,60961,t,83673659843320298,720575940628587324,"[761422, 334448, 110534]"
...,...,...,...,...,...,...,...,...,...,...
377694,760321,t,127378,cell_type,T4a,127378,t,76001267906418794,720575940640761728,"[314129, 204855, 184422]"
377695,760322,t,127393,cell_type,T5a,127393,t,76705299213588578,720575940644327584,"[352578, 225656, 253139]"
377696,760323,t,127697,cell_type,T4a,127697,t,75790161740758808,720575940637229198,"[301502, 206108, 197151]"
377697,760324,t,127776,cell_type,T4a,127776,t,75157049402858357,720575940627302745,"[263197, 218814, 256978]"


The `classification_system` column encodes the level of the hierarchy a given annotation belongs to. Only annotations for `flow` and `super class` are available for all neurons but finer annotations are only available for subsets of neurons. We can see that in the annotation counts for each hierarchy level:

In [17]:
hierarchical_annos_df["classification_system"].value_counts()

flow              127978
super_class       127978
cell_class         97672
cell_type          17404
cell_sub_class      6667
Name: classification_system, dtype: int64

## Subqueries - hierarchical annotations from Schlegel et al., 2023 continued

CAVE poses a limit on the size of a table that can be loaded at once. The current limit is `500,000` rows. This is to ensure the system is working for everyone and prevents accidentaly large queries to the server - that is particularly relevant for synapse queries. 

While the hierarchical annotation table is small enough to be loaded as a whole, it takes a few seconds to load. If we are only interested in a subset of the data, we can use a filter on any column to reduce the data footprint:

In [28]:
flow_annos_df = client.materialize.query_table("hierarchical_neuron_annotations", filter_equal_dict={"classification_system": "flow"})
flow_annos_df

Unnamed: 0,id,valid,target_id,classification_system,cell_type,id_ref,valid_ref,pt_supervoxel_id,pt_root_id,pt_position
0,391232,t,60947,flow,intrinsic,60947,t,83532853635373500,720575940627005443,"[751197, 330784, 106807]"
1,391233,t,60946,flow,intrinsic,60946,t,84307115980143621,720575940638062141,"[796396, 343997, 119089]"
2,391234,t,60945,flow,intrinsic,60945,t,85151060007686276,720575940634330991,"[845576, 316831, 145515]"
3,391235,t,60960,flow,intrinsic,60960,t,85010459891445545,720575940610409946,"[835953, 325029, 132976]"
4,391236,t,60961,flow,intrinsic,60961,t,83673659843320298,720575940628587324,"[761422, 334448, 110534]"
...,...,...,...,...,...,...,...,...,...,...
127973,503938,t,109940,flow,intrinsic,109940,t,74664055675374018,720575940627905383,"[234729, 191865, 193858]"
127974,503939,t,110216,flow,intrinsic,110216,t,82689047382048903,720575940628019985,"[702449, 370626, 173721]"
127975,503940,t,111137,flow,intrinsic,111137,t,81984535373702682,720575940639551797,"[663170, 321144, 196507]"
127976,503941,t,112845,flow,intrinsic,112845,t,74171543118689555,720575940611095160,"[205765, 196206, 178359]"


Besides `filter_equal_dict`, the CAVEclient provides `filter_in_dict` and `filter_out_dict` as options to restrict what data is loaded. Examples:

In [27]:
cell_class_type_annos_df = client.materialize.query_table("hierarchical_neuron_annotations", filter_in_dict={"classification_system": ["cell_class", "cell_type"]})
cell_class_type_annos_df

Unnamed: 0,id,valid,target_id,classification_system,cell_type,id_ref,valid_ref,pt_supervoxel_id,pt_root_id,pt_position
0,631921,t,24809,cell_class,ALIN,24809,t,81350666718746524,720575940618149686,"[624510, 288189, 127797]"
1,631922,t,3439,cell_class,ALIN,3439,t,80717416606245362,720575940628675433,"[586065, 292954, 82718]"
2,631923,t,42371,cell_class,ALIN,42371,t,77337311972566683,720575940630166583,"[389107, 148134, 167223]"
3,631924,t,15670,cell_class,ALIN,15670,t,80787785350455829,720575940630738683,"[592115, 291905, 83257]"
4,631925,t,34597,cell_class,ALIN,34597,t,78676860598498552,720575940624706062,"[468615, 298327, 130296]"
...,...,...,...,...,...,...,...,...,...,...
115071,760321,t,127378,cell_type,T4a,127378,t,76001267906418794,720575940640761728,"[314129, 204855, 184422]"
115072,760322,t,127393,cell_type,T5a,127393,t,76705299213588578,720575940644327584,"[352578, 225656, 253139]"
115073,760323,t,127697,cell_type,T4a,127697,t,75790161740758808,720575940637229198,"[301502, 206108, 197151]"
115074,760324,t,127776,cell_type,T4a,127776,t,75157049402858357,720575940627302745,"[263197, 218814, 256978]"


In [33]:
no_flow_annos_df = client.materialize.query_table("hierarchical_neuron_annotations", filter_out_dict={"classification_system": ["flow"]})
no_flow_annos_df

Unnamed: 0,id,valid,target_id,classification_system,cell_type,id_ref,valid_ref,pt_supervoxel_id,pt_root_id,pt_position
0,391232,t,60947,flow,intrinsic,60947,t,83532853635373500,720575940627005443,"[751197, 330784, 106807]"
1,391233,t,60946,flow,intrinsic,60946,t,84307115980143621,720575940638062141,"[796396, 343997, 119089]"
2,391234,t,60945,flow,intrinsic,60945,t,85151060007686276,720575940634330991,"[845576, 316831, 145515]"
3,391235,t,60960,flow,intrinsic,60960,t,85010459891445545,720575940610409946,"[835953, 325029, 132976]"
4,391236,t,60961,flow,intrinsic,60961,t,83673659843320298,720575940628587324,"[761422, 334448, 110534]"
...,...,...,...,...,...,...,...,...,...,...
377694,760321,t,127378,cell_type,T4a,127378,t,76001267906418794,720575940640761728,"[314129, 204855, 184422]"
377695,760322,t,127393,cell_type,T5a,127393,t,76705299213588578,720575940644327584,"[352578, 225656, 253139]"
377696,760323,t,127697,cell_type,T4a,127697,t,75790161740758808,720575940637229198,"[301502, 206108, 197151]"
377697,760324,t,127776,cell_type,T4a,127776,t,75157049402858357,720575940627302745,"[263197, 218814, 256978]"


## Community annotations

## Synapse queries

FlyWire uses automatically annotated synapses that were produced by [Buhmann et al.](https://www.nature.com/articles/s41592-021-01183-7). Automation of synapse annotation is critical for circuit analysis but one should keep in mind that the classifier may contain biases that lead to better or worse results in different brain regions. For instance, this classifier was trained on data acquired from neuropils in the central brain and might perform worse in the optic lobe or for sensory neurons.

In total, Buhmann et al. identified ~244 million _putative_ synapses. Each synapse is a link from a pre- to a posynaptic site. As presynapses in the fly are usually polysynaptic, there are usually multiple synapses for each presynaptic site.

In [40]:
client.materialize.get_annotation_count("synapses_nt_v1")

244358226