# Access and analysis of the MICrONS dataset

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sdorkenw/AMATH_342_neuronal_circuits/blob/main/mouse_visual_cortex_connectivity_MICrONS.ipynb)

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
   
<b> Electron Microscopy (EM) data enables morphological reconstruction of neurons and detection of their synaptic connectivity </b>. The <a href=https://www.microns-explorer.org/cortical-mm3>MICrONS dataset </a> is one of the largest datasets volume EM datasets currently available, and spans all layers of visual cortex. We will be using this dataset to query the connectivity between neurons in the visual cortex. 
    
<em> Note on data access: </em> To make our lifes easier, we already queried the most of the data needed for this exercise from the database. We have made it available as versioned files that can be read with pandas. The entire dataset is hosted using the <a href=https://caveconnectome.github.io/sections/cave_overview.html> Connectome Annotation Versioning Engine (CAVE) </a>. A separate notebook shows how to use CAVE to generate the files used in this notebook. 
    
    
</div>

In [None]:
mat_version = 1300

# Paths to compiled data for this workshop
syn_data_url = f"https://github.com/sdorkenw/MICrONS_workshop_CNC_forum/raw/refs/heads/main/data/synapses_proofread_V1_neurons_{mat_version}.feather"
cell_data_url = f"https://github.com/sdorkenw/MICrONS_workshop_CNC_forum/raw/refs/heads/main/data/cell_info_{mat_version}.feather" 

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
   
For convenience, we also use the package <b>skeleton_plot</b> which handles rendering the precomputed skeletons. Both packages can be installed with pip.
    
</div>

In [None]:
!pip install -q skeleton_plot
!pip install -q

In [None]:
import skeleton_plot as skelplot

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import matplotlib
import seaborn as sns

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

    
## CAVE token

After following the instructions in the CAVE setup notebook, copy your tolen to replace `PASTE_TOKEN_HERE` below. You can always get your token here: https://global.daf-apis.com/auth/api/v1/user/token

Look for `"token"`, the following string will be your token.
    
</div>

In [None]:
my_token = "PASTE_TOKEN_HERE"

try:
    import caveclient
    client = caveclient.CAVEclient("minnie65_public", auth_token=my_token)
except:
    client = None

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

    
## Proofreading and data quality

Understanding this variablity in data quality is critical when interpretting the MICrONS data.

Automated segmentation of neuronal processes in dense EM imaging is challenging at the size of entire neurons, which can have millimeters of axons and dendrites. The automated segmentation algorithms used in the EM data for this project are not perfect, and so proofreading is necessary to obtain accurate reconstructions of a cell and confidence in the connectivity

In the MICrONS dataset, the general rule is that dendrites onto cells with a cell body are sufficiently proofread to trust synaptic connections onto a cell. Axons on the other hand require so much proofreading that only ~1,650 cells have proofread axons.

Axon and dendrite compartment status are marked separately, as proofreading effort was applied differently to the different compartments in some cells.  In all cases, a status of `TRUE` indicates that false merges have been comprehensively removed, and the compartment is at least ‘clean’. Consult the ‘strategy’ column if completeness of the compartment is relevant to your  research.

Some cells were extended to different degrees of completeness, or with different research goals in mind. This is denoted by 'strategy_axon', which may be one of:

<ul>
    <li>none: No cleaning, and no extension, and status is `FALSE`. </li>
    <li>axon_partially_extended: The axon was extended outward from the soma, following each branch to its termination. Output synapses represent a sampling of potential partners. </li>
    <li>axon_interareal: The axon was extended with a preference for branches that projected to other brain areas. Some axon branches were fully extended, but local connections may be incomplete. Output synapses represent a sampling of potential partners. </li>
    <li>axon_fully_extended: Axon was extended outward from the soma, following each branch to its termination. After initial extension, every endpoint was identified, manually inspected, and extended again if possible. Output synapses represent a largely complete sampling of partners.. </li>
</ul>

<b> For this class, we treat all cells with at least `axon_partially_extended` as equally trustworth.</b> 
    
</div>

In [None]:
cell_df = pd.read_feather(cell_data_url)

In [None]:
cell_df["proofreading_strategy"].value_counts()

In [None]:
proofread_root_ids = np.array(cell_df[cell_df["proofreading_strategy"] != ""]["pt_root_id"])
len(proofread_root_ids)

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

    
## Synapse information

There are <b>337 million synapses</b> in the MICrONS dataset.

We have collected all of the input and output synapses for the proofread cells in version 1300. If you are interested in working with this dataset in the future, we added a notebook that shows how the data was queried.

Note that synapse queries always return the list of every synapse between the neurons in the query, even if there are multiple synapses between the same pair of neurons. A common pattern to generate a list of connections between unique pairs of neurons is to group by the root ids of the presynaptic and postsynaptic neurons and then count the number of synapses between them. 
    
    
</div>

In [None]:
syn_df = pd.read_feather(syn_data_url)
syn_df.shape

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

    
## Neuron morphology: plotting skeletons
    
</div>

In [None]:
skel_color_map = {3: "firebrick", 4: "salmon", 2: "steelblue", 1: "olive"}

def plot_skeleton(sk_d: dict, plot_soma=True, ax=None):
    """Plots a skeleton.
    
    Args:
        sk: skeleton dict as returned from CAVEclient
        
    Returns: 
        ax: plot axes
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(7, 10), dpi=150)
    
    skelplot.plot_tools.plot_verts(np.array(sk_d["vertices"]), np.array(sk_d["edges"]), radius=np.array(sk_d["radius"]) / 1000 * 1,
                                   ax=ax, skel_colors = sk_d["compartment"], title = "", skel_alpha = 1., 
                                   line_width = 5, x = 'x', y = 'y',  plot_soma = plot_soma, soma_node = sk_d["root"],
                                   color = 'darkslategray', soma_size = 120, invert_y = True, 
                                   skel_color_map = skel_color_map, x_min_max = None, 
                                   y_min_max = None, capstyle = "round", joinstyle = "round",)

    ax.spines['right'].set_visible(False) 
    ax.spines['left'].set_visible(False) 
    ax.spines['top'].set_visible(False) 
    ax.spines['bottom'].set_visible(False)
    ax.axis('off')
    
    return ax

In [None]:
# select from the cells which are fully extended
root_id = proofread_root_ids[2]

# Query the skeleton from CAVEclient
if client is None:
    raise Exception("CAVEclient not initialized - did you set your token?")
else:
    sk_d = client.skeleton.get_skeleton(root_id, output_format="dict")

In [None]:
# Plot skeleton in 2D
ax = plot_skeleton(sk_d)
plt.show()

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

    
## Neuron morphology & synapses
    
</div>

In [None]:
def filter_synapse_table(synapse_table: pd.DataFrame, pre_root_ids=None, post_root_ids=None):   
    """Filter synapse table by pre and post root ids.

    Args:
        synapse_table: synapse table with pre_pt_root_ids and post_pt_root_ids as pd.DataFrame
        pre_root_ids: np.ndarray, list or pd.Series if root_ids to filter on the presynaptic side
        post_root_ids: np.ndarray, list or pd.Series if root_ids to filter on the postsynaptic side

    Returns:
        synapse_table: filtered synapse table
    """
    
    if pre_root_ids is not None:
        assert isinstance(pre_root_ids, (np.ndarray, list, pd.core.series.Series)), f"IDs have to be of type np.ndarray, list or pd.Series; got {type(pre_root_ids)}"
        pre_m = np.isin(synapse_table["pre_pt_root_id"], pre_root_ids)
    else:
        pre_m = np.ones(len(synapse_table), dtype=bool)
        
    if post_root_ids is not None:
        assert isinstance(post_root_ids, (np.ndarray, list, pd.core.series.Series)), f"IDs have to be of type np.ndarray, list or pd.Series; got {type(pre_root_ids)}"
        post_m = np.isin(synapse_table["post_pt_root_id"], post_root_ids)
    else:
        post_m = np.ones(len(synapse_table), dtype=bool)
        
    return synapse_table[pre_m & post_m]

In [None]:
# Select the synapses where your root_id is the presynaptic ('upstream') cell; the outputs of your cell
pre_syns = filter_synapse_table(syn_df, pre_root_ids=[root_id])

In [None]:
pre_syns.head(5)

In [None]:
# Plot your skeleton and overlay synapses
ax = plot_skeleton(sk_d)

sns.scatterplot(data=pre_syns, x="ctr_pt_position_x", y="ctr_pt_position_y", 
                s=5, color="b", ax=ax, edgecolor=None, zorder=100) 

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">
<p><b>Task:</b> Use the filter function above to create a dataframe of all postsynaptic synapses and create a version of the plot above that includes these postsynaptic synapses. 
    
</div>

In [None]:
# Select the synapses where your root_id is postsynaptic ('downstream'); the inputs to your cell
post_syns = filter_synapse_table(syn_df, post_root_ids=[root_id])

In [None]:
# Plot skeleton with input and output synapses
ax = plot_skeleton(sk_d)

sns.scatterplot(data=pre_syns, x="ctr_pt_position_x", y="ctr_pt_position_y", 
                s=5, color="b", ax=ax, edgecolor=None, zorder=100) 

sns.scatterplot(data=post_syns, x="ctr_pt_position_x", y="ctr_pt_position_y", 
                s=5, color="r", ax=ax, edgecolor=None, zorder=100) 

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

## Synapse matrix

The synapses of neurons create a network of synaptic connections. One way of visualizing this connectivity is in a matrix form. Pandas provides the `pivot_table` function that we can use to make a matrix out of the tabular synapse data. For now, we will limit ourselves to the synapses between the proofread neurons.
    
Each synapses has a `size` value assigned to it. How to aggregate the sizes from multiple synapes between two neurons depends on the research question.

</d>

In [None]:
proof_proof_syn_table = filter_synapse_table(syn_df, pre_root_ids=proofread_root_ids, post_root_ids=proofread_root_ids)

# matrix of log-summed synapse size
syn_mat = proof_proof_syn_table.pivot_table(index="pre_pt_root_id", columns="post_pt_root_id", 
                                            values="size", aggfunc=lambda x: np.log10(np.sum(x))).fillna(0)

# Make sure matrix is quadratic
syn_mat = syn_mat.reindex(columns=np.array(syn_mat.index))

In [None]:
fig, ax = plt.subplots(figsize=(7, 5), dpi=150)
sns.heatmap(syn_mat, cmap="gray_r", xticklabels=[], yticklabels=[], 
            ax=ax, square=True,
            cbar_kws={"label": "Log summed synapse size (AU)"})

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

Above, we aggregated synapses between the same two neurons by summing their size followed by taken the log. It has been shown that connection strengths are generally log-normally distributed. Below we plot the connection strengths along a logarithmic axis:  


</d>

In [None]:
fig, ax = plt.subplots(figsize=(5, 5), dpi=150)
sns.histplot(x=10**np.array(syn_mat).flatten()[np.array(syn_mat).flatten() > 0], ax=ax, bins=np.logspace(2, 6, 101))
ax.set_xscale("log")
ax.set_xlabel("Connection size (AU)")

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

## Cell type tables

Identifying the putative ‘cell type’ from the EM morphology is a process that involves both manual and automatic classifications. Subsets of the dataset have been manually classified by anatomists at the Allen Institute, and these ground truth labels used to train and refine different automated ‘feature classifiers’ over time. 

The diversity of manual and automated cell type classifications available in the dataset reflect the fact that definitions of ‘cell types’ in the dataset is an active area of research and must be contextualized against the purpose and resolution of the cell-typing being performed.

<b> Manual Cell Types (V1 Column)</b>

A subset of nucleus detections in a 100 um column (n=2204) in VISp were manually classified by anatomists at the Allen Institute into categories of cell subclasses, first distinguishing cells into classes of non-neuronal, excitatory and inhibitory. Excitatory cells were separated into laminar sub-classes (L23, L4), 3 sub-types of layer 5 cells (ET, IT, NP) and 2 classes of layer 6 cells (IT, CT). Inhibitory cells were classified into Bipolar (BPC), Basket (BC), Martinotti (MC), or Unsure (Unsure). Those neuronal calls are available from the CAVEclient under the table name `allen_v1_column_types_slanted_ref` which references the nucleus id of the cell, and the column `cell_type_manual` in the prepared data for this workshop. 

Non-neuronal manual cells type calls enumerate astrocytes, microglia, pericytes, oligodendrocytes (oligo), and oligodendrocyte precursor cells (OPC), and area available in the table `aibs_column_nonneuronal_ref`.

<b>Automated Cell Type classification (Soma-nucleus model)</b>

Models were trained based upon the manual Column Neuron labels, as described in <a href=https://www.biorxiv.org/content/10.1101/2022.07.20.499976v1>Elabbady et al. BioRxiv 2023.</a> Each nucleus was analyzed for a variety of features, and a model trained on and independent dataset to distinguish neurons from non-neuronal detections. Non-neuron detections include both glial cells and false positive detections. The nucleus segmentation detected 171,818 connected components of nucleus objects, this model detected 82K neurons. Evaluation of this model on 1,316 cells in the volume shows the model has a recall of 99.6% for neurons, and a precision of 96.9%. All nucleus detections and the results of this model can be queried and linked to the cellular segmentation using the CAVEclient with the table name nucleus_neuron_svm. 

Combining those features with the nucleus features we trained a hierachical model on the manual labels to predict cell-classes and sub-classes across a large number of neurons. This is available as CAVE table `aibs_metamodel_celltypes_v661`, and the column `cell_type_auto` in the prepared data for this workshop.
    
</d>

In [None]:
cell_df["cell_type_auto"].value_counts()

In [None]:
for ct_name, df in cell_df[cell_df["proofreading_strategy"] != ""].groupby("cell_type_auto"):
    root_id = int(df.iloc[-1]["pt_root_id"])
    
    sk_d = client.skeleton.get_skeleton(root_id, output_format="dict")

    ax = plot_skeleton(sk_d)
    ax.set_title(ct_name)
    plt.show()

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

## Sorting the synapse matrix with cell types
    
Let's combine the synaptic connecitivity with the cell type information. Below we provide logic for sorting a connectivity matrix using a list of labels.    
    
</d>

In [None]:
def sort_matrix_by_types(mat: pd.DataFrame, 
                         labels: pd.DataFrame, 
                         label_type_col: str = "cell_type_auto", 
                         label_id_col: str = "pt_root_id", 
                         post_labels: pd.DataFrame = None, 
                         post_label_type_col: str = None, 
                         post_label_id_col: str = None):
    """Sorts (synapse) matrix by labels.

    This function assumes a square synapse matrix!

    Args:
        mat: synapse matrix as pandas DataFrame
        labels: DataFrame with labels, e.g. the output of client.materialize.query_table('aibs_metamodel_celltypes_v661')
        label_type_col: column name in labels for cell types
        label_id_col: column name in labels for root ids
        post_labels: DataFrame with labels, e.g. the output of client.materialize.query_table('aibs_metamodel_celltypes_v661')
        post_label_type_col: column name in labels for cell types
        post_label_id_col: column name in labels for root ids

    Returns:
        mat_sorted: sorted matrix
        mat_labels: sorted labels; has the same length as matrix
    """
    
    if post_labels is None:
        post_labels = labels
    if post_label_type_col is None:
        post_label_type_col = label_type_col
    if post_label_id_col is None:
        post_label_id_col = label_id_col
        
    mat_sorted = mat.copy()
    
    pre_mat_labels = np.array(labels.set_index(label_id_col).loc[mat_sorted.index][label_type_col])
    pre_sorting = np.argsort(pre_mat_labels)

    post_mat_labels = np.array(post_labels.set_index(post_label_id_col).loc[mat_sorted.T.index][post_label_type_col])
    post_sorting = np.argsort(post_mat_labels)

    mat_sorted = mat_sorted.iloc[pre_sorting].T.iloc[post_sorting].T

    return mat_sorted, pre_mat_labels[pre_sorting], post_mat_labels[post_sorting]

In [None]:
# Select the proofread excitatory cells
proof_proof_syn_table = filter_synapse_table(syn_df, pre_root_ids=proofread_root_ids, post_root_ids=proofread_root_ids)

syn_mat = proof_proof_syn_table.pivot_table(index="pre_pt_root_id", columns="post_pt_root_id", 
                                            values="size", aggfunc=lambda x: np.log(np.sum(x))).fillna(0)

syn_mat = syn_mat.reindex(columns=np.array(syn_mat.index))

# sort the matrix by cell types to render sensibly in heatmap
syn_mat_ct, syn_mat_cell_types, _ = sort_matrix_by_types(syn_mat, cell_df, label_type_col="cell_type_auto")

In [None]:
# add colormap for cell type
cts, ct_idx = np.unique(syn_mat_cell_types, return_inverse=True)
ct_colors = plt.get_cmap("tab10")(ct_idx)

fig, ax = plt.subplots(figsize=(7, 5), dpi=150)
sns.heatmap(syn_mat_ct, cmap="gray_r", xticklabels=[], yticklabels=[], 
            ax=ax, square=True,
            cbar_kws={"label": "Log sum synapse size (AU)"})


# Adding row and column colors for cell types
for i, color in enumerate(ct_colors):
    ax.add_patch(plt.Rectangle(xy=(-0.01, i), width=0.01, height=1, color=color, lw=0,
                               transform=ax.get_yaxis_transform(), clip_on=False))

for i, color in enumerate(ct_colors):
    ax.add_patch(plt.Rectangle(xy=(i, 1), height=0.01, width=1, color=color, lw=0,
                               transform=ax.get_xaxis_transform(), clip_on=False))

# add a legend for the cell types
legend_elements = [matplotlib.lines.Line2D([0], [0], color=plt.get_cmap("tab10")(i), label=ct) for i, ct in enumerate(cts)]
plt.legend(handles=legend_elements, loc='upper left', bbox_to_anchor=(1.3, 1), title="cell types")
plt.show()