# Tutorial 01: Data Access

**Author:** Alexander Bates  
**Python Version**

---

## Introduction

This tutorial covers data access to a pre-prepared curation of connectome data for the major *Drosophila* connectome projects. Our curated data includes:

- **Whole-system datasets**: maleCNS, BANC (brain + ventral nerve cord)
- **Brain-only datasets**: FAFB, Hemibrain
- **VNC-only dataset**: MANC

This Python tutorial demonstrates:
- Direct GCS data access using `gcsfs` and `pyarrow`
- Efficient streaming of large Parquet files
- Interactive visualizations with Plotly
- Metadata exploration and filtering
- Synapse data analysis with lazy loading

---

## Setup and Configuration

In [1]:
# Dataset configuration
# Options: "banc_746", "fafb_783", "manc_121", "hemibrain_121", "malecns_09"
DATASET = "banc_746"
DATASET_ID = "banc_746_id"

# Data location - can be GCS bucket or local path
# Option 1 (GCS - default): Access data directly from Google Cloud Storage
DATA_PATH = "gs://sjcabs_2025_data"

# Option 2 (Local): Use local copy if you've downloaded the data
# DATA_PATH = "/path/to/local/sjcabs_data"
# Example: DATA_PATH = "~/data/sjcabs_data"

# Detect if using GCS or local path
USE_GCS = DATA_PATH.startswith("gs://")

# Image output directory
import os
IMG_DIR = "images/tutorial_01"
os.makedirs(IMG_DIR, exist_ok=True)

print(f"Dataset: {DATASET}")
print(f"Data location: {DATA_PATH}")
print(f"Using GCS: {USE_GCS}")
print(f"Images will be saved to: {IMG_DIR}")

Dataset: banc_746
Data location: gs://sjcabs_2025_data
Using GCS: True
Images will be saved to: images/tutorial_01


## Import Packages

We'll use:
- **pandas**: Data manipulation and analysis
- **pyarrow**: Efficient reading of Feather and Parquet files
- **gcsfs**: Google Cloud Storage filesystem interface
- **plotly**: Interactive visualizations
- **navis**: Neuron analysis tools (for future tutorials)

In [2]:
# Import all common packages and helper functions
import sys
sys.path.insert(0, '.')
from setup_helpers import *

print(f"pandas version: {pd.__version__}")

âœ“ Packages loaded successfully
pandas version: 2.3.3


## Data Location Options

This tutorial supports two data access modes:

### Option 1: Direct GCS Access (Default)
Access data directly from Google Cloud Storage - no manual download required!
- **Pros:** No local storage needed, always up-to-date
- **Cons:** Slower for large files, requires authentication & internet

**GCS bucket location:** `gs://sjcabs_2025_data/`

### Option 2: Local Access (Faster for Repeated Use)
Download data once with `gsutil`, then access locally:

```bash
# Download specific dataset (e.g., BANC metadata + synapses)
gsutil -m cp gs://sjcabs_2025_data/banc/banc_746_meta.feather ~/data/sjcabs_data/banc/
gsutil -m cp gs://sjcabs_2025_data/banc/banc_746_synapses.parquet ~/data/sjcabs_data/banc/

# Or download entire dataset directory
gsutil -m cp -r gs://sjcabs_2025_data/banc ~/data/sjcabs_data/
```

Then update the configuration cell:
```python
DATA_PATH = "~/data/sjcabs_data"  # Use your local path
```

---

## Setup GCS Access

**Authentication required:** Before running this tutorial with GCS, authenticate with Google Cloud:

```bash
# Install gcloud CLI if you haven't already:
# https://cloud.google.com/sdk/docs/install

# Authenticate with your Google account
gcloud auth application-default login
```

This creates credentials that `gcsfs` will use automatically.

In [3]:
# Setup GCS filesystem if needed
if USE_GCS:
    print("Setting up Google Cloud Storage access...")
    gcs = gcsfs.GCSFileSystem(token='google_default')
    print("âœ“ GCS filesystem initialized")
else:
    gcs = None
    print("Using local filesystem")

Setting up Google Cloud Storage access...
âœ“ GCS filesystem initialized


## Helper Functions

Define utility functions for path construction and data reading:

In [4]:
def construct_path(data_root, dataset, file_type="meta", space_suffix=None):
    """
    Construct file paths for dataset files.
    Note: Skeleton files do not include version numbers in their filenames.
    
    Parameters
    ----------
    data_root : str
        Root data directory (can be gs:// or local path)
    dataset : str
        Dataset name with version (e.g., "banc_746")
    file_type : str
        Type of file: "meta", "synapses", "edgelist", "skeletons"
    space_suffix : str, optional
        Space name for skeletons (defaults to native space, e.g., "banc_space")
    
    Returns
    -------
    str
        Full path to the file
    """
    # Extract dataset name (e.g., "banc" from "banc_746")
    dataset_name = dataset.split("_")[0]
    
    # Determine file extension
    extensions = {
        "meta": ".feather",
        "synapses": ".parquet",
        "edgelist": ".feather",
        "edgelist_simple": ".feather",
        "skeletons": ""  # No extension - it's a directory
    }
    
    if file_type not in extensions:
        raise ValueError(f"Unknown file_type: {file_type}. Choose from: {list(extensions.keys())}")
    
    extension = extensions[file_type]
    
    # Construct filename
    if file_type == "skeletons":
        # Skeleton files don't include version number and have specific naming
        # Pattern: {dataset_name}_{space_name}_[l2_]swc (directory)
        # e.g., banc_banc_space_l2_swc/, fafb_fafb_space_swc/
        
        # Default space is the native space for the dataset
        if space_suffix is None:
            space_suffix = f"{dataset_name}_space"
        
        # BANC uses l2 skeletons, others don't
        if dataset_name == "banc":
            filename = f"{dataset_name}_{space_suffix}_l2_swc{extension}"
        else:
            filename = f"{dataset_name}_{space_suffix}_swc{extension}"
    else:
        # Other file types include the full dataset name with version
        filename = f"{dataset}_{file_type}{extension}"
    
    # Combine into full path
    full_path = f"{data_root}/{dataset_name}/{filename}"
    
    return full_path

def read_feather_gcs(path, gcs_fs=None):
    """
    Read Feather file from GCS or local path.
    
    Parameters
    ----------
    path : str
        Path to Feather file (can be gs:// or local)
    gcs_fs : gcsfs.GCSFileSystem, optional
        GCS filesystem object (required for GCS paths)
    
    Returns
    -------
    pd.DataFrame
        Loaded data
    """
    if path.startswith("gs://"):
        if gcs_fs is None:
            raise ValueError("gcs_fs required for GCS paths")
        
        print(f"Reading from GCS: {path}")
        
        # Strip gs:// prefix for gcsfs
        gcs_path = path.replace("gs://", "")
        
        with gcs_fs.open(gcs_path, 'rb') as f:
            df = feather.read_feather(f)
        
        print(f"âœ“ Loaded {len(df):,} rows")
        return df
    else:
        print(f"Reading from local path: {path}")
        df = pd.read_feather(path)
        print(f"âœ“ Loaded {len(df):,} rows")
        return df


def save_plot(fig, name, width=1200, height=600):
    """
    Save plotly figure to image file.
    
    Parameters
    ----------
    fig : plotly.graph_objects.Figure
        Plotly figure to save
    name : str
        Filename (without extension)
    width : int
        Width in pixels
    height : int
        Height in pixels
    """
    filename = os.path.join(IMG_DIR, f"{name}.png")
    fig.write_image(filename, width=width, height=height, scale=2)
    print(f"âœ“ Saved plot to {filename}")


print("âœ“ Helper functions defined")

âœ“ Helper functions defined


## Setup File Paths

Construct paths to metadata and synapse files:

In [5]:
# Construct file paths
meta_path = construct_path(DATA_PATH, DATASET, "meta")
synapse_path = construct_path(DATA_PATH, DATASET, "synapses")

print("File paths:")
print(f"  Metadata: {meta_path}")
print(f"  Synapses: {synapse_path}")

File paths:
  Metadata: gs://sjcabs_2025_data/banc/banc_746_meta.feather
  Synapses: gs://sjcabs_2025_data/banc/banc_746_synapses.parquet


---

## Reading Connectome Metadata

### Understanding the File Formats

Our data files use two Apache Arrow formats:
- **Feather** (`.feather`) for metadata - smaller files (~10 MB), loaded entirely into memory
- **Parquet** (`.parquet`) for synapses - large files (4-15 GB), supports lazy loading and predicate pushdown

**Why Parquet for synapses?**
- âœ“ Column selection: download only needed columns
- âœ“ Row filtering: filter on the server before downloading
- âœ“ Compression: smaller file sizes
- âœ“ Efficient for analytical queries on large datasets

### Load Metadata

For metadata (~10 MB), we can load the entire dataset into memory:

In [6]:
# Read metadata into memory
meta_full = read_feather_gcs(meta_path, gcs_fs=gcs)

print(f"\nDataset: {DATASET}")
print(f"Total neurons: {len(meta_full):,}")
print(f"Metadata columns: {len(meta_full.columns)}")
print(f"\nAvailable columns:")
print(meta_full.columns.tolist())

# Display first few rows
meta_full.head()

Reading from GCS: gs://sjcabs_2025_data/banc/banc_746_meta.feather


âœ“ Loaded 168,791 rows

Dataset: banc_746
Total neurons: 168,791
Metadata columns: 18

Available columns:
['banc_746_id', 'supervoxel_id', 'region', 'side', 'hemilineage', 'nerve', 'flow', 'super_class', 'cell_class', 'cell_sub_class', 'cell_type', 'neurotransmitter_predicted', 'neurotransmitter_score', 'cell_function', 'cell_function_detailed', 'body_part_sensory', 'body_part_effector', 'status']


Unnamed: 0,banc_746_id,supervoxel_id,region,side,hemilineage,nerve,flow,super_class,cell_class,cell_sub_class,cell_type,neurotransmitter_predicted,neurotransmitter_score,cell_function,cell_function_detailed,body_part_sensory,body_part_effector,status
0,720575941569192238,74803281603754231,central_brain,right,VPNp1_medial,,intrinsic,central_brain_intrinsic,,,"(PLP191,PLP192)a",acetylcholine,0.7534,,,,,
1,720575941574697871,74873512908765054,central_brain,right,VPNp1_medial,,intrinsic,central_brain_intrinsic,,,"(PLP191,PLP192)a",acetylcholine,0.7976,,,,,
2,720575941652939029,77477362601861709,central_brain,left,VPNp1_medial,,intrinsic,central_brain_intrinsic,,,"(PLP191,PLP192)a",dopamine,0.5825,,,,,TRACING_ISSUE_2
3,720575941452014202,74310563223910394,central_brain,right,VPNp1_medial,,intrinsic,central_brain_intrinsic,,,"(PLP191,PLP192)a",acetylcholine,0.5704,,,,,
4,720575941565035527,77406993858043384,central_brain,left,VPNp1_medial,,intrinsic,central_brain_intrinsic,,,"(PLP191,PLP192)a",acetylcholine,0.6317,,,,,


### Proofread Neurons (Important Concept!)

This meta data table contains all of the "identified" neurons in the dataset.

You may encounter neuron IDs outside of this meta data table, e.g., in the synapse table. Those are "fragments" that have not been linked up to full neurons.

Let's get our list of **"proofread" identified neurons**, as they are what we will want for analysis, mainly.

In [7]:
# Get proofread neuron IDs
# These are the validated, manually curated neurons in the dataset
proofread_ids = meta_full[DATASET_ID].values

print(f"Number of proofread neurons: {len(proofread_ids):,}")
print(f"Example IDs: {proofread_ids[:5].tolist()}")

Number of proofread neurons: 168,791
Example IDs: ['720575941569192238', '720575941574697871', '720575941652939029', '720575941452014202', '720575941565035527']


This meta data table contains all of the "identified" neurons in the dataset.

<p align="center">
  <img src="../inst/images/cns.png" alt="Central nervous system structure" width="80%">
</p>

You may encounter neuron IDs outside of this meta data table, in e.g. the synapse table.

Those are "fragments" that have not been linked up to full neurons.

### Example: Filtering Kenyon Cells

[Kenyon cells](https://en.wikipedia.org/wiki/Kenyon_cell) are the principal neurons of the insect [mushroom body](https://en.wikipedia.org/wiki/Mushroom_bodies), forming parallel pathways for associative memory. They integrate multi-sensory (but mostly olfactory) information and can number in the thousands per fly brain.

Let's filter for Kenyon cells:

---

## Exploring the Metadata

### Hierarchical Classification

The metadata uses a hierarchical classification system. See the full schema [here](../data/meta_data_entries.csv). This is based largely on the hierarchical scheme developed in Schlegel et al., 2024, see [here](https://pubmed.ncbi.nlm.nih.gov/39358521/).

**Hierarchy:** flow â†’ super_class â†’ cell_class â†’ cell_sub_class â†’ cell_type

Our meta data follows a hierarchical scheme (i.e. `flow`, `super_class`, `cell_class`, `cell_sub_class`, `cell_type`), with additional non-hierarchical labels (e.g. `neurotransmitter_predicted`, `nerve`, `hemilineage`).

<p align="center">
  <img src="../inst/images/meta_data_hierarchy_2.png" alt="Meta data hierarchy" width="80%">
</p>

---

## Exploring the Metadata

### Hierarchical Classification

The metadata uses a hierarchical classification system.

**Hierarchy:** flow â†’ super_class â†’ cell_class â†’ cell_sub_class â†’ cell_type

In [8]:
# Use full metadata
meta = meta_full.copy()

# Count neurons by classification level
flow_counts = meta['flow'].value_counts().dropna().sort_values(ascending=False)
super_counts = meta['super_class'].value_counts().dropna().sort_values(ascending=False)
class_counts = meta['cell_class'].value_counts().dropna().sort_values(ascending=False)

print("\nFlow categories:")
print(flow_counts)

print("\nTop 10 super_classes:")
print(super_counts.head(10))

print("\nTop 10 cell_classes:")
print(class_counts.head(10))


Flow categories:
flow
intrinsic    82299
afferent     15462
efferent      1031
Name: count, dtype: int64

Top 10 super_classes:
super_class
optic_lobe_intrinsic            31656
central_brain_intrinsic         29068
sensory                         14940
ventral_nerve_cord_intrinsic    12785
visual_projection                5765
ascending                        1839
descending                       1312
motor                             836
sensory_ascending                 512
visual_centrifugal                449
Name: count, dtype: int64

Top 10 cell_classes:
cell_class
transverse_neuron            9547
transmedullary               5233
bristle_neuron               5188
kenyon_cell                  4316
lamina_monopolar             3747
medulla_intrinsic            3558
lobula_columnar              3250
distal_medulla               2914
single_leg_neuromere         2844
olfactory_receptor_neuron    2170
Name: count, dtype: int64


### Understanding Neuropils

Our synapses have been roughly mapped to "neuropils", which are human-determined regions of the nervous system. These determinations are based on lumps and grooves on the surface of neural tissue and boundaries in synapse densities, but they roughly correlate with functional circuits. At least in some cases.

Our brain neuropils are transformed into connectome spaces from Ito et al., 2014's demarcations at light-level, see [here](https://pubmed.ncbi.nlm.nih.gov/24559671/). See below.

This means that the volumes can be slightly the wrong shape, and slightly shifted by some microns in space. As a consequence, neurons that are not actually in the canonical mushroom body calyx are caught by our search.

Neuropils are simply helpful guides through the nervous systems, like countries on a map. Countries correlate with geography but if you want to understand geology, you generally ignore their human-made borders. Likewise, in connectomics, neuropils are guides that set your sites on the right location, but real answers come from connectivity, and thinking about your results.

<p align="center">
  <img src="../inst/images/brain_neuropils_ito_et_al_2014.jpg" alt="Brain Neuropils from Ito et al. 2014" width="80%">
</p>

Our ventral nerve cord neuropils come from Court et al. 2020, see [here](https://pubmed.ncbi.nlm.nih.gov/32931755/). See below.

<p align="center">
  <img src="../inst/images/vnc_neuropils_court_et_al_2020.jpg" alt="VNC Neuropils from Court et al. 2020" width="80%">
</p>

---

## Working with Synapse Data

Synapse files can be very large (4-15 GB for full datasets). For this tutorial, we'll use a **pre-filtered subset** focusing on the mushroom body region, which is much faster to load.

### Filtering Mushroom Body Synapses

The [mushroom body](https://en.wikipedia.org/wiki/Mushroom_bodies) (MB) is the insect brain structure for associative learning and memory. The **mushroom body calyx** (MB_CA) is the primary input region of the mushroom body, where [Kenyon cells](https://en.wikipedia.org/wiki/Kenyon_cell) receive olfactory and other sensory information from projection neurons.

We provide pre-filtered synapse data for common brain regions to speed up analysis. For mushroom body, we use: `banc/mushroom_body/banc_746_mushroom_body_synapses.feather`

For more details on mushroom body organization and function, see [Li et al. 2020](https://pubmed.ncbi.nlm.nih.gov/33315010/) and [Aso et al. 2014](https://pubmed.ncbi.nlm.nih.gov/25535793/).

# Load pre-filtered mushroom body synapses from feather file
# This is MUCH faster than querying the full 4-15 GB Parquet file!
mb_synapses_path = f"{DATA_PATH}/banc/mushroom_body/{DATASET}_mushroom_body_synapses.feather"

print(f"Loading pre-filtered MB synapses from: {mb_synapses_path}")
mb_synapses = read_feather_gcs(mb_synapses_path, gcs_fs=gcs)

# Filter for right side and proofread neurons
mb_synapses = mb_synapses[
    (mb_synapses['side'] == 'right') &
    (mb_synapses['pre'].isin(proofread_ids) | mb_synapses['post'].isin(proofread_ids))
].copy()

print(f"âœ“ Loaded {len(mb_synapses):,} mushroom body synapses (right side, proofread neurons)")
print(f"Unique presynaptic neurons/fragments: {mb_synapses['pre'].nunique():,}")
print(f"Unique postsynaptic neurons/fragments: {mb_synapses['post'].nunique():,}")

mb_synapses.head()

---

## Working with Synapse Data (Optimized Parquet Filtering)

Synapse files are large (4-15 GB). With **Parquet format** and PyArrow's **predicate pushdown**:
- âœ“ **Server-side filtering**: Apply filters BEFORE downloading data
- âœ“ **Column selection**: Download only needed columns
- âœ“ Works on multi-GB files efficiently

**How predicate pushdown works:**
1. PyArrow applies simple filters (==, !=, <, >, in) at the parquet file level
2. Only matching rows are read from disk/GCS

### Filtering Mushroom Body Calyx Synapses

The [mushroom body](https://en.wikipedia.org/wiki/Mushroom_bodies) (MB) is the insect brain structure for associative learning and memory. The **mushroom body calyx** (MB_CA) is the primary input region of the mushroom body, where [Kenyon cells](https://en.wikipedia.org/wiki/Kenyon_cell) receive olfactory and other sensory information from projection neurons. For performance, we'll focus on the calyx rather than the entire mushroom body structure.

Let's extract MB calyx synapses using **optimized PyArrow filtering**:

For more details on mushroom body organization and function, see [Li et al. 2020](https://pubmed.ncbi.nlm.nih.gov/33315010/) and [Aso et al. 2014](https://pubmed.ncbi.nlm.nih.gov/25535793/).

In [9]:
# Load pre-filtered mushroom body synapses from feather file
# This is MUCH faster than querying the full 4-15 GB Parquet file!
mb_synapses_path = f"{DATA_PATH}/banc/mushroom_body/{DATASET}_mushroom_body_synapses.feather"

print(f"Loading pre-filtered MB synapses from: {mb_synapses_path}")
mb_synapses = read_feather_gcs(mb_synapses_path, gcs_fs=gcs)

# Filter for right side and proofread neurons
mb_synapses = mb_synapses[
    (mb_synapses['side'] == 'right') &
    (mb_synapses['pre'].isin(proofread_ids) | mb_synapses['post'].isin(proofread_ids))
].copy()

print(f"âœ“ Loaded {len(mb_synapses):,} mushroom body synapses (right side, proofread neurons)")
print(f"Unique presynaptic neurons/fragments: {mb_synapses['pre'].nunique():,}")
print(f"Unique postsynaptic neurons/fragments: {mb_synapses['post'].nunique():,}")

mb_synapses.head()

# NOTE: Alternative approach for querying the full Parquet file
# If you need to work with the full synapse dataset or filter by other regions,
# you can use PyArrow's predicate pushdown for server-side filtering.
# This is commented out because it takes 10-20 minutes to download and filter:
#
# synapse_path = construct_path(DATA_PATH, DATASET, "synapses")  # Full Parquet file
# parquet_path = synapse_path.replace("gs://", "") if USE_GCS else synapse_path
# filesystem = gcs if USE_GCS else None
# 
# # Apply filters at Parquet file level (server-side filtering)
# table = pq.read_table(
#     parquet_path,
#     columns=['id', 'pre', 'post', 'neuropil', 'side'],
#     filters=[('side', '==', 'right')],  # Filter before downloading
#     filesystem=filesystem
# )
# 
# # Convert to pandas and apply additional filters
# synapses_subset = table.to_pandas()
# mb_synapses = synapses_subset[
#     synapses_subset['neuropil'].str.startswith('MB_CA', na=False) &
#     (synapses_subset['pre'].isin(proofread_ids) | synapses_subset['post'].isin(proofread_ids))
# ].copy()

Loading pre-filtered MB synapses from: gs://sjcabs_2025_data/banc/mushroom_body/banc_746_mushroom_body_synapses.feather
Reading from GCS: gs://sjcabs_2025_data/banc/mushroom_body/banc_746_mushroom_body_synapses.feather


âœ“ Loaded 2,377,573 rows


âœ“ Loaded 2,108,201 mushroom body synapses (right side, proofread neurons)
Unique presynaptic neurons/fragments: 80,343


Unique postsynaptic neurons/fragments: 869,700


Unnamed: 0,id,size,pre,post,X,Y,Z,side,region,neuropil,acetylcholine,dopamine,gaba,glutamate,histamine,octopamine,serotonin,tyramine,syn_top_nt,syn_top_p
0,135533439,3.0,720575941689127692,720575941043230364,479408.0,72496.0,192060.0,right,central_brain,CRE,,,,,,,,,,
1,64728984,24.0,720575941613035437,720575941460130412,367795.0,181656.0,234232.0,right,central_brain,MB_CA,14978.0,382.0,12018.0,10818.0,9625.0,356.0,8020.0,868.0,acetylcholine,14978.0
2,75063931,18.0,720575941612228454,720575940537043974,392764.0,183918.0,215990.0,right,central_brain,MB_CA,14381.0,14058.0,8843.0,8249.0,50.0,2317.0,9772.0,4340.0,acetylcholine,14381.0
3,108085750,16.0,720575941542533405,720575941591715838,448560.0,55816.0,164509.0,right,central_brain,CRE,14540.0,9486.0,13525.0,10291.0,5294.0,7497.0,10242.0,8249.0,acetylcholine,14540.0
4,115387133,4.0,720575941626148042,720575941519202732,458704.0,73792.0,113625.0,right,central_brain,AL,,,,,,,,,,


### Identify Mushroom Body Calyx Neurons

Define MB calyx neurons as those with â‰¥100 synapses (inputs or outputs) within the MB calyx.

As noted before, neuropils are just guides: the real way to define Calyx neurons, is as neurons that connect with the dendrites of Kenyon cells. So we will add that filter as well.

In [10]:
# Get Kenyon cell IDs
kc_ids = meta[meta['cell_class'].str.contains('kenyon_cell', case=False, na=False)][DATASET_ID].values
print(f"Found {len(kc_ids):,} Kenyon cells")

# Count outputs per neuron (filter for proofread AND connections to/from Kenyon cells)
mb_outputs = (
    mb_synapses[
        mb_synapses['pre'].isin(proofread_ids) &
        (mb_synapses['pre'].isin(kc_ids) | mb_synapses['post'].isin(kc_ids))
    ]
    .groupby('pre')
    .size()
    .reset_index(name='n_outputs')
    .query('n_outputs >= 100')
)

# Count inputs per neuron (filter for proofread AND connections to/from Kenyon cells)
mb_inputs = (
    mb_synapses[
        mb_synapses['post'].isin(proofread_ids) &
        (mb_synapses['pre'].isin(kc_ids) | mb_synapses['post'].isin(kc_ids))
    ]
    .groupby('post')
    .size()
    .reset_index(name='n_inputs')
    .query('n_inputs >= 100')
)

# Combine to get all MB calyx neurons
mb_neurons = pd.unique(pd.concat([mb_outputs['pre'], mb_inputs['post']]))
print(f"Neurons with â‰¥100 synapses in MB calyx (connecting to/from Kenyon cells): {len(mb_neurons):,}")

# Get metadata for MB calyx neurons
mb_meta = meta[meta[DATASET_ID].isin(mb_neurons)].copy()

# Check how many are Kenyon cells
n_kc = mb_meta['cell_class'].str.contains('kenyon_cell', case=False, na=False).sum()
n_other = len(mb_meta) - n_kc

print(f"  Kenyon cells: {n_kc:,}")
print(f"  Other neurons: {n_other:,}")

Found 4,316 Kenyon cells


Neurons with â‰¥100 synapses in MB calyx (connecting to/from Kenyon cells): 1,843
  Kenyon cells: 1,685
  Other neurons: 158


### Characterize Non-Kenyon MB Calyx Neurons

What other neuron types are present in the mushroom body calyx?

In [11]:
# Prepare data - focus on non-Kenyon cells
mb_meta_clean = mb_meta.copy()
mb_meta_clean['super_class'] = mb_meta_clean['super_class'].fillna('other')
mb_meta_clean['cell_class'] = mb_meta_clean['cell_class'].fillna('other')
mb_meta_clean['cell_sub_class'] = mb_meta_clean['cell_sub_class'].fillna('other')
mb_meta_clean['is_kenyon'] = mb_meta_clean['cell_class'].str.contains('kenyon_cell', case=False, na=False)

# Filter out Kenyon cells
mb_non_kc = mb_meta_clean[~mb_meta_clean['is_kenyon']]

# Count by classification levels
super_counts = mb_non_kc['super_class'].value_counts().sort_values(ascending=True)
class_counts = mb_non_kc['cell_class'].value_counts().head(15).sort_values(ascending=True)
subclass_counts = mb_non_kc['cell_sub_class'].value_counts().head(15).sort_values(ascending=True)

# Create subplots
fig = make_subplots(
    rows=3, cols=1,
    subplot_titles=('Super Class', 'Cell Class (Top 15)', 'Cell Sub-Class (Top 15)'),
    vertical_spacing=0.12,
    specs=[[{"type": "bar"}], [{"type": "bar"}], [{"type": "bar"}]]
)

# Super class
fig.add_trace(
    go.Bar(
        y=super_counts.index,
        x=super_counts.values,
        orientation='h',
        marker_color='lightblue',
        text=super_counts.values,
        textposition='outside',
        hovertemplate='<b>%{y}</b><br>Count: %{x}<extra></extra>'
    ),
    row=1, col=1
)

# Cell class
fig.add_trace(
    go.Bar(
        y=class_counts.index,
        x=class_counts.values,
        orientation='h',
        marker_color='lightgreen',
        text=class_counts.values,
        textposition='outside',
        hovertemplate='<b>%{y}</b><br>Count: %{x}<extra></extra>'
    ),
    row=2, col=1
)

# Cell sub-class
fig.add_trace(
    go.Bar(
        y=subclass_counts.index,
        x=subclass_counts.values,
        orientation='h',
        marker_color='lightcoral',
        text=subclass_counts.values,
        textposition='outside',
        hovertemplate='<b>%{y}</b><br>Count: %{x}<extra></extra>'
    ),
    row=3, col=1
)

fig.update_layout(
    title_text=f"Non-Kenyon MB Calyx Neurons: {DATASET}",
    title_x=0.5,
    title_xanchor='center',
    height=1200,
    showlegend=False,
    template="plotly_white"
)

fig.update_xaxes(title_text="Count")

fig.show()

---

## Summary

In this tutorial you learned how to:

âœ“ Access connectome data from Google Cloud Storage or local paths  
âœ“ Use PyArrow's predicate pushdown for efficient GCS/Parquet queries  
âœ“ Load small metadata files (`.feather`) into memory  
âœ“ Explore neuron metadata and hierarchical classifications  
âœ“ Filter synapse data by brain region using string matching  
âœ“ Identify and characterise neurons by connectivity patterns  
âœ“ Create publication-quality interactive visualisations  
âœ“ Compare datasets to identify biological vs technical variation

**Next tutorial:** [02_neuron_morphology.ipynb](02_neuron_morphology.ipynb) - Load and visualise 3D neuron skeletons

---

## Session Information

### Summary Visualization

Create a summary comparing Kenyon vs non-Kenyon MB calyx neurons:

In [12]:
# Prepare summary data
mb_summary = pd.DataFrame({
    'Category': ['Kenyon Cells', 'Other MB Neurons'],
    'Count': [n_kc, n_other]
})

mb_summary['Percentage'] = (mb_summary['Count'] / mb_summary['Count'].sum() * 100).round(1)

# Create pie chart with plotly
fig = go.Figure()

fig.add_trace(go.Pie(
    labels=mb_summary['Category'],
    values=mb_summary['Count'],
    text=mb_summary.apply(lambda x: f"{x['Count']:,}<br>({x['Percentage']:.1f}%)", axis=1),
    textposition='inside',
    textfont=dict(size=14, color='white'),
    marker=dict(
        colors=['#E69F00', '#56B4E9'],
        line=dict(color='white', width=2)
    ),
    hovertemplate='<b>%{label}</b><br>Count: %{value:,}<br>Percentage: %{percent}<extra></extra>'
))

fig.update_layout(
    title=dict(
        text=f"Mushroom Body Calyx Neurons: {DATASET}<br><sub>Neurons with â‰¥100 synapses in MB calyx</sub>",
        x=0.5,
        xanchor='center'
    ),
    height=500,
    template="plotly_white"
)

fig.show()

# Print summary
print("\nMushroom Body Calyx Summary:")
print(mb_summary.to_string(index=False))


Mushroom Body Calyx Summary:
        Category  Count  Percentage
    Kenyon Cells   1685        91.4
Other MB Neurons    158         8.6


---

## Summary

In this tutorial, we covered:

1. **Data Access**: Loading connectome data from GCS or local files
2. **Metadata Exploration**: Understanding the hierarchical classification system
3. **Proofread Neurons**: Important concept for filtering validated neurons
4. **Efficient Data Loading**: Using PyArrow and Parquet for large synapse files
5. **Data Analysis**: Filtering and characterizing mushroom body neurons
6. **Visualization**: Creating interactive plots with Plotly

### Key Takeaways

- **Always filter for proofread neurons** when analyzing connectivity
- **Use Parquet for large files** - it supports efficient filtering and column selection
- **Stream from GCS** when possible to avoid large downloads
- **Interactive plots** with Plotly make exploration easier

---

## Next Steps

Try exploring other datasets or neuropils:
- Change `DATASET` to explore other connectomes (fafb_783, manc_121, etc.)
- Filter for different neuropils (e.g., 'AL' for antennal lobe, 'LH' for lateral horn)
- Analyze different cell types or neurotransmitters
- Look at Tutorial 02 for neuron morphology analysis

---

**Tutorial complete!** ðŸŽ‰