# Tutorial 01: Data Access

**Author:** Alexander Bates  
**Python Version**

---

## Introduction

This tutorial covers data access to a pre-prepared curation of connectome data for the major *Drosophila* connectome projects. Our curated data includes:

- **Whole-system datasets**: maleCNS, BANC (brain + ventral nerve cord)
- **Brain-only datasets**: FAFB, Hemibrain
- **VNC-only dataset**: MANC

This Python tutorial demonstrates:
- Direct GCS data access using `gcsfs` and `pyarrow`
- Efficient streaming of large Parquet files
- Interactive visualizations with Plotly
- Metadata exploration and filtering
- Synapse data analysis with lazy loading

---

## Setup and Configuration

In [1]:
# Dataset configuration
# Options: "banc_746", "fafb_783", "manc_121", "hemibrain_121", "malecns_09"
DATASET = "banc_746"
DATASET_ID = "banc_746_id"

# Data location - can be GCS bucket or local path
# Option 1 (GCS - default): Access data directly from Google Cloud Storage
DATA_PATH = "gs://brain-and-nerve-cord_exports/sjcabs_data"

# Option 2 (Local): Use local copy if you've downloaded the data
# DATA_PATH = "/path/to/local/sjcabs_data"
# Example: DATA_PATH = "~/data/sjcabs_data"

# Detect if using GCS or local path
USE_GCS = DATA_PATH.startswith("gs://")

# Image output directory
import os
IMG_DIR = "images/tutorial_01"
os.makedirs(IMG_DIR, exist_ok=True)

print(f"Dataset: {DATASET}")
print(f"Data location: {DATA_PATH}")
print(f"Using GCS: {USE_GCS}")
print(f"Images will be saved to: {IMG_DIR}")

Dataset: banc_746
Data location: gs://brain-and-nerve-cord_exports/sjcabs_data
Using GCS: True
Images will be saved to: images/tutorial_01


## Import Packages

We'll use:
- **pandas**: Data manipulation and analysis
- **pyarrow**: Efficient reading of Feather and Parquet files
- **gcsfs**: Google Cloud Storage filesystem interface
- **plotly**: Interactive visualizations
- **navis**: Neuron analysis tools (for future tutorials)

In [2]:
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.feather as feather
import pyarrow.parquet as pq
import gcsfs
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import re
from pathlib import Path

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)

print("✓ Packages loaded successfully")
print(f"Python version: {pd.__version__} (pandas), {pa.__version__} (pyarrow)")

✓ Packages loaded successfully
Python version: 2.3.3 (pandas), 22.0.0 (pyarrow)


## Data Location Options

This tutorial supports two data access modes:

### Option 1: Direct GCS Access (Default)
Access data directly from Google Cloud Storage - no manual download required!
- **Pros:** No local storage needed, always up-to-date
- **Cons:** Slower for large files, requires authentication & internet

**GCS bucket location:** `gs://brain-and-nerve-cord_exports/sjcabs_data/`

### Option 2: Local Access (Faster for Repeated Use)
Download data once with `gsutil`, then access locally:

```bash
# Download specific dataset (e.g., BANC metadata + synapses)
gsutil -m cp gs://brain-and-nerve-cord_exports/sjcabs_data/banc/banc_746_meta.feather ~/data/sjcabs_data/banc/
gsutil -m cp gs://brain-and-nerve-cord_exports/sjcabs_data/banc/banc_746_synapses.parquet ~/data/sjcabs_data/banc/

# Or download entire dataset directory
gsutil -m cp -r gs://brain-and-nerve-cord_exports/sjcabs_data/banc ~/data/sjcabs_data/
```

Then update the configuration cell:
```python
DATA_PATH = "~/data/sjcabs_data"  # Use your local path
```

---

## Setup GCS Access

**Authentication required:** Before running this tutorial with GCS, authenticate with Google Cloud:

```bash
# Install gcloud CLI if you haven't already:
# https://cloud.google.com/sdk/docs/install

# Authenticate with your Google account
gcloud auth application-default login
```

This creates credentials that `gcsfs` will use automatically.

In [3]:
# Setup GCS filesystem if needed
if USE_GCS:
    print("Setting up Google Cloud Storage access...")
    gcs = gcsfs.GCSFileSystem(token='google_default')
    print("✓ GCS filesystem initialized")
else:
    gcs = None
    print("Using local filesystem")

Setting up Google Cloud Storage access...
✓ GCS filesystem initialized


## Helper Functions

Define utility functions for path construction and data reading:

In [4]:
def construct_path(data_root, dataset, file_type="meta", space_suffix=None):
    """
    Construct file paths for dataset files.
    Note: Skeleton files do not include version numbers in their filenames.
    
    Parameters
    ----------
    data_root : str
        Root data directory (can be gs:// or local path)
    dataset : str
        Dataset name with version (e.g., "banc_746")
    file_type : str
        Type of file: "meta", "synapses", "edgelist", "skeletons"
    space_suffix : str, optional
        Space name for skeletons (defaults to native space, e.g., "banc_space")
    
    Returns
    -------
    str
        Full path to the file
    """
    # Extract dataset name (e.g., "banc" from "banc_746")
    dataset_name = dataset.split("_")[0]
    
    # Determine file extension
    extensions = {
        "meta": ".feather",
        "synapses": ".parquet",
        "edgelist": ".feather",
        "edgelist_simple": ".feather",
        "skeletons": ""  # No extension - it's a directory
    }
    
    if file_type not in extensions:
        raise ValueError(f"Unknown file_type: {file_type}. Choose from: {list(extensions.keys())}")
    
    extension = extensions[file_type]
    
    # Construct filename
    if file_type == "skeletons":
        # Skeleton files don't include version number and have specific naming
        # Pattern: {dataset_name}_{space_name}_[l2_]swc (directory)
        # e.g., banc_banc_space_l2_swc/, fafb_fafb_space_swc/
        
        # Default space is the native space for the dataset
        if space_suffix is None:
            space_suffix = f"{dataset_name}_space"
        
        # BANC uses l2 skeletons, others don't
        if dataset_name == "banc":
            filename = f"{dataset_name}_{space_suffix}_l2_swc{extension}"
        else:
            filename = f"{dataset_name}_{space_suffix}_swc{extension}"
    else:
        # Other file types include the full dataset name with version
        filename = f"{dataset}_{file_type}{extension}"
    
    # Combine into full path
    full_path = f"{data_root}/{dataset_name}/{filename}"
    
    return full_path

def read_feather_gcs(path, gcs_fs=None):
    """
    Read Feather file from GCS or local path.
    
    Parameters
    ----------
    path : str
        Path to Feather file (can be gs:// or local)
    gcs_fs : gcsfs.GCSFileSystem, optional
        GCS filesystem object (required for GCS paths)
    
    Returns
    -------
    pd.DataFrame
        Loaded data
    """
    if path.startswith("gs://"):
        if gcs_fs is None:
            raise ValueError("gcs_fs required for GCS paths")
        
        print(f"Reading from GCS: {path}")
        
        # Strip gs:// prefix for gcsfs
        gcs_path = path.replace("gs://", "")
        
        with gcs_fs.open(gcs_path, 'rb') as f:
            df = feather.read_feather(f)
        
        print(f"✓ Loaded {len(df):,} rows")
        return df
    else:
        print(f"Reading from local path: {path}")
        df = pd.read_feather(path)
        print(f"✓ Loaded {len(df):,} rows")
        return df


def save_plot(fig, name, width=1200, height=600):
    """
    Save plotly figure to image file.
    
    Parameters
    ----------
    fig : plotly.graph_objects.Figure
        Plotly figure to save
    name : str
        Filename (without extension)
    width : int
        Width in pixels
    height : int
        Height in pixels
    """
    filename = os.path.join(IMG_DIR, f"{name}.png")
    fig.write_image(filename, width=width, height=height, scale=2)
    print(f"✓ Saved plot to {filename}")


print("✓ Helper functions defined")

✓ Helper functions defined


## Setup File Paths

Construct paths to metadata and synapse files:

In [5]:
# Construct file paths
meta_path = construct_path(DATA_PATH, DATASET, "meta")
synapse_path = construct_path(DATA_PATH, DATASET, "synapses")

print("File paths:")
print(f"  Metadata: {meta_path}")
print(f"  Synapses: {synapse_path}")

File paths:
  Metadata: gs://brain-and-nerve-cord_exports/sjcabs_data/banc/banc_746_meta.feather
  Synapses: gs://brain-and-nerve-cord_exports/sjcabs_data/banc/banc_746_synapses.parquet


---

## Reading Connectome Metadata

### Understanding the File Formats

Our data files use two Apache Arrow formats:
- **Feather** (`.feather`) for metadata - smaller files (~10 MB), loaded entirely into memory
- **Parquet** (`.parquet`) for synapses - large files (4-15 GB), supports lazy loading and predicate pushdown

**Why Parquet for synapses?**
- ✓ Column selection: download only needed columns
- ✓ Row filtering: filter on the server before downloading
- ✓ Compression: smaller file sizes
- ✓ Efficient for analytical queries on large datasets

### Load Metadata

For metadata (~10 MB), we can load the entire dataset into memory:

In [6]:
# Read metadata into memory
meta_full = read_feather_gcs(meta_path, gcs_fs=gcs)

print(f"\nDataset: {DATASET}")
print(f"Total neurons: {len(meta_full):,}")
print(f"Metadata columns: {len(meta_full.columns)}")
print(f"\nAvailable columns:")
print(meta_full.columns.tolist())

# Display first few rows
meta_full.head()

Reading from GCS: gs://brain-and-nerve-cord_exports/sjcabs_data/banc/banc_746_meta.feather


✓ Loaded 168,793 rows

Dataset: banc_746
Total neurons: 168,793
Metadata columns: 18

Available columns:
['banc_746_id', 'supervoxel_id', 'region', 'side', 'hemilineage', 'nerve', 'flow', 'super_class', 'cell_class', 'cell_sub_class', 'cell_type', 'neurotransmitter_predicted', 'neurotransmitter_score', 'cell_function', 'cell_function_detailed', 'body_part_sensory', 'body_part_effector', 'status']


Unnamed: 0,banc_746_id,supervoxel_id,region,side,hemilineage,nerve,flow,super_class,cell_class,cell_sub_class,cell_type,neurotransmitter_predicted,neurotransmitter_score,cell_function,cell_function_detailed,body_part_sensory,body_part_effector,status
0,720575941569192238,74803281603754231,central_brain,right,VPNp1_medial,,intrinsic,central_brain_intrinsic,,,"(PLP191,PLP192)a",acetylcholine,0.7534,,,,,
1,720575941574697871,74873512908765054,central_brain,right,VPNp1_medial,,intrinsic,central_brain_intrinsic,,,"(PLP191,PLP192)a",acetylcholine,0.7976,,,,,
2,720575941652939029,77477362601861709,central_brain,left,VPNp1_medial,,intrinsic,central_brain_intrinsic,,,"(PLP191,PLP192)a",dopamine,0.5825,,,,,TRACING_ISSUE_2
3,720575941452014202,74310563223910394,central_brain,right,VPNp1_medial,,intrinsic,central_brain_intrinsic,,,"(PLP191,PLP192)a",acetylcholine,0.5704,,,,,
4,720575941565035527,77406993858043384,central_brain,left,VPNp1_medial,,intrinsic,central_brain_intrinsic,,,"(PLP191,PLP192)a",acetylcholine,0.6317,,,,,


### Proofread Neurons (Important Concept!)

This meta data table contains all of the "identified" neurons in the dataset.

You may encounter neuron IDs outside of this meta data table, e.g., in the synapse table. Those are "fragments" that have not been linked up to full neurons.

Let's get our list of **"proofread" identified neurons**, as they are what we will want for analysis, mainly.

In [7]:
# Get proofread neuron IDs
# These are the validated, manually curated neurons in the dataset
proofread_ids = meta_full[DATASET_ID].values

print(f"Number of proofread neurons: {len(proofread_ids):,}")
print(f"Example IDs: {proofread_ids[:5].tolist()}")

Number of proofread neurons: 168,793
Example IDs: ['720575941569192238', '720575941574697871', '720575941652939029', '720575941452014202', '720575941565035527']


### Example: Filtering Kenyon Cells

[Kenyon cells](https://en.wikipedia.org/wiki/Kenyon_cell) are the principal neurons of the insect [mushroom body](https://en.wikipedia.org/wiki/Mushroom_bodies), forming parallel pathways for associative memory. They integrate multi-sensory (but mostly olfactory) information and can number in the thousands per fly brain.

Let's filter for Kenyon cells:

In [8]:
# Filter for Kenyon cells
kenyon_cells = meta_full[meta_full['cell_class'].str.contains('kenyon_cell', case=False, na=False)]

print(f"Found {len(kenyon_cells):,} Kenyon cells in {DATASET}")
kenyon_cells.head()

Found 4,316 Kenyon cells in banc_746


Unnamed: 0,banc_746_id,supervoxel_id,region,side,hemilineage,nerve,flow,super_class,cell_class,cell_sub_class,cell_type,neurotransmitter_predicted,neurotransmitter_score,cell_function,cell_function_detailed,body_part_sensory,body_part_effector,status
38440,720575941470693643,75576719448477120,central_brain,right,,,intrinsic,central_brain_intrinsic,kenyon_cell,KCa'b',KCa'b'-ap1,acetylcholine,0.8895,,,,,TRACING_ISSUE_MISSING_AXON
38441,720575941442029659,75225769215359882,central_brain,right,,,intrinsic,central_brain_intrinsic,kenyon_cell,KCa'b',KCa'b'-ap1,,0.0,,,,,TRACING_ISSUE_MISSING_AXON
38442,720575941466804246,75225700563016133,central_brain,right,,,intrinsic,central_brain_intrinsic,kenyon_cell,KCa'b',KCa'b'-ap2,acetylcholine,0.8534,,,,,TRACING_ISSUE_MISSING_AXON
38443,720575941639255669,75436875447646395,central_brain,right,,,intrinsic,central_brain_intrinsic,kenyon_cell,KCa'b',KCa'b'-ap2,acetylcholine,0.8235,,,,,TRACING_ISSUE_MISSING_AXON
38444,720575941518579054,75296000520443547,central_brain,right,,,intrinsic,central_brain_intrinsic,kenyon_cell,KCa'b',KCa'b'-ap2,acetylcholine,0.8723,,,,,TRACING_ISSUE_MISSING_AXON


---

## Exploring the Metadata

### Hierarchical Classification

The metadata uses a hierarchical classification system.

**Hierarchy:** flow → super_class → cell_class → cell_sub_class → cell_type

In [9]:
# Use full metadata
meta = meta_full.copy()

# Count neurons by classification level
flow_counts = meta['flow'].value_counts().dropna().sort_values(ascending=False)
super_counts = meta['super_class'].value_counts().dropna().sort_values(ascending=False)
class_counts = meta['cell_class'].value_counts().dropna().sort_values(ascending=False)

print("\nFlow categories:")
print(flow_counts)

print("\nTop 10 super_classes:")
print(super_counts.head(10))

print("\nTop 10 cell_classes:")
print(class_counts.head(10))


Flow categories:
flow
intrinsic    82298
afferent     15462
efferent      1031
Name: count, dtype: int64

Top 10 super_classes:
super_class
optic_lobe_intrinsic            31655
central_brain_intrinsic         29069
sensory                         14939
ventral_nerve_cord_intrinsic    12785
visual_projection                5765
ascending                        1839
descending                       1312
motor                             836
sensory_ascending                 512
visual_centrifugal                449
Name: count, dtype: int64

Top 10 cell_classes:
cell_class
transverse_neuron            9547
transmedullary               5233
bristle_neuron               5188
kenyon_cell                  4316
lamina_monopolar             3747
medulla_intrinsic            3558
lobula_columnar              3250
distal_medulla               2914
single_leg_neuromere         2844
olfactory_receptor_neuron    2170
Name: count, dtype: int64


### Neurotransmitter Distribution

Neurotransmitter predictions are based on [Eckstein & Bates et al. (2024) *Cell*](https://doi.org/10.1016/j.cell.2024.03.016).

Let's visualize the distribution using Plotly:

In [10]:
# Count neurotransmitter predictions
nt_counts = meta['neurotransmitter_predicted'].value_counts().dropna().sort_values(ascending=True)

# Create interactive bar plot with Plotly
fig = go.Figure()

fig.add_trace(go.Bar(
    y=nt_counts.index,
    x=nt_counts.values,
    orientation='h',
    marker=dict(
        color=nt_counts.values,
        colorscale='Viridis',
        showscale=False
    ),
    text=nt_counts.values,
    textposition='outside',
    texttemplate='%{text:,}',
    hovertemplate='<b>%{y}</b><br>Count: %{x:,}<extra></extra>'
))

fig.update_layout(
    title=dict(
        text=f"Neurotransmitter Predictions: {DATASET}<br><sub>Based on Eckstein & Bates et al. (2024)</sub>",
        x=0.5,
        xanchor='center'
    ),
    xaxis_title="Number of Neurons",
    yaxis_title="Predicted Neurotransmitter",
    height=400,
    template="plotly_white",
    font=dict(size=12),
    margin=dict(l=150, r=100, t=100, b=50)
)

# Save plot
try:
    save_plot(fig, "neurotransmitter_distribution", width=1000, height=500)
except Exception as e:
    print(f"Note: Could not save plot: {e}")

fig.show()

# Save plot
try:
    save_plot(fig, "neurotransmitter_distribution", width=1000, height=500)
except Exception as e:
    print(f"Note: Could not save plot (likely missing kaleido package): {e}")
    print("To enable saving: pip install -U kaleido")

✓ Saved plot to images/tutorial_01/neurotransmitter_distribution.png


✓ Saved plot to images/tutorial_01/neurotransmitter_distribution.png


### Understanding Neuropils

Our synapses have been roughly mapped to "neuropils", which are human-determined regions of the nervous system. These determinations are based on lumps and grooves on the surface of neural tissue and boundaries in synapse densities, but they roughly correlate with functional circuits. At least in some cases.

Our brain neuropils are transformed into connectome spaces from Ito et al., 2014's demarcations at light-level, see [here](https://pubmed.ncbi.nlm.nih.gov/24559671/). See below.

<p align="center">
  <img src="../inst/images/brain_neuropils_ito_et_al_2014.jpg" alt="Brain Neuropils from Ito et al. 2014" width="80%">
</p>

Our ventral nerve cord neuropils come from Court et al. 2020, see [here](https://pubmed.ncbi.nlm.nih.gov/32931755/). See below.

<p align="center">
  <img src="../inst/images/vnc_neuropils_court_et_al_2020.jpg" alt="VNC Neuropils from Court et al. 2020" width="80%">
</p>

---

## Working with Synapse Data (Optimized Parquet Filtering)

Synapse files are large (4-15 GB). With **Parquet format** and PyArrow's **predicate pushdown**:
- ✓ **Server-side filtering**: Apply filters BEFORE downloading data
- ✓ **Column selection**: Download only needed columns
- ✓ **Massive speedup**: ~50% faster by filtering for `side == 'right'` at read time
- ✓ Works on multi-GB files efficiently

**How predicate pushdown works:**
1. PyArrow applies simple filters (==, !=, <, >, in) at the parquet file level
2. Only matching rows are read from disk/GCS
3. Reduces data transfer by ~50% for one-sided queries

### Filtering Mushroom Body Calyx Synapses

The [mushroom body](https://en.wikipedia.org/wiki/Mushroom_bodies) (MB) is the insect brain structure for associative learning and memory. The **mushroom body calyx** (MB_CA) is the primary input region of the mushroom body, where [Kenyon cells](https://en.wikipedia.org/wiki/Kenyon_cell) receive olfactory and other sensory information from projection neurons. For performance, we'll focus on the calyx rather than the entire mushroom body structure.

Let's extract MB calyx synapses using **optimized PyArrow filtering**:

For more details on mushroom body organization and function, see [Li et al. 2020](https://pubmed.ncbi.nlm.nih.gov/33315010/) and [Aso et al. 2014](https://pubmed.ncbi.nlm.nih.gov/25535793/).

In [11]:
print("\n=== OPTIMIZED PARQUET FILTERING ===")
print(f"\nReading synapses from: {synapse_path}")
print("(Using PyArrow predicate pushdown for server-side filtering)\n")

# Setup filesystem for reading
if USE_GCS:
    # Strip gs:// prefix for PyArrow
    parquet_path = synapse_path.replace("gs://", "")
    filesystem = gcs
else:
    parquet_path = synapse_path
    filesystem = None

print("Step 1: Applying PyArrow filters at read time...")
print("  - Filtering for side == 'right' (server-side)")
print("  - Selecting only needed columns: ['id', 'pre', 'post', 'neuropil', 'side']")
print("  This significantly reduces data transfer!")
print()

# Use PyArrow filters for predicate pushdown
# Filter for side == 'right' BEFORE reading data (huge speedup!)
filters = [
    ('side', '==', 'right')
]

# Read with filters applied at parquet file level
# This means only matching rows are read from disk/network
import time
start_time = time.time()

table = pq.read_table(
    parquet_path,
    columns=['id', 'pre', 'post', 'neuropil', 'side'],
    filters=filters,  # Applied at file read level!
    filesystem=filesystem
)

read_time = time.time() - start_time
print(f"✓ Read complete in {read_time:.1f}s")
print(f"  Rows after server-side filtering: {len(table):,}")
print()

print("Step 2: Converting to pandas and applying remaining filters...")
# Convert to pandas (now with ~50% fewer rows)
synapses_subset = table.to_pandas()

# Apply remaining filters in pandas
# (can't use PyArrow for string startswith or complex OR conditions)
print("  - Filtering for neuropil starts with 'MB_CA'")
print("  - Filtering for proofread neurons only")

mb_synapses = synapses_subset[
    synapses_subset['neuropil'].str.startswith('MB_CA', na=False) &
    (synapses_subset['pre'].isin(proofread_ids) | synapses_subset['post'].isin(proofread_ids))
].copy()

total_time = time.time() - start_time
print(f"\n✓ Done! Total time: {total_time:.1f}s")
print(f"✓ Loaded {len(mb_synapses):,} mushroom body calyx synapses (right side only)")
print(f"Unique presynaptic neurons/fragments: {mb_synapses['pre'].nunique():,}")
print(f"Unique postsynaptic neurons/fragments: {mb_synapses['post'].nunique():,}")

mb_synapses.head()


=== OPTIMIZED PARQUET FILTERING ===

Reading synapses from: gs://brain-and-nerve-cord_exports/sjcabs_data/banc/banc_746_synapses.parquet
(Using PyArrow predicate pushdown for server-side filtering)

Step 1: Applying PyArrow filters at read time...
  - Filtering for side == 'right' (server-side)
  - Selecting only needed columns: ['id', 'pre', 'post', 'neuropil', 'side']
  This significantly reduces data transfer!



✓ Read complete in 997.6s
  Rows after server-side filtering: 80,957,640

Step 2: Converting to pandas and applying remaining filters...


  - Filtering for neuropil starts with 'MB_CA'
  - Filtering for proofread neurons only



✓ Done! Total time: 1077.9s
✓ Loaded 475,920 mushroom body calyx synapses (right side only)
Unique presynaptic neurons/fragments: 11,560
Unique postsynaptic neurons/fragments: 190,110


Unnamed: 0,id,pre,post,neuropil,side
101,64103966,720575941455841587,720575941393776435,MB_CA,right
118,64728984,720575941613035437,720575941460130412,MB_CA,right
129,75063931,720575941612228454,720575940537043974,MB_CA,right
154,64623507,720575941431510751,720575941477870913,MB_CA,right
318,84437892,720575941539204813,720575941570755060,MB_CA,right


### Identify Mushroom Body Calyx Neurons

Define MB calyx neurons as those with ≥100 synapses (inputs or outputs) within the MB calyx:

In [12]:
# Count outputs per neuron (filter for proofread)
mb_outputs = (
    mb_synapses[mb_synapses['pre'].isin(proofread_ids)]
    .groupby('pre')
    .size()
    .reset_index(name='n_outputs')
    .query('n_outputs >= 100')
)

# Count inputs per neuron (filter for proofread)
mb_inputs = (
    mb_synapses[mb_synapses['post'].isin(proofread_ids)]
    .groupby('post')
    .size()
    .reset_index(name='n_inputs')
    .query('n_inputs >= 100')
)

# Combine to get all MB calyx neurons
mb_neurons = pd.unique(pd.concat([mb_outputs['pre'], mb_inputs['post']]))
print(f"Neurons with ≥100 synapses in MB calyx: {len(mb_neurons):,}")

# Get metadata for MB calyx neurons
mb_meta = meta[meta[DATASET_ID].isin(mb_neurons)].copy()

# Check how many are Kenyon cells
n_kc = mb_meta['cell_class'].str.contains('kenyon_cell', case=False, na=False).sum()
n_other = len(mb_meta) - n_kc

print(f"  Kenyon cells: {n_kc:,}")
print(f"  Other neurons: {n_other:,}")

Neurons with ≥100 synapses in MB calyx: 599
  Kenyon cells: 325
  Other neurons: 274


### Characterize Non-Kenyon MB Calyx Neurons

What other neuron types are present in the mushroom body calyx?

In [13]:
# Prepare data - focus on non-Kenyon cells
mb_meta_clean = mb_meta.copy()
mb_meta_clean['super_class'] = mb_meta_clean['super_class'].fillna('other')
mb_meta_clean['cell_class'] = mb_meta_clean['cell_class'].fillna('other')
mb_meta_clean['cell_sub_class'] = mb_meta_clean['cell_sub_class'].fillna('other')
mb_meta_clean['is_kenyon'] = mb_meta_clean['cell_class'].str.contains('kenyon_cell', case=False, na=False)

# Filter out Kenyon cells
mb_non_kc = mb_meta_clean[~mb_meta_clean['is_kenyon']]

# Count by classification levels
super_counts = mb_non_kc['super_class'].value_counts().sort_values(ascending=True)
class_counts = mb_non_kc['cell_class'].value_counts().head(15).sort_values(ascending=True)
subclass_counts = mb_non_kc['cell_sub_class'].value_counts().head(15).sort_values(ascending=True)

# Create subplots
fig = make_subplots(
    rows=3, cols=1,
    subplot_titles=('Super Class', 'Cell Class (Top 15)', 'Cell Sub-Class (Top 15)'),
    vertical_spacing=0.12,
    specs=[[{"type": "bar"}], [{"type": "bar"}], [{"type": "bar"}]]
)

# Super class
fig.add_trace(
    go.Bar(
        y=super_counts.index,
        x=super_counts.values,
        orientation='h',
        marker_color='lightblue',
        text=super_counts.values,
        textposition='outside',
        hovertemplate='<b>%{y}</b><br>Count: %{x}<extra></extra>'
    ),
    row=1, col=1
)

# Cell class
fig.add_trace(
    go.Bar(
        y=class_counts.index,
        x=class_counts.values,
        orientation='h',
        marker_color='lightgreen',
        text=class_counts.values,
        textposition='outside',
        hovertemplate='<b>%{y}</b><br>Count: %{x}<extra></extra>'
    ),
    row=2, col=1
)

# Cell sub-class
fig.add_trace(
    go.Bar(
        y=subclass_counts.index,
        x=subclass_counts.values,
        orientation='h',
        marker_color='lightcoral',
        text=subclass_counts.values,
        textposition='outside',
        hovertemplate='<b>%{y}</b><br>Count: %{x}<extra></extra>'
    ),
    row=3, col=1
)

fig.update_layout(
    title_text=f"Non-Kenyon MB Calyx Neurons: {DATASET}",
    title_x=0.5,
    title_xanchor='center',
    height=1200,
    showlegend=False,
    template="plotly_white"
)

fig.update_xaxes(title_text="Count")

fig.show()

### Summary Visualization

Create a summary comparing Kenyon vs non-Kenyon MB calyx neurons:

In [14]:
# Prepare summary data
mb_summary = pd.DataFrame({
    'Category': ['Kenyon Cells', 'Other MB Neurons'],
    'Count': [n_kc, n_other]
})

mb_summary['Percentage'] = (mb_summary['Count'] / mb_summary['Count'].sum() * 100).round(1)

# Create pie chart with plotly
fig = go.Figure()

fig.add_trace(go.Pie(
    labels=mb_summary['Category'],
    values=mb_summary['Count'],
    text=mb_summary.apply(lambda x: f"{x['Count']:,}<br>({x['Percentage']:.1f}%)", axis=1),
    textposition='inside',
    textfont=dict(size=14, color='white'),
    marker=dict(
        colors=['#E69F00', '#56B4E9'],
        line=dict(color='white', width=2)
    ),
    hovertemplate='<b>%{label}</b><br>Count: %{value:,}<br>Percentage: %{percent}<extra></extra>'
))

fig.update_layout(
    title=dict(
        text=f"Mushroom Body Calyx Neurons: {DATASET}<br><sub>Neurons with ≥100 synapses in MB calyx</sub>",
        x=0.5,
        xanchor='center'
    ),
    height=500,
    template="plotly_white"
)

fig.show()

# Print summary
print("\nMushroom Body Calyx Summary:")
print(mb_summary.to_string(index=False))


Mushroom Body Calyx Summary:
        Category  Count  Percentage
    Kenyon Cells    325        54.3
Other MB Neurons    274        45.7


---

## Summary

In this tutorial, we covered:

1. **Data Access**: Loading connectome data from GCS or local files
2. **Metadata Exploration**: Understanding the hierarchical classification system
3. **Proofread Neurons**: Important concept for filtering validated neurons
4. **Efficient Data Loading**: Using PyArrow and Parquet for large synapse files
5. **Data Analysis**: Filtering and characterizing mushroom body neurons
6. **Visualization**: Creating interactive plots with Plotly

### Key Takeaways

- **Always filter for proofread neurons** when analyzing connectivity
- **Use Parquet for large files** - it supports efficient filtering and column selection
- **Stream from GCS** when possible to avoid large downloads
- **Interactive plots** with Plotly make exploration easier

---

## Next Steps

Try exploring other datasets or neuropils:
- Change `DATASET` to explore other connectomes (fafb_783, manc_121, etc.)
- Filter for different neuropils (e.g., 'AL' for antennal lobe, 'LH' for lateral horn)
- Analyze different cell types or neurotransmitters
- Look at Tutorial 02 for neuron morphology analysis

---

**Tutorial complete!** 🎉