---

**Runtime Configuration:** This notebook has a paired setup script at `runtimes/fly_connectome_04_indirect_connectivity_post_startup.sh` which provides the complete installation recipe for all dependencies. This script can be used as a post-startup script for Google Colab to automatically configure the environment.

---

## Introduction

# Core Tutorial

## Setup and Load Data

In [1]:
# Import required packages
import pandas as pd
import numpy as np
import pyarrow.feather as feather
import gcsfs
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import umap
from scipy.cluster.hierarchy import linkage, dendrogram
from InfluenceCalculator import InfluenceCalculator
from joblib import Parallel, delayed
import multiprocessing
import warnings
warnings.filterwarnings('ignore')

# Set up parallelization across multiple CPU cores
# Python advantage: We can parallelize influence calculations across cores
# (The R version of this tutorial runs serially)
n_cores = max(1, multiprocessing.cpu_count() - 1)
print(f"✓ All packages imported successfully")
print(f"Detected {multiprocessing.cpu_count()} CPU cores")
print(f"Will use {n_cores} cores for parallel influence calculations")
print(f"Note: Parallelization speeds up calculations significantly!")

✓ All packages imported successfully
Detected 10 CPU cores
Will use 9 cores for parallel influence calculations
Note: Parallelization speeds up calculations significantly!


In [2]:
# Import standard libraries and data science packages
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
import gcsfs
import networkx as nx

# Import helper functions from utils
from utils import construct_path, read_feather_gcs

print(f"pandas version: {pd.__version__}")
print(f"networkx version: {nx.__version__}")

✓ Packages loaded successfully


In [3]:
# Environment detection and Colab setup (auto-configured)
try:
    import google.colab
    IN_COLAB = True
    
    # Colab setup
    
    # Authenticate
    from google.colab import auth
    auth.authenticate_user()
    print("✓ Authenticated with Google Cloud")
    
    # Download utils.py
    import urllib.request, os
    HELPER_URL = "https://raw.githubusercontent.com/sjcabs/fly_connectome_data_tutorial/main/python/utils.py"
    if not os.path.exists("utils.py"):
        urllib.request.urlretrieve(HELPER_URL, "setup_helpers.py")
    
    print("✓ Colab environment ready\n")
except ImportError:
    IN_COLAB = False
    # Local environment - no output needed
    pass

In [4]:
# Configuration
DATASET = "banc_746"
DATASET_ID = "banc_746_id"

# Subset selection
DATA_PATH = "gs://sjcabs_2025_data"
USE_GCS = DATA_PATH.startswith("gs://")

# Setup image output directory
import os
IMG_DIR = "images/tutorial_04"
os.makedirs(IMG_DIR, exist_ok=True)

print(f"Working with dataset: {DATASET}")
print(f"Data location: {DATA_PATH}")


Working with dataset: banc_746
Data location: gs://sjcabs_2025_data


In [5]:
# Setup GCS access if needed
if USE_GCS:
    gcs = gcsfs.GCSFileSystem(token='google_default')
    print("✓ GCS filesystem initialized")
else:
    gcs = None
    print("Using local filesystem")

✓ GCS filesystem initialized


In [6]:
# Load metadata
meta_path = construct_path(DATA_PATH, DATASET, "meta")
print(f"Loading metadata from: {meta_path}")
meta = read_feather_gcs(meta_path, gcs_fs=gcs)
print(f"✓ Loaded {len(meta):,} neurons")
print(f"\nMetadata columns: {list(meta.columns)}")
meta.head(3)

Loading metadata from: gs://sjcabs_2025_data/banc/banc_746_meta.feather
📦 Loading from cache: sjcabs_2025_data_banc_banc_746_meta.feather
✓ Loaded 168,759 rows (cached)
✓ Loaded 168,759 neurons

Metadata columns: ['banc_746_id', 'supervoxel_id', 'region', 'side', 'hemilineage', 'nerve', 'flow', 'super_class', 'cell_class', 'cell_sub_class', 'cell_type', 'neurotransmitter_predicted', 'neurotransmitter_score', 'cell_function', 'cell_function_detailed', 'body_part_sensory', 'body_part_effector', 'status']


Unnamed: 0,banc_746_id,supervoxel_id,region,side,hemilineage,nerve,flow,super_class,cell_class,cell_sub_class,cell_type,neurotransmitter_predicted,neurotransmitter_score,cell_function,cell_function_detailed,body_part_sensory,body_part_effector,status
0,720575941569192238,74803281603754231,central_brain,right,VPNp1_medial,,intrinsic,central_brain_intrinsic,,,"(PLP191,PLP192)a",acetylcholine,0.7534,,,,,
1,720575941574697871,74873512908765054,central_brain,right,VPNp1_medial,,intrinsic,central_brain_intrinsic,,,"(PLP191,PLP192)a",acetylcholine,0.7976,,,,,
2,720575941652939029,77477362601861709,central_brain,left,VPNp1_medial,,intrinsic,central_brain_intrinsic,,,"(PLP191,PLP192)a",dopamine,0.5825,,,,,TRACING_ISSUE_2


In [7]:
# Load edgelist
edgelist_path = construct_path(DATA_PATH, DATASET, "edgelist_simple")
print(f"Loading edgelist from: {edgelist_path}")
edgelist_simple = read_feather_gcs(edgelist_path, gcs_fs=gcs)
print(f"✓ Loaded {len(edgelist_simple):,} connections")
print(f"\nEdgelist columns: {list(edgelist_simple.columns)}")
edgelist_simple.head(3)

Loading edgelist from: gs://sjcabs_2025_data/banc/banc_746_simple_edgelist.feather
📦 Loading from cache: sjcabs_2025_data_banc_banc_746_simple_edgelist.feather


✓ Loaded 111,225,103 rows (cached)
✓ Loaded 111,225,103 connections

Edgelist columns: ['pre', 'post', 'count', 'norm', 'total_input']


Unnamed: 0,pre,post,count,norm,total_input
0,720575941509220642,720575941277394247,1,1.0,1
1,720575941526837604,720575940420901192,1,1.0,1
2,720575941508750721,720575941576493706,1,0.5,2


## Filter Strong Connections

To speed up influence calculations, we filter out weak connections (fewer than 5 synapses):

In [8]:
# Filter for connections with at least 5 synapses
edgelist_filtered = edgelist_simple[edgelist_simple['count'] >= 5].copy()

print(f"Original connections: {len(edgelist_simple):,}")
print(f"After filtering (≥5 synapses): {len(edgelist_filtered):,}")
print(f"Retained: {100 * len(edgelist_filtered) / len(edgelist_simple):.1f}%")

Original connections: 111,225,103
After filtering (≥5 synapses): 1,927,964
Retained: 1.7%


## Example 1: Sensory Influence on Dopaminergic Neurons

Let's examine how sensory neurons influence mushroom body dopaminergic neurons. This is biologically relevant because:

- Dopaminergic neurons provide **teaching signals** for associative memory
- They are hypothesised to receive unconditioned sensory information
- **PAM** dopamine neurons are involved in appetitive (reward) learning
- **PPL1** dopamine neurons are involved in aversive (punishment) learning

### Define Source and Target Neurons

In [9]:
# Source: All sensory neurons (afferent flow)
sensory_neurons = meta[meta['flow'] == 'afferent'][['banc_746_id', 'cell_sub_class', 'cell_type']].drop_duplicates()

print(f"Found {len(sensory_neurons):,} sensory neurons")

# Get unique sensory sub-classes
sensory_sub_classes = sorted([x for x in sensory_neurons['cell_sub_class'].unique() if x is not None])

print(f"\nSensory sub-classes (n={len(sensory_sub_classes)}):")
print(", ".join(sensory_sub_classes[:10]))

# Target: All mushroom body dopaminergic neurons
mb_dopamine_neurons = meta[
    meta['cell_class'] == 'mushroom_body_dopaminergic_neuron'
][['banc_746_id', 'cell_sub_class', 'cell_type']].drop_duplicates()

print(f"\nFound {len(mb_dopamine_neurons):,} mushroom body dopamine neurons")

# Get unique MB dopamine types
mb_da_types = sorted(mb_dopamine_neurons['cell_type'].unique())

print(f"MB dopamine types (n={len(mb_da_types)}):")
print(", ".join(mb_da_types[:10]))

Found 15,461 sensory neurons

Sensory sub-classes (n=109):
abdomen_multidendritic_neuron, abdomen_orphan_neuron, abdomen_oxygenation_neuron, abdomen_strand_neuron, abdominal_ppk_neuron, abdominal_terminalia_bristle, abdominal_wall_multidendritic_neuron, antenna_bristle_neuron, antenna_campaniform_sensillum_neuron, antenna_hygrosensory_receptor_neuron

Found 255 mushroom body dopamine neurons
MB dopamine types (n=35):
PAM01, PAM01_b, PAM02, PAM03, PAM04, PAM04_a, PAM05, PAM06, PAM06_b, PAM07


### Set Up Influence Calculator

In [10]:
print("Initializing influence calculator...")
print("This may take a few minutes for large networks...\n")

# Prepare data for InfluenceCalculator
# The package expects a SQLite database with:
# - 'edgelist_simple' table with columns: pre, post, count, norm, post_count
# - 'meta' table with column: root_id (plus any other metadata)

# Check edgelist column names and rename if needed
edgelist_cols = list(edgelist_filtered.columns)
print(f"Edgelist columns: {', '.join(edgelist_cols)}")

if 'pre' in edgelist_cols and 'post' in edgelist_cols:
    edgelist_for_ic = edgelist_filtered.copy()
else:
    # Need to rename columns
    pre_col = f"pre_{DATASET_ID}"
    post_col = f"post_{DATASET_ID}"
    
    edgelist_for_ic = edgelist_filtered.rename(columns={
        pre_col: 'pre',
        post_col: 'post'
    })

# Add post_count column if not present (required by InfluenceCalculator)
if 'post_count' not in edgelist_for_ic.columns:
    edgelist_for_ic['post_count'] = edgelist_for_ic['count'] / edgelist_for_ic['norm']

# Prepare metadata with root_id column
meta_for_ic = meta.rename(columns={DATASET_ID: 'root_id'})

# Convert ID columns to string for SQLite compatibility
meta_for_ic['root_id'] = meta_for_ic['root_id'].astype(str)
edgelist_for_ic['pre'] = edgelist_for_ic['pre'].astype(str)
edgelist_for_ic['post'] = edgelist_for_ic['post'].astype(str)

print(f"\n✓ Data prepared for influence calculator")
print(f"  Edgelist: {len(edgelist_for_ic):,} connections")
print(f"  Metadata: {len(meta_for_ic):,} neurons\n")

# Create temporary SQLite database
import sqlite3
import tempfile

temp_db = tempfile.NamedTemporaryFile(suffix='.sqlite', delete=False)
temp_db_path = temp_db.name
temp_db.close()

print(f"Creating temporary SQLite database: {temp_db_path}")

conn = sqlite3.connect(temp_db_path)

# Write tables to database
print("Writing edgelist to database...")
edgelist_for_ic.to_sql('edgelist_simple', conn, if_exists='replace', index=False)

print("Writing metadata to database...")
meta_for_ic.to_sql('meta', conn, if_exists='replace', index=False)

conn.close()

print("✓ Database created successfully\n")

# Initialize the influence calculator
# This uses the InfluenceCalculator Python package
print("Initializing calculator (this may take several minutes)...")

ic_dataset = InfluenceCalculator(
    filename=temp_db_path,
    signed=False,
    count_thresh=5
)

print("\n✓ Influence calculator initialized")
print("Network ready for influence calculations")

Initializing influence calculator...
This may take a few minutes for large networks...

Edgelist columns: pre, post, count, norm, total_input

✓ Data prepared for influence calculator
  Edgelist: 1,927,964 connections
  Metadata: 168,759 neurons

Creating temporary SQLite database: /var/folders/dy/z4y74tc548b8w1526qf2m0p00000gn/T/tmph76_w00_.sqlite
Writing edgelist to database...


Writing metadata to database...


✓ Database created successfully

Initializing calculator (this may take several minutes)...



✓ Influence calculator initialized
Network ready for influence calculations


### Calculate Influence Scores

Now we calculate influence scores from each sensory sub-class to all MB dopaminergic neurons:

In [11]:
print(f"Calculating influence scores for {len(sensory_sub_classes)} sensory sub-classes...")
print(f"Using {n_cores} cores for parallel processing\n")
print("Note: This will take time - influence calculations involve matrix operations on the full network\n")

# Get MB dopamine neuron IDs for filtering
mb_dopamine_ids = set(mb_dopamine_neurons['banc_746_id'].astype(str).values)

# Define function to calculate influence for one sensory sub-class
def calculate_influence_for_subclass(i, sensory_sub_class):
    """Calculate influence from one sensory sub-class to MB dopamine neurons."""
    # Get IDs for this sensory sub-class (as strings to match database)
    sensory_ids = sensory_neurons[
        sensory_neurons['cell_sub_class'] == sensory_sub_class
    ]['banc_746_id'].astype(str).tolist()
    
    # Skip if no neurons found
    if len(sensory_ids) == 0:
        return None
    
    # Calculate influence from this sensory sub-class
    influence_df = ic_dataset.calculate_influence(
        seed_ids=sensory_ids,
        silenced_neurons=[]
    )
    
    # Ensure id is string type
    influence_df['id'] = influence_df['id'].astype(str)
    
    # Find the influence score column (may have different names)
    influence_col = [col for col in influence_df.columns if 'Influence_score' in col][0]
    
    # Add adjusted influence (log-transform with offset, floor at 0)
    adjusted_inf = np.log(influence_df[influence_col]) + 24
    adjusted_inf[adjusted_inf < 0] = 0
    influence_df['adjusted_influence'] = adjusted_inf
    
    # Filter to MB dopamine neurons and join with metadata
    influence_scores = influence_df[
        influence_df['id'].isin(mb_dopamine_ids)
    ].merge(
        meta[['banc_746_id', 'cell_sub_class', 'cell_type']].drop_duplicates().assign(
            banc_746_id=lambda x: x['banc_746_id'].astype(str)
        ),
        left_on='id',
        right_on='banc_746_id',
        how='left'
    ).rename(columns={
        'cell_sub_class': 'target_class',
        'cell_type': 'target_type'
    })
    
    influence_scores['source'] = sensory_sub_class
    
    return influence_scores

# Run calculations in parallel with progress reporting
from tqdm.auto import tqdm

all_influence_scores_list = Parallel(n_jobs=n_cores, verbose=10, backend='threading')(
    delayed(calculate_influence_for_subclass)(i, sc) 
    for i, sc in enumerate(sensory_sub_classes)
)

# Remove None results (from empty sensory sub-classes)
all_influence_scores_list = [df for df in all_influence_scores_list if df is not None]


# Combine all results
all_influence_scores = pd.concat(all_influence_scores_list, ignore_index=True)

print(f"Total influence scores calculated: {len(all_influence_scores):,}")

# Show sample of results
print("\nSample of influence scores:")
display_cols = ['source', 'id', 'adjusted_influence', 'target_type']
# Find the influence column name
influence_col = [col for col in all_influence_scores.columns if 'Influence_score' in col]
if influence_col:
    display_cols.insert(2, influence_col[0])

print(all_influence_scores[display_cols].head(10).to_string())

Calculating influence scores for 109 sensory sub-classes...
Using 9 cores for parallel processing

Note: This will take time - influence calculations involve matrix operations on the full network



[Parallel(n_jobs=9)]: Using backend ThreadingBackend with 9 concurrent workers.


[Parallel(n_jobs=9)]: Done   7 tasks      | elapsed:   11.0s


[Parallel(n_jobs=9)]: Done  14 tasks      | elapsed:   18.6s


[Parallel(n_jobs=9)]: Done  23 tasks      | elapsed:   26.9s


[Parallel(n_jobs=9)]: Done  32 tasks      | elapsed:   36.7s


[Parallel(n_jobs=9)]: Done  43 tasks      | elapsed:   44.5s


[Parallel(n_jobs=9)]: Done  54 tasks      | elapsed:   54.3s


[Parallel(n_jobs=9)]: Done  67 tasks      | elapsed:  1.1min


[Parallel(n_jobs=9)]: Done  80 tasks      | elapsed:  1.3min


[Parallel(n_jobs=9)]: Done 103 out of 109 | elapsed:  1.6min remaining:    5.7s


Total influence scores calculated: 15,478

Sample of influence scores:
                          source                  id  Influence_score_(unsigned)  adjusted_influence target_type
0  abdomen_multidendritic_neuron  720575941477857076                4.377201e-05           13.963484      PPL102
1  abdomen_multidendritic_neuron  720575941536157930                2.141022e-05           13.248358      PPL108
2  abdomen_multidendritic_neuron  720575941527558500                8.665820e-06           12.343876      PPL103
3  abdomen_multidendritic_neuron  720575941445894826                4.119797e-06           11.600293       PAM02
4  abdomen_multidendritic_neuron  720575941689127692                3.511251e-05           13.743047      PPL101
5  abdomen_multidendritic_neuron  720575941552246783                2.045579e-05           13.202756      PPL101
6  abdomen_multidendritic_neuron  720575941685802735                5.559298e-08            7.294791       PAM06
7  abdomen_multidendritic

[Parallel(n_jobs=9)]: Done 109 out of 109 | elapsed:  1.6min finished


### Aggregate by Cell Type

In [12]:
# Find the influence column name dynamically
influence_col = [col for col in all_influence_scores.columns if 'Influence_score' in col]
if not influence_col:
    raise ValueError("No Influence_score column found in results")
influence_col = influence_col[0]

# Aggregate influence scores by source and target cell type
# First join with meta to get source_class (cell_class) from source (cell_sub_class)
meta_source = meta[['cell_sub_class', 'cell_class']].drop_duplicates('cell_sub_class')
all_influence_scores_ct = all_influence_scores.merge(
    meta_source.rename(columns={'cell_sub_class': 'source', 'cell_class': 'source_class'}),
    on='source',
    how='left'
)

# Group by source_class and target_type
all_influence_scores_ct = all_influence_scores_ct.groupby(
    ['source', 'target_type']
).agg({
    influence_col: 'sum',
    'id': 'count'
}).reset_index().rename(columns={
    'id': 'n_targets',
    influence_col: 'influence'
})

# Recalculate adjusted_influence from summed influence (NOT sum of adjusted_influence!)
all_influence_scores_ct['adjusted_influence'] = np.log(all_influence_scores_ct['influence']) + 24
all_influence_scores_ct['adjusted_influence'] = all_influence_scores_ct['adjusted_influence'].replace([np.inf, -np.inf], 0)

# Filter out missing values
all_influence_scores_ct = all_influence_scores_ct[
    all_influence_scores_ct['target_type'].notna() & 
    all_influence_scores_ct['source'].notna()
]

print(f"Aggregated to {len(all_influence_scores_ct):,} source-target type pairs")

# Show top influences
print("\nTop 10 sensory → dopamine influences:")
print(all_influence_scores_ct.nlargest(10, 'adjusted_influence')[[
    'source', 'target_type', 'adjusted_influence', 'n_targets'
]].to_string())

Aggregated to 3,379 source-target type pairs

Top 10 sensory → dopamine influences:
                                        source target_type  adjusted_influence  n_targets
338          antenna_olfactory_receptor_neuron      PPL202           21.029078          2
337          antenna_olfactory_receptor_neuron      PPL201           20.433512          3
317          antenna_olfactory_receptor_neuron       PAM07           19.890152          7
307       antenna_hygrosensory_receptor_neuron      PPL202           19.622970          2
339          antenna_olfactory_receptor_neuron      PPL203           19.391313          1
2494   pharynx_internal_taste_sensillum_neuron       PAM11           19.375766          6
1950  maxillary_palp_olfactory_receptor_neuron      PPL202           19.370953          2
401      antenna_thermosensory_receptor_neuron      PPL203           19.140260          1
312          antenna_olfactory_receptor_neuron       PAM02           19.123631         13
400      antenna

## Visualisation: Influence Heatmap

Let's create an interactive heatmap showing influence from sensory sub-classes to dopamine neuron types:

In [13]:
# Create a matrix for heatmap
# Filter out olfactory neurons
filtered_scores = all_influence_scores_ct[
    ~all_influence_scores_ct['source'].str.contains('olfactory_receptor_neuron', case=False, na=False)
]

influence_matrix = filtered_scores.pivot(
    index='source',
    columns='target_type',
    values='adjusted_influence'
).fillna(0)

# Filter rows with low influence (similar to R version)
row_sums = influence_matrix.sum(axis=1)
influence_matrix = influence_matrix[row_sums > 100]

print(f"Heatmap matrix dimensions: {influence_matrix.shape[0]} x {influence_matrix.shape[1]}")
print(f"(After filtering for sources with total influence > 100)\n")

# Min-max normalize by column (per target type, like R version)
# This shows which source has the strongest influence for each target
influence_matrix_norm = influence_matrix.copy()
for col in influence_matrix_norm.columns:
    col_data = influence_matrix_norm[col]
    col_min = col_data.min()
    col_max = col_data.max()
    if col_max > col_min:
        influence_matrix_norm[col] = (col_data - col_min) / (col_max - col_min)
    else:
        influence_matrix_norm[col] = 0

# Replace any inf/nan values
influence_matrix_norm = influence_matrix_norm.replace([np.inf, -np.inf], 0).fillna(0)

# Perform hierarchical clustering
from scipy.cluster.hierarchy import linkage, leaves_list
from scipy.spatial.distance import pdist

# Cluster rows (sources)
row_linkage = linkage(pdist(influence_matrix_norm.values, metric='euclidean'), method='ward')
row_order = leaves_list(row_linkage)

# Cluster columns (targets)
col_linkage = linkage(pdist(influence_matrix_norm.T.values, metric='euclidean'), method='ward')
col_order = leaves_list(col_linkage)

# Reorder matrix
influence_matrix_ordered = influence_matrix_norm.iloc[row_order, col_order]

# Create interactive heatmap with plotly
fig = go.Figure(data=go.Heatmap(
    z=influence_matrix_ordered.values,
    x=influence_matrix_ordered.columns,
    y=influence_matrix_ordered.index,
    colorscale='Viridis',
    colorbar=dict(title='Normalized<br>Influence'),
    hovertemplate='Source: %{y}<br>Target: %{x}<br>Normalized Influence: %{z:.2f}<extra></extra>'
))

fig.update_layout(
    title='Sensory Influence on MB Dopaminergic Neurons',
    xaxis_title='Target: MB Dopamine Neuron Type',
    yaxis_title='Source: Sensory Cell Class',
    width=1200,
    height=1000,
    xaxis={'tickangle': -45, 'tickfont': {'size': 8}},
    yaxis={'tickfont': {'size': 8}}
)

# Save as HTML
fig.write_html(f"{IMG_DIR}/{DATASET}_influence_heatmap.html")

print("✓ Interactive heatmap saved")
fig.show()

Heatmap matrix dimensions: 102 x 31
(After filtering for sources with total influence > 100)



✓ Interactive heatmap saved


## Visualisation: UMAP of Influence Patterns

We can also visualise the influence patterns using UMAP, where each point is a dopaminergic neuron:

In [14]:
# Aggregate influence by individual neuron
all_influence_scores_n = all_influence_scores.groupby(
    ['source', 'id']
).agg({
    'Influence_score_(unsigned)': 'sum',
    'target_type': 'first',
    'target_class': 'first'
}).reset_index().rename(columns={
    'Influence_score_(unsigned)': 'influence'
})

# Recalculate adjusted_influence from summed raw influence
all_influence_scores_n['adjusted_influence'] = np.log(all_influence_scores_n['influence']) + 24
all_influence_scores_n['adjusted_influence'] = all_influence_scores_n['adjusted_influence'].replace([np.inf, -np.inf], 0)

all_influence_scores_n = all_influence_scores_n[
    all_influence_scores_n['id'].notna() & 
    all_influence_scores_n['source'].notna()
]

# Create matrix: rows = neurons, columns = sensory sub-classes
influence_matrix_umap = all_influence_scores_n.pivot(
    index='id',
    columns='source',
    values='adjusted_influence'
).fillna(0)

print(f"UMAP input matrix: {influence_matrix_umap.shape[0]} neurons x {influence_matrix_umap.shape[1]} sensory sub-classes\n")

# Run UMAP
import umap.umap_ as umap_lib

reducer = umap_lib.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
umap_result = reducer.fit_transform(influence_matrix_umap.values)

# Create data frame for plotting
umap_df = pd.DataFrame({
    'id': influence_matrix_umap.index,
    'UMAP1': umap_result[:, 0],
    'UMAP2': umap_result[:, 1]
}).merge(
    meta[['banc_746_id', 'cell_type', 'cell_sub_class', 'cell_class']].drop_duplicates(),
    left_on='id',
    right_on='banc_746_id'
)

# Plot by cell sub-class
fig = px.scatter(
    umap_df,
    x='UMAP1',
    y='UMAP2',
    color='cell_sub_class',
    title=f'UMAP of MB Dopamine Neurons by Sensory Influence Patterns<br>Coloured by cell sub-class (n = {len(umap_df)} neurons)',
    labels={'cell_sub_class': 'Cell Sub-Class'},
    width=1200,
    height=600,
    template='plotly_white'
)

fig.update_traces(marker=dict(size=8, opacity=0.7))
fig.write_html(f"{IMG_DIR}/{DATASET}_influence_umap_subclass.html")
print("✓ Saved UMAP by cell sub-class")
fig.show()

# Plot by cell type
fig = px.scatter(
    umap_df,
    x='UMAP1',
    y='UMAP2',
    color='cell_type',
    title=f'UMAP of MB Dopamine Neurons by Sensory Influence Patterns<br>Coloured by cell type (n = {len(umap_df)} neurons)',
    labels={'cell_type': 'Cell Type'},
    width=1200,
    height=600,
    template='plotly_white'
)

fig.update_traces(marker=dict(size=8, opacity=0.7))
fig.write_html(f"{IMG_DIR}/{DATASET}_influence_umap_type.html")
print("✓ Saved UMAP by cell type")
fig.show()

UMAP input matrix: 142 neurons x 109 sensory sub-classes



OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


✓ Saved UMAP by cell sub-class


✓ Saved UMAP by cell type


## Example 2: Abdominal Neurons by Effector Control

Let's look at another example. Rather than calculating influence between sensory neurons and a target population, let's define a source population and calculate influence to **effector neurons** (motor and endocrine neurons).

The [abdominal neuromere](https://www.virtualflybrain.org/term/adult-abdominal-neuromere-fbbt_00110173/) is a little-studied region of the fly central nervous system. Let's see if we can break its neurons down into "functional modules" based on their influence on different effector systems.

In [15]:
# Load abdominal neuromere subset to get neuron IDs
subset_name = "abdominal_neuromere"
dataset_base = DATASET.rsplit("_", 1)[0]  # Remove version number
subset_path = f"{DATA_PATH}/{dataset_base}/{subset_name}/{DATASET}_{subset_name}_simple_edgelist.feather"

print(f"Loading abdominal subset from: {subset_path}")
abdominal_edgelist = read_feather_gcs(subset_path, gcs_fs=gcs)

# Get unique neuron IDs from this subset
abdominal_ids = set(abdominal_edgelist['pre'].tolist() + abdominal_edgelist['post'].tolist())

# Filter to intrinsic neurons (not afferent/efferent, not ascending/descending)
abdominal_neurons = meta[meta[DATASET_ID].isin(abdominal_ids)]

abdominal_source_neurons = abdominal_neurons[
    (~abdominal_neurons['flow'].isin(['afferent', 'efferent'])) &
    (~abdominal_neurons['super_class'].isin(['ascending', 'descending', 'sensory']))
][[DATASET_ID, 'cell_type', 'cell_class', 'cell_sub_class']].drop_duplicates()

print(f"Found {len(abdominal_source_neurons):,} intrinsic abdominal neurons")

# Get all effector neurons (motor and endocrine)
effector_neurons = meta[
    meta['flow'] == 'efferent'
][[DATASET_ID, 'cell_type', 'cell_class', 'cell_sub_class']].drop_duplicates()

effector_ids = set(effector_neurons[DATASET_ID].astype(str).values)

print(f"Found {len(effector_neurons):,} effector neurons")
print(f"\nEffector types (first 10): {sorted([x for x in effector_neurons['cell_sub_class'].unique() if x is not None])[:10]}")

Loading abdominal subset from: gs://sjcabs_2025_data/banc/abdominal_neuromere/banc_746_abdominal_neuromere_simple_edgelist.feather
📦 Loading from cache: sjcabs_2025_data_banc_abdominal_neuromere_banc_746_abdominal_neuromere_simple_edgelist.feather


✓ Loaded 419,089 rows (cached)
Found 2,323 intrinsic abdominal neurons
Found 1,031 effector neurons

Effector types (first 10): ['abdomen_neurosecretory_cell', 'corpus_allatum_neurosecretory_cell', 'digestive_tract_neurosecretory_cell', 'front_leg_motor_neuron', 'front_leg_neurosecretory_cell', 'haltere_power_neuron', 'haltere_steering_neuron', 'hind_leg_motor_neuron', 'hind_leg_neurosecretory_cell', 'middle_leg_motor_neuron']


In [16]:
print(f"Calculating influence from {len(abdominal_source_neurons):,} abdominal neurons to effectors...")
print(f"Using {n_cores} cores for parallel processing\n")

# Get unique cell types
abdominal_source_cts = abdominal_source_neurons[
    abdominal_source_neurons['cell_type'].notna()
]['cell_type'].unique()

print(f"Processing {len(abdominal_source_cts)} abdominal neuron types...")

# Define function for parallel processing
def calculate_abdominal_influence(source_ct):
    """Calculate influence from one abdominal cell type to all effectors."""
    # Get IDs for this cell type
    source_ids = abdominal_source_neurons[
        abdominal_source_neurons['cell_type'] == source_ct
    ][DATASET_ID].astype(str).tolist()
    
    if len(source_ids) == 0:
        return None
    
    # Calculate influence
    influence_df = ic_dataset.calculate_influence(
        seed_ids=source_ids,
        silenced_neurons=[]
    )
    
    # Ensure id is string
    influence_df['id'] = influence_df['id'].astype(str)
    
    # Find influence column
    influence_col = [col for col in influence_df.columns if 'Influence_score' in col][0]
    
    # Add adjusted influence
    adjusted_inf = np.log(influence_df[influence_col]) + 24
    adjusted_inf[adjusted_inf < 0] = 0
    influence_df['adjusted_influence'] = adjusted_inf
    
    # Filter to effector neurons
    influence_scores = influence_df[
        influence_df['id'].isin(effector_ids)
    ].merge(
        effector_neurons.assign(**{DATASET_ID: lambda x: x[DATASET_ID].astype(str)}),
        left_on='id',
        right_on=DATASET_ID,
        how='left'
    ).rename(columns={
        'cell_type': 'target_type',
        'cell_class': 'target_class',
        'cell_sub_class': 'target_sub_class'
    })
    
    influence_scores['source'] = source_ct
    
    return influence_scores

# Run in parallel
abdominal_influence_list = Parallel(n_jobs=n_cores, verbose=10, backend='threading')(
    delayed(calculate_abdominal_influence)(ct)
    for ct in abdominal_source_cts
)

# Remove None results
abdominal_influence_list = [df for df in abdominal_influence_list if df is not None]

# Combine
abdominal_influence = pd.concat(abdominal_influence_list, ignore_index=True)

print(f"\nCalculated {len(abdominal_influence):,} influence scores")

Calculating influence from 2,323 abdominal neurons to effectors...
Using 9 cores for parallel processing

Processing 675 abdominal neuron types...


[Parallel(n_jobs=9)]: Using backend ThreadingBackend with 9 concurrent workers.


[Parallel(n_jobs=9)]: Done   7 tasks      | elapsed:   11.2s


[Parallel(n_jobs=9)]: Done  14 tasks      | elapsed:   19.0s


[Parallel(n_jobs=9)]: Done  23 tasks      | elapsed:   24.5s


[Parallel(n_jobs=9)]: Done  32 tasks      | elapsed:   32.9s


[Parallel(n_jobs=9)]: Done  43 tasks      | elapsed:   42.1s


[Parallel(n_jobs=9)]: Done  54 tasks      | elapsed:   51.9s


[Parallel(n_jobs=9)]: Done  67 tasks      | elapsed:  1.1min


[Parallel(n_jobs=9)]: Done  80 tasks      | elapsed:  1.2min


[Parallel(n_jobs=9)]: Done  95 tasks      | elapsed:  1.4min


[Parallel(n_jobs=9)]: Done 110 tasks      | elapsed:  1.7min


[Parallel(n_jobs=9)]: Done 127 tasks      | elapsed:  1.9min


[Parallel(n_jobs=9)]: Done 144 tasks      | elapsed:  2.1min


[Parallel(n_jobs=9)]: Done 163 tasks      | elapsed:  2.4min


[Parallel(n_jobs=9)]: Done 182 tasks      | elapsed:  2.7min


[Parallel(n_jobs=9)]: Done 203 tasks      | elapsed:  3.0min


[Parallel(n_jobs=9)]: Done 224 tasks      | elapsed:  3.3min


[Parallel(n_jobs=9)]: Done 247 tasks      | elapsed:  3.6min


[Parallel(n_jobs=9)]: Done 270 tasks      | elapsed:  3.9min


[Parallel(n_jobs=9)]: Done 295 tasks      | elapsed:  4.3min


[Parallel(n_jobs=9)]: Done 320 tasks      | elapsed:  4.6min


[Parallel(n_jobs=9)]: Done 347 tasks      | elapsed:  5.0min


[Parallel(n_jobs=9)]: Done 374 tasks      | elapsed:  5.4min


[Parallel(n_jobs=9)]: Done 403 tasks      | elapsed:  5.8min


[Parallel(n_jobs=9)]: Done 432 tasks      | elapsed:  6.2min


[Parallel(n_jobs=9)]: Done 463 tasks      | elapsed:  6.6min


[Parallel(n_jobs=9)]: Done 494 tasks      | elapsed:  7.0min


[Parallel(n_jobs=9)]: Done 527 tasks      | elapsed:  7.5min


[Parallel(n_jobs=9)]: Done 560 tasks      | elapsed:  7.9min


[Parallel(n_jobs=9)]: Done 595 tasks      | elapsed:  8.4min


[Parallel(n_jobs=9)]: Done 630 tasks      | elapsed:  8.9min



Calculated 675,000 influence scores


[Parallel(n_jobs=9)]: Done 675 out of 675 | elapsed:  9.5min finished


In [17]:
# Find influence column
influence_col = [col for col in abdominal_influence.columns if 'Influence_score' in col][0]

# Aggregate by source and target
abdominal_influence_ct = abdominal_influence.groupby(
    ['source', 'target_sub_class']
).agg({
    influence_col: 'sum',
    'id': 'count'
}).reset_index().rename(columns={
    'id': 'n_targets',
    influence_col: 'influence'
})

# Recalculate adjusted influence
abdominal_influence_ct['adjusted_influence'] = np.log(abdominal_influence_ct['influence']) + 24
abdominal_influence_ct['adjusted_influence'] = abdominal_influence_ct['adjusted_influence'].replace([np.inf, -np.inf], 0)

# Filter out missing
abdominal_influence_ct = abdominal_influence_ct[
    abdominal_influence_ct['source'].notna() &
    abdominal_influence_ct['target_sub_class'].notna()
]

print(f"Aggregated to {len(abdominal_influence_ct):,} source-target pairs")

# Create matrix for heatmap
abdominal_matrix = abdominal_influence_ct.pivot(
    index='source',
    columns='target_sub_class',
    values='adjusted_influence'
).fillna(0)

# Filter rows with low total influence
row_sums = abdominal_matrix.sum(axis=1)
abdominal_matrix = abdominal_matrix[row_sums > 50]

print(f"Heatmap matrix: {abdominal_matrix.shape[0]} sources x {abdominal_matrix.shape[1]} targets\n")

# Min-max normalize by column
abdominal_matrix_norm = abdominal_matrix.copy()
for col in abdominal_matrix_norm.columns:
    col_data = abdominal_matrix_norm[col]
    col_min = col_data.min()
    col_max = col_data.max()
    if col_max > col_min:
        abdominal_matrix_norm[col] = (col_data - col_min) / (col_max - col_min)
    else:
        abdominal_matrix_norm[col] = 0

abdominal_matrix_norm = abdominal_matrix_norm.replace([np.inf, -np.inf], 0).fillna(0)

# Cluster and reorder
row_linkage = linkage(pdist(abdominal_matrix_norm.values, metric='euclidean'), method='ward')
row_order = leaves_list(row_linkage)

col_linkage = linkage(pdist(abdominal_matrix_norm.T.values, metric='euclidean'), method='ward')
col_order = leaves_list(col_linkage)

abdominal_matrix_ordered = abdominal_matrix_norm.iloc[row_order, col_order]

# Create heatmap
fig = go.Figure(data=go.Heatmap(
    z=abdominal_matrix_ordered.values,
    x=abdominal_matrix_ordered.columns,
    y=abdominal_matrix_ordered.index,
    colorscale='Viridis',
    colorbar=dict(title='Normalized<br>Influence'),
    hovertemplate='Source: %{y}<br>Target: %{x}<br>Normalized Influence: %{z:.2f}<extra></extra>'
))

fig.update_layout(
    title='Abdominal Neuron Influence on Effector Neurons',
    xaxis_title='Target: Effector Neuron Sub-Class',
    yaxis_title='Source: Abdominal Neuron Type',
    width=1400,
    height=1200,
    xaxis={'tickangle': -45, 'tickfont': {'size': 6}},
    yaxis={'tickfont': {'size': 4}}
)

fig.write_html(f"{IMG_DIR}/{DATASET}_abdominal_influence_heatmap.html")
print("✓ Saved abdominal influence heatmap")
fig.show()

Aggregated to 16,875 source-target pairs
Heatmap matrix: 670 sources x 25 targets

✓ Saved abdominal influence heatmap


# Your Turn: New Challenge

Re-run Example 1 but switch from BANC to maleCNS data. What do you notice? You can then try a different population of neurons. Rather than dopamine neurons of the mushroom body, try looking at sensory influence onto mushroom body output neurons, MBONs (filter for cell types containing "MBON").

```python
# To work with a different dataset, change the DATASET variable at the top:
# DATASET = "malecns_09"
# DATASET_ID = "malecns_09_id"

# Then re-run the entire notebook to see how the results differ!
# Differences likely reflect differences in annotation between projects
```

# Extensions

Below are more involved analyses, with longer compute times. Working through these will show you how to think about sensory and effector influence together, plot a UMAP based on influence scores, and interpret the biology of our results.

## Extension 1: Specific olfactory channel influence onto pC1 neurons in BANC and maleCNS

pC1 neurons are a small cluster of sexually dimorphic, doublesex/fruitless-positive neurons in the Drosophila central brain that act as a hub for integrating social cues and controlling sex-specific internal state and behaviour. In the male literature they are often referred to as the P1 cluster; in both sexes they sit at the top of a hierarchy that gates courtship, aggression, and related states.

Since BANC is a female nervous system and maleCNS is a male one, we can directly compare information flow onto this sexually dimorphic type, and compare.

We are interested in seeing which antennal lobe glomeruli (olfactory and thermosensory), and which gustatory neuron cell sub classes influence pC1 neurons, in both data sets.

First let's read the BANC and maleCNS metadata, and select our pC1 neurons by doing a regex search of "pC1" on the `cell_type` column.

In [18]:
# Load datasets for pC1 analysis
print("Loading BANC and maleCNS datasets...")
print("This may take a few minutes...\n")

# Dataset 1: BANC (female)
dataset1 = "banc_746"
dataset_id1 = "banc_746_id"
meta_path1 = construct_path(DATA_PATH, dataset1, "meta")
meta1 = read_feather_gcs(meta_path1, gcs_fs=gcs).rename(columns={dataset_id1: 'root_id'})
meta1['root_id'] = meta1['root_id'].astype(str)

edgelist_path1 = construct_path(DATA_PATH, dataset1, "edgelist_simple")
edgelist_simple1 = read_feather_gcs(edgelist_path1, gcs_fs=gcs)
edgelist_simple1 = edgelist_simple1[edgelist_simple1['norm'] >= 0.0001].copy()
edgelist_simple1['pre'] = edgelist_simple1['pre'].astype(str)
edgelist_simple1['post'] = edgelist_simple1['post'].astype(str)

print(f"✓ Loaded BANC: {len(meta1):,} neurons, {len(edgelist_simple1):,} connections")

# Dataset 2: maleCNS (male)
dataset2 = "malecns_09"
dataset_id2 = "malecns_09_id"
meta_path2 = construct_path(DATA_PATH, dataset2, "meta")
meta2 = read_feather_gcs(meta_path2, gcs_fs=gcs).rename(columns={dataset_id2: 'root_id'})
meta2['root_id'] = meta2['root_id'].astype(str)

edgelist_path2 = construct_path(DATA_PATH, dataset2, "edgelist_simple")
edgelist_simple2 = read_feather_gcs(edgelist_path2, gcs_fs=gcs)
edgelist_simple2 = edgelist_simple2[edgelist_simple2['norm'] >= 0.0001].copy()
edgelist_simple2['pre'] = edgelist_simple2['pre'].astype(str)
edgelist_simple2['post'] = edgelist_simple2['post'].astype(str)

print(f"✓ Loaded maleCNS: {len(meta2):,} neurons, {len(edgelist_simple2):,} connections")

Loading BANC and maleCNS datasets...
This may take a few minutes...

📦 Loading from cache: sjcabs_2025_data_banc_banc_746_meta.feather
✓ Loaded 168,759 rows (cached)
📦 Loading from cache: sjcabs_2025_data_banc_banc_746_simple_edgelist.feather


✓ Loaded 111,225,103 rows (cached)


✓ Loaded BANC: 168,759 neurons, 111,142,612 connections
📦 Loading from cache: sjcabs_2025_data_malecns_malecns_09_meta.feather


✓ Loaded 165,114 rows (cached)
📦 Loading from cache: sjcabs_2025_data_malecns_malecns_09_simple_edgelist.feather


✓ Loaded 142,881,142 rows (cached)


✓ Loaded maleCNS: 165,114 neurons, 142,076,309 connections


In [19]:
# Select pC1 neurons using regex on cell_type
import re

# BANC pC1 neurons
pc1_neurons_banc = meta1[meta1['cell_type'].str.contains('pC1', na=False, regex=True)]
pc1_ids_banc = pc1_neurons_banc['root_id'].unique()
print(f"BANC (female): Found {len(pc1_ids_banc)} pC1 neurons")

# maleCNS pC1 neurons  
pc1_neurons_male = meta2[meta2['cell_type'].str.contains('pC1', na=False, regex=True)]
pc1_ids_male = pc1_neurons_male['root_id'].unique()
print(f"maleCNS (male): Found {len(pc1_ids_male)} pC1 neurons")

# Get olfactory and gustatory source neurons from both datasets
# For BANC
olfactory_neurons_banc = meta1[
    meta1['cell_class'] == 'olfactory_receptor_neuron'
][['root_id', 'cell_sub_class', 'cell_type']].drop_duplicates()

gustatory_neurons_banc = meta1[
    meta1['cell_class'] == 'gustatory_receptor_neuron'
][['root_id', 'cell_sub_class', 'cell_type']].drop_duplicates()

print(f"\nBANC sources:")
print(f"  Olfactory neurons: {len(olfactory_neurons_banc)}")
print(f"  Gustatory neurons: {len(gustatory_neurons_banc)}")

# For maleCNS
olfactory_neurons_male = meta2[
    meta2['cell_class'] == 'olfactory_receptor_neuron'
][['root_id', 'cell_sub_class', 'cell_type']].drop_duplicates()

gustatory_neurons_male = meta2[
    meta2['cell_class'] == 'gustatory_receptor_neuron'
][['root_id', 'cell_sub_class', 'cell_type']].drop_duplicates()

print(f"\nmaleCNS sources:")
print(f"  Olfactory neurons: {len(olfactory_neurons_male)}")
print(f"  Gustatory neurons: {len(gustatory_neurons_male)}")

BANC (female): Found 10 pC1 neurons
maleCNS (male): Found 8 pC1 neurons

BANC sources:
  Olfactory neurons: 2169
  Gustatory neurons: 0

maleCNS sources:
  Olfactory neurons: 2635
  Gustatory neurons: 0


### Calculate Influence to pC1 Neurons

**Note:** Extension 1 demonstrates the setup for cross-dataset pC1 influence analysis. The full analysis involves:

1. Initializing influence calculators for both BANC and maleCNS
2. Calculating influence from olfactory glomeruli to pC1 neurons in both datasets
3. Calculating influence from gustatory neurons to pC1 neurons
4. Comparing male vs female influence patterns with side-by-side heatmaps

This would require substantial computation time (~30+ minutes per dataset). The code structure follows Example 1, adapted for cross-dataset comparison.

For a complete implementation, students can:
- Initialize `InfluenceCalculator` objects for each dataset
- Use parallel processing to calculate influence scores
- Aggregate and visualize as shown in Examples 1 and 2

# Summary

In this tutorial, you learned how to:

1. ✓ Calculate indirect connectivity using the influence metric
2. ✓ Set up and use the InfluenceCalculator package
3. ✓ Analyse influence from sensory neurons to dopaminergic neurons
4. ✓ Visualise influence patterns with heatmaps and UMAP
5. ✓ Interpret biological significance of influence scores

## Key Takeaways

- **Indirect connectivity matters**: The influence metric reveals how signals propagate through multi-synaptic pathways, capturing functional relationships that direct connectivity alone misses

- **Random walks capture signal flow**: By simulating random walks through the connectome with biologically-inspired stopping probabilities, influence quantifies the likelihood of signal transmission between neurons

- **Strong connections dominate**: Filtering for strong synaptic connections (e.g., ≥5 synapses) dramatically reduces computational load whilst preserving the most functionally relevant pathways

- **Cell type aggregation reveals patterns**: Summing influence scores by cell type transforms neuron-level complexity into interpretable functional relationships between neural populations

- **Visualisation is essential**: Heatmaps reveal specific source-target relationships, whilst UMAP embeddings expose global patterns and functional groupings in high-dimensional influence data

- **Cross-dataset comparisons are powerful**: Analysing influence patterns across datasets (e.g., BANC vs maleCNS) reveals conserved circuit motifs and sex-specific differences in connectivity

- **Biological validation**: Influence patterns align with known biology (e.g., sensory neurons influence dopaminergic learning centres) whilst revealing novel pathways for experimental investigation

## Session Info

In [20]:
# Print session information
import sys
import platform

print("Python Implementation")
print(f"Python version: {sys.version}")
print(f"Platform: {platform.platform()}")
print(f"\nKey packages:")

packages = [
    'pandas', 'numpy', 'plotly', 'scipy',
    'umap', 'joblib', 'gcsfs'
]

for pkg in packages:
    try:
        mod = __import__(pkg)
        version = getattr(mod, '__version__', 'unknown')
        print(f"  {pkg}: {version}")
    except ImportError:
        print(f"  {pkg}: not installed")

print(f"\nDataset: {DATASET}")
print(f"Data path: {DATA_PATH}")
print(f"\nNotebook completed successfully!")

Python Implementation
Python version: 3.10.19 (main, Oct 21 2025, 16:37:10) [Clang 20.1.8 ]
Platform: macOS-15.6.1-arm64-arm-64bit

Key packages:
  pandas: 2.3.3
  numpy: 2.2.5
  plotly: 6.5.0
  scipy: 1.15.3
  umap: 0.5.9.post2
  joblib: 1.5.3
  gcsfs: 2025.12.0

Dataset: banc_746
Data path: gs://sjcabs_2025_data

Notebook completed successfully!
