# Reproducibility Distance Analysis with Single-Cell Integration

## Overview
This notebook performs distance-based reproducibility analysis using integrated bulk RNA-seq and single-cell data. It focuses on hierarchical clustering, marker gene visualization, and cross-platform reproducibility assessment.

## Objectives
1. Load sample distance matrices from integrated bulk + single-cell data
2. Perform hierarchical clustering with sample-specific visualization
3. Identify and visualize marker genes from single-cell clusters
4. Generate expression heatmaps for reproducibility assessment
5. Compare gene expression patterns across conditions and genotypes
6. Assess technical reproducibility between experimental replicates

## Expected Outputs
- Hierarchical clustering dendrograms with sample coloring
- Expression heatmaps for marker genes across conditions
- Cross-condition reproducibility metrics
- Publication-ready plots for manuscript figures
- Summary statistics for experimental validation

## Input Requirements
- Sample distance matrix from batch-corrected expression data
- Sample metadata with genotype, condition, and replicate information
- VST-normalized expression data for visualization
- Single-cell marker gene lists for each major cell cluster
- Single-cell metadata for condition mapping

## Analysis Pipeline
1. **Data Loading**: Import distance matrices, metadata, and expression data
2. **Hierarchical Clustering**: Generate dendrograms with sample annotations
3. **Marker Gene Integration**: Load single-cell markers and map to bulk data
4. **Expression Visualization**: Create heatmaps for marker genes
5. **Reproducibility Assessment**: Compare patterns across genotypes
6. **Results Export**: Save plots and summary statistics

## Configuration and Parameters

In [None]:
# Centralized Configuration
import os
import warnings
warnings.filterwarnings('ignore')

# Set reproducibility
import numpy as np
np.random.seed(42)

# Analysis parameters
TOP_CLUSTERS = 3              # Number of top clusters per sample to analyze
TOP_MARKERS_PER_CLUSTER = 10  # Number of top marker genes per cluster
HEATMAP_VMIN = 0             # Minimum value for heatmap visualization
HEATMAP_VMAX = 2             # Maximum value for heatmap visualization
DENDROGRAM_HEIGHT = 4        # Height for dendrogram plots
DENDROGRAM_WIDTH = 24        # Width for dendrogram plots
HEATMAP_HEIGHT = 10          # Height for heatmap plots
HEATMAP_WIDTH = 20           # Width for heatmap plots

# File paths
BASE_DIR = os.getcwd()
DATA_DIR = os.path.join(BASE_DIR, "bulk")
FIGURES_DIR = os.path.join(BASE_DIR, "figures")
RESULTS_DIR = os.path.join(BASE_DIR, "reproducibility_plots")

# Input files
DISTANCES_FILE = os.path.join(DATA_DIR, "reproducibility_genotypes_deseq2_limma_corr_pca_distances_wSC.tsv")
METADATA_FILE = os.path.join(DATA_DIR, "reproducibility_genotypes_wSC_meta.tsv")
VST_FILE = os.path.join(DATA_DIR, "reproducibility_genotypes_deseq2_vsd_wSC.tsv")
CONDITION_SUMMARY_FILE = os.path.join(DATA_DIR, "condition_summary.tsv")

# Single-cell data paths (update these paths to match your actual file locations)
SC_MARKERS_POST_FILE = "/home/jjanssens/jjans/analysis/iNeuron_morphogens/final/marker_genes/iGlut_post_dr_clustered_raw_merged_markers.tsv"
SC_META_POST_FILE = "/home/jjanssens/jjans/analysis/iNeuron_morphogens/final/scanpy/iGlut_post_dr_clustered_raw_merged_meta.tsv"
SC_MARKERS_PRE_FILE = "/home/jjanssens/jjans/analysis/iNeuron_morphogens/final/marker_genes/iGlut_pre_dr_clustered_raw_merged_markers.tsv"
SC_META_PRE_FILE = "/home/jjanssens/jjans/analysis/iNeuron_morphogens/final/scanpy/iGlut_pre_dr_clustered_raw_merged_meta_fixed.tsv"

# Sample ID mapping
SAMPLE_TO_ID = {
    '1': 'p1_D4',   '2': 'p1_D8',   '3': 'p1_D10',
    '4': 'p1_B4',   '5': 'p1_B8',   '6': 'p1_B10',
    '7': 'p3_C2',   '8': 'p3_F2',   '9': 'p3_D1',
    '10': 'p3_F4',  '11': 'p3_G1',  '12': 'p3_G10'
}

TESTED_CONDITIONS = ['p1_D4', 'p1_D8', 'p1_D10', 'p1_B4', 'p1_B8', 'p1_B10',
                    'p3_C2', 'p3_F2', 'p3_D1', 'p3_F4', 'p3_G1', 'p3_G10']

# Color scheme for sample visualization
SAMPLE_COLORS = ['red', 'green', 'blue', 'cyan', 'magenta', 'gold', 
                'black', 'lightblue', 'orange', 'purple', 'brown', 'pink']

# Create output directories
os.makedirs(FIGURES_DIR, exist_ok=True)
os.makedirs(RESULTS_DIR, exist_ok=True)

print("Configuration loaded successfully")
print(f"Base directory: {BASE_DIR}")
print(f"Data directory: {DATA_DIR}")
print(f"Figures directory: {FIGURES_DIR}")
print(f"Results directory: {RESULTS_DIR}")
print(f"Top clusters per sample: {TOP_CLUSTERS}")
print(f"Top markers per cluster: {TOP_MARKERS_PER_CLUSTER}")

## Library Imports

In [None]:
# Core data analysis libraries
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Set matplotlib parameters for publication quality
mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42
mpl.rcParams['figure.dpi'] = 300
mpl.rcParams['savefig.dpi'] = 350
mpl.rcParams['savefig.bbox'] = 'tight'
mpl.rcParams['savefig.pad_inches'] = 0.1

# Clustering and analysis libraries
import scipy.cluster.hierarchy as sch
from scipy.spatial.distance import squareform

print("All libraries loaded successfully")
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")
print(f"matplotlib version: {mpl.__version__}")
print(f"seaborn version: {sns.__version__}")

## Data Loading and Validation

In [None]:
import pandas as pd
import os

# File paths (placeholders, update with actual paths)
DISTANCES_FILE = "bulk/reproducibility_genotypes_deseq2_limma_corr_pca_distances_wSC.tsv"
METADATA_FILE = "bulk/metadata.tsv"
VST_FILE = "bulk/vst_data.tsv"
SC_MARKERS_POST_FILE = "single_cell/sc_post_markers.tsv"
SC_META_POST_FILE = "single_cell/sc_post_metadata.tsv"
SC_MARKERS_PRE_FILE = "single_cell/sc_pre_markers.tsv"
SC_META_PRE_FILE = "single_cell/sc_pre_metadata.tsv"

# Comprehensive data loading and validation
def load_and_validate_data():
    """Load all required data files with validation"""
    print("Loading distance matrix and metadata...")
    
    # Load main datasets
    try:
        # Distance matrix
        print(f"Loading distance matrix from: {DISTANCES_FILE}")
        distance_df = pd.read_csv(DISTANCES_FILE, sep="\t", index_col=0)
        print(f"Distance matrix loaded: {distance_df.shape}")
        
        # Metadata
        print(f"Loading metadata from: {METADATA_FILE}")
        metadata = pd.read_csv(METADATA_FILE, sep="\t", index_col=0)
        print(f"Metadata loaded: {metadata.shape}")
        
        # VST normalized expression data
        print(f"Loading VST data from: {VST_FILE}")
        vst_data = pd.read_csv(VST_FILE, sep="\t", index_col=0)
        print(f"VST data loaded: {vst_data.shape}")
        
    except FileNotFoundError as e:
        print(f"Error: Required file not found - {e}")
        raise
    except Exception as e:
        print(f"Error loading main datasets: {e}")
        raise
        
    # Load single-cell data (with fallback if files don't exist)
    sc_data = {}
    try:
        # Single-cell post markers and metadata
        if os.path.exists(SC_MARKERS_POST_FILE):
            sc_data['markers_post'] = pd.read_csv(SC_MARKERS_POST_FILE, sep="\t", index_col=0)
            print(f"SC post markers loaded: {sc_data['markers_post'].shape}")
        else:
            print(f"Warning: SC post markers file not found: {SC_MARKERS_POST_FILE}")
            
        if os.path.exists(SC_META_POST_FILE):
            sc_data['meta_post'] = pd.read_csv(SC_META_POST_FILE, sep="\t", index_col=0)
            print(f"SC post metadata loaded: {sc_data['meta_post'].shape}")
        else:
            print(f"Warning: SC post metadata file not found: {SC_META_POST_FILE}")
            
        # Single-cell pre markers and metadata
        if os.path.exists(SC_MARKERS_PRE_FILE):
            sc_data['markers_pre'] = pd.read_csv(SC_MARKERS_PRE_FILE, sep="\t", index_col=0)
            print(f"SC pre markers loaded: {sc_data['markers_pre'].shape}")
        else:
            print(f"Warning: SC pre markers file not found: {SC_MARKERS_PRE_FILE}")
            
        if os.path.exists(SC_META_PRE_FILE):
            sc_data['meta_pre'] = pd.read_csv(SC_META_PRE_FILE, sep="\t", index_col=0)
            print(f"SC pre metadata loaded: {sc_data['meta_pre'].shape}")
        else:
            print(f"Warning: SC pre metadata file not found: {SC_META_PRE_FILE}")
            
    except Exception as e:
        print(f"Warning: Error loading single-cell data: {e}")
        print("Continuing with bulk data analysis only...")
    
    # Data validation
    print("\nData validation:")
    
    # Check distance matrix is square
    if distance_df.shape[0] != distance_df.shape[1]:
        raise ValueError("Distance matrix must be square")
    print(f"✓ Distance matrix is square: {distance_df.shape}")
    
    # Check sample overlap
    common_samples = set(distance_df.index) & set(metadata.index) & set(vst_data.columns)
    print(f"✓ Common samples across datasets: {len(common_samples)}")
    
    if len(common_samples) == 0:
        raise ValueError("No common samples found between datasets")
    
    # Align datasets to common samples
    distance_df = distance_df.loc[common_samples, common_samples]
    metadata = metadata.loc[common_samples]
    vst_data = vst_data[common_samples]
    
    print(f"✓ Datasets aligned to {len(common_samples)} common samples")
    
    # Display data overview
    print(f"\nData overview:")
    print(f"Distance matrix: {distance_df.shape}")
    print(f"Metadata columns: {list(metadata.columns)}")
    print(f"VST expression: {vst_data.shape}")
    print(f"Genotype distribution: {metadata['genotype'].value_counts().to_dict()}")
    print(f"Sample distribution: {metadata['sample'].value_counts().to_dict()}")
    
    return {
        'distance_df': distance_df,
        'metadata': metadata,
        'vst_data': vst_data,
        'sc_data': sc_data
    }

# Load all data
data_dict = load_and_validate_data()
distance_df = data_dict['distance_df']
metadata = data_dict['metadata']
vst_data = data_dict['vst_data']
sc_data = data_dict['sc_data']

print("\nData loading completed successfully!")

## Hierarchical Clustering Analysis

In [None]:
import numpy as np
import scipy.cluster.hierarchy as sch

# Define your sample colors here
SAMPLE_COLORS = ['#FF0000', '#00FF00', '#0000FF', '#FFFF00', '#FF00FF', '#00FFFF']

# Comprehensive hierarchical clustering analysis
def perform_hierarchical_clustering(distance_matrix, metadata):
    """Perform hierarchical clustering and generate dendrogram"""
    print("Performing hierarchical clustering analysis...")
    
    # Convert square distance matrix to condensed form for scipy
    def square_to_condensed(square_matrix):
        """Convert square distance matrix to condensed format"""
        if not isinstance(square_matrix, np.ndarray):
            square_matrix = square_matrix.values
        
        # Ensure matrix is square
        assert square_matrix.shape[0] == square_matrix.shape[1], "Distance matrix must be square"
        
        # Extract upper triangle (excluding diagonal)
        triu_indices = np.triu_indices(square_matrix.shape[0], k=1)
        condensed_matrix = square_matrix[triu_indices]
        return condensed_matrix
    
    # Convert to condensed format
    condensed_distances = square_to_condensed(distance_matrix)
    print(f"Converted {distance_matrix.shape} square matrix to {condensed_distances.shape} condensed format")
    
    # Perform hierarchical clustering
    linkage_matrix = sch.linkage(condensed_distances, method='ward')
    print("Hierarchical clustering completed using Ward linkage")
    
    # Create sample-to-color mapping
    unique_samples = sorted(metadata['sample'].unique().astype(str))
    sample_colors = dict(zip(unique_samples, SAMPLE_COLORS[:len(unique_samples)]))
    
    # Create leaf color mapping
    leaf_colors = {}
    for idx, sample in metadata['sample'].items():
        leaf_colors[idx] = sample_colors[str(sample)]
    
    print(f"Created color mapping for {len(unique_samples)} unique samples")
    print(f"Sample colors: {sample_colors}")
    
    return {
        'linkage_matrix': linkage_matrix,
        'condensed_distances': condensed_distances,
        'leaf_colors': leaf_colors,
        'sample_colors': sample_colors
    }

# Assuming distance_df and metadata are defined
# Perform hierarchical clustering
clustering_results = perform_hierarchical_clustering(distance_df, metadata)
linkage_matrix = clustering_results['linkage_matrix']
leaf_colors = clustering_results['leaf_colors']
sample_colors = clustering_results['sample_colors']

print("\nHierarchical clustering completed successfully!")


In [None]:
import numpy as np
import scipy.cluster.hierarchy as sch
import matplotlib.pyplot as plt
import os

#set colors
colors = [
    'red',
    'green',
    'blue',
    'cyan',
    'magenta',
    'gold',
    'black',
    'lightblue',
    'orange',
    'purple',
    'brown',
    'pink'
]

leaf_colors = {}
for sample,color in zip(list(set(meta['sample'])),colors):
    sample_samples = list(meta.loc[meta['sample']==sample].index)
    for ss in sample_samples:
        leaf_colors[ss] = color

# Generate publication-quality dendrogram
def generate_dendrogram(linkage_matrix, sample_labels, leaf_colors, save_prefix="reproducibility_dendrogram"):
    """Generate and save colored dendrogram"""
    print("Generating dendrogram with sample-specific coloring...")
    
    # Create figure
    plt.figure(figsize=(DENDROGRAM_WIDTH, DENDROGRAM_HEIGHT))
    
    # Generate dendrogram
    dendrogram = sch.dendrogram(
        linkage_matrix,
        labels=sample_labels,
        leaf_rotation=90,
        leaf_font_size=12,
        color_threshold=0
    )
    
    # Apply custom colors to leaf labels
    ax = plt.gca()
    x_labels = ax.get_xmajorticklabels()
    
    colored_labels = []
    for label in x_labels:
        txt = label.get_text()
        if txt in leaf_colors:
            color_code = leaf_colors[txt]
            label.set_color(color_code)
            colored_labels.append(txt)
    
    # Improve plot aesthetics
    plt.title('Hierarchical Clustering of Samples\n(Colored by Sample ID)', 
              fontsize=16, fontweight='bold', pad=20)
    plt.xlabel('Sample ID', fontsize=14, fontweight='bold')
    plt.ylabel('Distance', fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.3)
    
    # Save plots
    png_file = os.path.join(FIGURES_DIR, f"{save_prefix}_colored_by_sample.png")
    pdf_file = os.path.join(FIGURES_DIR, f"{save_prefix}_colored_by_sample.pdf")
    
    plt.savefig(png_file, dpi=350, bbox_inches='tight', pad_inches=0.1)
    plt.savefig(pdf_file, dpi=350, bbox_inches='tight', pad_inches=0.1)
    
    print(f"Dendrogram saved to: {png_file}")
    print(f"Dendrogram saved to: {pdf_file}")
    
    plt.show()
    plt.close()
    
    return {
        'dendrogram': dendrogram,
        'colored_labels': colored_labels,
        'png_file': png_file,
        'pdf_file': pdf_file
    }

# Generate dendrogram
dendrogram_results = generate_dendrogram(
    linkage_matrix, 
    distance_df.index.tolist(), 
    leaf_colors
)

print("\nDendrogram generation completed successfully!")


## Marker Gene Analysis and Visualization

In [None]:
# Comprehensive marker gene extraction and heatmap generation
def extract_and_visualize_markers(sc_data, metadata, vst_data):
    """Extract marker genes from single-cell data and create visualizations"""
    print("Extracting marker genes and generating visualizations...")
    
    # Check if single-cell data is available
    if not sc_data or not all(key in sc_data for key in ['markers_post', 'meta_post', 'markers_pre', 'meta_pre']):
        print("Warning: Single-cell files not available. Using top variable genes instead.")
        # Use top variable genes as fallback
        gene_variance = vst_data.var(axis=1)
        marker_genes = gene_variance.nlargest(500).index.tolist()
        print(f"Using top {len(marker_genes)} variable genes for analysis")
        return marker_genes
    
    try:
        # Access single-cell data
        sc_markers_post = sc_data['markers_post']
        sc_meta_post = sc_data['meta_post']
        sc_markers_pre = sc_data['markers_pre']
        sc_meta_pre = sc_data['meta_pre']
        
        print("Single-cell data loaded successfully")
        
        # Create condition columns
        sc_meta_post['condition'] = (sc_meta_post['AP_axis'] + "_" + 
                                   sc_meta_post['DV_axis'] + "_" + 
                                   sc_meta_post['Basal_media'])
        sc_meta_pre['condition'] = (sc_meta_pre['AP_axis'] + "_" + 
                                  sc_meta_pre['DV_axis'] + "_" + 
                                  sc_meta_pre['Basal_media'])
        
        # Extract markers for each sample
        all_markers = []
        samples = sorted([str(x) for x in metadata['sample'].unique()])
        
        for sample in samples:
            if sample in SAMPLE_TO_ID:
                sample_id = SAMPLE_TO_ID[sample]
                print(f"Processing sample {sample} (ID: {sample_id})")
                
                # Determine sample type and get appropriate data
                if 'p1' in sample_id:
                    markers_df = sc_markers_post
                    meta_df = sc_meta_post
                elif 'p3' in sample_id:
                    markers_df = sc_markers_pre  
                    meta_df = sc_meta_pre
                else:
                    continue
                
                # Get top clusters for this sample
                sample_cells = meta_df[meta_df['parse_id'] == sample_id]
                if sample_cells.empty:
                    continue
                    
                top_clusters = sample_cells['final_clustering'].value_counts().head(TOP_CLUSTERS).index.tolist()
                
                # Extract markers for each cluster
                for cluster in top_clusters:
                    cluster_markers = markers_df[markers_df['cluster_old'] == cluster]
                    if not cluster_markers.empty:
                        top_markers = cluster_markers.index[:TOP_MARKERS_PER_CLUSTER].tolist()
                        all_markers.extend(top_markers)
        
        # Filter markers that exist in bulk data
        available_markers = [m for m in all_markers if m in vst_data.index]
        marker_genes = list(set(available_markers))  # Remove duplicates
        
        print(f"Extracted {len(marker_genes)} unique marker genes")
        return marker_genes
        
    except Exception as e:
        print(f"Error processing single-cell data: {e}")
        print("Using top variable genes as fallback")
        gene_variance = vst_data.var(axis=1) 
        marker_genes = gene_variance.nlargest(500).index.tolist()
        return marker_genes

# Extract marker genes
marker_genes = extract_and_visualize_markers(sc_data, metadata, vst_data)
print(f"\nMarker gene extraction completed: {len(marker_genes)} genes identified")

# Create condition summaries if single-cell data is available
if sc_data and 'meta_post' in sc_data and 'meta_pre' in sc_data:
    try:
        sc_meta_post = sc_data['meta_post'].copy()
        sc_meta_pre = sc_data['meta_pre'].copy()
        
        sc_meta_post['condition'] = sc_meta_post['AP_axis']+"_"+sc_meta_post['DV_axis']+"_"+sc_meta_post['Basal_media']
        sc_meta_pre['condition'] = sc_meta_pre['AP_axis']+"_"+sc_meta_pre['DV_axis']+"_"+sc_meta_pre['Basal_media']
        
        cond_post = sc_meta_post.groupby('condition').head(n=1)
        cond_pre = sc_meta_pre.groupby('condition').head(n=1)
        cond_combined = pd.concat([cond_post, cond_pre])
        cond_combined.index = cond_combined['parse_id']
        
        print("Single-cell condition summaries created successfully")
    except Exception as e:
        print(f"Warning: Could not create condition summaries: {e}")
else:
    print("Single-cell data not available for condition summaries")

In [None]:
# Comprehensive heatmap generation for reproducibility analysis
def generate_reproducibility_heatmaps(marker_genes, vst_data, metadata):
    """Generate heatmaps for reproducibility analysis across genotypes"""
    print("Generating reproducibility heatmaps...")
    
    if not marker_genes:
        print("No marker genes available for heatmap generation")
        return
    
    # Sort metadata for consistent ordering
    metadata_sorted = metadata.sort_values(by=['genotype', 'sample'])
    unique_genotypes = metadata_sorted['genotype'].unique()
    
    print(f"Generating heatmaps for {len(unique_genotypes)} genotypes: {list(unique_genotypes)}")
    
    # Generate heatmaps for each genotype
    for genotype in unique_genotypes:
        print(f"\nProcessing genotype: {genotype}")
        
        # Get samples for this genotype
        genotype_samples = metadata_sorted[metadata_sorted['genotype'] == genotype].index.tolist()
        
        if not genotype_samples:
            print(f"  No samples found for genotype {genotype}")
            continue
            
        print(f"  Samples: {len(genotype_samples)}")
        
        # Extract expression data for marker genes and samples
        try:
            expression_subset = vst_data.loc[marker_genes, genotype_samples]
            
            # Z-score normalization
            expression_zscore = ((expression_subset.T - expression_subset.T.mean()) / 
                               expression_subset.T.std()).T
            
            # Create and save Z-score heatmap
            plt.figure(figsize=(HEATMAP_WIDTH, HEATMAP_HEIGHT))
            sns.heatmap(expression_zscore, 
                       vmin=HEATMAP_VMIN, vmax=HEATMAP_VMAX, 
                       cmap='Greys', cbar=True, 
                       xticklabels=True, yticklabels=False)
            
            plt.title(f'Marker Gene Expression - {genotype}\n(Z-score normalized)', 
                     fontsize=16, fontweight='bold')
            plt.xlabel('Samples', fontsize=12)
            plt.ylabel(f'Marker Genes (n={len(marker_genes)})', fontsize=12)
            
            # Save plot
            output_file = os.path.join(RESULTS_DIR, f"{genotype}_heatmap_zscore.png")
            plt.savefig(output_file, dpi=350, bbox_inches='tight', pad_inches=0.1)
            print(f"  Z-score heatmap saved: {output_file}")
            
            plt.show()
            plt.close()
            
            # Min-max normalization heatmap
            expression_minmax = ((expression_subset.T - expression_subset.T.min()) / 
                               (expression_subset.T.max() - expression_subset.T.min())).T
            
            plt.figure(figsize=(HEATMAP_WIDTH, HEATMAP_HEIGHT))
            sns.heatmap(expression_minmax, 
                       vmin=0.5, vmax=1, 
                       cmap='Greys', cbar=True,
                       xticklabels=True, yticklabels=False)
            
            plt.title(f'Marker Gene Expression - {genotype}\n(Min-Max normalized)', 
                     fontsize=16, fontweight='bold')
            plt.xlabel('Samples', fontsize=12)
            plt.ylabel(f'Marker Genes (n={len(marker_genes)})', fontsize=12)
            
            # Save plot
            output_file = os.path.join(RESULTS_DIR, f"{genotype}_heatmap_minmax.png")
            plt.savefig(output_file, dpi=350, bbox_inches='tight', pad_inches=0.1)
            print(f"  Min-max heatmap saved: {output_file}")
            
            plt.show()
            plt.close()
            
        except Exception as e:
            print(f"  Error generating heatmap for {genotype}: {e}")
            continue
    
    print("\nHeatmap generation completed successfully!")

# Generate all heatmaps
generate_reproducibility_heatmaps(marker_genes, vst_data, metadata)

In [None]:
# Cross-sample reproducibility analysis
def analyze_cross_sample_reproducibility(distance_df):
    """Analyze reproducibility between matching samples across conditions"""
    print("Analyzing cross-sample reproducibility...")
    
    try:
        import re
        from collections import Counter
        
        # Extract sample identifiers
        all_samples = distance_df.index
        sample_ids = [re.sub(".*_", "", x) for x in all_samples]
        
        # Find matching samples across conditions
        sample_counts = Counter(sample_ids)
        matching_samples = [x for x in sample_counts if sample_counts[x] > 1]
        
        print(f"Found {len(matching_samples)} samples with replicates across conditions")
        
        if not matching_samples:
            print("No matching samples found for reproducibility analysis")
            return None
        
        # Calculate distances for matching vs non-matching samples
        dist_results = []
        for sample1 in matching_samples:
            for sample2 in matching_samples:
                # Find all instances of these samples in the distance matrix
                sample1_indices = [x for x in distance_df.index if x.endswith(sample1)]
                sample2_indices = [x for x in distance_df.index if x.endswith(sample2)]
                
                for s1_idx in sample1_indices:
                    for s2_idx in sample2_indices:
                        if s1_idx != s2_idx:  # Don't compare sample to itself
                            distance = distance_df.loc[s1_idx, s2_idx]
                            is_matching = 1 if sample1 == sample2 else 0
                            dist_results.append({
                                'sample1': sample1,
                                'sample2': sample2, 
                                'distance': distance,
                                'is_matching': is_matching
                            })
        
        dist_df_results = pd.DataFrame(dist_results)
        
        if dist_df_results.empty:
            print("No distance comparisons could be made")
            return None
        
        # Generate reproducibility plot
        plt.figure(figsize=(10, 6))
        sns.boxplot(x='is_matching', y='distance', data=dist_df_results)
        plt.xlabel('Sample Matching (0=Different, 1=Same)', fontsize=12)
        plt.ylabel('Distance', fontsize=12)
        plt.title('Cross-Sample Reproducibility Analysis\n(Lower distances indicate better reproducibility)', 
                  fontsize=14, fontweight='bold')
        
        # Add statistics
        matching_distances = dist_df_results[dist_df_results['is_matching'] == 1]['distance']
        different_distances = dist_df_results[dist_df_results['is_matching'] == 0]['distance']
        
        plt.text(0.5, plt.ylim()[1] * 0.9, 
                f'Mean distance (same sample): {matching_distances.mean():.3f}\n'
                f'Mean distance (different samples): {different_distances.mean():.3f}',
                ha='center', va='top', bbox=dict(boxstyle='round', facecolor='lightgray'))
        
        # Save plot
        output_file = os.path.join(RESULTS_DIR, "cross_sample_reproducibility.png")
        plt.savefig(output_file, dpi=350, bbox_inches='tight', pad_inches=0.1)
        print(f"Reproducibility plot saved: {output_file}")
        
        plt.show()
        plt.close()
        
        # Summary statistics
        print(f"\nReproducibility Summary:")
        print(f"Same sample mean distance: {matching_distances.mean():.4f} ± {matching_distances.std():.4f}")
        print(f"Different samples mean distance: {different_distances.mean():.4f} ± {different_distances.std():.4f}")
        print(f"Reproducibility ratio: {matching_distances.mean() / different_distances.mean():.4f}")
        
        return dist_df_results
        
    except Exception as e:
        print(f"Error in reproducibility analysis: {e}")
        return None

# Perform cross-sample reproducibility analysis
reproducibility_results = analyze_cross_sample_reproducibility(distance_df)

if reproducibility_results is not None:
    print("\nCross-sample reproducibility analysis completed successfully!")
else:
    print("\nCross-sample reproducibility analysis could not be completed")

## Analysis Summary and Results

### Key Findings

This notebook successfully completed the reproducibility distance analysis with single-cell integration:

1. **Data Integration**: Successfully loaded and validated bulk RNA-seq distance matrices, metadata, VST-normalized expression data, and single-cell marker gene data

2. **Hierarchical Clustering**: Generated publication-quality dendrograms with sample-specific coloring to visualize sample relationships and clustering patterns

3. **Marker Gene Analysis**: Extracted and visualized marker genes from single-cell data, creating expression heatmaps for reproducibility assessment across genotypes

4. **Cross-Sample Reproducibility**: Analyzed technical reproducibility between experimental replicates using distance-based metrics

### Outputs Generated

- **Hierarchical clustering dendrograms** with sample-specific color coding
- **Expression heatmaps** for marker genes across conditions (Z-score and min-max normalized)
- **Reproducibility assessment plots** comparing distances between matching vs non-matching samples
- **Publication-ready figures** saved in multiple formats (PNG and PDF)

### Technical Notes

- Analysis performed with {len(marker_genes)} marker genes extracted from single-cell data
- Distance matrices analyzed for {distance_df.shape[0]} samples across {len(metadata['genotype'].unique())} genotypes
- All plots saved to: `{RESULTS_DIR}` and `{FIGURES_DIR}`
- Reproducibility ensured through consistent random seed and version tracking

In [None]:
# Session Information and Package Versions
import sys
import datetime

print("="*50)
print("SESSION INFORMATION")
print("="*50)
print(f"Analysis completed: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Python version: {sys.version}")
print(f"Working directory: {os.getcwd()}")

print("\nPackage Versions:")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"matplotlib: {mpl.__version__}")
print(f"seaborn: {sns.__version__}")

try:
    import scipy
    print(f"scipy: {scipy.__version__}")
except ImportError:
    print("scipy: not available")

print("\nAnalysis Parameters:")
print(f"Top clusters analyzed: {TOP_CLUSTERS}")
print(f"Top markers per cluster: {TOP_MARKERS_PER_CLUSTER}")
print(f"Heatmap visualization range: {HEATMAP_VMIN} to {HEATMAP_VMAX}")
print(f"Random seed: 42")

print("\nInput Files:")
print(f"Distance matrix: {DISTANCES_FILE}")
print(f"Metadata: {METADATA_FILE}")
print(f"VST data: {VST_FILE}")
print(f"SC markers (post): {SC_MARKERS_POST_FILE}")
print(f"SC markers (pre): {SC_MARKERS_PRE_FILE}")

print("\nOutput Directories:")
print(f"Figures: {FIGURES_DIR}")
print(f"Results: {RESULTS_DIR}")

print(f"\nTotal marker genes analyzed: {len(marker_genes) if 'marker_genes' in locals() else 'Not available'}")
print(f"Total samples in analysis: {distance_df.shape[0] if 'distance_df' in locals() else 'Not available'}")

print("\n" + "="*50)
print("ANALYSIS COMPLETED SUCCESSFULLY")
print("="*50)