# Endothelial Cell Proteomics Analysis

This notebook demonstrates the comprehensive data science pipeline for analyzing endothelial cell proteomics data.

## Overview

This analysis pipeline includes:
- Data loading and preprocessing
- Quality control assessment
- Data normalization
- Principal Component Analysis (PCA)
- Partial Least Squares Discriminant Analysis (PLS-DA)
- Differential expression analysis
- Interactive visualizations
- Comprehensive reporting

## Publication

This code supports the analysis presented in:
*"Multi-omics analysis of endothelial cells reveals the metabolic diversity that underlies endothelial cell functions"*

bioRxiv: https://doi.org/10.1101/2025.03.03.641143

In [None]:
# Import the proteomics analysis class
from proteomics_analysis import ProteomicsAnalyzer
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set up plotting
plt.style.use('seaborn-v0_8')
%matplotlib inline

## 1. Initialize the Analyzer and Load Data

In [None]:
# Initialize the proteomics analyzer
analyzer = ProteomicsAnalyzer(data_path=".", output_dir="results")

# Load the proteomics data
# Make sure you have the 'Suppl_table_1.xlsx' file in your working directory
analyzer.load_data("Suppl_table_1.xlsx")

In [None]:
# Preview the raw data structure
print(f"Raw data shape: {analyzer.raw_data.shape}")
print(f"Columns: {list(analyzer.raw_data.columns[:10])}...")  # Show first 10 columns
analyzer.raw_data.head()

## 2. Data Preprocessing and Quality Control

In [None]:
# Preprocess the data (remove missing values, create metadata)
analyzer.preprocess_data()

# Display metadata information
print("Metadata overview:")
print(analyzer.metadata.head())
print(f"\nCell lines: {analyzer.metadata['cell_line'].unique()}")
print(f"Time points: {analyzer.metadata['time_point'].unique()}")
print(f"Cell types: {analyzer.metadata['cell_type'].unique()}")

In [None]:
# Normalize the data using z-score normalization
analyzer.normalize_data(method='zscore')

print(f"Normalized data shape: {analyzer.normalized_data.shape}")
print(f"Data range after normalization: {analyzer.normalized_data.values.min():.2f} to {analyzer.normalized_data.values.max():.2f}")

In [None]:
# Perform quality control analysis
qc_results = analyzer.quality_control(save_plots=True)

# Display QC summary
print("Data completeness statistics:")
print(qc_results['completeness'])

## 3. Principal Component Analysis (PCA)

In [None]:
# Perform PCA analysis
pca_results = analyzer.perform_pca(n_components=10, save_plots=True)

# Display explained variance
print("Explained variance ratio:")
for i, var in enumerate(pca_results['explained_variance_ratio'][:5]):
    print(f"PC{i+1}: {var:.3f} ({var*100:.1f}%)")

print(f"\nCumulative variance explained by first 2 PCs: {pca_results['cumulative_variance'][1]:.3f} ({pca_results['cumulative_variance'][1]*100:.1f}%)")

## 4. PLS-DA Analysis for Cell Type Classification

In [None]:
# Perform PLS-DA to identify proteins that discriminate between BECs and LECs
plsda_results = analyzer.perform_plsda(
    target_variable='cell_type', 
    n_components=2, 
    n_permutations=1000, 
    save_plots=True
)

# Display significant proteins
print(f"Proteins with high weights (LEC-associated): {len(plsda_results['significant_proteins']['high_weights'])}")
print(f"Proteins with low weights (BEC-associated): {len(plsda_results['significant_proteins']['low_weights'])}")

# Show top proteins
print("\nTop 10 LEC-associated proteins:")
print(plsda_results['significant_proteins']['high_weights'].sort_values(ascending=False).head(10))

print("\nTop 10 BEC-associated proteins:")
print(plsda_results['significant_proteins']['low_weights'].sort_values().head(10))

## 5. Differential Expression Analysis

In [None]:
# Perform differential analysis comparing D5 vs D2 for each cell line
cell_lines = analyzer.metadata['cell_line'].unique()
diff_results_summary = {}

for cell_line in cell_lines:
    ref_condition = f"{cell_line}_D2"
    comp_condition = f"{cell_line}_D5"
    
    # Check if both conditions exist
    if (ref_condition in analyzer.metadata['condition'].values and 
        comp_condition in analyzer.metadata['condition'].values):
        
        print(f"\nAnalyzing {comp_condition} vs {ref_condition}...")
        
        diff_results = analyzer.differential_analysis(
            reference_condition=ref_condition,
            comparison_conditions=[comp_condition],
            statistical_test='ttest',
            multiple_testing_correction='qvalue'
        )
        
        # Summarize results
        comparison_key = f"{comp_condition}_vs_{ref_condition}"
        if comparison_key in diff_results:
            results_df = diff_results[comparison_key]
            significant = results_df[results_df['significant']]
            upregulated = significant[significant['log2_fc'] > 0]
            downregulated = significant[significant['log2_fc'] < 0]
            
            diff_results_summary[cell_line] = {
                'total_significant': len(significant),
                'upregulated': len(upregulated),
                'downregulated': len(downregulated)
            }
            
            print(f"  - Significant proteins: {len(significant)}")
            print(f"  - Upregulated: {len(upregulated)}")
            print(f"  - Downregulated: {len(downregulated)}")

# Create summary plot
if diff_results_summary:
    summary_df = pd.DataFrame(diff_results_summary).T
    
    fig, ax = plt.subplots(1, 1, figsize=(10, 6))
    summary_df[['upregulated', 'downregulated']].plot(kind='bar', ax=ax, 
                                                     color=['red', 'blue'], alpha=0.7)
    ax.set_title('Differential Expression Summary (D5 vs D2)')
    ax.set_ylabel('Number of Proteins')
    ax.set_xlabel('Cell Line')
    ax.legend(['Upregulated', 'Downregulated'])
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

## 6. Interactive Protein Visualization

In [None]:
# Create an interactive function to plot specific proteins
def plot_protein_expression(protein_name, cell_lines=None, time_points=None):
    """
    Plot expression levels of a specific protein across conditions.
    
    Parameters:
    -----------
    protein_name : str
        Name of the protein to plot
    cell_lines : list, optional
        List of cell lines to include
    time_points : list, optional
        List of time points to include
    """
    # Find protein in the data
    protein_matches = analyzer.clean_data.index[analyzer.clean_data.index.str.contains(protein_name, case=False)]
    
    if len(protein_matches) == 0:
        print(f"Protein '{protein_name}' not found in dataset")
        return
    
    if len(protein_matches) > 1:
        print(f"Multiple matches found: {list(protein_matches)}")
        print(f"Using first match: {protein_matches[0]}")
    
    protein_gene = protein_matches[0]
    
    # Get protein data
    protein_data = analyzer.clean_data.loc[protein_gene, analyzer.clean_data.columns[2:-2]]
    
    # Create plotting dataframe
    plot_df = pd.DataFrame({
        'intensity': protein_data.values,
        'sample_id': protein_data.index
    })
    
    # Merge with metadata
    plot_df = plot_df.merge(analyzer.metadata.set_index('sample_id'), left_on='sample_id', right_index=True)
    
    # Filter data if specified
    if cell_lines:
        plot_df = plot_df[plot_df['cell_line'].isin(cell_lines)]
    if time_points:
        plot_df = plot_df[plot_df['time_point'].isin(time_points)]
    
    # Create plot
    fig, ax = plt.subplots(1, 1, figsize=(12, 6))
    
    sns.boxplot(data=plot_df, x='condition', y='intensity', ax=ax)
    sns.stripplot(data=plot_df, x='condition', y='intensity', ax=ax, 
                 color='black', alpha=0.7, size=8)
    
    ax.set_title(f'Expression of {protein_gene}', fontsize=14, fontweight='bold')
    ax.set_xlabel('Condition', fontweight='bold')
    ax.set_ylabel('Log2 Intensity', fontweight='bold')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    # Print summary statistics
    print(f"\nSummary statistics for {protein_gene}:")
    summary = plot_df.groupby('condition')['intensity'].agg(['mean', 'std', 'count'])
    print(summary)

# Example usage
plot_protein_expression('PECAM1', time_points=['D2', 'D5'])

## 7. Generate Comprehensive Report

In [None]:
# Generate a comprehensive HTML report
report_html = analyzer.generate_report(save_html=True)

print("Analysis completed successfully!")
print(f"Results saved to: {analyzer.output_dir}")
print("\nGenerated files:")
print("- analysis_report.html: Comprehensive analysis report")
print("- quality_control.png: Quality control plots")
print("- pca_analysis.html: Interactive PCA plots")
print("- plsda_cell_type.png: PLS-DA analysis plots")
print("- volcano_plot_*.html: Interactive volcano plots")
print("- differential_analysis_*.csv: Differential expression results")

## 8. Advanced Analysis Examples

In [None]:
# Example: Pathway enrichment analysis preparation
# Get significant proteins from PLS-DA for pathway analysis

if 'plsda' in analyzer.results:
    lec_proteins = list(analyzer.results['plsda']['significant_proteins']['high_weights'].index)
    bec_proteins = list(analyzer.results['plsda']['significant_proteins']['low_weights'].index)
    
    print(f"LEC-associated proteins for pathway analysis: {len(lec_proteins)}")
    print("First 10 LEC proteins:", lec_proteins[:10])
    
    print(f"\nBEC-associated proteins for pathway analysis: {len(bec_proteins)}")
    print("First 10 BEC proteins:", bec_proteins[:10])
    
    # Save protein lists for external pathway analysis
    with open(analyzer.output_dir / 'lec_associated_proteins.txt', 'w') as f:
        f.write('\n'.join(lec_proteins))
    
    with open(analyzer.output_dir / 'bec_associated_proteins.txt', 'w') as f:
        f.write('\n'.join(bec_proteins))
    
    print("\nProtein lists saved for pathway analysis.")

In [None]:
# Example: Custom analysis - Correlation between specific proteins
proteins_of_interest = ['PECAM1', 'VWF', 'ICAM1', 'VCAM1']  # Example endothelial markers

# Find these proteins in the dataset
found_proteins = []
for protein in proteins_of_interest:
    matches = analyzer.clean_data.index[analyzer.clean_data.index.str.contains(protein, case=False)]
    if len(matches) > 0:
        found_proteins.append(matches[0])
        print(f"Found: {protein} -> {matches[0]}")
    else:
        print(f"Not found: {protein}")

if len(found_proteins) > 1:
    # Calculate correlation matrix
    protein_data = analyzer.normalized_data.loc[found_proteins].T
    correlation_matrix = protein_data.corr(method='spearman')
    
    # Plot correlation heatmap
    plt.figure(figsize=(8, 6))
    sns.heatmap(correlation_matrix, annot=True, cmap='RdBu_r', center=0,
                square=True, fmt='.2f', cbar_kws={'shrink': 0.8})
    plt.title('Correlation Matrix: Endothelial Markers')
    plt.tight_layout()
    plt.show()
    
    print("\nCorrelation matrix:")
    print(correlation_matrix)

## Summary

This notebook demonstrates a comprehensive data science approach to proteomics analysis including:

1. **Data Management**: Proper loading, cleaning, and preprocessing
2. **Quality Control**: Systematic assessment of data quality
3. **Statistical Analysis**: PCA, PLS-DA, and differential expression
4. **Visualization**: Publication-quality static and interactive plots
5. **Reporting**: Automated generation of comprehensive reports
6. **Reproducibility**: Well-documented, modular code structure

The analysis pipeline is designed to be:
- **Scalable**: Can handle large proteomics datasets
- **Flexible**: Easy to modify for different experimental designs
- **Reproducible**: Consistent results with proper documentation
- **Interactive**: Supports exploratory data analysis

For more advanced analyses, consider:
- Network analysis (protein-protein interactions)
- Pathway enrichment analysis
- Machine learning classification
- Time-series analysis
- Integration with other omics data