# Comparing Lung and Breast Cancer Gene Expression using 10x Genomics Data

This notebook demonstrates how to:
1. Download 10x Genomics single-cell RNA-seq data
2. Process and analyze lung cancer samples
3. Process and analyze breast cancer samples
4. Compare gene expression patterns between cancer types
5. Identify cancer-type specific markers

## 10x Genomics Data Sources:
- **10x Genomics Datasets**: https://www.10xgenomics.com/datasets
- **Cancer Cell Atlas**: Single-cell data from various cancers
- **Public repositories**: GEO, ArrayExpress with 10x data

In [None]:
# Install required packages
!pip install scanpy anndata pandas numpy matplotlib seaborn scipy scikit-learn leidenalg

In [None]:
import scanpy as sc
import anndata as ad
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.decomposition import PCA
from statsmodels.stats.multitest import multipletests
import requests
import tarfile
import gzip
import os
import warnings
warnings.filterwarnings('ignore')

# Scanpy settings
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white')

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

## 1. Download 10x Genomics Data

We'll download publicly available 10x Genomics datasets for lung and breast cancer.

**Note**: For this example, we'll use:
- Simulated/example data for demonstration
- Real datasets can be downloaded from 10x Genomics website

### Real Dataset Examples:
- **Breast Cancer**: 10x Genomics 1.3M Brain Cells (can substitute with cancer data)
- **Lung Cancer**: Various lung tumor scRNA-seq datasets available on GEO

In [None]:
class TenXDataDownloader:
    """
    Download and process 10x Genomics data
    """
    
    def __init__(self, output_dir="10x_data"):
        self.output_dir = output_dir
        os.makedirs(output_dir, exist_ok=True)
    
    def download_10x_dataset(self, url, dataset_name):
        """
        Download 10x Genomics dataset from URL
        """
        output_path = os.path.join(self.output_dir, f"{dataset_name}.tar.gz")
        extract_path = os.path.join(self.output_dir, dataset_name)
        
        if os.path.exists(extract_path):
            print(f"{dataset_name} already exists, skipping download")
            return extract_path
        
        print(f"Downloading {dataset_name}...")
        
        try:
            response = requests.get(url, stream=True)
            response.raise_for_status()
            
            with open(output_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
            
            print(f"Extracting {dataset_name}...")
            with tarfile.open(output_path, 'r:gz') as tar:
                tar.extractall(extract_path)
            
            os.remove(output_path)
            print(f"Successfully downloaded and extracted {dataset_name}")
            return extract_path
            
        except Exception as e:
            print(f"Error downloading {dataset_name}: {e}")
            return None
    
    def load_10x_data(self, data_path):
        """
        Load 10x Genomics data using Scanpy
        """
        # Try to find the filtered feature matrix
        possible_paths = [
            os.path.join(data_path, 'filtered_feature_bc_matrix'),
            os.path.join(data_path, 'filtered_gene_bc_matrices'),
            data_path
        ]
        
        for path in possible_paths:
            if os.path.exists(path):
                try:
                    adata = sc.read_10x_mtx(path, var_names='gene_symbols', cache=True)
                    return adata
                except:
                    continue
        
        print(f"Could not load data from {data_path}")
        return None

# Initialize downloader
downloader = TenXDataDownloader()

print("10x Genomics Data Downloader initialized!")

## 2. Generate Example 10x-style Data

Since downloading real datasets can be large and time-consuming, we'll create realistic synthetic data that mimics 10x Genomics single-cell RNA-seq data structure.

In [None]:
def generate_10x_cancer_data(cancer_type, n_cells=2000, n_genes=2000):
    """
    Generate synthetic single-cell data mimicking 10x Genomics format
    """
    np.random.seed(42 if cancer_type == 'lung' else 123)
    
    # Generate gene names
    gene_names = [f'Gene_{i:04d}' for i in range(n_genes)]
    
    # Add known cancer genes
    cancer_genes = {
        'lung': ['EGFR', 'KRAS', 'TP53', 'ALK', 'ROS1', 'NKX2-1', 'TTF1', 'NAPSA'],
        'breast': ['ESR1', 'PGR', 'ERBB2', 'GATA3', 'FOXA1', 'TP53', 'PIK3CA', 'BRCA1']
    }
    
    # Replace some generic genes with cancer-specific genes
    for i, gene in enumerate(cancer_genes[cancer_type]):
        gene_names[i] = gene
    
    # Cell barcodes
    cell_barcodes = [f'{cancer_type.upper()}_Cell_{i:04d}' for i in range(n_cells)]
    
    # Generate sparse-like count matrix
    # Single-cell data is typically very sparse (lots of zeros)
    counts = np.random.negative_binomial(n=5, p=0.3, size=(n_cells, n_genes))
    
    # Add sparsity (set 70% of values to zero)
    zero_mask = np.random.random((n_cells, n_genes)) < 0.7
    counts[zero_mask] = 0
    
    # Increase expression of cancer-specific genes
    for i, gene in enumerate(cancer_genes[cancer_type]):
        counts[:, i] = counts[:, i] * np.random.uniform(3, 8)
    
    # Create AnnData object (standard format for single-cell data)
    adata = ad.AnnData(X=counts)
    adata.obs_names = cell_barcodes
    adata.var_names = gene_names
    
    # Add metadata
    adata.obs['cancer_type'] = cancer_type
    adata.obs['n_counts'] = adata.X.sum(axis=1)
    adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
    
    # Add cell type annotations (simplified)
    cell_types = np.random.choice(
        ['Cancer_cells', 'T_cells', 'Macrophages', 'Fibroblasts', 'Endothelial'],
        size=n_cells,
        p=[0.5, 0.2, 0.15, 0.1, 0.05]  # Cancer cells are most abundant
    )
    adata.obs['cell_type'] = cell_types
    
    return adata

# Generate lung cancer data
print("Generating lung cancer single-cell data...")
lung_adata = generate_10x_cancer_data('lung', n_cells=2000, n_genes=2000)
print(f"Lung cancer data: {lung_adata.n_obs} cells × {lung_adata.n_vars} genes")

# Generate breast cancer data
print("\nGenerating breast cancer single-cell data...")
breast_adata = generate_10x_cancer_data('breast', n_cells=2000, n_genes=2000)
print(f"Breast cancer data: {breast_adata.n_obs} cells × {breast_adata.n_vars} genes")

print("\n" + "="*60)
print("DATASET SUMMARY")
print("="*60)
print(f"Lung Cancer:   {lung_adata.n_obs:,} cells × {lung_adata.n_vars:,} genes")
print(f"Breast Cancer: {breast_adata.n_obs:,} cells × {breast_adata.n_vars:,} genes")
print("="*60)

## 3. Quality Control and Preprocessing

In [None]:
def preprocess_10x_data(adata, min_genes=200, min_cells=3, name="Sample"):
    """
    Standard preprocessing for 10x Genomics data
    """
    print(f"\nPreprocessing {name}...")
    print(f"Starting with {adata.n_obs} cells and {adata.n_vars} genes")
    
    # Calculate QC metrics
    adata.var['n_cells'] = (adata.X > 0).sum(axis=0)
    
    # Filter cells and genes
    sc.pp.filter_cells(adata, min_genes=min_genes)
    sc.pp.filter_genes(adata, min_cells=min_cells)
    
    print(f"After filtering: {adata.n_obs} cells and {adata.n_vars} genes")
    
    # Normalize per cell
    sc.pp.normalize_total(adata, target_sum=1e4)
    
    # Log transform
    sc.pp.log1p(adata)
    
    # Store raw normalized data
    adata.raw = adata
    
    # Identify highly variable genes
    sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
    print(f"Identified {sum(adata.var['highly_variable'])} highly variable genes")
    
    return adata

# Preprocess both datasets
lung_adata = preprocess_10x_data(lung_adata, name="Lung Cancer")
breast_adata = preprocess_10x_data(breast_adata, name="Breast Cancer")

In [None]:
# Visualize QC metrics
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Lung cancer QC
axes[0, 0].hist(lung_adata.obs['n_counts'], bins=50, edgecolor='black', color='steelblue')
axes[0, 0].set_xlabel('Total counts')
axes[0, 0].set_ylabel('Number of cells')
axes[0, 0].set_title('Lung Cancer: UMI Counts per Cell')

axes[0, 1].hist(lung_adata.obs['n_genes'], bins=50, edgecolor='black', color='steelblue')
axes[0, 1].set_xlabel('Genes detected')
axes[0, 1].set_ylabel('Number of cells')
axes[0, 1].set_title('Lung Cancer: Genes per Cell')

axes[0, 2].scatter(lung_adata.obs['n_counts'], lung_adata.obs['n_genes'], 
                   alpha=0.3, s=10, color='steelblue')
axes[0, 2].set_xlabel('Total counts')
axes[0, 2].set_ylabel('Genes detected')
axes[0, 2].set_title('Lung Cancer: Counts vs Genes')

# Breast cancer QC
axes[1, 0].hist(breast_adata.obs['n_counts'], bins=50, edgecolor='black', color='coral')
axes[1, 0].set_xlabel('Total counts')
axes[1, 0].set_ylabel('Number of cells')
axes[1, 0].set_title('Breast Cancer: UMI Counts per Cell')

axes[1, 1].hist(breast_adata.obs['n_genes'], bins=50, edgecolor='black', color='coral')
axes[1, 1].set_xlabel('Genes detected')
axes[1, 1].set_ylabel('Number of cells')
axes[1, 1].set_title('Breast Cancer: Genes per Cell')

axes[1, 2].scatter(breast_adata.obs['n_counts'], breast_adata.obs['n_genes'], 
                   alpha=0.3, s=10, color='coral')
axes[1, 2].set_xlabel('Total counts')
axes[1, 2].set_ylabel('Genes detected')
axes[1, 2].set_title('Breast Cancer: Counts vs Genes')

plt.tight_layout()
plt.show()

## 4. Dimensionality Reduction and Clustering

In [None]:
def analyze_single_cell_data(adata, name="Sample"):
    """
    Perform dimensionality reduction and clustering
    """
    print(f"\nAnalyzing {name}...")
    
    # Scale data
    sc.pp.scale(adata, max_value=10)
    
    # PCA
    sc.tl.pca(adata, svd_solver='arpack')
    
    # Compute neighborhood graph
    sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
    
    # UMAP for visualization
    sc.tl.umap(adata)
    
    # Clustering
    sc.tl.leiden(adata, resolution=0.5)
    
    print(f"Found {len(adata.obs['leiden'].unique())} clusters")
    
    return adata

# Analyze both datasets
lung_adata = analyze_single_cell_data(lung_adata, name="Lung Cancer")
breast_adata = analyze_single_cell_data(breast_adata, name="Breast Cancer")

In [None]:
# Visualize UMAP
fig, axes = plt.subplots(2, 2, figsize=(16, 14))

# Lung cancer UMAP
sc.pl.umap(lung_adata, color='leiden', ax=axes[0, 0], show=False, title='Lung Cancer - Clusters')
sc.pl.umap(lung_adata, color='cell_type', ax=axes[0, 1], show=False, title='Lung Cancer - Cell Types')

# Breast cancer UMAP
sc.pl.umap(breast_adata, color='leiden', ax=axes[1, 0], show=False, title='Breast Cancer - Clusters')
sc.pl.umap(breast_adata, color='cell_type', ax=axes[1, 1], show=False, title='Breast Cancer - Cell Types')

plt.tight_layout()
plt.show()

## 5. Combine Datasets for Comparison

In [None]:
# Combine lung and breast cancer data
print("Combining datasets...")

# Concatenate
combined_adata = ad.concat([lung_adata, breast_adata], label="cancer_type", 
                           keys=['Lung', 'Breast'])

print(f"\nCombined dataset: {combined_adata.n_obs} cells × {combined_adata.n_vars} genes")
print(f"\nCells per cancer type:")
print(combined_adata.obs['cancer_type'].value_counts())

In [None]:
# Reprocess combined data
print("\nReprocessing combined dataset...")

# Normalize and log transform
sc.pp.normalize_total(combined_adata, target_sum=1e4)
sc.pp.log1p(combined_adata)

# Find highly variable genes across both cancer types
sc.pp.highly_variable_genes(combined_adata, min_mean=0.0125, max_mean=3, min_disp=0.5)

# Scale
sc.pp.scale(combined_adata, max_value=10)

# PCA
sc.tl.pca(combined_adata, svd_solver='arpack')

# Neighbors and UMAP
sc.pp.neighbors(combined_adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(combined_adata)

print("Combined data analysis complete!")

In [None]:
# Visualize combined data
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

sc.pl.umap(combined_adata, color='cancer_type', ax=axes[0], show=False, 
           title='Combined UMAP - Cancer Type', palette={'Lung': 'steelblue', 'Breast': 'coral'})

# PCA colored by cancer type
sc.pl.pca(combined_adata, color='cancer_type', ax=axes[1], show=False,
          title='Combined PCA - Cancer Type', palette={'Lung': 'steelblue', 'Breast': 'coral'})

plt.tight_layout()
plt.show()

## 6. Differential Gene Expression Analysis

Find genes that are differentially expressed between lung and breast cancer.

In [None]:
# Perform differential expression analysis
print("Performing differential expression analysis...")
print("Comparing Lung vs Breast cancer\n")

# Use Wilcoxon rank-sum test (standard for single-cell)
sc.tl.rank_genes_groups(combined_adata, 'cancer_type', method='wilcoxon')

# Get results
result = combined_adata.uns['rank_genes_groups']

# Extract top genes for each cancer type
groups = result['names'].dtype.names

print("\n" + "="*60)
print("TOP DIFFERENTIALLY EXPRESSED GENES")
print("="*60)

for group in groups:
    print(f"\n{group} Cancer - Top 10 Marker Genes:")
    genes = result['names'][group][:10]
    scores = result['scores'][group][:10]
    pvals = result['pvals_adj'][group][:10]
    logfcs = result['logfoldchanges'][group][:10]
    
    for i, (gene, score, pval, logfc) in enumerate(zip(genes, scores, pvals, logfcs), 1):
        print(f"{i:2d}. {gene:15s} | LogFC: {logfc:6.2f} | Score: {score:6.2f} | P-adj: {pval:.2e}")

In [None]:
# Visualize top differentially expressed genes
sc.pl.rank_genes_groups(combined_adata, n_genes=20, sharey=False, fontsize=10)

## 7. Volcano Plot Comparison

In [None]:
# Create volcano plot for Lung vs Breast
def create_volcano_plot(adata, group1, group2):
    """
    Create volcano plot from scanpy results
    """
    result = adata.uns['rank_genes_groups']
    
    # Extract data for group1
    genes = result['names'][group1]
    logfcs = result['logfoldchanges'][group1]
    pvals_adj = result['pvals_adj'][group1]
    
    # Create DataFrame
    df = pd.DataFrame({
        'Gene': genes,
        'LogFC': logfcs,
        'P_adj': pvals_adj
    })
    
    # Remove NaN values
    df = df.dropna()
    
    # Calculate -log10(p-value)
    df['-log10(P_adj)'] = -np.log10(df['P_adj'] + 1e-300)  # Add small value to avoid log(0)
    
    # Classify genes
    df['Significant'] = (df['P_adj'] < 0.05) & (abs(df['LogFC']) > 0.5)
    df['Direction'] = 'Not Significant'
    df.loc[(df['Significant']) & (df['LogFC'] > 0), 'Direction'] = f'{group1} Up'
    df.loc[(df['Significant']) & (df['LogFC'] < 0), 'Direction'] = f'{group1} Down'
    
    return df

# Create volcano plot data
volcano_df = create_volcano_plot(combined_adata, 'Lung', 'Breast')

# Plot
plt.figure(figsize=(12, 8))

colors = {'Not Significant': 'gray', 'Lung Up': 'steelblue', 'Lung Down': 'coral'}

for direction, color in colors.items():
    subset = volcano_df[volcano_df['Direction'] == direction]
    plt.scatter(subset['LogFC'], subset['-log10(P_adj)'], 
               c=color, label=direction, alpha=0.6, s=20)

# Add threshold lines
plt.axhline(-np.log10(0.05), color='black', linestyle='--', linewidth=1, alpha=0.5, label='P_adj = 0.05')
plt.axvline(-0.5, color='black', linestyle='--', linewidth=1, alpha=0.5)
plt.axvline(0.5, color='black', linestyle='--', linewidth=1, alpha=0.5)

# Annotate top genes
top_up = volcano_df[(volcano_df['Direction'] == 'Lung Up')].nlargest(5, '-log10(P_adj)')
top_down = volcano_df[(volcano_df['Direction'] == 'Lung Down')].nlargest(5, '-log10(P_adj)')

for _, row in pd.concat([top_up, top_down]).iterrows():
    plt.annotate(row['Gene'], xy=(row['LogFC'], row['-log10(P_adj)']),
                xytext=(5, 5), textcoords='offset points', fontsize=8, alpha=0.8)

plt.xlabel('Log2 Fold Change (Lung vs Breast)', fontsize=12)
plt.ylabel('-log10(Adjusted P-value)', fontsize=12)
plt.title('Volcano Plot: Lung vs Breast Cancer Gene Expression', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Print summary
print("\nVolcano Plot Summary:")
print(f"Genes upregulated in Lung: {sum(volcano_df['Direction'] == 'Lung Up')}")
print(f"Genes downregulated in Lung (upregulated in Breast): {sum(volcano_df['Direction'] == 'Lung Down')}")
print(f"Not significant: {sum(volcano_df['Direction'] == 'Not Significant')}")

## 8. Heatmap of Top Marker Genes

In [None]:
# Create heatmap of top marker genes
sc.pl.rank_genes_groups_heatmap(combined_adata, n_genes=10, groupby='cancer_type',
                                 cmap='RdBu_r', figsize=(12, 8), show_gene_labels=True)

## 9. Dot Plot of Marker Genes

In [None]:
# Dot plot showing expression and percentage of cells expressing marker genes
sc.pl.rank_genes_groups_dotplot(combined_adata, n_genes=10, groupby='cancer_type',
                                 figsize=(12, 6))

## 10. Compare Specific Cancer Marker Genes

In [None]:
# Known cancer markers
lung_markers = ['EGFR', 'KRAS', 'TP53', 'ALK', 'NKX2-1']
breast_markers = ['ESR1', 'PGR', 'ERBB2', 'GATA3', 'FOXA1']

# Get markers that exist in the dataset
available_lung = [g for g in lung_markers if g in combined_adata.var_names]
available_breast = [g for g in breast_markers if g in combined_adata.var_names]
all_markers = available_lung + available_breast

print(f"Analyzing {len(all_markers)} known cancer marker genes")
print(f"Lung markers: {available_lung}")
print(f"Breast markers: {available_breast}")

if len(all_markers) > 0:
    # Create violin plots
    sc.pl.stacked_violin(combined_adata, all_markers, groupby='cancer_type',
                         figsize=(12, 6), dendrogram=False)

In [None]:
# UMAP colored by expression of specific markers
if len(all_markers) >= 4:
    sc.pl.umap(combined_adata, color=all_markers[:4], 
               ncols=2, cmap='Reds', figsize=(14, 12))

## 11. Pseudobulk Analysis

Aggregate single-cell data to pseudobulk for traditional differential expression

In [None]:
def create_pseudobulk(adata, group_col='cancer_type'):
    """
    Aggregate single-cell data to pseudobulk per cancer type
    """
    pseudobulk = {}
    
    for group in adata.obs[group_col].unique():
        # Get cells from this group
        mask = adata.obs[group_col] == group
        
        # Sum expression across cells
        pseudobulk[group] = adata[mask].X.sum(axis=0).A1  # Convert to 1D array
    
    # Create DataFrame
    pseudobulk_df = pd.DataFrame(pseudobulk, index=adata.var_names)
    
    return pseudobulk_df

# Create pseudobulk data
pseudobulk_expr = create_pseudobulk(combined_adata)

print("Pseudobulk Expression Matrix:")
print(pseudobulk_expr.head())
print(f"\nShape: {pseudobulk_expr.shape}")

In [None]:
# Compare pseudobulk expression
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Scatter plot
axes[0].scatter(np.log1p(pseudobulk_expr['Lung']), 
                np.log1p(pseudobulk_expr['Breast']),
                alpha=0.3, s=10)
axes[0].plot([0, 15], [0, 15], 'r--', alpha=0.5, label='y=x')
axes[0].set_xlabel('Lung Cancer (log1p expression)')
axes[0].set_ylabel('Breast Cancer (log1p expression)')
axes[0].set_title('Pseudobulk Expression Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Calculate correlation
correlation = np.corrcoef(pseudobulk_expr['Lung'], pseudobulk_expr['Breast'])[0, 1]
axes[0].text(0.05, 0.95, f'Correlation: {correlation:.3f}', 
            transform=axes[0].transAxes, fontsize=12, verticalalignment='top')

# Distribution comparison
axes[1].hist(np.log1p(pseudobulk_expr['Lung']), bins=50, alpha=0.5, 
            label='Lung', color='steelblue', edgecolor='black')
axes[1].hist(np.log1p(pseudobulk_expr['Breast']), bins=50, alpha=0.5, 
            label='Breast', color='coral', edgecolor='black')
axes[1].set_xlabel('log1p(Expression)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Expression Distribution')
axes[1].legend()

plt.tight_layout()
plt.show()

## 12. Export Results

In [None]:
# Create output directory
output_dir = "lung_vs_breast_comparison"
os.makedirs(output_dir, exist_ok=True)

# Save differential expression results
result = combined_adata.uns['rank_genes_groups']

for group in result['names'].dtype.names:
    de_df = pd.DataFrame({
        'Gene': result['names'][group],
        'Score': result['scores'][group],
        'LogFC': result['logfoldchanges'][group],
        'P_value': result['pvals'][group],
        'P_adj': result['pvals_adj'][group]
    })
    
    de_df.to_csv(f'{output_dir}/{group}_marker_genes.csv', index=False)
    print(f"Saved: {output_dir}/{group}_marker_genes.csv")

# Save pseudobulk expression
pseudobulk_expr.to_csv(f'{output_dir}/pseudobulk_expression.csv')
print(f"Saved: {output_dir}/pseudobulk_expression.csv")

# Save volcano plot data
volcano_df.to_csv(f'{output_dir}/volcano_plot_data.csv', index=False)
print(f"Saved: {output_dir}/volcano_plot_data.csv")

# Save combined AnnData object
combined_adata.write(f'{output_dir}/combined_lung_breast.h5ad')
print(f"Saved: {output_dir}/combined_lung_breast.h5ad")

print(f"\nAll results saved to {output_dir}/")

## 13. Summary Statistics

In [None]:
# Generate comprehensive summary
print("\n" + "="*70)
print("LUNG vs BREAST CANCER COMPARISON SUMMARY")
print("="*70)

print("\n1. DATASET STATISTICS:")
print(f"   Lung Cancer:   {lung_adata.n_obs:,} cells × {lung_adata.n_vars:,} genes")
print(f"   Breast Cancer: {breast_adata.n_obs:,} cells × {breast_adata.n_vars:,} genes")
print(f"   Combined:      {combined_adata.n_obs:,} cells × {combined_adata.n_vars:,} genes")

print("\n2. DIFFERENTIAL EXPRESSION:")
result = combined_adata.uns['rank_genes_groups']
for group in result['names'].dtype.names:
    sig_genes = sum(result['pvals_adj'][group] < 0.05)
    print(f"   {group} Cancer: {sig_genes} significant marker genes (FDR < 0.05)")

print("\n3. TOP LUNG CANCER MARKERS:")
lung_top = result['names']['Lung'][:5]
for i, gene in enumerate(lung_top, 1):
    print(f"   {i}. {gene}")

print("\n4. TOP BREAST CANCER MARKERS:")
breast_top = result['names']['Breast'][:5]
for i, gene in enumerate(breast_top, 1):
    print(f"   {i}. {gene}")

print("\n5. PSEUDOBULK CORRELATION:")
print(f"   Pearson correlation: {correlation:.3f}")

print("\n" + "="*70)
print("Analysis complete! Results saved to:", output_dir)
print("="*70)

## Summary

This notebook demonstrated:

1. ✅ **10x Genomics data structure** and handling
2. ✅ **Single-cell RNA-seq preprocessing** (QC, normalization, log transformation)
3. ✅ **Dimensionality reduction** (PCA, UMAP)
4. ✅ **Clustering** using Leiden algorithm
5. ✅ **Cross-cancer comparison** (lung vs breast)
6. ✅ **Differential expression** using Wilcoxon rank-sum test
7. ✅ **Marker gene identification** for each cancer type
8. ✅ **Multiple visualizations** (volcano plots, heatmaps, dot plots, UMAPs)
9. ✅ **Pseudobulk analysis** for traditional bulk-like comparisons

### Key Findings:

- Identified cancer-type specific marker genes
- Visualized transcriptomic differences between lung and breast cancer
- Quantified expression patterns at single-cell resolution

### For Real 10x Genomics Data:

1. **Download from 10x website**: https://www.10xgenomics.com/datasets
2. **Use GEO datasets** with 10x data (search for "10x genomics" on GEO)
3. **Load with scanpy**: `sc.read_10x_mtx(path)`
4. **Additional QC**: Filter doublets, remove low-quality cells
5. **Batch correction**: If combining multiple samples, use Harmony or Scanorama
6. **Cell type annotation**: Use SingleR, CellTypist, or manual annotation

### Recommended Next Steps:

- Pathway enrichment analysis (GSEA, Reactome)
- Cell-cell communication analysis (CellPhoneDB, NicheNet)
- Trajectory inference for cancer progression
- Integration with bulk RNA-seq or proteomics data
- Survival analysis with clinical outcomes