# GEO (Gene Expression Omnibus) Data Analysis

This notebook demonstrates how to download and analyze gene expression data from GEO.

GEO is a public repository that archives high-throughput genomics data including:
- Microarray data
- RNA-seq data
- ChIP-seq data
- And more

## Methods:
1. Download data using GEOparse library
2. Alternative: Download using NCBI E-utilities
3. Process and normalize expression data
4. Differential expression analysis
5. Visualization

In [None]:
# Install required packages
!pip install GEOparse pandas numpy matplotlib seaborn scipy scikit-learn statsmodels requests

In [None]:
import GEOparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.multitest import multipletests
import requests
import gzip
import os
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

## 1. Download GEO Dataset using GEOparse

**Example datasets:**
- GSE48558: Breast cancer vs normal tissue (Microarray)
- GSE62944: Colorectal cancer (RNA-seq)
- GSE68849: Prostate cancer (Microarray)
- GSE183947: Lung cancer (RNA-seq)

We'll use GSE48558 as an example (breast cancer microarray data)

In [None]:
# Download GEO dataset
# This may take a few minutes depending on dataset size

geo_id = "GSE48558"  # Breast cancer dataset

print(f"Downloading {geo_id} from GEO...")
print("This may take several minutes...\n")

# Download and parse
gse = GEOparse.get_GEO(geo=geo_id, destdir="./geo_data")

print(f"\nDataset downloaded successfully!")
print(f"Title: {gse.metadata['title'][0]}")
print(f"Organism: {gse.metadata['organism'][0]}")
print(f"Platform: {gse.metadata['platform_id'][0]}")
print(f"Number of samples: {len(gse.gsms)}")

## 2. Explore Dataset Metadata

In [None]:
# Display dataset information
print("Dataset Summary:")
print(f"Summary: {gse.metadata['summary'][0][:500]}...\n")

# Get platform information
platform_id = gse.metadata['platform_id'][0]
print(f"Platform ID: {platform_id}")

# List all samples
print(f"\nSample IDs (first 10):")
sample_ids = list(gse.gsms.keys())
print(sample_ids[:10])

In [None]:
# Extract sample metadata
sample_metadata = []

for gsm_name, gsm in gse.gsms.items():
    # Extract relevant metadata
    metadata_dict = {
        'sample_id': gsm_name,
        'title': gsm.metadata['title'][0] if 'title' in gsm.metadata else 'N/A',
        'source': gsm.metadata['source_name_ch1'][0] if 'source_name_ch1' in gsm.metadata else 'N/A',
    }
    
    # Extract characteristics
    if 'characteristics_ch1' in gsm.metadata:
        for char in gsm.metadata['characteristics_ch1']:
            if ':' in char:
                key, value = char.split(':', 1)
                metadata_dict[key.strip()] = value.strip()
    
    sample_metadata.append(metadata_dict)

# Create metadata DataFrame
metadata_df = pd.DataFrame(sample_metadata)

print("Sample Metadata:")
print(metadata_df.head(10))
print(f"\nMetadata columns: {metadata_df.columns.tolist()}")

In [None]:
# Identify sample groups (tumor vs normal)
# This will vary by dataset - inspect the metadata to find the right column

# For GSE48558, look for tissue type information
print("Identifying sample groups...\n")

# Check different possible column names
possible_cols = ['tissue', 'sample type', 'tissue type', 'source', 'disease state', 'cell type']
group_col = None

for col in metadata_df.columns:
    print(f"\n{col}:")
    print(metadata_df[col].value_counts())
    
# Let's create a condition column based on the data
# For this dataset, we'll use 'tissue' column
if 'tissue' in metadata_df.columns:
    metadata_df['condition'] = metadata_df['tissue'].apply(
        lambda x: 'Tumor' if 'tumor' in str(x).lower() or 'cancer' in str(x).lower() 
        else 'Normal'
    )
else:
    # Fallback: use source column
    metadata_df['condition'] = metadata_df['source'].apply(
        lambda x: 'Tumor' if 'tumor' in str(x).lower() or 'cancer' in str(x).lower() 
        else 'Normal'
    )

print("\nSample distribution:")
print(metadata_df['condition'].value_counts())

## 3. Extract Expression Data

In [None]:
# Extract expression data from all samples
print("Extracting expression data...")

expression_data = {}

for gsm_name, gsm in gse.gsms.items():
    # Get the expression table
    # For microarray data, we typically use 'VALUE' column
    if hasattr(gsm.table, 'VALUE'):
        expression_data[gsm_name] = gsm.table['VALUE']
    elif 'VALUE' in gsm.table.columns:
        expression_data[gsm_name] = gsm.table['VALUE']
    else:
        # Try to find the right column
        # Common column names: VALUE, value, Signal, signal, Expression
        for col in gsm.table.columns:
            if 'value' in col.lower() or 'signal' in col.lower() or 'expression' in col.lower():
                expression_data[gsm_name] = gsm.table[col]
                break

# Create expression DataFrame
# Get gene IDs from the first sample
first_sample = list(gse.gsms.values())[0]

if 'ID_REF' in first_sample.table.columns:
    gene_ids = first_sample.table['ID_REF']
elif 'IDENTIFIER' in first_sample.table.columns:
    gene_ids = first_sample.table['IDENTIFIER']
else:
    gene_ids = first_sample.table.index

expression_df = pd.DataFrame(expression_data, index=gene_ids)

print(f"\nExpression matrix shape: {expression_df.shape}")
print(f"Number of probes/genes: {expression_df.shape[0]}")
print(f"Number of samples: {expression_df.shape[1]}")
print("\nFirst few rows and columns:")
print(expression_df.iloc[:5, :5])

## 4. Data Quality Control and Preprocessing

In [None]:
# Convert to numeric and handle missing values
expression_df = expression_df.apply(pd.to_numeric, errors='coerce')

print(f"Missing values: {expression_df.isna().sum().sum()}")

# Remove rows with too many missing values
threshold = 0.2 * expression_df.shape[1]  # 20% of samples
expression_df = expression_df.dropna(thresh=threshold)

# Fill remaining missing values with row median
expression_df = expression_df.apply(lambda row: row.fillna(row.median()), axis=1)

print(f"\nAfter cleanup:")
print(f"Expression matrix shape: {expression_df.shape}")
print(f"Missing values: {expression_df.isna().sum().sum()}")

In [None]:
# Visualize data distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Sample distribution boxplot
# Select subset of samples for visualization
sample_subset = expression_df.columns[:20]
expression_df[sample_subset].boxplot(ax=axes[0], rot=90)
axes[0].set_ylabel('Expression Value')
axes[0].set_title('Expression Distribution by Sample (first 20)')
axes[0].tick_params(axis='x', labelsize=6)

# Overall distribution histogram
axes[1].hist(expression_df.values.flatten(), bins=100, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Expression Value')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Overall Expression Distribution')

plt.tight_layout()
plt.show()

print("\nExpression summary statistics:")
print(expression_df.describe())

## 5. Log Transform if Needed

In [None]:
# Check if data is already log-transformed
# Log-transformed data typically has values between 0-20
# Raw intensity data can have very large values

max_val = expression_df.max().max()
min_val = expression_df.min().min()

print(f"Expression range: {min_val:.2f} to {max_val:.2f}")

if max_val > 100:  # Likely not log-transformed
    print("Data appears to be raw intensity values. Applying log2 transformation...")
    expression_df_log = np.log2(expression_df + 1)  # Add 1 to avoid log(0)
    print(f"After log2 transformation: {expression_df_log.min().min():.2f} to {expression_df_log.max().max():.2f}")
else:
    print("Data appears to be already log-transformed.")
    expression_df_log = expression_df.copy()

# Use the log-transformed data for further analysis
expression_final = expression_df_log.copy()

## 6. Match Samples with Metadata

In [None]:
# Ensure sample order matches between expression and metadata
common_samples = list(set(expression_final.columns) & set(metadata_df['sample_id']))

print(f"Samples in both expression data and metadata: {len(common_samples)}")

# Subset and order
expression_final = expression_final[common_samples]
metadata_df = metadata_df[metadata_df['sample_id'].isin(common_samples)]
metadata_df = metadata_df.set_index('sample_id').loc[common_samples].reset_index()

print(f"\nFinal dataset: {expression_final.shape[0]} genes × {expression_final.shape[1]} samples")
print(f"\nSample distribution:")
print(metadata_df['condition'].value_counts())

## 7. Principal Component Analysis (PCA)

In [None]:
# Perform PCA
scaler = StandardScaler()
scaled_data = scaler.fit_transform(expression_final.T)

pca = PCA(n_components=10)
pca_result = pca.fit_transform(scaled_data)

# Create PCA DataFrame
pca_df = pd.DataFrame(
    pca_result,
    columns=[f'PC{i+1}' for i in range(10)],
    index=expression_final.columns
)

# Add condition information
pca_df = pca_df.merge(
    metadata_df[['sample_id', 'condition']].set_index('sample_id'),
    left_index=True,
    right_index=True
)

# Plot PCA
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# PC1 vs PC2
for condition, color in [('Tumor', 'red'), ('Normal', 'blue')]:
    mask = pca_df['condition'] == condition
    axes[0].scatter(
        pca_df.loc[mask, 'PC1'],
        pca_df.loc[mask, 'PC2'],
        c=color,
        label=condition,
        s=100,
        alpha=0.7,
        edgecolors='black'
    )

axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)')
axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)')
axes[0].set_title(f'PCA: {geo_id} Samples')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Scree plot
axes[1].bar(range(1, 11), pca.explained_variance_ratio_ * 100)
axes[1].set_xlabel('Principal Component')
axes[1].set_ylabel('Variance Explained (%)')
axes[1].set_title('Scree Plot')
axes[1].set_xticks(range(1, 11))

plt.tight_layout()
plt.show()

print(f"Cumulative variance explained by first 3 PCs: {pca.explained_variance_ratio_[:3].sum()*100:.1f}%")

## 8. Differential Expression Analysis

In [None]:
def perform_differential_expression_geo(expression_df, metadata_df):
    """
    Perform differential expression analysis on GEO data
    """
    results = []
    
    # Get sample IDs for each condition
    tumor_samples = metadata_df[metadata_df['condition'] == 'Tumor']['sample_id'].values
    normal_samples = metadata_df[metadata_df['condition'] == 'Normal']['sample_id'].values
    
    print(f"Analyzing {len(tumor_samples)} tumor vs {len(normal_samples)} normal samples\n")
    
    for gene in expression_df.index:
        tumor_expr = expression_df.loc[gene, tumor_samples].values
        normal_expr = expression_df.loc[gene, normal_samples].values
        
        # Calculate statistics
        mean_tumor = np.mean(tumor_expr)
        mean_normal = np.mean(normal_expr)
        log2fc = mean_tumor - mean_normal
        
        # T-test
        t_stat, p_value = stats.ttest_ind(tumor_expr, normal_expr)
        
        results.append({
            'Gene': gene,
            'Mean_Tumor': mean_tumor,
            'Mean_Normal': mean_normal,
            'Log2FC': log2fc,
            'T_statistic': t_stat,
            'P_value': p_value
        })
    
    results_df = pd.DataFrame(results)
    
    # Remove NaN p-values
    results_df = results_df.dropna(subset=['P_value'])
    
    # Multiple testing correction
    _, results_df['FDR'], _, _ = multipletests(
        results_df['P_value'],
        method='fdr_bh'
    )
    
    # Add significance label
    results_df['Significant'] = (results_df['FDR'] < 0.05) & (abs(results_df['Log2FC']) > 1)
    
    # Sort by p-value
    results_df = results_df.sort_values('P_value')
    
    return results_df

# Perform differential expression
print("Performing differential expression analysis...")
de_results = perform_differential_expression_geo(expression_final, metadata_df)

# Summary
n_significant = de_results['Significant'].sum()
n_upregulated = ((de_results['Significant']) & (de_results['Log2FC'] > 0)).sum()
n_downregulated = ((de_results['Significant']) & (de_results['Log2FC'] < 0)).sum()

print(f"\n{'='*60}")
print(f"DIFFERENTIAL EXPRESSION RESULTS ({geo_id})")
print(f"{'='*60}")
print(f"Total genes analyzed: {len(de_results)}")
print(f"Significant genes (FDR < 0.05, |Log2FC| > 1): {n_significant}")
print(f"Upregulated in tumor: {n_upregulated}")
print(f"Downregulated in tumor: {n_downregulated}")
print(f"{'='*60}")

print(f"\nTop 20 differentially expressed genes:")
print(de_results[['Gene', 'Log2FC', 'P_value', 'FDR', 'Significant']].head(20))

## 9. Volcano Plot

In [None]:
# Prepare data for volcano plot
de_results['-log10(FDR)'] = -np.log10(de_results['FDR'])

# Create volcano plot
plt.figure(figsize=(12, 8))

# Not significant
not_sig = de_results[~de_results['Significant']]
plt.scatter(not_sig['Log2FC'], not_sig['-log10(FDR)'], 
           c='gray', alpha=0.4, s=10, label='Not significant')

# Significant and upregulated
sig_up = de_results[(de_results['Significant']) & (de_results['Log2FC'] > 0)]
plt.scatter(sig_up['Log2FC'], sig_up['-log10(FDR)'], 
           c='red', alpha=0.7, s=30, label=f'Upregulated ({len(sig_up)})')

# Significant and downregulated
sig_down = de_results[(de_results['Significant']) & (de_results['Log2FC'] < 0)]
plt.scatter(sig_down['Log2FC'], sig_down['-log10(FDR)'], 
           c='blue', alpha=0.7, s=30, label=f'Downregulated ({len(sig_down)})')

# Add threshold lines
plt.axhline(-np.log10(0.05), color='black', linestyle='--', linewidth=1, alpha=0.5, label='FDR = 0.05')
plt.axvline(-1, color='black', linestyle='--', linewidth=1, alpha=0.5)
plt.axvline(1, color='black', linestyle='--', linewidth=1, alpha=0.5)

# Annotate top genes
top_genes = de_results.nsmallest(10, 'FDR')
for _, row in top_genes.iterrows():
    plt.annotate(row['Gene'], 
                xy=(row['Log2FC'], row['-log10(FDR)']),
                xytext=(5, 5), textcoords='offset points',
                fontsize=8, alpha=0.7)

plt.xlabel('Log2 Fold Change (Tumor vs Normal)', fontsize=12)
plt.ylabel('-log10(FDR)', fontsize=12)
plt.title(f'Volcano Plot: {geo_id} Differential Expression', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 10. Heatmap of Top Differentially Expressed Genes

In [None]:
# Select top 50 differentially expressed genes
n_top_genes = min(50, len(de_results[de_results['Significant']]))

if n_top_genes > 0:
    top_genes = de_results.nsmallest(n_top_genes, 'FDR')['Gene'].values
    
    # Subset expression data
    heatmap_data = expression_final.loc[top_genes]
    
    # Z-score normalize
    heatmap_data_zscore = heatmap_data.sub(heatmap_data.mean(axis=1), axis=0).div(
        heatmap_data.std(axis=1), axis=0
    )
    
    # Create color map for conditions
    col_colors = metadata_df.set_index('sample_id')['condition'].map(
        {'Tumor': 'red', 'Normal': 'blue'}
    )
    
    # Reorder to match heatmap columns
    col_colors = col_colors.loc[heatmap_data_zscore.columns]
    
    # Create clustered heatmap
    plt.figure(figsize=(14, 10))
    g = sns.clustermap(
        heatmap_data_zscore,
        col_colors=col_colors,
        cmap='RdBu_r',
        center=0,
        vmin=-3,
        vmax=3,
        figsize=(14, 10),
        cbar_kws={'label': 'Z-score'},
        yticklabels=True,
        xticklabels=False,
        row_cluster=True,
        col_cluster=True
    )
    plt.suptitle(f'Top {n_top_genes} Differentially Expressed Genes ({geo_id})', 
                 y=0.98, fontsize=14, fontweight='bold')
    plt.show()
else:
    print("No significant genes found for heatmap")

## 11. Expression Profiles of Top Genes

In [None]:
# Plot top 6 genes
n_plot = min(6, len(de_results[de_results['Significant']]))

if n_plot > 0:
    top_genes_plot = de_results.nsmallest(n_plot, 'FDR')['Gene'].values
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    tumor_samples = metadata_df[metadata_df['condition'] == 'Tumor']['sample_id'].values
    normal_samples = metadata_df[metadata_df['condition'] == 'Normal']['sample_id'].values
    
    for idx, gene in enumerate(top_genes_plot):
        gene_expr = expression_final.loc[gene]
        
        tumor_expr = gene_expr[tumor_samples].values
        normal_expr = gene_expr[normal_samples].values
        
        # Box plot
        bp = axes[idx].boxplot([tumor_expr, normal_expr], 
                               labels=['Tumor', 'Normal'], 
                               patch_artist=True)
        
        bp['boxes'][0].set_facecolor('red')
        bp['boxes'][1].set_facecolor('blue')
        
        # Add points
        axes[idx].scatter([1]*len(tumor_expr), tumor_expr, alpha=0.3, c='darkred', s=20)
        axes[idx].scatter([2]*len(normal_expr), normal_expr, alpha=0.3, c='darkblue', s=20)
        
        # Get stats
        gene_stats = de_results[de_results['Gene'] == gene].iloc[0]
        
        axes[idx].set_ylabel('Expression (log2)')
        axes[idx].set_title(
            f"{gene}\nLog2FC={gene_stats['Log2FC']:.2f}, FDR={gene_stats['FDR']:.2e}"
        )
        axes[idx].grid(True, alpha=0.3)
    
    # Hide unused subplots
    for idx in range(n_plot, 6):
        axes[idx].set_visible(False)
    
    plt.tight_layout()
    plt.show()
else:
    print("No significant genes to plot")

## 12. Save Results

In [None]:
# Create output directory
output_dir = f"geo_analysis_{geo_id}"
os.makedirs(output_dir, exist_ok=True)

# Save differential expression results
de_results.to_csv(f'{output_dir}/differential_expression_results.csv', index=False)
print(f"Saved: {output_dir}/differential_expression_results.csv")

# Save significant genes
significant_genes = de_results[de_results['Significant']]
significant_genes.to_csv(f'{output_dir}/significant_genes.csv', index=False)
print(f"Saved: {output_dir}/significant_genes.csv ({len(significant_genes)} genes)")

# Save normalized expression matrix
expression_final.to_csv(f'{output_dir}/normalized_expression.csv')
print(f"Saved: {output_dir}/normalized_expression.csv")

# Save metadata
metadata_df.to_csv(f'{output_dir}/sample_metadata.csv', index=False)
print(f"Saved: {output_dir}/sample_metadata.csv")

print(f"\nAll results saved to {output_dir}/")

## 13. Alternative: Download GEO Data Manually via FTP

If GEOparse doesn't work or you want more control, you can download data directly.

In [None]:
# Alternative method: Download Series Matrix file directly
def download_geo_series_matrix(geo_id, output_dir="geo_data_manual"):
    """
    Download GEO series matrix file directly from NCBI FTP
    """
    import urllib.request
    
    os.makedirs(output_dir, exist_ok=True)
    
    # Construct FTP URL
    # Format: ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSEnnn/GSEnnnnnn/matrix/
    geo_num = geo_id.replace('GSE', '')
    series_prefix = geo_num[:-3] + 'nnn'
    
    base_url = f"https://ftp.ncbi.nlm.nih.gov/geo/series/GSE{series_prefix}/{geo_id}/matrix/"
    filename = f"{geo_id}_series_matrix.txt.gz"
    url = base_url + filename
    
    print(f"Downloading from: {url}")
    
    output_path = os.path.join(output_dir, filename)
    
    try:
        urllib.request.urlretrieve(url, output_path)
        print(f"Downloaded to: {output_path}")
        
        # Extract
        with gzip.open(output_path, 'rb') as f_in:
            with open(output_path.replace('.gz', ''), 'wb') as f_out:
                f_out.write(f_in.read())
        
        print(f"Extracted to: {output_path.replace('.gz', '')}")
        return output_path.replace('.gz', '')
        
    except Exception as e:
        print(f"Error downloading: {e}")
        return None

# Example usage (commented out - uncomment to use)
# matrix_file = download_geo_series_matrix("GSE48558")
print("Alternative download method available (see code above)")

## Summary

This notebook demonstrated:

1. ✅ **Downloading GEO data** using GEOparse
2. ✅ **Extracting sample metadata** and organizing samples
3. ✅ **Processing expression data** (cleaning, normalization, log transformation)
4. ✅ **Quality control** (missing values, distribution checks)
5. ✅ **PCA analysis** for sample visualization
6. ✅ **Differential expression analysis** with statistical testing
7. ✅ **Multiple testing correction** (FDR)
8. ✅ **Visualizations** (volcano plots, heatmaps, box plots)
9. ✅ **Results export** for further analysis

### Tips for Using GEO Data:

1. **Browse GEO** at https://www.ncbi.nlm.nih.gov/geo/ to find datasets
2. **Check sample metadata** carefully - column names vary by dataset
3. **Verify data type** - microarray vs RNA-seq require different processing
4. **Check preprocessing** - some datasets are already normalized
5. **Read the paper** - understand experimental design and sample groups

### Other Useful GEO Datasets:

- **GSE62944**: Colorectal cancer RNA-seq
- **GSE68849**: Prostate cancer microarray
- **GSE183947**: Lung cancer RNA-seq
- **GSE29431**: Gastric cancer microarray
- **GSE50760**: Pancreatic cancer RNA-seq