# 📊 Preprocessing of Breast Cancer Gene Expression Data (GSE96058)

This notebook performs preprocessing of the GSE96058 breast cancer RNA-seq dataset. The main goal is to clean and normalize the expression matrix and clinical metadata to prepare them for downstream analyses (e.g., clustering, differential expression, or signature evaluation).

---

## ✅ Included Steps:

1. **Load gene expression matrix**
2. **Quality control (QC)**  
   - Remove low-expressed genes  
   - Remove low-quality samples  
3. **Z-score normalization**  
   - Standardize gene expression across samples  
4. **Merge technical replicates**  
   - Average repeated samples  
5. **Parse clinical metadata**  
   - Extract structured annotations from the SOFT file  

---

**Input:**  
- `GSE96058_gene_expression_*.csv`  
- `GSE96058_family.soft`

**Output:**  
- Cleaned and normalized expression matrix  
- Matched clinical annotation DataFrame

This notebook does not perform analysis; it is designed purely for robust, reproducible preprocessing of large-scale RNA-seq data.


## 📥 Step 1: Load Gene Expression Matrix

We begin by loading the gene expression matrix from a CSV file. The matrix contains log-transformed expression values (e.g., log2 TPM or similar), with genes as rows and samples as columns. This dataset includes 3,273 samples and 136 technical replicates from the GSE96058 breast cancer cohort.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import re

# File path (modify as needed)
file_path = "./data/GSE96058_gene_expression_3273_samples_and_136_replicates_transformed.csv"

# Read the expression matrix
df = pd.read_csv(file_path, index_col=0)
print("✅ Data loaded successfully")
print(f"Original shape: {df.shape}")


## 🧹 Step 2: Quality Control

We perform initial quality control to filter out low-quality or non-informative data.

### 2.1 Remove low-expressed genes  
Genes expressed in less than 10% of all samples are removed. These genes are likely uninformative or non-expressed in most tumors.

### 2.2 Visualize sample-level total expression  
We compute and plot the total expression per sample after filtering genes, to inspect for outlier samples.

### 2.3 Remove low-quality samples  
Samples with total expression below the 5th percentile are considered low quality and removed from further analysis.


In [None]:
# ========== Quality Control ==========

# 1.1 Remove genes that are lowly expressed in most samples
# (e.g., genes with expression > 0 in less than 10% of samples are removed)
nonzero_fraction = (df > 0).sum(axis=1) / df.shape[1]
df_qc = df[nonzero_fraction >= 0.1]  # Keep genes expressed in ≥10% of samples
print(f"🔍 Genes retained after QC: {df_qc.shape[0]} / {df.shape[0]}")

# 1.2 Check total expression per sample
sample_total_expr = df_qc.sum(axis=0)

plt.figure(figsize=(12, 4))
sns.histplot(sample_total_expr, bins=50, kde=True)
plt.title("Sample-wise Total Expression (After Filtering Low-expressed Genes)")
plt.xlabel("Total Expression")
plt.ylabel("Sample Count")
plt.tight_layout()
plt.show()

# 1.3 Remove low-quality samples with extremely low total expression
# (e.g., below 5th percentile)
threshold = np.percentile(sample_total_expr, 5)
df_qc = df_qc.loc[:, sample_total_expr >= threshold]
print(f"✅ Shape after removing low-expression samples: {df_qc.shape}")

## ⚖️ Step 3: Z-score Normalization

To make the expression values comparable across genes, we perform Z-score normalization on each gene (row-wise):

- Subtract the gene’s mean expression across samples
- Divide by its standard deviation

This is a standard preprocessing step for clustering and PCA, ensuring all genes are on the same scale.


In [None]:

# ==========  Normalization ==========

# Perform Z-score normalization per gene (for clustering/PCA)
df_zscore = df_qc.apply(lambda x: (x - x.mean()) / x.std(), axis=1)

# Plot the distribution of Z-scored expression values
plt.figure(figsize=(6, 4))
sns.histplot(df_zscore.values.flatten(), bins=100, kde=True)
plt.title("Z-score Normalized Expression Distribution")
plt.xlabel("Z-score")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()


## 🧪 Step 4: Handle Technical Replicates

Some samples have technical replicates (columns ending with "repl"). To reduce redundancy:

- We detect replicate-original sample pairs.
- For each pair, we average the expression values of the original and its replicate.
- The replicate column is then removed.

This results in a cleaner expression matrix with no redundant columns.


In [None]:
# ---------- Identify replicate pairs ----------
df = df_zscore.copy()
# Replicate samples end with "repl"; original samples should match base name
repl_samples = [s for s in df.columns if re.search(r"repl$", s)]
base_samples = [re.sub(r"repl$", "", s) for s in repl_samples if re.sub(r"repl$", "", s) in df.columns]

print(f"Found {len(base_samples)} replicate pairs.")

# ---------- Step 3: Prepare data for correlation & scatterplot ----------
correlations = []      # For correlation barplot
scatter_data = []      # For joint scatterplot (sampled genes)

for base, repl in zip(base_samples, repl_samples):
    base_expr = df[base]
    repl_expr = df[repl]

    # Compute Pearson correlation
    corr = base_expr.corr(repl_expr)
    correlations.append({'sample': base, 'correlation': corr})

    # Sample 1000 genes for scatterplot clarity
    sampled_genes = np.random.choice(df.index, size=1000, replace=False)
    scatter_data.append(pd.DataFrame({
        'base_expr': base_expr.loc[sampled_genes],
        'repl_expr': repl_expr.loc[sampled_genes],
        'pair': base  # Used for color grouping (optional)
    }))

In [None]:
# ---------- Step 4: Barplot of all correlations ----------
corr_df = pd.DataFrame(correlations)

plt.figure(figsize=(14, 5))
sns.barplot(x='sample', y='correlation', data=corr_df)
plt.xticks(rotation=90)
plt.title("Pearson Correlation Between Original and Replicate Samples")
plt.axhline(0.95, color='red', linestyle='--', label='Recommended threshold')
plt.ylabel("Pearson r")
plt.xlabel("Sample")
plt.legend()
plt.tight_layout()
plt.show()


In [None]:
# ---------- Step 5: Combined scatterplot of replicate expression ----------
scatter_df = pd.concat(scatter_data, axis=0)

plt.figure(figsize=(6, 6))
sns.scatterplot(data=scatter_df, x='base_expr', y='repl_expr', hue='pair', alpha=0.3, s=10, legend=False)
plt.plot([0, scatter_df[['base_expr', 'repl_expr']].max().max()],
         [0, scatter_df[['base_expr', 'repl_expr']].max().max()],
         color='red', linestyle='--')
plt.xlabel("Original Sample Expression")
plt.ylabel("Replicate Sample Expression")
plt.title("Expression Consistency Across Replicate Pairs (Sampled Genes)")
plt.tight_layout()
plt.show()
low_corr_samples = corr_df[corr_df['correlation'] < 0.9]
print(f"Number of low-correlation replicate pairs: {len(low_corr_samples)}")
print(low_corr_samples)

In [None]:
low_corr_samples = corr_df[corr_df['correlation'] < 0.9]
print(f"Number of low-correlation replicate pairs: {len(low_corr_samples)}")
print(low_corr_samples)

In [None]:
df_zscore = df_zscore.drop(columns = low_corr_samples["sample"])
# Identify replicate and original sample columns
repl_cols = [col for col in df_zscore.columns if col.endswith('repl')]
original_cols = [col for col in df_zscore.columns if not col.endswith('repl')]

# Build mapping from replicate to original sample names
repl_to_original = {col: col.replace('repl', '') for col in repl_cols}
common_originals = [name for name in repl_to_original.values() if name in df_zscore.columns]

# Create a new DataFrame to store merged expression values
df_merged = df_zscore.copy()

# For samples with replicates, average the replicate and original expression
for repl, orig in repl_to_original.items():
    if orig in df_qc.columns:
        # Replace original column with the mean of original and replicate
        df_merged[orig] = (df_zscore[orig] + df_zscore[repl]) / 2
        # Drop the replicate column
        df_merged.drop(columns=[repl], inplace=True)


## 🧾 Clinical Data Parsing from SOFT File

This section extracts clinical metadata from the original GEO SOFT file (`GSE96058_family.soft`), which contains detailed sample annotations.


In [None]:
# ========== Step: Clinical Data Processing ==========

import re
import pandas as pd

# Load the SOFT file
soft_file = "./data/GSE96058_family.soft"

# Initialize sample list and tracking variables
samples = []
current_sample = {}
sample_id = None

# Parse the SOFT file line by line
with open(soft_file, "r") as f:
    for line in f:
        line = line.strip()

        # Detect the start of a new sample
        if line.startswith("!Sample_title"):
            # If a previous sample was being tracked, store it
            if current_sample:
                samples.append(current_sample)
            current_sample = {}
            sample_id = line.split(" = ")[1]
            current_sample["sample_id"] = sample_id

        # Parse clinical characteristics
        elif line.startswith("!Sample_characteristics_ch1 ="):
            content = line.split("= ", 1)[1]
            # Extract key-value pairs like "age: 56" or "ER status: positive"
            pairs = re.findall(r"([\w\s\-]+?):\s*([^:]+?)(?=(?:\s\w+?:|$))", content)
            for key, value in pairs:
                key = key.strip().lower().replace(" ", "_")  # normalize field name
                value = value.strip()
                current_sample[key] = value

# Add the final sample if it exists
if current_sample:
    samples.append(current_sample)

# Convert to DataFrame
df_meta = pd.DataFrame(samples)

# Display results
print(f"✅ Parsed {len(df_meta)} samples with {df_meta.shape[1]} metadata fields.")
print(df_meta.head())
