# Data Upload & Processing — Fetch TCGA_LUAD Expression & Methylation  + Data Processing
Purpose: This notebook downloads TCGA_LUAD Expression (RNAseq) & Methylation (array) from tumor and normal samples.Does data processing and handling the missing data and alighns tumor/noramal matched samples.

## Input files expected: (upload from UCSC Xena public links ):

- `meth_url`: ("https://tcga.xenahubs.net/download/TCGA.LUADsampleMap/HumanMethylation450.gz") : samples × CPGs (β-values; row index = sample IDs)
- `expr_url`: ("https://tcga.xenahubs.net/download/TCGA.LUAD.sampleMap/HiSeqV2.gz") : samples × genes (expression values as fold change, row index = sample IDs)

## Outputs produced:
- `X_meth.csv` : Processed data of samples × DMRs (β-values; row index = sample IDs)

- `Y_expr.csv` : Processed data of samples × genes (normalized expression)

- `methylation_data_matched.csv`: Methylation data for pairs of matched tumor and normal samples (CPGs β-values; row index = sample IDs)

- `expression_data_matched.CSV`: Expression data for pairs of matched tumor and normal samples (expression values as fold change, row index = sample IDs)

- `long_format_tumor_normal_samples.csv`: sample metadata with columns sampleID, condition (Tumor/Normal), optional patientID

- `y_labels.csv`: Samples and their clinical status with columns sample_id, label ("Tumor": 1, "Normal": 0)


## 1) Install & Imports

In [None]:
# This script uses UCSC Xena-hosted TCGA data for LUAD (Lung Adenocarcinoma)
# It downloads methylation and expression data and matches tumor-normal pairs

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

## 2) Setup & upload




In [None]:
# === 1. Download methylation and expression data ===
# UCSC Xena public links for TCGA-LUAD
meth_url = "https://tcga.xenahubs.net/download/TCGA.LUAD.sampleMap/HumanMethylation450.gz"
expr_url = "https://tcga.xenahubs.net/download/TCGA.LUAD.sampleMap/HiSeqV2.gz"

# Read both datasets
meth_df = pd.read_csv(meth_url, sep='\t', index_col=0)
expr_df = pd.read_csv(expr_url, sep='\t', index_col=0)

# Transpose to have samples as rows
meth_df = meth_df.T
expr_df = expr_df.T

print('Shapes of methylation and expression files:')
print('Methylation (samples x DMRs):', meth_df.shape)
print('Expression (samples x genes):', expr_df.shape)

Shapes of methylation and expression files:
Methylation (samples x DMRs): (492, 485577)
Expression (samples x genes): (576, 20530)


In [None]:
expr_df.head()

sample,ARHGEF10L,HIF3A,RNF17,RNF10,RNF11,RNF13,GTF2IP1,REM1,MTVR2,RTN4RL2,...,TULP2,NPY5R,GNGT2,GNGT1,TULP3,PTRF,BCL6B,GSTK1,SELP,SELS
TCGA-69-7978-01,9.9898,4.2598,0.4181,10.3657,11.1718,10.5897,12.2708,4.767,0.0,8.2023,...,1.8836,0.742,6.2348,0.0,9.452,12.7565,8.2668,11.24,6.1209,9.8977
TCGA-62-8399-01,10.4257,11.6239,0.0,11.5489,11.02,9.2843,12.154,5.7125,0.4628,5.5819,...,0.4628,1.5316,4.4464,1.3294,9.5226,12.21,8.5437,10.3491,8.6398,9.7315
TCGA-78-7539-01,9.6264,9.1362,1.1231,11.6692,10.4679,10.4649,12.6559,4.3943,0.3725,3.5365,...,2.9588,0.0,6.04,3.9201,9.2765,10.6498,6.1814,11.1659,6.097,10.354
TCGA-50-5931-11,8.6835,9.4824,0.8221,11.7341,11.6787,11.5412,11.9285,5.9466,0.8221,3.3528,...,0.0,2.4876,6.3782,0.0,8.6781,14.6956,9.7151,10.591,9.5115,10.4914
TCGA-73-4658-01,9.2078,5.0288,0.0,11.6209,11.3414,10.9376,12.0539,6.0942,0.0,7.4156,...,0.0,0.6557,6.3898,1.1048,9.2697,13.0036,8.9786,10.6777,8.4187,10.3142


## 3) Data Processing (Handeling missing data and Imputation)
### Methylation data processing

In [None]:
# Filter out CpGs with too much missing data
na_threshold = 0.2
valid_cpgs = meth_df.columns[meth_df.isna().mean() < na_threshold]
meth_df = meth_df[valid_cpgs]

# Impute missing values
imputer = SimpleImputer(strategy="mean")
meth_imputed = imputer.fit_transform(meth_df)

# Standardize features ## There is no need to normalize these beta values. They
# are kind of normalized already since they are percentage.
#scaler = StandardScaler()
#meth_scaled = scaler.fit_transform(meth_imputed)

# Save preprocessed matrix
#X_meth = pd.DataFrame(meth_scaled, index=meth_df.index, columns=valid_cpgs)
X_meth = pd.DataFrame(meth_imputed, index=meth_df.index, columns=valid_cpgs)
X_meth.to_csv("X_meth.csv")

### Expression data processing

In [None]:
# Step 1: Drop genes with too many missing values
missing_threshold = 0.2
expr_df = expr_df.loc[:, expr_df.isna().mean() < missing_threshold]

In [None]:
# Step 2: Impute remaining missing values
expr_df = expr_df.fillna(0)  # Simple zero imputation; consider mean or KNN for better estimates

In [None]:
# Step 3: Normalize (Z-score)
scaler = StandardScaler()
X_expr = pd.DataFrame(scaler.fit_transform(expr_df), index=expr_df.index, columns=expr_df.columns)

In [None]:
# Step 4: Save result
X_expr.to_csv("Y_expr.csv")

print("✅ Expression data preprocessing complete. Output saved as Y_expr.csv")

✅ Expression data preprocessing complete. Output saved as X_expr.csv


## 4) Aligning matched tumor and normal samples

In [None]:
# 1. Identify tumor and normal samples
def extract_base_id(sample_id):
    return "-".join(sample_id.split("-")[:3])

def extract_sample_type(sample_id):
    sample_type_code = sample_id.split("-")[3][:2]
    return "Normal" if sample_type_code == "11" else "Tumor"

X_meth['base_id'] = X_meth.index.map(extract_base_id)
X_meth['sample_type'] = X_meth.index.map(extract_sample_type)
X_expr['base_id'] = expr_df.index.map(extract_base_id)
X_expr['sample_type'] = expr_df.index.map(extract_sample_type)

In [None]:
# 2. Match tumor and normal pairs
meth_meta = X_meth[['base_id', 'sample_type']].reset_index().rename(columns={'index': 'sample_id'})
meth_tumor = meth_meta[meth_meta['sample_type'] == 'Tumor']
meth_normal = meth_meta[meth_meta['sample_type'] == 'Normal']

# Inner join to find matched base_ids
matched = pd.merge(meth_tumor, meth_normal, on='base_id', suffixes=('_tumor', '_normal'))
matched['has_normal_pair'] = True
print('Tumor/Normal matched samples pair with methylation data:', matched.shape)

Tumor/Normal matched samples pair with methylation data: (29, 6)


In [None]:
# 3. Subset methylation and expression to matched pairs
matched_ids = matched['sample_id_tumor'].tolist() + matched['sample_id_normal'].tolist()
meth_df_matched = X_meth.loc[matched_ids].drop(columns=['base_id', 'sample_type'])
expr_df_matched = X_expr.loc[X_expr.index.isin(matched_ids)].drop(columns=['base_id', 'sample_type'])


Samples with matching expression and methylation data

In [None]:
# 4. Save outputs
meth_df_matched.to_csv("methylation_data_matched.csv")
expr_df_matched.to_csv("expression_data_matched.csv")
matched.to_csv("matched_sample_metadata.csv", index=False)

print(f"✅ Saved:\n- {meth_df_matched.shape[0]} Methylation samples creating 29 pairs of Tumor/normal samples\n- {expr_df_matched.shape[0]} samples with expression and methylation data")

✅ Saved:
- 58 Methylation samples creating 29 pairs of Tumor/normal samples
- 47 samples with expression and methylation data


In [None]:
# 5.Save matched_sample_metadata.csv file in a long_format
# Read the original wide-format CSV
df = pd.read_csv("matched_sample_metadata.csv")

# Rename columns for clarity
df.columns = ["Tumor_sampleID", "PatientID", "Tumor_condition", "Normal_sampleID", "Normal_condition", "has_normal_pair"]

# Create two rows per patient: one for tumor, one for normal
df_long = pd.DataFrame({
    "SampleID": pd.concat([df["Tumor_sampleID"], df["Normal_sampleID"]], ignore_index=True),
    "Condition": pd.concat([df["Tumor_condition"], df["Normal_condition"]], ignore_index=True),
    "PatientID": pd.concat([df["PatientID"], df["PatientID"]], ignore_index=True)
})

# Save or preview
df_long.to_csv("long_format_tumor_normal_samples.csv", index=False)
print(df_long.head())

          SampleID Condition     PatientID
0  TCGA-44-6778-01     Tumor  TCGA-44-6778
1  TCGA-50-5931-01     Tumor  TCGA-50-5931
2  TCGA-44-6144-01     Tumor  TCGA-44-6144
3  TCGA-44-2668-01     Tumor  TCGA-44-2668
4  TCGA-44-2665-01     Tumor  TCGA-44-2665


In [None]:
# 6. Save y labels for each sample
# Label samples: tumor = 1, normal = 0
import pandas as pd

# Load your uploaded metadata file
metadata_df = pd.read_csv("matched_sample_metadata.csv")

# Check the column names
print(metadata_df.columns)

# Assuming the file has columns like 'sample_id' and 'sample_type'
# Modify the column names below to match yours exactly

y_labels_t = metadata_df[["sample_id_tumor", "sample_type_tumor"]].rename(columns={"sample_id_tumor": "sample_id" , "sample_type_tumor": "label"})
y_labels_n = metadata_df[["sample_id_normal", "sample_type_normal"]].rename(columns={"sample_id_normal": "sample_id" , "sample_type_normal": "label"})
y_labels_df = pd.concat([y_labels_t, y_labels_n])
y_labels_df = y_labels_df.reset_index(drop=True)
y_labels_df["label"] = y_labels_df["label"].map({"Tumor": 1, "Normal": 0})

y_labels_df.to_csv("y_labels.csv", index=False)

print("Saved y_labels.csv")



Index(['sample_id_tumor', 'base_id', 'sample_type_tumor', 'sample_id_normal',
       'sample_type_normal', 'has_normal_pair'],
      dtype='object')
Saved y_labels.csv
