# 🔬 Differentially Methylated CpG Selection:

This notebook finds differentially methylated CpGs in tumor vs normal samples using:
- ✅  Paired t-test

The top ranked CpGs will be used in DMR (differentially methylated regions) definition later in this project. Δβ and paired stats have been used for selection.

#Input files expected:
- `methylation_data_matched.csv`: Methylation data for pairs of matched tumor and normal samples (CPGs β-values; row index = sample IDs)

- `y_labels.csv`: Samples and their clinical status with columns sample_id, label ("Tumor": 1, "Normal": 0)


#Output produced
- `sorted_CpGs.csv`: Sorted differentially methylated CpGs (Columns: CpG,	delta_beta,	p_value)

## 1) Setup & upload

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_rel

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Upload methylationdata from google drive
import pandas as pd
file_path = '/content/drive/My Drive/GenomicsProject_data/methylation_data_matched.csv'
X_meth = pd.read_csv(file_path, index_col=0)

In [None]:
#Upload clinical status of samples
import pandas as pd
file_path = '/content/drive/My Drive/GenomicsProject_data/y_labels.csv'
y = pd.read_csv(file_path, index_col=0).squeeze()

In [None]:
# Load methylation data and labels
# "methylation_data_matched.csv" is used instead of X_meth.csv which is for all
# samples. They have the same number of columns which are CpGs
##X_meth = pd.read_csv("methylation_data_matched.csv", index_col=0)
#X_meth = pd.read_csv("X_meth.csv", index_col=0)
##y = pd.read_csv("y_labels.csv", index_col=0).squeeze()

In [None]:
# Split paired tumor and normal samples
tumor_ids = y[y == 1].index
normal_ids = y[y == 0].index

In [None]:
# Ensure paired format: match by patient ID substring
common_ids = [i for i in tumor_ids if i.replace("-01", "-11") in normal_ids]
pairs = [(i, i.replace("-01", "-11")) for i in common_ids]
print('Number of tumor/normal sample pairs with Methylation data:', len(pairs), 'pairs')

Number of tumor/normal sample pairs with Methylation data: 29 pairs


In [None]:
X_meth.head(5)

Unnamed: 0,cg13332474,cg00651829,cg17027195,cg09868354,cg03050183,cg06819656,cg04244851,cg19669385,cg04244855,cg17689707,...,cg19358568,cg27295654,cg03116837,cg15678817,cg14483317,cg11692435,cg10230711,cg16651827,cg18138552,cg07883722
TCGA-44-6778-01,0.0315,0.0424,0.0365,0.06,0.0893,0.9021,0.6482,0.7225,0.8478,0.0461,...,0.7214,0.6495,0.4088,0.3967,0.0774,0.6282,0.0784,0.5802,0.0652,0.9145
TCGA-50-5931-01,0.0328,0.0239,0.0362,0.0837,0.1224,0.8346,0.7195,0.8076,0.7837,0.0182,...,0.2089,0.8318,0.4326,0.6481,0.0804,0.9505,0.0912,0.7189,0.071,0.847
TCGA-44-6144-01,0.0387,0.0281,0.0271,0.0706,0.1197,0.8406,0.686,0.7457,0.7907,0.0324,...,0.8344,0.7582,0.4781,0.5769,0.0847,0.8031,0.1556,0.7195,0.069,0.9121
TCGA-44-2668-01,0.0209,0.05385,0.0156,0.0614,0.1295,0.8367,0.82565,0.8807,0.8806,0.33995,...,0.90065,0.81055,0.532,0.34755,0.0944,0.97,0.0747,0.73955,0.06525,0.87545
TCGA-44-2665-01,0.0123,0.2486,0.03595,0.06225,0.09515,0.81995,0.81165,0.6812,0.8205,0.07875,...,0.9228,0.8344,0.42865,0.54995,0.0687,0.98255,0.1012,0.75365,0.0715,0.8638


## 2)Calculate paired t-test statistics and Δβ for each CpG

In [None]:
# Paired t-test and delta beta
dmr_stats = []
for cpg in X_meth.columns:
    tumor_vals = [X_meth.loc[tumor][cpg] for tumor, normal in pairs]
    normal_vals = [X_meth.loc[normal][cpg] for tumor, normal in pairs]
    t_stat, p_val = ttest_rel(tumor_vals, normal_vals)
    delta_beta = np.mean(tumor_vals) - np.mean(normal_vals)
    dmr_stats.append((cpg, delta_beta, p_val))

In [None]:
# Convert to DataFrame and sort the based on p_value
dmr_df = pd.DataFrame(dmr_stats, columns=["CpG", "delta_beta", "p_value"])
dmr_df = dmr_df.sort_values("p_value").reset_index(drop=True)
dmr_df.to_csv("sorted_CpGs.csv", index=False)
print('Number of CpG with Δβ and p_value(sorted):', dmr_df.shape[0], 'CpGs')
dmr_df.head(5)


Number of CpG with Δβ and p_value(sorted): 395636 CpGs


Unnamed: 0,CpG,delta_beta,p_value
0,cg04864807,0.513935,4.4913690000000005e-22
1,cg11201447,-0.436955,5.731185999999999e-20
2,cg25247520,-0.403264,5.732552e-20
3,cg08443563,-0.339449,5.920718999999999e-20
4,cg12595013,0.32151,1.225674e-19
