##  Post-DEG & DMA Processing: Filtering Expression and Methylation Matrices

Following the identification of differentially expressed genes (DEGs) and differentially methylated probes (DMAs), we further refined the original expression and methylation matrices to include only the relevant genes and probes. This step ensures that all downstream analyses are focused exclusively on biologically meaningful targets.

---

###  Input Files

| File | Description |
|------|-------------|
| `PAAD_DEG_Significant_Results.csv` | List of significantly differentially expressed genes |
| `PAAD_DMA_Significant_Results.csv` | List of significantly differentially methylated probes |
| `PAAD_TOIL_RSEM_TPM_Levels.csv` | Original gene expression matrix (TPM, log2-transformed) |
| `PAAD_Methylation_Levels.csv` | Original DNA methylation matrix (beta values) |

---

###  Step 1: Filter Expression Matrix by DEGs

1. Load the list of significant DEGs from `PAAD_DEG_Significant_Results.csv`.
2. Extract the unique HGNC gene symbols.
3. Filter the original expression matrix to retain only genes listed in the DEG results.
4. Save the filtered matrix to `PAAD_TOIL_RSEM_TPM_Levels_After_DEG_DMA.csv`.

>  **Output**: Expression matrix with only DEGs retained (total: *n* genes).

---

###  Step 2: Filter Methylation Matrix by DMAs

1. Load the list of significant DMAs from `PAAD_DMA_Significant_Results.csv`.
2. Extract the unique probe IDs.
3. Filter the original methylation matrix to retain only probes listed in the DMA results.
4. Save the filtered matrix to `PAAD_Methylation_Levels_After_DEG_DMA.csv`.

>  **Output**: Methylation matrix with only DMAs retained (total: *m* probes).

---

###  Purpose

By performing this targeted filtering:
- We focus subsequent correlation or integrative analyses on genes and probes that are statistically significant.
- Redundant data is removed, which improves computational efficiency.
- The output matrices are ready for downstream analyses such as methylation-expression correlation, integrative clustering, or machine learning-based classification.

---

>  Both outputs are saved under the same project directory for consistent access and version control.


In [1]:
# Author: Yibai Tang
# Description: Filter gene expression and methylation data based on DEG and DMA results

import pandas as pd

# 1. File path
deg_file = r'D:\project data\M-28\NTU_DATA_CLEANED\PAAD_DEG_Significant_Results.csv'
dma_file = r'D:\project data\M-28\NTU_DATA_CLEANED\PAAD_DMA_Significant_Results.csv'
expression_file = r'D:\project data\M-28\NTU_DATA_CLEANED\PAAD_TOIL_RSEM_TPM_Levels.csv'
methylation_file = r'D:\project data\M-28\NTU_DATA_CLEANED\PAAD_Methylation_Levels.csv'

# 2. Output path
filtered_expression_output = r'D:\project data\M-28\NTU_DATA_CLEANED\PAAD_TOIL_RSEM_TPM_Levels_After_DEG_DMA.csv'
filtered_methylation_output = r'D:\project data\M-28\NTU_DATA_CLEANED\PAAD_Methylation_Levels_After_DEG_DMA.csv'

# -------------------------------
# Screen the expression data
# -------------------------------
# Read the results of DEG
deg_df = pd.read_csv(deg_file)
deg_genes = deg_df['HGNC_Symbol'].dropna().unique()

# Read the original expression matrix
expr_df = pd.read_csv(expression_file)

# Only remain the columns of HGNC_Symbol in DEG results
filtered_expr_df = expr_df[expr_df['HGNC_Symbol'].isin(deg_genes)]

# Read the original expression matrix
filtered_expr_df.to_csv(filtered_expression_output, index=False)
print(f"The filtered expression matrix has been saved, total {filtered_expr_df.shape[0]} gene to：{filtered_expression_output}")

# -------------------------------
# Screen the methylation data
# -------------------------------
# Read DMA results
dma_df = pd.read_csv(dma_file)
dma_probes = dma_df['Probe_ID'].dropna().unique()

# Read the original methylation data (the first column is Probe_ID)
meth_df = pd.read_csv(methylation_file)

# Only remain the columns of Probe_ID in DMA results
filtered_meth_df = meth_df[meth_df['Probe_ID'].isin(dma_probes)]

# Read the original methylation data (the first column is Probe_ID)
filtered_meth_df.to_csv(filtered_methylation_output, index=False)
print(f"The methylation matrices after screening have been saved, total {filtered_meth_df.shape[0]} probes to：{filtered_methylation_output}")

The filtered expression matrix has been saved, total 3317 gene to：D:\project data\M-28\NTU_DATA_CLEANED\PAAD_TOIL_RSEM_TPM_Levels_After_DEG_DMA.csv
The methylation matrices after screening have been saved, total 10248 probes to：D:\project data\M-28\NTU_DATA_CLEANED\PAAD_Methylation_Levels_After_DEG_DMA.csv
