## Overview
Microarray expression data is measured at the probe level, where multiple probes may map to the same gene or transcript.
In this notebook, we:
1. Load the Affymetrix GPL570 platform annotation
2. Map probe-level differential expression results to genes using Entrez Gene IDs
3. Resolve multiple probes per gene using a principled aggregation strategy
4. Generate a gene-level differential expression table suitable for downstream functional and pathway analyses

In [3]:
import pandas as pd
import numpy as np

#### Probe-to-Gene Mapping and Systems-Level Interpretation
Differential expression analysis in the previous notebook identified probe-level transcriptional changes between Duchenne muscular dystrophy (DMD) patient muscle samples and controls. However, Affymetrix microarray platforms measure expression at the probe level, and multiple probes may map to the same gene. To enable biologically meaningful interpretation and downstream pathway analysis, differentially expressed probes must therefore be mapped to gene-level identifiers and consolidated appropriately. This notebook focuses on translating probe-level differential expression results into gene-level signals and interpreting these changes in the context of molecular pathways and disease mechanisms relevant to DMD



##### Probe Annotation Strategy
Affymetrix probes can map to one or more genes, and a single gene may be represented by multiple probes. To ensure biologically interpretable results, probes will be mapped to gene symbols using platform-specific annotation resources. When multiple probes map to the same gene, probe-level signals will be consolidated to generate a single gene-level differential expression value per gene.

In [21]:
# Load probe-level DE results from Notebook 2
de_dmd_ctrl = pd.read_csv(
    r"C:\Users\Lenovo\Desktop\UG Stream\Notebooks\02 - DE_DMD_vs_Control.csv",
    index_col=0
)

de_dmd_ctrl.index.name = "probe_id"

de_dmd_ctrl.head()


Unnamed: 0_level_0,mean_control,mean_dmd,log2FC
probe_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
34471_at,4.320852,13.744169,9.423317
205940_at,5.229859,13.853884,8.624025
206717_at,5.43154,13.735486,8.303946
202311_s_at,4.035683,10.616064,6.580381
209875_s_at,2.515965,9.046366,6.530402


In [23]:
assert de_dmd_ctrl.index.str.contains("_at").any()

In [25]:
annot_df = pd.read_csv(
    r"C:\Users\Lenovo\Desktop\UG Stream\GPL570-55999 (1).txt",
    sep="\t",
    comment="#",
    low_memory=False
)

annot_df.head()

Unnamed: 0,ID,GB_ACC,SPOT_ID,Species Scientific Name,Annotation Date,Sequence Type,Sequence Source,Target Description,Representative Public ID,Gene Title,Gene Symbol,ENTREZ_GENE_ID,RefSeq Transcript ID,Gene Ontology Biological Process,Gene Ontology Cellular Component,Gene Ontology Molecular Function
0,1007_s_at,U48705,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Huma...,U48705,discoidin domain receptor tyrosine kinase 1 //...,DDR1 /// MIR4640,780 /// 100616237,NM_001202521 /// NM_001202522 /// NM_001202523...,0001558 // regulation of cell growth // inferr...,0005576 // extracellular region // inferred fr...,0000166 // nucleotide binding // inferred from...
1,1053_at,M87338,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,GenBank,M87338 /FEATURE= /DEFINITION=HUMA1SBU Human re...,M87338,"replication factor C (activator 1) 2, 40kDa",RFC2,5982,NM_001278791 /// NM_001278792 /// NM_001278793...,0000278 // mitotic cell cycle // traceable aut...,0005634 // nucleus // inferred from electronic...,0000166 // nucleotide binding // inferred from...
2,117_at,X51757,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,X51757 /FEATURE=cds /DEFINITION=HSP70B Human h...,X51757,heat shock 70kDa protein 6 (HSP70B'),HSPA6,3310,NM_002155,0000902 // cell morphogenesis // inferred from...,0005737 // cytoplasm // inferred from direct a...,0000166 // nucleotide binding // inferred from...
3,121_at,X69699,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,GenBank,X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens...,X69699,paired box 8,PAX8,7849,NM_003466 /// NM_013951 /// NM_013952 /// NM_0...,0001655 // urogenital system development // in...,0005634 // nucleus // inferred from direct ass...,0000979 // RNA polymerase II core promoter seq...
4,1255_g_at,L36861,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,L36861 /FEATURE=expanded_cds /DEFINITION=HUMGC...,L36861,guanylate cyclase activator 1A (retina),GUCA1A,2978,NM_000409 /// XM_006715073,0007165 // signal transduction // non-traceabl...,0001750 // photoreceptor outer segment // infe...,0005509 // calcium ion binding // inferred fro...


In [27]:
annot_df = annot_df[[
    "ID",
    "Gene Symbol",
    "ENTREZ_GENE_ID"
]].rename(columns={
    "ID": "probe_id",
    "Gene Symbol": "gene_symbol",
    "ENTREZ_GENE_ID": "entrez_id"
})

annot_df.head()

Unnamed: 0,probe_id,gene_symbol,entrez_id
0,1007_s_at,DDR1 /// MIR4640,780 /// 100616237
1,1053_at,RFC2,5982
2,117_at,HSPA6,3310
3,121_at,PAX8,7849
4,1255_g_at,GUCA1A,2978


In [29]:
annot_df.shape

(54675, 3)

In [31]:
annot_df["probe_id"].isna().sum()

0

In [33]:
annot_df["entrez_id"].isna().sum()

10541

Not all probes map to an Entrez Gene ID. Probes without gene assignments are excluded from gene-level analyses.

In [36]:
# Clean Entrez IDs: keep primary mapping if multiple IDs exist
annot_df["entrez_id"] = (
    annot_df["entrez_id"]
    .astype(str)
    .str.split(" /// ")
    .str[0]
)

# Convert to numeric (invalid entries become NaN)
annot_df["entrez_id"] = pd.to_numeric(annot_df["entrez_id"], errors="coerce")


In [38]:
annot_df["entrez_id"].head(10)

0      780.0
1     5982.0
2     3310.0
3     7849.0
4     2978.0
5     7318.0
6     7067.0
7    11099.0
8     6352.0
9     1571.0
Name: entrez_id, dtype: float64

In [40]:
annot_df["entrez_id"].isna().sum()

10541

## Handling multiple gene mappings
Some Affymetrix probes map to multiple genes, represented as multiple Entrez Gene IDs separated by /// in the platform annotation.
For downstream gene-level analysis, only the primary Entrez Gene ID (first listed) was retained for each probe. Probes without Entrez Gene ID assignments were excluded from gene-level aggregation.

In [43]:
de_dmd_ctrl = de_dmd_ctrl.copy()
de_dmd_ctrl.index.name = "probe_id"
de_dmd_ctrl.head()

Unnamed: 0_level_0,mean_control,mean_dmd,log2FC
probe_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
34471_at,4.320852,13.744169,9.423317
205940_at,5.229859,13.853884,8.624025
206717_at,5.43154,13.735486,8.303946
202311_s_at,4.035683,10.616064,6.580381
209875_s_at,2.515965,9.046366,6.530402


In [45]:
assert de_dmd_ctrl.index.str.contains("_at").any()

In [47]:
de_annot = (
    de_dmd_ctrl
    .reset_index()
    .merge(
        annot_df,
        on="probe_id",
        how="left"
    )
)

de_annot.head()

Unnamed: 0,probe_id,mean_control,mean_dmd,log2FC,gene_symbol,entrez_id
0,34471_at,4.320852,13.744169,9.423317,MYH8,4626.0
1,205940_at,5.229859,13.853884,8.624025,MYH3,4621.0
2,206717_at,5.43154,13.735486,8.303946,MYH8,4626.0
3,202311_s_at,4.035683,10.616064,6.580381,COL1A1,1277.0
4,209875_s_at,2.515965,9.046366,6.530402,SPP1,6696.0


In [49]:
# Shape check
print("DE rows (probes):", de_dmd_ctrl.shape[0])
print("Merged rows:", de_annot.shape[0])

DE rows (probes): 54675
Merged rows: 54675


In [51]:
# How many probes got an Entrez ID?
de_annot["entrez_id"].notna().sum()

44134

In [53]:
# How many unique probes survived?
de_annot["probe_id"].nunique()

54675

### Probe-Level Annotation
Probe-level differential expression results were merged with the Affymetrix GPL570 platform annotation using probe identifiers (ID_REF). This step associates each probe with its corresponding gene symbol and Entrez Gene ID where available. Probes without gene assignments were retained at this stage and excluded only during gene-level aggregation.

In [56]:
# Keep only probes with valid Entrez Gene IDs
de_annot_valid = de_annot.dropna(subset=["entrez_id"]).copy()

print("Probes before filtering:", de_annot.shape[0])
print("Probes after filtering:", de_annot_valid.shape[0])


Probes before filtering: 54675
Probes after filtering: 44134


In [58]:
de_annot_valid["entrez_id"] = de_annot_valid["entrez_id"].astype("Int64")

#### Filtering probes without gene assignments
Not all Affymetrix probes map to a known gene. Probes lacking Entrez Gene ID annotations were excluded prior to gene-level aggregation to ensure that downstream analyses operate on biologically interpretable units.

In [61]:
gene_de = (
    de_annot_valid
    .groupby("entrez_id")
    .agg(
        mean_log2FC=("log2FC", "mean"),
        n_probes=("probe_id", "count")
    )
    .sort_values("mean_log2FC", ascending=False)
)

gene_de.head()

Unnamed: 0_level_0,mean_log2FC,n_probes
entrez_id,Unnamed: 1_level_1,Unnamed: 2_level_1
4626,8.863631,2
4621,8.624025,1
5320,6.528468,1
4636,6.368076,1
158763,5.804189,1


In [63]:
gene_de.shape

(21180, 2)

#### Gene-level aggregation strategy

Multiple Affymetrix probes may target the same gene due to alternative transcripts or probe redundanc.

To obtain a single gene-level estimate of differential expression, probe-level log2 fold changes were aggregated by mean log2 fold change per Entrez GeneID.

The number of contributing probes per gene was retained to provide transparency regarding probe coverage.

In [66]:
# Most upregulated genes
gene_de.head(10)

Unnamed: 0_level_0,mean_log2FC,n_probes
entrez_id,Unnamed: 1_level_1,Unnamed: 2_level_1
4626,8.863631,2
4621,8.624025,1
5320,6.528468,1
4636,6.368076,1
158763,5.804189,1
5816,5.747369,1
4969,5.123637,2
6192,4.888431,1
100506621,4.655324,1
55228,4.523857,1


In [68]:
# Most downregulated genes
gene_de.tail(10)


Unnamed: 0_level_0,mean_log2FC,n_probes
entrez_id,Unnamed: 1_level_1,Unnamed: 2_level_1
165186,-4.014346,1
10647,-4.183059,1
10265,-4.218394,1
8549,-4.231001,2
340156,-4.30352,3
26628,-4.351001,1
9211,-4.485525,1
115019,-4.507204,1
64850,-4.524759,1
7306,-5.708823,1


In [70]:
gene_de["mean_log2FC"].describe()

count    21180.000000
mean         0.052062
std          0.648572
min         -5.708823
25%         -0.173561
50%         -0.000061
75%          0.192173
max          8.863631
Name: mean_log2FC, dtype: float64

In [80]:
gene_de.to_csv(
    r"C:\Users\Lenovo\Desktop\UG Stream\Notebooks\03 - Gene_Level_DE_Entrez.csv"
)
annot_df.to_csv(
    r"C:\Users\Lenovo\Desktop\UG Stream\Notebooks\03 - Probe_to_Gene_Annotation.csv",
    index=False
)
print("Saved")

Saved


### Notebook 03 - Summary and Next Steps

In this notebook, we transitioned from probe-level differential expression results to a biologically interpretable **gene-level differential expression profile** for Duchenne Muscular Dystrophy (DMD).

Specifically, we:
- Loaded the Affymetrix GPL570 platform annotation and extracted probe-to-gene mappings  
- Resolved probes mapping to multiple genes by retaining the primary Entrez Gene ID  
- Merged probe-level differential expression results with annotation using probe identifiers  
- Excluded probes without valid Entrez Gene ID assignments  
- Aggregated probe-level log2 fold changes to obtain **gene-level differential expression estimates**, while retaining probe counts per gene  

The final output of this notebook is a curated **gene-level differential expression table**, indexed by Entrez Gene ID, suitable for downstream functional analysis and target prioritization.

**Next steps** will involve leveraging this gene-level profile to perform higher-level biological interpretation, including pathway and functional enrichment analyses, disease stageâ€“specific expression comparisons, and prioritization of candidate genes for therapeutic targeting and molecular docking studies.


#### Outputs Saved

This notebook generated and saved a gene-level differential expression table indexed by Entrez Gene ID, representing aggregated probe-level transcriptional changes in Duchenne Muscular Dystrophy. A cleaned probe-to-gene annotation table derived from the GPL570 platform was also saved to ensure reproducibility of downstream analyses.
