# Gene ID mapping

![](./images/Module1/Gene_ID_Conversion.jpg)
## Learning Objectives:
1. Understanding different probe set ID.
2. Mapping probe IDs into gene identifiers and symbols.

### Understanding different probe set ID

Gene set or pathway analysis requires that gene sets and expression data use the same type of gene ID
(Entrez Gene ID, Gene symbol or probe set ID etc). However, this is frequently not true for the data we have.
For example, our gene sets mostly use Entrez Gene ID, but microarray datasets are frequently labeled by probe
set ID (or RefSeq transcript ID etc). Therefore, we need to convert or map the probe set IDs to Entrez gene ID.
Here, we will use `GSE48350` dataset that we have used in the previous section for demonstration of gene ID mapping.
In order to know what kind of probe set ID, we need to navigate to GEO record page of `GSE48350`. Under the `Platform`
tab, we can find the probe ID information.
![probID](./images/Module1/ProbeID.png)

From the record page, we know that the dataset was generated from 1 platform using Affymetrix Human Genome U133 Plus 2.0 Array.
To convert or map the probe set IDs to Entrez gene ID, we need to find the corresponding annotation package from
<a href="https://bioconductor.org/">Bioconductor</a>. For analyzed data, we need to use the `hgu133plus2.db` and `AnnotationDbi` databases.
We can install the package using following R command:

In [28]:
suppressMessages({if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
    suppressWarnings(BiocManager::install("hgu133plus2.db", update = F))
    suppressWarnings(BiocManager::install("AnnotationDbi", update = F))
})

In the data processing section, we have successfully downloaded the dataset `GSE48350` and saved it to the `data` sub-directory.
Now we can load the dataset and check for the Probe Set ID names by using the following command.

In [29]:
data = readRDS("./data/GSE48350.rds")
expression_data = data$expression_data
probeIDs = rownames(expression_data)
head(probeIDs)

**Probe Set ID** is the identifier that refers to a set of probe pairs selected to represent expressed sequences on an array. Designations are given at design time.

The probe set names never change, but they can give you an idea of what was known about the sequence at the time of design.

* **_at** = all the probes hit one known transcript.
* **_a** = all probes in the set hit alternate transcripts from the same genes.
* **_s** = all probes in the set hit transcripts from different genes.
* **_x** = some probes hit transcripts from different genes.

For HG-U133, the **_a** designation was not used; an **_s** probe set on these arrays means the same as an **_a** on any of the HG-U133 arrays.

### Mapping probe IDs into gene identifiers and symbols.
For pathway analysis, the database like Gene Ontology and KEGG use gene symbol annotation. Therefore, Probe set IDs should be mapped to gene symbol to be used in the later module. We can use the pre-annotated databases which are available from Bioconductor to map the probe IDs to the gene symbol. `AnnotationDbi` is an R package that provides an interface for connecting and querying various annotation databases using SQLite data storage while `hgu133plus2.db` is database to perform Probe IDs to gene symbol mapping for human data. We can use the following command to

In [30]:
suppressMessages({
  library(hgu133plus2.db)
  library(AnnotationDbi)
})

Here, we can build a  lookup table that maps the Probe IDs to gene names and gene symbol using the following command:

In [31]:
suppressMessages({
annotLookup <- AnnotationDbi::select(hgu133plus2.db, keys = probeIDs, columns = c('PROBEID','GENENAME','SYMBOL'))
})

In [32]:
head(annotLookup)

Unnamed: 0_level_0,PROBEID,GENENAME,SYMBOL
Unnamed: 0_level_1,<chr>,<chr>,<chr>
1,1007_s_at,discoidin domain receptor tyrosine kinase 1,DDR1
2,1053_at,replication factor C subunit 2,RFC2
3,117_at,heat shock protein family A (Hsp70) member 6,HSPA6
4,121_at,paired box 8,PAX8
5,1255_g_at,guanylate cyclase activator 1A,GUCA1A
6,1294_at,ubiquitin like modifier activating enzyme 7,UBA7


We can check the number of probe IDs that mapped to the unique gene symbols using the following command:

In [33]:
# Check number of the probe IDs
print(paste0("There are ",length(unique(annotLookup$PROBEID))," probe IDs that mapped to ",length(unique(annotLookup$SYMBOL))," gene symbols"))

[1] "There are 54675 probe IDs that mapped to 22243 gene symbols"


From the lookup table we can spot that a single gene symbol can be mapped to multiple probe IDs.
Depending on analysis performed, users can select their own probe IDs mapped to the gene of interest.
