# Install libraries

In [2]:
library(biomaRt)
library(org.Hs.eg.db)

Loading required package: AnnotationDbi

Loading required package: stats4

Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
    tapply, union, unique, unsplit, which.max, which.min


Loading required package: Biobase

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Loading required package: IRanges

Loading required package: S4Vectors


Att

In [3]:
refseq_ids <- c("NR_075077","NM_001005337","NM_012102")

Here I am showing ways to convert refseq ids to gene names using `org.Hs.en.db` and `biomaRt`. Based on the recovered refseq ID, it conclusively shows that `org.Hs.en.db` is the better option for refSeq and `biomaRt` for Ensembl. 

https://www.bioconductor.org/help/course-materials/2015/UseBioconductorFeb2015/A01.5_Annotation.html

# Using org.Hs.eg.db

In [4]:
keytypes(org.Hs.eg.db)

In [5]:
columns(org.Hs.eg.db)

In [7]:
cols <- c("SYMBOL", "GENENAME", "ENSEMBL")

select(org.Hs.eg.db, keys=refseq_ids, 
       columns=cols, keytype="REFSEQ")

'select()' returned 1:1 mapping between keys and columns



REFSEQ,SYMBOL,GENENAME,ENSEMBL
<chr>,<chr>,<chr>,<chr>
NR_075077,C1orf141,chromosome 1 open reading frame 141,ENSG00000203963
NM_001005337,PKP1,plakophilin 1,ENSG00000081277
NM_012102,RERE,arginine-glutamic acid dipeptide repeats,ENSG00000142599


# Using biomaRt

In [9]:
##using biomaRt, assigned gene ids to refseq ids
ensembl_mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))

In [10]:
## list the available datasets in this Mart
listAttributes(mart = ensembl_mart)

name,description,page
<chr>,<chr>,<chr>
ensembl_gene_id,Gene stable ID,feature_page
ensembl_gene_id_version,Gene stable ID version,feature_page
ensembl_transcript_id,Transcript stable ID,feature_page
ensembl_transcript_id_version,Transcript stable ID version,feature_page
ensembl_peptide_id,Protein stable ID,feature_page
ensembl_peptide_id_version,Protein stable ID version,feature_page
ensembl_exon_id,Exon stable ID,feature_page
description,Gene description,feature_page
chromosome_name,Chromosome/scaffold name,feature_page
start_position,Gene start (bp),feature_page


In [11]:
ref_gn <- biomaRt::getBM(filters="refseq_mrna", 
                         attributes=c("refseq_mrna","hgnc_symbol","ensembl_gene_id"), 
                         values=refseq_ids, mart=mart, useCache = FALSE)

In [12]:
ref_gn

refseq_mrna,hgnc_symbol,ensembl_gene_id
<chr>,<chr>,<chr>
NM_001005337,PKP1,ENSG00000081277
NM_012102,RERE,ENSG00000142599


# Session information

In [13]:
sessionInfo()

R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so;  LAPACK version 3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Denver
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] org.Hs.eg.db_3.19.1  AnnotationDbi_1.66.0 IRanges_2.38.0      
[4] S4Vectors_0.42.0     Biobase_2.64.0       BiocGenerics_0.50.0 
[7] biomaRt_2.60.0      

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3          utf8_1.2.4       