# Datacamp: Introduction to Bioconductor in R

## What is Bioconductor?
In this chapter you will get hands-on with Bioconductor. Bioconductor is the specialized repository for bioinformatics software, developed and maintained by the R community. You will learn how to install and use bioconductor packages. You will be introduced to S4 objects and functions, because most packages within Bioconductor inherit from S4. Additionally, you will use a real genomic dataset of a fungus to explore the BSgenome package.

### Bioconductor version

In [1]:
# Check R version
R.version

# Sessoion Info
sessionInfo()

               _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          4                           
minor          1.2                         
year           2021                        
month          11                          
day            01                          
svn rev        81115                       
language       R                           
version.string R version 4.1.2 (2021-11-01)
nickname       Bird Hippie                 

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22538)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] fansi_0.5.0     digest_0.6.27   utf8_1.2.2      crayon_1.4.1   
 [5] IRdisplay_1.1   repr_1.1.3      lifecycle_1.0.1 jsonlite_1.7.2 
 [9] evaluate_0.14   pillar_1.6.3    rlang_0.4.11    uuid_0.1-4     
[13] vctrs_0.3.8     ellipsis_0.3.2  IRkernel_1.3    tools_4.1.2    
[17] fastmap_1.1.0   compiler_4.1.2  base64enc_0.1-3 pbdZMQ_0.3-6   
[21] htmltools_0.5.2

### BioManager to install packages
BSgenome is a Bioconductor data package that contains representations of several genomes. This package has already been installed for you, as installing the dependencies usually takes some time, using the following code:


In [3]:
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("BSgenome")

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org


Bioconductor version 3.13 (BiocManager 1.30.16), R 4.1.2 (2021-11-01)

"package(s) not installed when version(s) same as current; use `force = TRUE` to
  re-install: 'BSgenome'"
Old packages: 'backports', 'brio', 'broom', 'cli', 'cpp11', 'crayon', 'DBI',
  'digest', 'dtplyr', 'fansi', 'fs', 'generics', 'glmnet', 'glue', 'jsonlite',
  'knitr', 'languageserver', 'mvtnorm', 'openssl', 'pillar', 'pkgload',
  'raster', 'Rcpp', 'readr', 'remotes', 'repr', 'rex', 'rJava', 'rjson',
  'rlang', 'rsconnect', 'rvest', 'sp', 'stringi', 'testthat', 'tibble',
  'tinytex', 'tzdb', 'uuid', 'vroom', 'withr', 'xfun', 'xgboost', 'xml2',
  'yaml', 'class', 'foreign', 'later', 'MASS', 'nlme', 'nnet', 'rpart',
  'spatial'



In [3]:
# Load the BSgenome package
library("BSgenome")
# Check the version of the BSgenome package
packageVersion("BSgenome")

[1] '1.60.0'

### S4 class definition
We will use the class `BSgenome`, which is already loaded for you.

Let's check the formal definition of this class by using the function `showClass("className")`. Check the BSgenome class results and find its parent classes (Extends) and the classes that inherit from it (Subclasses).

In [8]:
showClass("BSgenome")

Class "BSgenome" [package "BSgenome"]

Slots:
                                                                     
Name:               pkgname     single_sequences   multiple_sequences
Class:            character OnDiskNamedSequences        RdaCollection
                                                                     
Name:               seqinfo        user_seqnames   injectSNPs_handler
Class:              Seqinfo            character    InjectSNPsHandler
                                                                     
Name:           .seqs_cache         .link_counts             metadata
Class:          environment          environment                 list

Extends: "Annotated"

Known Subclasses: "MaskedBSgenome"


### Interaction with classes
Let's say we have an object called a_genome from class BSgenome. With a_genome, you can ask questions like these:
```
# What is a_genome's main class?
class(a_genome)  # "BSgenome"

# What is a_genome's other classes?
is(a_genome)  # "BSgenome", "GenomeDescription"

# Is a_genome an S4 representation?
isS4(a_genome)  # TRUE

```
If you want to find out more about the a_genome object, you can use the accessor `show(a_genome)` or use other specific accessors from the list of .S4methods(class = "BSgenome").

In [9]:
.S4methods(class = "BSgenome")

 [1] $               [[              as.list         coerce         
 [5] commonName      countPWM        export          extractAt      
 [9] getSeq          injectSNPs      length          masknames      
[13] matchPWM        mseqnames       names           organism       
[17] provider        providerVersion releaseDate     releaseName    
[21] seqinfo         seqinfo<-       seqnames        seqnames<-     
[25] show            snpcount        snplocs         SNPlocs_pkgname
[29] sourceUrl       vcountPattern   Views           vmatchPattern  
[33] vcountPDict     vmatchPDict     bsgenomeName    metadata       
[37] metadata<-     
see '?methods' for accessing help and source code

In [6]:
library('BSgenome.Scerevisiae.UCSC.sacCer3')
a_genome <- BSgenome.Scerevisiae.UCSC.sacCer3

In [7]:
# Investigate the a_genome using show()
show(a_genome)

# Investigate some other accesors
organism(a_genome)
provider(a_genome)
seqinfo(a_genome)

Yeast genome:
# organism: Saccharomyces cerevisiae (Yeast)
# genome: sacCer3
# provider: UCSC
# release date: April 2011
# 17 sequences:
#   chrI    chrII   chrIII  chrIV   chrV    chrVI   chrVII  chrVIII chrIX  
#   chrX    chrXI   chrXII  chrXIII chrXIV  chrXV   chrXVI  chrM           
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)


Seqinfo object with 17 sequences (1 circular) from sacCer3 genome:
  seqnames seqlengths isCircular  genome
  chrI         230218      FALSE sacCer3
  chrII        813184      FALSE sacCer3
  chrIII       316620      FALSE sacCer3
  chrIV       1531933      FALSE sacCer3
  chrV         576874      FALSE sacCer3
  ...             ...        ...     ...
  chrXIII      924431      FALSE sacCer3
  chrXIV       784333      FALSE sacCer3
  chrXV       1091291      FALSE sacCer3
  chrXVI       948066      FALSE sacCer3
  chrM          85779       TRUE sacCer3

In [4]:
# available genomes in BSgenome
available.genomes()

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org




In [5]:
## Install yeast genome
BiocManager::install('BSgenome.Scerevisiae.UCSC.sacCer3')

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org


Bioconductor version 3.13 (BiocManager 1.30.16), R 4.1.2 (2021-11-01)

Installing package(s) 'BSgenome.Scerevisiae.UCSC.sacCer3'

installing the source package 'BSgenome.Scerevisiae.UCSC.sacCer3'


Old packages: 'backports', 'brio', 'broom', 'cli', 'cpp11', 'crayon', 'DBI',
  'digest', 'dtplyr', 'fansi', 'fs', 'generics', 'glmnet', 'glue', 'jsonlite',
  'knitr', 'languageserver', 'mvtnorm', 'openssl', 'pillar', 'pkgload',
  'raster', 'Rcpp', 'readr', 'remotes', 'repr', 'rex', 'rJava', 'rjson',
  'rlang', 'rsconnect', 'rvest', 'sp', 'stringi', 'testthat', 'tibble',
  'tinytex', 'tzdb', 'uuid', 'vroom', 'withr', 'xfun', 'xgboost', 'xml2',
  'yaml', 'class', 'foreign', 'later', 'MASS', 'nlme', 'nnet', 'rpart',
  'spatial'



### Discovering the yeast genome
You'll begin to explore the yeast genome for yourself using the package `BSgenome.Scerevisiae.UCSC.sacCer3`, which is already installed for you.

As with other data in R, you can use `head()` and `tail()` to explore the yeastGenome object. You can also subset the genome by chromosome by using `$` syntax as follows: `object_name$chromosome_name`.

The names of the chromosomes can be obtained using the `names()` function, and `nchar()` can be used to count the number of characters in a sequence.

In [8]:
# Load the yeast genome
library(BSgenome.Scerevisiae.UCSC.sacCer3)

# Assign data to the yeastGenome object
yeastGenome <- BSgenome.Scerevisiae.UCSC.sacCer3

### Partitioning the yeast genome
Genomes are often big, but interest usually lies in specific regions of them. Therefore, we need to subset a genome by extracting parts of it. To pick a sequence interval, use `getSeq()` and specify the name of the chromosome and the start and end of the sequence interval.

The following example will select the bases of "chrI" from 100 to 150.

`getSeq(yeastGenome, names = "chrI", start = 100, end = 150)`

Note: `names` is optional; if not specified, it will return all chromosomes. The parameters `start` and `end` are also optional and, if not specified, will take the default values 1 and the length of the sequence, respectively.

In [10]:
# Load the yeast genome
library(BSgenome.Scerevisiae.UCSC.sacCer3)

# Assign data to the yeastGenome object
yeastGenome <- BSgenome.Scerevisiae.UCSC.sacCer3

# Get the first 30 bases of chrM
getSeq(yeastGenome, names = "chrM", end = 300)

300-letter DNAString object
seq: [47m[30mT[39m[49m[47m[30mT[39m[49m[47m[30mC[39m[49m[47m[30mA[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m[47m[30mA[39m[49m[47m[30mT[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m[47m[30mA[39m[49m[47m[30mT[39m[49m[47m[30mT[39m[49m[47m[30mT[39m[49m[47m[30mT[39m[49m[47m[30mT[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m[47m[30mT[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m[47m[30mT[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m...[47m[30mA[39m[49m[47m[30mA[39m[49m[47m[30mT[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m[47m[30mA[39m[49m[47m[30mT[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m[47m[30mA[39m[4