# Week 02 : Working with BioStrings

## [Biostrings](https://bioconductor.org/packages/release/bioc/html/Biostrings.html)
The Biostrings package contains classes and functions for representing biological strings such as DNA, RNA and amino acids. In addition the package has functionality for pattern matching (short read alignment) as well as a pairwise alignment function implementing Smith-Waterman local alignments and Needleman-Wunsch global alignments used in classic sequence alignment (see (Durbin et al. 1998) for a description of these algorithms). There are also functions for reading and writing output such as FASTA files.

### Dependencies

In [2]:
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Biostrings")



In [3]:
library(Biostrings)

### Representing sequences
There are two basic types of containers for representing strings. One container represents a single string (say a chromosome or a single short read) and the other container represents a set of strings (say a set of short reads). There are different classes intended to represent different types of sequences such as DNA or RNA sequences.

In [4]:
dna1 <- DNAString("ACGT-N")
dna1

6-letter DNAString object
seq: [47m[30mA[39m[49m[47m[30mC[39m[49m[47m[30mG[39m[49m[47m[30mT[39m[49m-[47m[37mN[39m[49m

In [5]:
DNAStringSet("ADE")

ERROR: Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, : key 69 (char 'E') not in lookup table


In [6]:
dna2 <- DNAStringSet(c("ACGT", "GTCA", "GCTA"))
dna2

DNAStringSet object of length 3:
    width seq
[1]     4 [47m[30mA[39m[49m[47m[30mC[39m[49m[47m[30mG[39m[49m[47m[30mT[39m[49m
[2]     4 [47m[30mG[39m[49m[47m[30mT[39m[49m[47m[30mC[39m[49m[47m[30mA[39m[49m
[3]     4 [47m[30mG[39m[49m[47m[30mC[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m

Note that the alphabet of a DNAString is an extended alphabet: - (for insertion) and N are allowed. In fact, IUPAC codes are allowed (these codes represent different characters, for example the code “M” represents either and “A” or a “C”). A list of IUPAC codes can be obtained by

In [7]:
IUPAC_CODE_MAP

Indexing into a DNAString retrieves a subsequence (similar to the standard R function substr), whereas indexing into a DNAStringSet gives you a subset of sequences.

In [8]:
dna1[2:4]

3-letter DNAString object
seq: [47m[30mC[39m[49m[47m[30mG[39m[49m[47m[30mT[39m[49m

In [9]:
dna2[2:3]

DNAStringSet object of length 2:
    width seq
[1]     4 [47m[30mG[39m[49m[47m[30mT[39m[49m[47m[30mC[39m[49m[47m[30mA[39m[49m
[2]     4 [47m[30mG[39m[49m[47m[30mC[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m

Note that `[[` allows you to get a single element of a DNAStringSet as a DNAString. This is very similar to `[` and `[[` for lists.

In [10]:
dna2[[2]] 

4-letter DNAString object
seq: [47m[30mG[39m[49m[47m[30mT[39m[49m[47m[30mC[39m[49m[47m[30mA[39m[49m

In [11]:
# DNAStringSet objects can have names, like ordinary vectors
names(dna2) <- paste0("seq", 1:3)
dna2

DNAStringSet object of length 3:
    width seq                                               names               
[1]     4 [47m[30mA[39m[49m[47m[30mC[39m[49m[47m[30mG[39m[49m[47m[30mT[39m[49m                                              seq1
[2]     4 [47m[30mG[39m[49m[47m[30mT[39m[49m[47m[30mC[39m[49m[47m[30mA[39m[49m                                              seq2
[3]     4 [47m[30mG[39m[49m[47m[30mC[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m                                              seq3

**The full set of string classes are**

- `DNAString`[Set]: DNA sequences.
- `RNAString`[Set]: RNA sequences.
- `AAString`[Set]: Amino Acids sequences (protein).
- `BString`[Set]: “Big” sequences, using any kind of letter.
In addition you will often see references to `XString`[Set] in the documentation. An `XString`[Set] is basically any of the above classes.

These classes seem very similar to standard characters() from base R, but there are important differences. The differences are mostly about efficiencies when you deal with either (a) many sequences or (b) very long strings (think whole chromosomes).

### Basic functionality

Basic character functionality is supported, like

- `length`, `names`.
- `c` and `rev` (reverse the sequence).
- `width`, `nchar` (number of characters in each sequence).
- `==`, `duplicated`, `unique`.
- `as.charcater` or `toString`: converts to a base character() vector.
- `sort`, `order`.
- `chartr`: convert some letters into other letters.
- `subseq`, subseq<-, extractAt, replaceAt.
- `replaceLetterAt`.

In [12]:
width(dna2)

In [13]:
sort(dna2)

DNAStringSet object of length 3:
    width seq                                               names               
[1]     4 [47m[30mA[39m[49m[47m[30mC[39m[49m[47m[30mG[39m[49m[47m[30mT[39m[49m                                              seq1
[2]     4 [47m[30mG[39m[49m[47m[30mC[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m                                              seq3
[3]     4 [47m[30mG[39m[49m[47m[30mT[39m[49m[47m[30mC[39m[49m[47m[30mA[39m[49m                                              seq2

**Note that `rev` on a `DNAStringSet` just reverse the order of the elements, whereas rev on a DNAString actually reverse the string.**

In [15]:
rev(dna2)

DNAStringSet object of length 3:
    width seq                                               names               
[1]     4 [47m[30mG[39m[49m[47m[30mC[39m[49m[47m[30mT[39m[49m[47m[30mA[39m[49m                                              seq3
[2]     4 [47m[30mG[39m[49m[47m[30mT[39m[49m[47m[30mC[39m[49m[47m[30mA[39m[49m                                              seq2
[3]     4 [47m[30mA[39m[49m[47m[30mC[39m[49m[47m[30mG[39m[49m[47m[30mT[39m[49m                                              seq1

In [16]:
rev(dna1)

6-letter DNAString object
seq: [47m[37mN[39m[49m-[47m[30mT[39m[49m[47m[30mG[39m[49m[47m[30mC[39m[49m[47m[30mA[39m[49m

### Biological functionality

There are also functions which are related to the biological interpretation of the sequences, including

- `reverse`: reverse the sequence.
- `complement`, `reverseComplement`: (reverse) complement the sequence.
- `translate`: translate the DNA or RNA sequence into amino acids.

In [17]:
translate(dna2)

"in 'x[[1]]': last base was ignored"
"in 'x[[2]]': last base was ignored"
"in 'x[[3]]': last base was ignored"


AAStringSet object of length 3:
    width seq                                               names               
[1]     1 T                                                 seq1
[2]     1 V                                                 seq2
[3]     1 A                                                 seq3

In [18]:
reverseComplement(dna1)

6-letter DNAString object
seq: [47m[37mN[39m[49m-[47m[30mA[39m[49m[47m[30mC[39m[49m[47m[30mG[39m[49m[47m[30mT[39m[49m

### Counting letters

We very often want to count sequences in various ways. Examples include:

- Compute the GC content of a set of sequences.
- Compute the frequencies of dinucleotides in a set of sequences.
- Compute a position weight matrix from a set of aligned sequences.

There is a rich set of functions for doing this quickly.

- alphabetFrequency, letterFrequency: Compute the frequency of all characters (alphabetFrequency) or only specific letters (letterFrequency).
- dinucleotideFrequency, trinucleotideFrequency, oligonucleotideFrequeny: compute frequencies of dinucleotides (2 bases), trinucleotides (3 bases) and oligonucleotides (general number of bases).
- letterFrequencyInSlidingView: letter frequencies, but in sliding views along the string.
- consensusMatrix: consensus matrix; almost a position weight matrix.
  
Let’s look at some examples, note how the output expands to a matrix when you use the functions on a DNAStringSet:

In [19]:
alphabetFrequency(dna1)

In [20]:
alphabetFrequency(dna2)

A,C,G,T,M,R,W,S,Y,K,V,H,D,B,N,-,+,.
1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [21]:
letterFrequency(dna2, "GC")

G|C
2
2
2


In [22]:
consensusMatrix(dna2, as.prob = TRUE)

0,1,2,3,4
A,0.3333333,0.0,0.0,0.6666667
C,0.0,0.6666667,0.3333333,0.0
G,0.6666667,0.0,0.3333333,0.0
T,0.0,0.3333333,0.3333333,0.3333333
M,0.0,0.0,0.0,0.0
R,0.0,0.0,0.0,0.0
W,0.0,0.0,0.0,0.0
S,0.0,0.0,0.0,0.0
Y,0.0,0.0,0.0,0.0
K,0.0,0.0,0.0,0.0


In [23]:
sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22543)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Biostrings_2.60.2   GenomeInfoDb_1.28.4 XVector_0.32.0     
[4] IRanges_2.26.0      S4Vectors_0.30.2    BiocGenerics_0.38.0

loaded via a namespace (and not attached):
 [1] zlibbioc_1.38.0        uuid_0.1-4             rlang_0.4.12          
 [4] fastmap_1.1.0          fansi_1.0.2            tools_4.1.2           
 [7] utf8_1.2.2             htmltools_0.5.2        ellipsis_0.3.2        
[10] digest_0.6.27          lifecycle_1.0.1        crayon_1.4.2          
[13] GenomeInfoDbD