/
5_tips_and_tricks.Rmd.orig
73 lines (58 loc) · 3.45 KB
/
5_tips_and_tricks.Rmd.orig
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
title: "5. Tips and Tricks"
output: rmarkdown::html_vignette
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Multiple restez paths
It is not advisable to download the entire GenBank database to your machine. Equally, it is best to limit the size of a database. Databases that are too large will be slow to query and are more likely to cause memory issues. For example, you may actually make a query that demands more memory than is available on your machine. One solution to instead set multiple `restez` paths on your machine.
You can either set up a path for different domains. Or you could download for a single set of domains and then create a database from the same downloaded files using the `alt_restez_path` argument. Do also make use of `restez_path_unset` to disconnect and unset the `restez` path.
```{r, eval=FALSE}
# a larger database from the same download files in rodents_path
db_create(alt_restez_path = rodents_path, max_length = 2000)
```
## Connecting and disconnecting
Always ensure you disconnect after connecting to a `restez` path. Not doing so may lead to some strange database errors such as 'seg faults' or you may even be prevented from connecting to a database again until you restart R. In scripts you should always place `restez_disconnect()` as the end of the script or when you have stopped making queries. If you are making queries from your own custom function you should use `on.exit`. This allows you to run 'clean up' code whenever a function exits, even if it errors.
```{r pathset, include=FALSE}
pkgwd <- sub(pattern = 'vignettes', replacement = '' , x = getwd())
rodents_path <- file.path(pkgwd, 'rodents')
```
```{r on-exit}
suppressMessages(library(restez))
random_definition <- function() {
suppressMessages(restez_connect())
on.exit(restez_disconnect())
if (restez_ready()) {
# deliberate mistake
id <- sample(list_db_ids(n = NULL), 1)[[1]]
return(gb_definition_get(id))
}
}
restez_path_set(rodents_path)
(definition <- random_definition())
# not connected outside of function!
(restez_ready())
```
## Which domain?
The `db_download` function lists the various possible GenBank domains that can be downloaded. You can work out which GenBank domain a sequence belongs to by its three letter code towards the end of its locus. For example, the top of the record for this sequence indicates it is in the rodent domain.
```
LOCUS LT548182 456 bp DNA linear ROD 23-NOV-2016
DEFINITION TPA_inf: Cavia porcellus GLNH gene for globin H.
ACCESSION LT548182
VERSION LT548182.1
```
## Database performance and behaviour
The `restez` package database is built with [`MonetDBlite`](https://github.com/hannesmuehleisen/MonetDBLite-R).
If you encounter any errors that include the phrase "Server says", then an issue is
likely to have occurred within the database. Please raise such issues with
[GitHub](https://github.com/ropensci/restez/issues). But keep the following
factors in mind:
* Is your request from the database likely to return an object too large for
your computer's RAM? If the size of database is 5GB then it is likely that
a request pulling all of the sequence data and information into an R session
will be around 5GB as well.
* Are you building and storing the database on a separate USB drive? It has
been noted that database behaviour can be unusual on separate USB drives. When
an issue, please provide information about your USB drive's format, size and USB
connections.