Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the use of a custom datafiles in db_load_* functions #72

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

arendsee
Copy link
Collaborator

Description

This commit adds a "path" keyword argument to the db_download_* files that allows the user to bypass the taxonomy database download step and use a custom local file.

Related Issue

Fixes #71, or at least provides a workaround.

Example

# This command will
#  1. download the NCBI taxonomy dump as "taxdmp.zip" using curl and a hard-coded URL
#  2. unzip the file and build the sqlite database
db_download_ncbi()

# This command bypasses the first download step. This allows
# the user to retrieve the data from a different source, modify it,
# or use a different tool to retrieve it (e.g., wget, rsync, or whatever).
db_download_ncbi(path="taxdmp.zip")

@arendsee
Copy link
Collaborator Author

The issue reported in #71 about curl timing out is happening in the tests here. It appears to be unrelated to my new code.

On my personal Linux machine this works:

> db_url <- 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip'
> db_path_file = "z.zip"
> curl::curl_download(db_url, db_path_file, quiet = TRUE)

This also works:

> taxizedb::db_download_ncbi()

Using libcurl 8.5.0, OpenSSL/3.2.0, R 4.3.2, and curl_5.2.0

On my server at work I encountered the same time out problem as the check here and @bergalu in #71.

Copy link
Collaborator

@stitam stitam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @arendsee for opening this PR. I thought about the issue some more and suggested an alternative solution that does not require an additional argument. Just a thought, I'm okay with your implementation as well.

If you want to go with your version, then I think overwrite rules should be more explicit for file.copy(). Let me know what you think.

# download data
mssg(verbose, 'downloading...')
curl::curl_download(db_url, db_path_file, quiet = TRUE)
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @arendsee for this. I am wondering if it is possible to implement a solution that does not require an extra argument. How do you feel about something like this?

if (!file.exists(db_path_file) | overwrite) {
  # download data
  mssg(verbose, 'downloading...')
  curl::curl_download(db_url, db_path_file, quiet = TRUE)
}

This approach is more complex for users because it requires them to copy the manually downloaded file to the right location and rename it (we could rename db_path_file in line 50 to taxdmp.zip to eliminate the need to rename). On the other hand it resolves the issue without changing/complicating the user interface. Maybe we could add a note/guide in the documentation for those who want to use their own, manually downloaded database. What do you think of this alternative?

mssg(verbose, 'downloading...')
curl::curl_download(db_url, db_path_file, quiet = TRUE)
} else {
file.copy(path, db_path_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is already a taxdump.zip in the cache directory, then file.copy() will not do anything because the default behaviour of the function is overwrite = FALSE. However if we call db_download_ncbi() with overwrite = TRUE and a manually downloaded taxdump.zip, we might expect to overwrite the old cache file with the new, manually downloaded file. Maybe the overwrite argument should be called explicitly?

file.copy(path, db_path_file, overwrite = overwrite)

This way if db_download_ncbi() is called with overwrite = TRUE then the manually downloaded taxdump.zip will overwrite matching files in the cache directory.

@KaiAragaki KaiAragaki mentioned this pull request Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

change db and cache path
2 participants