Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to encrypt arrow files using the cyphr package? #50

Open
marianschmidt opened this issue Jun 9, 2022 · 2 comments
Open

Comments

@marianschmidt
Copy link

Hi, I have been experimenting with the cyphr package and have hit the memory limit with large .RData files. As an alternative, the arrow package offers partitioning of large data when writing files. I tried to create a new method for arrow::write_dataset(), but when using cyphr::encrypt(), it results in an error message of denied permissions (using any other build-in write functions of cyphr however works). A reprex with iris below.

# packages
library(cyphr)
library(arrow)
#> 
#> Attache Paket: 'arrow'
#> Das folgende Objekt ist maskiert 'package:utils':
#> 
#>     timestamp

# To do anything we first need a key:
key <- cyphr::key_sodium(sodium::keygen())

# Register new method for arrow::write_dataset()
cyphr::rewrite_register("arrow", "write_dataset", "path")
ls(cyphr:::db)
#>  [1] "arrow::write_dataset" "base::load"           "base::readLines"     
#>  [4] "base::readRDS"        "base::save"           "base::saveRDS"       
#>  [7] "base::writeLines"     "readxl::read_excel"   "readxl::read_xls"    
#> [10] "readxl::read_xlsx"    "utils::read.csv"      "utils::read.csv2"    
#> [13] "utils::read.delim"    "utils::read.delim2"   "utils::read.table"   
#> [16] "utils::write.csv"     "utils::write.csv2"    "utils::write.table"  
#> [19] "writexl::write_xlsx"

# Trying to encrypt with cyphr results in error message of denied permissions
cyphr::encrypt(write_dataset(iris, tempfile(), partitioning = c("Species")), 
               key)
#> Warning in file(con, "rb"): cannot open file 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpKw7PXv\filed4c33d93cd0d4c2d2f10cf'
#> Permission denied
#> Error in file(con, "rb"): cannot open the connection
#> Warning in file.remove(paths[ok]):  cannot remove file 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpKw7PXv\filed4c33d93cd0d4c2d2f10cf'
#> 'Permission denied'

Created on 2022-06-09 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  German_Germany.utf8
#>  ctype    German_Germany.utf8
#>  tz       Europe/Berlin
#>  date     2022-06-09
#>  pandoc   2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  arrow       * 8.0.0   2022-05-09 [1] CRAN (R 4.2.0)
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  bit           4.0.4   2020-08-04 [1] CRAN (R 4.2.0)
#>  bit64         4.0.5   2020-08-30 [1] CRAN (R 4.2.0)
#>  cli           3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.2.0)
#>  cyphr       * 1.1.2   2021-05-17 [1] CRAN (R 4.2.0)
#>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.2.0)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  dplyr         1.0.9   2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  generics      0.1.2   2022-01-31 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.2.0)
#>  knitr         1.39    2022-04-26 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.2.0)
#>  R.cache       0.15.0  2021-04-30 [1] CRAN (R 4.2.0)
#>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.2.0)
#>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.2.0)
#>  R.utils       2.11.0  2021-09-26 [1] CRAN (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.2.0)
#>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.2.0)
#>  rmarkdown     2.14    2022-04-25 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.2.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  sodium        1.2.0   2021-10-21 [1] CRAN (R 4.2.0)
#>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.2.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.2.0)
#>  styler        1.7.0   2022-03-13 [1] CRAN (R 4.2.0)
#>  tibble        3.1.7   2022-05-03 [1] CRAN (R 4.2.0)
#>  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.31    2022-05-10 [1] CRAN (R 4.2.0)
#>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] C:/Users/ga27jar/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@richfitz
Copy link
Member

Hi @marianschmidt; the assumption made by cyphr::encrypt() (and decrypt()) is that each read/write operation will write exactly one file. It looks like with partitioning the arrow::write_dataset call is creating three files, one per partition, and that breaks the model. I don't think that this is easily worked around with the simple call-rewriting approach that cyphr uses, because the logic around partitioned reads and writes happens in compiled code in that package.

Options here are:

  • you can not use the magic-but-friendly encrypt/decrypt wrapper and instead do the encryption and decryption of each partition yourself. Write out the partitioned file, loop over each directory and encrypt each file, deleting the cleartext files as you go. Reverse the process on read, making sure that you don't leave cleartext files floating around on error.
  • if it looks like this is a really common need, then we could look at supporting partitioned arrow files, but as we don't have the need for this ourselves at the moment it might sit on the backlog for a while
  • we could try and fix whatever the "memory limit with large .RData files" issue is - is that an issue on your machine (cannot allocate vector of XXX size) or an issue with cyphr (something to do with long vectors). If you could work up a reprex or give us an idea of how big big is that would be useful

@marianschmidt
Copy link
Author

marianschmidt commented Jun 10, 2022

Hi @richfitz; Thanks a lot for your prompt reply and sharing possible solutions.

  1. Unfortunately, I think the problem might not only relate to partitioned arrow files; since this additional case also fails (see reprex below).

  2. Possible solutions:

  • First write then encrypt approach: I would be a bit concerned about data security here, because we are working on NAS drives that automatically get backed up and when this backup happens before the unencrypted file was deleted, then this would be problematic
  • Supporting arrow in cyphr: Of course, I would be a huge fan of it as I see cyphr currently as the best way to use encrypted files collaboratively within R and I also see that the arrow data format has a growing fanbase in the R community. But I totally understand that implementing that might take a while.
  • Fix the memory limit issue: This seems to be a general issue with the way cyphr currently operates (trying to write out the data as one object; thus hitting the internal R memory limit). I created a separate issue and reprex for that Memory limit error for encrypt() #51.
# packages
library(cyphr)
library(arrow)
#> 
#> Attache Paket: 'arrow'
#> Das folgende Objekt ist maskiert 'package:utils':
#> 
#>     timestamp

# To do anything we first need a key:
key <- cyphr::key_sodium(sodium::keygen())

# Register new method for arrow::write_dataset()
cyphr::rewrite_register("arrow", "write_dataset", "path")
ls(cyphr:::db)
#>  [1] "arrow::write_dataset" "base::load"           "base::readLines"     
#>  [4] "base::readRDS"        "base::save"           "base::saveRDS"       
#>  [7] "base::writeLines"     "readxl::read_excel"   "readxl::read_xls"    
#> [10] "readxl::read_xlsx"    "utils::read.csv"      "utils::read.csv2"    
#> [13] "utils::read.delim"    "utils::read.delim2"   "utils::read.table"   
#> [16] "utils::write.csv"     "utils::write.csv2"    "utils::write.table"  
#> [19] "writexl::write_xlsx"

# arrow::write_dataset() without encryption is working 
# both for partitioned and unpartitioned parquet files
arrow::write_dataset(iris, "myfile_arrow_part", partitioning = c("Species"))
list.files("myfile_arrow_part", recursive = TRUE)
#> [1] "Species=setosa/part-0.parquet"     "Species=versicolor/part-0.parquet"
#> [3] "Species=virginica/part-0.parquet"
arrow::write_dataset(iris, "myfile_arrow")
list.files("myfile_arrow")
#> [1] "part-0.parquet"

# Trying to encrypt with cyphr results in error message of denied permissions
cyphr::encrypt(write_dataset(iris, "myfile_encrypt_part", partitioning = c("Species")), 
               key)
#> Warning in file(con, "rb"): kann Datei 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpyED7sT\myfile_encrypt_part20f83dde1b0f'
#> nicht öffnen: Permission denied
#> Error in file(con, "rb"): kann Verbindung nicht öffnen
#> Warning in file.remove(paths[ok]): kann Datei 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpyED7sT\myfile_encrypt_part20f83dde1b0f'
#> nicht löschen. Grund 'Permission denied'

# This problem persists for writing small data without portioning
cyphr::encrypt(write_dataset(iris, "myfile_encrypt"),
               key)
#> Warning in file(con, "rb"): kann Datei 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpyED7sT\myfile_encrypt20f844c072c' nicht
#> öffnen: Permission denied
#> Error in file(con, "rb"): kann Verbindung nicht öffnen
#> Warning in file.remove(paths[ok]): kann Datei 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpyED7sT\myfile_encrypt20f844c072c' nicht
#> löschen. Grund 'Permission denied'

Created on 2022-06-10 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.1.3 (2022-03-10)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  German_Germany.1252
#>  ctype    German_Germany.1252
#>  tz       Europe/Berlin
#>  date     2022-06-10
#>  pandoc   2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date (UTC) lib source
#>  arrow       * 8.0.0   2022-05-09 [1] CRAN (R 4.1.3)
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.2)
#>  bit           4.0.4   2020-08-04 [1] CRAN (R 4.1.2)
#>  bit64         4.0.5   2020-08-30 [1] CRAN (R 4.1.2)
#>  cli           3.3.0   2022-04-25 [1] CRAN (R 4.1.3)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.1.3)
#>  cyphr       * 1.1.2   2021-05-17 [1] CRAN (R 4.1.2)
#>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.1.2)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.1.2)
#>  dplyr         1.0.9   2022-04-28 [1] CRAN (R 4.1.3)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.2)
#>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.1.2)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.1.3)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.2)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
#>  generics      0.1.2   2022-01-31 [1] CRAN (R 4.1.2)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.1.2)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.2)
#>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.2)
#>  knitr         1.39    2022-04-26 [1] CRAN (R 4.1.3)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
#>  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.2)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.2)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.2)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.2)
#>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.1.3)
#>  rmarkdown     2.14    2022-04-25 [1] CRAN (R 4.1.3)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.2)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
#>  sodium        1.2.0   2021-10-21 [1] CRAN (R 4.1.2)
#>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.1.2)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.1.2)
#>  tibble        3.1.7   2022-05-03 [1] CRAN (R 4.1.3)
#>  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.1.2)
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.1.3)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.2)
#>  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.1.3)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
#>  xfun          0.31    2022-05-10 [1] CRAN (R 4.1.3)
#>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.1.2)
#> 
#>  [1] C:/Users/ga27jar/Documents/R/win-library/4.1
#>  [2] C:/Program Files/R/R-4.1.3/library
#> 
#> ------------------------------------------------------------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants