Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using parquet parser seems to fail #854

Closed
Giqles opened this issue Mar 30, 2022 · 5 comments
Closed

Using parquet parser seems to fail #854

Giqles opened this issue Mar 30, 2022 · 5 comments

Comments

@Giqles
Copy link

Giqles commented Mar 30, 2022

System details

Output of sessioninfo::session_info()():

R version 4.1.3 (2022-03-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plumber_1.1.0.9000 grf_2.1.0          data.table_1.14.2  arrow_7.0.0       
[5] qs_0.25.3          jsonlite_1.8.0    

loaded via a namespace (and not attached):
 [1] stringfish_0.15.5   Rcpp_1.0.8.3        magrittr_2.0.2     
 [4] webutils_1.1        tidyselect_1.1.2    bit_4.0.4          
 [7] RApiSerialize_0.1.0 lattice_0.20-45     R6_2.5.1           
[10] rlang_1.0.2         tools_4.1.3         grid_4.1.3         
[13] bit64_4.0.5         RcppParallel_5.1.5  assertthat_0.2.1   
[16] lifecycle_1.0.1     Matrix_1.4-0        purrr_0.3.4        
[19] later_1.3.0         vctrs_0.3.8         promises_1.2.0.1   
[22] glue_1.6.2          stringi_1.7.6       swagger_3.33.1     
[25] compiler_4.1.3

Example application or steps to reproduce the problem

This is trying to set up a container to use with Sagemaker batch transforms. From the sagemaker examples, that will be sent as a post like:

curl -v -X POST --data-binary @payload.snappy.parquet -H "Content-Type: application/vnd.apache.parquet" http://localhost:8080/invocations --output payload.snappy.parquet.out

Then the plumber.R file looks something like this.

#' Parse input and return prediction from model
#' @post /invocations
#' @parser parquet
#' @serializer parquet
function(req) {
    require(data.table)
    dt <- as.data.table(req$argsBody)
    print(dt)
}
# <simpleError in rawToChar(value): embedded nul in string: '...

Describe the problem in detail

Based on #661, I was expecting the above to print a data.table object with the same structure as the parquet file. Which has 5 rows and 12 columns, and is fine to open locally:

>>> import pandas as pd
>>> df = pd.read_parquet('payload.snappy.parquet')
>>> df.head()
[....]
[5 rows x 12 columns]

I looked at following some other tips on parsers for other content types, but kept running into a version of this embedded nul in string error when trying to process the binary data. Lots of those examples are based on sending data using forms, which I don't think I can use here.

@meztez
Copy link
Collaborator

meztez commented Mar 30, 2022

With R arrow::read_parquet. What do you get?

dt <- arrow::read_parquet('payload.snappy.parquet')

@Giqles
Copy link
Author

Giqles commented Mar 30, 2022

That works as expected:

> dt <- arrow::read_parquet('/opt/ml/payload.snappy.parquet')
> class(dt)
# [1] "tbl_df"     "tbl"        "data.frame"

And then using setDT etc work as you'd expect.

@Giqles
Copy link
Author

Giqles commented Mar 30, 2022

After some digging I think I found a fix. This seems to be a bug(?) in writeBin, or in the C code it references. If I change the app to use readr::write_file to save the raw content to file, I don't get the issue and it reads back happily. I'm guessing you don't want to introduce a general dependency on readr, but would probably make sense to add it for the parquet parser at least. I can make a PR if that'd be useful.

parser_pq <- function(...) {
    function(value, ...) {
        tmp <- tempfile()
        on.exit({
            if (file.exists(tmp)) file.remove(tmp)
        }, add = TRUE)
        readr::write_file(value, tmp)
        arrow::read_parquet(tmp)
    }
}
register_parser("pq", parser_pq, fixed = "application/parquet")

#' Parse input and return prediction from model
#' @post /invocations
#' @parser pq
function(req) {
    setDT(req$argsBody)
    print(req$argsBody)
}

@schloerke
Copy link
Collaborator

I can make a PR
Yes, please. Thank you! (Related PR #849)

I have not seen this type of error before with writeBin(). (I've seen errors about encodings but not with raw data.) It'll be good to ask the team about the differences between the two methods.

I am ok with {readr} being required for {arrow} based parsers. I hope we can stay away from making {readr} a dep for all file based parsers.

@Giqles Giqles mentioned this issue Jun 2, 2022
3 tasks
@Giqles
Copy link
Author

Giqles commented Jun 2, 2022

I can't reproduce this consistently; closing.

@Giqles Giqles closed this as completed Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants