Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'docs_bulk'-functionality for ingest attachment-plugin ('pipeline_attachment') ? #253

Closed
Aeilert opened this issue May 7, 2019 · 6 comments
Milestone

Comments

@Aeilert
Copy link

Aeilert commented May 7, 2019

I have a question about pushing documents in bulk with the ingest attachment-plugin. This used to work by setting an additional parameter, query = 'pipeline=attachment', in docs_bulk (tested with version 0.8.4), but no longer seems to work with the current version of the package.

When using docs_bulk with a pipeline like the one below, the data is pushed through to Elasticsearch, but not using the plugin. The result is a an index containing a base64-encoded data-field, and not a list of fulltext-fields like you would expect.

This does not work
# Create ingest attachment pipeline 
body.pipeline <- '{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "target_field": "fulltext",
        "indexed_chars" : -1,
        "on_failure" : [
          {
            "set" : {
              "field" : "error",
              "value" : "{{ _ingest.on_failure_message }}"
            }
          }
        ]
      },
    "remove": {
      "field": "data"
    }
    }
  ]
}'
pipeline_create(es.con, id = "attachment", body = body.pipeline)
# Create test-index
index_create(es.con, index = "myindex")

# List of base64-encoded documents w/ some metadata 
docs <- list(
  list(data = "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
       category = "lorem ipsum"),
  list(data = "aGVsbG8gd29ybGQgaGVsbG8gd29ybGQ=",
       category = "hello world")
)
# Push documents to Elastic 
docs_bulk(conn = es.con, x = docs, index = "myindex", type = '_doc', 
          doc_ids = 1:2, es_ids = FALSE, query = 'pipeline=attachment')
# Data was not pushed correctly 
Search(es.con,"myindex")
...
$hits$hits
$hits$hits[[1]]
$hits$hits[[1]]$`_index`
[1] "myindex"

$hits$hits[[1]]$`_type`
[1] "_doc"

$hits$hits[[1]]$`_id`
[1] "1"

$hits$hits[[1]]$`_score`
[1] 1

$hits$hits[[1]]$`_source`
$hits$hits[[1]]$`_source`$data
[1] "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="

$hits$hits[[1]]$`_source`$category
[1] "lorem ipsum"

I could of course use pipeline_attachment but I have several thousands files and want to take advantage of the bulk API. Maybe this could be solved with a docs_bulk wrapper for pipeline_attachment? Or just a parameter to add 'pipeline=attachment' to the POST-statement (not sure why passing query-option to cruldoesn't work)?

As an example of the functionality I'm looking for I created a simple wrapper-function for pipeline_attachment. I'm not saying this should be the solution. It's just to illustrate the functionality.

This does work
DocsBulkAttachment <- function (conn, x, index = NULL, type = NULL, chunk_size = 1000, 
            doc_ids = NULL, es_ids = TRUE, raw = FALSE, quiet = FALSE, pipeline = 'attachment',
            sleep = 1, ...) {
    elastic:::is_conn(conn)
    elastic:::assert(quiet, "logical")
    if (is.null(index)) {
      stop("index can't be NULL when passing a list", call. = FALSE)
    }
    if (is.null(type)) 
      type <- "_doc" #index
    elastic:::check_doc_ids(x, doc_ids)
    if (is.factor(doc_ids)) 
      doc_ids <- as.character(doc_ids)
    x <- unname(x)
    x <- elastic:::check_named_vectors(x)
    rws <- seq_len(length(x))
    data_chks <- split(rws, ceiling(seq_along(rws)/chunk_size))
    if (!is.null(doc_ids)) {
      id_chks <- split(doc_ids, ceiling(seq_along(doc_ids)/chunk_size))
    }  

    resl <- vector(mode = "list", length = length(data_chks))
    for (i in seq_along(data_chks)) {
      if (!quiet) {
        pb <- txtProgressBar(min = 0, max = length(data_chks[[i]]), 
                             initial = 0, style = 3)
        on.exit(close(pb))
      }
      
      resl2 <- vector(mode = "list", length = length(data_chks[[i]]))
      for(y in seq_along(data_chks[[i]])){
        resl2[[y]] <- pipeline_attachment(conn, index = index, type = type, pipeline = pipeline, 
                            body = x[data_chks[[i]]][[y]], id = id_chks[[i]][y])    
        if (!quiet)
          setTxtProgressBar(pb, y)
      }
      resl[[i]] <- resl2
      Sys.sleep(sleep)
    }
    return(resl)
}
index_create(es.con, index = "myindex2")
DocsBulkAttachment(es.con, index = "myindex2", x = docs, type = '_doc', 
                   doc_ids = 1:2, pipeline = "attachment")
# Data was pushed correctly 
Search(es.con,"myindex2")
...
$hits$hits
$hits$hits[[1]]
$hits$hits[[1]]$`_index`
[1] "myindex2"

$hits$hits[[1]]$`_type`
[1] "_doc"

$hits$hits[[1]]$`_id`
[1] "1"

$hits$hits[[1]]$`_score`
[1] 1

$hits$hits[[1]]$`_source`
$hits$hits[[1]]$`_source`$fulltext
$hits$hits[[1]]$`_source`$fulltext$content_type
[1] "application/rtf"

$hits$hits[[1]]$`_source`$fulltext$language
[1] "ro"

$hits$hits[[1]]$`_source`$fulltext$content
[1] "Lorem ipsum dolor sit amet"

$hits$hits[[1]]$`_source`$fulltext$content_length
[1] 28


$hits$hits[[1]]$`_source`$category
[1] "lorem ipsum"

I'm using R 3.5.3. and Elasticsearch 7.0.0 w/ Docker. I have installed the ingest-attachment plugin. See below for other session info.

Session Info
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] elastic_1.0.0.9100

loaded via a namespace (and not attached):
 [1] compiler_3.5.3  R6_2.4.0        tools_3.5.3     httpcode_0.2.0  curl_3.3        Rcpp_1.0.1      urltools_1.7.3  triebeard_0.3.0 crul_0.7.4     
[10] jsonlite_1.6   
Dockerfile
FROM docker.elastic.co/elasticsearch/elasticsearch:7.0.0

RUN bin/elasticsearch-plugin install --batch ingest-attachment

COPY config/. ./config/
@sckott
Copy link
Contributor

sckott commented May 7, 2019

thanks for the detailed report, i'll take a look soon

@Aeilert
Copy link
Author

Aeilert commented May 8, 2019

Great.

@sckott
Copy link
Contributor

sckott commented May 15, 2019

So I make sure I understand:

  • The second return example with Search(es.con,"myindex2") is what you want back?
  • You said query = 'pipeline=attachment' in the docs_bulk call used to work. Do you know if query = 'pipeline=attachment' was used as a query parameter in the http request? hard to say how it used to be used. I changed to a different http client a while back, right now any additional parameters to ... are only passed to curl options, so a query named parameter would not do anything
  • I do get the same result with your DocsBulkAttachment function

@Aeilert
Copy link
Author

Aeilert commented May 16, 2019

To answer your questions:

  • Yes, it is the result of the second example I'm looking for. (Where the ingest pipeline converts the base64-encoded data-field to a target-field in ES).
  • Yes, I think it was used as an additional parameter to the underlying HTTP request. Similar to this workaround: ingest API #191. Was it the httr-package back then? I did notice the switch to crul, but I'm not so familiar with this package. You could probably do something similar with this. To use an attachment-plugin you basically just need to add ?pipeline=attachment to the PUT-statement.

@sckott sckott added this to the v1.1 milestone Jun 3, 2019
@sckott sckott closed this as completed in 08e1401 Jan 10, 2020
@sckott
Copy link
Contributor

sckott commented Jan 10, 2020

Was it the httr-package back then?

Yes, it was the httr pkg back then

Okay, just pushed a change, I think this should work for you now, if you're still interested. See the query param in docs_bulk

@sckott
Copy link
Contributor

sckott commented Jan 10, 2020

@Aeilert 👆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants