New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for sys 2.0: no longer shQuote on Windows #8

Merged
merged 2 commits into from Nov 9, 2018

Conversation

Projects
None yet
2 participants
@jeroen
Contributor

jeroen commented Nov 5, 2018

The new version of sys automatically quotes arguments on Windows when needed. Unfortunately this breaks the shQuote() workaround. You need to remove the shQuote() otherwise the arguments will be double quoted.

Could you help me check if this now works with sys 2.0?

I'm sorry for the breaking change, but at least now the system calls are consistent on all platforms ;)

@goodmansasha

This comment has been minimized.

Collaborator

goodmansasha commented Nov 7, 2018

Jeroen, I'm checking it out.

@goodmansasha

This comment has been minimized.

Collaborator

goodmansasha commented Nov 7, 2018

@jeroen,

Sys 2.0 breaks rtika in a way more complex than just quotes for Windows. There is something else different in the way it waits for the java program to finish.

With Sys 1.6, the Java program Tika says it "COMPLETED_NORMALLY" and the devtools::test() shows all the tests pass. The following is Tika's std_out and std_err.

`
input <- c(
system.file("extdata", "jsonlite.pdf", package = "rtika"),
system.file("extdata", "curl.pdf", package = "rtika"),
system.file("extdata", "table.docx", package = "rtika"),
system.file("extdata", "xml2.pdf", package = "rtika"),
system.file("extdata", "R-FAQ.html", package = "rtika"),
system.file("extdata", "calculator.jpg", package = "rtika"),
system.file("extdata", "tika.apache.org.zip", package = "rtika")
)
text <- tika(input, quiet = FALSE)

INFO about to start driver
BatchProcess:No config file set via -bc, relying on tika-app-batch-config.xml or default-tika-batch-config.xml
INFO BatchProcess: Nov 07, 2018 11:42:22 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
INFO BatchProcess: WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
INFO BatchProcess: See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
INFO BatchProcess: for optional dependencies.
INFO BatchProcess:
INFO BatchProcess: Nov 07, 2018 11:42:22 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
INFO BatchProcess: WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
INFO BatchProcess: you've excluded the TesseractOCRParser from the default parser.
INFO BatchProcess: Tesseract may dramatically slow down content extraction (TIKA-2359).
INFO BatchProcess: As of Tika 1.15 (and prior versions), Tesseract is automatically called.
INFO BatchProcess: In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
INFO BatchProcess: Nov 07, 2018 11:42:22 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
INFO BatchProcess: WARNING: org.xerial's sqlite-jdbc is not loaded.
INFO BatchProcess: Please provide the jar on your classpath to parse sqlite files.
INFO BatchProcess: See tika-parsers/pom.xml for the correct version.
INFO BatchProcess: randomCrawl attribute is ignored by FSListCrawler
BatchProcess:BatchProcess starting up
BatchProcess:Using fallback font LiberationSans for base font Symbol
BatchProcess:Using fallback font LiberationSans for base font ZapfDingbats
BatchProcess:Processed 0 documents in 1 second.
BatchProcess:There have been 0 handled exceptions.
BatchProcess:There are 2 file processors still active.
BatchProcess:The directory crawler has considered 7 files, and it has added 7 files.
BatchProcess:
BatchProcess:The directory crawler has completed its crawl.
BatchProcess:
BatchProcess:No Unicode mapping for asciigrave.Var (96) in font PYEYFS+Inconsolata-zi4r
BatchProcess:Processed 3 documents in 2 seconds.
BatchProcess:There have been 0 handled exceptions.
BatchProcess:There are 2 file processors still active.
BatchProcess:The directory crawler has considered 7 files, and it has added 7 files.
BatchProcess:
BatchProcess:The directory crawler has completed its crawl.
BatchProcess:
BatchProcess:Processed 5 documents in 3 seconds.
BatchProcess:There have been 0 handled exceptions.
BatchProcess:There are 2 file processors still active.
BatchProcess:The directory crawler has considered 7 files, and it has added 7 files.
BatchProcess:
BatchProcess:The directory crawler has completed its crawl.
BatchProcess:
BatchProcess:Processed 6 documents in 4 seconds.
BatchProcess:There have been 0 handled exceptions.
BatchProcess:There is one file processor still active.
BatchProcess:The directory crawler has considered 7 files, and it has added 7 files.
BatchProcess:
BatchProcess:The directory crawler has completed its crawl.
BatchProcess:
BatchProcess:Main thread in TikaFSBatchCLI has finished processing.
BatchProcess:
BatchProcess:
BatchProcess:ParallelFileProcessingResult{considered=7, added=7, consumed=7, numberHandledExceptions=0, secondsElapsed=4.446, exitStatus=0, causeForTermination='COMPLETED_NORMALLY'}
INFO The child process has finished with an exit value of: 0
INFO Process driver has completed
`

With sys 2.0, there are inconsistent errors and devtools::test() shows inconsistent errors in various tests. These have to do with how sys 2.0 and Tika communicate that a job is done. Looking closer at the Tika std_out and std_err messages, I see that Tika will stop early because of a 'USER_INTERRUPTION' .

INFO about to start driver BatchProcess:No config file set via -bc, relying on tika-app-batch-config.xml or default-tika-batch-config.xml INFO BatchProcess: Nov 07, 2018 11:44:09 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem INFO BatchProcess: WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. INFO BatchProcess: See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io INFO BatchProcess: for optional dependencies. INFO BatchProcess: INFO BatchProcess: Nov 07, 2018 11:44:09 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem INFO BatchProcess: WARNING: Tesseract OCR is installed and will be automatically applied to image files unless INFO BatchProcess: you've excluded the TesseractOCRParser from the default parser. INFO BatchProcess: Tesseract may dramatically slow down content extraction (TIKA-2359). INFO BatchProcess: As of Tika 1.15 (and prior versions), Tesseract is automatically called. INFO BatchProcess: In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig. INFO BatchProcess: Nov 07, 2018 11:44:09 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem INFO BatchProcess: WARNING: org.xerial's sqlite-jdbc is not loaded. INFO BatchProcess: Please provide the jar on your classpath to parse sqlite files. INFO BatchProcess: See tika-parsers/pom.xml for the correct version. INFO BatchProcess: randomCrawl attribute is ignored by FSListCrawler BatchProcess:BatchProcess starting up BatchProcess:Using fallback font LiberationSans for base font Symbol BatchProcess:Using fallback font LiberationSans for base font ZapfDingbats BatchProcess:Processed 0 documents in 1 second. BatchProcess:There have been 0 handled exceptions. BatchProcess:There are 2 file processors still active. BatchProcess:The directory crawler has considered 7 files, and it has added 7 files. BatchProcess: BatchProcess:The directory crawler has completed its crawl. BatchProcess: BatchProcess:Process is shutting down now. BatchProcess:Main thread in TikaFSBatchCLI has finished processing. BatchProcess: BatchProcess: BatchProcess:ParallelFileProcessingResult{considered=7, added=7, consumed=2, numberHandledExceptions=0, secondsElapsed=1.539, exitStatus=0, causeForTermination='USER_INTERRUPTION'} INFO Process driver has completed

At the end, it reads:

BatchProcess:ParallelFileProcessingResult{considered=7, added=7, consumed=2, numberHandledExceptions=0, secondsElapsed=1.539, exitStatus=0, causeForTermination='USER_INTERRUPTION'}

Yet, when I start debugging the tika() function and put a browser() in the body of that function, the following strange behavior occurs where Tika waits 10 seconds, and produces messages like:

"Waited after an early termination for 10024ms, but there was at least one active consumer"

BatchProcess:ParallelFileProcessingResult{considered=0, added=0, consumed=7, numberHandledExceptions=0, secondsElapsed=10.064, exitStatus=0, causeForTermination='USER_INTERRUPTION'}

The full output is:

INFO about to start driver BatchProcess:No config file set via -bc, relying on tika-app-batch-config.xml or default-tika-batch-config.xml INFO BatchProcess: Nov 07, 2018 11:45:15 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem INFO BatchProcess: WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. INFO BatchProcess: See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io INFO BatchProcess: for optional dependencies. INFO BatchProcess: INFO BatchProcess: Nov 07, 2018 11:45:15 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem INFO BatchProcess: WARNING: Tesseract OCR is installed and will be automatically applied to image files unless INFO BatchProcess: you've excluded the TesseractOCRParser from the default parser. INFO BatchProcess: Tesseract may dramatically slow down content extraction (TIKA-2359). INFO BatchProcess: As of Tika 1.15 (and prior versions), Tesseract is automatically called. INFO BatchProcess: In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig. INFO BatchProcess: Nov 07, 2018 11:45:15 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem INFO BatchProcess: WARNING: org.xerial's sqlite-jdbc is not loaded. INFO BatchProcess: Please provide the jar on your classpath to parse sqlite files. INFO BatchProcess: See tika-parsers/pom.xml for the correct version. INFO BatchProcess: randomCrawl attribute is ignored by FSListCrawler BatchProcess:BatchProcess starting up BatchProcess:Using fallback font LiberationSans for base font Symbol BatchProcess:Using fallback font LiberationSans for base font ZapfDingbats BatchProcess:Processed 0 documents in 1 second. BatchProcess:There have been 0 handled exceptions. BatchProcess:There is one file processor still active. BatchProcess:The directory crawler has considered 7 files, and it has added 7 files. BatchProcess: BatchProcess:The directory crawler has completed its crawl. BatchProcess: BatchProcess:Process is shutting down now. BatchProcess:Processed 2 documents in 2 seconds. BatchProcess:There have been 0 handled exceptions. BatchProcess:There is one file processor still active. BatchProcess:The directory crawler has considered 7 files, and it has added 7 files. BatchProcess: BatchProcess:The directory crawler has completed its crawl. BatchProcess: BatchProcess:Process is shutting down now. BatchProcess:Processed 2 documents in 3 seconds. BatchProcess:There have been 0 handled exceptions. BatchProcess:There is one file processor still active. BatchProcess:The directory crawler has considered 7 files, and it has added 7 files. BatchProcess: BatchProcess:The directory crawler has completed its crawl. BatchProcess: BatchProcess:Process is shutting down now. BatchProcess:No Unicode mapping for asciigrave.Var (96) in font PYEYFS+Inconsolata-zi4r BatchProcess:Processed 6 documents in 4 seconds. BatchProcess:There have been 0 handled exceptions. BatchProcess:There is one file processor still active. BatchProcess:The directory crawler has considered 7 files, and it has added 7 files. BatchProcess: BatchProcess:The directory crawler has completed its crawl. BatchProcess: BatchProcess:Process is shutting down now. BatchProcess:Processed 6 documents in 5 seconds (1 docs per sec). BatchProcess:There have been 0 handled exceptions. BatchProcess:There is one file processor still active. BatchProcess:The directory crawler has considered 7 files, and it has added 7 files. BatchProcess: BatchProcess:The directory crawler has completed its crawl. BatchProcess: BatchProcess:Process is shutting down now. BatchProcess:Processed 7 documents in 6 seconds (1 docs per sec). BatchProcess:There have been 0 handled exceptions. BatchProcess:There is one file processor still active. BatchProcess:The directory crawler has considered 7 files, and it has added 7 files. BatchProcess: BatchProcess:The directory crawler has completed its crawl. BatchProcess: BatchProcess:Process is shutting down now. BatchProcess:Processed 7 documents in 7 seconds (0 docs per sec). BatchProcess:There have been 0 handled exceptions. BatchProcess:There is one file processor still active. BatchProcess:The directory crawler has considered 7 files, and it has added 7 files. BatchProcess: BatchProcess:The directory crawler has completed its crawl. BatchProcess: BatchProcess:Process is shutting down now. BatchProcess:Processed 7 documents in 8 seconds (0 docs per sec). BatchProcess:There have been 0 handled exceptions. BatchProcess:There is one file processor still active. BatchProcess:The directory crawler has considered 7 files, and it has added 7 files. BatchProcess: BatchProcess:The directory crawler has completed its crawl. BatchProcess: BatchProcess:Process is shutting down now. BatchProcess:Processed 7 documents in 9 seconds (0 docs per sec). BatchProcess:There have been 0 handled exceptions. BatchProcess:There is one file processor still active. BatchProcess:The directory crawler has considered 7 files, and it has added 7 files. BatchProcess: BatchProcess:The directory crawler has completed its crawl. BatchProcess: BatchProcess:Process is shutting down now. BatchProcess:Processed 7 documents in 10 seconds (0 docs per sec). BatchProcess:There have been 0 handled exceptions. BatchProcess:There is one file processor still active. BatchProcess:The directory crawler has considered 7 files, and it has added 7 files. BatchProcess: BatchProcess:The directory crawler has completed its crawl. BatchProcess: BatchProcess:Process is shutting down now. BatchProcess:Waited after an early termination for 10024ms, but there was at least one active consumer BatchProcess:Main thread in TikaFSBatchCLI has finished processing. BatchProcess: BatchProcess: BatchProcess:ParallelFileProcessingResult{considered=0, added=0, consumed=7, numberHandledExceptions=0, secondsElapsed=10.064, exitStatus=0, causeForTermination='USER_INTERRUPTION'} INFO Process driver has completed

Now, I also tried processx::run() and found that it also is leading to intermittent problems also, I think slightly different, so this is a bit worrisome for me. Maybe I could ask the Tika people to make Tika signal completion in a certain consistent way, but I don't know what to ask.

@jeroen

This comment has been minimized.

Contributor

jeroen commented Nov 7, 2018

Hmm strange, I don't think that much has changed internally in sys 2.0. What is a minimal example (least amount of parameters) to reproduce the problem?

@jeroen

This comment has been minimized.

Contributor

jeroen commented Nov 7, 2018

So to clarify, the behavior has only changed on Windows and not on other operating systems?

@jeroen

This comment has been minimized.

Contributor

jeroen commented Nov 7, 2018

@goodmansasha I just fixed a small bug in sys 2.0. Could you test if this changes anything? Easiest way to install is running the following in a clean r session (where sys is not already loaded).

remotes::install_github("jeroen/sys")
@goodmansasha

This comment has been minimized.

Collaborator

goodmansasha commented Nov 7, 2018

I tried the updated sys but the package tests still produce errors with this version (but not 1.6). If I reinstall a prior version of sys (version 1.6) the devtools::test(); produces no errors, but 2.0 and this patch do. I'll try to present a more concrete example, as devtools::test(); is a bit vague ...

@jeroen

This comment has been minimized.

Contributor

jeroen commented Nov 7, 2018

OK thanks for debugging. Note that the entire devtools stack itself is also going through some flux right now so there could be orthogonal problems there as well.

@goodmansasha

This comment has been minimized.

Collaborator

goodmansasha commented Nov 8, 2018

Here is an example that works on my machine but requires you restart R. The 'quiet = FALSE' shows what Tika is saying on the command line.

`
input <- c(
system.file("extdata", "jsonlite.pdf", package = "rtika"),
system.file("extdata", "curl.pdf", package = "rtika"),
system.file("extdata", "table.docx", package = "rtika"),
system.file("extdata", "xml2.pdf", package = "rtika"),
system.file("extdata", "R-FAQ.html", package = "rtika"),
system.file("extdata", "calculator.jpg", package = "rtika"),
system.file("extdata", "tika.apache.org.zip", package = "rtika")
)

text <- rtika::tika(input, quiet = FALSE)

remove.packages('sys')
devtools::install_version("sys", version = "1.6", repos = "http://cran.us.r-project.org")

!!! RESTART R !!!!!

input <- c(
system.file("extdata", "jsonlite.pdf", package = "rtika"),
system.file("extdata", "curl.pdf", package = "rtika"),
system.file("extdata", "table.docx", package = "rtika"),
system.file("extdata", "xml2.pdf", package = "rtika"),
system.file("extdata", "R-FAQ.html", package = "rtika"),
system.file("extdata", "calculator.jpg", package = "rtika"),
system.file("extdata", "tika.apache.org.zip", package = "rtika")
)

text <- rtika::tika(input, quiet = FALSE)
`

@jeroen

This comment has been minimized.

Contributor

jeroen commented Nov 8, 2018

I'm unable to reproduce this, it seems to work fine on my windows vm. Let my try another Windows server...

tika

@goodmansasha

This comment has been minimized.

Collaborator

goodmansasha commented Nov 8, 2018

@jeroen

This comment has been minimized.

Contributor

jeroen commented Nov 8, 2018

Ah okay, that's a problem separate from this PR then. This PR fixes the quoting on Windows specifically. Let's merge this one, and then I'll go debug the other issue on Linux.

Which ubuntu docker image are you using?

@jeroen

This comment has been minimized.

Contributor

jeroen commented Nov 8, 2018

By the way, could you go to https://ci.appveyor.com/projects/new and flip on the switch for rtika to enable the Windows CI?

@goodmansasha

This comment has been minimized.

Collaborator

goodmansasha commented Nov 9, 2018

@goodmansasha

This comment has been minimized.

Collaborator

goodmansasha commented Nov 9, 2018

@goodmansasha

This comment has been minimized.

Collaborator

goodmansasha commented Nov 9, 2018

container.zip my Dockerfile and included files.

@goodmansasha

This comment has been minimized.

Collaborator

goodmansasha commented Nov 9, 2018

Let's merge this one, and then I'll go debug the other issue on Linux.

Alright.

@goodmansasha goodmansasha merged commit fca0ca2 into master Nov 9, 2018

0 of 4 checks passed

continuous-integration/appveyor/branch AppVeyor build cancelled
Details
continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
continuous-integration/travis-ci/push The Travis CI build could not complete due to an error
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
@goodmansasha

This comment has been minimized.

Collaborator

goodmansasha commented Nov 9, 2018

@jeroen

This comment has been minimized.

Contributor

jeroen commented Nov 9, 2018

OK I'm going to look into this. Btw appveyor now works: https://ci.appveyor.com/project/goodmansasha/rtika/history

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment