Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to PDFBox 2.0.0 #52

Closed
jazzido opened this issue Dec 3, 2015 · 12 comments
Closed

Upgrade to PDFBox 2.0.0 #52

jazzido opened this issue Dec 3, 2015 · 12 comments

Comments

@jazzido
Copy link
Contributor

jazzido commented Dec 3, 2015

A stable release of PDFBox 2.0 is around the corner (they're at rc2 now), so it makes sense to start thinking about upgrading.

Our ObjectExtractor class extends PDFBox 1.8 PageDrawer, which changed substantially in 2.0.

Also, PDF rendering improved substantially in PDFBox 2.0, so we might be able to drop JPedal in Tabula and use PDFBox for rendering.

@beng06
Copy link

beng06 commented Mar 23, 2016

For your reference, Apache just released PDFBox 2.0 last March 21, 2016.

http://sdtimes.com/apache-pdfbox-2-0-is-released/

@kapil-mangtani
Copy link

We currently upgraded our project to pdfbox 2.0 and so most of the tabula code doesnt work now. A lot of functions have been moved/deleted in pdfbox so its getting very hard to personally make the changes. I understand you people would be working very hard on it, but around when can we expect the migration in tabula? Thanks and cheers !

@jazzido
Copy link
Contributor Author

jazzido commented Mar 31, 2016

Hi @kapil-mangtani,

The migration to pdfbox 2.0 hasn't even started. pdfbox 2 has a completly different API, so it's going to be quite a bit of work. As much as I'd like to work on it, so there are other priorities and Tabula is a labor of love.

Unfortunately, I can't give you a timeline. If you'd like to contribute a patch, however, we'll be happy to work with you in integrating it to the master branch.

@kapil-mangtani
Copy link

Thanks for the swift reply.

I am trying to rewrite the ObjectExtractor class along with its PageIterator, would love to contribute if this thing works out correctly.

@jazzido
Copy link
Contributor Author

jazzido commented Apr 28, 2016

Some initial explorations:

Our ObjectExtractor is a subclass of PageDrawer (PDFBox 1.8). That class is now meant for rendering a PDF onto a Graphics2D (the docs state that "If you want to do custom graphics processing rather than Graphics2D rendering, then you should subclass PDFGraphicsStreamEngine instead"). ObjectExtractor mines both graphics and text elements, so we need hooks for both. Unfortunately, there is no single class in PDFBox 2.0 that can interpret both.

The solution would have to be a class that inherits from PDFStreamEngine, and combines the funcionality of PDFGraphicsStreamEngine, PageDrawer and PDFTextStripper

Additionally, the new StreamEngines no longer operate on a PDDocument, but on a PDPage. We'll need to modify PageIterator and ObjectExtractor accordingly.

@subhashbylaiah
Copy link

Hi @kapil-mangtani, @jazzido

We have run into a similar problem as Kapil.
Have you been able to make any progress on the migrations for 2.0

Regards
Subhash

@jazzido
Copy link
Contributor Author

jazzido commented Jul 1, 2016

Hi @subhashbylaiah,

No, we haven't made much progress. However, if you are interested in sponsoring the development of this, or contributing a patch, let us know.

@jazzido
Copy link
Contributor Author

jazzido commented Oct 3, 2016

I've started to do some real work on this issue (be0b41a)

Things are looking good. In addition, the pdfbox 2 version is faster than 1.8.
Unscientific benchmarks ahead:

With PDFBox 1.8:

for i in `seq 1 10`; do time java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1 src/test/resources/technology/tabula/argentina_diputados_voting_record.pdf > /dev/null; done
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.71s user 0.27s system 177% cpu 2.249 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.64s user 0.25s system 176% cpu 2.197 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.58s user 0.23s system 173% cpu 2.199 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.47s user 0.23s system 171% cpu 2.151 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.44s user 0.23s system 172% cpu 2.132 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.43s user 0.22s system 169% cpu 2.160 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.51s user 0.24s system 173% cpu 2.162 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.55s user 0.24s system 173% cpu 2.184 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.66s user 0.24s system 176% cpu 2.217 total

With PDFBox 2.0.3

for i in `seq 1 10`; do time java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1 src/test/resources/technology/tabula/argentina_diputados_voting_record.pdf > /dev/null; done
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  2.91s user 0.23s system 208% cpu 1.501 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  2.96s user 0.23s system 206% cpu 1.539 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  2.91s user 0.23s system 201% cpu 1.561 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  3.19s user 0.23s system 212% cpu 1.613 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  3.01s user 0.23s system 209% cpu 1.548 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  3.01s user 0.23s system 208% cpu 1.552 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  3.18s user 0.23s system 176% cpu 1.926 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  3.07s user 0.23s system 214% cpu 1.543 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  2.92s user 0.23s system 203% cpu 1.544 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  3.10s user 0.24s system 171% cpu 1.945 total

@jazzido
Copy link
Contributor Author

jazzido commented Nov 22, 2016

Leaving a comment here for future reference: when this issue is ready to be resolved, let's make sure that we don't regress the accuracy of the table detector.

We have the output of the Travis builds as a baseline: https://travis-ci.org/tabulapdf/tabula-java/jobs/177768121

@gudipatiharitha
Copy link

We have run sample pdf's with master branch and with pdfbox2-0 working branch, we have seen the tables which were identified correctly using master branch are not fetched using this branch.( along with other errors in cell information). After looking at the travis build results - we have seen that many of the test cases are still failing. Is there any timeline to release this branch to master ?. We would like to contribute to make a quicker release. Would working on fixing the failed test cases available be the best way to proceed ?

@jazzido
Copy link
Contributor Author

jazzido commented Mar 7, 2017

Just for the record: melisabok/tabula-java@pdfbox2.0 now passes all the tests.

We expect to merge @melisabok's fantastic work in the coming weeks.

@jazzido
Copy link
Contributor Author

jazzido commented Mar 8, 2017

We have a pull request: #146 — Will review and integrate with master in the coming days.

Those of you (@gudipatiharitha, @subhashbylaiah, @beng06, @kapil-mangtani, @chezou) interested in helping out testing this, please build from melisabok/tabula-java@pdfbox2.0

jazzido added a commit that referenced this issue Mar 27, 2017
* Starting with upgrade to PDFBox 2.0 (#52)

* 2.0

* little progress in upgrading to pdfbox 2

* upgrade to pdfbox 2 starting to show signs of life

* Fix TextElement creation

* fix tabs

* Use the code from LegacyPDFStreamEngine to create the TextElements

* Fix removeText function using the example:

org.apache.pdfbox.examples.util.RemoveAllText

* close the document

* close removed text document

* fix array serialization

* add spanning cells test with CSV format

* - Remove capheight calculation
- Temporally set height

* Test writer two tables checking the json result object instead of the string

Add a test writer two tables for CSV output

* Fix pageTransform when there is a rotation
Add more csv tests

* fix path iterator

* update json tests

* update json outputs

* upgrade pdfbox version

* back to the old implementation and catch the IndexOutOfBoundsException

* Remove hardcoded code

* Remove more hardcoded code

* test all the elements of the detected table

* Change the expected table top value

* Increase the threshold factor to support a greater headings

* Fix rectangle comparator.

* fix wrong expected column size, 5 instead of 6.

add more tests

* update expected table, more spaces are expected to respect the alingment.

* when the text value has length > 1, clean the spaces.

* clean code

* remove stackstrace

* add log error

* upgrade all dependencies

* code formatting

* setting pom to snapshot version
@jazzido jazzido closed this as completed Apr 9, 2017
EmpowerZ pushed a commit to EmpowerZ/tabula-java that referenced this issue Oct 23, 2020
* Starting with upgrade to PDFBox 2.0 (tabulapdf#52)

* 2.0

* little progress in upgrading to pdfbox 2

* upgrade to pdfbox 2 starting to show signs of life

* Fix TextElement creation

* fix tabs

* Use the code from LegacyPDFStreamEngine to create the TextElements

* Fix removeText function using the example:

org.apache.pdfbox.examples.util.RemoveAllText

* close the document

* close removed text document

* fix array serialization

* add spanning cells test with CSV format

* - Remove capheight calculation
- Temporally set height

* Test writer two tables checking the json result object instead of the string

Add a test writer two tables for CSV output

* Fix pageTransform when there is a rotation
Add more csv tests

* fix path iterator

* update json tests

* update json outputs

* upgrade pdfbox version

* back to the old implementation and catch the IndexOutOfBoundsException

* Remove hardcoded code

* Remove more hardcoded code

* test all the elements of the detected table

* Change the expected table top value

* Increase the threshold factor to support a greater headings

* Fix rectangle comparator.

* fix wrong expected column size, 5 instead of 6.

add more tests

* update expected table, more spaces are expected to respect the alingment.

* when the text value has length > 1, clean the spaces.

* clean code

* remove stackstrace

* add log error

* upgrade all dependencies

* code formatting

* setting pom to snapshot version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants