Exception raised when specifying area with pdf that has spare blank pages #130

jjelosua · 2016-12-16T15:53:46Z

java -jar ./tools/tabula-0.9.1-jar-with-dependencies.jar pdf_with_blank_page.pdf --pages all --spreadsheet -u -a 0,0,864.567,1105.51

Exception in thread "main" java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:854)
at java.util.Collections.min(Collections.java:635)
at technology.tabula.Page.getArea(Page.java:68)
at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:163)
at technology.tabula.CommandLineApp.extractFileInto(CommandLineApp.java:138)
at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.java:128)
at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:104)
at technology.tabula.CommandLineApp.main(CommandLineApp.java:74)

Example doc:
pdf_with_blank_page.pdf

Will send a one liner PR to solve it

The text was updated successfully, but these errors were encountered:

ignore area if the page has no text. closes #130

This reverts commit dfd5f2f.

# The first commit's message is: # This is a combination of 17 commits. # The first commit's message is: Fix TextElement creation # This is the 2nd commit message: fix tabs # This is the 3rd commit message: Use the code from LegacyPDFStreamEngine to create the TextElements # This is the 4th commit message: Fix removeText function using the example: org.apache.pdfbox.examples.util.RemoveAllText # This is the 5th commit message: close the document # This is the 6th commit message: close removed text document # This is the 7th commit message: fix array serialization # This is the 8th commit message: add spanning cells test with CSV format # This is the 9th commit message: - Remove capheight calculation - Temporally set height # This is the 10th commit message: Test writer two tables checking the json result object instead of the string Add a test writer two tables for CSV output # This is the 11th commit message: Fix pageTransform when there is a rotation Add more csv tests # This is the 12th commit message: fix path iterator # This is the 13th commit message: update json tests # This is the 14th commit message: Refactor table equality assertions for better reporting # This is the 15th commit message: Moved test fixture to a CSV file # This is the 16th commit message: rename spreadsheet/no-spreadsheet to lattice/stream to match web UI in CLI arguments and names of extraction algorithms # This is the 17th commit message: adjust expected output to use lattice/stream instead of spreadsheet/basic names for extraction mehthod # This is the 2nd commit message: ignore area restrictions on blank page. closes tabulapdf#130 # This is the 3rd commit message: Revert "ignore area restrictions on blank page. closes tabulapdf#130" This reverts commit dfd5f2f. # This is the 4th commit message: more consistent naming of avariable :) # This is the 5th commit message: fix and test for empty areas; which should have no text content # This is the 6th commit message: various additional null/empty checks to avoid exceptions when the user selects empty pages or regions # This is the 7th commit message: Update acknowledgments # This is the 8th commit message: tabula 0.9.2 # This is the 9th commit message: update version # This is the 10th commit message: -t/--stream, -l/--lattice in #whichExtractionMethod # This is the 11th commit message: Comment on line above

# The first commit's message is: # This is a combination of 12 commits. # The first commit's message is: # This is a combination of 17 commits. # The first commit's message is: Fix TextElement creation # This is the 2nd commit message: fix tabs # This is the 3rd commit message: Use the code from LegacyPDFStreamEngine to create the TextElements # This is the 4th commit message: Fix removeText function using the example: org.apache.pdfbox.examples.util.RemoveAllText # This is the 5th commit message: close the document # This is the 6th commit message: close removed text document # This is the 7th commit message: fix array serialization # This is the 8th commit message: add spanning cells test with CSV format # This is the 9th commit message: - Remove capheight calculation - Temporally set height # This is the 10th commit message: Test writer two tables checking the json result object instead of the string Add a test writer two tables for CSV output # This is the 11th commit message: Fix pageTransform when there is a rotation Add more csv tests # This is the 12th commit message: fix path iterator # This is the 13th commit message: update json tests # This is the 14th commit message: Refactor table equality assertions for better reporting # This is the 15th commit message: Moved test fixture to a CSV file # This is the 16th commit message: rename spreadsheet/no-spreadsheet to lattice/stream to match web UI in CLI arguments and names of extraction algorithms # This is the 17th commit message: adjust expected output to use lattice/stream instead of spreadsheet/basic names for extraction mehthod # This is the 2nd commit message: ignore area restrictions on blank page. closes tabulapdf#130 # This is the 3rd commit message: Revert "ignore area restrictions on blank page. closes tabulapdf#130" This reverts commit dfd5f2f. # This is the 4th commit message: more consistent naming of avariable :) # This is the 5th commit message: fix and test for empty areas; which should have no text content # This is the 6th commit message: various additional null/empty checks to avoid exceptions when the user selects empty pages or regions # This is the 7th commit message: Update acknowledgments # This is the 8th commit message: tabula 0.9.2 # This is the 9th commit message: update version # This is the 10th commit message: -t/--stream, -l/--lattice in #whichExtractionMethod # This is the 11th commit message: Comment on line above # This is the 12th commit message: update json outputs # This is the 2nd commit message: upgrade pdfbox version

fix tabs Use the code from LegacyPDFStreamEngine to create the TextElements Fix removeText function using the example: org.apache.pdfbox.examples.util.RemoveAllText close the document close removed text document fix array serialization add spanning cells test with CSV format - Remove capheight calculation - Temporally set height Test writer two tables checking the json result object instead of the string Add a test writer two tables for CSV output Fix pageTransform when there is a rotation Add more csv tests fix path iterator update json tests Refactor table equality assertions for better reporting Moved test fixture to a CSV file rename spreadsheet/no-spreadsheet to lattice/stream to match web UI in CLI arguments and names of extraction algorithms adjust expected output to use lattice/stream instead of spreadsheet/basic names for extraction mehthod ignore area restrictions on blank page. closes tabulapdf#130 Revert "ignore area restrictions on blank page. closes tabulapdf#130" This reverts commit dfd5f2f. more consistent naming of avariable :) fix and test for empty areas; which should have no text content various additional null/empty checks to avoid exceptions when the user selects empty pages or regions Update acknowledgments tabula 0.9.2 update version -t/--stream, -l/--lattice in #whichExtractionMethod Comment on line above update json outputs upgrade pdfbox version back to the old implementation and catch the IndexOutOfBoundsException Remove hardcoded code Remove more hardcoded code test all the elements of the detected table Change the expected table top value Increase the threshold factor to support a greater headings Fix rectangle comparator. fix wrong expected column size, 5 instead of 6. add more tests update expected table, more spaces are expected to respect the alingment. when the text value has length > 1, clean the spaces. clean code remove stackstrace add log error

ignore area if the page has no text. closes tabulapdf#130

This reverts commit dfd5f2f.

jjelosua pushed a commit to jjelosua/tabula-java that referenced this issue Dec 16, 2016

ignore area if the page has no text. closes tabulapdf#130

6c41a07

jazzido closed this as completed in dfd5f2f Dec 16, 2016

jazzido added a commit that referenced this issue Dec 16, 2016

Merge pull request #131 from jjelosua/master

9514e1b

ignore area if the page has no text. closes #130

jeremybmerrill added a commit that referenced this issue Dec 30, 2016

Revert "ignore area restrictions on blank page. closes #130"

a361735

This reverts commit dfd5f2f.

EmpowerZ pushed a commit to EmpowerZ/tabula-java that referenced this issue Oct 23, 2020

ignore area restrictions on blank page. closes tabulapdf#130

14c9746

EmpowerZ pushed a commit to EmpowerZ/tabula-java that referenced this issue Oct 23, 2020

Merge pull request tabulapdf#131 from jjelosua/master

b4d6bfa

ignore area if the page has no text. closes tabulapdf#130

EmpowerZ pushed a commit to EmpowerZ/tabula-java that referenced this issue Oct 23, 2020

Revert "ignore area restrictions on blank page. closes tabulapdf#130"

4f66614

This reverts commit dfd5f2f.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception raised when specifying area with pdf that has spare blank pages #130

Exception raised when specifying area with pdf that has spare blank pages #130

jjelosua commented Dec 16, 2016

Exception raised when specifying area with pdf that has spare blank pages #130

Exception raised when specifying area with pdf that has spare blank pages #130

Comments

jjelosua commented Dec 16, 2016