Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception raised when specifying area with pdf that has spare blank pages #130

Closed
jjelosua opened this issue Dec 16, 2016 · 0 comments
Closed

Comments

@jjelosua
Copy link
Contributor

java -jar ./tools/tabula-0.9.1-jar-with-dependencies.jar pdf_with_blank_page.pdf --pages all --spreadsheet -u -a 0,0,864.567,1105.51

Exception in thread "main" java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:854)
at java.util.Collections.min(Collections.java:635)
at technology.tabula.Page.getArea(Page.java:68)
at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:163)
at technology.tabula.CommandLineApp.extractFileInto(CommandLineApp.java:138)
at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.java:128)
at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:104)
at technology.tabula.CommandLineApp.main(CommandLineApp.java:74)

Example doc:
pdf_with_blank_page.pdf

Will send a one liner PR to solve it

jjelosua pushed a commit to jjelosua/tabula-java that referenced this issue Dec 16, 2016
jazzido added a commit that referenced this issue Dec 16, 2016
ignore area if the page has no text. closes #130
jeremybmerrill added a commit that referenced this issue Dec 30, 2016
melisabok added a commit to melisabok/tabula-java that referenced this issue Mar 8, 2017
# The first commit's message is:
# This is a combination of 17 commits.
# The first commit's message is:
Fix TextElement creation

# This is the 2nd commit message:

fix tabs

# This is the 3rd commit message:

Use the code from LegacyPDFStreamEngine to create the TextElements

# This is the 4th commit message:

Fix removeText function using the example:

org.apache.pdfbox.examples.util.RemoveAllText

# This is the 5th commit message:

close the document

# This is the 6th commit message:

close removed text document

# This is the 7th commit message:

fix array serialization

# This is the 8th commit message:

add spanning cells test with CSV format

# This is the 9th commit message:

- Remove capheight calculation
- Temporally set height

# This is the 10th commit message:

Test writer two tables checking the json result object instead of the string

Add a test writer two tables for CSV output

# This is the 11th commit message:

Fix pageTransform when there is a rotation
Add more csv tests

# This is the 12th commit message:

fix path iterator

# This is the 13th commit message:

update json tests

# This is the 14th commit message:

Refactor table equality assertions for better reporting

# This is the 15th commit message:

Moved test fixture to a CSV file

# This is the 16th commit message:

rename spreadsheet/no-spreadsheet to lattice/stream to match web UI in CLI arguments and names of extraction algorithms

# This is the 17th commit message:

adjust expected output to use lattice/stream instead of spreadsheet/basic names for extraction mehthod

# This is the 2nd commit message:

ignore area restrictions on blank page. closes tabulapdf#130

# This is the 3rd commit message:

Revert "ignore area restrictions on blank page. closes tabulapdf#130"

This reverts commit dfd5f2f.

# This is the 4th commit message:

more consistent naming of avariable :)

# This is the 5th commit message:

fix and test for empty areas; which should have no text content

# This is the 6th commit message:

various additional null/empty checks to avoid exceptions when the user selects empty pages or regions

# This is the 7th commit message:

Update acknowledgments

# This is the 8th commit message:

tabula 0.9.2

# This is the 9th commit message:

update version

# This is the 10th commit message:

-t/--stream, -l/--lattice in #whichExtractionMethod

# This is the 11th commit message:

Comment on line above
melisabok added a commit to melisabok/tabula-java that referenced this issue Mar 8, 2017
# The first commit's message is:

# This is a combination of 12 commits.
# The first commit's message is:
# This is a combination of 17 commits.
# The first commit's message is:
Fix TextElement creation

# This is the 2nd commit message:

fix tabs

# This is the 3rd commit message:

Use the code from LegacyPDFStreamEngine to create the TextElements

# This is the 4th commit message:

Fix removeText function using the example:

org.apache.pdfbox.examples.util.RemoveAllText

# This is the 5th commit message:

close the document

# This is the 6th commit message:

close removed text document

# This is the 7th commit message:

fix array serialization

# This is the 8th commit message:

add spanning cells test with CSV format

# This is the 9th commit message:

- Remove capheight calculation
- Temporally set height

# This is the 10th commit message:

Test writer two tables checking the json result object instead of the string

Add a test writer two tables for CSV output

# This is the 11th commit message:

Fix pageTransform when there is a rotation
Add more csv tests

# This is the 12th commit message:

fix path iterator

# This is the 13th commit message:

update json tests

# This is the 14th commit message:

Refactor table equality assertions for better reporting

# This is the 15th commit message:

Moved test fixture to a CSV file

# This is the 16th commit message:

rename spreadsheet/no-spreadsheet to lattice/stream to match web UI in CLI arguments and names of extraction algorithms

# This is the 17th commit message:

adjust expected output to use lattice/stream instead of spreadsheet/basic names for extraction mehthod

# This is the 2nd commit message:

ignore area restrictions on blank page. closes tabulapdf#130

# This is the 3rd commit message:

Revert "ignore area restrictions on blank page. closes tabulapdf#130"

This reverts commit dfd5f2f.

# This is the 4th commit message:

more consistent naming of avariable :)

# This is the 5th commit message:

fix and test for empty areas; which should have no text content

# This is the 6th commit message:

various additional null/empty checks to avoid exceptions when the user selects empty pages or regions

# This is the 7th commit message:

Update acknowledgments

# This is the 8th commit message:

tabula 0.9.2

# This is the 9th commit message:

update version

# This is the 10th commit message:

-t/--stream, -l/--lattice in #whichExtractionMethod

# This is the 11th commit message:

Comment on line above

# This is the 12th commit message:

update json outputs

# This is the 2nd commit message:

upgrade pdfbox version
melisabok added a commit to melisabok/tabula-java that referenced this issue Mar 8, 2017
fix tabs

Use the code from LegacyPDFStreamEngine to create the TextElements

Fix removeText function using the example:

org.apache.pdfbox.examples.util.RemoveAllText

close the document

close removed text document

fix array serialization

add spanning cells test with CSV format

- Remove capheight calculation
- Temporally set height

Test writer two tables checking the json result object instead of the string

Add a test writer two tables for CSV output

Fix pageTransform when there is a rotation
Add more csv tests

fix path iterator

update json tests

Refactor table equality assertions for better reporting

Moved test fixture to a CSV file

rename spreadsheet/no-spreadsheet to lattice/stream to match web UI in CLI arguments and names of extraction algorithms

adjust expected output to use lattice/stream instead of spreadsheet/basic names for extraction mehthod

ignore area restrictions on blank page. closes tabulapdf#130

Revert "ignore area restrictions on blank page. closes tabulapdf#130"

This reverts commit dfd5f2f.

more consistent naming of avariable :)

fix and test for empty areas; which should have no text content

various additional null/empty checks to avoid exceptions when the user selects empty pages or regions

Update acknowledgments

tabula 0.9.2

update version

-t/--stream, -l/--lattice in #whichExtractionMethod

Comment on line above

update json outputs

upgrade pdfbox version

back to the old implementation and catch the IndexOutOfBoundsException

Remove hardcoded code

Remove more hardcoded code

test all the elements of the detected table

Change the expected table top value

Increase the threshold factor to support a greater headings

Fix rectangle comparator.

fix wrong expected column size, 5 instead of 6.

add more tests

update expected table, more spaces are expected to respect the alingment.

when the text value has length > 1, clean the spaces.

clean code

remove stackstrace

add log error
EmpowerZ pushed a commit to EmpowerZ/tabula-java that referenced this issue Oct 23, 2020
EmpowerZ pushed a commit to EmpowerZ/tabula-java that referenced this issue Oct 23, 2020
ignore area if the page has no text. closes tabulapdf#130
EmpowerZ pushed a commit to EmpowerZ/tabula-java that referenced this issue Oct 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant