Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vertical text messing up unrelated tables #126

Open
SeguinBe opened this issue Nov 28, 2016 · 4 comments
Open

Vertical text messing up unrelated tables #126

SeguinBe opened this issue Nov 28, 2016 · 4 comments

Comments

@SeguinBe
Copy link

Hi,

First of all, thanks for the great tool.

I was trying to extract the data from the following pdf : pdf

If I extract the following two tables like this :
capture d ecran 2016-11-28 a 15 04 08

The second table get extracted properly but not the first one, which seem to be messed up by some 'Info' text. These 'Info' boxes appear vertically in some headers (between 'Executed Elements' and 'Base Value') which are not part of the selected tables.

capture d ecran 2016-11-28 a 15 08 44

My best guess would be that there is an issue in the layout rendering and that somehow some rotations of the vertical text is not registered properly. Maybe, the x and y coordinates get swaped which would explain why the 'Info' texts end up both in the first table and none on the following tables.

Anyway, any tip in how to get out of this is greatly appreciated.

PS : posting here because I actually use more the CLI version, used the web version just for the screenshots, both version are of course producing the same output.

@jeremybmerrill
Copy link
Member

Hi @SeguinBe: Thanks for the bug report. I think you're probably exactly right about the source of the problem: the rotation of the text isn't understood properly. My initial guess is that this is a PDFBox bug, but I'm not 100% sure. I'm not sure we have a good solution or workaround, unfortunately. Manuel is hard at work incorporating the latest version of PDFBox into Tabula, which will hopefully solve the problem, but I've got no suggestions until then.

@SeguinBe
Copy link
Author

SeguinBe commented Dec 4, 2016

Ok, thanks for the feedback. I will wait for it then :-)

@lazar-basiq
Copy link

Hi,

It seems that this issue is still present.
Does anyone know if there is any workaround for this? :)

Thanks

@lazar-basiq
Copy link

Hi,

I think I found an quick (and probably dirty) way to resolve this issue.
The problem seems to be in that, when text orientation is vertical PdfBox returns swapped x and y coordinates of text element.

Here is diff with the change I made: tabula-text-direction_fix.diff

@jeremybmerrill can someone from Tabula verify if above change may have any negative effects or maybe to suggest different solution for this issue?

If you think that this patch is ok, I can create an pull request for it :)

Thanks,
Lazar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants