-
Notifications
You must be signed in to change notification settings - Fork 408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vertical text messing up unrelated tables #126
Comments
Hi @SeguinBe: Thanks for the bug report. I think you're probably exactly right about the source of the problem: the rotation of the text isn't understood properly. My initial guess is that this is a PDFBox bug, but I'm not 100% sure. I'm not sure we have a good solution or workaround, unfortunately. Manuel is hard at work incorporating the latest version of PDFBox into Tabula, which will hopefully solve the problem, but I've got no suggestions until then. |
Ok, thanks for the feedback. I will wait for it then :-) |
Hi, It seems that this issue is still present. Thanks |
Hi, I think I found an quick (and probably dirty) way to resolve this issue. Here is diff with the change I made: tabula-text-direction_fix.diff @jeremybmerrill can someone from Tabula verify if above change may have any negative effects or maybe to suggest different solution for this issue? If you think that this patch is ok, I can create an pull request for it :) Thanks, |
Hi,
First of all, thanks for the great tool.
I was trying to extract the data from the following pdf : pdf
If I extract the following two tables like this :
The second table get extracted properly but not the first one, which seem to be messed up by some 'Info' text. These 'Info' boxes appear vertically in some headers (between 'Executed Elements' and 'Base Value') which are not part of the selected tables.
My best guess would be that there is an issue in the layout rendering and that somehow some rotations of the vertical text is not registered properly. Maybe, the x and y coordinates get swaped which would explain why the 'Info' texts end up both in the first table and none on the following tables.
Anyway, any tip in how to get out of this is greatly appreciated.
PS : posting here because I actually use more the CLI version, used the web version just for the screenshots, both version are of course producing the same output.
The text was updated successfully, but these errors were encountered: