-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add possibility to pass additional PDFMiner parameters for get_page_layout() #170
Comments
Having an option to specify kwargs for PDFMiner sounds good. Can you show me the structure of one of these PDFs? Just curious. |
hi @vinayak-mehta ,
This is result with
Line 1 for example appends the "L" after "es Blancs". Compared to this output with
|
I've quickly looked at the underlying issue with letters in the wrong order in the cells in the example above. I believe it's because the x-position is not taken into account when building text in cells (at least for my virtually all-horizontal data). When debugging, I noticed that
) |
Thanks for the detailed report and looking into the text setter method! You're correct, it doesn't compare the x-position of horizontal and vertical text when assigning text to a cell. This behavior should be corrected. At the same time, users should be able to pass in pdfminer kwargs to get the best parsing results. Let me look into this. |
@redapple You can now pass PDFMiner LAParam kwargs using |
Thanks for the heads up @vinayak-mehta ! |
On some PDFs, PDFMiner has issues when
detect_vertical
is passed asTrue
and hence the generation of rows is wrong, with some letters not following reading order.On a local version of camelot-py, I'm getting better results by forcing
detect_vertical=False
here.Would it be possible to have an argument in
.read_pdf()
to setdetect_vertical
? Just like there is amargins
argument.The text was updated successfully, but these errors were encountered: