Add possibility to pass additional PDFMiner parameters for get_page_layout() #170

redapple · 2018-10-23T15:17:01Z

On some PDFs, PDFMiner has issues when detect_vertical is passed as True and hence the generation of rows is wrong, with some letters not following reading order.

On a local version of camelot-py, I'm getting better results by forcing detect_vertical=False here.

Would it be possible to have an argument in .read_pdf() to set detect_vertical? Just like there is a margins argument.

The text was updated successfully, but these errors were encountered:

vinayak-mehta · 2018-10-24T17:12:39Z

Having an option to specify kwargs for PDFMiner sounds good. Can you show me the structure of one of these PDFs? Just curious.

redapple · 2018-10-26T09:07:41Z

hi @vinayak-mehta ,
here's one example: https://www.chateaudemassillan.fr/_media/files/livre%20de%20cave%20juin%202018.pdf
with this simple stream parsing script on page 2

import camelot

tables = camelot.read_pdf('livre de cave juin 2018.pdf', flavor='stream', pages='2')
print(tables[0].df)

This is result with detect_vertical=True (camelot's default):

                                                    0        1
0                            V i n s   a u   Ve r r e         
1                                        es Blancs  L   12.5CL
2                                A.O.P Côtes du Rhône         
3   Domaine de la Guicharde «  Autour de la chapel...      8 €
4                                    A.O.P Vacqueyras         
5               Domaine de Montvac  « Melodine » 2016     10 €
6                           A.O.P Châteauneuf du Pape         
7                          Domaine de Beaurenard 2017     13 €
8                          A.O.P Côteaux du Languedoc         
9           Villa Tempora « Un temps pour elle » 2014      9 €
10                            A.O.P Côtes de Provence         
11                           Château Grand Boise 2017      9 €
12                                     es Rosés     L  12,5 CL
13                               A.O.P Côtes du Rhône         
14   Domaine de la Florane « A fleur de Pampre » 2016      8 €
15  Famille Coulon (Domaine Beaurenard) Biotifulfo...      8 €
16                                   A.O.P Vacqueyras         
17                            Domaine de Montvac 2017      9 €
18                                    A.O.P Languedoc         
19                   Domaine de Joncas « Nébla » 2015      8 €
20           Villa Tempora « L’arroseur arrosé » 2015      9 €
21                            A.O.P Côtes de Provence         
22       Château Grand Boise « Sainte Victoire » 2017      9 €
23                                Château Léoube 2016     10 €
24                                      es Rouges   L    12,CL
25                               A.O.P Côtes du Rhône         
26               Domaine de Dionysos « La Cigalette »      8 €
27  Château Saint Estève d’Uchaux « Grande Réserve...      9 €
28   Domaine de la Guicharde « Cuvée Massillan » 2016      9 €
29       Domaine de la Florane « Terre Pourpre » 2014     10 €
30  L’Oratoire St Martin « Réserve des Seigneurs »...     11 €
31                                 A.O.P Saint Joseph         
32           Domaine Monier Perréol « Châtelet » 2015     13 €
33                          A.O.P Châteauneuf du Pape         
34                         Domaine de Beaurenard 2011     15 €
35                                       A.O.P Cornas         
36              Domaine Lionnet « Terre Brûlée » 2012     15 €

Line 1 for example appends the "L" after "es Blancs".

Compared to this output with detect_vertical=False:

                                                    0        1
0                            V i n s   a u   Ve r r e         
1                                          Les Blancs   12.5CL
2                                A.O.P Côtes du Rhône         
3   Domaine de la Guicharde «  Autour de la chapel...      8 €
4                                    A.O.P Vacqueyras         
5               Domaine de Montvac  « Melodine » 2016     10 €
6                           A.O.P Châteauneuf du Pape         
7                          Domaine de Beaurenard 2017     13 €
8                          A.O.P Côteaux du Languedoc         
9           Villa Tempora « Un temps pour elle » 2014      9 €
10                            A.O.P Côtes de Provence         
11                           Château Grand Boise 2017      9 €
12                                          Les Rosés  12,5 CL
13                               A.O.P Côtes du Rhône         
14   Domaine de la Florane « A fleur de Pampre » 2016      8 €
15  Famille Coulon (Domaine Beaurenard) Biotifulfo...      8 €
16                                   A.O.P Vacqueyras         
17                            Domaine de Montvac 2017      9 €
18                                    A.O.P Languedoc         
19                   Domaine de Joncas « Nébla » 2015      8 €
20           Villa Tempora « L’arroseur arrosé » 2015      9 €
21                            A.O.P Côtes de Provence         
22       Château Grand Boise « Sainte Victoire » 2017      9 €
23                                Château Léoube 2016     10 €
24                                         Les Rouges    12,CL
25                               A.O.P Côtes du Rhône         
26               Domaine de Dionysos « La Cigalette »      8 €
27  Château Saint Estève d’Uchaux « Grande Réserve...      9 €
28   Domaine de la Guicharde « Cuvée Massillan » 2016      9 €
29       Domaine de la Florane « Terre Pourpre » 2014     10 €
30  L’Oratoire St Martin « Réserve des Seigneurs »...     11 €
31                                 A.O.P Saint Joseph         
32           Domaine Monier Perréol « Châtelet » 2015     13 €
33                          A.O.P Châteauneuf du Pape         
34                         Domaine de Beaurenard 2011     15 €
35                                       A.O.P Cornas         
36              Domaine Lionnet « Terre Brûlée » 2012     15 €

redapple · 2018-10-26T09:19:48Z

I've quickly looked at the underlying issue with letters in the wrong order in the cells in the example above.

I believe it's because the x-position is not taken into account when building text in cells (at least for my virtually all-horizontal data).
Especially this line in Stream._generate_table().

When debugging, I noticed that LTTextLineHorizontal instances were put into the correct cell,
but then, when looping over LTTextLineVertical, the text from these are appended to the current cell content, thus at the end, even when the x-position of the vertical text was actually on the left of the current cell content (because of the Cell.text setter method

    @text.setter
    def text(self, t):
        self._text = ''.join([self._text, t])

)

vinayak-mehta · 2018-10-28T09:22:07Z

Thanks for the detailed report and looking into the text setter method! You're correct, it doesn't compare the x-position of horizontal and vertical text when assigning text to a cell. This behavior should be corrected.

At the same time, users should be able to pass in pdfminer kwargs to get the best parsing results. Let me look into this.

vinayak-mehta · 2018-12-19T13:20:21Z

@redapple You can now pass PDFMiner LAParam kwargs using layout_kwargs in read_pdf(). Check out this section of the docs for more details: https://camelot-py.readthedocs.io/en/master/user/advanced.html#tweak-layout-generation

redapple · 2018-12-19T13:28:39Z

Thanks for the heads up @vinayak-mehta !

vinayak-mehta added enhancement easy labels Oct 24, 2018

vinayak-mehta added bug and removed easy labels Oct 28, 2018

vinayak-mehta mentioned this issue Dec 2, 2018

Account for x-position when assigning text to cell #213

Closed

vinayak-mehta added this to the v0.6.0 milestone Dec 2, 2018

vinayak-mehta mentioned this issue Dec 4, 2018

Question: Why is position of first character of input row changed to last character of the same line in output table? #215

Closed

vinayak-mehta mentioned this issue Dec 17, 2018

[MRG] Add option to pass pdfminer kwargs #232

Merged

vinayak-mehta closed this as completed in #232 Dec 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add possibility to pass additional PDFMiner parameters for get_page_layout() #170

Add possibility to pass additional PDFMiner parameters for get_page_layout() #170

redapple commented Oct 23, 2018

vinayak-mehta commented Oct 24, 2018

redapple commented Oct 26, 2018

redapple commented Oct 26, 2018

vinayak-mehta commented Oct 28, 2018

vinayak-mehta commented Dec 19, 2018

redapple commented Dec 19, 2018

Add possibility to pass additional PDFMiner parameters for get_page_layout() #170

Add possibility to pass additional PDFMiner parameters for get_page_layout() #170

Comments

redapple commented Oct 23, 2018

vinayak-mehta commented Oct 24, 2018

redapple commented Oct 26, 2018

redapple commented Oct 26, 2018

vinayak-mehta commented Oct 28, 2018

vinayak-mehta commented Dec 19, 2018

redapple commented Dec 19, 2018