Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Why is position of first character of input row changed to last character of the same line in output table? #215

Closed
amdhacks opened this issue Dec 3, 2018 · 9 comments
Labels
Milestone

Comments

@amdhacks
Copy link

amdhacks commented Dec 3, 2018

Will share input file shortly. Unable to share from restricted network at this moment.

@amdhacks amdhacks changed the title Question: Position of first character of input row is changed to last character of the same line ouput Question: Why position of first character of input row is changed to last character of the same line ouput table? Dec 3, 2018
@amdhacks amdhacks changed the title Question: Why position of first character of input row is changed to last character of the same line ouput table? Question: Why is position of first character of input row is changed to last character of the same line ouput table? Dec 3, 2018
@amdhacks amdhacks changed the title Question: Why is position of first character of input row is changed to last character of the same line ouput table? Question: Why is position of first character of input row is changed to last character of the same line in output table? Dec 3, 2018
@amdhacks amdhacks changed the title Question: Why is position of first character of input row is changed to last character of the same line in output table? Question: Why is position of first character of input row changed to last character of the same line in output table? Dec 3, 2018
@amdhacks
Copy link
Author

amdhacks commented Dec 3, 2018

You can see in snapshot below that first character from row no 2-10 is cropped and append as last character in output table.
snapshot:
issue

The input files is :
document-page3.pdf

Any idea,why this is happening and how can it be fixed.

Thanks.

@abhibisht89
Copy link

i am also having the same issue related to first character position.

@vinayak-mehta
Copy link
Contributor

Hi @amdhacks and @abhibisht89! Thanks for the report, this is a known issue (more details here #170 and here #213). You can expect a fix by the end of this week. Till then, you can try the workaround mentioned in #170, by finding the path where camelot is installed and passing detect_vertical=False in base.py.

@amdhacks
Copy link
Author

amdhacks commented Dec 4, 2018

My base.py file after the change given below:

# -*- coding: utf-8 -*-

import os

from ..utils import get_page_layout, get_text_objects


class BaseParser(object):
    """Defines a base parser.
    """
    def _generate_layout(self, filename):
        self.filename = filename
        self.layout, self.dimensions = get_page_layout(
            self.filename,
            char_margin=self.char_margin,
            line_margin=self.line_margin,
            word_margin=self.word_margin)
        self.horizontal_text = get_text_objects(self.layout, ltype="lh")
        self.vertical_text = get_text_objects(self.layout, ltype="lv")
        self.pdf_width, self.pdf_height = self.dimensions
        self.rootname, __ = os.path.splitext(self.filename)
        self.detect_vertical = False

But I do not see any improvement. First character is still shown as last character of the same row in output.

Am I missing something?

@vinayak-mehta
Copy link
Contributor

Sorry, I should've been more specific. You need to add detect_vertical=True to get_page_layout. You can check out its definition in utils.py.

@vinayak-mehta
Copy link
Contributor

@amdhacks This is fixed now. It will be more configurable after #170.

@amdhacks
Copy link
Author

@vinayak-mehta, I have updated camelot version to 0.4.1 but first character which was showing as last earlier is now appearing as a single character in first column like below:
Output now:
C lass A
N et Asset Value at 31 December 5,111,372
N umber of outstanding units at 31 December 49,136
N et Asset Value per unit at 31 December 104.03
C lass B
N et Asset Value at 31 December 49,144,825
N umber of outstanding units at 31 December 471,555
N et Asset Value per unit at 31 December 104.22

Please suggest.
Thanks.

@vinayak-mehta
Copy link
Contributor

@amdhacks Please install the latest version i.e. v0.5.0. The table isn't being detected correctly for this case. You'll need to specify a table area.

$ camelot --format csv --output output.csv stream -T 70,690,550,170 input.pdf

@sjm20066
Copy link

@vinayak-mehta I'm having the exact same issue as @amdhacks, and I have version 0.7.3
The Output has the same issue as:
C lass A
N et Asset Value at 31 December 5,111,372
N umber of outstanding units at 31 December 49,136
N et Asset Value per unit at 31 December 104.03
C lass B
N et Asset Value at 31 December 49,144,825
N umber of outstanding units at 31 December 471,555
N et Asset Value per unit at 31 December 104.22

Also, the table boundaries in the Input PDF as as clearly defined as they can be.
Any solution for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants