No description, website, or topics provided.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
README
page_to_cells.py
pdf_table_scraper.py

README

PDF table scraper
-----------------

This script attempts to extract the data of a table from a pdf file.

It considers every single page of a pdf as a table, and attempts to make sense
of it. The output should be much easier to parse and 'somehow clean', but a
manual checking is required over the results.

It currently exports the data as a .html (for visualization) as well as in .csv
or in Python pickle form, for reuse in another script.

    ~/pdf_table_scraper$ ./pdf_table_scraper.py -h
    usage: pdf_table_scraper.py [-h] [--vskip VSKIP] [--page PAGE] [--html HTML]
                                [--csv CSV] [--pickle PICKLE] [--tmp_xml TMP_XML]
                                [-v]
                                filename

    Extracts a table from a .pdf file

    positional arguments:
      filename           the .pdf file

    optional arguments:
      -h, --help         show this help message and exit
      --vskip VSKIP      max vertical space between consecutive lines in the same
                         paragraph (usually ~8)
      --page PAGE        run the script on a specific page
      --html HTML        A filename for html output
      --csv CSV          A filename for csv output
      --pickle PICKLE    A filename for Python .pickle output
      --tmp_xml TMP_XML  A temporary XML file (output of pdftohtml)
      -v                 Increase the verbosity level