GitHub - regardscitoyens/PDF_table

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README		README
page_to_cells.py		page_to_cells.py
pdf_table_scraper.py		pdf_table_scraper.py

Repository files navigation

PDF table scraper
-----------------

This script attempts to extract the data of a table from a pdf file.

It considers every single page of a pdf as a table, and attempts to make sense
of it. The output should be much easier to parse and 'somehow clean', but a
manual checking is required over the results.

It currently exports the data as a .html (for visualization) as well as in .csv
or in Python pickle form, for reuse in another script.

    ~/pdf_table_scraper$ ./pdf_table_scraper.py -h
    usage: pdf_table_scraper.py [-h] [--vskip VSKIP] [--page PAGE] [--html HTML]
                                [--csv CSV] [--pickle PICKLE] [--tmp_xml TMP_XML]
                                [-v]
                                filename

    Extracts a table from a .pdf file

    positional arguments:
      filename           the .pdf file

    optional arguments:
      -h, --help         show this help message and exit
      --vskip VSKIP      max vertical space between consecutive lines in the same
                         paragraph (usually ~8)
      --page PAGE        run the script on a specific page
      --html HTML        A filename for html output
      --csv CSV          A filename for csv output
      --pickle PICKLE    A filename for Python .pickle output
      --tmp_xml TMP_XML  A temporary XML file (output of pdftohtml)
      -v                 Increase the verbosity level