Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
No description, website, or topics provided.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Type||Name||Latest commit message||Commit time|
|Failed to load latest commit information.|
PDF table scraper ----------------- This script attempts to extract the data of a table from a pdf file. It considers every single page of a pdf as a table, and attempts to make sense of it. The output should be much easier to parse and 'somehow clean', but a manual checking is required over the results. It currently exports the data as a .html (for visualization) as well as in .csv or in Python pickle form, for reuse in another script. ~/pdf_table_scraper$ ./pdf_table_scraper.py -h usage: pdf_table_scraper.py [-h] [--vskip VSKIP] [--page PAGE] [--html HTML] [--csv CSV] [--pickle PICKLE] [--tmp_xml TMP_XML] [-v] filename Extracts a table from a .pdf file positional arguments: filename the .pdf file optional arguments: -h, --help show this help message and exit --vskip VSKIP max vertical space between consecutive lines in the same paragraph (usually ~8) --page PAGE run the script on a specific page --html HTML A filename for html output --csv CSV A filename for csv output --pickle PICKLE A filename for Python .pickle output --tmp_xml TMP_XML A temporary XML file (output of pdftohtml) -v Increase the verbosity level