PDFicator

A hacky workspace for developing workflows to analyze a massive (+80k) collection of PDF documents.

The idea is, give the tool a massive directory of PDF files, and extract data from it.

This project will start off with a few hacky python experiments, but the eventual goal is to build an app that is worthy of the collection.

Different approaches will be taken towards the goal of mining the data - static analysis, extraction, and eventual use in ML is intended.

Project Structure:

.
├── LICENSE
├── README.md
├── archive
│   └── testCollection
├── data
├── pdf_tool_env
├── py
│   ├── process_pdfs.py
│   └── query_pdfs.py
├── process.sh
├── query.sh
└── tools.sh

Notes:

tools.sh is used to set up the tooling for the project, including a pyenv environment.
For now, requires python 3.11 on MacOS and Linux. Other python versions may get a pyenv configuration when needed.
The archive/testCollection/ and data/ directories are .gitignor'ed!
Put your own test PDF collection in archive/testCollection/
process.sh/query.sh will depend on the contents of data/
Do not commit your data to the repo - be sure to inspect .gitignore
At least rustc>1.7 is required to be onboard, for the tokenizer packages installed by pip

Run:

To do a test run, first you must populate the archive/testCollection/ directory with some PDF's of your own. You can use the populate.sh script to do this - it will copy 50 random files from the directory of your choice.

Once you've put some files in archive/testCollection/, you can run the following:

sh process.sh && sh query.sh

This will process the PDF's in the archive, and then give you a REPL-like query interface to search for text in the database created by process.sh.

Note that, at the moment, there are bugs in the indexing yet to be fixed - but limited results can be attained with this workflow so far.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDFicator

Project Structure:

Notes:

Run:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
py		py
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compare.sh		compare.sh
inspect.sh		inspect.sh
populate.sh		populate.sh
process.sh		process.sh
query.sh		query.sh
test.sh		test.sh
tools.sh		tools.sh

License

seclorum/PDFicator

Folders and files

Latest commit

History

Repository files navigation

PDFicator

Project Structure:

Notes:

Run:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages