Skip to content

Latest commit

 

History

History
 
 

pdf-splitter-python

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Document AI PDF Splitter Sample

NOTE: This sample is deprecated. Use Document AI Toolbox to Split PDFs based on output from a Splitter/Classifier processor.

This project uses Document AI Splitter/Classifier Processors identify split points and uses PikePDF to split PDF documents.

Designed to work with the following processors:

For more information about Document AI Splitters, check out Document splitters behavior

Quick start

  1. Install Python
  2. Install the prerequisites: pip install -r requirements.txt
  3. Install the Google Cloud SDK
  4. Run gcloud init, create a new project, and enable billing
  5. Enable the Document AI API: gcloud services enable documentai.googleapis.com
  6. Setup application default authentication, run: gcloud auth application-default login
  7. Run the sample: python main.py -i multi_document.pdf.
    • You should see the split up sub-documents in your current directory with file names like pg1-2_1040sc_2020_multi_document.
    • You should also see the raw Document output from Document AI in a json file multi_document.json

Setup

Install dependencies

  1. Install pyenv: https://github.com/pyenv/pyenv#installation
  2. Use pyenv to install the latest version of Python 3 for example, to install Python version 3.10.1, run: pyenv install 3.10.1
  3. Create a Python virtual environment with the installed version of Python 3, for example, to create a Python 3.10.1 virtual environment called docai-splitter, run: pyenv virtualenv 3.10.1 docai-splitter
  4. Clone this repo and cd to the root of the repo
  5. Configure pyenv to use the virtual python environment we created earlier when in this repo: pyenv local docai-splitter
  6. Install the prerequisites: pip install -r requirements.txt

Setup Google Cloud

  1. Install the Cloud SDK: https://cloud.google.com/sdk/docs/install
  2. Run gcloud init, to create a new project, and link a billing to your project
  3. Enable the Document AI API: gcloud services enable documentai.googleapis.com
  4. Setup application default authentication, run: gcloud auth application-default login

Running the sample

  1. Run the sample: python main.py -i multi_document.pdf
  2. Check to see that the PDFs created in the current directory are sub-documents of multi-document.pdf.

Testing

Linting

  1. Install dependencies:

    pip install -U pylint
  2. Run the linter:

    pylint *.py

Unit tests

  1. Run the unit tests: python main_test.py

Manual

  1. Run the sample: python main.py -i multi_document.pdf
  2. Check to see that the PDFs created in the current directory are sub-documents of multi-document.pdf.