-
First make sure PyTorch - 1.7.1 (or later) and torchvision are installed.
-
pip install git+https://github.com/openai/CLIP.git- OpenAI's CLIP model for matching text with images -
pip install numpy pandas ftfy regex tqdm PyPDF2 python-dotenv openai -
Setup
pdf2image. Instructions given here:Linux and MacOS
- setup poppler using the isntructions given in https://pdf2image.readthedocs.io/en/latest/installation.html
pip install pdf2image
Windows
- Download the latest poppler package from https://github.com/oschwartz10612/poppler-windows/releases/ which is the most up-to-date.
- Move the extracted directory to the desired place on your system
- Add the
bin/directory to your PATH - Test that all went well by opening cmd and making sure that you can call
pdftoppm -h - If still not working, point the
poppler_pathargument to the\binfolder like already done inside the file. pip install pdf2image
-
Setup
pytesseract. Instructions given here:Linux and MacOS
- Setup the latest version of pytesseract (5+) using https://studysection.com/blog/quick-guide-to-install-and-remove-tesseract-ocr-5-on-ubuntu-18-04/
- Make sure the correct tesseract language packages are installed for your use. Helpful guide - https://ocrmypdf.readthedocs.io/en/latest/languages.html Windows
shrivastava95/docparser
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|