Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content = pdf2text.extract() taking a lot of time before crashing colab #4

Closed
mobassir94 opened this issue Jan 24, 2022 · 1 comment

Comments

@mobassir94
Copy link

mobassir94 commented Jan 24, 2022

Thank you for making this awesome library.i am trying to make a bengali tafsir reader using your repository.
here is the code that i tried in colab:

!pip install gTTS
#!pip install PyPDF2
!pip install playsound
!pip install multilingual-pdf2text==1.1.0
!apt install tesseract-ocr
!apt install libtesseract-dev
!apt-get install poppler-utils 

!apt-get install tesseract-ocr-ara
!apt-get install tesseract-ocr-ben

from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging
logging.basicConfig(level=logging.INFO)


def main():
    ## create document for extraction with configurations
    pdf_document = Document(
        document_path='/content/tafsir.pdf',
        language='ben'
        )
    pdf2text = PDF2Text(document=pdf_document)
    content = pdf2text.extract()
    for page in content:
      print(page['text'])

if __name__ == "__main__":
    main()

it takes a lot of time and basically is stuck after printing this :

INFO:multilingual_pdf2text.doc2img.parse_document:Parsing document from pdf to image
INFO:multilingual_pdf2text.ocr.image_to_text:Extracting text from images via OCR

and after few minutes colab will crash,,seems like after exhausting all available ram of colab,the notebook gets crashed.
the pdf book that i am trying to read using this library is written in bangla and arabic.here is the link of that pdf book : https://i-onlinemedia.net/downloads/books/quran-tafsir/tafsir_ibn_kasir/Tafsir_Ibn_Kasir_Part-1-2-3.pdf

@mobassir94
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant