A Streamlit web application that uses PyOCR to extract text from images and PDFs.
Before running this application, you need to install Tesseract-OCR and Poppler on your system:
sudo apt-get update
sudo apt-get install tesseract-ocr poppler-utils
- Download and install Tesseract-OCR from: https://github.com/UB-Mannheim/tesseract/wiki
- Download and install Poppler from: http://blog.alivate.com.au/poppler-windows/
- Add both Tesseract and Poppler to your system PATH
brew install tesseract poppler
- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install the required Python packages:
pip install -r requirements.txt
- Activate your virtual environment if not already activated:
source venv/bin/activate # On Windows: venv\Scripts\activate
- Run the Streamlit app:
streamlit run app.py
- Open your web browser and navigate to the URL shown in the terminal (typically http://localhost:8501)
- Upload an image file (PNG, JPG, or JPEG)
- Click "Extract Text" to process the image
- View the extracted text
- Download the text if needed
- Upload a PDF file
- Select a specific page to process or choose to process all pages
- Click "Extract Text" to process the selected page or "Process All Pages" to extract text from all pages
- View the extracted text
- Download the text (individual page or all pages)
- Image upload support (PNG, JPG, JPEG)
- PDF upload support with multi-page processing
- Page selection for PDFs
- Real-time text extraction
- Download extracted text
- User-friendly interface
- Support for multiple file formats
- For best results, use clear, well-lit images with good contrast
- The application uses English language by default for OCR
- Processing time may vary depending on file size and complexity
- PDF processing may take longer than image processing due to the conversion step