A powerful tool for extracting text from various file formats including images, PDFs, Word documents, and Excel spreadsheets. Available as both an Electron desktop application and a Python web application.
- Extract text from multiple file formats (PNG, JPG, PDF, DOCX, XLSX, etc.)
- Copy extracted text to clipboard
- Cross-platform support
- Native desktop experience
- System tray integration
- File association support
- Accessible from any device on your network
- No installation required (just a web browser)
- Responsive design for mobile and desktop
- Python 3.8 or higher
- pip (Python package manager)
- Tesseract OCR (required for image text extraction)
- Download the installer from UB Mannheim
- Run the installer
- Add Tesseract to your system PATH or update the path in the script
brew install tesseract
sudo apt-get install tesseract-ocr
-
Clone the repository
git clone https://github.com/sajidsdk19/Extracting-Text-From-Image-Document-Sheet.git cd Extracting-Text-From-Image-Document-Sheet
-
Create and activate a virtual environment (recommended)
# On Windows python -m venv venv .\venv\Scripts\activate # On macOS/Linux python3 -m venv venv source venv/bin/activate
-
Install the required Python packages
pip install -r requirements.txt
# Basic usage
python text_extractor.py path/to/your/file
# Save output to a file
python text_extractor.py input.pdf -o output.txt
# Copy text to clipboard
python text_extractor.py input.png --clipboard
# Show help
python text_extractor.py --help
- Images: .png, .jpg, .jpeg (using OCR)
- Documents: .pdf, .docx
- Spreadsheets: .xlsx
You can also use the text extractor as a Python module:
from text_extractor import extract_text_from_file
text = extract_text_from_file('document.pdf')
print(text)
-
Install Python Dependencies:
python -m pip install -r requirements.txt
-
Start the Web Server:
python app.py
-
Access the Web Interface: Open your browser and go to http://127.0.0.1:5000
-
Navigate to the Electron app directory:
cd electron-app
-
Install Node.js Dependencies:
npm install
-
Start the Electron App:
npm start
- Upload a file by drag-and-drop or file browser
- Click Extract Text to process
- Use the Copy button to copy extracted text
- Click Choose File to select a document
- Click Extract to process the file
- Copy the extracted text using the Copy button
Extracting-Text-From-Image-Document-Sheet/
├── electron-app/ # Electron GUI application
├── templates/ # Web interface templates
├── app.py # Flask web server
├── create_test_image.py # Utility to create test images
├── requirements.txt # Python dependencies
├── text_extractor.py # Main text extraction module
└── README.md # This file
-
Tesseract not found
- Ensure Tesseract is installed and added to your system PATH
- Or update the path in
text_extractor.py
-
Missing Dependencies
pip install -r requirements.txt
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.