Skip to content

sajidsdk19/Extracting-Text-From-Image-Document-Sheet

Repository files navigation

Text Extractor

A powerful tool for extracting text from various file formats including images, PDFs, Word documents, and Excel spreadsheets. Available as both an Electron desktop application and a Python web application.

Features

Common Features

  • Extract text from multiple file formats (PNG, JPG, PDF, DOCX, XLSX, etc.)
  • Copy extracted text to clipboard
  • Cross-platform support

Electron App Features

  • Native desktop experience
  • System tray integration
  • File association support

Web Interface Features

  • Accessible from any device on your network
  • No installation required (just a web browser)
  • Responsive design for mobile and desktop

Prerequisites

System Requirements

  • Python 3.8 or higher
  • pip (Python package manager)
  • Tesseract OCR (required for image text extraction)

Installing Tesseract OCR

Windows:

  1. Download the installer from UB Mannheim
  2. Run the installer
  3. Add Tesseract to your system PATH or update the path in the script

macOS:

brew install tesseract

Linux (Ubuntu/Debian):

sudo apt-get install tesseract-ocr

Installation

  1. Clone the repository

    git clone https://github.com/sajidsdk19/Extracting-Text-From-Image-Document-Sheet.git
    cd Extracting-Text-From-Image-Document-Sheet
  2. Create and activate a virtual environment (recommended)

    # On Windows
    python -m venv venv
    .\venv\Scripts\activate
    
    # On macOS/Linux
    python3 -m venv venv
    source venv/bin/activate
  3. Install the required Python packages

    pip install -r requirements.txt

Usage

Command Line Interface

# Basic usage
python text_extractor.py path/to/your/file

# Save output to a file
python text_extractor.py input.pdf -o output.txt

# Copy text to clipboard
python text_extractor.py input.png --clipboard

# Show help
python text_extractor.py --help

Supported File Formats

  • Images: .png, .jpg, .jpeg (using OCR)
  • Documents: .pdf, .docx
  • Spreadsheets: .xlsx

Python Module

You can also use the text extractor as a Python module:

from text_extractor import extract_text_from_file

text = extract_text_from_file('document.pdf')
print(text)

Installation Methods

Option 1: Python Web Interface

  1. Install Python Dependencies:

    python -m pip install -r requirements.txt
  2. Start the Web Server:

    python app.py
  3. Access the Web Interface: Open your browser and go to http://127.0.0.1:5000

Option 2: Electron Desktop App

  1. Navigate to the Electron app directory:

    cd electron-app
  2. Install Node.js Dependencies:

    npm install
  3. Start the Electron App:

    npm start

Usage

Web Interface

  1. Upload a file by drag-and-drop or file browser
  2. Click Extract Text to process
  3. Use the Copy button to copy extracted text

Electron App

  1. Click Choose File to select a document
  2. Click Extract to process the file
  3. Copy the extracted text using the Copy button

Project Structure

Extracting-Text-From-Image-Document-Sheet/
├── electron-app/         # Electron GUI application
├── templates/            # Web interface templates
├── app.py               # Flask web server
├── create_test_image.py  # Utility to create test images
├── requirements.txt      # Python dependencies
├── text_extractor.py     # Main text extraction module
└── README.md            # This file

Troubleshooting

  1. Tesseract not found

    • Ensure Tesseract is installed and added to your system PATH
    • Or update the path in text_extractor.py
  2. Missing Dependencies

    pip install -r requirements.txt

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

text extractor

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published