Text Extractor

A powerful tool for extracting text from various file formats including images, PDFs, Word documents, and Excel spreadsheets. Available as both an Electron desktop application and a Python web application.

Features

Common Features

Extract text from multiple file formats (PNG, JPG, PDF, DOCX, XLSX, etc.)
Copy extracted text to clipboard
Cross-platform support

Electron App Features

Native desktop experience
System tray integration
File association support

Web Interface Features

Accessible from any device on your network
No installation required (just a web browser)
Responsive design for mobile and desktop

Prerequisites

System Requirements

Python 3.8 or higher
pip (Python package manager)
Tesseract OCR (required for image text extraction)

Installing Tesseract OCR

Windows:

Download the installer from UB Mannheim
Run the installer
Add Tesseract to your system PATH or update the path in the script

macOS:

brew install tesseract

Linux (Ubuntu/Debian):

sudo apt-get install tesseract-ocr

Installation

Clone the repository

git clone https://github.com/sajidsdk19/Extracting-Text-From-Image-Document-Sheet.git
cd Extracting-Text-From-Image-Document-Sheet

Create and activate a virtual environment (recommended)

# On Windows
python -m venv venv
.\venv\Scripts\activate

# On macOS/Linux
python3 -m venv venv
source venv/bin/activate

Install the required Python packages
```
pip install -r requirements.txt
```

Usage

Command Line Interface

# Basic usage
python text_extractor.py path/to/your/file

# Save output to a file
python text_extractor.py input.pdf -o output.txt

# Copy text to clipboard
python text_extractor.py input.png --clipboard

# Show help
python text_extractor.py --help

Supported File Formats

Images: .png, .jpg, .jpeg (using OCR)
Documents: .pdf, .docx
Spreadsheets: .xlsx

Python Module

You can also use the text extractor as a Python module:

from text_extractor import extract_text_from_file

text = extract_text_from_file('document.pdf')
print(text)

Installation Methods

Option 1: Python Web Interface

Install Python Dependencies:

python -m pip install -r requirements.txt

Start the Web Server:
```
python app.py
```
Access the Web Interface: Open your browser and go to http://127.0.0.1:5000

Option 2: Electron Desktop App

Navigate to the Electron app directory:
```
cd electron-app
```
Install Node.js Dependencies:
```
npm install
```
Start the Electron App:
```
npm start
```

Usage

Web Interface

Upload a file by drag-and-drop or file browser
Click Extract Text to process
Use the Copy button to copy extracted text

Electron App

Click Choose File to select a document
Click Extract to process the file
Copy the extracted text using the Copy button

Project Structure

Extracting-Text-From-Image-Document-Sheet/
├── electron-app/         # Electron GUI application
├── templates/            # Web interface templates
├── app.py               # Flask web server
├── create_test_image.py  # Utility to create test images
├── requirements.txt      # Python dependencies
├── text_extractor.py     # Main text extraction module
└── README.md            # This file

Troubleshooting

Tesseract not found
- Ensure Tesseract is installed and added to your system PATH
- Or update the path in text_extractor.py
Missing Dependencies
```
pip install -r requirements.txt
```

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
__pycache__		__pycache__
electron-app		electron-app
templates		templates
.gitattributes		.gitattributes
.gitignore		.gitignore
1.png		1.png
DOCUMENTATION.md		DOCUMENTATION.md
README.md		README.md
Text_Extractor_Documentation.pdf		Text_Extractor_Documentation.pdf
app.py		app.py
create_test_image.py		create_test_image.py
generate_pdf.py		generate_pdf.py
requirements.txt		requirements.txt
test_image.png		test_image.png
text_extractor.py		text_extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text Extractor

Features

Common Features

Electron App Features

Web Interface Features

Prerequisites

System Requirements

Installing Tesseract OCR

Windows:

macOS:

Linux (Ubuntu/Debian):

Installation

Usage

Command Line Interface

Supported File Formats

Python Module

Installation Methods

Option 1: Python Web Interface

Option 2: Electron Desktop App

Usage

Web Interface

Electron App

Project Structure

Troubleshooting

Contributing

License

About

Uh oh!

Releases

Packages

Languages

sajidsdk19/Extracting-Text-From-Image-Document-Sheet

Folders and files

Latest commit

History

Repository files navigation

Text Extractor

Features

Common Features

Electron App Features

Web Interface Features

Prerequisites

System Requirements

Installing Tesseract OCR

Windows:

macOS:

Linux (Ubuntu/Debian):

Installation

Usage

Command Line Interface

Supported File Formats

Python Module

Installation Methods

Option 1: Python Web Interface

Option 2: Electron Desktop App

Usage

Web Interface

Electron App

Project Structure

Troubleshooting

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages