This Python script extracts text from PDF files, splits it into chunks, and saves the chunks as both JSON and HTML files. It's useful for processing large documents and preparing text data for further analysis or processing, such as creating social media content from books.
- Extracts text from PDF files 📄
- Saves text as JSON and HTML files 📊
- Accept input from file explorer.
- Add a web interface.
- Improve chunking with AI models.
- Python 3.6+
- PyPDF2 library
- PyQt5
-
Clone this repository:
git clone https://github.com/thethmuu/book2socialfeed.git
-
Navigate to the project directory:
cd book2socialfeed
-
Install the required packages:
pip install -r requirements.txt
-
Run the script:
python main.py
-
Enter the following prompts:
- PDF file name 📁
- Number of pages to skip (default is 1) ⏭️
- Chunk size (default is 50) 📏
-
The script generates:
output.json
: Extracted text chunks{input_filename}_output.html
: Basic styled representation of the chunks, where{input_filename}
is the name of the PDF file (truncated to 20 characters if necessary).
output.json
contains an array of text chunks.{input_filename}_output.html
displays the text chunks in a simple format, named after the input PDF file for easier identification.
Modify chunk_size
and skip_pages
in the script for different defaults.
Contributions and feature requests are welcome! Check the issues page.
This project is MIT licensed.