Skip to content

Python script to extract text from PDFs, chunk it, and output as JSON for showing them like social media posts

Notifications You must be signed in to change notification settings

thethmuu/book2socialfeed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Book2SocialFeed 📚➡️📱

This Python script extracts text from PDF files, splits it into chunks, and saves the chunks as both JSON and HTML files. It's useful for processing large documents and preparing text data for further analysis or processing, such as creating social media content from books.

Features 🌟

  • Extracts text from PDF files 📄
  • Saves text as JSON and HTML files 📊

Roadmap 🛣️

  • Accept input from file explorer.
  • Add a web interface.
  • Improve chunking with AI models.

Requirements 🛠️

  • Python 3.6+
  • PyPDF2 library
  • PyQt5

Installation 🚀

  1. Clone this repository:

    git clone https://github.com/thethmuu/book2socialfeed.git
  2. Navigate to the project directory:

    cd book2socialfeed
  3. Install the required packages:

    pip install -r requirements.txt

Usage 🖥️

  1. Run the script:

    python main.py
  2. Enter the following prompts:

    • PDF file name 📁
    • Number of pages to skip (default is 1) ⏭️
    • Chunk size (default is 50) 📏
  3. The script generates:

    • output.json: Extracted text chunks
    • {input_filename}_output.html: Basic styled representation of the chunks, where {input_filename} is the name of the PDF file (truncated to 20 characters if necessary).

Output 📊

  • output.json contains an array of text chunks.
  • {input_filename}_output.html displays the text chunks in a simple format, named after the input PDF file for easier identification.

Customization ⚙️

Modify chunk_size and skip_pages in the script for different defaults.

Contributing 🤝

Contributions and feature requests are welcome! Check the issues page.

License 📜

This project is MIT licensed.

About

Python script to extract text from PDFs, chunk it, and output as JSON for showing them like social media posts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published