Module for automatic summarization of text documents and HTML pages.
-
Updated
May 6, 2024 - Python
Module for automatic summarization of text documents and HTML pages.
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Heuristic based boilerplate removal tool
A simple library and set of tools for parsing, modifying, and composing SRT files.
AWS Lambda functions to extract text from various binary formats.
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
Entity Disambiguation as text extraction (ACL 2022)
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
Benchmarking PDF libraries
Python port of Boilerpipe library
PDF text data extraction web app with OCR for scanned documents
python based software to unpack kindlegen generated ebooks
A very simple news crawler with a funny name
NLP预/后处理工具。
Simple pdf to text with python using PDFtk and PyPDF2
This is a highly efficient python wrapper for tesseract-ocr.
Python scripts that converts PDF files to text, splits them into chunks, and stores their vector representations using GPT4All embeddings in a Chroma DB. It also provides a script to query the Chroma DB for similarity search based on user input.
A Python asyncio wrapper for Tesseract-OCR.
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."