Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
-
Updated
Jun 4, 2024 - Python
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
A very simple news crawler with a funny name
AI Media and Misinformation Content Analysis Tool: Analyze text and images
Extract embedded metadata from HTML markup
Module for automatic summarization of text documents and HTML pages.
OCR with Tesseract and OpenCV: Extract text from images effortlessly. Preprocess with OpenCV for accuracy. Display results and save output. Easy integration for document digitization and data entry automation.
Heuristic based boilerplate removal tool
Dataiku DSS plugin to perform optical character recognition (OCR) using the Tesseract engine.
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
UnchainedText: Break free from PDFs! Easily extract raw text to .txt for preprocessing.
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
A simple library and set of tools for parsing, modifying, and composing SRT files.
Api to get text from multiple types of files
ScholarVista analyses research papers and extracts/plots information about them. It uses Grobid to extract all the content of the research papers. Then all this data is plotted and displayed using Python.
YiraBot: Simplifying Web Scraping for All. A user-friendly tool for developers and enthusiasts, offering command-line ease and Python integration. Ideal for research, SEO, and data collection.
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."