Stars
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
Toolkit for linearizing PDFs for LLM datasets/training
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
Official s3cmd repo -- Command line tool for managing S3 compatible storage services (including Amazon S3 and CloudFront).
A parser for Google Scholar, written in Python
openFDA is an FDA project to provide open APIs, raw data downloads, documentation and examples, and a developer community for an important collection of FDA public datasets.
vbench: A tool for benchmarking your code through time, for showing performance improvement or regressions
Python library for reading and writing warc files
A more complete example of programming with PDFMiner, which continues where the default documentation stops
A package that allows R developer to use Hadoop MapReduce
Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)
Simplified and standard interface to a number of cheminformatics toolkits
This is the repository for NAACL'25 paper "TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning"
gathering point for open source OCR scripts and diffs
This sample python application demonstrates how to access the Compute Engine API using the Google Python API Client Library.
Fetchs google scholar queries and outputs info or bibtex data
scanning script for the noisebridge book scanner
Access index of web pages in Common Crawl
This simple tool helps researchers and students to visualize who is collaborating with whom in their field of research.
the python program takes input from the webcam and saves an image frame in jpeg format every 30 seconds.
linearregression / warcat
Forked from chfoo/warcatTool and library for handling Web ARChive (WARC) files.
darkseed / ossocr
Forked from artunit/ossocrgathering point for open source OCR scripts and diffs