Skip to content
View andrewdefries's full-sized avatar

Block or report andrewdefries

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
33 stars written in Python
Clear filter

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

Python 28,129 2,174 Updated Mar 13, 2025

NLTK Source

Python 13,898 2,917 Updated Mar 12, 2025

Toolkit for linearizing PDFs for LLM datasets/training

Python 9,755 640 Updated Mar 14, 2025

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Python 8,782 1,588 Updated Jun 10, 2024

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Python 5,284 1,126 Updated Dec 7, 2022

Official s3cmd repo -- Command line tool for managing S3 compatible storage services (including Amazon S3 and CloudFront).

Python 4,669 909 Updated Jan 27, 2025

A parser for Google Scholar, written in Python

Python 2,127 775 Updated Sep 10, 2022

openFDA is an FDA project to provide open APIs, raw data downloads, documentation and examples, and a developer community for an important collection of FDA public datasets.

Python 599 137 Updated Dec 26, 2022

vbench: A tool for benchmarking your code through time, for showing performance improvement or regressions

Python 244 41 Updated Oct 12, 2017

Python library for reading and writing warc files

Python 239 114 Updated Mar 7, 2022

A more complete example of programming with PDFMiner, which continues where the default documentation stops

Python 214 115 Updated Dec 3, 2019

Data Parallel Python

Python 207 26 Updated May 10, 2013

Index URLs in Common Crawl

Python 193 47 Updated Sep 19, 2017

A package that allows R developer to use Hadoop MapReduce

Python 159 149 Updated Jul 21, 2020

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

Python 157 29 Updated Aug 27, 2020

Simplified and standard interface to a number of cheminformatics toolkits

Python 87 27 Updated Nov 4, 2023

This is the repository for NAACL'25 paper "TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning"

Python 47 1 Updated Oct 24, 2024

gathering point for open source OCR scripts and diffs

Python 43 7 Updated Jun 27, 2014

This sample python application demonstrates how to access the Compute Engine API using the Google Python API Client Library.

Python 41 16 Updated Sep 22, 2015

Fetchs google scholar queries and outputs info or bibtex data

Python 32 13 Updated Feb 18, 2014

The Historian's WARC Toolkit

Python 15 4 Updated May 14, 2015

scanning script for the noisebridge book scanner

Python 14 3 Updated May 12, 2017

Access index of web pages in Common Crawl

Python 9 2 Updated Apr 28, 2015

A Django app for accessing Mechanical Turk

Python 9 4 Updated Jul 28, 2020

performance module for python

Python 7 Updated Jun 4, 2012

This simple tool helps researchers and students to visualize who is collaborating with whom in their field of research.

Python 4 1 Updated May 27, 2012

Try Python for Cheminformatics

Python 3 Updated Jul 18, 2010

the python program takes input from the webcam and saves an image frame in jpeg format every 30 seconds.

Python 2 1 Updated Jul 22, 2013

Tool and library for handling Web ARChive (WARC) files.

Python 1 1 Updated May 28, 2014

gathering point for open source OCR scripts and diffs

Python 1 Updated Apr 20, 2013
Next
33 stars written in Python