Stars
A Step-by-Step Guide to Scraping eBay Product Data
Code samples from the book Web Scraping with Python http://shop.oreilly.com/product/0636920034391.do
Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024
Perforator is a cluster-wide continuous profiling tool designed for large data centers
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Wo…
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
基于序列表格识别算法推理库,集成PP-Structure和modelscope等表格识别算法。
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
A curated list of awesome packages, articles, and other cool resources from the Scrapy community.
Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Lightweight and extensible compatibility layer between dataframe libraries!
Machine Learning Natural Language Processing analysis of earnings call transcripts for logistic regression classification to make 'buy', 'sell' or 'hold' calls on stocks.
Collection of publicly available IPTV channels from all over the world
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
🔊 Text-Prompted Generative Audio Model
A course on aligning smol models.
NVIDIA AI Blueprint for multimodal PDF data extraction for enterprise RAG
Create fast graph language models from converted PDF documents for knowledge extraction and Q&A.
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
A Unified Toolkit for Deep Learning-Based Table Extraction
A PyTorch implementation of DTrOCR: Decoder-only Transformer for Optical Character Recognition
2nd solution of ICDAR 2021 Competition on Scientific Literature Parsing, Task B.
Get your documents ready for gen AI
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Simple package to extract text with coordinates from programmatic PDFs
A High-efficiency Open-source Toolkit for Table-to-Latex Task
A Comprehensive Toolkit for High-Quality PDF Content Extraction