#

text-extraction

Here are 89 public repositories matching this topic...

miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.

python nlp pagerank-algorithm text-extraction reduction summarization html-page summary lsa sumy textteaser summarizer html-extraction html-extractor

Updated May 6, 2024
Python

adbar / trafilatura

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Updated May 15, 2024
Python

chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Updated Apr 14, 2024
Python

miso-belica / jusText

Heuristic based boilerplate removal tool

python text-extraction html-parser html-parsing

Updated May 9, 2024
Python

cdown / srt

A simple library and set of tools for parsing, modifying, and composing SRT files.

python library tools command-line text-extraction subtitles subtitle srt subtitles-parsing mit-license command-line-tool subtitle-parser subtitle-fixer

Updated Mar 19, 2024
Python

skylander86 / lambda-text-extractor

AWS Lambda functions to extract text from various binary formats.

pdf ocr aws-lambda lambda-functions tesseract text-extraction searchable-pdfs pdf-ocr-extraction

Updated Feb 7, 2018
Python

weareprestatech / hotpdf

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

python pdf text-extraction text-search

Updated Mar 26, 2024
Python

SapienzaNLP / extend

Entity Disambiguation as text extraction (ACL 2022)

nlp natural-language-processing acl pytorch text-extraction entity-linking entity-disambiguation entity-disambiguation-models acl2022

Updated Apr 17, 2022
Python

vsymbol / CUTIE

CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

computer-vision deep-learning text-extraction

Updated Dec 8, 2022
Python

py-pdf / benchmarks

Benchmarking PDF libraries

pdf benchmark text-extraction mupdf data-extraction pypdf2 poppler-utils

Updated Oct 31, 2023
Python

jmriebold / BoilerPy3

Python port of Boilerpipe library

text-extraction boilerpipe boilerpy html-text-extraction full-text-extraction

Updated Nov 1, 2023
Python

nainiayoub / pdf-text-data-extractor

PDF text data extraction web app with OCR for scanned documents

python pdf ocr text-extraction pdf-to-text ocr-text-reader ocr-python streamlit streamlit-webapp

Updated Jul 6, 2023
Python

iscc / mobi

python based software to unpack kindlegen generated ebooks

mobi text-extraction kindle

Updated Feb 6, 2023
Python

flairNLP / fundus

A very simple news crawler with a funny name

python nlp rss sitemap crawler scraper corpus text-extraction web-scraping news-crawler commoncrawl web-corpus news-scraping cc-news

Updated May 15, 2024
Python

fourdigits / wagtail_textract

Text extraction for Wagtail document search

search django wagtail tesseract text-extraction textract

Updated Oct 25, 2023
Python

hscspring / pnlp

NLP预/后处理工具。

nlp concurrency text-extraction chinese-nlp text-processing preprocessing normalization text-cleaning nlp-preprocess nlp-enhancer text-length

Updated Jan 12, 2024
Python

asepmaulanaismail / pdf-to-txt-python

Simple pdf to text with python using PDFtk and PyPDF2

python pdf python3 text-extraction pdf-to-text pypdf2 pdftk pdf-extractor

Updated Oct 1, 2023
Python

Altabeh / tesseract-ocr-wrapper

This is a highly efficient python wrapper for tesseract-ocr.

multiprocessing text-extraction leptonica tesseract-ocr xpdf

Updated May 19, 2022
Python

Govind-S-B / pdf-to-text-chroma-search

Python scripts that converts PDF files to text, splits them into chunks, and stores their vector representations using GPT4All embeddings in a Chroma DB. It also provides a script to query the Chroma DB for similarity search based on user input.

text-extraction similarity-search pdf-processing vector-embeddings chromadb

Updated Oct 23, 2023
Python

amenezes / aiopytesseract

A Python asyncio wrapper for Tesseract-OCR.

ocr tesseract text-extraction asyncio tesseract-ocr optical-character-recognition pdftotext pytesseract pytesseract-ocr

Updated Feb 5, 2024
Python

Improve this page

Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."