Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
Updated
Jul 26, 2024 - Python
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Bitextor generates translation memories from multilingual websites
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Python library for handling audio datasets.
OpusFilter - Parallel corpus processing toolkit
Utilities for Processing the Switchboard Dialogue Act Corpus
An open source reimplementation of Benny Brodda's BETA in Python
A set of workflows for corpus building through OCR, post-correction and normalisation
Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework
A parser for annotated MuseScore 3 files.
Python library for extracting quantitative, reproducible metrics of multi-level alignment between two speakers in naturalistic language corpora.
Utilities for Processing the Meeting Recorder Dialogue Act Corpus
MFTE (Multi Feature Tagger of English) Python is the Python version based on Le Foll's MFTE written in Perl. It is extended to include semantic tags from Biber (2006) and Biber et al. (1999), including other specific tags.
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Yet another search platform for linguistic corpora.
Searching in-memory corpus with Corpus Query Language (CQL)
Measure the similarity of text corpora for 74 languages
Library for Python to use Korp API
Scripts for building a geo-located web corpus using Common Crawl data
Add a description, image, and links to the corpus-tools topic page so that developers can more easily learn about it.
To associate your repository with the corpus-tools topic, visit your repo's landing page and select "manage topics."