N-Gram language model that learns n-gram probabilities from a given corpus and generates new sentences from it based on the conditional probabilities from the generated words and phrases.
-
Updated
Feb 8, 2018 - Python
N-Gram language model that learns n-gram probabilities from a given corpus and generates new sentences from it based on the conditional probabilities from the generated words and phrases.
Forpus is a Python library for processing plain text corpora to various corpus formats.
uniblock, scoring and filtering corpus with Unicode block information (and more).
Collection of tools for building diachronic/historical word vectors
Utilities for Processing the Saarbrücken Corpus of Spoken English
Utilities for Processing the bAbi Tasks Corpus
Utilities for Processing the Dialogue State Tracking Challenge 3 Corpus
Utilities for Processing the FRAMES Corpus
Python scripts preprocessing Penn Treebank and Chinese Treebank
Diarization A to Z - Kaldi to Gecko to Kaldi and corpus and back
We designed an Information Retrieval system based on Vector Space model in python. We Also have implemented Bi gram Indices for Phrasal query search and Champion List retrieval. We also compared time of whole retrieving in our project report.
Utilities for Processing the HCRC Map Task Corpus
Utilities for Processing the Meeting Recorder Dialogue Act Corpus
Utilities for Processing the BT Oasis Corpus
Utilities for Processing the Switchboard Dialogue Act Corpus
Gets text and extracts sentences in a language from text using that language's lexicon.
A simple collocation-driven recognition of rhymes. Contains pre-trained models for Czech, Dutch, English, French, German, Russian, and Spanish poetry
Python scripts for the construction of the LEXB parallel corpus of South Tyrolean legislation (IT-DE).
Split-corpus package that provide dividing text corpora into the meaningful parts as close to specified size as possible.
This package provides utility classes and static methods for Python that make use of different third party software commonly used in text processing such as: Unitex-GramLab, TreeTagger, Apache-Tika and Google-Tesseract.
Add a description, image, and links to the corpus-processing topic page so that developers can more easily learn about it.
To associate your repository with the corpus-processing topic, visit your repo's landing page and select "manage topics."