framework for making streamcorpus data
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data
docs
examples
scripts
sphinx-docs
streamcorpus_pipeline
.gitignore
.gitmodules
LICENSE.txt
MANIFEST.in
Makefile
README.md
conf.py
distribute_setup.py
index.rst
setup.py
tox.ini
version.py

README.md

StreamCorpus Pipeline

streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.

The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.

Read more at streamcorpus.org