Skip to content

Latest commit

 

History

History
14 lines (11 loc) · 559 Bytes

README.md

File metadata and controls

14 lines (11 loc) · 559 Bytes

StreamCorpus Pipeline

streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.

The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.

Read more at streamcorpus.org