Civilized Data Processing in Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



Civilized Data Processing

All data in pipette flows through a pipeline with a source.  Data takes the form of a stream of documents.  Pipelines are lazy, they are only evaluated when you iterate over them or explicitly invoke the pull method.

It works like so:

  from pipette import *

  s = Source('bigdata.csv')
  def change_variables(doc):
      """Change some variable names

      Notice that I don't have to return the doc.  The document is returned 
      to the pipeline unless something non-None was returned.
      """ = doc.pop('DATE')
      doc.is_important = int(doc.pop('STATUS') == 'important')
      doc.category = doc.CAT

  def keep_import(doc):
      """Filter functions must return a boolean.
      return doc.is_important

  def date(doc):
      """Remember that group_by works like itertools.groupby, in that it only
      groups consecutive entries.

  # slice mutates the pipeline, like all methods

  for date, doc in s:
      print date, len(doc)