Hi. I changed the name of this library to
engarde. Check it out here.
Data Scientists Against Dirty Data (DSADD)
A python package for defensive data analysis. (Name to be determined.)
Supports python 2.7+ and 3.4+
Data are messy. You want to assert that certain invariants about your data across operations or updates to the raw data. This is a lightweight way of placing some additional structure on semi-structured data sources like CSVs.
There are two main ways of using the library. First, as decorators:
from dsadd.decorators import none_missing, unique_index, is_shape @none_missing def f(df1, df2): return df1.add(df2) @is_shape((1290, 10)) @unique_index def make_design_matrix('data.csv'): out = ... return out
Second, interactively (probably with the
which requires pandas>=0.16.2).
>>> import dsadd.checks as dc >>> (df1.reindex_like(df2)) ... .pipe(dc.unique_index) ... .cumsum() ... .pipe(dc.within_range(0, 100)) ... )
Functions take a DataFrame (and optionally arguments) and return a DataFrame.
If used as a decorator, the result for the decorated function is checked.
Any failed check raises with an
- better NaN ignoring (e.g. is_monotonic)
- better subsetting / column-specific things
- better error messages