# `tidy-harness`

A _tidy_ `pandas.DataFrame` with `scikit-learn` models, interactive `bokeh` visualizations, and `jinja2` templates.

## Usage

### Example: Modeling Fisher's 🌸 Data

In [3]:
from harness import Harness
from pandas import Categorical
from sklearn import datasets, discriminant_analysis
iris = datasets.load_iris()

# Harness is just a dataframe
df = Harness(
    data=iris['data'], index=Categorical(iris['target']),
    estimator=discriminant_analysis.LinearDiscriminantAnalysis(),
    feature_level=-1, # the feature level indicates an index 
                      # in the dataframe. -1 is the last index.
)

# Fit the model with 50 random rows.
df.sample(50).fit()

# Transform the dataframe
transformed = df.transform()
transformed.set_index(
    df.index
    .rename_categories(iris['target_names'])
    .rename('species'), append=True, inplace=True,
)

# Plot the dataframe using Bokeh charts.
with transformed.reset_index().DataSource(x=0, y=1) as source:
    source.Scatter(color='species')
    source.show()

### More Examples

More examples can be found in the [`tests`](https://github.com/tonyfast/tidy-harness/tree/master/tests) directory.  Tap the __Ⓣ key__ while in the Github interface to search quickly.

## Install

For the meantime:

```bash
pip install git+https://github.com/tonyfast/tidy-harness
```

## Background

`harness` initially responded to the need for `scikit-learn` models closer to a `pandas.DataFrame`.  Since a DataFrame is __[Tidy Data](http://vita.had.co.nz/papers/tidy-data.pdf)__ the rows and columns can assist in tracking samples and features over many estimations.  With this knowledge it would be easier to design a testing harness for data science.

The `DataFrame` has a powerful declarative syntax, consider the `groupby` and `rolling` apis.  There is a modern tendency toward declarative and functional syntaxes in scientific computing and visualization.  This is observed in [altair](https://github.com/altair-viz/altair), dask, and scikit-learn.

`tidy-harness` aims to provide a chain interface between `pandas.DataFrame` objects and other popular scientific computing libraries in the python ecosystem.  The initial `harness` extensions :

* attach a `scikit-learn` estimator to the dataframe.
* attach a shared `jinja2` environment to render narratives about the dataframes.
* `bokeh` plotting methods with a `contextmanager` for interactive visualization development

## Development

> The development scripts can be run through this notebook.

Jupyter notebooks are used for all Python development in this project.  The key features are:

* [`watchdog`]() file system watcher that converts notebooks to python scripts with `nbconvert`.  _Tests are not converted._
* [`nbconvert`]() with the `--execute` flag to run notebooks and fill out their output.  _The current goal is for the notebook to be viewable in a Github repo.
* [`pytest-ipynb`]() to run tests directly on the notebooks.

### Making the python module

The script below:

* Installs a develop copy of `harness`
* Listens for file systems events to convert notebooks to `python` scripts.

In [2]:
%%script bash --bg
python setup.py develop
watchmedo tricks tricks.yaml

Starting job # 0 in a separate thread.


In [5]:
# Execute this cell to stop watching the files
%killbgscripts

All background processes were killed.


### Build & Run Tests

The tests require `pytest` and `pytest-ipynb`.

In [None]:
%%script bash
jupyter nbconvert tests/*.ipynb --execute --to notebook --inplace 
py.test