Skip to content

Commit

Permalink
Merge branch 'develop' into release-0.1.2
Browse files Browse the repository at this point in the history
  • Loading branch information
tgsmith61591 committed Oct 22, 2016
2 parents be01d2c + 4ead822 commit 757cc54
Show file tree
Hide file tree
Showing 16 changed files with 543 additions and 116 deletions.
7 changes: 3 additions & 4 deletions .travis.yml
Expand Up @@ -46,12 +46,11 @@ install:
- if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then
conda install --yes -c dan_blanchard python-coveralls nose-cov;
fi
- chmod 777 ./.travis/install.sh
- ./.travis/install.sh
- chmod 777 ./build_tools/travis/install.sh
- ./build_tools/travis/install.sh
- pip install coveralls
- pip install matplotlib
- pip install seaborn
- pip install tox
- pip install http://h2o-release.s3.amazonaws.com/h2o/rel-turchin/9/Python/h2o-3.8.2.9-py2.py3-none-any.whl
- python setup.py develop

Expand All @@ -72,4 +71,4 @@ script:
after_success:
- if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then
coveralls;
fi
fi
15 changes: 7 additions & 8 deletions README.md
Expand Up @@ -8,13 +8,13 @@


# scikit-util
What began as a succinct set of [sklearn](https://github.com/scikit-learn/scikit-learn) extension classes and utilities (as well as implementations of preprocessors from R packages like [caret](https://github.com/topepo/caret)) grew to bridge functionality between sklearn and [H2O](https://github.com/h2oai/h2o-3). Now, scikit-util (skutil) brings the best of both worlds to H2O and sklearn, delivering an easy transition into the world of distributed computing that H2O offers, while providing the same, familiar interface that sklearn users have come to know and love. View the [documentation here](https://tgsmith61591.github.io/skutil)
What began as a modest, succinct set of [sklearn](https://github.com/scikit-learn/scikit-learn) extension classes and utilities (as well as implementations of preprocessors from R packages like [caret](https://github.com/topepo/caret)) grew to bridge functionality between sklearn and [H2O](https://github.com/h2oai/h2o-3). Now, scikit-util (skutil) brings the best of both worlds to H2O and sklearn, delivering an easy transition into the world of distributed computing that H2O offers, while providing the same, familiar interface that sklearn users have come to know and love. __View the [documentation here](https://tgsmith61591.github.io/skutil)__



### Pre-installation
Skutil depends on the ability to compile Fortran code. For different platforms, there are different ways to install `gcc`:
- Mac OS (__note__: this can take a while):
Skutil adapts code from several R packages, and thus depends on the ability to compile Fortran code using `gcc`. For different platforms, there are different ways to install `gcc` (the easiest, of course, being [Homebrew](http://brew.sh/)):
- __Mac OS__ (__note__: this can take a while):
```bash
brew install gcc
```
Expand All @@ -24,7 +24,7 @@ There is a bug in some setups that will still cause issues in symlinking the `gc
brew link --overwrite gcc
```

- Linux:
- __Linux__:
```bash
sudo apt-get install gcc
```
Expand All @@ -36,7 +36,7 @@ sudo apt-get install gcc

### Installation:

Installation is easy. After cloning the project onto your machine, simply use the `setup.py` file:
Installation is easy. After cloning the project onto your machine and installing the required dependencies, simply use the `setup.py` file:

```bash
git clone https://github.com/tgsmith61591/skutil.git
Expand All @@ -45,9 +45,9 @@ python setup.py install
```


### Contributing:
### Installing for ongoing development:

If you'd like to fork skutil and will be running some tests, your setup is a bit different. Rather than using the `install` arg, use `develop`. This creates a symlink in the local directory so that as you make changes, they are automatically reflected and you don't have to re-install every time. For more information on `develop` vs. `install`, see [this](http://stackoverflow.com/questions/19048732/python-setup-py-develop-vs-install) StackOverflow question. Note that after running setup with `develop`, you may have to uninstall before re-running with `install`. *If you are experiencing the dreaded* `no module named dqrsl` *issue and your GCC is up-to-date, it's likely a* `develop` *vs.* `install` *issue. Try uninstalling, clearing the egg from the local folder (or popping the local path from* `sys.path`*) and running setup with the* `install` *option.*
If you'd like to fork skutil to contribute to the codebase and intend to run some tests, your setup is a bit different. Rather than using the `install` arg, use `develop`. This creates a symlink in the local directory so that as you make changes, they are automatically reflected and you don't have to re-install every time. For more information on `develop` vs. `install`, see [this](http://stackoverflow.com/questions/19048732/python-setup-py-develop-vs-install) StackOverflow question. Note that after running setup with `develop`, you may have to uninstall before re-running with `install`. *If you are experiencing the dreaded* `no module named dqrsl` *issue and your GCC is up-to-date, it's likely a* `develop` *vs.* `install` *issue. Try uninstalling, clearing the egg from the local folder (or popping the local path from* `sys.path`*) and running setup with the* `install` *option.*

```bash
git clone https://github.com/tgsmith61591/skutil.git
Expand All @@ -58,6 +58,5 @@ nosetests


#### Examples:
- See the [wiki](https://github.com/tgsmith61591/skutil/wiki)
- See the [example ipython notebooks](https://github.com/tgsmith61591/skutil/tree/master/doc/examples)

6 changes: 3 additions & 3 deletions .travis/install.sh → build_tools/travis/install.sh
@@ -1,10 +1,10 @@
#!/bin/bash

case "${COMBINATION}" in
python-2-7-sklearn-0-17)
"python-2-7-sklearn-0-17" | "python-3-5-sklearn-0-17")
pip install scikit-learn==0.17.1
;;
python-2-7-sklearn-0-18)
"python-2-7-sklearn-0-18" | "python-3-5-sklearn-0-18")
pip install scikit-learn==0.18
;;
esac
esac
34 changes: 25 additions & 9 deletions skutil/h2o/balance.py
Expand Up @@ -32,7 +32,7 @@ def _validate_x_y_ratio(X, y, ratio):
X : H2OFrame
The frame from which to sample
y_name : str
y : str
The name of the column that is the response class
Returns
Expand Down Expand Up @@ -89,7 +89,8 @@ class H2OOversamplingClassBalancer(_BaseH2OBalancer):
target_feature : str
The name of the response column. The response column must be
bi-class, no more or less.
more than a single class and less than
``skutil.preprocessing.balance.BalancerMixin._max_classes``
ratio : float, optional (default=0.2)
The target ratio of the minority records to the majority records. If the
Expand All @@ -114,8 +115,14 @@ def balance(self, X):
Parameters
----------
X : H2OFrame, shape [n_samples, n_features]
The data to balance
X : H2OFrame, shape=[n_samples, n_features]
The imbalanced dataset.
Returns
-------
Xb : H2OFrame
The balanced H2OFrame
"""
# check on state of X
frame = _check_is_frame(X)
Expand All @@ -129,7 +136,8 @@ def balance(self, X):
# since H2O won't allow us to resample (it's considered rearranging)
# we need to rbind at each point of duplication... this can be pretty
# inefficient, so we might need to get clever about this...
return reorder_h2o_frame(frame, sample_idcs)
Xb = reorder_h2o_frame(frame, sample_idcs)
return Xb


class H2OUndersamplingClassBalancer(_BaseH2OBalancer):
Expand All @@ -153,7 +161,8 @@ class H2OUndersamplingClassBalancer(_BaseH2OBalancer):
target_feature : str
The name of the response column. The response column must be
biclass, no more or less.
more than a single class and less than
``skutil.preprocessing.balance.BalancerMixin._max_classes``
ratio : float, optional (default=0.2)
The target ratio of the minority records to the majority records. If the
Expand Down Expand Up @@ -181,8 +190,14 @@ def balance(self, X):
Parameters
----------
X : H2OFrame, shape [n_samples, n_features]
The data to balance
X : H2OFrame, shape=[n_samples, n_features]
The imbalanced dataset.
Returns
-------
Xb : H2OFrame
The balanced H2OFrame
"""

# check on state of X
Expand All @@ -196,4 +211,5 @@ def balance(self, X):
# since there are no feature_names, we can just slice
# the h2o frame as is, given the indices:
idcs = partitioner.get_indices(self.shuffle)
return frame[idcs, :] if not self.shuffle else reorder_h2o_frame(frame, idcs)
Xb = frame[idcs, :] if not self.shuffle else reorder_h2o_frame(frame, idcs)
return Xb
49 changes: 46 additions & 3 deletions skutil/h2o/encode.py
Expand Up @@ -19,8 +19,24 @@ def _val_vec(y):

class _H2OVecSafeOneHotEncoder(BaseH2OTransformer):
"""Safely one-hot encodes an H2OVec into an H2OFrame of
one-hot encoded dummies. Skips previously unseen levels
in the transform section.
one-hot encoded dummies. Whereas H2O's default behavior for
previously-unseen factor levels is to error, the
``_H2OVecSafeOneHotEncoder`` skips previously-unseen levels
in the ``transform`` section, returning 'nan' (which H2O
interprets as ``NA``).
Parameters
----------
feature_names : array_like (str), optional (default=None)
The list of names on which to fit the transformer.
target_feature : str, optional (default None)
The name of the target feature (is excluded from the fit)
for the estimator.
exclude_features : iterable or None, optional (default=None)
Any names that should be excluded from ``feature_names``
"""

_min_version = '3.8.2.9'
Expand All @@ -34,6 +50,19 @@ def __init__(self):
max_version=self._max_version)

def fit(self, y):
"""Fit the encoder.
Parameters
----------
X : H2OFrame
The frame to fit
Returns
-------
self
"""
# validate y
y = _val_vec(y)

Expand All @@ -52,6 +81,20 @@ def fit(self, y):
return self

def transform(self, y):
"""Transform a new 1d frame after fit.
Parameters
----------
X : H2OFrame, 1d
The 1d frame to transform
Returns
-------
output : H2OFrame, 1d
The transformed H2OFrame
"""
# make sure is fitted, validate y
check_is_fitted(self, 'classes_')
y = _val_vec(y)
Expand Down Expand Up @@ -157,7 +200,7 @@ def transform(self, X):
Returns
-------
X_transform : H2OFrame
X : H2OFrame
The transformed H2OFrame
"""
check_is_fitted(self, 'encoders_')
Expand Down
9 changes: 8 additions & 1 deletion skutil/h2o/frame.py
Expand Up @@ -13,14 +13,21 @@

def _check_is_1d_frame(X):
"""Check whether X is an H2OFrame
and that it's a 1d column.
and that it's a 1d column. If not, will
raise an ``AssertionError``
Parameters
----------
X : H2OFrame
The H2OFrame
Raises
------
``AssertionError`` if the ``X`` variable
is not a 1-dimensional H2OFrame.
Returns
-------
Expand Down
28 changes: 27 additions & 1 deletion skutil/h2o/grid_search.py
Expand Up @@ -22,7 +22,7 @@
from .base import _check_is_frame, BaseH2OFunctionWrapper, validate_x_y, VizMixin
from skutil.base import overrides
from ..utils import report_grid_score_detail
from ..utils.metaestimators import if_delegate_has_method
from ..utils.metaestimators import if_delegate_has_method, if_delegate_isinstance
from skutil.grid_search import _CVScoreTuple, _check_param_grid
from ..metrics import GainsStatisticalReport
from .split import *
Expand Down Expand Up @@ -632,6 +632,32 @@ def fit_predict(self, frame):
"""
return self.fit(frame).predict(frame)

@if_delegate_isinstance(delegate='best_estimator_', instance_type=(H2OEstimator, H2OPipeline))
def download_pojo(self, path="", get_jar=True):
"""This method is injected at runtime if the ``best_estimator_``
is an instance of an ``H2OEstimator``. This method downloads the POJO
from a fit estimator.
Parameters
----------
path : string, optional (default="")
Where to save the POJO.
get_jar : bool, optional (default=True)
Whether to get the jar from the POJO.
Returns
-------
None or string
Returns None if ``path`` is "" else, the filepath
where the POJO was saved.
"""
is_h2o = isinstance(self.best_estimator_, H2OEstimator)
return h2o.download_pojo(self.best_estimator_ if is_h2o else self.best_estimator_._final_estimator,
path=path, get_jar=get_jar)

@overrides(VizMixin)
def plot(self, timestep, metric):
check_is_fitted(self, 'best_estimator_')
Expand Down

0 comments on commit 757cc54

Please sign in to comment.