Merge branch 'develop' into release-0.1.2

tgsmith61591 · Oct 22, 2016 · 757cc54 · 757cc54
2 parents be01d2c + 4ead822
commit 757cc54
Show file tree

Hide file tree

Showing 16 changed files with 543 additions and 116 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -46,12 +46,11 @@ install:
   - if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then
       conda install --yes -c dan_blanchard python-coveralls nose-cov;
     fi
-  - chmod 777 ./.travis/install.sh
-  - ./.travis/install.sh
+  - chmod 777 ./build_tools/travis/install.sh
+  - ./build_tools/travis/install.sh
   - pip install coveralls
   - pip install matplotlib
   - pip install seaborn
-  - pip install tox
   - pip install http://h2o-release.s3.amazonaws.com/h2o/rel-turchin/9/Python/h2o-3.8.2.9-py2.py3-none-any.whl
   - python setup.py develop
 
@@ -72,4 +71,4 @@ script:
 after_success:
   - if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then
       coveralls;
-    fi
+    fi
diff --git a/README.md b/README.md
@@ -8,13 +8,13 @@
 
 
 # scikit-util
-What began as a succinct set of [sklearn](https://github.com/scikit-learn/scikit-learn) extension classes and utilities (as well as implementations of preprocessors from R packages like [caret](https://github.com/topepo/caret)) grew to bridge functionality between sklearn and [H2O](https://github.com/h2oai/h2o-3).  Now, scikit-util (skutil) brings the best of both worlds to H2O and sklearn, delivering an easy transition into the world of distributed computing that H2O offers, while providing the same, familiar interface that sklearn users have come to know and love. View the [documentation here](https://tgsmith61591.github.io/skutil)
+What began as a modest, succinct set of [sklearn](https://github.com/scikit-learn/scikit-learn) extension classes and utilities (as well as implementations of preprocessors from R packages like [caret](https://github.com/topepo/caret)) grew to bridge functionality between sklearn and [H2O](https://github.com/h2oai/h2o-3).  Now, scikit-util (skutil) brings the best of both worlds to H2O and sklearn, delivering an easy transition into the world of distributed computing that H2O offers, while providing the same, familiar interface that sklearn users have come to know and love. __View the [documentation here](https://tgsmith61591.github.io/skutil)__
 
 
 
 ### Pre-installation
-Skutil depends on the ability to compile Fortran code. For different platforms, there are different ways to install `gcc`:
-  - Mac OS (__note__: this can take a while):
+Skutil adapts code from several R packages, and thus depends on the ability to compile Fortran code using `gcc`. For different platforms, there are different ways to install `gcc` (the easiest, of course, being [Homebrew](http://brew.sh/)):
+  - __Mac OS__ (__note__: this can take a while):
 ```bash
 brew install gcc
 ```
@@ -24,7 +24,7 @@ There is a bug in some setups that will still cause issues in symlinking the `gc
 brew link --overwrite gcc
 ```
 
-  - Linux:
+  - __Linux__:
 ```bash
 sudo apt-get install gcc
 ```
@@ -36,7 +36,7 @@ sudo apt-get install gcc
 
 ### Installation:
 
-Installation is easy. After cloning the project onto your machine, simply use the `setup.py` file:
+Installation is easy. After cloning the project onto your machine and installing the required dependencies, simply use the `setup.py` file:
 
 ```bash
 git clone https://github.com/tgsmith61591/skutil.git
@@ -45,9 +45,9 @@ python setup.py install
 ```
 
 
-### Contributing:
+### Installing for ongoing development:
 
-If you'd like to fork skutil and will be running some tests, your setup is a bit different. Rather than using the `install` arg, use `develop`. This creates a symlink in the local directory so that as you make changes, they are automatically reflected and you don't have to re-install every time. For more information on `develop` vs. `install`, see [this](http://stackoverflow.com/questions/19048732/python-setup-py-develop-vs-install) StackOverflow question. Note that after running setup with `develop`, you may have to uninstall before re-running with `install`. *If you are experiencing the dreaded* `no module named dqrsl` *issue and your GCC is up-to-date, it's likely a* `develop` *vs.* `install` *issue. Try uninstalling, clearing the egg from the local folder (or popping the local path from* `sys.path`*) and running setup with the* `install` *option.*
+If you'd like to fork skutil to contribute to the codebase and intend to run some tests, your setup is a bit different. Rather than using the `install` arg, use `develop`. This creates a symlink in the local directory so that as you make changes, they are automatically reflected and you don't have to re-install every time. For more information on `develop` vs. `install`, see [this](http://stackoverflow.com/questions/19048732/python-setup-py-develop-vs-install) StackOverflow question. Note that after running setup with `develop`, you may have to uninstall before re-running with `install`. *If you are experiencing the dreaded* `no module named dqrsl` *issue and your GCC is up-to-date, it's likely a* `develop` *vs.* `install` *issue. Try uninstalling, clearing the egg from the local folder (or popping the local path from* `sys.path`*) and running setup with the* `install` *option.*
 
 ```bash
 git clone https://github.com/tgsmith61591/skutil.git
@@ -58,6 +58,5 @@ nosetests
 
 
 #### Examples:
-  - See the [wiki](https://github.com/tgsmith61591/skutil/wiki)
   - See the [example ipython notebooks](https://github.com/tgsmith61591/skutil/tree/master/doc/examples)
 
diff --git a/.travis/install.sh → build_tools/travis/install.sh b/.travis/install.sh → build_tools/travis/install.sh
@@ -1,10 +1,10 @@
 #!/bin/bash
 
 case "${COMBINATION}" in
-    python-2-7-sklearn-0-17)
+    "python-2-7-sklearn-0-17" | "python-3-5-sklearn-0-17")
         pip install scikit-learn==0.17.1
         ;;
-    python-2-7-sklearn-0-18)
+    "python-2-7-sklearn-0-18" | "python-3-5-sklearn-0-18")
         pip install scikit-learn==0.18
         ;;
-esac
+esac
diff --git a/skutil/h2o/balance.py b/skutil/h2o/balance.py
@@ -32,7 +32,7 @@ def _validate_x_y_ratio(X, y, ratio):
     X : H2OFrame
         The frame from which to sample
 
-    y_name : str
+    y : str
         The name of the column that is the response class
 
     Returns
@@ -89,7 +89,8 @@ class H2OOversamplingClassBalancer(_BaseH2OBalancer):
 
     target_feature : str
         The name of the response column. The response column must be
-        bi-class, no more or less.
+        more than a single class and less than 
+        ``skutil.preprocessing.balance.BalancerMixin._max_classes``
 
     ratio : float, optional (default=0.2)
         The target ratio of the minority records to the majority records. If the
@@ -114,8 +115,14 @@ def balance(self, X):
         Parameters
         ----------
 
-        X : H2OFrame, shape [n_samples, n_features]
-            The data to balance
+        X : H2OFrame, shape=[n_samples, n_features]
+            The imbalanced dataset.
+
+        Returns
+        -------
+
+        Xb : H2OFrame
+            The balanced H2OFrame
         """
         # check on state of X
         frame = _check_is_frame(X)
@@ -129,7 +136,8 @@ def balance(self, X):
         # since H2O won't allow us to resample (it's considered rearranging)
         # we need to rbind at each point of duplication... this can be pretty
         # inefficient, so we might need to get clever about this...
-        return reorder_h2o_frame(frame, sample_idcs)
+        Xb = reorder_h2o_frame(frame, sample_idcs)
+        return Xb
 
 
 class H2OUndersamplingClassBalancer(_BaseH2OBalancer):
@@ -153,7 +161,8 @@ class H2OUndersamplingClassBalancer(_BaseH2OBalancer):
 
     target_feature : str
         The name of the response column. The response column must be
-        biclass, no more or less.
+        more than a single class and less than 
+        ``skutil.preprocessing.balance.BalancerMixin._max_classes``
 
     ratio : float, optional (default=0.2)
         The target ratio of the minority records to the majority records. If the
@@ -181,8 +190,14 @@ def balance(self, X):
         Parameters
         ----------
 
-        X : H2OFrame, shape [n_samples, n_features]
-            The data to balance
+        X : H2OFrame, shape=[n_samples, n_features]
+            The imbalanced dataset.
+
+        Returns
+        -------
+
+        Xb : H2OFrame
+            The balanced H2OFrame
         """
 
         # check on state of X
@@ -196,4 +211,5 @@ def balance(self, X):
         # since there are no feature_names, we can just slice
         # the h2o frame as is, given the indices:
         idcs = partitioner.get_indices(self.shuffle)
-        return frame[idcs, :] if not self.shuffle else reorder_h2o_frame(frame, idcs)
+        Xb = frame[idcs, :] if not self.shuffle else reorder_h2o_frame(frame, idcs)
+        return Xb
diff --git a/skutil/h2o/encode.py b/skutil/h2o/encode.py
@@ -19,8 +19,24 @@ def _val_vec(y):
 
 class _H2OVecSafeOneHotEncoder(BaseH2OTransformer):
     """Safely one-hot encodes an H2OVec into an H2OFrame of
-    one-hot encoded dummies. Skips previously unseen levels
-    in the transform section.
+    one-hot encoded dummies. Whereas H2O's default behavior for
+    previously-unseen factor levels is to error, the 
+    ``_H2OVecSafeOneHotEncoder`` skips previously-unseen levels
+    in the ``transform`` section, returning 'nan' (which H2O
+    interprets as ``NA``).
+
+    Parameters
+    ----------
+
+    feature_names : array_like (str), optional (default=None)
+        The list of names on which to fit the transformer.
+
+    target_feature : str, optional (default None)
+        The name of the target feature (is excluded from the fit)
+        for the estimator.
+
+    exclude_features : iterable or None, optional (default=None)
+        Any names that should be excluded from ``feature_names``
     """
 
     _min_version = '3.8.2.9'
@@ -34,6 +50,19 @@ def __init__(self):
                                                        max_version=self._max_version)
 
     def fit(self, y):
+        """Fit the encoder.
+
+        Parameters
+        ----------
+
+        X : H2OFrame
+            The frame to fit
+
+        Returns
+        -------
+
+        self
+        """
         # validate y
         y = _val_vec(y)
 
@@ -52,6 +81,20 @@ def fit(self, y):
         return self
 
     def transform(self, y):
+        """Transform a new 1d frame after fit.
+
+        Parameters
+        ----------
+
+        X : H2OFrame, 1d
+            The 1d frame to transform
+
+        Returns
+        -------
+
+        output : H2OFrame, 1d
+            The transformed H2OFrame
+        """
         # make sure is fitted, validate y
         check_is_fitted(self, 'classes_')
         y = _val_vec(y)
@@ -157,7 +200,7 @@ def transform(self, X):
         Returns
         -------
 
-        X_transform : H2OFrame
+        X : H2OFrame
             The transformed H2OFrame
         """
         check_is_fitted(self, 'encoders_')

diff --git a/skutil/h2o/frame.py b/skutil/h2o/frame.py
@@ -13,14 +13,21 @@
 
 def _check_is_1d_frame(X):
     """Check whether X is an H2OFrame
-    and that it's a 1d column.
+    and that it's a 1d column. If not, will
+    raise an ``AssertionError``
 
     Parameters
     ----------
 
     X : H2OFrame
         The H2OFrame
 
+    Raises
+    ------
+
+    ``AssertionError`` if the ``X`` variable
+    is not a 1-dimensional H2OFrame.
+
     Returns
     -------
 

diff --git a/skutil/h2o/grid_search.py b/skutil/h2o/grid_search.py
@@ -22,7 +22,7 @@
 from .base import _check_is_frame, BaseH2OFunctionWrapper, validate_x_y, VizMixin
 from skutil.base import overrides
 from ..utils import report_grid_score_detail
-from ..utils.metaestimators import if_delegate_has_method
+from ..utils.metaestimators import if_delegate_has_method, if_delegate_isinstance
 from skutil.grid_search import _CVScoreTuple, _check_param_grid
 from ..metrics import GainsStatisticalReport
 from .split import *
@@ -632,6 +632,32 @@ def fit_predict(self, frame):
         """
         return self.fit(frame).predict(frame)
 
+    @if_delegate_isinstance(delegate='best_estimator_', instance_type=(H2OEstimator, H2OPipeline))
+    def download_pojo(self, path="", get_jar=True):
+        """This method is injected at runtime if the ``best_estimator_``
+        is an instance of an ``H2OEstimator``. This method downloads the POJO
+        from a fit estimator.
+
+        Parameters
+        ----------
+
+        path : string, optional (default="")
+            Where to save the POJO.
+
+        get_jar : bool, optional (default=True)
+            Whether to get the jar from the POJO.
+
+        Returns
+        -------
+
+        None or string
+            Returns None if ``path`` is "" else, the filepath
+            where the POJO was saved.
+        """
+        is_h2o = isinstance(self.best_estimator_, H2OEstimator)
+        return h2o.download_pojo(self.best_estimator_ if is_h2o else self.best_estimator_._final_estimator, 
+                                 path=path, get_jar=get_jar)
+
     @overrides(VizMixin)
     def plot(self, timestep, metric):
         check_is_fitted(self, 'best_estimator_')