Skip to content

Commit

Permalink
Merge pull request #79 from rodrigo-arenas/0.7.X
Browse files Browse the repository at this point in the history
GAFeatureSelectionCV
  • Loading branch information
rodrigo-arenas committed Nov 17, 2021
2 parents e733cb7 + bf4e4a6 commit bbd77d7
Show file tree
Hide file tree
Showing 26 changed files with 1,526 additions and 40 deletions.
55 changes: 50 additions & 5 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,10 @@
Sklearn-genetic-opt
###################

scikit-learn models hyperparameters tuning, using evolutionary algorithms.
scikit-learn models hyperparameters tuning and feature selection, using evolutionary algorithms.

This is meant to be an alternative from popular methods inside scikit-learn such as Grid Search and Randomized Grid Search.
This is meant to be an alternative from popular methods inside scikit-learn such as Grid Search and Randomized Grid Search
for hyperparameteres tuning, and from RFE, Select From Model for feature selection.

Sklearn-genetic-opt uses evolutionary algorithms from the DEAP package to choose the set of hyperparameters that
optimizes (max or min) the cross-validation scores, it can be used for both regression and classification problems.
Expand All @@ -37,7 +38,8 @@ Documentation is available `here <https://sklearn-genetic-opt.readthedocs.io/>`_
Main Features:
##############

* **GASearchCV**: Principal class of the package, holds the evolutionary cross-validation optimization routine.
* **GASearchCV**: Main class of the package for hyperparameters tuning, holds the evolutionary cross-validation optimization routine.
* **GAFeatureSelectionCV**: Main class of the package for feature selection.
* **Algorithms**: Set of different evolutionary algorithms to use as an optimization procedure.
* **Callbacks**: Custom evaluation strategies to generate early stopping rules,
logging (into TensorBoard, .pkl files, etc) or your custom logic.
Expand Down Expand Up @@ -82,8 +84,8 @@ The only optional dependency that the last command does not install, it's Tensor
it is usually advised to look further which distribution works better for you.


Example
#######
Example: Hyperparameters Tuning
###############################

.. code-block:: python
Expand Down Expand Up @@ -134,6 +136,49 @@ Example
print("Best k solutions: ", evolved_estimator.hof)
Example: Feature Selection
##########################

.. code:: python3
import matplotlib.pyplot as plt
from sklearn_genetic import GAFeatureSelectionCV
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import numpy as np
data = load_iris()
X, y = data["data"], data["target"]
# Add random non-important features
noise = np.random.uniform(0, 10, size=(X.shape[0], 5))
X = np.hstack((X, noise))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
clf = SVC(gamma='auto')
evolved_estimator = GAFeatureSelectionCV(
estimator=clf,
scoring="accuracy",
population_size=30,
generations=20,
n_jobs=-1)
# Train and select the features
evolved_estimator.fit(X_train, y_train)
# Features selected by the algorithm
features = evolved_estimator.best_features_
print(features)
# Predict only with the subset of selected features
y_predict_ga = evolved_estimator.predict(X_test[:, features])
print(accuracy_score(y_test, y_predict_ga))
Changelog
#########

Expand Down
23 changes: 23 additions & 0 deletions docs/api/featureselectioncv.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@

FeatureSelectionCV
------------------

.. currentmodule:: sklearn_genetic

.. autosummary:: GAFeatureSelectionCV
GASearchCV.decision_function
GASearchCV.fit
GASearchCV.get_params
GASearchCV.inverse_transform
GASearchCV.predict
GASearchCV.predict_proba
GASearchCV.score
GASearchCV.score_samples
GASearchCV.set_params
GASearchCV.transform

.. autoclass:: sklearn_genetic.GAFeatureSelectionCV
:members:
:inherited-members:
:exclude-members: evaluate, mutate, n_features_in_, classes_
:undoc-members: True
2 changes: 1 addition & 1 deletion docs/api/mlflow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,6 @@ MLflow

.. currentmodule:: sklearn_genetic

.. autoclass:: sklearn_genetic.mlflow.MLflowConfig
.. autoclass:: sklearn_genetic.mlflow_log.MLflowConfig
:members:
:undoc-members: False
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "**.ipynb_checkpoints"]

# -- Options for HTML output -------------------------------------------------

Expand Down
Binary file added docs/images/basic_usage_accuracy_6.PNG
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/basic_usage_fitness_plot_7.PNG
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/basic_usage_train_log_5.PNG
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 8 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,13 @@
sklean-genetic-opt
==================
scikit-learn models hyperparameters tuning, using evolutionary algorithms.
##########################################################################
scikit-learn models hyperparameters tuning and feature selection,
using evolutionary algorithms.

This is meant to be an alternative from popular methods inside scikit-learn such as Grid Search and Randomized Grid Search.
#################################################################

This is meant to be an alternative from popular methods inside scikit-learn such as Grid Search and Randomized Grid Search
for hyperparameteres tuning, and from RFE, Select From Model for feature selection.

Sklearn-genetic-opt uses evolutionary algorithms from the deap package to choose a set of hyperparameters
that optimizes (max or min) the cross-validation scores, it can be used for both regression and classification problems.
Expand Down Expand Up @@ -73,6 +76,7 @@ as it is usually advised to look further which distribution works better for you

notebooks/sklearn_comparison.ipynb
notebooks/Boston_Houses_decision_tree.ipynb
notebooks/Iris_feature_selection.ipynb
notebooks/Digits_decision_tree.ipynb
notebooks/MLflow_logger.ipynb

Expand All @@ -87,6 +91,7 @@ as it is usually advised to look further which distribution works better for you
:caption: API Reference:

api/gasearchcv
api/featureselectioncv
api/callbacks
api/plots
api/mlflow
Expand Down
260 changes: 260 additions & 0 deletions docs/notebooks/Iris_feature_selection.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/notebooks/MLflow_logger.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.datasets import load_digits\n",
"from sklearn.metrics import accuracy_score\n",
"from sklearn_genetic.mlflow import MLflowConfig"
"from sklearn_genetic.mlflow_log import MLflowConfig"
]
},
{
Expand Down
24 changes: 24 additions & 0 deletions docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,30 @@ Release Notes

Some notes on new features in various releases


What's new in 0.7.0dev0
-----------------------

This is the current in-development version, these features are not yet
available via PyPI

^^^^^^^^^
Features:
^^^^^^^^^

* :class:`~sklearn_genetic.GAFeatureSelectionCV` for feature selection along
with any scikit-learn classifier or regressor. It optimizes the cv-score
while minimizing the number of features to select.
This class is compatible with the mlflow and tensorboard integration,
the Callbacks and the ``plot_fitness_evolution`` function.

^^^^^^^^^^^^
API Changes:
^^^^^^^^^^^^

* The module :mod:`~sklearn_genetic.mlflow` was renamed to :class:`~sklearn_genetic.mlflow_log`
to avoid unexpected errors on name resolutions

What's new in 0.6.1
-------------------

Expand Down
112 changes: 106 additions & 6 deletions docs/tutorials/basic_usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ How to Use Sklearn-genetic-opt
Introduction
------------

Sklearn-genetic-opt uses evolutionary algorithms to fine-tune scikit-learn machine learning algorithms.
Sklearn-genetic-opt uses evolutionary algorithms to fine-tune scikit-learn machine learning algorithms
and perform feature selection.
It is designed to accept a `scikit-learn <http://scikit-learn.org/stable/index.html>`__
regression or classification model (or a pipeline containing on of those).

Expand All @@ -23,8 +24,8 @@ Then by using evolutionary operators as the mating, mutation, selection and eval
it generates new candidates looking to improve the cross-validation score in each generation.
It'll continue with this process until a number of generations is reached or until a callback criterion is met.

Example
-------
Fine-tuning Example
-------------------

First let's import some dataset and other scikit-learn standard modules, we'll use
the `digits dataset <https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html>`__.
Expand Down Expand Up @@ -165,10 +166,109 @@ sklearn-genetic-opt comes with a plot function to analyze this log:
.. image:: ../images/basic_usage_plot_space_4.png

What this plot shows us, is the distributione of the sampled values for each hyperparameter.
What this plot shows us, is the distribution of the sampled values for each hyperparameter.
We can see for example in the *'min_weight_fraction_leaf'* that the algorithm mostly sampled values below 0.15.
You can also check every single combination of variables and the contour plot that represents the sampled values.


Feature Selection Example
-------------------------

For this example, we are going to use the well-known Iris dataset, it's a classification problem with four features.
We are also going to simulate some random noise to represent non-important features:

.. code:: python3
import matplotlib.pyplot as plt
from sklearn_genetic import GAFeatureSelectionCV
from sklearn_genetic.plots import plot_fitness_evolution
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import numpy as np
data = load_iris()
X, y = data["data"], data["target"]
noise = np.random.uniform(0, 10, size=(X.shape[0], 10))
X = np.hstack((X, noise))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
This should give us 10 extra noisy features with our train and test set.

Now we can create the GAFeatureSelectionCV object, it's very similar to the GASearchCV and they share
most of the parameters, the main difference is GAFeatureSelectionCV doesn't run hyperparameters optimization
thus the param_grid parameter it's not available, and the estimator should be defined with its hyperparameters.

The way the feature selection is performed is by creating models with a subsample of features
and evaluate its cv-score, the way the subsets are created is by using the available evolutionary algorithms.
It also tries to minimize the number of selected features, so it's a multi-objective optimization.

Let's create the feature selection object, the estimator we're going to use is a SVM:

.. code:: python3
clf = SVC(gamma='auto')
evolved_estimator = GAFeatureSelectionCV(
estimator=clf,
cv=3,
scoring="accuracy",
population_size=30,
generations=20,
n_jobs=-1,
verbose=True,
keep_top_k=2,
elitism=True,
)
We are ready to run the optimization routine:

.. code:: python3
# Train and select the features
evolved_estimator.fit(X_train, y_train)
During the training, the same log format is displayed as before:

.. image:: ../images/basic_usage_train_log_5.PNG

After fitting the model, we have some extra methods to use the model right away. It will use by default the best set of
features it found, remember as the algorithm used only a subset, you have to select them from the
``X_test array``, this is done like this:

.. code:: python3
features = evolved_estimator.best_features_
# Predict only with the subset of selected features
y_predict_ga = evolved_estimator.predict(X_test[:, features])
accuracy = accuracy_score(y_test, y_predict_ga)
.. image:: ../images/basic_usage_accuracy_6.PNG

In this case, we got an accuracy score in the test set of 0.98.

Notice that the ``best_features_`` is a vector of bool values, each
position represents the index of the feature (column) and the value indicates
if that features was selected (True) or not (False) by the algorithm.
In this example, the algorithm, discarded all the noisy random variables we created
and selected the original variables.

We can also plot the fitness evolution:

.. code:: python3
from sklearn_genetic.plots import plot_fitness_evolution
plot_fitness_evolution(evolved_estimator)
plt.show()
.. image:: ../images/basic_usage_fitness_plot_7.PNG

This concludes our introduction to the basic sklearn-genetic-opt usage.
Further tutorials will cover the GASearchCV parameters, callbacks,
different optimization algorithms and more advanced use cases.
Further tutorials will cover the GASearchCV and GAFeatureSelectionCV parameters, callbacks,
different optimization algorithms and more advanced use cases.

2 changes: 1 addition & 1 deletion docs/tutorials/callbacks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Callbacks can be defined to take actions or decisions over the optimization
process while it is still running.
Common callbacks include different rules to stop the algorithm or log artifacts.
The callbacks are passed to the ``.fit`` method
of the :class:`~sklearn_genetic.GASearchCV` class.
of the :class:`~sklearn_genetic.GASearchCV` or :class:`~sklearn_genetic.GAFeatureSelectionCV` class.

The callbacks are evaluated at the start of the training using the `on_start` method,
at the end of each generation fit using `on_step` method and at the
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/custom_callback.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ bellow a threshold value.

The callback must have three parameters: `record`, `logbook` and `estimator`.
Those are a dictionary, a deap's Logbook object and the
current :class:`~sklearn_genetic.GASearchCV` respectively
current :class:`~sklearn_genetic.GASearchCV` (or :class:`~sklearn_genetic.GAFeatureSelectionCV`) respectively
with the current iteration metrics, all the past iterations metrics
and all the properties saved in the estimator.

Expand Down
4 changes: 2 additions & 2 deletions docs/tutorials/mlflow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ In this post, we are going to explain how setup the build-in integration
of sklearn-genetic-opt with MLflow.
To use this feature, we must set the parameters that will include
the tracking server, experiment name, run name, tags and others,
the full implementation is here: :class:`~sklearn_genetic.mlflow.MLflowConfig`
the full implementation is here: :class:`~sklearn_genetic.mlflow_log.MLflowConfig`

Configuration
-------------
Expand All @@ -29,7 +29,7 @@ trained models.

.. code:: python3
from sklearn_genetic.mlflow import MLflowConfig
from sklearn_genetic.mlflow_log import MLflowConfig
mlflow_config = MLflowConfig(
tracking_uri="http://localhost:5000",
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/understand_cv.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Understanding the evaluation process
====================================

In this post, we are going to explain how the evaluation process works
and how to use different validation strategies.
on hyperparameters tuning and how to use different validation strategies.

Parameters
----------
Expand Down
3 changes: 2 additions & 1 deletion sklearn_genetic/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from .genetic_search import GASearchCV
from .genetic_search import GASearchCV, GAFeatureSelectionCV

from .callbacks import (
ThresholdStopping,
Expand All @@ -12,6 +12,7 @@

__all__ = [
"GASearchCV",
"GAFeatureSelectionCV",
"ThresholdStopping",
"ConsecutiveStopping",
"DeltaThreshold",
Expand Down
2 changes: 1 addition & 1 deletion sklearn_genetic/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.6.1"
__version__ = "0.7.0dev0"

0 comments on commit bbd77d7

Please sign in to comment.