Merge pull request #15 from rodrigo-arenas/docs

Understanding evaluation process tutorial
rodrigo-arenas · Jun 2, 2021 · 365f9ec · 365f9ec
2 parents cff8fad + 8c74989
commit 365f9ec
Show file tree

Hide file tree

Showing 6 changed files with 132 additions and 4 deletions.
diff --git a/demo/Boson_Houses_decision_tree.py b/demo/Boson_Houses_decision_tree.py
@@ -3,7 +3,7 @@
 from sklearn_genetic.space import Integer, Categorical, Continuous
 from sklearn_genetic.plots import plot_fitness_evolution, plot_search_space
 from sklearn.datasets import load_boston
-from sklearn.model_selection import train_test_split
+from sklearn.model_selection import train_test_split, KFold
 from sklearn.tree import DecisionTreeRegressor
 from sklearn.metrics import r2_score
 from sklearn.pipeline import Pipeline
@@ -19,6 +19,8 @@
     X, y, test_size=0.33, random_state=42
 )
 
+cv = KFold(n_splits=5, shuffle=True)
+
 clf = DecisionTreeRegressor()
 
 pipe = Pipeline([('scaler', StandardScaler()), ('clf', clf)])
@@ -34,8 +36,8 @@
     estimator=pipe,
     cv=3,
     scoring="r2",
-    population_size=20,
-    generations=30,
+    population_size=15,
+    generations=20,
     tournament_size=3,
     elitism=True,
     keep_top_k=4,

diff --git a/docs/images/genetic_cv.png b/docs/images/genetic_cv.png
diff --git a/docs/images/k-folds.png b/docs/images/k-folds.png
diff --git a/docs/index.rst b/docs/index.rst
@@ -45,6 +45,7 @@ sklearn-genetic-opt requires:
    tutorials/basic_usage
    tutorials/callbacks
    tutorials/custom_callback
+   tutorials/understand_cv
    release_notes
 
 .. toctree::

diff --git a/docs/tutorials/basic_usage.rst b/docs/tutorials/basic_usage.rst
@@ -24,7 +24,7 @@ it generates new candidates looking to improve the cross-validation score in eac
 It'll continue with this process until a number of generations is reached or until a callback criteria is met.
 
 Example
-------------
+-------
 
 First lets import some dataset and others scikit-learn standard modules, we'll use the `digits dataset <https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html>`__.
 This is a classification problem, we'll fine-tune a Random Forest Classifier for this task.

diff --git a/docs/tutorials/understand_cv.rst b/docs/tutorials/understand_cv.rst
@@ -0,0 +1,125 @@
+Understanding the evaluation process
+====================================
+
+In this post, we are going to explain how to evaluation process happens
+and how to use different validation strategies.
+
+Parameters
+----------
+
+The :class:`~sklearn_genetic.GASearchCV` class, expects a parameter named `cv`.
+This stands for cross-validation and it accepts any of the scikit-learn
+strategies, such as K-fold, Repeated K-Fold, Stratified k-fold, and so on.
+You can find more about this in `scikit-learn documentation <https://scikit-learn.org/stable/modules/cross_validation.html>`_.
+
+A second parameter that comes along, is the `scoring`, this is the evaluation metric
+that the model is going to use, to decide which model is better,
+it could be for example accuracy, precision, recall for a classification problem
+or r2, max_error, neg_root_mean_squared_error for a regression problem.
+To se the full list of metrics, check in `here <https://scikit-learn.org/stable/modules/model_evaluation.html>`_
+
+Steps
+-----
+
+The way as `GASearchCV` evaluates the candidates is as follows:
+
+* It starts by selecting random sets of hyperparameters according the the `param_grid` definition,
+  the total number of sets is determined by the `population_size` parameter.
+
+* It fits a model per each set of hyperparameters and calculates the cross validation score
+  according to the `cv` and `scoring` setup.
+
+* After evaluating each candidate, the fitness, fitness_std, fitness_max and fitness_min are computed
+  and are logged into the console if ``verbose=True``.
+  `Fitness` is the way to refer to the selected metric,
+  but this is calculated as the average of all the candidates of the current generation, this means that if there are
+  10 different set of hyperparameters, the `fitness` value, is the average score of those 10 evaluated candidates,
+  the same goes for the others metrics.
+
+
+* Now it creates new sets (generation) of hyperparameters,
+  those are created by combining the last generation with different strategies, those strategies
+  depends on the selected :mod:`~sklearn_genetic.algorithms`.
+
+* It repeats the step 2, 3 and 4 until the number of generations is met, or until a callbacks stops the process.
+
+* At the end, the algorithm selects the best hyperparameters, as the one set that got the best individual
+  cross validation scoring.
+
+
+Those steps could be represented like this, each line represents one of several possible
+natural process like mating, crossover, selection and mutation:
+
+.. image:: ../images/genetic_cv.png
+
+Inside each Set, the cross validation takes place, for example using 5-Folds strategy
+
+.. image:: ../images/k-folds.png
+
+Image taken from `scikit-learn <https://scikit-learn.org/stable/modules/cross_validation.html>`_
+
+Example
+-------
+
+This example is going to use a regression problem from the Boston house prices dataset.
+We are going to use an K-Fold with 5 splits taking as evaluation metric the r-squared.
+
+At the end, we are going to print the top 4 best solutions and the r-squared
+on the test set for the best set of hyperparameters.
+
+
+.. code:: python3
+
+    from sklearn_genetic import GASearchCV
+    from sklearn_genetic.space import Integer, Categorical, Continuous
+    from sklearn.datasets import load_boston
+    from sklearn.model_selection import train_test_split, KFold
+    from sklearn.tree import DecisionTreeRegressor
+    from sklearn.metrics import r2_score
+    from sklearn.pipeline import Pipeline
+    from sklearn.preprocessing import StandardScaler
+
+    data = load_boston()
+
+    y = data["target"]
+    X = data["data"]
+
+    X_train, X_test, y_train, y_test = train_test_split(
+        X, y, test_size=0.33, random_state=42)
+
+    cv = KFold(n_splits=5, shuffle=True)
+
+    clf = DecisionTreeRegressor()
+
+    pipe = Pipeline([('scaler', StandardScaler()), ('clf', clf)])
+
+    param_grid = {
+        "clf__ccp_alpha": Continuous(0, 1),
+        "clf__criterion": Categorical(["mse", "mae"]),
+        "clf__max_depth": Integer(2, 20),
+        "clf__min_samples_split": Integer(2, 30),
+    }
+
+    evolved_estimator = GASearchCV(
+        estimator=pipe,
+        cv=3,
+        scoring="r2",
+        population_size=15,
+        generations=20,
+        tournament_size=3,
+        elitism=True,
+        keep_top_k=4,
+        crossover_probability=0.9,
+        mutation_probability=0.05,
+        param_grid=param_grid,
+        criteria="max",
+        algorithm="eaMuCommaLambda",
+        n_jobs=-1,
+    )
+
+    evolved_estimator.fit(X_train, y_train)
+    y_predict_ga = evolved_estimator.predict(X_test)
+    r_squared = r2_score(y_test, y_predict_ga)
+
+    print(evolved_estimator.best_params_)
+    print("r-squared: ", "{:.2f}".format(r_squared))