# Main parameters

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rxavier/poniard/blob/master/examples/02._main_parameters.ipynb)

This notebook outlines the most relevant options for Poniard estimators.

If you don't have it installed, please install from PyPI.

In [1]:
# %pip install poniard

At the core of Poniard lie the choice of estimators, metrics and CV strategy. While defaults might work for most cases, we try to keep it flexible.

## `estimators`

Estimators can be passed as a dict of `estimator_name: estimator_instance` or as a list of `estimator_instance`. In the latter, names will be obtained directly from the class.

Using a dictionary allows passing multiple instances of the same estimator with different hyperparameters.

In [2]:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from poniard import PoniardClassifier

X, y = make_classification(n_classes=3, n_informative=3)
pnd = PoniardClassifier(
    estimators={
        "lr": LogisticRegression(max_iter=5000),
        "lr_no_penalty": LogisticRegression(max_iter=5000, penalty="none"),
        "lda": LinearDiscriminantAnalysis(),
    }
)
pnd.setup(X, y)
pnd.fit()

Target info
-----------
Type: multiclass
Shape: (100,)
Unique values: 3

Main metric
-----------
roc_auc_ovr

Thresholds
----------
Minimum unique values to consider a feature numeric: 10
Minimum unique values to consider a categorical high cardinality: 20

Inferred feature types
----------------------


Unnamed: 0,numeric,categorical_high,categorical_low,datetime
0,0.0,,,
1,1.0,,,
2,2.0,,,
3,3.0,,,
4,4.0,,,
5,5.0,,,
6,6.0,,,
7,7.0,,,
8,8.0,,,
9,9.0,,,






  0%|          | 0/4 [00:00<?, ?it/s]

PoniardClassifier(estimators={'lr': LogisticRegression(max_iter=5000, random_state=0), 'lr_no_penalty': LogisticRegression(max_iter=5000, penalty='none', random_state=0), 'lda': LinearDiscriminantAnalysis()}, metrics=['roc_auc_ovr', 'accuracy', 'precision_macro', 'recall_macro', 'f1_macro'],
    preprocess=True, scaler=standard, numeric_imputer=simple,
    custom_preprocessor=None, numeric_threshold=10,
    cardinality_threshold=20, cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True), verbose=0,
    random_state=0, n_jobs=None, plugins=None,
    plot_options=PoniardPlotFactory())
            

Since we are in scikit-learn-land, most of the stuff you expect to work still works. For example, multilabel classification.

Here we had to use a dictionary because `estimator.__class__.__name__`, which is used for assigning a name to each estimator when a list is passed, would be the same for both `OneVsRestClassifier` and they would be overwritten.

In [3]:
from sklearn.datasets import make_multilabel_classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

X, y = make_multilabel_classification(n_samples=1000)
pnd = PoniardClassifier(
    estimators={
        "rf": OneVsRestClassifier(RandomForestClassifier()),
        "nb": OneVsRestClassifier(LogisticRegression()),
    }
)
pnd.setup(X, y)
pnd.fit()

Target info
-----------
Type: multilabel-indicator
Shape: (1000, 5)
Unique values: 2

Main metric
-----------
roc_auc

Thresholds
----------
Minimum unique values to consider a feature numeric: 100
Minimum unique values to consider a categorical high cardinality: 20

Inferred feature types
----------------------


Unnamed: 0,numeric,categorical_high,categorical_low,datetime
0,,,0.0,
1,,,1.0,
2,,,2.0,
3,,,3.0,
4,,,4.0,
5,,,5.0,
6,,,6.0,
7,,,7.0,
8,,,8.0,
9,,,9.0,






  self.preprocessor = self._build_preprocessor()


  0%|          | 0/3 [00:00<?, ?it/s]

PoniardClassifier(estimators={'rf': OneVsRestClassifier(estimator=RandomForestClassifier()), 'nb': OneVsRestClassifier(estimator=LogisticRegression())}, metrics=['roc_auc', 'accuracy', 'precision_macro', 'recall_macro', 'f1_macro'],
    preprocess=True, scaler=standard, numeric_imputer=simple,
    custom_preprocessor=None, numeric_threshold=100,
    cardinality_threshold=20, cv=KFold(n_splits=5, random_state=0, shuffle=True), verbose=0,
    random_state=0, n_jobs=None, plugins=None,
    plot_options=PoniardPlotFactory())
            

In [4]:
pnd.get_results()

Unnamed: 0,test_roc_auc,test_accuracy,test_precision_macro,test_recall_macro,test_f1_macro,fit_time,score_time
nb,0.782694,0.385,0.597681,0.571633,0.57987,0.138852,0.021902
rf,0.759198,0.338,0.623584,0.496908,0.507446,1.887072,0.051547
DummyClassifier,0.5,0.096,0.23,0.4,0.291,0.012758,0.008101


As you may have noticed, a dummy estimator is always included even if not explicitly set during initialization.

## `metrics`

Metrics can be passed as a list of strings, following the familiar scikit-learn nomenclature, or as a dict of `str: callable`. For convenience, it can also be a single string.

This restriction is in place to facilitate naming columns in the `get_results()` method.

In [5]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from poniard import PoniardRegressor

X, y = make_regression(n_samples=500, n_features=20, n_informative=5)
pnd = PoniardRegressor(
    metrics=["neg_median_absolute_error", "explained_variance"],
    estimators=[LinearRegression()],
)
pnd.setup(X, y)
pnd.fit()

Target info
-----------
Type: continuous
Shape: (500,)
Unique values: 500

Main metric
-----------
neg_median_absolute_error

Thresholds
----------
Minimum unique values to consider a feature numeric: 50
Minimum unique values to consider a categorical high cardinality: 20

Inferred feature types
----------------------


Unnamed: 0,numeric,categorical_high,categorical_low,datetime
0,0.0,,,
1,1.0,,,
2,2.0,,,
3,3.0,,,
4,4.0,,,
5,5.0,,,
6,6.0,,,
7,7.0,,,
8,8.0,,,
9,9.0,,,






  0%|          | 0/2 [00:00<?, ?it/s]

PoniardRegressor(estimators=[LinearRegression()], metrics=['neg_median_absolute_error', 'explained_variance'],
    preprocess=True, scaler=standard, numeric_imputer=simple,
    custom_preprocessor=None, numeric_threshold=50,
    cardinality_threshold=20, cv=KFold(n_splits=5, random_state=0, shuffle=True), verbose=0,
    random_state=0, n_jobs=None, plugins=None,
    plot_options=PoniardPlotFactory())
            

In [6]:
pnd.get_results()

Unnamed: 0,test_neg_median_absolute_error,test_explained_variance,fit_time,score_time
LinearRegression,-1.946887e-13,1.0,0.00523,0.000489
DummyRegressor,-122.0752,2.2204460000000003e-17,0.000993,0.000293


In [7]:
from sklearn.metrics import r2_score, make_scorer


def scaled_r2(y_true, y_pred):
    return round(r2_score(y_true, y_pred) * 100, 1)


pnd = PoniardRegressor(
    metrics={
        "scaled_r2": make_scorer(scaled_r2, greater_is_better=True),
        "usual_r2": make_scorer(r2_score, greater_is_better=True),
    },
    estimators=[LinearRegression()],
)
pnd.setup(X, y).fit().get_results()

Target info
-----------
Type: continuous
Shape: (500,)
Unique values: 500

Main metric
-----------
scaled_r2

Thresholds
----------
Minimum unique values to consider a feature numeric: 50
Minimum unique values to consider a categorical high cardinality: 20

Inferred feature types
----------------------


Unnamed: 0,numeric,categorical_high,categorical_low,datetime
0,0.0,,,
1,1.0,,,
2,2.0,,,
3,3.0,,,
4,4.0,,,
5,5.0,,,
6,6.0,,,
7,7.0,,,
8,8.0,,,
9,9.0,,,






  0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,test_scaled_r2,test_usual_r2,fit_time,score_time
LinearRegression,100.0,1.0,0.001717,0.000355
DummyRegressor,-0.3,-0.002793,0.000985,0.000298


## `cv`

Cross validation can be anything that scikit-learn accepts. By default, classification tasks will be paired with a `StratifiedKFold` if the target is binary, and `KFold` otherwise. Regression tasks use `KFold` by default.

`cv=int` or `cv=None` are internally converted to one of the above classes so that Poniard's `random_state` parameter can be passed on.

In [8]:
from IPython.utils import io
from sklearn.model_selection import RepeatedKFold

with io.capture_output() as c:
    pnd_5 = PoniardRegressor(cv=4).setup(X, y)
    pnd_none = PoniardRegressor(cv=None).setup(X, y)
    pnd_k = PoniardRegressor(cv=RepeatedKFold(n_splits=3)).setup(X, y)

In [9]:
print(pnd_5.cv, pnd_none.cv, pnd_k.cv, sep="\n")

KFold(n_splits=4, random_state=0, shuffle=True)
KFold(n_splits=5, random_state=0, shuffle=True)
RepeatedKFold(n_repeats=10, n_splits=3, random_state=0)


Note that even though we didn't specify `random_state` for the third estimator, it gets injected during setup.