# Main parameters

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rxavier/poniard/blob/master/examples/02._main_parameters.ipynb)

This notebook outlines the most relevant options for Poniard estimators.

If you don't have it installed, please install from PyPI.

In [1]:
# %pip install poniard

At the core of Poniard lie the choice of estimators, metrics and CV strategy. While defaults might work for most cases, we try to keep it flexible.

## `estimators`

Estimators can be passed as a dict of `estimator_name: estimator_instance` or as a list of `estimator_instance`. In the latter, names will be obtained directly from the class.

Using a dictionary allows passing multiple instances of the same estimator with different hyperparameters.

In [2]:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from poniard import PoniardClassifier

X, y = make_classification(n_classes=3, n_informative=3)
pnd = PoniardClassifier(estimators={"lr": LogisticRegression(max_iter=1000),
                                    "lr_no_penalty": LogisticRegression(max_iter=1000, penalty="none"),
                                    "lda": LinearDiscriminantAnalysis()})
pnd.setup(X, y)
pnd.fit()

Main metric: roc_auc_ovr
Minimum unique values to consider a number feature numeric: 10
Minimum unique values to consider a non-number feature high cardinality: 20

Inferred feature types:
    numeric categorical_high categorical_low datetime
0       0.0                                          
1       1.0                                          
2       2.0                                          
3       3.0                                          
4       4.0                                          
5       5.0                                          
6       6.0                                          
7       7.0                                          
8       8.0                                          
9       9.0                                          
10     10.0                                          
11     11.0                                          
12     12.0                                          
13     13.0                                          
1

Completed: 100%|██████████| 4/4 [00:00<00:00, 24.21it/s]      


PoniardClassifier(estimators={'lr': LogisticRegression(max_iter=1000, random_state=0), 'lr_no_penalty': LogisticRegression(max_iter=1000, penalty='none', random_state=0), 'lda': LinearDiscriminantAnalysis()}, metrics=None,
    preprocess=True, scaler=standard, numeric_imputer=simple,
    custom_preprocessor=None, numeric_threshold=0.1,
    cardinality_threshold=20, cv=None, verbose=0,
    random_state=0, n_jobs=None, plugins=None,
    plot_options=PoniardPlotFactory())
            

Since we are in scikit-learn-land, most of the stuff you expect to work still works. For example, multilabel classification.

Here we had to use a dictionary because `estimator.__class__.__name__`, which is used for assigning a name to each estimator when a list is passed, would be the same for both `MultiOutputClassifier` and they would be overwritten.

In [3]:
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import RidgeClassifier

X, y = make_multilabel_classification()
pnd = PoniardClassifier(estimators={"rf": MultiOutputClassifier(RandomForestClassifier()),
                                    "nb": MultiOutputClassifier(RidgeClassifier())})
pnd.setup(X, y)
pnd.fit()

Main metric: accuracy
Minimum unique values to consider a number feature numeric: 10
Minimum unique values to consider a non-number feature high cardinality: 20

Inferred feature types:
   numeric categorical_high  categorical_low datetime
0                                        0.0         
1                                        1.0         
2                                        2.0         
3                                        3.0         
4                                        4.0         
5                                        5.0         
6                                        6.0         
7                                        7.0         
8                                        8.0         
9                                        9.0         
10                                      10.0         
11                                      11.0         
12                                      12.0         
13                                      13.0         
14  

Completed: 100%|██████████| 3/3 [00:01<00:00,  2.14it/s]      


PoniardClassifier(estimators={'rf': MultiOutputClassifier(estimator=RandomForestClassifier()), 'nb': MultiOutputClassifier(estimator=RidgeClassifier())}, metrics=None,
    preprocess=True, scaler=standard, numeric_imputer=simple,
    custom_preprocessor=None, numeric_threshold=0.1,
    cardinality_threshold=20, cv=None, verbose=0,
    random_state=0, n_jobs=None, plugins=None,
    plot_options=PoniardPlotFactory())
            

In [4]:
pnd.show_results()

Unnamed: 0,test_accuracy,train_accuracy,test_precision_macro,train_precision_macro,test_recall_macro,train_recall_macro,test_f1_macro,train_f1_macro,fit_time,score_time
rf,0.19,1.0,0.403381,1.0,0.377483,1.0,0.382205,1.0,0.216844,0.017341
nb,0.14,1.0,0.466974,1.0,0.423013,1.0,0.43044,1.0,0.010243,0.003137
DummyClassifier,0.09,0.1475,0.15,0.155,0.28,0.28,0.194611,0.199421,0.002559,0.003074


As you may have noticed, a dummy estimator is always included even if not explicitly set during initialization.

## `metrics`

Metrics can be passed as a list of strings, following the familiar scikit-learn nomenclature, or as a dict of `str: callable`. For convenience, it can also be a single string.

This restriction is in place to facilitate naming columns in the `show_results()` method.

In [5]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from poniard import PoniardRegressor

X, y = make_regression()
pnd = PoniardRegressor(metrics=["neg_median_absolute_error", "explained_variance"],
                       estimators=[LinearRegression()])
pnd.setup(X, y)
pnd.fit()

Main metric: neg_median_absolute_error
Minimum unique values to consider a number feature numeric: 10
Minimum unique values to consider a non-number feature high cardinality: 20

Inferred feature types:
    numeric categorical_high categorical_low datetime
0       0.0                                          
1       1.0                                          
2       2.0                                          
3       3.0                                          
4       4.0                                          
..      ...              ...             ...      ...
95     95.0                                          
96     96.0                                          
97     97.0                                          
98     98.0                                          
99     99.0                                          

[100 rows x 4 columns]


Completed: 100%|██████████| 2/2 [00:00<00:00, 28.62it/s]


PoniardRegressor(estimators=[LinearRegression()], metrics=['neg_median_absolute_error', 'explained_variance'],
    preprocess=True, scaler=standard, numeric_imputer=simple,
    custom_preprocessor=None, numeric_threshold=0.1,
    cardinality_threshold=20, cv=None, verbose=0,
    random_state=0, n_jobs=None, plugins=None,
    plot_options=PoniardPlotFactory())
            

In [6]:
pnd.show_results()

Unnamed: 0,test_neg_median_absolute_error,train_neg_median_absolute_error,test_explained_variance,train_explained_variance,fit_time,score_time
LinearRegression,-75.725797,-3.410605e-13,0.752295,1.0,0.005511,0.001264
DummyRegressor,-151.725912,-147.4922,2.2204460000000003e-17,0.0,0.00133,0.000587


In [7]:
from sklearn.metrics import r2_score, make_scorer

def scaled_r2(y_true, y_pred):
    return round(r2_score(y_true, y_pred) * 100, 1)

pnd = PoniardRegressor(metrics={"scaled_r2": make_scorer(scaled_r2, greater_is_better=True),
                                "usual_r2": make_scorer(r2_score, greater_is_better=True)},
                       estimators=[LinearRegression()])
pnd.setup(X, y).fit().show_results()

Main metric: scaled_r2
Minimum unique values to consider a number feature numeric: 10
Minimum unique values to consider a non-number feature high cardinality: 20

Inferred feature types:
    numeric categorical_high categorical_low datetime
0       0.0                                          
1       1.0                                          
2       2.0                                          
3       3.0                                          
4       4.0                                          
..      ...              ...             ...      ...
95     95.0                                          
96     96.0                                          
97     97.0                                          
98     98.0                                          
99     99.0                                          

[100 rows x 4 columns]


Completed: 100%|██████████| 2/2 [00:00<00:00, 63.18it/s]


Unnamed: 0,test_scaled_r2,train_scaled_r2,test_usual_r2,train_usual_r2,fit_time,score_time
LinearRegression,72.78,100.0,0.727786,1.0,0.002391,0.000471
DummyRegressor,-4.32,0.0,-0.043417,0.0,0.000905,0.000398


## `cv`

Cross validation can be anything that scikit-learn accepts. By default, classification tasks will be paired with a `StratifiedKFold` if the target is binary, and `KFold` otherwise. Regression tasks use `KFold` by default.

`cv=int` or `cv=None` are internally converted to one of the above classes so that Poniard's `random_state` parameter can be passed on.

In [8]:
from IPython.utils import io
from sklearn.model_selection import RepeatedKFold

with io.capture_output() as c:
    pnd_5 = PoniardRegressor(cv=4).setup(X, y)
    pnd_none = PoniardRegressor(cv=None).setup(X, y)
    pnd_k = PoniardRegressor(cv=RepeatedKFold(n_splits=3)).setup(X, y)

In [9]:
print(pnd_5.cv_, pnd_none.cv_, pnd_k.cv_, sep="\n")

KFold(n_splits=4, random_state=0, shuffle=True)
KFold(n_splits=5, random_state=0, shuffle=True)
RepeatedKFold(n_repeats=10, n_splits=3, random_state=0)


Note that even though we didn't specify `random_state` for the third estimator, it gets injected during setup.