# Poniard basics

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rxavier/poniard/blob/master/examples/01._basics.ipynb)

This notebook outlines the simplest Poniard workflow.

If you don't have it installed, please install from PyPI.

In [1]:
# %pip install poniard

## Outline

Poniard offers two main interfaces: a classifier and a regressor. Both are used to fit multiple estimators with cross validation and compare one or multiple metrics.

Data undergoes light preprocessing which includes type inference - low variance numeric features are assumed to be categorical, while categorical can be high or low frequency depending on their cardinality.

## Classification

As with all examples, we'll use scikit-learn toy datasets.

The first step is initializing the appropiate class (in this case `PoniardClassifier`) and calling the `setup()` method with the features and target. This will output some information regarding the main metric (by default multiple metrics are considered, but the first one in the list will be used in other methods apart from `fit()`) and type inference thresholds.

Crucially, it also shows what Poniard inferred for feature types. This is key, since it determines what kind of preprocessing will be used for each variable.

However, as will be seen, preprocessing can be disabled completely or set directly without any assumptions by Poniard.

In [2]:
from sklearn.datasets import load_breast_cancer
from poniard import PoniardClassifier

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
pnd = PoniardClassifier().setup(X, y)

Target info
-----------
Type: binary
Shape: (569,)
Unique values: 2

Main metric
-----------
roc_auc

Thresholds
----------
Minimum unique values to consider a feature numeric: 56
Minimum unique values to consider a categorical high cardinality: 20

Inferred feature types
----------------------


Unnamed: 0,numeric,categorical_high,categorical_low,datetime
0,mean radius,,,
1,mean texture,,,
2,mean perimeter,,,
3,mean area,,,
4,mean smoothness,,,
5,mean compactness,,,
6,mean concavity,,,
7,mean concave points,,,
8,mean symmetry,,,
9,mean fractal dimension,,,






`setup()` also prepares the preprocessor, cross-validation strategy, metrics and estimators. Please note that these are default values and can be modified during initialization.

Please note that Poniard prunes empty (with no features assigned) preprocessing steps.

In [3]:
pnd.preprocessor_

In [4]:
pnd.cv_

StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

In [5]:
pnd.metrics_

['roc_auc', 'accuracy', 'precision', 'recall', 'f1']

In [6]:
pnd.estimators_.keys()

dict_keys(['LogisticRegression', 'GaussianNB', 'LinearSVC', 'KNeighborsClassifier', 'DecisionTreeClassifier', 'RandomForestClassifier', 'HistGradientBoostingClassifier', 'XGBClassifier', 'DummyClassifier'])

After data has been set up, calling `fit()` (without any arguments as `X` and `y` are saved as attributes) will cross validate each estimator.

In [7]:
pnd.fit()

Completed: 100%|██████████| 9/9 [00:11<00:00,  1.32s/it]                     


PoniardClassifier(estimators=None, metrics=None,
    preprocess=True, scaler=standard, numeric_imputer=simple,
    custom_preprocessor=None, numeric_threshold=0.1,
    cardinality_threshold=20, cv=None, verbose=0,
    random_state=0, n_jobs=None, plugins=None,
    plot_options=PoniardPlotFactory())
            

Finally, `show_results()` concatenates the results of each sklearn `cross_validate()`. These are ordered by the first column, which will always be the main scoring for the hold out sets.

All the numbers in this table are averages over CV folds.

In [8]:
pnd.show_results()

Unnamed: 0,test_roc_auc,train_roc_auc,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1,fit_time,score_time
LogisticRegression,0.995456,0.997511,0.978916,0.988137,0.975411,0.98613,0.991549,0.995095,0.983351,0.990589,0.492887,0.005178
HistGradientBoostingClassifier,0.994128,1.0,0.970129,1.0,0.967263,1.0,0.985955,1.0,0.976433,1.0,1.458177,0.033147
XGBClassifier,0.994123,1.0,0.970129,1.0,0.967554,1.0,0.985915,1.0,0.976469,1.0,0.05545,0.004636
LinearSVC,0.992901,0.998985,0.968359,0.989895,0.974993,0.98751,0.974765,0.996496,0.974783,0.991982,0.007783,0.005104
RandomForestClassifier,0.992264,1.0,0.964881,1.0,0.964647,1.0,0.980282,1.0,0.972192,1.0,0.081701,0.009922
GaussianNB,0.98873,0.988861,0.9297,0.939369,0.940993,0.941821,0.949413,0.962883,0.9443,0.952219,0.005137,0.0029
KNeighborsClassifier,0.98061,0.998064,0.964881,0.978472,0.955018,0.97003,0.991628,0.996501,0.972746,0.983079,0.003238,0.056205
DecisionTreeClassifier,0.920983,1.0,0.926223,1.0,0.941672,1.0,0.94108,1.0,0.941054,1.0,0.005963,0.002693
DummyClassifier,0.5,0.5,0.627418,0.627417,0.627418,0.627417,1.0,1.0,0.771052,0.771058,0.002701,0.003293


## Regression

Now we'll load another dataset.

In [9]:
from sklearn.datasets import load_diabetes
from poniard import PoniardRegressor

X, y = load_diabetes(return_X_y=True, as_frame=True)
pnd = PoniardRegressor().setup(X, y)

Target info
-----------
Type: continuous
Shape: (442,)
Unique values: 214

Main metric
-----------
neg_mean_squared_error

Thresholds
----------
Minimum unique values to consider a feature numeric: 44
Minimum unique values to consider a categorical high cardinality: 20

Inferred feature types
----------------------


Unnamed: 0,numeric,categorical_high,categorical_low,datetime
0,age,,sex,
1,bmi,,,
2,bp,,,
3,s1,,,
4,s2,,,
5,s3,,,
6,s4,,,
7,s5,,,
8,s6,,,






This time most features are assumed to be numeric, except the "sex" variable which is correctly inferred to be categorical despite being float type.

In [10]:
X["sex"].unique()

array([ 0.05068012, -0.04464164])

We can transform the dataset with the preprocessor just to check out the transformations work as expected.

Depending on your version of sklearn, you might not be able to obtain the feature names since `SimpleImputer()` did not have a `get_feature_names_out()` method before 1.1.0.

In [11]:
import pandas as pd
import sklearn

sklearn_version = sklearn.__version__

transformed = pd.DataFrame(pnd.preprocessor_.fit_transform(X))
if sklearn_version >= "1.1.0":
    columns = [x.split("__")[-1] for x in pnd.preprocessor_.get_feature_names_out()]
    transformed.columns = columns
transformed

Unnamed: 0,age,bmi,bp,s1,s2,s3,s4,s5,s6,sex_0.05068011873981862
0,0.800500,1.297088,0.459841,-0.929746,-0.732065,-0.912451,-0.054499,0.418531,-0.370989,1.0
1,-0.039567,-1.082180,-0.553505,-0.177624,-0.402886,1.564414,-0.830301,-1.436589,-1.938479,0.0
2,1.793307,0.934533,-0.119214,-0.958674,-0.718897,-0.680245,-0.054499,0.060156,-0.545154,1.0
3,-1.872441,-0.243771,-0.770650,0.256292,0.525397,-0.757647,0.721302,0.476983,-0.196823,0.0
4,0.113172,-0.764944,0.459841,0.082726,0.327890,0.171178,-0.054499,-0.672502,-0.980568,0.0
...,...,...,...,...,...,...,...,...,...,...
437,0.876870,0.413360,1.256040,-0.119769,-0.053957,-0.602843,-0.054499,0.655787,0.151508,1.0
438,-0.115937,-0.334410,-1.422086,1.037341,1.664355,-0.602843,0.721302,-0.380819,0.935254,1.0
439,0.876870,-0.334410,0.363573,-0.785107,-0.290965,-0.525441,-0.232934,-0.985649,0.325674,1.0
440,-0.956004,0.821235,0.025550,0.343075,0.321306,-0.602843,0.558384,0.936163,-0.545154,0.0


In [12]:
pnd.fit()
pnd.show_results()

Completed: 100%|██████████| 9/9 [00:06<00:00,  1.45it/s]                    


Unnamed: 0,test_neg_mean_squared_error,train_neg_mean_squared_error,test_neg_mean_absolute_percentage_error,train_neg_mean_absolute_percentage_error,test_neg_median_absolute_error,train_neg_median_absolute_error,test_r2,train_r2,fit_time,score_time
LinearRegression,-2977.598515,-2846.654523,-0.396566,-0.386732,-39.009146,-37.9475,0.489155,0.51942,0.005739,0.002251
ElasticNet,-3159.017211,-3086.503123,-0.422912,-0.419648,-42.619546,-41.368389,0.46074,0.47898,0.004493,0.002173
RandomForestRegressor,-3431.823331,-484.350401,-0.419956,-0.154994,-42.203,-14.963,0.414595,0.918252,0.119276,0.006072
HistGradientBoostingRegressor,-3544.069433,-433.802058,-0.407417,-0.139296,-40.39639,-12.975536,0.391633,0.92675,0.931234,0.01623
KNeighborsRegressor,-3615.195398,-2444.122063,-0.418674,-0.338658,-38.98,-32.74,0.379625,0.587469,0.005044,0.002922
XGBRegressor,-3923.48886,-0.024397,-0.426471,-0.000925,-39.031309,-0.059937,0.329961,0.999996,0.072652,0.003645
LinearSVR,-4268.314411,-4109.886065,-0.374296,-0.361547,-43.388592,-41.509301,0.271443,0.306186,0.004373,0.002242
DummyRegressor,-5934.577616,-5929.903934,-0.62154,-0.621228,-61.775921,-63.713209,-0.000797,0.0,0.004098,0.002207
DecisionTreeRegressor,-6728.423034,0.0,-0.591906,0.0,-59.7,0.0,-0.14546,1.0,0.005346,0.002262


That's the simplest Poniard usage example. Use defaults and everything should work fine.