# Preprocessing parameters

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rxavier/poniard/blob/master/examples/03._preprocessing_parameters.ipynb)

This notebook outlines preprocessing parameters for Poniard estimators.

If you don't have it installed, please install from PyPI.

In [9]:
# %pip install poniard

Poniard tries to apply minimal preprocessing to data. In general, it just tries to make sure that models fit correctly without introducing signifcant transformation overhead. In particular, there is no anomaly detection, dimensionality reduction, clustering, resampling, feature creation from polynomial interactions, feature selection, etc.

This is so the user always knows what's going on.

However, the default options may not be suitable for your data or objectives, so these can be set during initialization or modified afterwards.

## Basics

The list of default transformations is:
* Missing data imputation.
* Z-score scaling for numeric variables.
* One-hot encoding for low cardinality categorical variables.
* Target encoding for the remaining categorical variables. This is a custom transformer based on Micci-Barreca, 2001, with implementation heavily based on [Dirty Cat](https://github.com/dirty-cat/dirty_cat/blob/master/dirty_cat/target_encoder.py).
* Datetime encoding for datetime variables. This also uses a custom transformer that extracts multiple datetime levels.
* Zero-variance feature elimination.

This includes some type inference logic that decides whether a given feature is either numeric, categorical high cardinality, categorical low cardinality or datetime.

In [10]:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from poniard import PoniardRegressor

X, y = load_diabetes(return_X_y=True, as_frame=True)
pnd = PoniardRegressor(estimators=[Ridge()])
pnd.setup(X, y)
pnd.preprocessor_

Main metric: neg_mean_squared_error
Minimum unique values to consider a number feature numeric: 44
Minimum unique values to consider a non-number feature high cardinality: 20

Inferred feature types:
  numeric categorical_high categorical_low datetime
0     age                              sex         
1     bmi                                          
2      bp                                          
3      s1                                          
4      s2                                          
5      s3                                          
6      s4                                          
7      s5                                          
8      s6                                          


`preprocess=False` disables preprocessing altogether, while `custom_preprocessor` accepts a scikit-learn transformer (or pipeline/column transformer) that replaces the default Poniard transformation pipeline.

Logically, there is no type inference involved when these options are used and full control is given to the user.

In the following example, we use `TfidfVectorizer()` to process the 20 News Groups dataset.

In [11]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from poniard import PoniardClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

X, y = fetch_20newsgroups(return_X_y=True, remove=("headers", "footers", "quotes"),
                          categories=("alt.atheism", "sci.space"))
pnd = PoniardClassifier(estimators=[LogisticRegression()], custom_preprocessor=TfidfVectorizer())
pnd.setup(X, y)
pnd.preprocessor_

Main metric: roc_auc


In [12]:
pnd.fit()
pnd.show_results()

Completed: 100%|██████████| 2/2 [00:05<00:00,  2.75s/it]         


Unnamed: 0,test_roc_auc,train_roc_auc,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1,fit_time,score_time
LogisticRegression,0.966728,0.996531,0.890002,0.968779,0.853208,0.949418,0.967968,0.996628,0.906863,0.972445,0.219359,0.071576
DummyClassifier,0.5,0.5,0.552654,0.552656,0.552654,0.552656,1.0,1.0,0.711882,0.711885,0.167682,0.070055


Type inference is governed by the input data types and two thresholds included in the estimator constructor.

Number features (as defined by `numpy`) with unique values greater than `numeric_threshold` will be treated as numeric, with the remainder being treated as non-numeric. If this parameter is a float, the actual threshold is `numeric_threshold * samples`.

Non-numeric features (either because they are number features below `numeric_threshold` or they are non-number features like strings) with unique values greater than `cardinality_threshold` will be considered high cardinality. Likewise, in the case of a float value, the threshold is `cardinality_threshold * samples`.

Defaults are set at reasonable limits, but do pay attention to the output of `setup()` as it might expose misclassified features. In that scenario there's three options: initialize the estimator with different thresholds that better acommodate the dataset, use a `custom_preprocessor` that applies appropiate transformations to different sets of features, or use the `reassign_types()` method to explicitly assign features to the three categories.

In [13]:
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
from poniard import PoniardRegressor

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
reg = PoniardRegressor(numeric_imputer=KNNImputer(), scaler="minmax")
reg.setup(X, y)
reg.preprocessor_

Main metric: neg_mean_squared_error
Minimum unique values to consider a number feature numeric: 2064
Minimum unique values to consider a non-number feature high cardinality: 20

Inferred feature types:
      numeric categorical_high categorical_low datetime
0      MedInc         HouseAge                         
1    AveRooms         Latitude                         
2   AveBedrms        Longitude                         
3  Population                                          
4    AveOccup                                          


In [14]:
reg.reassign_types(numeric=["AveRooms", "AveBedrms", "Population", "AveOccup", "HouseAge", "Latitude", "Longitude"])

Assigned feature types:
      numeric categorical_high categorical_low datetime
0    AveRooms                                          
1   AveBedrms                                          
2  Population                                          
3    AveOccup                                          
4    HouseAge                                          
5    Latitude                                          
6   Longitude                                          


PoniardRegressor(estimators=None, metrics=None,
    preprocess=True, scaler=minmax, numeric_imputer=KNNImputer(),
    custom_preprocessor=None, numeric_threshold=0.1,
    cardinality_threshold=20, cv=None, verbose=0,
    random_state=0, n_jobs=None, plugins=None,
    plot_options=PoniardPlotFactory())
            

## Imputation, numeric scaling and categorical encoding

Different options for imputation, scaling and encoding are available on initialization. As a rule, `scaler`, `numeric_imputer` and `high_cardinality_encoder` accept appropiate scikit-learn preprocessors, e.g. `RobustScaler()` and `KNNImputer()` in the first two cases, and something like `OrdinalEncoder()` in the latter, which will be used as is.

Also, they accept strings that represent scikit-learn preprocessors. The scaler can be "standard", "minmax" or "robust", the imputer can be "simple" or "iterative", and the encoder can be "target" or "ordinal".

We are deliberately not providing options for the categorical imputer (a `SimpleImputer(strategy="most_frequent")` is used) or the low cardinality categorical encoder (always `OneHotEncoder()`). While this is not set in stone, we feel that these are less debatable.

In [15]:
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
from poniard import PoniardRegressor

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
reg = PoniardRegressor(numeric_imputer=KNNImputer(), scaler="minmax")
reg.setup(X, y)
reg.reassign_types(numeric=["AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "Longitude"],
                   categorical_high=["HouseAge"])
reg.preprocessor_

Main metric: neg_mean_squared_error
Minimum unique values to consider a number feature numeric: 2064
Minimum unique values to consider a non-number feature high cardinality: 20

Inferred feature types:
      numeric categorical_high categorical_low datetime
0      MedInc         HouseAge                         
1    AveRooms         Latitude                         
2   AveBedrms        Longitude                         
3  Population                                          
4    AveOccup                                          
Assigned feature types:
      numeric categorical_high categorical_low datetime
0    AveRooms         HouseAge                         
1   AveBedrms                                          
2  Population                                          
3    AveOccup                                          
4    Latitude                                          
5   Longitude                                          


In [16]:
reg.remove_estimators(["LinearSVR", "RandomForestRegressor", "ElasticNet"])
reg.fit()
reg.show_results()

Completed: 100%|██████████| 6/6 [00:17<00:00,  2.93s/it]                    


Unnamed: 0,test_neg_mean_squared_error,train_neg_mean_squared_error,test_neg_mean_absolute_percentage_error,train_neg_mean_absolute_percentage_error,test_neg_median_absolute_error,train_neg_median_absolute_error,test_r2,train_r2,fit_time,score_time
XGBRegressor,-0.237771,-0.09412031,-0.185607,-0.1233918,-0.216376,-0.155068,0.821449,0.929319,1.102309,0.012352
HistGradientBoostingRegressor,-0.239658,-0.1821939,-0.190946,-0.1707084,-0.226846,-0.206955,0.82002,0.86317,1.621701,0.018214
KNeighborsRegressor,-0.403042,-0.2622197,-0.23354,-0.1882364,-0.28416,-0.228,0.697173,0.803084,0.049478,0.062409
DecisionTreeRegressor,-0.465499,-8.310265e-32,-0.233377,-7.41969e-18,-0.245401,0.0,0.650196,1.0,0.137257,0.00889
LinearRegression,-0.791995,-0.7848231,-0.424454,-0.4235226,-0.528726,-0.527934,0.405098,0.410588,0.073061,0.012835
DummyRegressor,-1.33161,-1.331544,-0.621102,-0.621098,-0.762638,-0.76134,-0.000124,0.0,0.040143,0.0066


Preprocessing steps can be added to an existing preprocessor in any position.

In [17]:
from sklearn.feature_selection import SelectKBest, f_regression

reg.add_preprocessing_step(("feature_selection", SelectKBest(f_regression, k=3)), position="end")
reg.preprocessor_