# Preprocessing

> Learn how Poniard preprocessors can be modified to fit different use cases and datasets

In [None]:
#| hide
from nbdev.showdoc import *

## Introduction

Poniard tries to apply minimal preprocessing to data. In general, it just tries to make sure that models fit correctly without introducing signifcant transformation overhead. In particular, there is no anomaly detection, dimensionality reduction, clustering, resampling, feature creation from polynomial interactions, feature selection, etc.

This is so the user always knows what's going on.

However, the default options may not be suitable for your data or objectives, so these can be set during initialization or modified afterwards.

## Default preprocessing pipeline

The list of default transformations is:

* Missing data imputation.
* Z-score scaling for numeric variables.
* One-hot encoding for low cardinality categorical variables.
* Target encoding for the remaining categorical variables. This is a custom transformer based on Micci-Barreca, 2001, with implementation heavily based on [Dirty Cat](https://github.com/dirty-cat/dirty_cat/blob/master/dirty_cat/target_encoder.py). If the task is multilabel or multioutput, ordinal encoding will be used instead.
* Datetime encoding for datetime variables. This also uses a custom transformer that extracts multiple datetime levels.
* Zero-variance feature elimination.

This includes some type inference logic that decides whether a given feature is either numeric, categorical high cardinality, categorical low cardinality or datetime (see [`Type inference`](#type-inference)).



In [None]:
import random

import pandas as pd
import numpy as np
from poniard import PoniardClassifier

In [None]:
random.seed(0)
rng = np.random.default_rng(0)

data = pd.DataFrame({"type": random.choices(["house", "apartment"], k=500),
                     "age": rng.uniform(1, 200, 500).astype(int),
                     "date": pd.date_range("2022-01-01", freq="M", periods=500),
                     "rating": random.choices(range(50), k=500),
                     "target": random.choices([0, 1], k=500)})
X, y = data.drop("target", axis=1), data["target"]
pnd = PoniardClassifier().setup(X, y)
pnd.preprocessor

Target info
-----------
Type: binary
Shape: (500,)
Unique values: 2

Main metric
-----------
roc_auc

Thresholds
----------
Minimum unique values to consider a feature numeric: 50
Minimum unique values to consider a categorical high cardinality: 20

Inferred feature types
----------------------


Unnamed: 0,numeric,categorical_high,categorical_low,datetime
0,age,rating,type,date






:::{.callout-note}
## Empty subpreprocessors
If no features are assigned to a subpreprocessor (like `datetime_preprocessor` or `categorical_low_preprocessor`), then it will be dropped. This does not affect results as scikit-learn effectively ignores transformers with no assigned features, but it makes the HTML representation cleaner.
:::

## Type inference

Type inference is governed by the input data types and two thresholds included in the estimator constructor.

Number features (as defined by `numpy`) with unique values greater than `numeric_threshold` will be treated as numeric, with the remainder being treated as non-numeric. If this parameter is a float, the actual threshold is `numeric_threshold * samples`.

Non-numeric features (either because they are number features below `numeric_threshold` or they are non-number features like strings) with unique values greater than `cardinality_threshold` will be considered high cardinality. Likewise, in the case of a float value, the threshold is `cardinality_threshold * samples`.

Defaults are set at reasonable limits, but do pay attention to the output of `PoniardBaseEstimator.setup` as it might expose misclassified features. In that scenario there's three options: initialize the estimator with different thresholds that better acommodate the dataset, use a `custom_preprocessor` that applies appropiate transformations to different sets of features, or use the `PoniardBaseEstimator.reassign_types` method to explicitly assign features to the three categories.

In the following example, `PoniardBaseEstimator.reassign_types` is used to make every feature numeric as far as preprocessing goes.

In [None]:
from sklearn.datasets import fetch_california_housing
from poniard import PoniardRegressor

In [None]:
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
reg = PoniardRegressor()
reg.setup(X, y)
reg.preprocessor

Target info
-----------
Type: continuous
Shape: (20640,)
Unique values: 3842

Main metric
-----------
neg_mean_squared_error

Thresholds
----------
Minimum unique values to consider a feature numeric: 2064
Minimum unique values to consider a categorical high cardinality: 20

Inferred feature types
----------------------


Unnamed: 0,numeric,categorical_high,categorical_low,datetime
0,MedInc,HouseAge,,
1,AveRooms,Latitude,,
2,AveBedrms,Longitude,,
3,Population,,,
4,AveOccup,,,






In [None]:
reg.reassign_types(numeric=["AveRooms", "AveBedrms", "Population", "AveOccup", "HouseAge", "Latitude", "Longitude"])
reg.preprocessor

Assigned feature types
----------------------


Unnamed: 0,numeric,categorical_high,categorical_low,datetime
0,AveRooms,,,
1,AveBedrms,,,
2,Population,,,
3,AveOccup,,,
4,HouseAge,,,
5,Latitude,,,
6,Longitude,,,






:::{.callout-warning}
## Undefined features in `reassign_types`
Any feature that is not included in any of the `PoniardBaseEstimator.reassign_types` parameters will be effectively dropped, which is why already-numeric features had to be included in the `numeric` parameter. This behavior will be changed in the future. 
:::

## Modifying the default preprocessor



Combining properly setup feature types with the `scaler`, `numeric_imputer` and `high_cardinality_encoder` parameters allows almost complete customization of the default preprocessing pipeline.

These three parameters take strings representing transformers (as in `scaler="minmax"` will use scikit-learn's `MinMaxScaler`, see the [reference](./core.ipynb)), and also accept scikit-learn transformers and pipelines.

For now, we are deliberately not providing options for the categorical imputer (a `SimpleImputer(strategy="most_frequent")` is used) or the low cardinality categorical encoder (always `OneHotEncoder(drop="if_binary", handle_unknown="ignore", sparse=False)`). While this is not set in stone, we feel that these are less debatable.



In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
from poniard import PoniardRegressor

In [None]:
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
reg = PoniardRegressor(numeric_imputer=KNNImputer(), scaler="robust")
reg.setup(X, y)
reg.reassign_types(numeric=["AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "Longitude"],
                   categorical_high=["HouseAge"])
reg.preprocessor

Target info
-----------
Type: continuous
Shape: (20640,)
Unique values: 3842

Main metric
-----------
neg_mean_squared_error

Thresholds
----------
Minimum unique values to consider a feature numeric: 2064
Minimum unique values to consider a categorical high cardinality: 20

Inferred feature types
----------------------


Unnamed: 0,numeric,categorical_high,categorical_low,datetime
0,MedInc,HouseAge,,
1,AveRooms,Latitude,,
2,AveBedrms,Longitude,,
3,Population,,,
4,AveOccup,,,




Assigned feature types
----------------------


Unnamed: 0,numeric,categorical_high,categorical_low,datetime
0,AveRooms,HouseAge,,
1,AveBedrms,,,
2,Population,,,
3,AveOccup,,,
4,Latitude,,,
5,Longitude,,,






## Use a custom preprocessor

During init of either `PoniardRegressor` or `PoniardClassifier` (see docs for `PoniardBaseEstimator` which sets up most of the functionality), `preprocess=False` disables preprocessing altogether, while `custom_preprocessor` accepts a scikit-learn transformer (or pipeline/column transformer) that replaces the default Poniard transformation pipeline.

Logically, there is no type inference involved when these options are used and full control is given to the user.

In the following example, we use `TfidfVectorizer` and `Normalizer` to process the [20 News Groups dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups).

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

from poniard import PoniardClassifier

In [None]:
X, y = fetch_20newsgroups(return_X_y=True, remove=("headers", "footers", "quotes"),
                          categories=("sci.crypt", "sci.electronics", "sci.med"))
preprocessor = make_pipeline(TfidfVectorizer(), Normalizer())
pnd = PoniardClassifier(estimators=[LogisticRegression()], custom_preprocessor=preprocessor)
pnd.setup(X, y)
pnd.preprocessor

Target info
-----------
Type: multiclass
Shape: (1780,)
Unique values: 3

Main metric
-----------
roc_auc_ovr



In [None]:
pnd.fit()
pnd.get_results()

Completed: 100%|██████████████████████████████████| 2/2 [00:12<00:00,  6.26s/it]


Unnamed: 0,test_roc_auc_ovr,test_accuracy,test_precision_macro,test_recall_macro,test_f1_macro,fit_time,score_time
LogisticRegression,0.976337,0.888202,0.89617,0.888335,0.888822,0.732272,0.148423
DummyClassifier,0.5,0.33427,0.111423,0.333333,0.167018,0.343316,0.142217


In [None]:
#| hide
import nbdev; nbdev.nbdev_export()