# cuML Preprocessing
Users of cuML are certainly familiar with its ability to run machine learning models on GPUs and the significant training and inference speedup that can entail, but the models themselves are only part of the story. In this notebook, we will demonstrate how cuML allows you to develop an entire machine learning _pipeline_ in order to preprocess and prepare your data without _ever_ leaving the GPU.

We will use the [BNP Paribas Cardif Claims Management dataset](https://www.kaggle.com/c/bnp-paribas-cardif-claims-management) to showcase a few of the many methods that cuML offers for GPU-accelerated feature engineering. This dataset offers an interesting challenge because:
1. It is somewhat messy, including missing data of various kinds.
2. It includes both quantitative data (represented as floating point values) and categorical data (represented as both integers and strings).
3. It is anonymized, so we cannot use _a priori_ domain-specific knowledge to guide our approach.

Our goal here is not necessarily to achieve the best possible model performance but to showcase the cuML features that you could use to improve model performance on your own. For a deeper dive into how to maximize performance on this dataset, check out the solutions and associated discussion for [the top Kaggle entries](https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/leaderboard).

## 1. Data Ingest
Our first step is to acquire the data and read it into a data frame for subsequent processing. This process should be quite familiar for Pandas users, though we will be making use of cuDF, the equivalent GPU-accelerated module.

In [None]:
# To acquire the dataset, we will make use of the Kaggle CLI tool.
# If you do not have this tool set up, you can download the data directly
# from the Kaggle competition page: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/data
# Note that you may still need to visit this page even if you have the CLI
# tool in order to agree to the terms of data usage.

!kaggle competitions download -c bnp-paribas-cardif-claims-management
!unzip -o bnp-paribas-cardif-claims-management.zip

In [None]:
import cudf

data_cudf = cudf.read_csv("./train.csv.zip")
data_pd = data_cudf.to_pandas()

data_cudf.head()

Looking at the first few rows of these data, we can already understand some of the problems we might expect in working with the full dataset. We have a "target" column representing a binary classification target that we would like to predict with our model. As input to that model, we have over a hundred features, some represented as floats, some as ints, and some as strings. We can also see that quite a bit of the data is missing, as denoted by the numerous "\<NA\>" entries.

## 2. Evaluation Procedure
As a general principle, it is helpful to clearly define an evaluation procedure before jumping into model building and training. In this case, we are interested in finding a robust preprocessing protocol to apply to unseen data, so we will perform [k-fold cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation) and average performance across folds.

Because the RAPIDS packages have maintained such close compatibility with their non-GPU-accelerated counterparts, sklearn's k-fold cross-validation implementation can be directly applied to our data on the GPU. Moreover, this is one of several sklearn algorithms that can be applied without incurring any device-to-host copies, so we will use it directly in our evaluation protocol.

For demonstrations purposes, we will use accuracy (the default scoring metric for random forest models in sklearn) as our metric, but remember that accuracy [should](https://en.wikipedia.org/wiki/Accuracy_paradox) [not](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0084217) [be](https://www.fharrell.com/post/class-damage/) [used](https://medium.com/@limavallantin/why-you-should-not-trust-only-in-accuracy-to-measure-machine-learning-performance-a72cf00b4516) as a model-selection metric for any serious application.

In [None]:
import warnings

import numpy
from sklearn.model_selection import KFold

def evaluate(pipeline, data, n_splits=5, target_col='target'):
    """"""
    x = data[data.columns.difference([target_col])]
    y = data[[target_col]]

    folds = KFold(n_splits=n_splits, shuffle=False)
    scores = numpy.empty(folds.get_n_splits(x), dtype=numpy.float32)
    for i, (train_indices, test_indices) in enumerate(folds.split(x)):
        x_train, x_test = x.iloc[train_indices], x.iloc[test_indices]
        y_train, y_test = y.iloc[train_indices], y.iloc[test_indices]
        pipeline.fit(x_train, y_train)
        scores[i] = pipeline.score(x_test, y_test)

    return numpy.average(scores)

def cu_evaluate(pipeline):
    """Convenience wrapper for evaluating cuML-based pipelines"""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        return evaluate(pipeline, data_cudf)

def sk_evaluate(pipeline):
    """Convenience wrapper for evaluating sklearn-based pipelines"""
    # Suppress sklearn data conversion warnings
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        return evaluate(pipeline, data_pd)

With these two convenience functions, we can quickly assess performance of full processing-and-classification pipelines with a single call.

## 3. The Model
For the moment, we are focusing on the preprocessing portion of our pipeline, so we will stick with a random forest model with a fixed set of hyperparameters. We will set `n_jobs` to `-1` for the sklearn model in order to make use of all available CPU processors, but we will otherwise stick with defaults.

You will probably notice a small difference in the accuracy achieved by the cuML random forest implementation and that achieved by sklearn. RAPIDS is in the process of transitioning to a new random forest implementation that performs much more comparably to sklearn. If you'd like to try out this (currently experimental) implementation, uncommon the indicated lines below.

In [None]:
from cuml.ensemble import RandomForestClassifier as cuRandomForestClassifier
from sklearn.ensemble import RandomForestClassifier as skRandomForestClassifier

cu_classifier = cuRandomForestClassifier()
sk_classifier = skRandomForestClassifier(n_jobs=-1)


# Uncomment the following lines to try out the new experimental RF
# implementation in cuML

# cu_classifier = cuRandomForestClassifier(max_features=1.0,
#                                          max_depth=13,
#                                          use_experimental_backend=True)

## Intermezzo: Helper Code

One of the standout features of sklearn is its consistent API for algorithms that fill the same role. Introducing a new algorithm that can be slotted into an sklearn pipeline is as easy as defining a class that fits that API. In this section, we'll define a few helper classes that will help us easily apply whatever preprocessing transformations we desire as part of our pipeline.

Feel free to skip over the details of these implementations; the docstrings should give a sufficient sense of their purpose and usage.

In [None]:
import pandas
from sklearn.base import BaseEstimator, TransformerMixin

class LambdaTransformer(BaseEstimator, TransformerMixin):
    """An sklearn-compatible class for simple transformation functions
    
    This helper class is useful for transforming data with a straightforward
    function requiring no fitting
    """
    def __init__(self, transform_function):
        self.transform_function = transform_function

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return self.transform_function(X)


# Workaround for https://github.com/rapidsai/cuml/issues/3041

class PerFeatureTransformer(BaseEstimator, TransformerMixin):
    """An sklearn-compatible class for fitting and transforming on
    each feature independently
    
    Some preprocessing algorithms need to be applied independently to
    each feature. This wrapper facilitates that process.
    """
    def __init__(self,
                 transformer_class,
                 transformer_args=(),
                 transformer_kwargs={},
                 copy=True):
        self.transformer_class = transformer_class
        self.transformer_args = transformer_args
        self.transformer_kwargs = transformer_kwargs
        self.transformers = {}
        self.copy = copy
        
    def fit(self, X, y=None):
        for col in X.columns:
            self.transformers[col] = self.transformer_class(
                *self.transformer_args,
                **self.transformer_kwargs
            )
            try:
                self.transformers[col].fit(X[col], y=y)
            except TypeError:  # https://github.com/rapidsai/cuml/issues/3053
                self.transformers[col].fit(X[col])
        return self
    
    def transform(self, X, y=None):
        if self.copy:
            X = X.copy()
        for col in X.columns:
            try:
                X[col] = self.transformers[col].transform(X[col], y=y)
            except TypeError:  # https://github.com/rapidsai/cuml/issues/3053
                X[col] = self.transformers[col].transform(X[col])
            
        return X
    
    def fit_transform(self, X, y=None):
        for col in X.columns:
            self.transformers[col] = self.transformer_class(
                *self.transformer_args,
                **self.transformer_kwargs
            )
            try:
                X[col] = self.transformers[col].fit_transform(X[col], y=y)
            except TypeError:  # https://github.com/rapidsai/cuml/issues/3053
                X[col] = self.transformers[col].fit_transform(X[col])
        return X



class FeatureGenerator(BaseEstimator, TransformerMixin):
    """An sklearn-compatible class for adding new features to existing
    data
    """
    def __init__(self,
                 generator,
                 include_dtypes=None,
                 exclude_dtypes=None,
                 columns=None,
                 copy=True):
        self.include_dtypes = include_dtypes
        self.exclude_dtypes = exclude_dtypes
        self.columns = columns
        self.copy = copy
        self.generator = generator
        
    def _get_subset(self, X):
        subset = X
        if self.columns is not None:
            subset = X[self.columns]
        if self.include_dtypes or self.exclude_dtypes:
            subset = subset.select_dtypes(
                include=self.include_dtypes,
                exclude=self.exclude_dtypes
            )
        return subset
    
    def fit(self, X, y=None):
        subset = self._get_subset(X)
        try:
            self.generator.fit(subset, y=y)
        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053
            self.generator.fit(subset)
    
    def transform(self, X, y=None):
        subset = self._get_subset(X)
        try:
            new_features = self.generator.transform(subset, y=y)
        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053
            new_features = self.generator.transform(subset)
        if isinstance(X, cudf.DataFrame):
            return cudf.concat((X.reset_index(), new_features), axis=1)
        else:
            new_features = pandas.DataFrame(
                new_features,
                columns=["new_{}".format(i) for i in range(new_features.shape[1])]
            )
            return pandas.concat((X.reset_index(), new_features), axis=1)
    
    def fit_transform(self, X, y=None):
        subset = self._get_subset(X)
        try:
            new_features = self.generator.fit_transform(subset, y=y)
        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053
            new_features = self.generator.fit_transform(subset)
        if isinstance(X, cudf.DataFrame):
            return cudf.concat((X.reset_index(), new_features), axis=1)
        else:
            new_features = pandas.DataFrame(
                new_features,
                columns=["new_{}".format(i) for i in range(new_features.shape[1])]
            )
            return pandas.concat((X.reset_index(), new_features), axis=1)


class SubsetTransformer(BaseEstimator, TransformerMixin):
    """An sklearn-compatible class for fitting and transforming on
    a subset of features
    
    This allows a transformation to be applied to only data in a
    specific column of a dataframe or only data of a particular dtype.
    """
    def __init__(self,
                 transformer,
                 include_dtypes=None,
                 exclude_dtypes=None,
                 columns=None,
                 copy=True):
        self.transformer = transformer
        self.include_dtypes = include_dtypes
        self.exclude_dtypes = exclude_dtypes
        self.columns = columns
        self.copy = copy
        
    def _get_subset(self, X):
        subset = X
        if self.columns is not None:
            subset = X[self.columns]
        if self.include_dtypes or self.exclude_dtypes:
            subset = subset.select_dtypes(
                include=self.include_dtypes,
                exclude=self.exclude_dtypes
            )
        return subset
        
    def fit(self, X, y=None):
        subset = self._get_subset(X)
        try:
            self.transformer.fit(subset, y=y)
        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053
            self.transformer.fit(subset)
        return self
    
    def transform(self, X, y=None):
        if self.copy:
            X = X.copy()
        subset = self._get_subset(X)
        try:
            X[subset.columns] = self.transformer.transform(subset, y=y)
        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053
            X[subset.columns] = self.transformer.transform(subset)
        
        return X
    
    def fit_transform(self, X, y=None):
        if self.copy:
            X = X.copy()
        subset = self._get_subset(X)
        try:
            X[subset.columns] = self.transformer.fit_transform(subset, y=y)
        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053
            X[subset.columns] = self.transformer.fit_transform(subset)
        
        return X


class DeviceSpecificTransformer(BaseEstimator, TransformerMixin):
    """An sklearn-compatible class for performing different
    transformations based on whether it receives a cuDF or Pandas
    dataframe"""
    def __init__(self, pandas_transformer, cudf_transformer):
        self.pandas_transformer = pandas_transformer
        self.cudf_transformer = cudf_transformer
        self.transformer = None
        self.is_cuml_transformer = None

    def fit(self, X, y=None):
        if hasattr(X, 'to_pandas'):
            self.transformer = self.cudf_transformer
            self.is_cuml_transformer = True
        else:
            self.transformer = self.pandas_transformer
            self.is_cuml_transformer = False
            
        try:
            self.transformer.fit(X, y=y)
        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053
            self.transformer.fit(X)
        return self

    def transform(self, X, y=None):
        try:
            return self.transformer.transform(X, y=y)
        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053
            return self.transformer.transform(X)

    def fit_transform(self, X, y=None):
        if hasattr(X, 'to_pandas'):
            self.transformer = self.cudf_transformer
        else:
            self.transformer = self.pandas_transformer

        try:
            return self.transformer.fit_transform(X, y=y)
        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053
            return self.transformer.fit_transform(X)

Note that much of the logic here is necessary because of the relative messiness of the dataset we intend to work with or because we will be using these transformers in both cuML and sklearn pipelines. Simpler, cleaner datasets may not require any of this helper logic if they are processed solely with cuML

## 4. Feature Engineering
With an evaluation protocol in place, a fixed model defined, and helper classes written, we can now turn to the actual task of cleaning up our data and exploring the available tools for creating useful features.

### 4.1 A Naive Approach

We'll start by defining a few cleaning steps that will be needed simply to pass off the data to our classifiers. Specifically, we will:
1. Drop the `ID` column, since we do not want to take the arbitrarily-assigned ID into account in our training.
2. Replace null and NaN values with something our classifier can work with.
3. Drop any non-numeric features, since our classifier does not currently support such data.
4. Convert remaining (numeric) features to 32-bit floats, since cuML's random forest implementation requires this.

This approach is quite naive. Categorical integer data is treated in the same way as quantitative float data. Categorical strings are ignored entirely, and missing data is replaced with a constant value that may not be appropriate in the context of the full dataframe. We will address all of these concerns and more as we build up more complex preprocessing pipelines.

In [None]:
drop_id = LambdaTransformer(lambda x: x[x.columns.difference(['ID'])])

In [None]:
replace_numeric_na = SubsetTransformer(
    LambdaTransformer(lambda x: x.fillna(0)),
    include_dtypes=['integer', 'floating']
)

In [None]:
replace_string_na = SubsetTransformer(
    LambdaTransformer(lambda x: x.fillna('UNKNOWN')),
    include_dtypes=['object']
)

In [None]:
filter_numeric = LambdaTransformer(lambda x: x.select_dtypes('number'))

In [None]:
convert_to_float32 = LambdaTransformer(lambda x: x.astype('float32'))

In [None]:
preprocessing_steps = [
    ("Drop ID", drop_id),
    ("Replace numeric NA", replace_numeric_na),
    ("Replace string NA", replace_string_na),
    ("Numeric filter", filter_numeric),
    ("32-bit Conversion", convert_to_float32)
]

With these naive preprocessing steps defined, let's create an sklearn `Pipeline` for both the cuML classifier and the sklearn classifier. We can then apply our previously-defined evaluation protocol to each and assess both runtime and accuracy performance.

In [None]:
from sklearn.pipeline import Pipeline
cuml_pipeline = Pipeline(
    preprocessing_steps + [("Classifier", cu_classifier)],
    verbose=1  # Detailed timing information
)
sklearn_pipeline = Pipeline(
    preprocessing_steps + [("Classifier", sk_classifier)],
    verbose=1  # Detailed timing information
)

In [None]:
%time cu_evaluate(cuml_pipeline)

In [None]:
%%script false --no-raise-error
# WARNING: Takes several minutes
%time sk_evaluate(sklearn_pipeline)

Given the known runtime improvement of cuML's GPU-accelerated random forest implementation, it is no surprise that the cuML pipeline executed faster than its CPU-only equivalent. Digging into the timings of individual pipeline steps, we do indeed see that the majority of our performance gain with cuML comes from the classifier itself, but we also see some improvement in runtimes for the preprocessing steps. We'll take a closer look at that once we have a slightly more interesting pipeline in place.

Since the sklearn pipeline takes several minutes to run and the observed accuracy is similar to what we see with cuML, most of the remaining sklearn cells in this notebook will be disabled with the `%%script false --no-raise-error` magic tag. You can simply delete this tag from the cell if you wish to run the sklearn version of a particular section of code.

### 4.2 Data Imputation

As a marginal improvement on our initial approach, let's use a slightly more sophisticated method for dealing with missing values. Specifically, let's fill in missing quantitative features with the mean value for that feature in our training data. For this, we will make use of the `SimpleImputer` class, newly available in RAPIDS v0.16 through the `cuml.experimental.preprocessing` module.

#### Aside: cuML's Experimental Preprocessing
It is no secret that cuML stands on the shoulders of the sklearn giant and benefits enormously from sklearn's brilliant design, thoughtful implementation, and enthusiastic community. In v0.16, cuML has benefitted even more directly through its new (and currently experimental) preprocessing features.

Because cuML has maintained such strong compatibility with sklearn, the RAPIDS team was able to incorporate sklearn code (still distributed under the terms of the sklearn license, of course) directly into cuML with only minor modifications. This became cuML's experimental preprocessing module. So if you appreciate having these features available in cuML, remember that it is thanks to the consistently stellar work of the sklearn developers and community, and be sure to [cite sklearn](https://scikit-learn.org/stable/about.html#citing-scikit-learn) in any scientific publications based on these features.

As an experimental feature, we are actively seeking feedback on these newly-introduced preprocessing algorithms. Please do report any problems you encounter via the [cuML issue tracker](https://github.com/rapidsai/cuml/issues).

In [None]:
from sklearn.impute import SimpleImputer as skSimpleImputer
from cuml.experimental.preprocessing import SimpleImputer as cuSimpleImputer

sk_mean_imputer = SubsetTransformer(
    skSimpleImputer(missing_values=numpy.nan, strategy='mean'),
    include_dtypes=['floating']
)
cu_mean_imputer = SubsetTransformer(
    cuSimpleImputer(missing_values=numpy.nan, strategy='mean'),
    include_dtypes=['floating']
)
mean_imputer = DeviceSpecificTransformer(sk_mean_imputer, cu_mean_imputer)

Because cupy does not currently support null values, we will need to add one other step to our pipeline: converting null data to NaNs or another known invalid value before performing imputation.

In [None]:
def _replace_nulls(data):
    data = data.copy()
    replacements = [
        (numpy.floating, numpy.nan),
        (numpy.integer, -1),
        (object, 'UNKNOWN')
    ]
    for col_type, value in replacements:
        subset = data.select_dtypes(col_type)
        data[subset.columns] = subset.fillna(value)
    return data

null_filler = LambdaTransformer(_replace_nulls)

In [None]:
preprocessing_steps = [
    ("Drop ID", drop_id),
    ("Replace nulls", null_filler),
    ("Imputation", mean_imputer),
    ("Numeric filter", filter_numeric),
    ("32-bit Conversion", convert_to_float32)
]

In [None]:
cuml_pipeline = Pipeline(preprocessing_steps + [("Classifier", cu_classifier)])
sklearn_pipeline = Pipeline(preprocessing_steps + [("Classifier", sk_classifier)])

In [None]:
cu_evaluate(cuml_pipeline)

In [None]:
%%script false --no-raise-error
sk_evaluate(sklearn_pipeline)

We see an almost negligible increase in accuracy using mean imputation, but you can try experimenting with other imputation strategies, including "median" and "most_frequent" to see what impact it has on performance.

### 4.3 Scaling
For some machine learning algorithms, it is helpful to adjust the average value of a feature and scale it so that its "spread" is comparable to other features. There are a few strategies for doing this, but one of the most common is to subtract off the mean and then divide by the variance. We can do precisely this using the `StandardScaler` algorithm.

In [None]:
from sklearn.preprocessing import StandardScaler as skStandardScaler
from cuml.experimental.preprocessing import StandardScaler as cuStandardScaler

sk_scaler = SubsetTransformer(
    skStandardScaler(),
    include_dtypes=['floating']
)
cu_scaler = SubsetTransformer(
    cuStandardScaler(),
    include_dtypes=['floating']
)
scaler = DeviceSpecificTransformer(sk_scaler, cu_scaler)

In [None]:
preprocessing_steps = [
    ("Drop ID", drop_id),
    ("Replace nulls", null_filler),
    ("Imputation", mean_imputer),
    ("Scaling", scaler),
    ("Numeric filter", filter_numeric),
    ("32-bit Conversion", convert_to_float32)
]
cuml_pipeline = Pipeline(preprocessing_steps + [("Classifier", cu_classifier)])
sklearn_pipeline = Pipeline(preprocessing_steps + [("Classifier", sk_classifier)])

In [None]:
cu_evaluate(cuml_pipeline)

In [None]:
%%script false --no-raise-error
sk_evaluate(sklearn_pipeline)

In general, random forest models do not benefit from this kind of scaling, but other model types, especially logistic regression and neural networks can see improved accuracy or better convergence with this sort of preprocessing.

### 4.4 Encoding Categorical Data
Up to this point, we have not taken advantage of the categorical features in our data at all. In order to do so, we must encode them in some numeric representation. cuML offers a number of strategies for doing this, including one-hot encoding, label encoding, and target encoding. We will demonstrate just one of these algorithms (label encoding) here.

Using encoders on different training and testing data can be tricky because our training split may be missing some labels from our testing split. cuML's `LabelEncoder` includes the `handle_unknown` param which allows us to mark previously-unseen categories as null. Since all integer entries in our dataset are whole numbers, we can then replace these nulls with a value of -1 using two quick helper transformations.

In sklearn, we must use a slightly different workaround.

In [None]:
from cuml.preprocessing import LabelEncoder as cuLabelEncoder
from sklearn.preprocessing import LabelEncoder as skLabelEncoder

cu_encoder = SubsetTransformer(
    PerFeatureTransformer(cuLabelEncoder, transformer_kwargs={'handle_unknown': 'ignore'}),
    include_dtypes=['integer', 'object']
)

In [None]:
# cuML workarounds for unseen data
def standard_ints(data):
    subset = data.select_dtypes('integer')
    data[subset.columns] = subset.astype('int32')
    return data

int_standardizer = LambdaTransformer(standard_ints)
replace_unknown_labels = LambdaTransformer(lambda x: x.fillna(-1))

In [None]:
# sklearn workarounds for unseen data
class SKUnknownEncoder(BaseEstimator, TransformerMixin):
    UNKNOWN = 'UNKNOWN'
    
    def __init__(self, base_encoder, copy=True):
        self.base_encoder = base_encoder
        self.copy = copy
        
    def fit(self, X, y=None):
        self.base_encoder.fit(list(X) + [self.UNKNOWN])
    
    def transform(self, X):
        if self.copy:
            X = X.copy()
        missing = set(X.unique()) - set(self.base_encoder.classes_)
        X = X.replace(list(missing), self.UNKNOWN)
        return self.base_encoder.transform(X)
    
    def fit_transform(self, X, y=None):
        return self.base_encoder.fit_transform(X)

In [None]:
sk_encoder = SubsetTransformer(
    PerFeatureTransformer(SKUnknownEncoder, transformer_args=(skLabelEncoder(),)),
    include_dtypes=['integer', 'object']
)

In [None]:
label_encoder = DeviceSpecificTransformer(sk_encoder, cu_encoder)

In [None]:
preprocessing_steps = [
    ("Drop ID", drop_id),
    ("Replace nulls", null_filler),
    ("Encoding", label_encoder),
    ("Imputation", mean_imputer),
    ("Standardize ints", int_standardizer),
    ("Handle unknown labels", replace_unknown_labels),
    ("Scaling", scaler),
    ("Numeric filter", filter_numeric),
    ("32-bit Conversion", convert_to_float32)
]
cuml_pipeline = Pipeline(preprocessing_steps + [("Classifier", cu_classifier)])
sklearn_pipeline = Pipeline(preprocessing_steps + [("Classifier", sk_classifier)])

In [None]:
cu_evaluate(cuml_pipeline)

In [None]:
%%script false --no-raise-error
sk_evaluate(sklearn_pipeline)

### 4.5 Discretization
While encoding gives us a way of converting discrete labels into numeric values, it is sometimes useful to do the reverse. When quantitative data falls into obviously useful categories (like "zero" vs "non-zero") or when the noise in quantitative data does not yield meaningful information about our prediction target, it can help our model to preprocess that quantitative data by converting it into categorical "bins". We will give just one example of this (`KBinsDiscretizer`), which we will naively apply across all categorical data. For more serious feature engineering, we would perform a more careful analysis of the meaning and distribution of each quantitative feature.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer as skKBinsDiscretizer
from cuml.experimental.preprocessing import KBinsDiscretizer as cuKBinsDiscretizer

sk_discretizer = SubsetTransformer(
    skKBinsDiscretizer(encode='ordinal'),
    include_dtypes=['floating']
)
cu_discretizer = SubsetTransformer(
    cuKBinsDiscretizer(encode='ordinal'),
    include_dtypes=['floating']
)
discretizer = DeviceSpecificTransformer(sk_discretizer, cu_discretizer)

In [None]:
preprocessing_steps = [
    ("Drop ID", drop_id),
    ("Replace nulls", null_filler),
    ("Encoding", label_encoder),
    ("Imputation", mean_imputer),
    ("Standardize ints", int_standardizer),
    ("Handle unknown labels", replace_unknown_labels),
    ("Scaling", scaler),
    ("Discretization", discretizer),
    ("Numeric filter", filter_numeric),
    ("32-bit Conversion", convert_to_float32)
]
cuml_pipeline = Pipeline(preprocessing_steps + [("Classifier", cu_classifier)])
sklearn_pipeline = Pipeline(preprocessing_steps + [("Classifier", sk_classifier)])

In [None]:
cu_evaluate(cuml_pipeline)

In [None]:
%%script false --no-raise-error
sk_evaluate(sklearn_pipeline)

### 4.7 Generating New Features
We have looked at several ways of processing existing features that may help a machine learning model converge faster or perform better, but we can also generate new features from the existing data to help create the best possible representation of those data.

One of the most straightforward examples of this technique is expemplified by the `PolynomialFeatureGenerator` algorithm. This algorithm works by looking at the products of existing features up to a certain order. Thus, if we have features `a`, `b`, and `c`, it might be useful to let the model see `ab`, `ac`, `bc` and potentially even `a**2`, `b**2`, and `c**2`.

In our case, we will again take a fairly naive approach, adding all of the interaction terms of order 2 (corresponding to `ab`, `ac`, and `bc` in the above example) as new features.

In [None]:
from cuml.experimental.preprocessing import PolynomialFeatures as cuPolynomialFeatures
from sklearn.preprocessing import PolynomialFeatures as skPolynomialFeatures

sk_generator = FeatureGenerator(
    skPolynomialFeatures(interaction_only=True, degree=2),
    include_dtypes=['integer']
)
cu_generator = FeatureGenerator(
    cuPolynomialFeatures(interaction_only=True, degree=2),
    include_dtypes=['integer']
)
generator = DeviceSpecificTransformer(sk_generator, cu_generator)

In [None]:
preprocessing_steps = [
    ("Drop ID", drop_id),
    ("Replace nulls", null_filler),
    ("Encoding", label_encoder),
    ("Imputation", mean_imputer),
    ("Standardize ints", int_standardizer),
    ("Handle unknown labels", replace_unknown_labels),
    ("Generate products", generator),
    ("Scaling", scaler),
    ("Discretization", discretizer),
    ("Numeric filter", filter_numeric),
    ("32-bit Conversion", convert_to_float32)
]
cuml_pipeline = Pipeline(preprocessing_steps + [("Classifier", cu_classifier)])
sklearn_pipeline = Pipeline(preprocessing_steps + [("Classifier", sk_classifier)])

In [None]:
cu_evaluate(cuml_pipeline)

In [None]:
%%script false --no-raise-error
sk_evaluate(sklearn_pipeline)

## 5. Final Assessment
Blindly applying the techniques presented thus far, we have seen a very modest increase in accuracy due solely to preprocessing. As evidenced by the ingenious solutions presented for the Kaggle competition associated with this dataset, a more careful and thorough exploration of preprocessing can yield much more impressive performance.

A key factor in finding an effective preprocessing protocol is how long it takes to iterate through possibilities and assess their impact. Indeed, this is one of the key benefits of cuML's new preprocessing tools. Using them, we can load data onto the GPU then tweak, transform, and use it for training and inference without ever incurring the cost of device-to-host transfers.

With this in mind, let's take one final look at execution time for our final pipeline, breaking it down and analyzing the specific benefits of GPU-accelerated preprocessing.

In [None]:
#Increase verbosity to provide timing details
cuml_pipeline = Pipeline(
    preprocessing_steps + [("Classifier", cu_classifier)],
    verbose=1
)
sklearn_pipeline = Pipeline(
    preprocessing_steps + [("Classifier", sk_classifier)],
    verbose=1
)

In [None]:
%time cu_evaluate(cuml_pipeline)

In [None]:
preprocessing_steps = [
    ("Drop ID", drop_id),
    ("Replace nulls", null_filler),
    ("Encoding", label_encoder),
    ("Imputation", mean_imputer),
    ("Standardize ints", int_standardizer),
    ("Handle unknown labels", replace_unknown_labels),
    ("Generate products", generator),
    ("Scaling", scaler),
    ("Discretization", discretizer),
    ("Numeric filter", filter_numeric),
    ("32-bit Conversion", convert_to_float32)
]
sklearn_pipeline = Pipeline(
    preprocessing_steps + [("Classifier", sk_classifier)],
    verbose=1
)
%time sk_evaluate(sklearn_pipeline)

In [None]:
preproc_only_pipeline = Pipeline(preprocessing_steps)

In [None]:
%%time
# Suppress warnings from naive application of discretizer to
# all features
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    preproc_only_pipeline.fit_transform(data_cudf[data_cudf.columns.difference(['target'])], data_cudf.target)

In [None]:
%%time
# Suppress warnings from naive application of discretizer to
# all features
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    preproc_only_pipeline.fit_transform(data_pd[data_pd.columns.difference(['target'])], data_pd.target)

Looking at these results, we can see the runtime benefit of GPU acceleration in both the entire preprocessing and classification pipeline and the preprocessing portion alone. For feature engineering, this means faster iteration, lower compute costs, and the possibility of conducting more systematic hyper-parameter optimization over even the preprocessing steps themselves. Those with an interest in HPO might check out our [detailed walkthroughs](https://rapids.ai/hpo) on performing HPO with RAPIDS in the cloud. The techniques explored there could easily be combined with those demonstrated in this notebook to rapidly search the space of available preprocessing and model hyperparameters.

## 6. Conclusions
Thanks to the newly-expanded cuML preprocessing features in RAPIDS v0.16, it is now possible to keep your entire machine learning pipeline on the GPU, without copying data back to the host to make use of CPU-only algorithms. This offers substantial benefits in terms of runtime, which can in turn lead to more thorough exploration of the feature engineering space and dramatically lower compute times and costs.

While this notebook primarily offers a high-level demonstration of available preprocessing features rather than an in-depth optimization of features on a particular dataset, you may be interested in using it to play more with the BNP dataset yourself to engineer the perfect combination of curated features. Or better yet, try it with your own data.

If you like what you see here, there is plenty more to explore in our [other demo notebooks](https://github.com/rapidsai/notebooks). Please feel free to report any problems you find or ask questions via [the cuML issue tracker](https://github.com/rapidsai/cuml/issues), and keep an eye out for the next release of cuML (v0.17), which we expect to have an even smoother preprocessing experience as we start to transition the new preprocessing features out of experimental.