# Improving model performance with xfeat, RAPIDS and Optuna


## Introduction

Feature Engineering is the processing of transforming raw data into features that can represent the underlying patterns of the data better. They can help boost the accuracy by a great deal and improve the ability of the model to generalise on unseen data. Every data scientist knows the importance feature engineering. Spending some time thinking about how best to apply and combine the available features can be very meaningful. 

Hyper parameter Optimisation is another such process which can help complement a good model by tuning it's hyperparameters, which can have a tremendous impact on the accuracy of the model. The time and resources required for these processes are generally the reason they're overlooked. 

With xfeat, RAPIDS and Optuna - we aim to bridge these gaps and elevate the performance. 

## What is Optuna?
[Optuna](https://github.com/optuna/optuna) s a lightweight framework for automatic hyperparameter optimization. It provides a define-by-run API, which makes it easy to adapt to any already existing code that we have and enables high modularity and the flexibility to construct hyperparameter spaces dynamically. By simply wrapping the objective function with Optuna can help perform a parallel-distributed HPO search over a search space. As we'll see in this notebook.

## What is xfeat?
[xfeat](https://github.com/pfnet-research/xfeat) is a feature engineering & exploration library using GPUs and Optuna. It provides a scikit-learn-like API for feature engineering with support for pandas and cuDF dataframes and cuPy arrays. 

## What is RAPIDS?
[RAPIDS](https://rapids.ai/about.html) framework  provides a library suite that can execute end-to-end data science pipelines entirely on GPUs.  The libraries in the framework include [cuDF](https://github.com/rapidsai/cudf) - a GPU Dataframe with pandas-like API, [cuML](https://github.com/rapidsai/cuml) - implement machine learning algorithms that provide a scikit-learn-like API and many more. You can learn more [here](https://github.com/rapidsai).

In this notebook, we'll show how one can use these tools together to develop and improve a machine learning model. We'll use Airlines dataset (20M rows) to predict if a flight will be delayed or not. We'll explore how to use Optuna with RAPIDS and the speedups that we can achieve with the integration of these.

In [1]:
from cuml import LogisticRegression
import cudf
import optuna
from cuml.metrics import accuracy_score
from cuml.preprocessing.model_selection import train_test_split
import numpy as np
from xfeat.pipeline import Pipeline
from xfeat.num_encoder import SelectNumerical
from xfeat.selector import ChiSquareKBest
from xfeat.optuna_selector import KBestThresholdExplorer, GroupCombinationExplorer
from cuml import LogisticRegression
from functools import partial
import xfeat
from sklearn.model_selection import KFold

from xfeat import ArithmeticCombinations, Pipeline, SelectNumerical, LabelEncoder, SelectCategorical, TargetEncoder



### Feature Engineering

We'll be using the following functions to perform a few featur engineering tasks on the data. The `feature_engineering` function is called on the dataframe `df`, in this function we perform a simple Arithmetic Combinations on the numerical columns that adds two columns to create a new one. We specify the `operator` and `r` - r is used to indicate how many columns need to be combined.

Then we call `categorical_encoding` which converts the categorical columns to numerical ones and then performs `target_encoding`. Target Encoding replaces the value with the target mean. This is helpful in classification problem to boost the model accuracy. Find more resources at the end of the notebook.

You'll also notice we use `Pipeline` from xfeat to combine two or more feature engineering tasks together. This is useful to concatenate encoders sequentially.

Read more about Feature Encoding and Pipelining with xfeat [here](https://github.com/pfnet-research/xfeat/blob/master/_docs/feature_encoding.md).

In [2]:
def feature_engineering(df):
    """
    Perform feature engineering and return a new df with engineered features
    """
    df_train, df_test, y_train, y_test = train_test_split(df, "ArrDelayBinary", train_size=0.7, random_state=np.random.seed(0), shuffle=True)

    # Need to do this to ensure we are appropriately assigning the split values
    # It introduces nulls when done directly as df["col"] = x
    df_train["ArrDelayBinary"] = np.nan
    df_train.loc[:, "ArrDelayBinary"] = y_train
    df_test["ArrDelayBinary"] = np.nan
    df_test.loc[:, "ArrDelayBinary"] = y_test
    
    # combine into one pipeline
    encoder = Pipeline([
                        LabelEncoder(output_suffix=""),
                        TargetEncoder(target_col="ArrDelayBinary", output_suffix=""),
                        ArithmeticCombinations(exclude_cols=["ArrDelayBinary"],
                                               drop_origin=False,
                                               operator="+",
                                               r=2,
                                               output_suffix="_plus")
                        
                         
    
                    ])
    df_train = encoder.fit_transform(df_train)
    df_test = encoder.transform(df_test)
    df = cudf.concat([df_train, df_test], sort=False)
    return df

### Feature Selection and Hyper parameter Optimisation

Now that we have some new features, how do we know they are relevant for the task or represent anything meaningful? We use the feature selection process to do this. This helps in selection of a subset of features that are  most informative. This helps in simplifying the problem and ensures that we aren't overloading the system with unimportant features. Optuna provides a way to choose a `selector` which accepts a `Pipeline` object from xfeat. You can see in the `feature_selection` function we define a Pipeline that takes in an Explorer and a Selection Algorithm (`ChiSquareKBest`). We pass this to an Optuna Study object, along with an Objective function

### Objective Function
The objective function will be the one we optimize in Optuna Study. Objective funciton tries out different values for the parameters that we are tuning and saving the results in `study.trials_dataframes()`.

Let's define the objective function for this HPO task by making use of the `train_and_eval()`. You can see that we simply choose a value for the parameters and call the `train_and_eval` method, making Optuna very easy to use in an existing workflow.

The objective remains constant over different samplers, which are built-in options in Optuna to enable the selection of different sampling algorithms that optuna provides. Some of the available ones include - GridSampler, RandomSampler, TPESampler, etc. We'll use TPESampler for this demo, but feel free to try different samplers to notice the chnages in performance.


### HPO Trials and Study
Optuna uses [study](https://optuna.readthedocs.io/en/stable/reference/study.html) and [trials](https://optuna.readthedocs.io/en/stable/reference/trial.html) to keep track of the HPO experiments. Put simply, a trial is a single call of the objective function while a set of trials make up a study. We will pick the optimal performing trial from a study to get the best parameters that were used in that run.

In [3]:
def train_and_eval(df, penalty='l2', C=1.0, l1_ratio='None', fit_intercept='True', selector=None):

    # Splitting data and prepping for selector fit
    X_train,  X_test, y_train, y_test = train_test_split(df, "ArrDelayBinary",random_state=np.random.seed(0), shuffle=True)

    if selector:
        # For the selector, the label also needs to be in the DF
        
        # Need to do this to ensure we are appropriately assigning the split values
        # It introduces nulls when done directly as df["col"] = x
        X_train["ArrDelayBinary"] = np.nan
        X_train.loc[:, "ArrDelayBinary"] = y_train
        X_test["ArrDelayBinary"] = np.nan
        X_test.loc[:, "ArrDelayBinary"] = y_test
        
        X_train = selector.fit_transform(X_train)
        X_test = selector.transform(X_test)
    
    # Train and get accuracy
    classifier = LogisticRegression(penalty=penalty,
                                    C=C,
                                    l1_ratio=l1_ratio,
                                    fit_intercept=fit_intercept,
                                    max_iter=10000)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict_proba(X_test.values)[:, 1]
    score = accuracy_score(y_test, y_pred)
    return score

def objective(df, selector, trial):
    """
    Performs the training and evaluation of the set of parameters and subset of features using selector.
    """
    selector.set_trial(trial)
    
    # Select Params
    C = trial.suggest_uniform("C", 0 , 7.0)
    penalty = trial.suggest_categorical("penalty", ['l1', 'none', 'l2'])
    l1_ratio = trial.suggest_uniform("l1_ratio", 0 , 1.0)
    fit_intercept = trial.suggest_categorical("fit_intercept", [True, False])
    
    score = train_and_eval(df,
                           penalty=penalty,
                           C=C,
                           l1_ratio=l1_ratio,
                           fit_intercept=fit_intercept,
                          selector=selector)
    return score

def feature_selection(df):
    """
    Defines the Pipeline and performs the optuna opt
    """
    selector = Pipeline(
        [
            SelectNumerical(),
            KBestThresholdExplorer(ChiSquareKBest(target_col="ArrDelayBinary")),
        ]
    )

    
    study = optuna.create_study(direction="maximize")
    study.optimize(partial(objective, df, selector), n_trials=N_TRIALS)

    selector.from_trial(study.best_trial)
    selected_cols = selector.get_selected_cols()
    
    return study, df[selected_cols]

In [4]:
INPUT_FILE = "/home/hyperopt/hyperopt/data/air_par.parquet"

N_ROWS = 10000000
N_TRIALS = 100

In [5]:
df_ = cudf.read_parquet(INPUT_FILE)[:N_ROWS]

df_ = df_.drop(["ActualElapsedTime"], axis=1)
# Can't handle nagative values, yet
# indices = df_.loc[df_["ActualElapsedTime"] < 0].index
# df_.loc[indices, "ActualElapsedTime"] = -1 * df_.loc[indices, "ActualElapsedTime"]

# cuML can't handle object types
df_["ArrDelayBinary"] = df_["ArrDelayBinary"].astype('int32')
print("Default performance: ", train_and_eval(df_))

Default performance:  0.1875987946987152


In [6]:
# We cast to objects for categorical  and target encoding
df_["UniqueCarrier"] = df_["UniqueCarrier"].astype("object")
df_["Origin"] = df_["Origin"].astype("object")
df_["Dest"] = df_["Dest"].astype("object")
df_["ArrDelayBinary"] = df_["ArrDelayBinary"].astype('int32')

In [7]:
df_feature_eng = feature_engineering(df_)
print("After feature eng: ", train_and_eval(df_feature_eng))

After feature eng:  0.8115019798278809


In [8]:
study, df_select = feature_selection(df_feature_eng)

[I 2020-07-16 16:01:52,631] Finished trial#0 with value: 0.8119096159934998 with parameters: {'C': 1.1704604735458646, 'penalty': 'none', 'l1_ratio': 0.3996027625828089, 'fit_intercept': True, 'KBestThresholdExplorer.k': 46.0}. Best is trial#0 with value: 0.8119096159934998.
[I 2020-07-16 16:02:00,544] Finished trial#1 with value: 0.8118315935134888 with parameters: {'C': 0.5449011514426064, 'penalty': 'l1', 'l1_ratio': 0.7106452051766206, 'fit_intercept': False, 'KBestThresholdExplorer.k': 71.0}. Best is trial#0 with value: 0.8119096159934998.
[I 2020-07-16 16:02:08,445] Finished trial#2 with value: 0.8119671940803528 with parameters: {'C': 6.220744902299951, 'penalty': 'l1', 'l1_ratio': 0.10636759985961242, 'fit_intercept': False, 'KBestThresholdExplorer.k': 72.0}. Best is trial#2 with value: 0.8119671940803528.
[I 2020-07-16 16:02:16,046] Finished trial#3 with value: 0.8116515874862671 with parameters: {'C': 6.680750967256805, 'penalty': 'l2', 'l1_ratio': 0.5882468499527762, 'fit_in

In [9]:
df_select["ArrDelayBinary"] = df_["ArrDelayBinary"]

In [10]:
df_select.columns

Index(['CRSDepTime', 'CRSArrTime', 'Origin', 'Distance', 'YearCRSArrTime_plus',
       'YearDistance_plus', 'MonthCRSDepTime_plus', 'MonthCRSArrTime_plus',
       'MonthFlightNum_plus', 'MonthOrigin_plus', 'MonthDest_plus',
       'MonthDistance_plus', 'DayofMonthCRSArrTime_plus',
       'DayofMonthDistance_plus', 'DayofWeekCRSDepTime_plus',
       'DayofWeekCRSArrTime_plus', 'DayofWeekDistance_plus',
       'CRSDepTimeCRSArrTime_plus', 'CRSDepTimeUniqueCarrier_plus',
       'CRSDepTimeFlightNum_plus', 'CRSDepTimeOrigin_plus',
       'CRSDepTimeDest_plus', 'CRSDepTimeDistance_plus',
       'CRSArrTimeUniqueCarrier_plus', 'CRSArrTimeFlightNum_plus',
       'CRSArrTimeOrigin_plus', 'CRSArrTimeDest_plus',
       'CRSArrTimeDistance_plus', 'UniqueCarrierDistance_plus',
       'FlightNumOrigin_plus', 'FlightNumDest_plus', 'FlightNumDistance_plus',
       'OriginDest_plus', 'OriginDistance_plus', 'DestDistance_plus',
       'DistanceDiverted_plus', 'ArrDelayBinary'],
      dtype='object')

In [11]:
params = study.best_params
print("After feature selection and paramter tuning: ", train_and_eval(df_select,
                                                                      C=params['C'],
                                                                      penalty=params['penalty'],
                                                                      l1_ratio=params['l1_ratio'],
                                                                      fit_intercept=params['fit_intercept']))

After feature selection and paramter tuning:  0.8124812245368958


In [12]:
# study.trials_dataframe().to_csv("xfeat_chi2_100Trials_run3_catenc.csv", header=True)

In [13]:
print(params)

{'C': 1.6862352497863953, 'penalty': 'none', 'l1_ratio': 0.04097929441331606, 'fit_intercept': False, 'KBestThresholdExplorer.k': 36.0}



|Run No.|Default|After FE| Optuna Best| After Selection and HPO|Best Params| Cols|
|-|-|-|-|-|-|-|
|1|0.18747319281101227|0.18743759393692017|0.812965989112854|0.8126764297485352| {'C': 2.7250515887031064,'penalty': 'l2','l1_ratio':0.1403560039741595, 'fit_intercept': True, 'KBestThresholdExplorer.k': 1.0} |['UniqueCarrierFlightNum_plus', 'ArrDelayBinary']|
|2|0.18741360306739807|0.1874103993177414|0.812959611415863|0.8120356202125549|{'C': 3.398344321379011,'penalty': 'none','l1_ratio': 0.5549805266380305,'fit_intercept': True,'KBestThresholdExplorer.k': 1.0}|['FlightNumDistance_plus', 'ArrDelayBinary']|
|3|0.18733720481395721|0.18729199469089508|0.8128544092178345|0.812743604183197|{'C': 4.202429766198819, 'penalty': 'none', 'l1_ratio': 0.6571582748501704, 'fit_intercept': False, 'KBestThresholdExplorer.k': 1.0}|['UniqueCarrierFlightNum_plus', 'ArrDelayBinary']|

## Additional Resources
[How to Win a DS Kaggle competition](https://www.coursera.org/learn/competitive-data-science)

[Target Encoding and Bayesian Target Encoding](https://towardsdatascience.com/target-encoding-and-bayesian-target-encoding-5c6a6c58ae8c)
