# Modeling

In this tutorial, we will show you how to use `zephyr_ml` to train models using the `Zephyr` class. This tutorial builds on top of the previous one where we create EntitySets, generate label times, and do automated feature engineering. To do any of these previous steps, please refer to `feature_engineering` notebook.

## 1) Load the Feature Matrix

Load the feature matrix which is the result of the `feature_engineering` notebook. For the purpose of this tutorial, we use a dummy feature matrix stored in the `data/` folder.

In [1]:
import pandas as pd

feature_matrix = pd.read_csv('data/feature_matrix.csv')

## 2) Preparing Model Inputs

Prepare the data for modeling. Depending on the data, you might need to: normalize the data, impute missing values, create one-hot encodings for categorical values, etc.

In this part of the notebook, we do the following:
* create `X` and `y` variables from the feature matrix
* impute missing values using a SimpleImpute
* split the data into training and testing

In [2]:
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

# pop the target labels
y = list(feature_matrix.pop('label'))
X = feature_matrix.values

# impute missing values
imputer = SimpleImputer()
X = imputer.fit_transform(X)

# create train and test splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=33)

## 3) Train a Model

We train a model using the `Zephyr` interface where you can train, infer, and evaluate a pipeline.
In this notebook, we use an `xgb_classifier` pipeline which consists of two primitives:

```
        "xgboost.XGBClassifier"
        "zephyr_ml.primitives.postprocessing.FindThreshold"
```

An `XGBClassifier` primitive is an XGB model that returns the probability of each class, and `FindThreshold` primitive creates binary labels from the output of the XGB model by choosing a threshold that produces the best metric value (F1 Score by default)

To use a pipeline, we simply pass the name of the pipeline to `Zephyr`
Optionally, you can change the default settings of the primitive by passing a hyperparameter dictionary. For example, we can change the number of trees in the classifier to be 50 instead of the default value (100).

In [3]:
from zephyr_ml import Zephyr

hyperparameters = {
    "xgboost.XGBClassifier#1": {
        "n_estimators": 50
    }
}

zephyr = Zephyr('xgb_classifier', hyperparameters)

Then, training a pipeline can be done using the `fit` function and passing the training data

In [4]:
zephyr.fit(X_train, y_train)

Now that the pipeline is trained, we can use it to predict the values of the test data using `predict` function

In [5]:
zephyr.predict(X_test)

[1, 0, 1]

Lastly, we can evaluate the performance of the pipeline using `evaluate` function

In [6]:
zephyr.evaluate(X_test, y_test)

accuracy     0.666667
f1           0.666667
recall       1.000000
precision    0.500000
dtype: float64