# ValidMind for model validation 3 — Developing potential challenger models

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this third notebook, develop potential challenger models and then pass your models and their predictions to ValidMind.

A *challenger model* is an alternate model that attempt to outperform the champion model, ensuring that the best performing fit-for-purpose model is always considered for deployment. Challenger models also help avoid over-reliance on a single model, and allow testing of new features, algorithms, or data sources without disrupting the production lifecycle.

## Setting up

## Split the preprocessed dataset

With our dummy model imported, raw dataset rebalanced with highly correlated features removed, let's now **spilt our dataset into train and test** in preparation for training potential challenger models.

To start, let's grab the first few rows from the `balanced_raw_no_age_df` dataset we initialized earlier:

In [None]:
balanced_raw_no_age_df.head()

Before training the models, we need to encode the categorical features in the dataset:

- Use the `OneHotEncoder` class from the `sklearn.preprocessing` module to encode the categorical features.
- The categorical features in the dataset are `Geography` and `Gender`.

In [None]:
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()

Splitting our dataset into training and testing is essential for proper validation testing, as this helps assess how well the model generalizes to unseen data:

- We begin by dividing our preprocessed dataset into feature variables (`X`) and the target variable (`y`), which indicates whether a customer exited.
- Using `train_test_split`, we randomly allocate 80% of the data to the training set (`X_train`, `y_train`) and 20% to the test set (`X_test`, `y_test`), ensuring both sets are representative of the full distribution.

In [None]:
# Split the input and target variables
X = balanced_raw_no_age_df.drop("Exited", axis=1)
y = balanced_raw_no_age_df["Exited"]
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)