# SageMaker Demo: Employee Attrition Prediction Using Feature Store and XGBoost

This notebook demonstrates how to use Amazon SageMaker's Feature Store and XGBoost built-in algorithm to predict employee attrition.

In [1]:
import pandas as pd

# Load the dataset
file_path = 'Employee.csv'  # Replace with your actual file path in S3 if needed
employee_df = pd.read_csv(file_path)
employee_df.head()

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
0,Bachelors,2017,Bangalore,3,34,Male,No,0,0
1,Bachelors,2013,Pune,1,28,Female,No,3,1
2,Bachelors,2014,New Delhi,3,38,Female,No,2,0
3,Masters,2016,Bangalore,3,27,Male,No,5,1
4,Masters,2017,Pune,3,24,Male,Yes,2,1


In [2]:
# Step 2: Data Preparation
# Convert categorical columns to numeric
employee_df['Education'] = employee_df['Education'].astype('category').cat.codes
employee_df['City'] = employee_df['City'].astype('category').cat.codes
employee_df['Gender'] = employee_df['Gender'].astype('category').cat.codes
employee_df['EverBenched'] = employee_df['EverBenched'].map({'Yes': 1, 'No': 0})

# Drop rows with NaN values in the target column
employee_df.dropna(subset=['LeaveOrNot'])

# Convert target column to numeric if needed
employee_df['LeaveOrNot'] = employee_df['LeaveOrNot'].astype(int)

# Ensure no missing values in feature columns
employee_df = employee_df.dropna()

# Verify all columns are numeric
print(employee_df.dtypes)

# Define features and target
feature_columns = [
    'Education', 'JoiningYear', 'City', 'PaymentTier', 'Age',
    'Gender', 'EverBenched', 'ExperienceInCurrentDomain'
]
target_column = 'LeaveOrNot'

employee_df = employee_df[[target_column] + feature_columns]

# Display the transformed dataset
employee_df.head()

Education                     int8
JoiningYear                  int64
City                          int8
PaymentTier                  int64
Age                          int64
Gender                        int8
EverBenched                  int64
ExperienceInCurrentDomain    int64
LeaveOrNot                   int64
dtype: object


Unnamed: 0,LeaveOrNot,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain
0,0,0,2017,0,3,34,1,0,0
1,1,0,2013,2,1,28,0,0,3
2,0,0,2014,1,3,38,0,0,2
3,1,1,2016,0,3,27,1,0,5
4,1,1,2017,2,3,24,1,1,2


In [3]:
from sklearn.model_selection import train_test_split 

# Check if we have any retrieved records
if not employee_df.empty:
    # Split the data into training and test sets
    train_df, test_df = train_test_split(employee_df, test_size=0.2, random_state=42)
    print("Training and test data split after retrieval from Feature Store.")
else:
    print("No records retrieved. Please check the feature group and identifiers.")

Training and test data split after retrieval from Feature Store.


## Train the Model Using Local Data with S3 Mode (Default)

In [4]:
# Save the data locally first
train_file = 'train.csv'
validation_file = 'validation.csv'
train_df.to_csv(train_file, index=False)
test_df.to_csv(validation_file, index=False)


In [5]:
train_df = pd.read_csv("train.csv")
valid_df = pd.read_csv("validation.csv")

target_col = "LeaveOrNot"

X_train = train_df.drop(columns=[target_col])
y_train = train_df[target_col]

X_valid = valid_df.drop(columns=[target_col])
y_valid = valid_df[target_col]

In [6]:
%pip install xgboost


Note: you may need to restart the kernel to use updated packages.


In [7]:
from xgboost import XGBClassifier

model = XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='logloss'
)

model.fit(X_train, y_train)


0,1,2
,"objective  objective: typing.Union[str, xgboost.sklearn._SklObjWProto, typing.Callable[[typing.Any, typing.Any], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType] Specify the learning task and the corresponding learning objective or a custom objective function to be used. For custom objective, see :doc:`/tutorials/custom_metric_obj` and :ref:`custom-obj-metric` for more information, along with the end note for function signatures.",'binary:logistic'
,"base_score  base_score: typing.Union[float, typing.List[float], NoneType] The initial prediction score of all instances, global bias.",
,booster,
,"callbacks  callbacks: typing.Optional[typing.List[xgboost.callback.TrainingCallback]] List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using :ref:`Callback API `. .. note::  States in callback are not preserved during training, which means callback  objects can not be reused for multiple training sessions without  reinitialization or deepcopy. .. code-block:: python  for params in parameters_grid:  # be sure to (re)initialize the callbacks before each run  callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]  reg = xgboost.XGBRegressor(**params, callbacks=callbacks)  reg.fit(X, y)",
,colsample_bylevel  colsample_bylevel: typing.Optional[float] Subsample ratio of columns for each level.,
,colsample_bynode  colsample_bynode: typing.Optional[float] Subsample ratio of columns for each split.,
,colsample_bytree  colsample_bytree: typing.Optional[float] Subsample ratio of columns when constructing each tree.,0.8
,"device  device: typing.Optional[str] .. versionadded:: 2.0.0 Device ordinal, available options are `cpu`, `cuda`, and `gpu`.",
,"early_stopping_rounds  early_stopping_rounds: typing.Optional[int] .. versionadded:: 1.6.0 - Activates early stopping. Validation metric needs to improve at least once in  every **early_stopping_rounds** round(s) to continue training. Requires at  least one item in **eval_set** in :py:meth:`fit`. - If early stopping occurs, the model will have two additional attributes:  :py:attr:`best_score` and :py:attr:`best_iteration`. These are used by the  :py:meth:`predict` and :py:meth:`apply` methods to determine the optimal  number of trees during inference. If users want to access the full model  (including trees built after early stopping), they can specify the  `iteration_range` in these inference methods. In addition, other utilities  like model plotting can also use the entire model. - If you prefer to discard the trees after `best_iteration`, consider using the  callback function :py:class:`xgboost.callback.EarlyStopping`. - If there's more than one item in **eval_set**, the last entry will be used for  early stopping. If there's more than one metric in **eval_metric**, the last  metric will be used for early stopping.",
,enable_categorical  enable_categorical: bool See the same parameter of :py:class:`DMatrix` for details.,False


In [8]:
%pip install optuna xgboost
    

Note: you may need to restart the kernel to use updated packages.


In [9]:
import optuna
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np

# Example data — replace with your actual data
# Must have features in X and target in y
X = employee_df.drop(columns=['LeaveOrNot'])  # replace target_col
y = employee_df['LeaveOrNot']

# Train/Val split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

def objective(trial):
    params = {
        "objective": "reg:squarederror",
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "eta": trial.suggest_float("eta", 0.1, 0.5),
        "gamma": trial.suggest_float("gamma", 0, 5),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
    }

    model = xgb.XGBRegressor(**params)
    model.fit(X_train, y_train)

    preds = model.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, preds))
    return rmse

# Run tuning
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=50)

print("Best Trial:")
print("  Value (RMSE):", study.best_value)
print("  Params:", study.best_params)

# Train final model on full data
best_params = study.best_params
best_model = xgb.XGBRegressor(
    objective="reg:squarederror",
    **best_params
)
best_model.fit(X, y)

print("\nBest model trained on full data.")


  from .autonotebook import tqdm as notebook_tqdm
[32m[I 2026-01-24 14:56:37,978][0m A new study created in memory with name: no-name-77bca2cf-b6f6-4d41-8294-f6e904de70e5[0m
[32m[I 2026-01-24 14:56:38,018][0m Trial 0 finished with value: 0.3449163481808489 and parameters: {'max_depth': 7, 'eta': 0.16561548007509394, 'gamma': 4.000802921398796, 'min_child_weight': 7, 'subsample': 0.9476077428700564}. Best is trial 0 with value: 0.3449163481808489.[0m
[32m[I 2026-01-24 14:56:38,053][0m Trial 1 finished with value: 0.34156956334676597 and parameters: {'max_depth': 4, 'eta': 0.23137850239459778, 'gamma': 2.0937647641876014, 'min_child_weight': 5, 'subsample': 0.6492047419029391}. Best is trial 1 with value: 0.34156956334676597.[0m
[32m[I 2026-01-24 14:56:38,104][0m Trial 2 finished with value: 0.3168449774557602 and parameters: {'max_depth': 7, 'eta': 0.19440657050724608, 'gamma': 0.38681782446152224, 'min_child_weight': 10, 'subsample': 0.6868916893886097}. Best is trial 2 with

Best Trial:
  Value (RMSE): 0.3133865417330095
  Params: {'max_depth': 7, 'eta': 0.42045975968915594, 'gamma': 0.3739582362036293, 'min_child_weight': 1, 'subsample': 0.8823607501753465}

Best model trained on full data.
