# Vehicle Sales Price Predictions Workshop - Part 2 of 3

## Training Pipeline

In order to make a machine learning system from this dataset, we have structured the service into 3 pipelines:

1. feature engineering pipeline notebook (see Part 1)
2. training pipeline notebook (ie. this Part 2)
3. inferencing pipeline notebook (see Part 3)

This notebook will outline the second step, ie. the training pipeline.


## <span style="color:#ff5f27">📝 Imports </span>

In [None]:
# Install XGBoost
!pip install xgboost -q

# Install the Hopsworks client library
!pip install --quiet hopsworks

In [None]:
import os
import time
import joblib
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, FunctionTransformer
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## <span style="color:#ff5f27"> 📡 Connecting to Hopsworks Feature Store </span>

In [None]:
# Connect to the Hopsworks Feature store and get the feature group
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

car_prices_fg = fs.get_feature_group(
    name="car_prices_xgboost", 
    version=1,
)

## <span style="color:#ff5f27">⚙️ Feature View Creation </span>


In [None]:
# Create a feature view for the training
feature_view = fs.get_or_create_feature_view(
    name="car_prices_xgboost",
    version=1,
    query= car_prices_fg.select_except(["seller", "saledate"]),
    labels=["sellingprice"],
)

## <span style="color:#ff5f27">👩🏻‍🍳 Data Preparation </span>

A machine learning model is a mathematical equation. An equation cannot accept anything other than numbers. Your categorical data must therefore be transformed (encoded) into numerical data at this stage. However, if you encode the data, you must also save the encoder for later decoding once the model is trained.

In [None]:
features_df, labels_df = feature_view.training_data()
labels_df
features_df

In [None]:
# Now we will encode the dataset
def encode_categorical_data(dataset, label_encoders):
    # Iterate over the columns of the DataFrame
    for column in dataset.columns:
        # Check if the column is of type 'object' (categorical)
        if dataset[column].dtype == 'object':
            # Create a LabelEncoder instance
            label_encoder = LabelEncoder()

            # Perform encoding on unique column values
            dataset[column] = label_encoder.fit_transform(dataset[column])

            # Add the encoder label to the dictionary
            label_encoders[column] = label_encoder
    return dataset

# Create a dictionary to store label encoders
clf = {}
df_encoded = encode_categorical_data(features_df, clf)
df_encoded

Transform categorical values ​​from dataset 'dataset_cleaned.csv' into numeric values ​​and saves the encoder to a file for later use during prediction.

Cut the encoded dataset into two parts, train and test Keep 1000 data in the test dataset`

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_encoded, labels_df, test_size=1000, random_state=42)

# Show training and test set sizes
print("⛳️ Size of the training dataset :", len(X_train))
print("⛳️ Size of the test dataset :", len(X_test))

## <span style="color:#ff5f27">🏃🏻‍♂️ Model Training </span>


In [None]:
# Train the XGBoost model
model = xgb.XGBRegressor(
    objective='reg:squarederror', 
    n_estimators=100, 
    learning_rate=0.1,
)
model.fit(X_train, y_train)

In [None]:
# Evaluate the model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

In [None]:
#Save the model
MODEL_NAME = "car_prices_model_xgboost"
os.makedirs(MODEL_NAME, exist_ok=True)

# Save the model as a JSON file
model.save_model(MODEL_NAME + '/xgboost_model.json')
joblib.dump(clf, MODEL_NAME + '/label_encoders.pkl')

In [None]:
# Feature importance plot
fig, ax = plt.subplots(figsize=(10, 6))
xgb.plot_importance(model, ax=ax)
plt.title('Feature Importance')

os.makedirs(MODEL_NAME + "/images", exist_ok=True)
plt.savefig(MODEL_NAME + '/images/feature_importance.png')

plt.show()

## <span style="color:#ff5f27">📝 Model Registry</span>


Saves the entire trained model in a file to use it afterwards and put it in production. We will also upload the model into the Hopsworks Model Registry.

In [None]:
# This step will upload the model to the Hopsworks Model Registry
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(features_df)
output_schema = Schema(labels_df)

model_schema = ModelSchema(
    input_schema=input_schema, 
    output_schema=output_schema,
)

mr = project.get_model_registry()

car_prices_xgboost_model = mr.python.create_model(
    MODEL_NAME,
    description="Car Price Predictor",
    input_example=X_train.sample(), 
    model_schema=model_schema,
    metrics={'test_loss': mse},
)

# Save the created model in the "car_prices_model" directory
car_prices_xgboost_model.save(MODEL_NAME)

## <span style="color:#ff5f27">🚀 Model Deployment</span>


### <span style="color:#ff5f27">📎 Predictor script for Python models</span>


In [None]:
%%writefile predict_example.py
import os
import numpy as np
import hsfs
import joblib
import xgboost as xgb

class Predict(object):

    def __init__(self):
        """ Initializes the serving state, reads a trained model"""        
        # Get feature store handle
        fs_conn = hsfs.connection()
        self.fs = fs_conn.get_feature_store()

        # Load the model from the JSON file
        self.model = xgb.XGBRegressor()
        self.model.load_model(os.environ["ARTIFACT_FILES_PATH"] + "/xgboost_model.json")
        print("Initialization Complete")

    def predict(self, inputs):
        """ Serves a prediction request usign a trained model"""
        return self.model.predict(inputs).tolist()


In [None]:
# Get the dataset API for the current project
dataset_api = project.get_dataset_api()

# Specify the local file path of the Python script to be uploaded
local_script_path = "predict_example.py"

# Upload the Python script to the "Models", and overwrite if it already exists
uploaded_file_path = dataset_api.upload(local_script_path, "Models", overwrite=True)

# Create the full path to the uploaded script for future reference
predictor_script_path = os.path.join("/Projects", project.name, uploaded_file_path)

### <span style="color:#ff5f27">⚙️ Create the deployment</span>


In [None]:
# Deploy the fraud model
deployment = car_prices_xgboost_model.deploy(
    name="carpricexgboostmodeldeployment",  # Specify a name for the deployment
    script_file=predictor_script_path,  # Provide the path to the Python script for prediction
)

In [None]:
# Print the name of the deployment
print("Deployment: " + deployment.name)

# Display information about the deployment
deployment.describe()

In [None]:
print("Deployment is warming up...")
time.sleep(15)

#### The deployment has now been registered. However, to start it you need to run the following command:

In [None]:
# Start the deployment and wait for it to be in a running state for up to 300 seconds
deployment.start(await_running=300)

In [None]:
# Get the current state of the deployment
deployment.get_state().describe()

In [None]:
# To troubleshoot you can use `get_logs()` method
deployment.get_logs(component='predictor')

### <span style="color:#ff5f27">🔮 Inference</span>


In [None]:
# Use the deployed model to make predictions on the provided input example
prediction = deployment.predict(
    inputs=car_prices_xgboost_model.input_example,
)
prediction

#### Stop Deployment

To stop the deployment you simply run:

In [None]:
# Stop the deployment and wait for it to be in a stopped state for up to 180 seconds
deployment.stop(await_stopped=180)

Now we can proceed to the Inference Pipeline of the workshop demo example.

---