# CoffeeVolume Prediction Demo - Training Pipeline (2)

This is the second part of a short demonstration of how you can use Hopsworks for creating a Machine Learning System that creates predictions. The hypothetical use case is that of a **barrista** who would want to predict how much coffee will be consumed in his bar, based on past trends and behaviour.

![](https://blogstudio.s3.theshoppad.net/coffeeheroau/d4459a5d44905ff2cf3c245e7a931675.jpg)

In this second part of the Machine Learning system that we are creating, we are going to focus on the **Training Pipeline** required to make the predictions that we want.

## <span style="color:#ff5f27">📝 Code Library Imports and installations</span>

### Installing Supporting libraries
We also need to install the Kaleido libraries so that we can use these during our notebook exploration:

In [1]:
!pip install -U kaleido # For Plotly Image export
!pip install sklearn
!pip install hopsworks==3.5.0rc1



### Other imports
We are importing a number of other libaries, among which several classes from  `coffeevolume.py` and `averages.py`, into this notebook, as we will use it later on.

In [2]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import os
import joblib
from coffeevolume import plot_prediction_test
from functions import predict_id
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>
As we have already installed Hopsworks on this system, all we now need to do is import the library into this notebook, and start establishing the connection to the Hopsworks feature store.

In [4]:
import hopsworks

# #To connect to Managed:
# import hsfs
# conn = hsfs.connection(
#     host="172f2800-9e76-11ee-ba4c-277d56d9f8e7.cloud.hopsworks.ai",                                # DNS of your Feature Store instance
#     project="RixCoffeevolumeDemo",                      # Name of your Hopsworks Feature Store project
#     hostname_verification=False,                     # Disable for self-signed certificates
#     api_key_value="Q0sPuOSFpsuwdIa0.pfBCpgAAnPr3C3J49BvEdeJvfoqTkwQihEotXupzz23FPzDdJpexwHmXyRB8ACDf"          # Feature store API key value 
# )
# fs = conn.get_feature_store()           # Get the project's default feature store"

project = hopsworks.login(
    host="172f2800-9e76-11ee-ba4c-277d56d9f8e7.cloud.hopsworks.ai", 
    project="RixCoffeevolumeDemo",        
    api_key_value="Q0sPuOSFpsuwdIa0.pfBCpgAAnPr3C3J49BvEdeJvfoqTkwQihEotXupzz23FPzDdJpexwHmXyRB8ACDf" 
)
fs = project.get_feature_store()  

#To connect to Serverless:
#import hopsworks
# project = hopsworks.login()
#fs = project.get_feature_store() 

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://172f2800-9e76-11ee-ba4c-277d56d9f8e7.cloud.hopsworks.ai:443/p/4215
Connected. Call `.close()` to terminate connection gracefully.


Since we have already created the necessary feature groups in our previous notebook, we can now just import these into our training pipeline. Remember we had two feature groups:
1. a `coffeevolume` feature group that contained the synthetic data that we had created,
2. a `coffeevolume_averages` feature group that contained a number of _engineered_ features that we derived from the `coffeevolume` feature group in order to have better predictions later on.

We retrieve bother feature groups:

In [5]:
# Retrieve feature groups
coffeevolume_fg = fs.get_feature_group(
    name='coffeevolume',
    version=1,
)
coffeevolume_averages_fg = fs.get_feature_group(
    name='coffeevolume_averages',
    version=1,
)

## <span style="color:#ff5f27">🔪 Preparing our Training pipeline: Feature Selection </span>
Using the Hopsworks feature store, we can very easily select the specific subset that we need for our training pipeline, and be confident that we will be using the correct training dataset in our Machine Learning process:

In [6]:
# Select features for training dataset
selected_features = coffeevolume_fg.select_all() \
    .join(coffeevolume_averages_fg.select_except(['date']))
selected_features.show(5)

Finished: Reading data from Hopsworks, using ArrowFlight (29.74s) 


Unnamed: 0,date,id,coffeevolume,ma_7,ma_14,ma_30,daily_rate_of_change,volatility_30_day,ema_02,ema_05,rsi
0,2023-09-01,0,200.0,0.0,0.0,0.0,0.0,0.0,200.0,200.0,0.0
1,2023-09-02,0,201.0,0.0,0.0,0.0,0.5,0.0,200.555556,200.666667,0.0
2,2023-09-03,0,193.1,0.0,0.0,0.0,-3.930348,0.0,197.5,196.342857,0.0
3,2023-09-04,0,206.7,0.0,0.0,0.0,7.042983,0.0,200.616531,201.866667,0.0
4,2023-09-05,0,204.7,0.0,0.0,0.0,-1.539202,0.0,204.237744,205.219355,0.0


## <span style="color:#ff5f27">🤖 Preparing our Training pipeline: Transformation Functions </span>
Using the Hopsworks feature store, we can easily apply a set of transformation functions that will apply to our training pipeline:

In [7]:
# Load transformation function
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")

# Define a list of feature names
feature_names = [
    'ma_7', 'ma_14', 'ma_30', 'daily_rate_of_change', 'volatility_30_day', 'ema_02', 'ema_05', 'rsi'
]

# Map features to transformations
transformation_functions = {
    feature_name: min_max_scaler
    for feature_name in feature_names
}
transformation_functions

{'ma_7': <hsfs.transformation_function.TransformationFunction at 0x159556bd0>,
 'ma_14': <hsfs.transformation_function.TransformationFunction at 0x159556bd0>,
 'ma_30': <hsfs.transformation_function.TransformationFunction at 0x159556bd0>,
 'daily_rate_of_change': <hsfs.transformation_function.TransformationFunction at 0x159556bd0>,
 'volatility_30_day': <hsfs.transformation_function.TransformationFunction at 0x159556bd0>,
 'ema_02': <hsfs.transformation_function.TransformationFunction at 0x159556bd0>,
 'ema_05': <hsfs.transformation_function.TransformationFunction at 0x159556bd0>,
 'rsi': <hsfs.transformation_function.TransformationFunction at 0x159556bd0>}

## <span style="color:#ff5f27">⚙️ Feature View Creation leveraging feature groups, feature selection and feature transformations </span>
Using the feature groups, the selection that we made above and finally the associated transformation functions, we can create a **feature view** in the Hopsworks feature store:

In [8]:
# Get or create the 'coffeevolume_fv' feature view
feature_view = fs.get_or_create_feature_view(
    name='coffeevolume_fv',
    version=1,
    query=selected_features,
    labels=["coffeevolume"],
    transformation_functions=transformation_functions,
)


# <span style="color:#ff5f27">🏋️ Training Dataset Creation based on the Feature View</span>
The next part of our process will focus on the creation of a Training dataset based on the feature view that we have prepared above.

In [9]:
# Get training and testing sets
X_train, X_test, y_train, y_test = feature_view.train_test_split(
    description='Coffee Volume Dataset',  # Provide a description for the dataset split
    train_start='2023-09-01',      # Start date for the training set
    train_end='2023-11-01',        # End date for the training set
    test_start='2023-11-01',       # Start date for the testing set
    test_end=datetime.today().strftime("%Y-%m-%d"),  # End date for the testing set (current date)
)

Finished: Reading data from Hopsworks, using ArrowFlight (27.13s) 




In [10]:
X_train.head(3)

Unnamed: 0,date,id,ma_7,ma_14,ma_30,daily_rate_of_change,volatility_30_day,ema_02,ema_05,rsi
0,2023-09-01,0,0.0,0.0,0.0,0.449107,0.0,0.207295,0.184405,0.361737
1,2023-09-02,0,0.0,0.0,0.0,0.467924,0.0,0.221873,0.199966,0.361737
2,2023-09-03,0,0.0,0.0,0.0,0.30119,0.0,0.141695,0.099038,0.361737


In [11]:
y_train.head(3)

Unnamed: 0,coffeevolume
0,200.0
1,201.0
2,193.1


In [12]:
# Sort the training features by the 'date' column
X_train = X_train.sort_values("date")

# Reindex the target 'y_train' to match the sorted order of 'X_train'
y_train = y_train.reindex(X_train.index)

# Sort the testing features by the 'date' column
X_test = X_test.sort_values("date")

# Reindex the target 'y_test' to match the sorted order of 'X_test'
y_test = y_test.reindex(X_test.index)

# Extract and store the 'date' column as a separate DataFrame for both training and testing sets
train_date = pd.DataFrame(X_train.pop("date"))
test_date = pd.DataFrame(X_test.pop("date"))

# <span style="color:#ff5f27">🧬 Using the training dataset to create a Machine Learning Model</span>

In order to now create a predictive model out of our data, we will use the **XGBoost Regressor** to examine the training dataset and figure out a predictive model. XGBoost regressor is a powerful and highly effective machine learning algorithm for regression problems. XGBoost is known for its ability to handle complex relationships in the data, handle missing values, and provide accurate predictions. It's a popular choice in the data science community due to its robustness and excellent predictive performance, making it well-suited for our specific problem.

In [13]:
# Initialize the XGBoost regressor
model = xgb.XGBRegressor()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the validation set
y_test_pred = model.predict(X_test)

# Calculate RMSE on the validation set
mse = mean_squared_error(y_test, y_test_pred, squared=False)
print(f"Mean Squared Error (MSE): {mse}")

Mean Squared Error (MSE): 1.6051051026030378


In [14]:
# Make predictions for a specific ID (ID=1) using the 'predict_id' function
prediction_for_id = predict_id(
    1, 
    X_test, 
    model,
)

# Generate a Plotly figure for visualizing the predictions
fig = plot_prediction_test(
    1, 
    X_train, 
    X_test, 
    y_train, 
    y_test, 
    train_date, 
    test_date, 
    prediction_for_id,
)

# Display the generated Plotly figure
fig.show()

## <span style="color:#ff5f27">⚙️ Creating the Model Schema </span>
Now that we know what the input training features and the target/output variables will look like, we can create a schema for this and save this to a dictionary.

In [15]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

# Create an input schema using the training features
input_schema = Schema(X_train.values)

# Create an output schema using the target variable
output_schema = Schema(y_train)

# Create a model schema using the input and output schemas
model_schema = ModelSchema(
    input_schema=input_schema, 
    output_schema=output_schema,
)

# Convert the model schema to a dictionary
model_schema.to_dict()

{'input_schema': {'tensor_schema': {'shape': '(263652, 9)',
   'type': 'float64'}},
 'output_schema': {'columnar_schema': [{'name': 'coffeevolume',
    'type': 'float64'}]}}

## <span style="color:#ff5f27">📝 Registering the model in the Hopsworks registry</span>
We will also save and persist the model in Hopsworks, in case future decisions would need to be retraced to a specific version of the prediction system:

In [16]:
# Specify the directory for saving the model
model_dir = "coffeevolume_model"

# Check if the directory exists, and create it if not
if not os.path.isdir(model_dir):
    os.mkdir(model_dir)

# Save the trained XGBoost model using joblib
joblib.dump(model, f'{model_dir}/xgboost_coffeevolume_model.pkl')

# Write the generated Plotly figure image to the specified directory
fig.write_image(f'{model_dir}/model_prediction.png')


setDaemon() is deprecated, set the daemon attribute instead



In [17]:
# Get the model registry from the project
mr = project.get_model_registry()

# Create a Python model in the model registry named 'xgboost_price_model'
coffeevolume_model = mr.python.create_model(
    name="xgboost_coffeevolume_model", 
    metrics={"MSE": mse},           # Specify metrics (Mean Squared Error)
    model_schema=model_schema,      # Provide the model schema
    input_example=X_train.sample(), # Provide an example of the input data
    description="Coffeevolume Predictor",  # Add a description for the model
)

# Save the model to the specified directory
coffeevolume_model.save(model_dir)

Connected. Call `.close()` to terminate connection gracefully.


  0%|          | 0/6 [00:00<?, ?it/s]

Uploading: 0.000%|          | 0/496802 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/65655 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/167 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/241 elapsed<00:00 remaining<?

Model created, explore it at https://172f2800-9e76-11ee-ba4c-277d56d9f8e7.cloud.hopsworks.ai:443/p/4215/models/xgboost_coffeevolume_model/2


Model(name: 'xgboost_coffeevolume_model', version: 2)

# <span style="color:#ff5f27">🚀 Coffeevolume Consumption Prediction Model Deployment</span>

**About Model Serving**

Models can be served via KFServing or "default" serving, which means a Docker container exposing a Flask server. For KFServing models, or models written in Tensorflow, we do not need to write a prediction file. However, for sklearn models using default serving, you do need to proceed to write a prediction file.

We will not go into this at this point - and wrap up the 2nd part of our prediction system with this training example.

----

## <span style="color:#ff5f27">📎 Predictor script for Python models</span>

Scikit-learn and XGBoost models are deployed as Python models, in which case you need to provide a Predict class that implements the predict method. The `predict()` method invokes the model on the inputs and returns the prediction as a list.

The `init()` method is run when the predictor is loaded into memory, loading the model from the local directory it is materialized to, ARTIFACT_FILES_PATH.

The directive **"%%writefile"** writes out the cell before to the given Python file. We will use the **predict_example.py** file to create a deployment for our model.

In [None]:
%%writefile predict_example.py
import os
import numpy as np
import pandas as pd
import hsfs
import joblib


class Predict(object):

    def __init__(self):
        """ Initializes the serving state, reads a trained model"""        
        # get feature store handle
        fs_conn = hsfs.connection()
        self.fs = fs_conn.get_feature_store()
        
        # get feature view
        self.fv = self.fs.get_feature_view("coffeevolume_fv", 1)
        
        # initialize serving
        self.fv.init_serving(1)

        # load the trained model
        self.model = joblib.load(os.environ["ARTIFACT_FILES_PATH"] + "/xgboost_coffeevolume_model.pkl")
        print("Initialization of Coffeevolume model Complete")

    
    def predict(self, id_value):
        """ Serves a Coffeevolume prediction request usign a trained model"""
        # Retrieve feature vectors
        feature_vector = self.fv.get_feature_vector(
            entry = {'id': id_value[0]}
        )
        return self.model.predict(np.asarray(feature_vector[1:]).reshape(1, -1)).tolist()

This script needs to be put into a known location in the Hopsworks file system. Let's call the file predict_example.py and put it in the Models directory.

In [None]:
# Get the dataset API from the project
dataset_api = project.get_dataset_api()

# Upload the file "predict_example.py" to the "Models" dataset, overwriting if it already exists
uploaded_file_path = dataset_api.upload("predict_example.py", "Models", overwrite=True)

# Create the full path to the uploaded predictor script
predictor_script_path = os.path.join("/Projects", project.name, uploaded_file_path)

---

## <span style="color:#ff5f27">🚀 Create the deployment</span>

Here, you fetch the model you want from the model registry and define a configuration for the deployment. For the configuration, you need to specify the serving type (default or KFserving).

In [None]:
# Deploy the 'coffeevolume_model'
deployment = coffeevolume_model.deploy(
    name="coffeevolumeonlinemodeldeployment",  # Specify the deployment name
    script_file=predictor_script_path,  # Provide the path to the predictor script
)

In [None]:
# Start the deployment and wait for it up to 180 seconds
deployment.start(await_running=180)

In [None]:
# Get the current state of the deployment and describe it
deployment_state = deployment.get_state().describe()

In [None]:
# Predict volume for the 1 ID
deployment.predict({'instances': [100]})

---