# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Model training</span>

<span style="font-width:bold; font-size: 1.4rem;"> This notebook explains how to read from a feature group and create training dataset within the feature store. You will train a model on the created training dataset. You will train your model using TensorFlow, although it could just as well be trained with other machine learning frameworks such as Scikit-learn, Keras, and PyTorch. You will also see some of the exploration that can be done in Hopsworks, notably the search functions and the lineage.</span>

## **🗒️ This notebook is divided into the following steps:** 

1. **Feature Selection**: Select the features you want to train your model on.
2. **Feature Transformation**: How the features should be preprocessed.
3. **Training Dataset Creation**: Create a dataset for training anomaly detection model.
2. **Model Training**: Train your anomaly detection model.
3. **Model Registry**: Register model to Hopsworks model registry.
4. **Model Deployment**: Deploy the model for real-time inference.

![tutorial-flow](../../images/02_training-dataset.png) 

## <span style='color:#ff5f27'> 📝 Imports

In [1]:
import ast
import numpy as np
import pandas as pd
import tensorflow as tf
import os

from anomaly_detection import GanEncAnomalyDetector

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [2]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Multiple projects found. 

	 (1) rixdemo
	 (2) BeerVolumePrediction

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/189590
Connected. Call `.close()` to terminate connection gracefully.


## <span style="color:#ff5f27;"> 🔪 Feature Selection </span>

You start by selecting all the features you want to include for model training/inference.

In [3]:
# Retrieve Feature Groups
transactions_monthly_fg = fs.get_feature_group(
    name="transactions_monthly", 
    version=1,
)

graph_embeddings_fg = fs.get_feature_group(
    name="graph_embeddings",
    version=1,
) 

party_fg = fs.get_feature_group(
    name="party_labels", 
    version=1,
)

In [4]:
# Select features for training data
selected_features = party_fg.select(["type", "is_sar"]).join(
    transactions_monthly_fg.select(
        [
            "monthly_in_count", 
            "monthly_in_total_amount", 
            "monthly_in_mean_amount", 
            "monthly_in_std_amount", 
            "monthly_out_count", 
            "monthly_out_total_amount", 
            "monthly_out_mean_amount", 
            "monthly_out_std_amount",
        ]
    )).join(
        graph_embeddings_fg.select(["party_graph_embedding"]),
    )


In [None]:
# Uncomment this if you would like to view your selected features
# selected_features.show(5)

###### <span style="color:#ff5f27;"> 🤖 Transformation Functions </span>

Transformation functions are a mathematical mapping of input data that may be stateful - requiring statistics from the partent feature view (such as number of instances of a category, or mean value of a numerical feature)

We will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [5]:
# Load built in transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")

# Map features to transformations.
transformation_functions = {
    "monthly_in_count": min_max_scaler,
    "monthly_in_total_amount": min_max_scaler,
    "monthly_in_mean_amount": min_max_scaler,
    "monthly_in_std_amount": min_max_scaler,
    "monthly_out_count": min_max_scaler,
    "monthly_out_total_amount": min_max_scaler,
    "monthly_out_mean_amount": min_max_scaler,
    "monthly_out_std_amount": min_max_scaler,
}

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

In Hopsworks, you write features to feature groups (where the features are stored) and you read features from feature views. A feature view is a logical view over features, stored in feature groups, and a feature view typically contains the features used by a specific model. This way, feature views enable features, stored in different feature groups, to be reused across many different models. The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create a Feature View we may use `fs.create_feature_view()`

In [6]:
# Create the 'aml_feature_view' feature view
feature_view = fs.create_feature_view(
    name='aml_feature_view',
    query=selected_features,
    labels=["is_sar"],
    transformation_functions=transformation_functions,
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/189590/fs/189509/fv/aml_feature_view/version/1


## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using `feature_view.training_data()` method.

**From feature view APIs we can also create training datasts based on even time filters specifing `start_time` and `end_time`**. 


In [7]:
# Get training data
X_train, y_train = feature_view.training_data(
    description='AML training dataset'
)

Finished: Reading data from Hopsworks, using ArrowFlight (40.90s) 




In [8]:
# Displaying the first three rows of the training data
X_train.head(3)

Unnamed: 0,type,monthly_in_count,monthly_in_total_amount,monthly_in_mean_amount,monthly_in_std_amount,monthly_out_count,monthly_out_total_amount,monthly_out_mean_amount,monthly_out_std_amount,party_graph_embedding
0,0,0.2,0.113415,0.203202,0.159068,0.066667,0.033702,0.268579,0.0,"[0.8572240471839905, 0.6891205310821533, 0.035..."
1,1,0.0,0.0,0.0,0.0,0.333333,0.116609,0.185859,0.187547,"[0.2513226568698883, 0.21294060349464417, -0.0..."
2,0,0.1,0.027209,0.097499,0.139264,0.0,0.0,0.0,0.0,"[-0.030080951750278473, 0.013454589061439037, ..."


In [9]:
# Displaying the first three rows of the target data
y_train.head(3)

Unnamed: 0,is_sar
0,0
1,0
2,0


In [10]:
X_tmp = X_train
y_tmp = y_train

# <span style="color:#ff5f27;">🤖 Model Building</span>


In [None]:
# Converting string representations of Python literals in 'graph_embeddings' column to actual objects
#X_train['party_graph_embedding'] = X_train['party_graph_embedding'].apply(ast.literal_eval)

In [11]:
# Convert each element in the 'graph_embeddings' column to a NumPy array
X_train['party_graph_embedding'] = X_train['party_graph_embedding'].apply(np.array)

In [12]:
# Merge the original DataFrame with a DataFrame of exploded embeddings
X_train = X_train.merge(
    pd.DataFrame(X_train['party_graph_embedding'].to_list()).add_prefix('emb_'), 
    left_index=True,
    right_index=True,
).drop('party_graph_embedding', axis=1)

# Display the first three rows of the modified DataFrame
X_train.head(3)

Unnamed: 0,type,monthly_in_count,monthly_in_total_amount,monthly_in_mean_amount,monthly_in_std_amount,monthly_out_count,monthly_out_total_amount,monthly_out_mean_amount,monthly_out_std_amount,emb_0,...,emb_6,emb_7,emb_8,emb_9,emb_10,emb_11,emb_12,emb_13,emb_14,emb_15
0,0,0.2,0.113415,0.203202,0.159068,0.066667,0.033702,0.268579,0.0,0.857224,...,0.39559,-0.16993,0.383786,-0.118911,-0.068757,-0.034441,-0.33836,0.388147,0.131358,-0.124771
1,1,0.0,0.0,0.0,0.0,0.333333,0.116609,0.185859,0.187547,0.251323,...,0.124021,-0.043839,0.169117,-0.043344,-0.03889,-0.034858,-0.114783,0.173731,0.037775,-0.025678
2,0,0.1,0.027209,0.097499,0.139264,0.0,0.0,0.0,0.0,-0.030081,...,-0.008507,-0.023562,0.025372,0.004972,0.007744,-0.004733,0.004641,0.006481,0.002612,0.003756


You are going to train [gan for anomaly detection](https://arxiv.org/pdf/1905.11034.pdf). During training step  you will provide only features of accounts that have never been reported for suspicios activity.  You will disclose previously reported accounts to the model only in evaluation step.   

In [13]:
# Filter non-suspicious transactions from X_train based on y_train values equal to 0
non_sar_transactions = X_train[y_train.values == 0]

# Drop any rows with missing values from the non-suspicious transactions DataFrame
non_sar_transactions = non_sar_transactions.dropna()

Now lets define Tensorflow Dataset as we are going to train keras tensorflow model

In [14]:
def windowed_dataset(dataset, window_size, batch_size):
    # Create a windowed dataset using the specified window_size and shift of 1
    # Drop any remaining elements that do not fit in complete windows
    ds = dataset.window(window_size, shift=1, drop_remainder=True)

    # Flatten the nested datasets into a single dataset of windows
    ds = ds.flat_map(lambda x: x.batch(window_size))

    # Batch the windows into batches of the specified batch_size
    # Use drop_remainder=True to ensure that all batches have the same size
    # Prefetch one batch to improve performance
    return ds.batch(batch_size, drop_remainder=True).prefetch(1)

In [15]:
# Convert non_sar_transactions to a TensorFlow dataset, casting the values to float32
training_dataset = tf.data.Dataset.from_tensor_slices(
    tf.cast(non_sar_transactions.astype('float32'), tf.float32)
)

# Use the windowed_dataset function to create a windowed dataset
# Parameters: window_size=2 (sequence length), batch_size=16 (number of sequences in each batch)
training_dataset = windowed_dataset(
    training_dataset, 
    window_size=2, 
    batch_size=16,
)

training_dataset

2024-03-01 09:14:39.977327: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2
2024-03-01 09:14:39.977351: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 24.00 GB
2024-03-01 09:14:39.977371: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 8.00 GB
2024-03-01 09:14:39.977459: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-03-01 09:14:39.977513: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


<_PrefetchDataset element_spec=TensorSpec(shape=(16, None, 25), dtype=tf.float32, name=None)>

## <span style="color:#ff5f27;"> 🏃 Train Model</span>

Next we'll train a model. Here, we set the class weight of the positive class to be twice as big as the negative class.

## <span style="color:#ff5f27;">🧬 Model architecture</span>

Key components:

- `Encoder`(encoder_model) takes input data and compresses it into a latent representation. The encoder consists of two Convolutional 1D layers with Batch Normalization and Leaky ReLU activation functions.

- `Generator`(generator_model) takes a latent vector and generates synthetic data. The generator consists of two Convolutional 1D layers with Batch Normalization and Leaky ReLU activation functions. The last layer produces data with the same shape as the input data.

- `Discriminator`(discriminator_model) distinguishes between real and generated (fake) data. It comprises two Convolutional 1D layers with Batch Normalization and Leaky ReLU activation functions, followed by a fully connected layer. The output is a single value representing the probability that the input is real.

![tutorial-flow](images/model_architecture.png)

In [16]:
# Create an instance of the GanEncAnomalyDetector model with input dimensions [2, n_features]
model = GanEncAnomalyDetector([2, training_dataset.element_spec.shape[-1]])

# Compile the model
model.compile()



In [17]:
# Iterate through each layer in the model
for layer in model.layers:
    # Print the name and output shape of each layer
    print(layer.name, layer.output_shape)

encoder_model (None, 1, 25)
generator_model (None, 2, 25)
discriminator_model (None, 1)


In [18]:
# Train the model using the training_dataset
# Set the number of epochs to 2 and suppress verbose output during training
history = model.fit(
    training_dataset,  # Training dataset used for model training
    epochs=2,          # Number of training epochs
    verbose=0,         # Verbosity mode (0: silent, 1: progress bar, 2: one line per epoch)
)

2024-03-01 09:14:54.761867: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2024-03-01 09:15:43.928060: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 6347865373390499916
2024-03-01 09:15:43.928077: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 324168307569353234
2024-03-01 09:15:43.928081: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 14193971859317367416
2024-03-01 09:15:43.928086: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 11673728079890473886
2024-03-01 09:15:43.928092: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 6959878666771555107
2024-03-01 09:15:43.928097: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous r

In [19]:
# Create a dictionary to store metrics
# The key is 'loss', and the value is the initial value of the generator loss from the training history
metrics = {
    'loss': history.history["g_loss"][0],
}

### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/3.0/user_guides/mlops/registry/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [20]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

# Define the input schema using the values of X_train
input_schema = Schema(X_train)

# Define the output schema using y_train
output_schema = Schema(y_train)

# Create a ModelSchema object specifying the input and output schemas
model_schema = ModelSchema(
    input_schema=input_schema, 
    output_schema=output_schema,
)

# Convert the model schema to a dictionary for further inspection or serialization
model_schema.to_dict()

{'input_schema': {'columnar_schema': [{'name': 'type', 'type': 'int64'},
   {'name': 'monthly_in_count', 'type': 'float64'},
   {'name': 'monthly_in_total_amount', 'type': 'float64'},
   {'name': 'monthly_in_mean_amount', 'type': 'float64'},
   {'name': 'monthly_in_std_amount', 'type': 'float64'},
   {'name': 'monthly_out_count', 'type': 'float64'},
   {'name': 'monthly_out_total_amount', 'type': 'float64'},
   {'name': 'monthly_out_mean_amount', 'type': 'float64'},
   {'name': 'monthly_out_std_amount', 'type': 'float64'},
   {'name': 'emb_0', 'type': 'float64'},
   {'name': 'emb_1', 'type': 'float64'},
   {'name': 'emb_2', 'type': 'float64'},
   {'name': 'emb_3', 'type': 'float64'},
   {'name': 'emb_4', 'type': 'float64'},
   {'name': 'emb_5', 'type': 'float64'},
   {'name': 'emb_6', 'type': 'float64'},
   {'name': 'emb_7', 'type': 'float64'},
   {'name': 'emb_8', 'type': 'float64'},
   {'name': 'emb_9', 'type': 'float64'},
   {'name': 'emb_10', 'type': 'float64'},
   {'name': 'emb_11

## <span style="color:#ff5f27;">📝 Register model</span>

One of the features in Hopsworks is the model registry. This is where we can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [21]:
# Set the path for exporting the trained model
export_path = "aml_model"
print('Exporting trained model to: {}'.format(export_path))

# Get the concrete function for serving the model
call = model.serve_function.get_concrete_function(tf.TensorSpec([None, None, None], tf.float32))

# Save the model to the specified export path with the serving signature
tf.saved_model.save(
    model, 
    export_path, 
    signatures=call,
)

# Access the model registry in your project

mr = project.get_model_registry()

# Create a TensorFlow model in the model registry with specified metadata
mr_model = mr.tensorflow.create_model(
    name="aml_model",                                    # Specify the model name
    metrics=metrics,                                     # Include model metrics
    model_schema=model_schema,                           # Include model schema
    description="Adversarial anomaly detection model.",  # Model description
    input_example=["70408aef"],                          # Input example
)

# Save the registered model to the model registry
mr_model.save(export_path)

Exporting trained model to: aml_model
2024-03-01 09:16:37,567 INFO: Function `serve_function` contains input name(s) resource with unsupported characters which will be renamed to generator_model_batch_normalization_4_batchnorm_readvariableop_2_resource in the SavedModel.
INFO:tensorflow:Assets written to: aml_model/assets
2024-03-01 09:16:40,400 INFO: Assets written to: aml_model/assets
Connected. Call `.close()` to terminate connection gracefully.


  0%|          | 0/6 [00:00<?, ?it/s]

Uploading: 0.000%|          | 0/57 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/2381779 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/233601 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/5498 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/12 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/2011 elapsed<00:00 remaining<?

Model created, explore it at https://c.app.hopsworks.ai:443/p/189590/models/aml_model/1


Model(name: 'aml_model', version: 1)

## <span style="color:#ff5f27;"> 🚀 Model Deployment</span>


In [22]:
%%writefile aml_model_transformer.py

import os
import hsfs
import numpy as np
import tensorflow as tf

class Transformer(object):
    
    def __init__(self):        
        # Get feature store handle
        fs_conn = hsfs.connection()
        self.fs = fs_conn.get_feature_store()
        
        # Get feature views
        self.fv = self.fs.get_feature_view(
            name="aml_feature_view", 
            version=1,
        )
        
        # Initialise serving
        self.fv.init_serving(1)
    
    def preprocess(self, inputs):
        # Retrieve feature vector using the feature vector provider
        feature_vector = self.fv.get_feature_vector({"id": inputs["inputs"][0]})

        # Explode embeddings (flatten the list of embeddings)
        feature_vector_exploded_emb = [*self.flat2gen(feature_vector)]

        # Reshape feature vector to match the model's input shape
        feature_vector_reshaped = np.array(feature_vector_exploded_emb).reshape(1, 25)

        # Convert the feature vector to a TensorFlow constant
        input_vector = tf.constant(feature_vector_reshaped, dtype=tf.float32)

        # Add a time dimension (axis=1) to the input vector
        input_vector = tf.expand_dims(input_vector, axis=1)

        # Duplicate the input vector to create a pair
        input_vector = tf.tile(input_vector, [1, 2, 1])
        
        # Duplicate the input vector to create a pair
        input_vector = input_vector.numpy().tolist()

        # Return the preprocessed input dictionary
        return {
            'inputs': input_vector
        }

    def postprocess(self, outputs):
        return outputs

    def flat2gen(self, alist):
        for item in alist:
            if isinstance(item, list):
                for subitem in item: yield subitem
            else:
                yield item 

Overwriting aml_model_transformer.py


In [23]:
from hsml.transformer import Transformer

# Get the dataset API from the project
dataset_api = project.get_dataset_api()

# Upload the transformer script file to the "Models" dataset
uploaded_file_path = dataset_api.upload(
    "aml_model_transformer.py",   # Name of the script file
    "Resources",                     # Destination folder in the dataset
    overwrite=True,               # Overwrite the file if it already exists
)

# Construct the full path to the uploaded transformer script file
transformer_script_path = os.path.join(
    "/Projects", 
    project.name, 
    uploaded_file_path,
)

# Create a Transformer object using the uploaded script
transformer_script = Transformer(
    script_file=transformer_script_path,
)

Uploading: 0.000%|          | 0/1804 elapsed<00:00 remaining<?

In [24]:
# Retrieve the "aml_model" from the model registry
model = mr.get_model(
    name="aml_model", 
    version=1,
)

# Deploy the model with the specified name ("amlmodeldeployment") and associated transformer
deployment = model.deploy(
    name="amlmodeldeployment",      # Specify the deployment name
    transformer=transformer_script, # Associate the transformer script with the deployment
)

Deployment created, explore it at https://c.app.hopsworks.ai:443/p/189590/deployments/215042
Before making predictions, start the deployment by using `.start()`


In [None]:
print("Deployment: " + deployment.name)
deployment.describe()

> The deployment has now been registered. However, to start it you need to run:

In [None]:
deployment.start(await_running=300)

> For trouble shooting one can use `get_logs` method

In [None]:
deployment.get_logs()

> To stop the deployment you simply run:

In [None]:
deployment.stop()

---
## <span style="color:#ff5f27;"> ⏭️ Next: Part 03: Online Inference </span>
    
In the next notebook you will use your deployment for online inference.