# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Training Data & Feature views</span>

<span style="font-width:bold; font-size: 1.4rem;">This is the second part of the quick start series of tutorials about Hopsworks Feature Store. This notebook explains how to read from a feature group and create training dataset within the feature store</span>

## **🗒️ In this notebook we will see how to create a training dataset from the feature groups:** 
1. **Select the features** we want to train our model on,
2. **How the features should be preprocessed,**
3. **Create a dataset split** for training and validation data.

![tutorial-flow](images/02_training-dataset.png) 

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


## <span style="color:#ff5f27;"> 🔪 Feature Selection </span>

We start by selecting all the features we want to include for model training/inference.

In [2]:
# Load feature groups.
trans_fg = fs.get_feature_group('transactions', version=1)
window_aggs_fg = fs.get_feature_group('transactions_4h_aggs', version=1)

# Select features for training data.
ds_query = trans_fg.select(["fraud_label", "loc_delta"])

ds_query.show(5)

2022-06-16 22:54:39,296 INFO: USE `fraud_online_featurestore`
2022-06-16 22:54:40,086 INFO: SELECT `fg0`.`fraud_label` `fraud_label`, `fg0`.`loc_delta` `loc_delta`
FROM `fraud_online_featurestore`.`transactions_1` `fg0`


Unnamed: 0,fraud_label,loc_delta
0,0,0.44428
1,0,0.0152
2,0,0.573758
3,0,0.26414
4,0,0.372482


Recall that we computed the features in `transactions_4h_aggs` using 4-hour aggregates. If we had created multiple feature groups with identical schema for different window lengths, and wanted to include them in the join we would need to include a prefix argument in the join to avoid feature name clash. See the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/query_api/#join) for more details.

v🤖 Transformation Functions </span>

We will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [3]:
# Load transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

# Map features to transformations.
transformation_functions = {
    "loc_delta": min_max_scaler,
}

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create a Feature View we may use `fs.create_feature_view()`

In [4]:
feature_view = fs.create_feature_view(
    name='transactions_view',
    query=ds_query,
    labels=["fraud_label"],
    transformation_functions=transformation_functions
)

Feature view created successfully, explore it at https://73bbb040-ed8b-11ec-a289-2979fdcaf1e8.cloud.hopsworks.ai/p/120/fs/68/fv/transactions_view/version/1


To view and explore data in the feature view we can retrieve batch data using `get_batch_data()` method 

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using `fs.create_training_dataset()` method.

**From feature view APIs we can also create training datasts based on even time filters specifing `start_time` and `end_time`** 



In [5]:
from datetime import datetime
date_format = "%Y-%m-%d %H:%M:%S"

In [6]:
# Create training datasets based event time filter
start_time = int(float(datetime.strptime("2022-01-01 00:00:01", date_format).timestamp()) * 1000)
end_time = int(float(datetime.strptime("2022-02-28 23:59:59", date_format).timestamp()) * 1000)

td_jan_feb_version, td_job = feature_view.create_training_dataset(
        description = 'transactions_dataset_jan_feb',
        data_format = "csv",
        coalesce = True,
        write_options = {'wait_for_job': True},
    )

Training dataset job started successfully, you can follow the progress at https://73bbb040-ed8b-11ec-a289-2979fdcaf1e8.cloud.hopsworks.ai/p/120/jobs/named/transactions_view_1_1_create_fv_td_16062022225447/executions




In [7]:
start_time = int(float(datetime.strptime("2022-03-01 00:00:01", date_format).timestamp()) * 1000)
end_time = int(float(datetime.strptime("2022-03-31 23:59:59", date_format).timestamp()) * 1000)

td_mar_version, td_job = feature_view.create_training_dataset(
        start_time = start_time,
        end_time = end_time,
        description = 'transactions_dataset_mar',
        data_format = "csv",
        coalesce = True,
        write_options = {'wait_for_job': True},
 )


Training dataset job started successfully, you can follow the progress at https://73bbb040-ed8b-11ec-a289-2979fdcaf1e8.cloud.hopsworks.ai/p/120/jobs/named/transactions_view_1_2_create_fv_td_16062022225545/executions




## <span style="color:#ff5f27;"> 🪝 Training Dataset retreival </span>

To retrieve training data from storage (already materialised) or from feature groups direcly we can use `get_training_dataset_splits` or `get_training_dataset` methods. If version is not provided or provided version has not already existed, it creates a new version of training data according to given arguments and returns a dataframe. If version is provided and has already existed, it reads training data from storage or feature groups and returns a dataframe. If split is provided, it reads the specific split.

In [8]:
train_jan_feb_x, train_jan_feb_y = feature_view.get_training_data(td_jan_feb_version)



In [9]:
train_mar_x, train_mar_y = feature_view.get_training_data(td_mar_version)

In [10]:
train_jan_feb_x

Unnamed: 0,loc_delta
0,0.153145
1,0.005240
2,0.197777
3,0.091050
4,0.128397
...,...
365107,0.147441
365108,0.176873
365109,0.039796
365110,0.074346


In [11]:
train_jan_feb_y[train_jan_feb_y.fraud_label==1]

Unnamed: 0,fraud_label
809,1
988,1
1225,1
1278,1
1289,1
...,...
363018,1
363200,1
363705,1
364078,1


In [12]:
train_mar_x

Unnamed: 0,loc_delta
0,0.207684
1,0.062811
2,0.030848
3,0.041706
4,0.114363
...,...
84604,0.219972
84605,0.097360
84606,0.206802
84607,0.013869


In [13]:
train_mar_y[train_mar_y.fraud_label==1]

Unnamed: 0,fraud_label
969,1
2065,1
2168,1
3281,1
6500,1
...,...
77362,1
77956,1
79616,1
83123,1


## <span style="color:#ff5f27;">⏭️ **Next:** Part 03 </span>

In the following notebook, we will train a model on the dataset we created in this notebook and have quick overview of the lineage.