 🗒️ This notebook is divided into the following sections:

1. Fetch Feature Groups
2. Define Transformation functions
4. Create Feature Views
5. Create Training Dataset with training, validation and test splits

In [1]:
!pip install -U hopsworks 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting hopsworks
  Downloading hopsworks-3.0.5.tar.gz (35 kB)
Collecting hsfs[python]<3.1.0,>=3.0.0
  Downloading hsfs-3.0.5.tar.gz (120 kB)
[K     |████████████████████████████████| 120 kB 8.1 MB/s 
[?25hCollecting hsml<3.1.0,>=3.0.0
  Downloading hsml-3.0.3.tar.gz (50 kB)
[K     |████████████████████████████████| 50 kB 6.1 MB/s 
[?25hCollecting pyhumps==1.6.1
  Downloading pyhumps-1.6.1-py3-none-any.whl (5.0 kB)
Collecting furl
  Downloading furl-2.1.3-py2.py3-none-any.whl (20 kB)
Collecting boto3
  Downloading boto3-1.26.41-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 50.2 MB/s 
[?25hCollecting pyjks
  Downloading pyjks-20.0.0-py2.py3-none-any.whl (45 kB)
[K     |████████████████████████████████| 45 kB 3.0 MB/s 
[?25hCollecting mock
  Downloading mock-5.0.0-py3-none-any.whl (29 kB)
Collecting avro==1.10.2
  Downloading avro-1.10.2.tar.gz (68 kB

In [1]:
import hopsworks
project = hopsworks.login()
fs = project.get_feature_store() 

Copy your Api Key (first register/login): https://c.app.hopsworks.ai/account/api/generated

Paste it here: ··········
Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/5315
Connected. Call `.close()` to terminate connection gracefully.




In [2]:
air_quality_fg = fs.get_or_create_feature_group(
    name = 'air_quality_fg',
    version = 1
)
weather_fg = fs.get_or_create_feature_group(
    name = 'weather_fg',
    version = 1
)

In [3]:
query = air_quality_fg.select_all().join(weather_fg.select_all())
query.read()

Unnamed: 0,date,pm25,pm10,o3,no2,so2,co,aqi,tempmax,tempmin,...,snowdepth,windgust,windspeed,winddir,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,moonphase
0,1671494400000,46.0,0.0,0.0,0.0,0.0,0.0,46.0,12.3,10.2,...,0.0,37.4,36.0,193.6,1012.9,98.6,12.3,5.8,0.4,0.95
1,1671408000000,60.0,16.0,13.0,19.0,0.0,0.0,60.0,12.3,4.3,...,0.0,37.4,18.5,180.8,1017.7,98.7,9.9,9.2,0.7,0.91
2,1671321600000,103.0,28.0,14.0,29.0,0.0,0.0,103.0,3.6,-6.0,...,0.0,26.3,14.8,139.5,1022.0,81.3,7.1,10.8,1.0,0.86
3,1671235200000,93.0,44.0,4.0,24.0,0.0,0.0,93.0,-0.6,-4.6,...,0.0,23.0,12.4,121.6,1025.1,76.4,7.2,12.0,1.1,0.81
4,1671148800000,85.0,42.0,10.0,31.0,0.0,0.0,85.0,1.3,-2.0,...,0.0,29.3,14.5,19.6,1015.8,46.1,9.8,12.4,1.0,0.76
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
349,1641340800000,37.0,31.0,17.0,25.0,0.0,0.0,37.0,6.5,2.0,...,0.0,48.0,18.5,291.2,1014.3,26.0,23.2,27.2,2.3,0.05
350,1641254400000,26.0,22.0,19.0,19.0,0.0,0.0,26.0,11.5,4.6,...,0.0,59.4,26.6,280.8,1002.0,98.9,17.8,13.3,1.1,0.02
351,1641168000000,38.0,12.0,24.0,23.0,0.0,0.0,38.0,12.7,10.8,...,0.0,52.3,22.8,224.0,1015.4,97.9,17.0,18.9,1.7,0.00
352,1641081600000,49.0,23.0,27.0,19.0,0.0,0.0,49.0,14.0,9.5,...,0.0,53.6,24.5,221.0,1018.9,89.8,24.3,18.2,1.5,1.00


--- 

## <span style="color:#ff5f27;"> 🖍 Feature View Creation and Retrieving </span>

In [4]:
# no need : query = air_quality_fg.select_all().join(weather_fg.select_all())
query_show = query.show(5)
col_names = query_show.columns

query_show

Unnamed: 0,date,pm25,pm10,o3,no2,so2,co,aqi,tempmax,tempmin,...,snowdepth,windgust,windspeed,winddir,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,moonphase
0,1671494400000,46.0,0.0,0.0,0.0,0.0,0.0,46.0,12.3,10.2,...,0.0,37.4,36.0,193.6,1012.9,98.6,12.3,5.8,0.4,0.95
1,1671408000000,60.0,16.0,13.0,19.0,0.0,0.0,60.0,12.3,4.3,...,0.0,37.4,18.5,180.8,1017.7,98.7,9.9,9.2,0.7,0.91
2,1671321600000,103.0,28.0,14.0,29.0,0.0,0.0,103.0,3.6,-6.0,...,0.0,26.3,14.8,139.5,1022.0,81.3,7.1,10.8,1.0,0.86
3,1671235200000,93.0,44.0,4.0,24.0,0.0,0.0,93.0,-0.6,-4.6,...,0.0,23.0,12.4,121.6,1025.1,76.4,7.2,12.0,1.1,0.81
4,1671148800000,85.0,42.0,10.0,31.0,0.0,0.0,85.0,1.3,-2.0,...,0.0,29.3,14.5,19.6,1015.8,46.1,9.8,12.4,1.0,0.76


### <span style="color:#ff5f27;">🧑🏻‍🔬 Transformation functions</span>

Hopsworks Feature Store provides functionality to attach transformation functions to training datasets.

Hopsworks Feature Store also comes with built-in transformation functions such as `min_max_scaler`, `standard_scaler`, `robust_scaler` and `label_encoder`.

In [None]:
[t_func.name for t_func in fs.get_transformation_functions()]

['min_max_scaler', 'standard_scaler', 'robust_scaler', 'label_encoder']

You can retrieve transformation function you need.

To attach transformation function to training dataset provide transformation functions as dict, where key is feature name and value is online transformation function name.

Also training dataset must be created from the Query object. Once attached transformation function will be applied on whenever save, insert and get_serving_vector methods are called on training dataset object.

In [None]:
category_cols = ['date','conditions','aqi']

mapping_transformers = {col_name:fs.get_transformation_function(name='standard_scaler') for col_name in col_names if col_name not in category_cols}
category_cols = {col_name:fs.get_transformation_function(name='label_encoder') for col_name in category_cols if col_name not in ['date','aqi']}

mapping_transformers.update(category_cols)

`Feature Views` stands between **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

You can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [9]:
feature_view = fs.create_feature_view(
    name = 'air_quality_fv',
    version = 1,
    labels = ['aqi'],
    #transformation_functions = mapping_transformers,???????????????
    query = query # Merg of two feature groups
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/5315/fs/5235/fv/air_quality_fv/version/1


For now `Feature View` is saved in Hopsworks and you can retrieve it using `FeatureStore.get_feature_view()`.

In [10]:
feature_view = fs.get_feature_view(
    name = 'air_quality_fv',
    version = 1
)

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset you use `FeatureView.create_training_data()` method.

Here are some importand things:

- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- You can choose necessary format using **data_format** parameter.

- **start_time** and **end_time** in order to filter dataset in specific time range.

In [16]:
#Maryam

td, td_job = feature_view.create_train_validation_test_split(#create_training_data(
        # start_time="20210101",
        #end_time="20220228",    
        description='Data set to train the air-prediction model ',
        data_format="csv",
        validation_size = 0.2,
        test_size = 0.1
    )

x_train, x_test, y_train, y_test, x_validate, y_validate = feature_view.get_train_validation_test_split(td)

Training dataset job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/5315/jobs/named/air_quality_fv_1_1_create_fv_td_03012023010106/executions


