# <a class="anchor" id="1.5_bullet" style="color:#533a7b"> **Dataset Creation** </a>
---


In this notebook, we will create the actual dataset that we will train our model on. In particular, we will:
1. Select the features we want to train our model on.
2. Specify how the features should be preprocessed.
3. Create a dataset split for training and validation data.

![tutorial-flow](images/create_training_dataset.png)

### <a class="anchor" id="1.5_bullet" style="color:#e363a3"> **📝 Importing Libraries** </a>

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


### <a class="anchor" id="1.5_bullet" style="color:#3772ff"> **⬇️ Data retrieving from Feature Groups** </a>

We start by selecting all the features we want to include for model training/inference.

In [2]:
# Load feature groups.
fares_fg = fs.get_feature_group("fares_fg")
rides_fg = fs.get_feature_group("rides_fg")

# Select features for training data.
fg_query = fares_fg.select(['total_fare', 'pickup_datetime', 'month_of_the_ride'])\
                            .join(rides_fg.select_except(['taxi_id',
                                  'driver_id']), on=['ride_id', 'pickup_datetime', 'month_of_the_ride'])



In [3]:
fg_query.show(2)

2022-05-22 21:12:14,950 INFO: USE `nyc_taxi_fares_featurestore`
2022-05-22 21:12:15,710 INFO: WITH right_fg0 AS (SELECT *
FROM (SELECT `fg1`.`total_fare` `total_fare`, `fg1`.`pickup_datetime` `pickup_datetime`, `fg1`.`month_of_the_ride` `month_of_the_ride`, `fg1`.`ride_id` `join_pk_ride_id`, `fg1`.`pickup_datetime` `join_evt_pickup_datetime`, `fg0`.`ride_id` `ride_id`, `fg0`.`pickup_longitude` `pickup_longitude`, `fg0`.`pickup_latitude` `pickup_latitude`, `fg0`.`dropoff_longitude` `dropoff_longitude`, `fg0`.`dropoff_latitude` `dropoff_latitude`, `fg0`.`passenger_count` `passenger_count`, `fg0`.`distance` `distance`, `fg0`.`pickup_distance_to_jfk` `pickup_distance_to_jfk`, `fg0`.`dropoff_distance_to_jfk` `dropoff_distance_to_jfk`, `fg0`.`pickup_distance_to_ewr` `pickup_distance_to_ewr`, `fg0`.`dropoff_distance_to_ewr` `dropoff_distance_to_ewr`, `fg0`.`pickup_distance_to_lgr` `pickup_distance_to_lgr`, `fg0`.`dropoff_distance_to_lgr` `dropoff_distance_to_lgr`, `fg0`.`year` `year`, `fg0`.`

Unnamed: 0,total_fare,pickup_datetime,month_of_the_ride,ride_id,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance,pickup_distance_to_jfk,dropoff_distance_to_jfk,pickup_distance_to_ewr,dropoff_distance_to_ewr,pickup_distance_to_lgr,dropoff_distance_to_lgr,year,weekday,hour
0,118.0,1577880020000,202001,1,-73.76764,40.88664,-73.843834,40.78967,3,7.794438,16.960192,10.813943,25.269297,18.642949,9.402573,1.808076,2020,2,12
1,48.0,1577880040000,202001,2,-73.85604,40.77413,-73.80203,40.84287,3,5.525897,10.044588,13.983367,17.669196,22.185668,0.959006,5.911146,2020,2,12


#### <a class="anchor" id="1.5_bullet" style="color:#772f1a"> **〰️ Transformation Functions** </a>


We can preprocess our data using several encoding methods like *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [4]:
[t_func.name for t_func in fs.get_transformation_functions()]

['robust_scaler', 'label_encoder', 'min_max_scaler', 'standard_scaler']

In [4]:
# # Load transformation functions.
# min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
# label_encoder = fs.get_transformation_function(name="label_encoder")

# # Map features to transformations.
# transformation_functions = {
#     "total_fare": min_max_scaler,
#     "distance": min_max_scaler
# }

## <a class="anchor" id="1.5_bullet" style="color:#2db3f0"> **🔮 Feature View Creation** </a>


We start by selecting all the features we want to include for model training/inference.

After we have made a query from desired features, we should make a corresponding `Feature View`.
In order to do it we may use `fs.create_feature_view()`

In [5]:
nyc_fares_fv = fs.create_feature_view(
    name='nyc_taxi_fares',
    query=fg_query,
    label=["total_fare"]
#     transformation_functions=transformation_functions
)

In [6]:
nyc_fares_fv.version

14

In [7]:
nyc_fares_fv.preview_feature_vector()

[]

## <a class="anchor" id="1.5_bullet" style="color:#525252"> **📦 Dataset Creation** </a>


Finally we create the dataset using `fs.create_training_dataset()`.

In [8]:
td_metadata = nyc_fares_fv.create_training_dataset(
    description = 'NYC taxi fares dataset.',
    data_format = 'csv',
    splits = {'train': 80, 'validation': 20},
    train_split = "train",
    write_options = {'wait_for_job': False}
)

Training dataset job started successfully, you can follow the progress at https://hopsworks.glassfish.service.consul:8182/p/122/jobs/named/nyc_taxi_fares_14_1_create_fv_td_22052022211314/executions




We can sanity check that the transformation functions have been applied by loading the training and validation data.

Now our dataset has been splitted into two parts: **train(80% of original dataset)** and **validation(20% of original dataset)**.

- To get training dataset we can use `FeatureView.get_training_dataset()` method.

- To retrieve specific part of training dataset use `FeatureView.get_training_dataset_splits()` method

In [10]:
td_version, df = nyc_fares_fv.get_training_dataset()

df.head()

2022-05-22 21:18:29,774 INFO: USE `nyc_taxi_fares_featurestore`
2022-05-22 21:18:30,559 INFO: WITH right_fg0 AS (SELECT *
FROM (SELECT `fg1`.`total_fare` `total_fare`, `fg1`.`pickup_datetime` `pickup_datetime`, `fg1`.`month_of_the_ride` `month_of_the_ride`, `fg1`.`ride_id` `join_pk_ride_id`, `fg1`.`pickup_datetime` `join_evt_pickup_datetime`, `fg0`.`ride_id` `ride_id`, `fg0`.`pickup_longitude` `pickup_longitude`, `fg0`.`pickup_latitude` `pickup_latitude`, `fg0`.`dropoff_longitude` `dropoff_longitude`, `fg0`.`dropoff_latitude` `dropoff_latitude`, `fg0`.`passenger_count` `passenger_count`, `fg0`.`distance` `distance`, `fg0`.`pickup_distance_to_jfk` `pickup_distance_to_jfk`, `fg0`.`dropoff_distance_to_jfk` `dropoff_distance_to_jfk`, `fg0`.`pickup_distance_to_ewr` `pickup_distance_to_ewr`, `fg0`.`dropoff_distance_to_ewr` `dropoff_distance_to_ewr`, `fg0`.`pickup_distance_to_lgr` `pickup_distance_to_lgr`, `fg0`.`dropoff_distance_to_lgr` `dropoff_distance_to_lgr`, `fg0`.`year` `year`, `fg0`.`



Unnamed: 0,total_fare,pickup_datetime,month_of_the_ride,ride_id,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance,pickup_distance_to_jfk,dropoff_distance_to_jfk,pickup_distance_to_ewr,dropoff_distance_to_ewr,pickup_distance_to_lgr,dropoff_distance_to_lgr,year,weekday,hour
0,118.0,1577880020000,202001,1,-73.76764,40.88664,-73.843834,40.78967,3,7.794438,16.960192,10.813943,25.269297,18.642949,9.402573,1.808076,2020,2,12
1,48.0,1577880040000,202001,2,-73.85604,40.77413,-73.80203,40.84287,3,5.525897,10.044588,13.983367,17.669196,22.185668,0.959006,5.911146,2020,2,12
2,41.0,1577880060000,202001,3,-73.86453,40.763325,-73.84797,40.7844,3,1.694445,9.569711,10.542662,17.013063,18.309169,1.060797,1.457111,2020,2,12
3,44.0,1577880080000,202001,4,-73.86093,40.767902,-73.781784,40.868633,3,8.097295,9.764157,15.708376,17.288908,23.986612,0.924238,7.963623,2020,2,12
4,46.0,1577880100000,202001,5,-73.85884,40.77057,-73.75468,40.903137,3,10.65566,9.882264,18.132646,17.450841,26.459162,0.905815,10.722712,2020,2,12


### Next Steps

In the next notebook, we will train a model on the dataset we created in this notebook.