# <span style="font-width:bold; font-size: 3rem; color:#2656a3;">**Msc. BDS Module - Data Engineering and Machine Learning Operations in Business (MLOPs)** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Feature Backfill</span>

The project uses [Hopsworks](https://www.hopsworks.ai) as the platform to store features in the **Hopworks Feature Store** and save a trained model in **Hopworks Model Registry**.

## <span style='color:#2656a3'> 🗒️ The notebook is divided into the following sections:
1. Loading the data and process features.
2. Connecting to Hopsworks Feature Store.
3. Creating feature groups and uploading them to the feature store.

## <span style='color:#2656a3'> ⚙️ Import of Libraries and Packages

We start by accessing the folder we have created that holds the functions (incl. live API calls and data preprocessing) we need for electricity prices, weather measures, and the Danish calendar. Then, we proceed to import some of the necessary libraries and warnings to avoid unnecessary distractions and keep output clean.

In [1]:
# First we go one back in our directory to access the folder with our functions
%cd ..

# Now we import the functions from the features folder
# This is the functions we have created to generate features for electricity prices, weather measures, and the danish calendar
from features import electricity_prices, weather_measures, calendar

# We go back into the notebooks folder
%cd notebooks

/Users/camillahannesbo/Documents/AAU/Master - BDS/2. semester/Data Engineering and Machine learning operations in Business/MLOPs-Assignment-
/Users/camillahannesbo/Documents/AAU/Master - BDS/2. semester/Data Engineering and Machine learning operations in Business/MLOPs-Assignment-/notebooks


In [2]:
# Importing pandas for data handling
import pandas as pd

# Ignore warnings
import warnings 
warnings.filterwarnings('ignore')

## <span style="color:#2656a3;"> 💽 Loading the Historical Data

The data used comes from the following different sources:

- Hourly electricity prices in Denmark per day on price area DK1 from [Energinet](https://www.energidataservice.dk).  Located in the folder folder `features/electricity_prices`.
- Different meteorological observations based on Aalborg Denmark from [Open Meteo](https://www.open-meteo.com). Located in the folder `features/weather_measures`.
- Weather Forecast based on Aalborg Denmark from [Open Meteo](https://www.open-meteo.com). Located in the folder `features/weather_measures`. (This data is used later to parse new real-time weather data)
- Danish calendar that categorizes dates into types based on whether it is a weekday or not. This file is made manually by the group and is located in the folder `data` inside this repository.


### <span style="color:#2656a3;">💸 Electricity Prices per day from Energinet
The first dataset we load is hourly electricity prices per day from Energinet/Dataservice.

In [3]:
# Fetching historical electricity prices for area DK1 from January 1, 2022
# Note: The end date is currently left out to retrieve data up to the day before present date 
# Today is not included in the data as it is not historical data
electricity_df = electricity_prices.electricity_prices(
    historical=True, 
    area=["DK1"], 
    start='2022-01-01'
)

In [4]:
# Display the first 5 rows of the electricity dataframe
electricity_df.head(5)

Unnamed: 0,timestamp,datetime,date,hour,dk1_spotpricedkk_kwh
0,1640995200000,2022-01-01 00:00:00,2022-01-01,0,0.3722
1,1640998800000,2022-01-01 01:00:00,2022-01-01,1,0.30735
2,1641002400000,2022-01-01 02:00:00,2022-01-01,2,0.32141
3,1641006000000,2022-01-01 03:00:00,2022-01-01,3,0.33806
4,1641009600000,2022-01-01 04:00:00,2022-01-01,4,0.28013


In [5]:
# Display the last 5 rows of the electricity dataframe
electricity_df.tail(5)

Unnamed: 0,timestamp,datetime,date,hour,dk1_spotpricedkk_kwh
20536,1714935600000,2024-05-05 19:00:00,2024-05-05,19,0.71783
20537,1714939200000,2024-05-05 20:00:00,2024-05-05,20,0.83478
20538,1714942800000,2024-05-05 21:00:00,2024-05-05,21,0.80204
20539,1714946400000,2024-05-05 22:00:00,2024-05-05,22,0.73647
20540,1714950000000,2024-05-05 23:00:00,2024-05-05,23,0.66136


In [6]:
# Show the information for the electricity dataframe
electricity_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20541 entries, 0 to 20540
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   timestamp             20541 non-null  int64         
 1   datetime              20541 non-null  datetime64[ns]
 2   date                  20541 non-null  object        
 3   hour                  20541 non-null  int64         
 4   dk1_spotpricedkk_kwh  20541 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 802.5+ KB


### <span style="color:#2656a3;"> 🌤 Weather measurements from Open Meteo
Next weather measurements from Open Meteo is fetched.

#### <span style="color:#2656a3;"> 🕰️ Historical Weather Measures

In [7]:
# Fetching historical weather measurements from January 1, 2022
# Note: The end date is currently left out to retrieve data up to the day before present date 
# Today is not included in the data as it is not historical data
historical_weather_df = weather_measures.historical_weather_measures(
    historical=True, 
    start = '2022-01-01'
)

In [8]:
# Display the first 5 rows of the weather dataframe
historical_weather_df.head(5)

Unnamed: 0,timestamp,datetime,date,hour,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m
0,1640995200000,2022-01-01 00:00:00,2022-01-01,0,6.7,100.0,0.0,0.0,0.0,3.0,100.0,16.2,36.0
1,1640998800000,2022-01-01 01:00:00,2022-01-01,1,6.6,100.0,0.0,0.0,0.0,3.0,100.0,16.2,30.2
2,1641002400000,2022-01-01 02:00:00,2022-01-01,2,6.7,99.0,0.0,0.0,0.0,3.0,100.0,15.5,30.6
3,1641006000000,2022-01-01 03:00:00,2022-01-01,3,6.7,100.0,0.0,0.0,0.0,3.0,100.0,12.7,28.8
4,1641009600000,2022-01-01 04:00:00,2022-01-01,4,6.7,99.0,0.0,0.0,0.0,3.0,100.0,10.6,23.8


In [9]:
# Display the last 5 rows of the weather dataframe
historical_weather_df.tail(5)

Unnamed: 0,timestamp,datetime,date,hour,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m
20515,1714849200000,2024-05-04 19:00:00,2024-05-04,19,12.2,88.0,0.0,0.0,0.0,3.0,100.0,1.6,4.3
20516,1714852800000,2024-05-04 20:00:00,2024-05-04,20,11.4,92.0,0.0,0.0,0.0,2.0,70.0,1.5,2.2
20517,1714856400000,2024-05-04 21:00:00,2024-05-04,21,10.7,96.0,0.0,0.0,0.0,2.0,64.0,0.4,2.5
20518,1714860000000,2024-05-04 22:00:00,2024-05-04,22,10.1,100.0,0.0,0.0,0.0,3.0,100.0,2.4,3.2
20519,1714863600000,2024-05-04 23:00:00,2024-05-04,23,9.9,100.0,0.0,0.0,0.0,3.0,100.0,2.9,4.0


In [10]:
# Show the information for the weather dataframe
historical_weather_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20520 entries, 0 to 20519
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   timestamp             20520 non-null  int64         
 1   datetime              20520 non-null  datetime64[ns]
 2   date                  20520 non-null  object        
 3   hour                  20520 non-null  int64         
 4   temperature_2m        20520 non-null  float64       
 5   relative_humidity_2m  20520 non-null  float64       
 6   precipitation         20520 non-null  float64       
 7   rain                  20520 non-null  float64       
 8   snowfall              20520 non-null  float64       
 9   weather_code          20520 non-null  float64       
 10  cloud_cover           20520 non-null  float64       
 11  wind_speed_10m        20520 non-null  float64       
 12  wind_gusts_10m        20520 non-null  float64       
dtypes: datetime64[ns

#### <span style="color:#2656a3;"> 🌈  Forecast Weather Measures
Weather Forecast from Open Meteo is now being fetched. This data is used in the `2_feature_pipeline` to parse in new real-time weather data.

In [11]:
# Fetching weather forecast measures for the next 5 days
weather_forecast_df = weather_measures.forecast_weather_measures(
    forecast_length=5
)

In [12]:
# Display the first 5 rows of the weather forecast dataframe
weather_forecast_df.head(5)

Unnamed: 0,timestamp,datetime,date,hour,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m
0,1714953600000,2024-05-06 00:00:00,2024-05-06,0,9.6,93.0,0.2,0.2,0.0,51.0,100.0,14.4,24.8
1,1714957200000,2024-05-06 01:00:00,2024-05-06,1,9.7,93.0,0.0,0.0,0.0,3.0,100.0,14.0,24.8
2,1714960800000,2024-05-06 02:00:00,2024-05-06,2,9.5,91.0,0.0,0.0,0.0,3.0,100.0,14.0,24.8
3,1714964400000,2024-05-06 03:00:00,2024-05-06,3,9.5,91.0,0.0,0.0,0.0,3.0,100.0,13.0,23.4
4,1714968000000,2024-05-06 04:00:00,2024-05-06,4,9.6,92.0,0.0,0.0,0.0,3.0,100.0,14.0,24.1


In [13]:
# Display the last 5 rows of the weather forecast dataframe
weather_forecast_df.tail(5)

Unnamed: 0,timestamp,datetime,date,hour,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m
115,1715367600000,2024-05-10 19:00:00,2024-05-10,19,11.5,68.0,0.0,0.0,0.0,3.0,89.0,5.2,13.0
116,1715371200000,2024-05-10 20:00:00,2024-05-10,20,10.5,71.0,0.0,0.0,0.0,3.0,88.0,3.4,8.6
117,1715374800000,2024-05-10 21:00:00,2024-05-10,21,9.5,74.0,0.0,0.0,0.0,3.0,87.0,2.5,4.3
118,1715378400000,2024-05-10 22:00:00,2024-05-10,22,8.6,78.0,0.0,0.0,0.0,3.0,91.0,2.6,4.3
119,1715382000000,2024-05-10 23:00:00,2024-05-10,23,7.8,81.0,0.0,0.0,0.0,3.0,96.0,2.5,4.3


In [14]:
# Show the information for the weather weather forecast dataframe
weather_forecast_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   timestamp             120 non-null    int64         
 1   datetime              120 non-null    datetime64[ns]
 2   date                  120 non-null    object        
 3   hour                  120 non-null    int64         
 4   temperature_2m        120 non-null    float64       
 5   relative_humidity_2m  120 non-null    float64       
 6   precipitation         120 non-null    float64       
 7   rain                  120 non-null    float64       
 8   snowfall              120 non-null    float64       
 9   weather_code          120 non-null    float64       
 10  cloud_cover           120 non-null    float64       
 11  wind_speed_10m        120 non-null    float64       
 12  wind_gusts_10m        120 non-null    float64       
dtypes: datetime64[ns](1)

### <span style="color:#2656a3;"> 🗓️ Calendar of Danish workdays and holidays 
Lastly, the calendar data is being loaded in. The calendar data includes a `workday` attribute indicating whether the date is a workday or not. This column has been encoded from categorical variables into numerical form in the folder `features/weather_measures`. `1` indicating a workday and `0` indicating a non-workday.  

In [15]:
# Fetching the Danish calendar from January 1, 2022 to December 31, 2024 
calender_df = calendar.dk_calendar()

In [16]:
# Display the first 5 rows of the calendar dataframe
calender_df.head(5)

Unnamed: 0,date,dayofweek,day,month,year,workday
0,2022-01-01,5,1,1,2022,0
1,2022-01-02,6,2,1,2022,0
2,2022-01-03,0,3,1,2022,1
3,2022-01-04,1,4,1,2022,1
4,2022-01-05,2,5,1,2022,1


In [17]:
# Display the last 5 rows of the calendar dataframe
calender_df.tail(5)

Unnamed: 0,date,dayofweek,day,month,year,workday
1091,2024-12-27,4,27,12,2024,1
1092,2024-12-28,5,28,12,2024,0
1093,2024-12-29,6,29,12,2024,0
1094,2024-12-30,0,30,12,2024,1
1095,2024-12-31,1,31,12,2024,1


In [18]:
# Show the information for the calendar dataframe
calender_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1096 entries, 0 to 1095
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       1096 non-null   object
 1   dayofweek  1096 non-null   int64 
 2   day        1096 non-null   int64 
 3   month      1096 non-null   int64 
 4   year       1096 non-null   int64 
 5   workday    1096 non-null   int64 
dtypes: int64(5), object(1)
memory usage: 51.5+ KB


## <span style="color:#2656a3;"> 📡 Connecting to Hopsworks Feature Store

We connect to Hopsworks Feature Store so we can access and create feature groups.

In [19]:
# Importing the hopsworks module for interacting with the Hopsworks platform
import hopsworks

# Logging into the Hopsworks project
project = hopsworks.login()

# Getting the feature store from the project
fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/550040
Connected. Call `.close()` to terminate connection gracefully.


### <span style="color:#2656a3;"> 🪄 Creating Feature Groups
A feature group can be seen as a collection of conceptually related features. In this case we create feature groups for the 
- eletricity price data,
- weather data,
- calendar data.

We specify a `primary_key` as `date` and `timestamp`, so we are able to join them when we create a dataset for training later in part `3_training_pipeline`.
We define a name and a short describtion of the feature group's contents and a version number. 

`event_time` is specifyed as `timestamp`. If event_time is set the feature group can be used for point-in-time joins.

We've set `online_enabled` to `True` to enable accessing the feature group through the Online API for a Feature View.

In [20]:
# Creating the feature group for the electricity prices
electricity_fg = fs.get_or_create_feature_group(
    name="electricity_prices",
    version=1,
    description="Electricity prices from Energidata API",
    primary_key=["date","timestamp"], 
    online_enabled=True,
    event_time="timestamp",
)

We have now outlined metadata for the feature group. Data hasn't been stored yet, and there's no schema defined. To store data persistently for the feature group, we populate it with its associated data using the `insert` function.

In [21]:
# Inserting the electricity_df into the feature group named electricity_fg
electricity_fg.insert(electricity_df)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/550040/fs/545863/fg/787801


Uploading Dataframe: 100.00% |██████████| Rows 20541/20541 | Elapsed Time: 00:08 | Remaining Time: 00:00


Launching job: electricity_prices_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/550040/jobs/named/electricity_prices_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x12fc0d450>, None)

We make a descriptions for each feature we put into the feature group. In this way we are adding more information and documentation to the user.

In [22]:
# List of descriptions for electricity features
electricity_feature_descriptions = [
    {"name": "timestamp", "description": "Timestamp of the event time"},
    {"name": "date", "description": "Date of the electricity measurement"},
    {"name": "datetime", "description": "Date and time of the electricity measurement"},
    {"name": "hour", "description": "Hour of the day"},
    {"name": "dk1_spotpricedkk_kwh", "description": "Spot price in DKK per KWH"}, 
]

# Updating feature descriptions
for desc in electricity_feature_descriptions: 
    electricity_fg.update_feature_description(desc["name"], desc["description"])

We replicate the process for both the `weather_fg` and `danish_holidays_fg` by establishing feature groups and inserting the dataframes into their respective feature groups.

In [23]:
# Creating the feature group for the weather data
weather_fg = fs.get_or_create_feature_group(
    name="weather_measurements",
    version=1,
    description="Weather measurements from Open Meteo API",
    primary_key=["date", "timestamp"], 
    online_enabled=True,
    event_time="timestamp",
)

In [24]:
# Inserting the weather_df into the feature group named weather_fg
weather_fg.insert(historical_weather_df)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/550040/fs/545863/fg/786783


Uploading Dataframe: 100.00% |██████████| Rows 20520/20520 | Elapsed Time: 00:08 | Remaining Time: 00:00


Launching job: weather_measurements_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/550040/jobs/named/weather_measurements_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x12fbb3b50>, None)

In [25]:
# List of descriptions for weather features
weather_feature_descriptions = [
    {"name": "timestamp", "description": "Timestamp for the weather measurement"},
    {"name": "date", "description": "Date of the weather measurement"},
    {"name": "datetime", "description": "Date and time of the weather measurement"},
    {"name": "hour", "description": "Hour of the day"},
    {"name": "temperature_2m", "description": "Temperature at 2m above ground"},
    {"name": "relative_humidity_2m", "description": "Relative humidity at 2m above ground"},
    {"name": "precipitation", "description": "Precipitation"},
    {"name": "rain", "description": "Rain"},
    {"name": "snowfall", "description": "Snowfall"},   
    {"name": "weather_code", "description": "Weather code"},   
    {"name": "cloud_cover", "description": "Cloud cover"},   
    {"name": "wind_speed_10m", "description": "Wind speed at 10m above ground"},   
    {"name": "wind_gusts_10m", "description": "Wind gusts at 10m above ground"},   
]

# Updating feature descriptions
for desc in weather_feature_descriptions: 
    weather_fg.update_feature_description(desc["name"], desc["description"])

In [26]:
# Creating the feature group for the danish calendar
danish_calendar_fg = fs.get_or_create_feature_group(
    name="dk_calendar",
    version=1,
    description="Danish calendar",
    primary_key=["date"],
    online_enabled=True,
)

In [27]:
# Inserting the calendar_df into the feature group named danish_calendar_fg
danish_calendar_fg.insert(calender_df)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/550040/fs/545863/fg/786784


Uploading Dataframe: 100.00% |██████████| Rows 1096/1096 | Elapsed Time: 00:05 | Remaining Time: 00:00


Launching job: dk_calendar_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/550040/jobs/named/dk_calendar_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x12fe04690>, None)

In [28]:
# List of descriptions for danish_calendar features
danish_calendar_feature_descriptions = [
    {"name": "date", "description": "Date in the calendar"},
    {"name": "day", "description": "Day number of the week. Monday is 0 and Sunday is 6"},
    {"name": "month", "description": "Month number of the year"},
    {"name": "workday", "description": "Workday or not a workday. Workday is 1 and not a workday is 0"},
]

# Updating feature descriptions
for desc in danish_calendar_feature_descriptions: 
    danish_calendar_fg.update_feature_description(desc["name"], desc["description"])

---
## <span style="color:#2656a3;">⏭️ **Next:** Part 02: Feature Pipeline </span>

Next we will generate new data for the Feature Groups.