# BeerVolume Prediction Demo
This is a short demonstration of how you can use Hopsworks for creating a Machine Learning System that creates predictions. The hypothetical use case is that of a **bar owner** who would want to predict how much beer will be consumed in his bar, based on past trends and behaviour.

![](https://lh3.googleusercontent.com/blogger_img_proxy/ALY8t1uqu0YUTdfoFJYGV2r9a2iHEewpP3daVa9J3qzCzV3rZm8EX8YyhHhOHbfG450AhHYQXu6Hgf8pj2fTpSzg4uio4X_qv9TTEfMnEtO6rYLevnGBxF6sO97tGeYyzaAkGSyVBnw8WtWS1P_2RLY=s0-d)

## <span style="color:#ff5f27">📝 Code Library Imports </span>
We are importing a `beervolume.py` and a `averages.py` script in this notebook, as we will use it later on to generate the historical data, plot it, and generate the second-order features for our predictions.

In [2]:
import datetime
from beervolume import generate_historical_data, to_wide_format, plot_historical_id
from averages import calculate_second_order_features

import great_expectations as ge
from great_expectations.core import ExpectationSuite, ExpectationConfiguration

import warnings
warnings.filterwarnings('ignore')

## <span style="color:#ff5f27">⚙️ Beer Consumption Data Import </span>

In this demo, we will be generating synthetic beer consumption data. For every day after a certain `START_DATE`, we will be generating a supposed volume of beer that has been consumed.

Therefore, let's define the `START_DATE` variable (format: %Y-%m-%d) which will indicate the start date for our beer data generation.

In [3]:
# Define a constant START_DATE with a specific date (September 1, 2022)
START_DATE = datetime.date(2023, 9, 1)

In [4]:
# Generate synthetic historical data using the generate_historical_data function from START_DATE till current date
data_generated = generate_historical_data(
    START_DATE,  # Start date for data generation (September 1, 2023)
)

# Display the first 3 rows of the generated data
data_generated.head(3)

Generating Data: 100%|██████████| 151/151 [00:11<00:00, 13.70it/s]


Unnamed: 0,date,id,beervolume
0,2023-09-01,1687,200.0
1,2023-09-01,1817,200.0
2,2023-09-01,3457,200.0


Look at historical values for 1 and 2 IDs.

In [5]:
plot_historical_id([1,2], data_generated)

## <span style="color:#ff5f27"> 👮🏻‍♂️ Beer Volume data validation using Great Expectations</span>

In order to make sure that the generated data is in the right format, we will be running the generated synthetic data through an "Expectation Suite" that is part of the "Great Expectations" framework (see [over here](https://github.com/great-expectations/great_expectations) for more info) that is part of [Hopsworks](www.hopsworks.ai).

In [6]:
# Convert the generated historical data DataFrame to a Great Expectations DataFrame
ge_beervolume_df = ge.from_pandas(data_generated)

# Retrieve the expectation suite associated with the ge DataFrame
expectation_suite_beervolume = ge_beervolume_df.get_expectation_suite()

# Set the expectation suite name to "beervolume_suite"
expectation_suite_beervolume.expectation_suite_name = "beervolume_suite"

Now that we have the "Expectation Suite", we can add specific "Expectations" to that suite: every parameter that we generated will have comply with a specific expectation.

In [7]:
# Add expectation for the 'id' column values to be between 0 and 5000
expectation_suite_beervolume.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_between",
        kwargs={
            "column": "id",
            "min_value": 0,
            "max_value": 5000,
        }
    )
)

# Add expectation for the 'beervolume' column values to be between 0 and 1000
expectation_suite_beervolume.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_between",
        kwargs={
            "column": "beervolume",
            "min_value": 0,
            "max_value": 1000,
        }
    )
)

# Loop through specified columns ('date', 'id', 'beervolume') and add expectations for null values
for column in ['date', 'id', 'beervolume']:
    expectation_suite_beervolume.add_expectation(
        ExpectationConfiguration(
            expectation_type="expect_column_values_to_be_null",
            kwargs={
                "column": column,
                "mostly": 0.0,
            }
        )
    )


## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>
As we have already installed Hopsworks in a previous cell, all we now need to do is import the library into this notebook, and start establishing the connection to the Hopsworks feature store.

In [8]:
import hopsworks

#To connect to Managed:
import hsfs
conn = hsfs.connection(
    host="172f2800-9e76-11ee-ba4c-277d56d9f8e7.cloud.hopsworks.ai",                                # DNS of your Feature Store instance
    project="RixBeervolumeDemo",                      # Name of your Hopsworks Feature Store project
    hostname_verification=False,                     # Disable for self-signed certificates
    api_key_value="Q0sPuOSFpsuwdIa0.pfBCpgAAnPr3C3J49BvEdeJvfoqTkwQihEotXupzz23FPzDdJpexwHmXyRB8ACDf"          # Feature store API key value 
)
fs = conn.get_feature_store()           # Get the project's default feature store"

#To connect to Serverless:
#import hopsworks
#project = hopsworks.login()
#fs = project.get_feature_store() 



Connected. Call `.close()` to terminate connection gracefully.


Now that we have a connection to the feature store (called `fs`) we can start working with it in the notebook. Specifically, we first want to start creating feature groups for our Beer Consumption machine learning assignment.

## <span style="color:#ff5f27">🪄 Creating the **beervolume** Feature Group </span>
Based on the synthetic data that we generated in previous notebook cells, we are now going to create the first feature group for our Beer Volume dataset.

In [9]:
# Get or create the 'beervolume' feature group
beervolume_fg = fs.get_or_create_feature_group(
    name='beervolume',
    description='Beer Volume Consumption Data',
    version=1,
    primary_key=['id'],
    event_time='date',
    online_enabled=True,
    expectation_suite=expectation_suite_beervolume,
)    
# Insert data
beervolume_fg.insert(data_generated,write_options={"wait":True})

Validation succeeded.
Validation Report saved successfully, explore a summary at https://172f2800-9e76-11ee-ba4c-277d56d9f8e7.cloud.hopsworks.ai:443/p/120/fs/68/fg/19


Uploading Dataframe: 0.00% |          | Rows 0/742858 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: beervolume_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://172f2800-9e76-11ee-ba4c-277d56d9f8e7.cloud.hopsworks.ai/p/120/jobs/named/beervolume_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x157ad5ad0>,
 {
   "statistics": {
     "evaluated_expectations": 5,
     "successful_expectations": 5,
     "unsuccessful_expectations": 0,
     "success_percent": 100.0
   },
   "success": true,
   "evaluation_parameters": {},
   "meta": {
     "great_expectations_version": "0.15.12",
     "expectation_suite_name": "beervolume_suite",
     "run_id": {
       "run_name": null,
       "run_time": "2024-01-29T16:22:37.484888+00:00"
     },
     "batch_kwargs": {
       "ge_batch_id": "9dc39524-bec2-11ee-a048-a2df1663ba57"
     },
     "batch_markers": {},
     "batch_parameters": {},
     "validation_time": "20240129T162237.484801Z",
     "expectation_suite_meta": {
       "great_expectations_version": "0.15.12"
     }
   },
   "results": [
     {
       "expectation_config": {
         "kwargs": {
           "column": "id",
           "mostly": 0.0
         },
         "meta": {
           "expectationId": 19
         },
         "expectation_type": "expect_colu

Now that this is done, we can move on to creating derived features that we will _engineer_ from the featuregroup above.

## <span style="color:#ff5f27">⚙️ **beervolume** Feature Engineering  </span>

In order to make predictions around the volume of the consumption of beer that our hypothetical bar owner needs to take into account, we actually want to calculate a set of derived features that will be meaningful input to our prediction pipelines later on. We will therefore *engineer* the following features:

- `ma_7`: This feature represents the **7-day moving average** of the 'beervolume' data, providing a smoothed representation of short-term beervolume trends.

- `ma_14`: This feature represents the **14-day moving average** of the 'beervolume' data, offering a slightly longer-term smoothed beervolume trend.

- `ma_30`: This feature represents the **30-day moving average** of the 'beervolume' data, providing a longer-term smoothed representation of beervolume trends.

- `daily_rate_of_change`: This feature calculates the **daily rate of change** in beer volumes as a percentage change, indicating how much the beervolume has changed from the previous day.

- `volatility_30_day`: This feature measures the **volatility of beer volume over a 30-day window** using the standard deviation. Higher values indicate greater beer volume fluctuations.

- `ema_02`: This feature calculates the **exponential moving average (EMA)** of 'beervolume' with a _smoothing factor of 0.2_, giving more weight to recent data points in the calculation.

- `ema_05`: Similar to ema_02, this feature calculates the **EMA of 'beer volume'** with a _smoothing factor of 0.5_, providing a different degree of responsiveness to recent data.

- `rsi`: The **Relative Strength Index (RSI)** is a momentum oscillator that measures the speed and change of beer volume movements. It ranges from 0 to 100, with values above 70 indicating higher-than-normal beervolume and values below 30 indicating lower-than-normal beervolume.

In [10]:
# Read the beer volume data from the 'beervolume' feature group
beervolume_df = beervolume_fg.read(online=True)
beervolume_df.head(3)

Unnamed: 0,date,id,beervolume
0,2024-01-29,2781,217.3
1,2024-01-29,4749,217.8
2,2024-01-29,2851,217.1


In [11]:
# Calculate second-order features
#beervolume_averages_df = calculate_second_order_features(beervolume_df)
beervolume_averages_df = calculate_second_order_features(data_generated)

# Display the first 3 rows of the resulting DataFrame
beervolume_averages_df.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,date,id,ma_7,ma_14,ma_30,daily_rate_of_change,volatility_30_day,ema_02,ema_05,rsi
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,7038,2023-09-02,0,0.0,0.0,0.0,0.0,0.0,199.4,199.4,0.0
0,8765,2023-09-02,0,0.0,0.0,0.0,2.056169,0.0,201.677778,202.133333,0.0
0,9264,2023-09-02,0,0.0,0.0,0.0,0.589681,0.0,202.916393,203.6,0.0


## <span style="color:#ff5f27">🪄 Creating feature groups for the newly engineered features </span>

Now that we have the newly engineered features calculated, we will add them to a separate set of Hopsworks Feature Groups:

In [12]:
# Get or create the 'averages' feature group
beervolume_averages_fg = fs.get_or_create_feature_group(
    name='beervolume_averages',
    description='Calculated second beervolume order features',
    version=1,
    primary_key=['id'],
    event_time='date',
    online_enabled=True,
    parents=[beervolume_fg],
)
# Insert data
beervolume_averages_fg.insert(beervolume_averages_df)

Uploading Dataframe: 0.00% |          | Rows 0/742858 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: beervolume_averages_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://172f2800-9e76-11ee-ba4c-277d56d9f8e7.cloud.hopsworks.ai/p/120/jobs/named/beervolume_averages_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x157ad7e10>, None)

---