# CoffeeVolume Prediction Demo
This is a short demonstration of how you can use Hopsworks for creating a Machine Learning System that creates predictions. The hypothetical use case is that of a **barrista** who would want to predict how much coffee will be consumed in his bar, based on past trends and behaviour.

![](https://blogstudio.s3.theshoppad.net/coffeeheroau/d4459a5d44905ff2cf3c245e7a931675.jpg)

## <span style="color:#ff5f27">📝 Code Library Imports </span>
We are importing a `coffeevolume.py` and a `averages.py` script in this notebook, as we will use it later on to generate the historical data, plot it, and generate the second-order features for our predictions.

In [1]:
import datetime
from coffeevolume import generate_historical_data, to_wide_format, plot_historical_id
from averages import calculate_second_order_features

import great_expectations as ge
from great_expectations.core import ExpectationSuite, ExpectationConfiguration

import warnings
warnings.filterwarnings('ignore')

## <span style="color:#ff5f27">⚙️ Coffee Consumption Data Import </span>

In this demo, we will be generating synthetic coffee consumption data. For every day after a certain `START_DATE`, we will be generating a supposed volume of coffee that has been consumed.

Therefore, let's define the `START_DATE` variable (format: %Y-%m-%d) which will indicate the start date for our coffee data generation.

In [2]:
# Define a constant START_DATE with a specific date (September 1, 2022)
START_DATE = datetime.date(2023, 9, 1)

In [3]:
# Generate synthetic historical data using the generate_historical_data function from START_DATE till current date
data_generated = generate_historical_data(
    START_DATE,  # Start date for data generation (September 1, 2023)
)

# Display the first 3 rows of the generated data
data_generated.head(3)

Generating Data: 100%|██████████| 151/151 [00:10<00:00, 14.04it/s]


Unnamed: 0,date,id,coffeevolume
0,2023-09-01,1788,200.0
1,2023-09-01,2248,200.0
2,2023-09-01,4539,200.0


Look at historical values for 1 and 2 IDs.

In [4]:
plot_historical_id([1,2], data_generated)

## <span style="color:#ff5f27"> 👮🏻‍♂️ Coffee Volume data validation using Great Expectations</span>

In order to make sure that the generated data is in the right format, we will be running the generated synthetic data through an "Expectation Suite" that is part of the "Great Expectations" framework (see [over here](https://github.com/great-expectations/great_expectations) for more info) that is part of [Hopsworks](www.hopsworks.ai).

In [5]:
# Convert the generated historical data DataFrame to a Great Expectations DataFrame
ge_coffeevolume_df = ge.from_pandas(data_generated)

# Retrieve the expectation suite associated with the ge DataFrame
expectation_suite_coffeevolume = ge_coffeevolume_df.get_expectation_suite()

# Set the expectation suite name to "coffeevolume_suite"
expectation_suite_coffeevolume.expectation_suite_name = "coffeevolume_suite"

Now that we have the "Expectation Suite", we can add specific "Expectations" to that suite: every parameter that we generated will have comply with a specific expectation.

In [6]:
# Add expectation for the 'id' column values to be between 0 and 5000
expectation_suite_coffeevolume.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_between",
        kwargs={
            "column": "id",
            "min_value": 0,
            "max_value": 5000,
        }
    )
)

# Add expectation for the 'coffeevolume' column values to be between 0 and 1000
expectation_suite_coffeevolume.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_between",
        kwargs={
            "column": "coffeevolume",
            "min_value": 0,
            "max_value": 1000,
        }
    )
)

# Loop through specified columns ('date', 'id', 'coffeevolume') and add expectations for null values
for column in ['date', 'id', 'coffeevolume']:
    expectation_suite_coffeevolume.add_expectation(
        ExpectationConfiguration(
            expectation_type="expect_column_values_to_be_null",
            kwargs={
                "column": column,
                "mostly": 0.0,
            }
        )
    )


## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>
As we have already installed Hopsworks in a previous cell, all we now need to do is import the library into this notebook, and start establishing the connection to the Hopsworks feature store.

In [7]:
import hopsworks

#To connect to Managed:
import hsfs
conn = hsfs.connection(
    host="172f2800-9e76-11ee-ba4c-277d56d9f8e7.cloud.hopsworks.ai",                                # DNS of your Feature Store instance
    project="RixCoffeevolumeDemo",                      # Name of your Hopsworks Feature Store project
    hostname_verification=False,                     # Disable for self-signed certificates
    api_key_value="Q0sPuOSFpsuwdIa0.pfBCpgAAnPr3C3J49BvEdeJvfoqTkwQihEotXupzz23FPzDdJpexwHmXyRB8ACDf"          # Feature store API key value 
)
fs = conn.get_feature_store()           # Get the project's default feature store"

#To connect to Serverless:
#import hopsworks
#project = hopsworks.login()
#fs = project.get_feature_store() 



Connected. Call `.close()` to terminate connection gracefully.


Now that we have a connection to the feature store (called `fs`) we can start working with it in the notebook. Specifically, we first want to start creating feature groups for our Coffee Consumption machine learning assignment.

## <span style="color:#ff5f27">🪄 Creating the **coffeevolume** Feature Group </span>
Based on the synthetic data that we generated in previous notebook cells, we are now going to create the first feature group for our Coffee Volume dataset.

In [8]:
# Get or create the 'coffeevolume' feature group
coffeevolume_fg = fs.get_or_create_feature_group(
    name='coffeevolume',
    description='Coffee Volume Consumption Data',
    version=1,
    primary_key=['id'],
    event_time='date',
    online_enabled=True,
    expectation_suite=expectation_suite_coffeevolume,
)    
# Insert data
coffeevolume_fg.insert(data_generated,write_options={"wait":True})

Feature Group created successfully, explore it at 
https://172f2800-9e76-11ee-ba4c-277d56d9f8e7.cloud.hopsworks.ai:443/p/4215/fs/4163/fg/3085
Validation succeeded.
Validation Report saved successfully, explore a summary at https://172f2800-9e76-11ee-ba4c-277d56d9f8e7.cloud.hopsworks.ai:443/p/4215/fs/4163/fg/3085


Uploading Dataframe: 0.00% |          | Rows 0/742735 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: coffeevolume_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://172f2800-9e76-11ee-ba4c-277d56d9f8e7.cloud.hopsworks.ai/p/4215/jobs/named/coffeevolume_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x15f6a7f10>,
 {
   "evaluation_parameters": {},
   "statistics": {
     "evaluated_expectations": 5,
     "successful_expectations": 5,
     "unsuccessful_expectations": 0,
     "success_percent": 100.0
   },
   "results": [
     {
       "exception_info": {
         "raised_exception": false,
         "exception_message": null,
         "exception_traceback": null
       },
       "expectation_config": {
         "kwargs": {
           "column": "coffeevolume",
           "mostly": 0.0
         },
         "expectation_type": "expect_column_values_to_be_null",
         "meta": {
           "expectationId": 3077
         }
       },
       "result": {
         "element_count": 742735,
         "unexpected_count": 742735,
         "unexpected_percent": 100.0,
         "unexpected_percent_total": 100.0,
         "partial_unexpected_list": []
       },
       "meta": {
         "ingestionResult": "INGESTED",
         "validationTime": "2024-01-29T05:01:04.000763Z"


Now that this is done, we can move on to creating derived features that we will _engineer_ from the featuregroup above.

## <span style="color:#ff5f27">⚙️ **coffeevolume** Feature Engineering  </span>

In order to make predictions around the volume of the consumption of coffee that our hypothetical bar owner needs to take into account, we actually want to calculate a set of derived features that will be meaningful input to our prediction pipelines later on. We will therefore *engineer* the following features:

- `ma_7`: This feature represents the **7-day moving average** of the 'coffeevolume' data, providing a smoothed representation of short-term coffeevolume trends.

- `ma_14`: This feature represents the **14-day moving average** of the 'coffeevolume' data, offering a slightly longer-term smoothed coffeevolume trend.

- `ma_30`: This feature represents the **30-day moving average** of the 'coffeevolume' data, providing a longer-term smoothed representation of coffeevolume trends.

- `daily_rate_of_change`: This feature calculates the **daily rate of change** in coffee volumes as a percentage change, indicating how much the coffeevolume has changed from the previous day.

- `volatility_30_day`: This feature measures the **volatility of coffee volume over a 30-day window** using the standard deviation. Higher values indicate greater coffee volume fluctuations.

- `ema_02`: This feature calculates the **exponential moving average (EMA)** of 'coffeevolume' with a _smoothing factor of 0.2_, giving more weight to recent data points in the calculation.

- `ema_05`: Similar to ema_02, this feature calculates the **EMA of 'coffee volume'** with a _smoothing factor of 0.5_, providing a different degree of responsiveness to recent data.

- `rsi`: The **Relative Strength Index (RSI)** is a momentum oscillator that measures the speed and change of coffee volume movements. It ranges from 0 to 100, with values above 70 indicating higher-than-normal coffeevolume and values below 30 indicating lower-than-normal coffeevolume.

In [9]:
# Read the coffee volume data from the 'coffeevolume' feature group
coffeevolume_df = coffeevolume_fg.read(online=True)
coffeevolume_df.head(3)

Unnamed: 0,date,id,coffeevolume
0,2024-01-28,1716,213.9
1,2024-01-29,4383,214.7
2,2024-01-28,1091,213.8


In [10]:
# Calculate second-order features
#coffeevolume_averages_df = calculate_second_order_features(coffeevolume_df)
coffeevolume_averages_df = calculate_second_order_features(data_generated)

# Display the first 3 rows of the resulting DataFrame
coffeevolume_averages_df.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,date,id,ma_7,ma_14,ma_30,daily_rate_of_change,volatility_30_day,ema_02,ema_05,rsi
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,298,2023-09-01,0,0.0,0.0,0.0,0.0,0.0,200.0,200.0,0.0
0,6160,2023-09-02,0,0.0,0.0,0.0,-1.35,0.0,198.5,198.2,0.0
0,14213,2023-09-03,0,0.0,0.0,0.0,5.017739,0.0,202.065574,203.342857,0.0


## <span style="color:#ff5f27">🪄 Creating feature groups for the newly engineered features </span>

Now that we have the newly engineered features calculated, we will add them to a separate set of Hopsworks Feature Groups:

In [11]:
# Get or create the 'averages' feature group
coffeevolume_averages_fg = fs.get_or_create_feature_group(
    name='coffeevolume_averages',
    description='Calculated second coffeevolume order features',
    version=1,
    primary_key=['id'],
    event_time='date',
    online_enabled=True,
    parents=[coffeevolume_fg],
)
# Insert data
coffeevolume_averages_fg.insert(coffeevolume_averages_df)

Feature Group created successfully, explore it at 
https://172f2800-9e76-11ee-ba4c-277d56d9f8e7.cloud.hopsworks.ai:443/p/4215/fs/4163/fg/3086


Uploading Dataframe: 0.00% |          | Rows 0/742735 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: coffeevolume_averages_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://172f2800-9e76-11ee-ba4c-277d56d9f8e7.cloud.hopsworks.ai/p/4215/jobs/named/coffeevolume_averages_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x1605a7510>, None)

---