# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="https://github.com/xrisaD/ScalableMLProject/images/icon102.png?raw=1" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 04: Batch Predictions</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/{project_name}/{notebook_name}.ipynb)


## 🗒️ This notebook is divided into the following sections:

1. Loading the training data
2. Train the model
3. Register model in Hopsworks model registry


### <span style='color:#ff5f27'> 📝 Imports

In [40]:
import pandas as pd

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import f1_score

import warnings
warnings.filterwarnings("ignore")

---

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [2]:
import hopsworks

project = hopsworks.login() 

fs = project.get_feature_store() 

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/5318




Connected. Call `.close()` to terminate connection gracefully.


---

## <span style="color:#ff5f27;">🪝 Feature View and Training Dataset Retrieval</span>

In [3]:
feature_view = fs.get_feature_view(
    name = 'streamflow_fv',
    version = 1
)

In [5]:
train_data = feature_view.get_training_data(1)[0]

train_data.head()

Unnamed: 0,date,streamflow,place,temperature_2m_max,temperature_2m_min,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,windspeed_10m_max,windgusts_10m_max,winddirection_10m_dominant,et0_fao_evapotranspiration
0,1661810400000,123.9365,Abisko,8.6,5.6,4.8,4.8,0.0,16.0,10.3,34.9,158.0,0.93
1,1615417200000,4.1078,Spånga,3.7,0.1,15.3,5.7,6.79,22.0,29.5,66.6,160.0,0.52
2,1625522400000,1.398,Spånga,24.1,15.5,4.8,4.8,0.0,9.0,15.8,38.2,170.0,3.67
3,1635372000000,11.8726,Spånga,12.6,9.2,0.8,0.8,0.0,3.0,22.4,49.7,210.0,0.34
4,1639868400000,7.0708,Spånga,4.0,0.7,0.0,0.0,0.0,0.0,22.0,41.4,331.0,0.6


---

## <span style="color:#ff5f27;">🧬 Modeling</span>

In [8]:
train_data = train_data.sort_values(by=["date", 'place'], ascending=[False, True]).reset_index(drop=True)

train_data.head(5)

Unnamed: 0,date,streamflow,place,temperature_2m_max,temperature_2m_min,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,windspeed_10m_max,windgusts_10m_max,winddirection_10m_dominant,et0_fao_evapotranspiration
0,1671231600000,33.2854,Abisko,-7.2,-9.9,0.3,0.0,0.21,3.0,6.9,26.6,293.0,0.0
1,1671231600000,5.3238,Spånga,0.7,-7.9,1.7,0.1,1.12,8.0,21.3,48.6,257.0,0.08
2,1671231600000,3.0196,Uppsala,-0.9,-12.4,1.1,0.0,0.77,5.0,19.8,40.0,255.0,0.06
3,1671145200000,32.8515,Abisko,-4.5,-9.7,2.0,0.0,1.4,9.0,9.6,35.3,243.0,0.0
4,1671145200000,5.1158,Spånga,-4.9,-14.7,0.0,0.0,0.0,0.0,14.8,28.4,253.0,0.04


In [9]:
train_data["streamflow_next_day"] = train_data.groupby('place')['streamflow'].shift(1)

train_data.head(5)

Unnamed: 0,date,streamflow,place,temperature_2m_max,temperature_2m_min,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,windspeed_10m_max,windgusts_10m_max,winddirection_10m_dominant,et0_fao_evapotranspiration,streamflow_next_day
0,1671231600000,33.2854,Abisko,-7.2,-9.9,0.3,0.0,0.21,3.0,6.9,26.6,293.0,0.0,
1,1671231600000,5.3238,Spånga,0.7,-7.9,1.7,0.1,1.12,8.0,21.3,48.6,257.0,0.08,
2,1671231600000,3.0196,Uppsala,-0.9,-12.4,1.1,0.0,0.77,5.0,19.8,40.0,255.0,0.06,
3,1671145200000,32.8515,Abisko,-4.5,-9.7,2.0,0.0,1.4,9.0,9.6,35.3,243.0,0.0,33.2854
4,1671145200000,5.1158,Spånga,-4.9,-14.7,0.0,0.0,0.0,0.0,14.8,28.4,253.0,0.04,5.3238


In [76]:
X = train_data.drop(columns=["date", "place"]).fillna(0) # PUT PLACE BACK!!!!!
y = X.pop("streamflow_next_day")

In [46]:
X.head()

Unnamed: 0,streamflow,temperature_2m_max,temperature_2m_min,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,windspeed_10m_max,windgusts_10m_max,winddirection_10m_dominant,et0_fao_evapotranspiration
0,33.2854,-7.2,-9.9,0.3,0.0,0.21,3.0,6.9,26.6,293.0,0.0
1,5.3238,0.7,-7.9,1.7,0.1,1.12,8.0,21.3,48.6,257.0,0.08
2,3.0196,-0.9,-12.4,1.1,0.0,0.77,5.0,19.8,40.0,255.0,0.06
3,32.8515,-4.5,-9.7,2.0,0.0,1.4,9.0,9.6,35.3,243.0,0.0
4,5.1158,-4.9,-14.7,0.0,0.0,0.0,0.0,14.8,28.4,253.0,0.04


### <span style='color:#ff5f27'> Create Pipeline
1. Transformations
2. Model Selection

In [117]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

column_trans = ColumnTransformer(
        [#('categories', OneHotEncoder(), ['place']),
         ('title_bow', StandardScaler(), ['streamflow', 'temperature_2m_max', 'temperature_2m_min','precipitation_sum', 'rain_sum', 'snowfall_sum', 'precipitation_hours', 'windspeed_10m_max','windgusts_10m_max','winddirection_10m_dominant','et0_fao_evapotranspiration'])
        ],
         verbose_feature_names_out=True)
pipe = Pipeline([('transformations', column_trans), ('gb', GradientBoostingRegressor())])

pipe.fit(X, y)

### <span style='color:#ff5f27'> 👨🏻‍⚖️ Model Validation

In [118]:
f1 = f1_score(y.astype('int'),[int(pred) for pred in pipe.predict(X)],average='micro')
f1

0.5322033898305085

In [111]:
y.iloc[4:10].values

array([ 5.3238,  3.0196, 32.8515,  5.1158,  2.98  , 32.9945])

In [121]:
pred_df = pd.DataFrame({
    'streamflow_real': y.iloc[4:10].values,
    'streamflow_pred': map(int, pipe.predict(X.iloc[4:10]))
}
)
pred_df

Unnamed: 0,streamflow_real,streamflow_pred
0,5.3238,4
1,3.0196,2
2,32.8515,29
3,5.1158,3
4,2.98,3
5,32.9945,32


---

## <span style='color:#ff5f27'>🗄 Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [122]:
mr = project.get_model_registry()

Connected. Call `.close()` to terminate connection gracefully.


### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/machine-learning-api/latest/generated/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [123]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X)
output_schema = Schema(y)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

{'input_schema': {'columnar_schema': [{'name': 'streamflow',
    'type': 'float64'},
   {'name': 'temperature_2m_max', 'type': 'float64'},
   {'name': 'temperature_2m_min', 'type': 'float64'},
   {'name': 'precipitation_sum', 'type': 'float64'},
   {'name': 'rain_sum', 'type': 'float64'},
   {'name': 'snowfall_sum', 'type': 'float64'},
   {'name': 'precipitation_hours', 'type': 'float64'},
   {'name': 'windspeed_10m_max', 'type': 'float64'},
   {'name': 'windgusts_10m_max', 'type': 'float64'},
   {'name': 'winddirection_10m_dominant', 'type': 'float64'},
   {'name': 'et0_fao_evapotranspiration', 'type': 'float64'}]},
 'output_schema': {'columnar_schema': [{'name': 'streamflow_next_day',
    'type': 'float64'}]}}

In [125]:
import joblib

joblib.dump(pipe, 'model.pkl')

['model.pkl']

In [127]:
model = mr.sklearn.create_model(
    name="gradient_boost_pipeline",
    metrics={"f1": f1}, 
    description="Tranformations and Gradient Boost Regressor.",
    input_example=X.sample(),
    model_schema=model_schema
)

model.save('model.pkl')

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://c.app.hopsworks.ai:443/p/5318/models/gradient_boost_pipeline/1


Model(name: 'gradient_boost_pipeline', version: 1)