# Iris Flower - Feature Pipeline

In this notebook we will, 

1. Run in either "Backfill" or "Normal" operation. 
2. IF *BACKFILL==True*, we will load our DataFrame with data from the iris.csv file 

   ELSE *BACKFILL==False*, we will load our DataFrame with one synthetic Iris Flower sample 
3. Write our DataFrame to a Feature Group

In [None]:
!pip install -U hopsworks --quiet

Set **BACKFILL=True** if you want to create features from the iris.csv file containing historical data.

In [1]:
import random
import pandas as pd
import hopsworks

BACKFILL=True

### Synthetic Data Functions

These synthetic data functions can be used to create a DataFrame containing a single Iris Flower sample.

In [2]:
def generate_flower(name, sepal_len_max, sepal_len_min, sepal_width_max, sepal_width_min, 
                    petal_len_max, petal_len_min, petal_width_max, petal_width_min):
    """
    Returns a single iris flower as a single row in a DataFrame
    """
    df = pd.DataFrame({ "sepal_length": [random.uniform(sepal_len_max, sepal_len_min)],
                       "sepal_width": [random.uniform(sepal_width_max, sepal_width_min)],
                       "petal_length": [random.uniform(petal_len_max, petal_len_min)],
                       "petal_width": [random.uniform(petal_width_max, petal_width_min)]
                      })
    df['variety'] = name
    return df


def get_random_iris_flower():
    """
    Returns a DataFrame containing one random iris flower
    """
    virginica_df = generate_flower("Virginica", 8, 5.5, 3.8, 2.2, 7, 4.5, 2.5, 1.4)
    versicolor_df = generate_flower("Versicolor", 7.5, 4.5, 3.5, 2.1, 5.5, 3.1, 1.8, 1.0)
    setosa_df =  generate_flower("Setosa", 6, 4.5, 4.5, 2.3, 2, 1.2, 0.7, 0.3)

    # randomly pick one of these 3 and write it to the featurestore
    pick_random = random.uniform(0,3)
    if pick_random >= 2:
        iris_df = virginica_df
    elif pick_random >= 1:
        iris_df = versicolor_df
    else:
        iris_df = setosa_df

    return iris_df

## Backfill or create new synthetic input data

You can run this pipeline in either *backfill* or *synthetic-data* mode.

In [3]:
if BACKFILL == True:
    iris_df = pd.read_csv("https://repo.hops.works/master/hopsworks-tutorials/data/iris.csv")
else:
    iris_df = get_random_iris_flower()
    
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [8]:
import great_expectations as ge

expectation_suite = ge.core.ExpectationSuite(expectation_suite_name="iris_dimensions")

def value_between(expectation_suite, column_name, min, max):
    
    expectation_suite.add_expectation(
      ge.core.ExpectationConfiguration(
      expectation_type="expect_column_values_to_be_between",
      kwargs={"column":column_name, "min_value": min, "max_value": max}) 
    );
    
value_between(expectation_suite, "sepal_length", 4.3, 8.0)
value_between(expectation_suite, "sepal_width", 2.0, 4.5)
value_between(expectation_suite, "petal_length", 1.0, 7)
value_between(expectation_suite, "petal_width", 0.1, 4.5)

## Authenticate with Hopsworks using your API Key

Hopsworks will prompt you to paste in your API key and provide you with a link to find your API key if you have not stored it securely already.

In [9]:
project = hopsworks.login()
fs = project.get_feature_store()

Connection closed.
Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/398
Connected. Call `.close()` to terminate connection gracefully.


## Create and write to a feature group - primary keys

To prevent duplicate entries, Hopsworks requires that each DataFame has a *primary_key*. 
A *primary_key* is one or more columns that uniquely identify the row. Here, we assume
that each Iris flower has a unique combination of ("sepal_length","sepal_width","petal_length","petal_width")
feature values. If you randomly generate a sample that already exists in the feature group, the insert operation will fail.

The *feature group* will create its online schema using the schema of the Pandas DataFame.

In [10]:
iris_fg = fs.get_or_create_feature_group(name="iris",
                                  version=7,
                                  primary_key=["sepal_length","sepal_width","petal_length","petal_width"],
                                  description="Iris flower dataset",
                                  expectation_suite=expectation_suite
                                 )

In [11]:
iris_fg.insert(iris_df, write_options={"wait_for_job" : False})

#iris_fg.delete_expectation_suite()

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/398/fs/335/fg/8606
2022-12-19 19:34:59,891 INFO: 	4 expectation(s) included in expectation_suite.
Validation Report saved successfully, explore a summary at https://c.app.hopsworks.ai:443/p/398/fs/335/fg/8606


Uploading Dataframe: 0.00% |          | Rows 0/150 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/398/jobs/named/iris_7_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f1e733ef2b0>,
 {
   "evaluation_parameters": {},
   "success": true,
   "statistics": {
     "evaluated_expectations": 4,
     "successful_expectations": 4,
     "unsuccessful_expectations": 0,
     "success_percent": 100.0
   },
   "results": [
     {
       "expectation_config": {
         "expectation_type": "expect_column_values_to_be_between",
         "meta": {
           "expectationId": 7213
         },
         "kwargs": {
           "column": "petal_width",
           "min_value": 0.1,
           "max_value": 4.5
         }
       },
       "success": true,
       "result": {
         "element_count": 150,
         "missing_count": 0,
         "missing_percent": 0.0,
         "unexpected_count": 0,
         "unexpected_percent": 0.0,
         "unexpected_percent_total": 0.0,
         "unexpected_percent_nonmissing": 0.0,
         "partial_unexpected_list": []
       },
       "meta": {},
       "exception_info": {
         "raised_exception": false,
