# Getting started

This guide will walk you through the creation and configuration of a new model into the platform and give you model observability in 5 minutes.

[1. Install Superwise package](#Install-Superwise-package)

[2. Simulate model training flow](#Simulate-model-training-flow)

[3. Lets integrate with Superwise](#Lets-integrate-with-Superwise)
- [3.1 Create a Model](##Create-Model)
- [3.2 Create a Version](##Create-Version)
- [3.3 Log production data](##Log-production-data)


# Install Superwise package

Superwise's SDK is a standard Python package that simplifies the integration with Superwise and streams data to the Superwise platform.

In [1]:
import os

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

from superwise import Superwise
from superwise.models.task import Task
from superwise.models.version import Version
from superwise.resources.superwise_enums import DataEntityRole

## Init superwise client

There are 2 environment variables used to identify and authenticate your connection:
- SUPERWISE_CLIENT_ID
- SUPERWISE_SECRET

Read [here](https://docs.superwise.ai/docs/authentication) on how to generate them.

In [2]:
os.environ['SUPERWISE_CLIENT_ID'] = 'REPLACE_WITH_YOUR_CLIENT'
os.environ['SUPERWISE_SECRET'] = 'REPLACE_WITH_YOUR_SECRET'

Create an instance of the Superwise object to interact with the Superwise APIs. All APIs will now be accessible under the sw instance


In [3]:
sw = Superwise()

# Simualte model training flow

For this toturial we will use Diamonds public dataset. This classic dataset contains the prices and other attributes of almost 54,000 diamonds

## Load data

In [4]:
diamonds = pd.read_csv('https://www.openml.org/data/get_csv/21792853/dataset')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


Dataset properties:
- carat weight of the diamond (0.2--5.01)
- cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color diamond colour, from J (worst) to D (best)
- clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
- table width of top of diamond relative to widest point (43--95)
- price price in US dollars ($326--$18,823) - Label
- x length in mm (0--10.74)
- y width in mm (0--58.9)
- z depth in mm (0--31.8)

## Split Train-Test

The training dataset will use as later on as a reference dataset (AKA baseline). The test dataset will simulate a production data that feed into the model. You can read mode about the baseline dataset concept [here](https://docs.superwise.ai/docs/baseline)

In [5]:
X = diamonds.drop(columns="price")
y = diamonds["price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z
19497,1.21,Ideal,H,VVS2,61.3,57.0,6.92,6.87,4.23
31229,0.31,Ideal,E,VS2,62.0,56.0,4.38,4.36,2.71
22311,1.21,Ideal,E,VS1,62.4,57.0,6.75,6.83,4.24
278,0.81,Ideal,F,SI2,62.6,55.0,5.92,5.96,3.72
6646,0.79,Ideal,I,VVS2,61.7,56.0,5.94,5.95,3.67


## Pre-processing

In order to use categorical features in our model, we will trasform them into numeric values using One Hot Encoding

In [6]:
categorical_cols = ['cut','clarity','color']

preprocessor = ColumnTransformer(
    transformers=[
        ('categorical',  OneHotEncoder(), categorical_cols)
    ], remainder='passthrough')

## Train your model

In this example we will train a simple Liner Regression

In [7]:
diamond_price_model=LinearRegression()

my_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', diamond_price_model)
])

my_pipeline.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('categorical',
                                                  OneHotEncoder(),
                                                  ['cut', 'clarity',
                                                   'color'])])),
                ('model', LinearRegression())])

## Predict

Use the model we trained in order to predict the price

In [8]:
y_pred_train =  my_pipeline.predict(X_train)

# Lets integrate with Superwise

Now that we have a trained model, we can start and log it into Superwise and start monitor it

## Create a Model

Models, or machine learning-based decision processes are the basic atomic component that Superwise observes

In [9]:
diamond_task = Task(
    name="Diamond Model",
    description="Regression model which predict the diamond price"
)

diamond_task = sw.task.create(diamond_task)
print(f"New task Created - {diamond_task.id}")

New task Created - 54


## Create a Version

Deploying a model to production is only step 1, as models require iterative improvement and ongoing updates. Differences between versions may be ad hoc schema changes or retraining on a new data set to refit the model hyperparameters under the same given schema.

This proccess contains 3 main steps 
    - Preparing the baseline dataset based on the training data
    - Define the version schema
    - Log the version in Superwise app

### Prepare Baseline Dataset

Here we going to add 4 columns to our dataset - 

- ID - Unique identifier per row or prediction (Required)
- Timestampe - Indicates when the prediction took place (Required)
- Model prediction
- Label - the real diamond price

In [11]:
baseline_data = X_train.assign(
    id=X_train.index,
    ts=pd.Timestamp.now(),
    prediction=y_pred_train,
    price=y_train
)
baseline_data["prediction"] = baseline_data["prediction"].astype(float)
baseline_data.to_csv('data/baseline.csv', index=False)
baseline_data.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,id,ts,prediction,price
19497,1.21,Ideal,H,VVS2,61.3,57.0,6.92,6.87,4.23,19497,2021-12-27 17:27:56.433781,8060.0,8131
31229,0.31,Ideal,E,VS2,62.0,56.0,4.38,4.36,2.71,31229,2021-12-27 17:27:56.433781,613.0,756
22311,1.21,Ideal,E,VS1,62.4,57.0,6.75,6.83,4.24,22311,2021-12-27 17:27:56.433781,8568.0,10351
278,0.81,Ideal,F,SI2,62.6,55.0,5.92,5.96,3.72,278,2021-12-27 17:27:56.433781,3037.5,2795
6646,0.79,Ideal,I,VVS2,61.7,56.0,5.94,5.95,3.67,6646,2021-12-27 17:27:56.433781,3866.5,4092


### Infer schema

Because each model version could introduce new input formats, each version requires an explicit schema definition. Schema is a collection of different data entities (a.k.a columns) that are part of the specific version of the machine learning decision process. Each data entity has its own data type and has a specific role in the ML process.

You can read more about the data entites roles and supported data types [here](https://docs.superwise.ai/docs/version)

In [12]:
entities_collection = sw.data_entity.summarise(
    data=baseline_data,
    specific_roles = {
      'id': DataEntityRole.ID,
      'ts': DataEntityRole.TIMESTAMP,
      'prediction': DataEntityRole.PREDICTION_VALUE,
      'price': DataEntityRole.LABEL
    }
)

Here are the schema main properties (roles, types, feature importance and descriptive statistics):

In [13]:
ls = list()
for entity in entities_collection:
    ls.append(entity.get_properties())
    
pd.DataFrame(ls).head()

Unnamed: 0,data_type,dimension_start_ts,feature_importance,id,name,role,secondary_type,summary,type
0,number,,61.74,,carat,feature,Num_centered,"{'statistics': {'missing_values': 0.0, 'outlie...",Numeric
1,text,,0.01,,cut,feature,Cat_dense,"{'statistics': {'missing_values': 0.0, 'new_va...",Categorical
2,text,,6.03,,color,feature,Cat_dense,"{'statistics': {'missing_values': 0.0, 'new_va...",Categorical
3,text,,11.16,,clarity,feature,Cat_dense,"{'statistics': {'missing_values': 0.0, 'new_va...",Categorical
4,number,,0.01,,depth,feature,Num_centered,"{'statistics': {'missing_values': 0.0, 'outlie...",Numeric


### Activate a Version

Now that we have model and schema, lets combaine them together into a version

In [14]:
new_version = Version(
    task_id=diamond_task.id,
    name="1.0.0",
    data_entities=entities_collection,
)

new_version = sw.version.create(new_version)

In [15]:
sw.version.activate(new_version.id)

<Response [204]>

## Log production data

We will use the test data in order to simulate the production data

### Predictions

The predictions data should include all the data entities with the following roles - Id, Timestamp, Metadata, Feature, Prediction probability, Prediction value, Label weight - as you define in the version schema above.
In our diamonds example, we will add the following columns - id, timestamp and model prediction.

Notice that if some data entities are missing, you will get a schema skew error, which means that the data will not be streaming into Superwise.

In [16]:
y_test_pred= my_pipeline.predict(X_test)

In [38]:
prediction_time_vector = pd.Timestamp.now().floor('h') - \
    pd.TimedeltaIndex(X_test.reset_index(drop=True).index // int(X_test.shape[0] // 30), unit='D')

ongoing_prediction = X_test.assign(
    id=X_test.index,
    ts=prediction_time_vector,
    prediction=y_test_pred
)
ongoing_prediction["prediction"] = ongoing_prediction["prediction"].astype(float)
ongoing_prediction.to_csv('data/ongoing_prediction.csv', index=False)
ongoing_prediction.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,id,ts,prediction
1388,0.24,Ideal,G,VVS1,62.1,56.0,3.97,4.0,2.47,1388,2021-12-27 17:00:00,718.0
50052,0.58,'Very Good',F,VVS2,60.0,57.0,5.44,5.42,3.26,50052,2021-12-27 17:00:00,3192.0
41645,0.4,Ideal,E,VVS2,62.1,55.0,4.76,4.74,2.95,41645,2021-12-27 17:00:00,1951.5
42377,0.43,Premium,E,VVS2,60.8,57.0,4.92,4.89,2.98,42377,2021-12-27 17:00:00,2083.0
17244,1.55,Ideal,E,SI2,62.3,55.0,7.44,7.37,4.61,17244,2021-12-27 17:00:00,9876.0


After we fit the data according to the schema, he is now ready to be sent. you must proive the model and the version of the model that used to predict. 

Notice that each chuck of data should be not more then 1000 records

In [39]:
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

In [40]:
ongoing_prediction_chuncks = chunks(ongoing_prediction.to_dict(orient='records'), 1000)

transaction_ids = list()
for ongoing_prediction_chunck in ongoing_prediction_chuncks:
    transaction_id = sw.transaction.log_records(
        task_id=diamond_task.id, 
        version_id=new_version.name, 
        records=ongoing_prediction_chunck
    )
    transaction_ids.append(transaction_id)
    print(transaction_id)

{'transaction_id': 'eee1ee98-672a-11ec-920b-5acaded3d43d'}
{'transaction_id': 'f02a559c-672a-11ec-b09e-426f0aedd514'}
{'transaction_id': 'f16ed13a-672a-11ec-950c-426f0aedd514'}
{'transaction_id': 'f2b3be48-672a-11ec-a5d6-5acaded3d43d'}
{'transaction_id': 'f42dc7c8-672a-11ec-ab12-5acaded3d43d'}
{'transaction_id': 'f5a29638-672a-11ec-9e51-5acaded3d43d'}
{'transaction_id': 'f6d2ac46-672a-11ec-a331-426f0aedd514'}
{'transaction_id': 'f81f50a4-672a-11ec-95d6-5acaded3d43d'}
{'transaction_id': 'f95c221c-672a-11ec-8e99-5acaded3d43d'}
{'transaction_id': 'faaa0f9e-672a-11ec-bf37-426f0aedd514'}
{'transaction_id': 'fbfc7198-672a-11ec-8b33-5acaded3d43d'}
{'transaction_id': 'fd384fdc-672a-11ec-afee-426f0aedd514'}
{'transaction_id': 'fe81af14-672a-11ec-b8bf-5acaded3d43d'}
{'transaction_id': 'ffe799e0-672a-11ec-abcc-426f0aedd514'}
{'transaction_id': '012c8a72-672b-11ec-a204-5acaded3d43d'}
{'transaction_id': '027987ae-672b-11ec-98ca-426f0aedd514'}
{'transaction_id': '0387ac20-672b-11ec-9775-5acaded3d43d

In [41]:
transaction_id = sw.transaction.get(transaction_id=transaction_ids[0]['transaction_id'])
transaction_id.get_properties()['status']

'Passed'

### Ground truth

In ML scenarios, most of the time, The Ground truth comes back in a delay. In order to log it, the data should include all the data entities with the following roles - Id, Label (as we define in the schema). Superwise will join them automatically to the predictions that were sent earlier in order to calculate the model performance metrics.

In [43]:
ongoing_labels = y_test.reset_index().copy().rename(columns={"index": "id"})
ongoing_labels.to_csv('data/ongoing_labels.csv', index=False)
ongoing_labels.head()

Unnamed: 0,id,price
0,1388,559
1,50052,2201
2,41645,1238
3,42377,1304
4,17244,6901


Notice that when you log the ground truth, the version_id is optional. lets say for example that you use 2 different versions in order to predict the same data record, the groud truth is relevant for both of them

In [44]:
ongoing_labels_chuncks = chunks(ongoing_labels.to_dict(orient='records'), 1000)

transaction_ids = list()
for ongoing_labels_chunck in ongoing_labels_chuncks:
    transaction_id = sw.transaction.log_records(
        task_id=diamond_task.id, 
        records=ongoing_labels_chunck
    )
    
    transaction_ids.append(transaction_id)
    print(transaction_id)

{'transaction_id': '3cc42a36-672b-11ec-86e5-5acaded3d43d'}
{'transaction_id': '3da1622a-672b-11ec-8865-426f0aedd514'}
{'transaction_id': '3e7cedf4-672b-11ec-ad4d-426f0aedd514'}
{'transaction_id': '3f3a8d0a-672b-11ec-9684-426f0aedd514'}
{'transaction_id': '4000adf0-672b-11ec-ac56-5acaded3d43d'}
{'transaction_id': '40be0972-672b-11ec-9506-5acaded3d43d'}
{'transaction_id': '41814f4a-672b-11ec-a560-5acaded3d43d'}
{'transaction_id': '42368946-672b-11ec-b907-426f0aedd514'}
{'transaction_id': '42eaef12-672b-11ec-a5d6-5acaded3d43d'}
{'transaction_id': '43acedba-672b-11ec-b09e-426f0aedd514'}
{'transaction_id': '44660994-672b-11ec-8210-426f0aedd514'}
{'transaction_id': '451437ee-672b-11ec-ab12-5acaded3d43d'}
{'transaction_id': '45c12d64-672b-11ec-b1fe-426f0aedd514'}
{'transaction_id': '468b220e-672b-11ec-8a85-426f0aedd514'}
{'transaction_id': '474cc67a-672b-11ec-a4e3-426f0aedd514'}
{'transaction_id': '48100e00-672b-11ec-a755-5acaded3d43d'}
{'transaction_id': '489f2928-672b-11ec-ac82-5acaded3d43d

In [135]:
transaction_id = sw.transaction.get(transaction_id=transaction_ids[0]['transaction_id'])
transaction_id.get_properties()['status']

'Passed'