<a href="https://colab.research.google.com/github/raptor-ml/raptor/blob/master/docs/guides/getting-started-with-labsdk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[↵ Back to the Docs](https://raptor.ml)



# LabSDK

Using the LabSDK, Data-scientists can build features(that can run on production) directly from the notebook while developing your model.

When you're done, you can "export" the features as Kubernetes manifests, and deploy them like any other service in your cluster. This way, you can benefit from the "serverless approach", and **focus on the business-logic**, while RaptorML is taking care of the production concerns.

## Getting started
In this quickstart tutorial, we'll build a model that predicts the probability of closing a deal.

Our CRM allow us to track every email communication, and the history of previous deals for each customer. We'll use this data sources to predict whether the customer is ready for closure or not.

To do that, we're going to build a few features from the data:
1. `emails_10h` - the amount of email exchanges over the last 10 hours
1. `deals_10h[sum]` - the sum of the deals of the last 10 hours
1. `emails_deals` - the rate between the emails in the last 10 hours (`emails_10h`) and the avarage of the deals in the last 10 hours (`deals_10h[avg]`)

## Installing the SDK
Yalla, let's go! In the following two blocks, we install the LabSDK and import it.

In [1]:
!pip install --upgrade raptor-labsdk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import raptor
from raptor.stub import *  #<-- this prevents the IDE/Notebookfrom detecting PyExp built-in as errors

## Writing our first features
Our first feature is calculating how many emails an account got over the last 10 hours.

It uses the `emails` data-connector the DevOps configured for us.

In [3]:
@raptor.register(int, freshness='1m', staleness='10h')
@raptor.connector("emails")  #<-- we are decorating our feature with our production data-connector! 😎 
@raptor.aggr([raptor.AggrFn.Count])
def emails_10h(**req: RaptorRequest):
    """email over 10 hours"""
    return 1

---
> ## 😎 *Cool tip* 
>
> Although it's looks like a regular python, it's not 😲
>
> The feature definition above is actually written in [PyExp](https://raptor.ml/docs/reference/pyexp), it's a dialect of Python and allow us to compile your code and run it production (in a safe and performant manner)
>
> [Learn more about PyExp »](https://raptor.ml/docs/reference/pyexp)

Let's create another feature that calculates various aggregations against the deal amount.

Notice that we're specified the `deals` data-connector here. In production, this data-connector will be used to retreive the "real" data.

In [4]:
@raptor.register(int, freshness='1m', staleness='10h')
@raptor.connector("deals")
@raptor.aggr([raptor.AggrFn.Sum, raptor.AggrFn.Avg, raptor.AggrFn.Max, raptor.AggrFn.Min])
def deals_10h(**req: RaptorRequest):
    """sum/avg/min/max of deal amount over 24 hours"""
    return req['payload']["amount"]

Now we can create a *derived feature* that defines the rate between these two features. Since the feature is calculated only on time-of-request, we can set it's type as `headless`.

Notice that we used the Fully Qualified Name(*FQN*) of the feature, which includes the feature's namespace(*default*).
When querying a feature with an aggregation function, we need to specify the function in the brackets.

In [5]:
@raptor.register('headless', freshness='-1', staleness='-1')
def emails_deals(**req: RaptorRequest):
    """emails/deal[avg] rate over 10 hours"""
    e, _ = f("emails_10h.default[count]", req['entity_id'])
    d, _ = f("deals_10h.default[avg]", req['entity_id'])
    if e == None or d == None:
        return None
    return e / d

And finally, prepare it as a data set:

In [6]:
@raptor.feature_set(register=True)
def deal_prediction():
    return "emails_10h.default[count]", "deals_10h.default[sum]", emails_deals

## Historical Replay
We can "replay" the historical records against our production-ready feature that we have written above for development purposes.

The SDK will run this code locally and allow us to iterate on it quickly.

In [7]:
!pip install pandas pyarrow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [8]:
import pandas as pd

# first, calculate the "root" features
df = pd.read_parquet("https://gist.github.com/AlmogBaku/a1b331615eaf1284432d2eecc5fe60bc/raw/emails.parquet")
emails_10h.replay(df, entity_id_field="account_id")

df = pd.read_csv("https://gist.githubusercontent.com/AlmogBaku/a1b331615eaf1284432d2eecc5fe60bc/raw/deals.csv")
deals_10h.replay(df, entity_id_field="account_id")

# then, we can calculate the derrived features
emails_deals.replay(df, entity_id_field="account_id")

Unnamed: 0,timestamp,entity_id,fqn,value
0,2022-01-01 12:00:10+00:00,msft,emails_deals.default,0.002183
1,2022-01-01 13:10:00+00:00,msft,emails_deals.default,0.002316
2,2022-01-01 13:21:00+00:00,msft,emails_deals.default,0.002938
3,2022-01-01 14:03:00+00:00,msft,emails_deals.default,0.002106
4,2022-01-01 14:10:00+00:00,msft,emails_deals.default,0.001714
5,2022-01-01 14:20:00+00:00,msft,emails_deals.default,0.001556
6,2022-01-01 14:30:00+00:00,msft,emails_deals.default,0.001764
7,2022-01-01 14:40:00+00:00,msft,emails_deals.default,0.00198
8,2022-01-01 15:33:00+00:00,msft,emails_deals.default,0.002219
9,2022-01-01 12:00:00+00:00,tesla,emails_deals.default,0.000113


:---
> ℹ️ **Looking to run Replay at *scale*?** try [Raptor Enterprise](mailto:contact@raptor.ml) 🦖


## Building our model
To use our set in for our model, we need to query it:

In [9]:
df = deal_prediction.historical_get(since='2020-1-1', until='2022-12-31')
df
# model.fit(df)

Unnamed: 0,timestamp,entity_id,emails_10h.default[count],deals_10h.default[sum],emails_deals.default
0,2022-01-01 12:00:00+00:00,msft,1.0,458.0,0.002183
1,2022-01-01 12:00:00+00:00,tesla,1.0,8837.0,0.000113
2,2022-01-01 12:20:00+00:00,tesla,2.0,103502.0,3.9e-05
3,2022-01-01 13:10:00+00:00,msft,2.0,1727.0,0.002316
4,2022-01-01 13:20:00+00:00,msft,3.0,3063.0,0.002938
5,2022-01-01 13:40:00+00:00,tesla,3.0,109966.0,8.2e-05
6,2022-01-01 14:00:00+00:00,msft,4.0,7599.0,0.002106
7,2022-01-01 14:10:00+00:00,msft,5.0,14583.0,0.001714
8,2022-01-01 14:20:00+00:00,msft,6.0,23132.0,0.001556
9,2022-01-01 14:30:00+00:00,msft,7.0,27775.0,0.001764


## Deployment
That's the fun part! 🤗 Making our features run at scale in production couldn't be easier.

We only need to deploy our feature definitions to the Raptor Platform.
You can do that with a preferred CI/CD of your choice, manually via `kubectl` or directly from your Jupyter Notebook(but that's not recommended for real-production environments 🤪)

### Manifest deployment (only use this for production)
We *copy-and-paste* the generated manifests to git, and use the organization's preferred method for deploying Kubernetes manifests (i.e. gitops, jenkins, kustomize, helm, etc.)

In [10]:
raptor.manifests()

apiVersion: k8s.raptor.ml/v1alpha1
kind: Feature
metadata:
  name: emails-10h
  namespace: default
  annotations:
    a8r.io/description: "email over 10 hours
spec:
  primitive: int
  freshness: 1m
  staleness: 10h
  aggr:
    - count
  builder:
    kind: expression
    pyexp: |
      def emails_10h(**req):
          'email over 10 hours'
          return 1
---
apiVersion: k8s.raptor.ml/v1alpha1
kind: Feature
metadata:
  name: deals-10h
  namespace: default
  annotations:
    a8r.io/description: "sum/avg/min/max of deal amount over 24 hours
spec:
  primitive: int
  freshness: 1m
  staleness: 10h
  aggr:
    - sum
    - avg
    - max
    - min
  builder:
    kind: expression
    pyexp: |
      def deals_10h(**req):
          'sum/avg/min/max of deal amount over 24 hours'
          return req['payload']['amount']
---
apiVersion: k8s.raptor.ml/v1alpha1
kind: Feature
metadata:
  name: emails-deals
  namespace: default
  annotations:
    a8r.io/description: "emails/deal[avg] rate over 10 ho

### Direct deployment (for local development)
Alternatively, we can just deploy it on our own directly from python:

1. Make sure you've installed and configured `kubectl` on the notebook device.
2. Then, you can create and upload your manifests directly from the notebook
```python
manifests = raptor.manifests(return_str=True)
```
```
!echo "$manifests" | kubectl apply -f -
```

>> We didn't include these blocks as executable since you need to configure your cluster.

## Viola! 🪄
**From now on**, our cluster will collect and build features in production and record the values for historical purposes (so you'll be able to retrain against the production data)