# Sumatra Sandbox Tutorial

This notebook demonstrates how to engineer features with Sumatra, backfill them offline, and serve them online. The notebook includes keys to a sandbox environment, set up just for you.

## Install Python packages
Along with the `sumatra-client`, we use the `faker` package to generate test data and `tqdm` to display progress.

In [None]:
!pip install sumatra-client
!pip install faker tqdm

## Connect to Sandbox Environment

__*To request your keys, email: hello@sumatra.ai*__

* __Client__ provides access to Sumatra's platform APIs
* __APIClient__ is a simple wrapper around `requests` to call Sumatra's event-ingestion REST endpoint

In [None]:
%env SUMATRA_SDK_KEY=
%env SUMATRA_API_KEY=

from sumatra import Client, APIClient
sumatra = Client('console.qa.sumatra.ai')
api = APIClient('console.qa.sumatra.ai')

## Key Concepts

When an __event__ is sent to Sumatra, a __topology__ defined by the user, processes the event to:
* __Extract__ payload elements into a statically-typed dataframe
* __Enrich__ the data with stateless and stateful feature transformations, computed efficiently on-demand

Topologies are defined in a DSL call __SCOWL__. You will find it useful throughout the tutorial to reference the __documentation__ here: https://scowl.sumatra.ai

## Hello World

Let's start with a simple example:
* extract a `name` from the payload
* compute a single new feature: `greeting`.

Notice that features are defined within the scope of an __event type__—in this case: `test`.

In [None]:
scowl = """
event test
name := $.name as string
greeting := 'Hello, ' + name
"""
sumatra.publish_scowl(scowl)

event = {'_type': 'test', 'name': 'World'}
response = api.send(event)
response

Let's look at __what we just did__. We:
* published a very simple topology to LIVE
* sent a single event to our API to enrich the data with features

Pretty neat, huh? 😄

### Exercises For You

* Change the `name` in the `event` payload from `World` to `Universe`
* Change the definition of `greeting` in the `scowl` code to `'Howdy, ' + name`
* Add a new feature: `gibberish_score := GibberishNameScore(name)`. _Can you change the name to make the score positive?_
* Try breaking the code to get a compile-time error

## Payment Example

Now let's try something a bit more realistic. We'll create a simplified payload for a payment transaction.

The following function will generate random payloads for us:

In [None]:
import pandas as pd
from tqdm.notebook import trange
from random import random, randrange
from faker import Faker
faker = Faker()

def payment(**args):
    return {
        '_type': 'payment',
        'card': {
            'hash': args.get('hash', faker.uuid4()),
            'bin': args.get('bin', randrange(100000, 999999)),
        },
        'amount': args.get('amount', round(1 + random() * 100, 2)),
        'merchant': {
            'email': args.get('email', faker.free_email()),
            'create_ts': args.get('create_ts', faker.date_time_this_year().isoformat()),
        }
    }

### Scenario

Imagine we need to compute some on-demand __transformations__ and __aggregates__ over the payment data to pass to our real-time payment risk model. We may want to compute features such as:
* time-windowed sums
* unique counts
* ratios, etc. 

In the scowl code below, you'll see some example features that could be useful.

Let's publish this topology and send some events through.

In [None]:
scowl = """
event payment

amount := $.amount as float
card_bin := $.card.bin as int
card_hash := $.card.hash as string
merchant_create_ts := $.merchant.create_ts as time
merchant_email := $.merchant.email as string

amount_by_card_2w := Sum(amount by card_hash last 2 weeks)

merchant_age_days := Days(EventTime() - merchant_create_ts)
merchants_by_card_5d := CountUnique(merchant_email by card_hash last 5 days)
young_merchants_by_card_5d := CountUnique(merchant_email by card_hash where merchant_age_days < 120 last 5 days)
young_merchant_ratio_5d := young_merchants_by_card_5d / Maximum(merchants_by_card_5d , 1)
"""

sumatra.publish_scowl(scowl)

features = []
bin = randrange(100000, 999999)
hash = faker.uuid4()
for i in trange(10):
    resp = api.send(payment(bin=bin, hash=hash))
    features.append(resp['features'])
    if i == 4:
        hash = faker.uuid4()

df = pd.DataFrame(features)
df

Let's look at __what we just did__. We:
* published a simple but realistic topology to LIVE
* sent 10 generated events to the API
* viewed the online features responses as a dataframe

Boom. 🤯

### Exercises For You
* Add a new feature: `num_cards_same_bin_1d := CountUnique(card_hash by card_bin last day)`
* Modify the event generator to drive up some aggregate counts


## Offline "Replay"

For obvious reasons, testing in LIVE is not the best way to experiment with new features. In a production setting, topologies are developed and validated on historical data before going through the change management process for LIVE deployment.

One of the key benefits of Sumatra is that __no modification or reimplmentation__ is required between offline and online feature definitions.

### Stitching Events Together

In addition to demonstrating offline functionality, we'll also introduce of Sumatra's most powerful capabilities: aggregating data across event types. Here we'll define three event types:
* `login` - merchant logged into their portal
* `update_payout` - merchant changed the bank the receives their payouts
* `payment` - as before, a payment is made

__Motivation:__ At payment time, we want to add risk if it appears that the merchant account was taken over, because a fraudster may be using the merchant account to funnel money from stolen cards.

Let's start by creating (but not yet publishing) a scowl topology:

In [None]:
scowl = """
event login

email := $.email as string
ip := $.ip as string
correct_password := $.correct_password as bool
ipc := IPCBlock(ip)
failures_by_ipc_10m := CountUnique(email by ipc where not correct_password last 10 minutes)
possible_credential_stuffing := correct_password and failures_by_ipc_10m > 5

event update_payout

email := $.email as string
ip := $.ip as string
bank_hash := $.bank.hash as string
risky_logins_2d := Count<login>(by email where possible_credential_stuffing last 2 days)

event payment

amount := $.amount as float
card_bin := $.card.bin as int
card_hash := $.card.hash as string
merchant_email := $.merchant.email as string
merchant_risky_bank_updates_12h := Count<update_payout>(by merchant_email as email where risky_logins_2d >= 1 last 12 hours)
"""
sumatra.create_branch_from_scowl(scowl)
sumatra.get_branch()

The above code defines features for each of the three event types. Notice that `payment` computes an aggregate over the `update_payout` event type, using the `<>` syntax: `Count<update_payout>`.

Likewise, `update_payout` aggregates data over the `login` event type. In this way, we can propagate risk forward through the customer journey to provide maximum context at decision time.

### Timelines

To evaluate our candidate features, we'll use a saved event log, called a _timeline_, of 17 events contrived to demonstrate an __attack__.

In [None]:
attack_jsonl = """{"_type": "payment", "card": {"hash": "e37acc45-0b0b-4de3-9d11-9b8501f61616", "bin": 502053}, "amount": 28.5, "merchant": {"email": "eric26@yahoo.com", "create_ts": "2021-05-18T23:40:43"}, "_time": "2021-04-12T22:43:10Z"}
{"_type": "payment", "card": {"hash": "4b54c4d9-0ef1-4827-a8ff-30fb259c6ad6", "bin": 418246}, "amount": 14.69, "merchant": {"email": "eric26@yahoo.com", "create_ts": "2021-05-18T23:40:43"}, "_time": "2021-04-12T22:43:21Z"}
{"_type": "payment", "card": {"hash": "541a49eb-3c68-43bf-a481-05198b4ed6d7", "bin": 952656}, "amount": 24.48, "merchant": {"email": "eric26@yahoo.com", "create_ts": "2021-05-18T23:40:43"}, "_time": "2021-04-12T22:43:33Z"}
{"_type": "login", "email": "annsmith@hotmail.com", "ip": "82.177.157.108", "correct_password": false, "_time": "2021-04-12T22:43:46Z"}
{"_type": "login", "email": "rcarroll@yahoo.com", "ip": "82.177.157.108", "correct_password": false, "_time": "2021-04-12T22:43:51Z"}
{"_type": "login", "email": "nancy87@hotmail.com", "ip": "82.177.157.108", "correct_password": false, "_time": "2021-04-12T22:44:05Z"}
{"_type": "login", "email": "brendansmith@hotmail.com", "ip": "82.177.157.108", "correct_password": false, "_time": "2021-04-12T22:44:19Z"}
{"_type": "login", "email": "dale56@gmail.com", "ip": "82.177.157.108", "correct_password": false, "_time": "2021-04-12T22:44:23Z"}
{"_type": "login", "email": "tanyawarren@hotmail.com", "ip": "82.177.157.108", "correct_password": false, "_time": "2021-04-12T22:44:35Z"}
{"_type": "login", "email": "nicolemarshall@hotmail.com", "ip": "82.177.157.108", "correct_password": false, "_time": "2021-04-12T22:44:43Z"}
{"_type": "login", "email": "billy04@hotmail.com", "ip": "82.177.157.108", "correct_password": false, "_time": "2021-04-12T22:44:47Z"}
{"_type": "login", "email": "mary95@yahoo.com", "ip": "82.177.157.108", "correct_password": false, "_time": "2021-04-12T22:44:59Z"}
{"_type": "login", "email": "eric26@yahoo.com", "ip": "82.177.157.108", "correct_password": true, "_time": "2021-04-12T22:45:06Z"}
{"_type": "update_payout", "email": "eric26@yahoo.com", "ip": "82.177.157.108", "bank": {"hash": "3326a480-8030-4e27-8784-bfc85304366b"}, "_time": "2021-04-12T22:45:16Z"}
{"_type": "payment", "card": {"hash": "f4c42a6f-d99d-426d-a8dc-b9fd0aeb03e0", "bin": 713244}, "amount": 67.39, "merchant": {"email": "eric26@yahoo.com", "create_ts": "2021-05-18T23:40:43"}, "_time": "2021-04-12T22:45:29Z"}
{"_type": "payment", "card": {"hash": "82a6c562-1995-46e4-9a13-b73347e080c7", "bin": 670712}, "amount": 32.35, "merchant": {"email": "eric26@yahoo.com", "create_ts": "2021-05-18T23:40:43"}, "_time": "2021-04-12T22:45:34Z"}
{"_type": "payment", "card": {"hash": "07415d87-e05c-43d4-ba61-7c3669af4648", "bin": 508544}, "amount": 29.6, "merchant": {"email": "eric26@yahoo.com", "create_ts": "2021-05-18T23:40:43"}, "_time": "2021-04-12T22:45:38Z"}"""
sumatra.create_timeline_from_jsonl('attack', attack_jsonl)
sumatra.get_timeline('attack')

😃 __Now the fun part.__ 😃 Let's run that historical timeline through our saved topology to enrich all of the events.

For each event type, we end up with a dataframe. We'll start by viewing `login`.

In [None]:
enriched = sumatra.materialize(timeline='attack')
enriched.get_events(event_type='login')

As you can see, one IP is hammering away, trying credentials until it finally succeeds with one merchant account. Our `possible_credential_stuffing` feature becomes true.

At this point, the attacker updates the payout, which is recognized to be risky (by the `risky_logins_2d` feature):

In [None]:
enriched.get_events('update_payout')

Last, we can see that the 3 payments preceding the attack are not deemed risky while the three that come after the attack are. (See `merchant_risky_bank_updates_12h`):

In [None]:
enriched.get_events('payment')

Let's look at __what we just did__. We:
* saved a draft topology with our candidate features
* we sourced a saved _timeline_ for historical test events
* we ran the timeline through our topology to backfill features
* we viewed the enriched output

At this point, we can perform exploratory data analysis, model training, or any other data science activities on the enriched `payment` data set.

👉 __Data Science Goes Here__ 😄

Satisfied with our feature set, we can publish it to LIVE:

In [None]:
diff = sumatra.publish_branch()

and just like that, we have a scaled out API to serve up fresh, real-time risk signals: 🔥🔥🔥

In [None]:
response = api.send({
    "_type": "payment",
    "amount": 28.5,
    "card": {"hash": "e37acc45-0b0b-4de3-9d11-9b8501f61616", "bin": 502053},
    "merchant": {"email": "eric26@yahoo.com", "create_ts": "2021-05-18T23:40:43"}
})
response['features']