# Real-Time User Segmentation

- [Overview](#user-seg-demo-overview)
- [Generate Stream Data](#user-seg-demo-gen-data)
- [Create a Stream](#user-seg-demo-create-stream)
- [Write the data to the Stream](#user-seg-demo-write-to-stream)
- [Define a Nuclio Function for Processing Stream Events](#user-seg-demo-nuclio-func-def)
- [Execute the Event-Processor Nuclio Job](#user-seg-demo-nuclio-execute-job)
- [Verify the Nuclio-Job Execution](#user-seg-demo-nuclio-verify-job-execution)
- [Cleanup](#user-seg-demo-nosql-cleanup)

<a id="user-seg-demo-overview"></a>
## Overview

This tutorial implements an application that builds a stream-event processor on a sliding time window for tagging and untagging users based on programmatic rules of user behavior.
The application demonstrates how easily you can get high throughput with just a simple data stream and a compatible Nuclio configuration.<br>

The application generates simulated slot-machine data and writes it to an Iguazio Data Science Platform ("platform") data stream ("a platform stream") named **slots_stream**.
Then, it creates a Nuclio function that tracks the in-app purchases and slot-machine spins per user; writes it to a NoSQL table named **slots_users**, which is used to store a 2-hour window (`TRACK_MINUTES`) of the latest data and related timestamps; and saves tagged-users data in the function's context and prints it to Kubernetes logs.
The stream and NoSQL table are created in the **/User/examples** directory.

The user-behavior rule that is used in this example is to tag any user who in the last two hours (`TRACK_MINUTES`) spun the slot machine 10 or more times (`SPINS_THRESHOLD`) and had a total of in-app purchases of at least $2 (`PURCHASES_THRESHOLD`).

You can configure the duration of the sliding window, the user-behavior rule, and other aspecs of the application's logic by changing the values of the relevant [workflow constants](#user-seg-demo-workflow-constants) and/or by changing the implementation of the [Nuclio function](#user-seg-demo-nuclio-func-def).

> **Note:** You need to edit the code in the [Configure Nuclio](#user-seg-demo-nuclio-init-nuclio-config) section to set the `spec.triggers.slotStream.username` variable to your user password.

<a id="user-seg-demo-initialization"></a>
## Initialization

- [Define Workflow Constants](#user-seg-demo-workflow-constants)
- [Install Python Paramters](#user-seg-demo-install-python-params)

<a id="user-seg-demo-install-python-params"></a>
### Install Python Paramters

In [1]:
!pip install params



<a id="user-seg-demo-workflow-constants"></a>
### Define Workflow Constants

In [2]:
STREAM = 'slots_stream'
CONTAINER = 'users'
NOSQL_TABLE = 'slots_users'
EVENT_SPIN = 'spin'
EVENT_PURCHASE = 'in_app_purchase'
PATH = "/User/demos/slots-stream"
EVENTS = [EVENT_SPIN, EVENT_PURCHASE]
# Total number of users to test
MAX_USERS = 5000
# Upper limit of generated events
MAX_RECORDS = 10000000
# Limits for users actions: top 20% (1/5) have more than 200 spins, top 2% (1/50) have more than 2 purchases
USER_ID_LIMITS = {EVENT_SPIN: MAX_USERS // 5, EVENT_PURCHASE: MAX_USERS // 50}
# Limits for all low-grade users
EVENTS_LIMITS = {EVENT_SPIN: 199, EVENT_PURCHASE: 1}
# Total duration of the run
DURATION_HOURS = 3
# Parallel processing
SHARDS_COUNT = 5
# Spin tracking
TRACK_MINUTES = 120
# Time to keep user activity after it lost active events, minutes
KEEP_ACTIVITY = 3
# Aggregation delay
CHECK_SECONDS = 60
# Spin threshold to toggle the activity; (a low threshold increases the toggle rate)
SPINS_THRESHOLD = 10
# Purchases threshold to toggle the activity
PURCHASES_THRESHOLD = 2

<a id="user-seg-demo-gen-data"></a>
## Generate Stream Data

In [3]:
# nuclio: ignore
from random import choices, randint, uniform
import datetime
import tqdm
import time
import random
import math
import v3io_frames as v3f
import pandas as pd
import os
# a.py
from params import *

LAST_TIME = datetime.datetime.now()

event_records = {user_id: {EVENT_SPIN: 0, EVENT_PURCHASE: 0} for user_id in range(MAX_USERS+1)}

purchase_sequence = []
spin_sequence = []

event = EVENT_PURCHASE
for _ in range(MAX_USERS * 2):
    user_id = randint(1, MAX_USERS)
    if user_id > USER_ID_LIMITS[event]:
        user_limit = EVENTS_LIMITS[event]
    else:
        user_limit = math.inf
    if event_records[user_id][event] < user_limit:
        event_records[user_id][event] += 1
        purchase_sequence.append({user_id: event})

event = EVENT_SPIN
for _ in range(MAX_RECORDS):
    user_id = randint(1, MAX_USERS)
    if user_id > USER_ID_LIMITS[event]:
        user_limit = EVENTS_LIMITS[event]
    else:
        user_limit = math.inf
    if event_records[user_id][event] < user_limit:
        event_records[user_id][event] += 1
        spin_sequence.append({user_id: event})

len_purchases = len(purchase_sequence)
len_spins = len(spin_sequence)
print('Events generated:', len_purchases + len_spins)

Events generated: 2799643


<a id="user-seg-demo-create-stream"></a>
## Create a Stream

Create a new stream object with 5 shards (`SHARDS_COUNT`).<br>
Each stream shard will be consumed by an instance of a Nuclio function that is triggered by stream events.
For example, creating a stream with 5 shards will spawn 5 instances of a Nuclio function when data is available on the respective shards.

In [4]:
# nuclio: ignore
from v3io_frames import frames_pb2 as fpb

# Create a Frames client object
client = v3f.Client("framesd:8081", container="users")

TABLE = os.getenv('V3IO_USERNAME') + '/examples/' + STREAM

# Delete the stream if it exists
client.delete(backend="stream", table=TABLE, if_missing=fpb.IGNORE)
# Create a new stream
client.create(backend="stream", table=TABLE, shards=SHARDS_COUNT, retention_hours=DURATION_HOURS)

url = 'http://v3io-webapi:8081/users/' + os.getenv('V3IO_USERNAME') + '/examples/' + STREAM + '/'
headers = {
            "Content-Type": "application/json",
            "X-v3io-function": "PutRecords",
            "X-v3io-session-key": os.getenv('V3IO_ACCESS_KEY')
          }

records = []

def send_payload(records):
    if (len(records) > 0):
        payload = {
            "Records": records
        }
        requests.post(url, json=payload, headers=headers, verify=False)

<a id="user-seg-demo-write-to-stream"></a>
## Write Data to the Stream

Write (ingest) data into the stream in bulk batches of 1,000 items to improve performance.

- [Generate Timestamps](#user-seg-demo-write-to-stream-gen-timestamps)
- [Populate the Stream with Purchase Data](#user-seg-demo-write-to-stream-purchases)
- [Populate the Stream with Spin Data](#user-seg-demo-write-to-stream-spins)
- [Verify the Data Ingestion](#user-seg-demo-write-to-stream-verify)

<a id="user-seg-demo-write-to-stream-gen-timestamps"></a>
### Generate Timestamps

Generate random timestamps to be used for the ingested stream data.

In [5]:
# nuclio: ignore
# Generate random timestamps
timestamps_list = []
for i in range(len_spins+len_purchases):
    timestamps_list.append((LAST_TIME - datetime.timedelta(hours = random.random() * DURATION_HOURS)).strftime('%Y-%m-%d %H:%M:%S.%f UTC'))

<a id="user-seg-demo-write-to-stream-purchases"></a>
### Populate the Stream with Purchase Data

In [6]:
# nuclio: ignore
# Prepare purchases data to write
for i in tqdm.tqdm(range(len_purchases)):
    key = list(purchase_sequence[i].keys())[0]
    data = str(b64encode(f'{purchase_sequence[i][key]},{timestamps_list[i]},{key}'.encode("UTF-8")), "UTF-8")
    records.append({"Data": data, "PartitionKey": str(key)})
    if i % 1000 == 0:
        send_payload(records)
        records = []
send_payload(records)
records = []

100%|██████████| 4458/4458 [00:00<00:00, 450298.80it/s]


<a id="user-seg-demo-write-to-stream-spins"></a>
### Populate the Stream with Spin Data

In [7]:
# nuclio: ignore
# Prepare spins data to write
for i in tqdm.tqdm(range(len_spins)):
    key = list(spin_sequence[i].keys())[0]
    data = str(b64encode(f'{spin_sequence[i][key]},{timestamps_list[i]},{key}'.encode("UTF-8")), "UTF-8")
    records.append({"Data": data, "PartitionKey": str(key)})
    if i % 1000 == 0:
        send_payload(records)
        records = []
send_payload(records)
records = []

100%|██████████| 2795185/2795185 [01:27<00:00, 31892.85it/s]


<a id="user-seg-demo-write-to-stream-verify"></a>
### Verify the Data Ingestion

Read (consume) data from the stream to verify that previous write operations.

In [8]:
# nuclio: ignore
# Read data from the stream into a temporary NoSQL table
client.read(backend="stream", table=TABLE, seek="earliest", shard_id="0")

Unnamed: 0_level_0,data,stream_time
seq_number,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"in_app_purchase,2020-01-18 16:43:29.067447 UTC...",2020-01-18 19:22:15.329781652
2,"in_app_purchase,2020-01-18 16:29:02.037484 UTC...",2020-01-18 19:22:15.329781652
3,"in_app_purchase,2020-01-18 17:17:07.810692 UTC...",2020-01-18 19:22:15.329781652
4,"in_app_purchase,2020-01-18 16:54:44.035642 UTC...",2020-01-18 19:22:15.329781652
5,"in_app_purchase,2020-01-18 16:55:07.114582 UTC...",2020-01-18 19:22:15.329781652
6,"in_app_purchase,2020-01-18 17:33:56.903202 UTC...",2020-01-18 19:22:15.329781652
7,"in_app_purchase,2020-01-18 19:03:22.388396 UTC,57",2020-01-18 19:22:15.329781652
8,"in_app_purchase,2020-01-18 17:13:46.155450 UTC...",2020-01-18 19:22:15.329781652
9,"in_app_purchase,2020-01-18 17:18:16.206886 UTC...",2020-01-18 19:22:15.329781652
10,"in_app_purchase,2020-01-18 17:41:09.312284 UTC...",2020-01-18 19:22:15.329781652


In [9]:
# Delete the temporary NoSQL table
!rm -rf /User/examples/slots_users

<a id="user-seg-demo-nuclio-init"></a>
## Nuclio Initialization

- [Import Nuclio](#user-seg-demo-nuclio-init-import-nuclio)
- [Install Required Packages](#user-seg-demo-nuclio-init-install-pkgs)
- [Configure Nuclio](#user-seg-demo-nuclio-init-nuclio-config)
- [Define Nuclio Environment Variables](#user-seg-demo-define-envars)
- [Import Required Libraries](#user-seg-demo-import-libs)

<a id="user-seg-demo-nuclio-init-import-nuclio"></a>
### Import Nuclio

In [10]:
# nuclio: ignore
import nuclio

<a id="user-seg-demo-nuclio-init-install-pkgs"></a>
### Install Required Packages

In [11]:
%%nuclio cmd -c
pip install --upgrade pip
pip install v3io_frames

<a id="user-seg-demo-nuclio-init-nuclio-config"></a>
### Configure Nuclio

> **Note:** You need to edit the definition of the `spec.triggers.slotStream.username` variable to set the password for the running platform user.

In [12]:
%%nuclio config
spec.build.baseImage = "python:3.6-jessie"
spec.resources.requests.memory = "4G"
# ERROR: {"error":"json: cannot unmarshal string into Go struct field Build.noCache of type bool","errorStackTrace":"\nError - json: cannot unmarshal string into Go struct field Build.noCache of type bool\n    .../nuclio/pkg/dashboard/resource/function.go:344\n\nCall stack:\nFailed to parse JSON body\n    .../nuclio/pkg/dashboard/resource/function.go:344\n"}
# spec.build.noCache = "true"
spec.readinessTimeoutSeconds = 10
spec.loggerLevel = "info"
spec.triggers.slotStream.kind = "v3ioStream"
spec.triggers.slotStream.url = "http://${V3IO_API}/users/${V3IO_USERNAME}/examples/slots_stream"
spec.triggers.slotStream.username = "${V3IO_USERNAME}"
###### TODO: SET YOUR PASSWORD HERE ######
spec.triggers.slotStream.password = "<SET PASSWORD>"
##########################################
spec.triggers.slotStream.attributes.seekTo = "earliest"
spec.triggers.slotStream.attributes.partitions = [0, 1, 2, 3, 4]

%nuclio: setting spec.build.baseImage to 'python:3.6-jessie'
%nuclio: setting spec.resources.requests.memory to '4G'
%nuclio: setting spec.readinessTimeoutSeconds to 10
%nuclio: setting spec.loggerLevel to 'info'
%nuclio: setting spec.triggers.slotStream.kind to 'v3ioStream'
%nuclio: setting spec.triggers.slotStream.url to 'http://v3io-webapi.default-tenant.svc:8081/users/iguazio/examples/slots_stream'
%nuclio: setting spec.triggers.slotStream.username to 'iguazio'
%nuclio: setting spec.triggers.slotStream.password to '██████████'
%nuclio: setting spec.triggers.slotStream.attributes.seekTo to 'earliest'
%nuclio: setting spec.triggers.slotStream.attributes.partitions to [0, 1, 2, 3, 4]


<a id="user-seg-demo-define-envars"></a>
### Define Nuclio Environment Variables

In [13]:
%nuclio env -c V3IO_USERNAME=${V3IO_USERNAME}
%nuclio env -c CONTAINER_NAME=users
%nuclio env -c TABLE_NAME=slot_data
%nuclio env -c WEB_API_HOST_AND_PORT=v3io-webapi:8081
%nuclio env -c SESSION_KEY=${V3IO_ACCESS_KEY}
%nuclio env -c FRAMESD_URL=framesd:8081

<a id="user-seg-demo-import-libs"></a>
### Import Required Libraries

In [14]:
import os
import json
import time
import http.client
import uuid
import v3io_frames as v3f
import pandas as pd
from urllib3 import HTTPConnectionPool

<a id="user-seg-demo-nuclio-func-def"></a>
## Define a Nuclio Function for Processing Stream Events

Define a Nuclio function that implements a stream-event processor on a sliding time window for tagging and untagging users based on programmatic rules of user behavior.
The processed data is written to a NoSQL table.
The function implements the following flow:

- An instance of a Nuclio function is spawn for each stream shard and handles the incoming events on that shard.
- Slot-machine spin and in-app purchase data is written to the stream, triggering the Nuclio function.
- The function saves to a NoSQL table a two-hour sliding window (`TRACK_MINUTES`) of the latest user data.
- The user data is aggregated and written to the NoSQL table as batch jobs in 1-minute intervals (`CHECK_SECONDS`) along with the latest aggregation update time.<br>
  Using batch writes improves the throughput, because there are less network calls and disk reads and writes.
- Tagged-user data is stored in the context of the Nuclio function and written to Kubernetes logs.<br>
  Each time new user data is written, the old data for this user is read from the table and the condition for tagging users is checked.
  Updated tagged-users data is saved in the function's context (`context.user_data.tagged_users`).

> **Note:**a In the [**Verify the Nuclio-Job Execution**](#user-seg-demo-nuclio-verify-job-execution) section of the tutorial you can find examples for reading tagged user data from the NoSQL table and reading the Kubernetes execution logs.

In [15]:
# Initializes the context on instance creation
# The function initializes user data that is persisted among function invocations.
def init_context(context):
    context.user_data.TABLE_NAME = os.getenv('V3IO_USERNAME') + '/examples/' + NOSQL_TABLE
    context.user_data.CONTAINER_NAME = os.getenv('CONTAINER_NAME')
    context.user_data.WEB_API_HOST_AND_PORT = os.getenv('WEB_API_HOST_AND_PORT')
    context.user_data.SESSION_KEY = os.getenv('SESSION_KEY')
    context.user_data.FRAMESD_URL = os.getenv('FRAMESD_URL')
    
    # V3IO Frames client for packet writes (NoSQL table)
    context.user_data.client = v3f.Client(context.user_data.FRAMESD_URL, container=context.user_data.CONTAINER_NAME, token=context.user_data.SESSION_KEY)

    # Nuclio instance identifier
    context.user_data.worker_id = uuid.uuid4()

    # Records of active users (within the aggregation delay)
    context.user_data.active_users = {}

    # Users with tagged state
    context.user_data.tagged_users = {}

    # Time of last aggregation, seconds of the Unix epoch (Unix timestamp)
    context.user_data.last_aggregation = -1

    # Aggregation lock for the active thread
    context.user_data.aggregation_lock = False

    # Instance start time
    context.user_data.start_time = -1

    # Number of processed messages
    context.user_data.processed_events = 0
    
    headers = {'Content-Type': 'application/json',
           'X-v3io-function': 'GetItem',
           'cache-control': 'no-cache',
           'X-v3io-session-key': context.user_data.SESSION_KEY}
    context.user_data.pool = HTTPConnectionPool(context.user_data.WEB_API_HOST_AND_PORT, maxsize=100, headers = headers)


# Handles stream events
# The function pulls data of the number of slots per user from a platform stream.
# The data for each user is aggregated in memory and written to a NoSQL table in 1-minute intervals (batch writes).
# The data is stored in a user_id array of 1-minute slot aggregates and the timestamp for the last array update.
# Data for inactive users is discarded.
# Each time new user data is written, the old data for this user is read from the table, the tagged-users condition
# is checked, and the tagged-users data in the function's context is updated accordingly.
def handler(context, event):
    try:
        TABLE_NAME = context.user_data.TABLE_NAME
        CONTAINER_NAME = context.user_data.CONTAINER_NAME
        WEB_API_HOST_AND_PORT = context.user_data.WEB_API_HOST_AND_PORT
        SESSION_KEY = context.user_data.SESSION_KEY
        FRAMESD_URL = context.user_data.FRAMESD_URL
        # Account the fresh event
        context.user_data.processed_events += 1

        # Initialize start time during the first run
        if context.user_data.start_time < 0:
            context.user_data.start_time = time.time()

        # Extract action parameters
        msg = event.body.decode().strip()
        action, timestamp, user_id = msg.split(',')

        if context.user_data.processed_events % 10000 == 0:
            context.logger.debug("sample event: " + msg)
            context.logger.debug("context.user_data.processed_events=" + str(context.user_data.processed_events) + "\n" + str(context))

        # Current time in seconds of epoch
        current_time_seconds = int(time.time())

        # Current time in minutes of epoch
        current_time_minutes = current_time_seconds // 60

        # Check for a new user
        if user_id not in context.user_data.active_users:
            context.user_data.active_users[user_id] = {}
            context.user_data.active_users[user_id]['spin_array'] = [0] * TRACK_MINUTES
            context.user_data.active_users[user_id]['purchases'] = 0

        # Process the action
        if action == 'in_app_purchase':
            context.user_data.active_users[user_id]['purchases'] += 1
        elif action == 'spin':
            event_time = int(time.mktime(time.strptime(timestamp, '%Y-%m-%d %H:%M:%S.%f UTC'))) // 60
            time_lag = current_time_minutes - event_time
            time_lag = max(time_lag, 0)
            if time_lag < TRACK_MINUTES:
                context.user_data.active_users[user_id]['spin_array'][time_lag] += 1

        # Exit if aggregation is already in progress
        if context.user_data.aggregation_lock:
            return event.body
        # Start data aggregation during the first run or after the predefined period
        if (context.user_data.last_aggregation == -1) or (current_time_seconds - context.user_data.last_aggregation > CHECK_SECONDS):
            # Set the aggregation trigger to avoid overlays
            context.user_data.aggregation_lock = True

            # Reset aggregation time
            context.user_data.last_aggregation = time.time()

            # Copy all active users and purge the previous records
            active_users_local = dict(context.user_data.active_users)
            context.user_data.active_users = {}

            # Prepare lists of fresh-tagged and untagged users for the report
            fresh_tagged = set()
            fresh_untagged = set()

            # Global payload for the NoSQL table update
            global_payload = []

            # Process list of recent users
            for user_id in active_users_local:

                # Set aggregation structure subset
                aggregation = {}

                # Read stored user data
                response_status, response_data = read_item(WEB_API_HOST_AND_PORT, CONTAINER_NAME, TABLE_NAME, user_id, context.user_data.pool, SESSION_KEY)
                if response_status == 404:

                    # Initialize a new user (not already found in the NoSQL table)
                    if user_id in active_users_local:
                        aggregation['spin_array'] = active_users_local[user_id]['spin_array']
                        aggregation['purchases'] = active_users_local[user_id]['purchases']
                    else:
                        aggregation['spin_array'] = [0] * TRACK_MINUTES
                        aggregation['purchases'] = 0
                    aggregation['updated'] = current_time_minutes
                else:

                    # Load existing data
                    loaded = json.loads(response_data.decode())
                    aggregation['updated'] = int(loaded['Item']['updated']['N'])
                    aggregation['spin_array'] = list(map(int, loaded['Item']['spin_array']['S'].split()))
                    aggregation['purchases'] = int(loaded['Item']['purchases']['N'])

                    # Update existing data
                    time_delta = int(current_time_minutes - aggregation['updated'])
                    time_delta = min(time_delta, TRACK_MINUTES)
                    if time_delta > 0:
                        aggregation['updated'] = current_time_minutes
                        for _ in range(time_delta):
                            aggregation['spin_array'] = [0] + aggregation['spin_array']
                            aggregation['spin_array'].pop()

                # Account fresh data
                if user_id in active_users_local:
                    aggregation['purchases'] += active_users_local[user_id]['purchases']
                    aggregation['spin_array'] = [sum(x) for x in zip(aggregation['spin_array'], active_users_local[user_id]['spin_array'])]
                    aggregation['updated'] = current_time_minutes

                # Check activity conditions and tag the active user
                if (sum(aggregation['spin_array']) >= SPINS_THRESHOLD) and (aggregation['purchases'] >= PURCHASES_THRESHOLD):
                    if user_id not in context.user_data.tagged_users:
                        context.user_data.tagged_users[user_id] = {}
                    context.user_data.tagged_users[user_id]['expire'] = current_time_minutes + KEEP_ACTIVITY
                    fresh_tagged.add(user_id)

                # Prepare a combined payload
                if aggregation['updated'] == current_time_minutes:
                    global_payload.append({'user_id': user_id,
                                           'spin_array': ' '.join(list(map(str, aggregation['spin_array']))),
                                           'purchases': aggregation['purchases'],
                                           'updated': aggregation['updated']})

            # Untag expired tagged users
            for user_id in context.user_data.tagged_users:
                if context.user_data.tagged_users[user_id]['expire'] <= current_time_minutes:
                    fresh_untagged.add(user_id)
            for user_id in fresh_untagged:
                del context.user_data.tagged_users[user_id]

            tagged_users_ids = list(dict(context.user_data.tagged_users).keys())
            runtime = time.time() - context.user_data.start_time
            context.logger.debug(f'Worker ID: {context.user_data.worker_id}\n',
                  f'Runtime: {runtime} s\n',
                  f'Processed events: {context.user_data.processed_events} \n',
                  f'Events per second: {int(context.user_data.processed_events / runtime)}\n',
                  f'Toggled users: {len(fresh_tagged) + len(fresh_untagged)}\n',
                #  f'Tagged users ({len(fresh_tagged)}): {fresh_tagged} \n',
                  f'Active users ({len(active_users_local)}) \n',
                  f'Tagged users ({len(context.user_data.tagged_users)}): {str(tagged_users_ids)} \n',
                  f'Fresh Tagged users ({len(fresh_tagged)}): {fresh_tagged}  \n',
                  f'Fresh Untagged users ({len(fresh_untagged)}): {fresh_untagged}')

            # Write the combined payload to the NoSQL table
            df = pd.DataFrame(global_payload)
            df.set_index('user_id', inplace=True)
            context.user_data.client.write(backend='kv', table=TABLE_NAME, dfs=df, save_mode="overwriteItem")
            
            context.user_data.aggregation_lock = False
        return event.body
    except BaseException as e:
        context.logger.debug("EXCEPTION: " + str(e))
        return event.body


# Reads a single item from a NoSQL table
def read_item(url, container, table, key, pool, session_key):
    headers = {'Content-Type': 'application/json',
               'X-v3io-function': 'GetItem',
               'cache-control': 'no-cache',
               'X-v3io-session-key': session_key}
    payload = {'Key': {'user_id': {'S': key}}, 'AttributesToGet': '*'}
    # conn = http.client.HTTPConnection(url)
    # conn.request('POST', f'/{container}/{table}/', bytes(json.dumps(payload), encoding='utf-8'), headers=headers)

    response = pool.urlopen('POST', f'/{container}/{table}/', body=bytes(json.dumps(payload), encoding='utf-8'), headers=headers)
    # response = conn.getresponse()
    data = response.data#.decode('utf-8')
    # data = response.read()
    # conn.close()
    return response.status, data


<a id="user-seg-demo-nuclio-execute-job"></a>
## Execute the Event-Processor Nuclio Job

- [Deploy the Nuclio Function](#user-seg-demo-nuclio-func-deploy)
- [Wait for the Job to Start](#user-seg-demo-nuclio-job-wait-start)
- [Wait for the Job to Finish](#user-seg-demo-nuclio-job-wait-finish)

<a id="user-seg-demo-nuclio-func-deploy"></a>
### Deploy the Nuclio Function

Deloy the stream event-processor Nuclio function.

In [16]:
%nuclio deploy -n slot-machine-v3iostream -p examples -v

[nuclio.deploy] 2020-01-18 19:26:59,284 creating slot-machine-v3iostream
[nuclio.deploy] 2020-01-18 19:26:59,403 deploying ...
[nuclio.deploy] 2020-01-18 19:27:00,445 (info) Building processor image
[nuclio.deploy] 2020-01-18 19:27:00,446 {'imageName': 'nuclio/processor-slot-machine-v3iostream:latest', 'level': 'info', 'message': 'Building processor image', 'name': 'deployer', 'time': 1579375620173.0205}
[nuclio.deploy] 2020-01-18 19:27:08,651 (info) Build complete
[nuclio.deploy] 2020-01-18 19:27:08,652 {'level': 'info', 'message': 'Build complete', 'name': 'deployer', 'result': {'Image': 'nuclio/processor-slot-machine-v3iostream:latest', 'UpdatedFunctionConfig': {'metadata': {'annotations': {'nuclio.io/generated_by': 'function generated at 18-01-2020 by iguazio from /User/demos/slots-stream/real-time-user-segmentation.ipynb'}, 'labels': {'nuclio.io/project-name': 'examples'}, 'name': 'slot-machine-v3iostream', 'namespace': 'default-tenant'}, 'spec': {'build': {'baseImage': 'python:3.

<a id="user-seg-demo-nuclio-job-wait-start"></a>
### Wait for the Job to Start

In [17]:
# nuclio: ignore
import pandas as pd
import v3io_frames as v3f
import os
from sqlalchemy.engine import create_engine

In [18]:
# nuclio: ignore
while True:
    if os.path.isdir("/User/examples/" + NOSQL_TABLE):
        break
    time.sleep(5)

<a id="user-seg-demo-nuclio-job-wait-finish"></a>
### Wait for the Job to Finish

In [19]:
# nuclio: ignore
client = v3f.Client("framesd:8081", container="users")

In [20]:
# nuclio: ignore
TABLE_NAME = os.getenv('V3IO_USERNAME') + '/examples/' + NOSQL_TABLE

In [21]:
# nuclio: ignore
engine = create_engine(os.getenv('DATABASE_URL'))
table_path = os.path.join('v3io.users."'+str(os.getenv('V3IO_USERNAME')) + '/examples/slots_users"')
query = 'select count(1) count from ' + table_path

def count():
    df = pd.read_sql(query, engine)
    return df.loc[0, 'count']

timeStart = time.time()
while count() < MAX_USERS:
    time.sleep(1)
totalTime = int(round((time.time() - timeStart) * 1000))
print("Total run time: " + str(totalTime) + " ms")

Total run time: 58830 ms


<a id="user-seg-demo-nuclio-verify-job-execution"></a>
## Verify the Nuclio-Job Execution

- [Read from the NoSQL Table](#user-seg-demo-nuclio-read-nosql-data)
- [Print Logs](#user-seg-demo-nuclio-print-logs)

<a id="user-seg-demo-nuclio-read-nosql-data"></a>
#### Read from the NoSQL Table

Verify the stream data by using V3IO Frames to read arrays of spins, purchases, and last updated timestamps from the NoSQL table into a pandas DataFrame and display the results.

In [22]:
# nuclio: ignore
client = v3f.Client("framesd:8081", container="users")
no_spins = ' '.join(list(map(str, TRACK_MINUTES * [0])))
df = client.read(backend="kv", table=TABLE_NAME, filter="purchases>0 AND spin_array!='" + no_spins + "'")
display(df.head(20))

Unnamed: 0_level_0,purchases,spin_array,updated
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4479,2,0 0 0 0 0 0 0 0 0 0 2 0 2 1 0 0 0 2 0 0 0 0 2 ...,26322929
3685,2,0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 2 0 0 0 1 ...,26322929
1767,2,0 0 0 0 0 0 0 0 0 2 0 2 0 1 0 2 0 0 0 0 2 2 0 ...,26322929
79,4,0 0 0 0 0 0 0 0 0 0 1 2 1 2 0 0 0 0 0 0 2 0 0 ...,26322929
2101,2,0 0 0 0 0 0 0 0 1 0 0 0 0 4 0 0 0 0 2 0 0 0 0 ...,26322929
3166,2,0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 2 ...,26322929
3695,2,0 0 0 0 0 0 0 0 0 2 0 0 0 3 0 3 2 0 1 2 0 2 0 ...,26322929
890,2,0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 4 0 2 0 0 2 0 2 ...,26322929
4253,2,0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 3 1 2 1 0 2 0 ...,26322929
3911,2,0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 2 ...,26322929


<a id="user-seg-demo-nuclio-print-logs"></a>
#### Print Logs

Print the Kubernetes logs for the executed Nuclio job.
The logs should contain metrics such as the number of events per second.
The total throughput is the sum of the events-per-second data for all workers (function instances).
For example, for the default application configuration, which uses a stream with 5 shards, on an m5.8xlarge EC2 instance you should get about 20,000 events per second.

In [23]:
!kubectl logs $(kubectl get pods | grep slot | cut -d ' ' -f 1) | grep -e Runtime -e Worker

{"level":"debug","time":"2020-01-18T19:27:10.294Z","name":"processor","message":"Read configuration","more":"config=&{Config:{Meta:{Name:slot-machine-v3iostream Namespace:default-tenant Labels:map[] Annotations:map[nuclio.io/generated_by:function generated at 18-01-2020 by iguazio from /User/demos/slots-stream/real-time-user-segmentation.ipynb]} Spec:{Description: Disabled:false Publish:false Handler:real-time-user-segmentation:handler Runtime:python:3.6 Env:[{Name:V3IO_USERNAME Value:iguazio ValueFrom:nil} {Name:CONTAINER_NAME Value:users ValueFrom:nil} {Name:TABLE_NAME Value:slot_data ValueFrom:nil} {Name:WEB_API_HOST_AND_PORT Value:v3io-webapi:8081 ValueFrom:nil} {Name:SESSION_KEY Value:21f3db74-0b72-46bb-897f-d097d3802000 ValueFrom:nil} {Name:FRAMESD_URL Value:framesd:8081 ValueFrom:nil}] Resources:{Limits:map[] Requests:map[memory:{i:{value:4 scale:9} d:{Dec:<nil>} s:4G Format:DecimalSI}]} Image:docker-registry.default-tenant.app.██████████:80/nuclio/processor-slot-machine-v3iostr

<a id="user-seg-demo-cleanup"></a>
## Cleanup

You can optionally delete any of the directories or files that you created.
See the instructions in the [Creating and Deleting Container Directories](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/containers/#create-delete-container-dirs) tutorial.
For example, the following code uses a local file-system command to delete the entire **&lt;running user&gt;/examples/** directory in the "users" container.
Edit the path, as needed, then remove the comment mark (`#`) and run the code.

In [None]:
#!rm -rf /User/examples/