# 2a. Stream to Parquet
--------------------------------------------------------------------

Store the input stream to a set of parquet files. The purpose is to store the input stream to a log of raw events.

![Model deployment with streaming Real-time operational Pipeline](../../assets/images/model-deployment-with-streaming.png)

Each batch size (default 1024 records) are stored in a parquet file. The default output is to `/User/examples/model-deployment-with-streaming/data/events-pq`

## Initialize

Load the project

In [1]:
from mlrun import load_project
from os import path

project_path = path.abspath('conf')
project = load_project(project_path)

Get the generated stream path, this is used to get the data which we store to parquet

In [2]:
input_stream = project.params.get('STREAM_CONFIGS').get('generated-stream')
input_stream_path =  input_stream.get('path')

Nuclio leverages consumer groups. When one or more Nuclio replicas join a consumer group, each replica receives its equal share of the shards, based on the number of replicas that are defined in the function.

We set up the input stream URL below. A consumer-group URL is in the form of `http://v3io-webapi:8081/<container name>/<stream path>@<consumer group name>`. In this case we use `WEB_API_USERS` for URL prefix `http://v3io-webapi:8081/<container name>` and a consumer group named **`stream2pq`**.

For more information, refer to the [Nuclio v3iostream trigger reference documentation](https://nuclio.io/docs/latest/reference/triggers/v3iostream/).

In [3]:
WEB_API_USERS = project.params.get('WEB_API_USERS')
input_stream_url = path.join(WEB_API_USERS, input_stream_path) + "@stream2pq"
print(f'Input stream URL: {input_stream_url}')

Input stream URL: http://v3io-webapi:8081/users/iguazio/examples/model-deployment-with-streaming/data/generated-stream@stream2pq


Write the parquet files to the `target_path`

In [4]:
import os
target_path = path.join(os.sep, 'v3io', project.params.get('CONTAINER'), project.params.get('PARQUET_TARGET_PATH'))
print(f'Target path: {target_path}')

Target path: /User/examples/model-deployment-with-streaming/data/events-pq


## Create and Test a Local Function 

[Nuclio](https://nuclio.io/) is a high-performance open-source and managed serverless framework, which is available as a predefined tenant-wide platform service (`nuclio`).
The demo uses Nuclio to create and deploy serverless functions.
Therefore, you need to import the Nuclio package and configure Nuclio for your project.

The platform's Jupyter Notebook service preinstalls the [nuclio-jupyter SDK](https://github.com/nuclio/nuclio-jupyter/blob/master/README.md) for creating and deploying Nuclio functions with Python and Jupyter Notebook.
The tutorial uses the Nuclio magic commands and annotation comments of this SDK to automate function code generation.
The magic commands are initialized when you import the `nuclio` package.<br>
The `%nuclio` magic commands are used to run Nuclio commands from Jupyter notebooks (`%nuclio <Nuclio command>`).
You can also use `%%nuclio` at the start of a cell to identify the entire cell as containing Nuclio code.
The magic commands are initialized when you import the `nuclio` package.<br>
The `# nuclio: start-code`, `# nuclio: end-code`, and `# nuclio: ignore` section-marker annotations notify Nuclio of the beginning or end of code sections.
Nuclio ignores all notebook code before a `# nuclio: start-code` marker or after an `# nuclio: end-code` marker.
Nuclio translates all other notebook code sections into function code, except for sections that are marked with the `# nuclio: ignore` marker.

### Import Nuclio

The following code imports the `nuclio` Python package.

In [5]:
import nuclio

#### Configure Nuclio

The following code uses the `# nuclio: start-code` marker to instruct Nuclio to start processing code only from this location, and then performs basic Nuclio function configuration &mdash; defining the name of the function's container image (`mlrun/ml-models`) and some additional package installation commands.

> **Note:** You can add code to define function dependencies and perform additional configuration after the `# nuclio: start-code` marker.

In [6]:
# Define function spec
%nuclio config kind = "nuclio"

%nuclio: setting kind to 'nuclio'


In [7]:
%%nuclio cmd -c

python -m pip install pandas
python -m pip install pyarrow

In [8]:
%%nuclio config
spec.build.baseImage = "mlrun/ml-models"
spec.readinessTimeoutSeconds = 200

%nuclio: setting spec.build.baseImage to 'mlrun/ml-models'
%nuclio: setting spec.readinessTimeoutSeconds to 200


In [9]:
%nuclio mount /User ~/

mounting volume path /User as ~/


In [10]:
# nuclio: start-code

In [11]:
import os
import pandas as pd
import numpy as np
import json
import datetime

In [12]:
def init_context(context):
    setattr(context, 'batch', [])
    setattr(context, 'batch_size', int(os.getenv('BATCH_SIZE', 1024)))
    setattr(context, 'batch_count',int(os.getenv('BATCH_COUNT', 0)))
    
    pq_partitions = os.getenv('PQ_PARTITIONS')
    if pq_partitions:
        setattr(context, 'pq_partitions', pq_partitions.split(','))
    else:
        setattr(context, 'pq_partitions', pq_partitions)
    
    setattr(context, 'target_path', os.getenv('TARGET_PATH'))
    os.makedirs(context.target_path, exist_ok=True)

In [13]:
def handler(context, event):
    if type(event.body) is dict:
        event_dict = event.body
    else:
        event_dict = json.loads(event.body)
        
    context.logger.info_with('Got invoked',
                             trigger_kind=event.trigger.kind,
                             event_body=event_dict)
    
    # add the incoming event to the current batch
    context.batch.append(event_dict)
    
    #check if batch size reached
    if context.batch_size == len(context.batch):
        context.logger.info_with('Writing batch',
                                 batch_count=context.batch_count,
                                 batch_size=len(context.batch))
        write_batch(context)
        context.logger.info_with('Written batch',
                                 batch_count=context.batch_count,
                                 batch_size=len(context.batch))
        
def write_batch(context):
    file_name = str(context.worker_id)+'_'+str(context.batch_count)
    df = pd.DataFrame.from_records(context.batch)
    df.to_parquet(path=os.path.join(context.target_path, file_name), partition_cols=context.pq_partitions)
    # post write cleanup and counter update
    context.batch = []
    context.batch_count += 1

The following cell uses the `# nuclio: end-code` marker to mark the end of a Nuclio code section and instruct Nuclio to stop parsing the notebook at this point.<br>
> **IMPORTANT:** Do not remove the end-code cell.

In [14]:
# nuclio: end-code

## Environment Variables

Set a dictionary for initializing the environment variables used by the function

In [15]:
envs = {'TARGET_PATH' : target_path,
        'BATCH_SIZE': 1024}

## Test Locally

In [16]:
for key, value in envs.items():
    os.environ[key] = str(value)
init_context(context)
#reduce the batch size to 10
context.batch_size = 10

# trigger with 9 events:

nine_events = [b'{"user_id" : 1 , "event_type": "spin"}',
              b'{"user_id" : 2 , "event_type": "spin"}',
              b'{"user_id" : 3 , "event_type": "spin"}',
              b'{"user_id" : 4 , "event_type": "spin"}',
              b'{"user_id" : 5 , "event_type": "spin"}',
              b'{"user_id" : 6 , "event_type": "spin"}',
              b'{"user_id" : 7 , "event_type": "spin"}',
              b'{"user_id" : 8 , "event_type": "spin"}',
              b'{"user_id" : 9 , "event_type": "spin"}']

for e in nine_events:
    event = nuclio.Event(body=e)
    handler(context, event)

Python> 2020-08-20 11:14:52,815 [info] Got invoked: {'trigger_kind': '', 'event_body': {'user_id': 1, 'event_type': 'spin'}}
Python> 2020-08-20 11:14:52,816 [info] Got invoked: {'trigger_kind': '', 'event_body': {'user_id': 2, 'event_type': 'spin'}}
Python> 2020-08-20 11:14:52,816 [info] Got invoked: {'trigger_kind': '', 'event_body': {'user_id': 3, 'event_type': 'spin'}}
Python> 2020-08-20 11:14:52,817 [info] Got invoked: {'trigger_kind': '', 'event_body': {'user_id': 4, 'event_type': 'spin'}}
Python> 2020-08-20 11:14:52,817 [info] Got invoked: {'trigger_kind': '', 'event_body': {'user_id': 5, 'event_type': 'spin'}}
Python> 2020-08-20 11:14:52,818 [info] Got invoked: {'trigger_kind': '', 'event_body': {'user_id': 6, 'event_type': 'spin'}}
Python> 2020-08-20 11:14:52,818 [info] Got invoked: {'trigger_kind': '', 'event_body': {'user_id': 7, 'event_type': 'spin'}}
Python> 2020-08-20 11:14:52,819 [info] Got invoked: {'trigger_kind': '', 'event_body': {'user_id': 8, 'event_type': 'spin'}}


In [17]:
# check whether a parquet has been created
!ls -l {target_path}

total 0


In [18]:
# trigger the tenth event which should trigger the creation of the parquet file.
tenth_event = b'{"user_id" : 10 , "event_type": "spin"}'
event = nuclio.Event(body=tenth_event)
handler(context, event)

Python> 2020-08-20 11:14:53,441 [info] Got invoked: {'trigger_kind': '', 'event_body': {'user_id': 10, 'event_type': 'spin'}}
Python> 2020-08-20 11:14:53,442 [info] Writing batch: {'batch_count': 0, 'batch_size': 10}
Python> 2020-08-20 11:14:53,497 [info] Written batch: {'batch_count': 1, 'batch_size': 0}


In [19]:
# check weather a parquet has been created
!ls -l {target_path}

total 3
-rw-r--r-- 1 51 nogroup 2268 Aug 20 11:14 None_0


In [20]:
# cleanup
!rm {target_path}/None_0

## Nuclio Deploy

### Convert code to function

We use MLRun `code_to_function` in order to convert the python code to a Nuclio function. We then set the relevant enrivonment variables and streaming trigger.

In [21]:
from mlrun import code_to_function

gen_func = code_to_function(name='stream2pq', kind = 'nuclio')
project.set_function(gen_func)
stream2pq = project.func('stream2pq')
stream2pq.set_envs(envs)
stream2pq.add_trigger('stream2pq', nuclio.triggers.V3IOStreamTrigger(url=input_stream_url, access_key=os.getenv('V3IO_ACCESS_KEY'), maxWorkers=10))

<mlrun.runtimes.function.RemoteRuntime at 0x7ff360001450>

In [22]:
project.save()

### Deploy

In [23]:
stream2pq.deploy()

> 2020-08-20 11:14:56,230 [info] deploy started
[nuclio] 2020-08-20 11:14:57,308 (info) Build complete
[nuclio] 2020-08-20 11:15:01,363 (info) Function deploy complete
[nuclio] 2020-08-20 11:15:01,373 done creating model-deployment-with-streaming-iguazio-stream2pq, function address: 3.131.87.251:32385


'http://3.131.87.251:32385'