# conversion of (large) files and streams

The python SDK supports importing large sets of time series using files and streams in a CSV format.

The SDK converts the file to a local ETL Import file and uploads this file to waylay for an asynchronous import.

This notebook illustrates the various csv import formats that are supported by the SDK (without actually uploading the series)

<img src="ETL File Import.png">

In [1]:
import waylay
waylay.__version__

'v0.7.0'

We use previously created connection profile, and use the `timeseries.etl_tool` for this workflow.

A default temporary file will help cleaning up afterwards.

In [2]:
import tempfile

waylay_client = waylay.WaylayClient.from_profile()
etl_tool = waylay_client.timeseries.etl_tool
etl_tool.temp_dir= tempfile.mkdtemp('etl-import')

waylay_client.config.to_dict()

{'credentials': {'type': 'client_credentials',
  'api_key': 'fc29ca8f37544723fc39d908',
  'api_secret': '********',
  'gateway_url': 'https://api-aws-dev.waylay.io',
  'accounts_url': None},
 'profile': '_default_',
 'settings': {}}

### a simple data file
This simple example uses a CSV file where each row is a timestamped event with containing a few measurements.

In [3]:

with open(f'{etl_tool.temp_dir}/hvac_demo.csv','wt') as hvac:
    hvac.write(
"timestamp,temperature,occupancy,humidity,snr,event_id\n"
"2021-02-22T14:35:10+00:00,23,0,2304,0.0001,#xji-98904\n"
"2021-02-22T14:35:20+00:00,21,3,2200,0.0002,#xji-98905\n"
)

Let's check the file exists and that we can read the first two lines:


In [4]:
import csv
with open(f'{etl_tool.temp_dir}/hvac_demo.csv', 'rt') as csv_file:
    csv_reader = csv.reader(csv_file)
    print(next(csv_reader))
    print(next(csv_reader))

['timestamp', 'temperature', 'occupancy', 'humidity', 'snr', 'event_id']
['2021-02-22T14:35:10+00:00', '23', '0', '2304', '0.0001', '#xji-98904']


### preparing the import
Before uploading it into waylay, `etl_tool.prepare_import()` will convert your input source into a local CSV file that
has the right format.

In this case, we provide arguments
* `'hvac_demo.csv'`, our csv file,
* an (optional) `name` for the import action itself (to track its progress)
* the `resource` setting to indicate the _resource id_ to use for the upload (as it is not in the data set)
* the (optional) `metrics` setting to filter the measurements we want to include in the data set.

The call returns a data object that keeps track of the import file and settings, and is the main argument for the other `etl_tool` utility functions.

In [5]:
import random

IMPORT_NAME = f'hvac{random.randint(0,10000):04d}'
RESOURCE_NAME=f'r_{IMPORT_NAME}'
display(f'''
  IMPORT_NAME={IMPORT_NAME}
  RESOURCE_NAME={RESOURCE_NAME}
''')
hvac_import = etl_tool.prepare_import(
    f'{etl_tool.temp_dir}/hvac_demo.csv',
    name=IMPORT_NAME,
    resource=RESOURCE_NAME,
    metrics=['temperature', 'humidity', 'occupancy']
)

'\n  IMPORT_NAME=hvac2491\n  RESOURCE_NAME=r_hvac2491\n'

1.00csv_files [00:00, 341csv_files/s]
 50%|███████████▌           | 3.00/6.00 [00:00<00:00, 2.55krows/s]


In [6]:
hvac_import

WaylayETLSeriesImport(import_file=ETLFile(directory='/var/folders/07/zn347xhn33z8m79l8xtz1hn80000gp/T/tmpg3g3cigtetl-import', prefix='hvac2491'), settings=SeriesSettings(metrics=[temperature, humidity, occupancy], metric_column=None, metric=None, resources=None, resource_column=None, resource='r_hvac2491', value_column=None, timestamp_column='timestamp', timestamp_offset=None, timestamp_first=None, timestamp_last=None, timestamp_interval=None, timestamp_constructor=<function TimestampFormat.parser.<locals>._parse_with_tz at 0x121b33040>, timestamp_timezone=None, name=None, timestamp_from=None, timestamp_until=None, timestamp_formatter=None, timestamp_format=None, write_csv_header=True, per_resource=False, per_metric=False), storage_bucket='etl-import')

In [7]:
hvac_import.settings.resource

'r_hvac2491'

You can use `etl_tool.read_import_as_dataframe` to validate the content of the *etl import file* etl_tool 

In [8]:
etl_tool.read_import_as_dataframe(hvac_import)

resource,r_hvac2491,r_hvac2491,r_hvac2491
metric,temperature,occupancy,humidity
timestamp,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2021-02-22 14:35:10+00:00,23.0,0.0,2304.0
2021-02-22 14:35:20+00:00,21.0,3.0,2200.0


### initiating the import

The function `etl_tool.initiate_import` will upload the *etl import file*, initiating the import process

In [9]:
hvac_import = etl_tool.initiate_import(hvac_import)

### checking the import process
Use `etl_tool.check_import` to follow up on the import process. The returned object has a `to_html` method that you can use to visualize the status in this notebook:

> KNOWN ISSUE: the `check_import` can no longer check the currently running import process on the gateway endpoint

In [11]:
from IPython.core.display import HTML

HTML(
    etl_tool.check_import(hvac_import).to_html()
)

In [12]:
# rerun until status is ok
HTML(
    etl_tool.check_import(hvac_import).to_html()
)

You can also list all previous import jobs.

The results are kept in the `etl-import` storage bucket, in folders according to the job status.

In [13]:
for import_job in etl_tool.list_import(name_filter='hvac'):
    display(HTML(import_job.to_html()))

Once the import process is sucessfull, you can explore and use the data.

### Provisioning a _waylay resource_ or _waylay query_ for the dataset
The `etl_tool` gives you two provisioning utilities that let you create waylay entities related to your dataset:
    
* `etl_tool.update_query` will create a **waylay query** for the dataset. This will allow you to refer and query this dataset using a query name. 

* `etl_tool.update_resources` will create a **waylay resource** for each resource referenced in your dataset. 
   You can also use `etl_tool.list_import_resources` to only render the resource definitions, without actually
   creating them on the platform.


#### provisioning a _waylay query_

In [14]:
etl_tool.update_query(hvac_import)

{'from': '2021-02-22T14:35:10Z',
 'data': [{'resource': 'r_hvac2491', 'metric': 'temperature'},
  {'resource': 'r_hvac2491', 'metric': 'occupancy'},
  {'resource': 'r_hvac2491', 'metric': 'humidity'}]}

Once created, we can used the named query in the console of using the `queries.query` api.

In [15]:
waylay_client.queries.query.get(IMPORT_NAME)

{'data': [{'metric': 'temperature', 'resource': 'r_hvac2491'},
  {'metric': 'occupancy', 'resource': 'r_hvac2491'},
  {'metric': 'humidity', 'resource': 'r_hvac2491'}],
 'from': '2021-02-22T14:35:10Z'}

In [16]:
waylay_client.queries.query.execute(IMPORT_NAME)

resource,r_hvac2491,r_hvac2491,r_hvac2491
metric,temperature,occupancy,humidity
timestamp,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2021-02-22 14:35:10+00:00,23,0,2304
2021-02-22 14:35:20+00:00,21,3,2200


In [17]:
waylay_client.queries.query.execute(IMPORT_NAME, params={
    'from': '2021-02-22T14:00:00Z',
    'freq':'PT1H', 
    'aggregation':'mean',
    'periods':2
})

resource,r_hvac2491,r_hvac2491,r_hvac2491
metric,temperature,occupancy,humidity
aggregation,mean,mean,mean
timestamp,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3
2021-02-22 14:00:00+00:00,22.0,1.5,2252.0
2021-02-22 15:00:00+00:00,,,


#### provisioning a _waylay resource_

In [18]:
etl_tool.update_resources(hvac_import)

[r_hvac2491]

You can validate on the console, or using the `waylay.api.resource` api that the resource has been created:

In [19]:
waylay_client.resources.resource.get(RESOURCE_NAME)

{'id': 'r_hvac2491',
 'name': 'r_hvac2491',
 'metrics': [{'name': 'temperature'},
  {'name': 'occupancy'},
  {'name': 'humidity'}]}

Cleanup

In [20]:
waylay_client.data.events.remove(RESOURCE_NAME)

{'message': 'Deleted messages, series and all metrics for r_hvac2491'}

In [21]:
waylay_client.resources.resource.remove(RESOURCE_NAME)

In [22]:
waylay_client.queries.query.remove(IMPORT_NAME)

{'messages': [],
 '_links': {'self': {'href': 'http://api-aws-dev.waylay.io/queries/v1/query/hvac2491',
   'method': 'DELETE'}}}

## import with additional metadata

When preparing an import you can specify additional metadata on the resources and metrics in the dataset, which will be taken into account when creating the _waylay resource_. 
This also supports the renaming of metrics and resources as specified in the data set (`Metric.key`, `Resource.key`) to the names used in the waylay upload itself (`Metric.name`. `Resource.id`).

In [23]:
IMPORT_NAME = f'hvac{random.randint(0,10000):04d}'
RESOURCE_NAME=f'r_{IMPORT_NAME}'
display(f'''
  IMPORT_NAME={IMPORT_NAME}
  RESOURCE_NAME={RESOURCE_NAME}
''')

from waylay.service.timeseries import Resource, Metric
hvac_demo_with_metadata = etl_tool.prepare_import(
    f'{etl_tool.temp_dir}/hvac_demo.csv',
    name=IMPORT_NAME,
    resource=RESOURCE_NAME,
    resources=[
        Resource(
            RESOURCE_NAME, 
            name=f'Home Office ${IMPORT_NAME}', 
            description='Example Resource for the Waylay SDK etl_tool demo'
        )
    ],
    metrics=[
        Metric(name="temp", key='temperature', value_type="float", metric_type="gauge", unit="°C", description='Home office temperature.'), 
        Metric(name="humi", key="humidity", value_type= "float",  metric_type="gauge",  unit="%", description= "Relative Humidity at my desk."), 
        Metric(name="occu", key="occupancy", value_type="integer",  metric_type="gauge",  unit="items", description="Number of cups on my desk.")
  
    ]
)

'\n  IMPORT_NAME=hvac2475\n  RESOURCE_NAME=r_hvac2475\n'

1.00csv_files [00:00, 507csv_files/s]
 50%|███████████▌           | 3.00/6.00 [00:00<00:00, 3.40krows/s]


In [24]:
etl_tool.update_resources(hvac_demo_with_metadata)

[r_hvac2475]

In [25]:
waylay_client.resources.resource.get(RESOURCE_NAME)

{'id': 'r_hvac2475',
 'name': 'Home Office $hvac2475',
 'metrics': [{'name': 'temp',
   'valueType': 'float',
   'metricType': 'gauge',
   'unit': '°C',
   'description': 'Home office temperature.'},
  {'name': 'occu',
   'valueType': 'integer',
   'metricType': 'gauge',
   'unit': 'items',
   'description': 'Number of cups on my desk.'},
  {'name': 'humi',
   'valueType': 'float',
   'metricType': 'gauge',
   'unit': '%',
   'description': 'Relative Humidity at my desk.'}],
 'description': 'Example Resource for the Waylay SDK etl_tool demo'}

In [26]:
waylay_client.data.events.remove(IMPORT_NAME)

{'message': 'Deleted messages, series and all metrics for hvac2475'}

In [27]:
waylay_client.resources.resource.remove(RESOURCE_NAME)