> Under Construction: this notebook uses the [waylay-py-internal](https://github.com/waylayio/waylay-py-internal) extension for apis that are not yet public. Requires a current versions both waylay-py and waylay-py-internal:

```
pip install https://github.com/waylayio/waylay-py
pip install https://github.com/waylayio/waylay-py-internal
```

# HVAC occupancy detection

This notebook illustrates how to interact with the Waylay Platform API's for an HVAC data science use case. 

## References
* The [kaggle](https://www.kaggle.com) notebook [HVAC Occupancy Detection with ML and DL Methods](https://www.kaggle.com/turksoyomer/hvac-occupancy-detection-with-ml-and-dl-methods/notebook), and related [dataset](https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+), on which this example is based.
* The [Waylay api documentation](https://docs.waylay.io/api/)
* The [Waylay python SDK](https://docs.waylay.io/api/sdk/python/)
* [Setup instructions](https://github.com/waylayio/demo-general/tree/master/python-sdk) for a python notebook using the Waylay Python SDK.


## Parameters
Please review and adapt the following parameters for this demo

In [1]:
class HVACDemo:
    """parametrization for this demo"""
    
    # original location of the data set
    data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00357/occupancy_data.zip'
    
    # the profile name under which waylay credentials are stored
    waylay_client_profile='rules'
    
    # the id of the resource under which this demo is run
    resource_id = 'demo_energy_hvac_occupancy'
    
    
    
    

## Setup

In [2]:
import pandas as pd
import waylay
from datetime import datetime

In [3]:
# NEEDED FOR NOW

import waylay_internal
dict(
    waylay=waylay.__version__,
    waylay_internal=waylay_internal.__version__,
)

{'waylay': 'v0.1.2+14.g8a561d0', 'waylay_internal': '0+untagged.16.g380f6a6'}

In [4]:
# if the profile does not exist, this will interactively request for credentials, and let you optionally store it.
waylay_client = waylay.WaylayClient.from_profile(HVACDemo.waylay_client_profile)

## Data retrieval

### download the data set
We download the dataset (a zipped set of csv files), inspect its content, and read out the csv files into a pandas data structure.

In [5]:
import os
import os.path
import zipfile
from urllib.request import urlretrieve

os.makedirs('input', exist_ok=True)
os.makedirs('output', exist_ok=True)

# download the kaggle data set
if not os.path.isfile('input/occupancy.zip'):
    urlretrieve(HVACDemo.data_url, 'input/occupancy.zip')
    
with zipfile.ZipFile('input/occupancy.zip') as occ_zip:
    for file_name in occ_zip.namelist():
        print(file_name)

datatest.txt
datatest2.txt
datatraining.txt


In [6]:
with zipfile.ZipFile('input/occupancy.zip') as occ_zip:
    datatest = pd.read_csv(occ_zip.open('datatest.txt'))
    datatest2 = pd.read_csv(occ_zip.open('datatest2.txt'))
    datatraining = pd.read_csv(occ_zip.open('datatraining.txt'))
    


In [7]:
datatraining.describe()

Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
count,8143.0,8143.0,8143.0,8143.0,8143.0,8143.0
mean,20.619084,25.731507,119.519375,606.546243,0.003863,0.21233
std,1.016916,5.531211,194.755805,314.320877,0.000852,0.408982
min,19.0,16.745,0.0,412.75,0.002674,0.0
25%,19.7,20.2,0.0,439.0,0.003078,0.0
50%,20.39,26.2225,0.0,453.5,0.003801,0.0
75%,21.39,30.533333,256.375,638.833333,0.004352,0.0
max,23.18,39.1175,1546.333333,2028.5,0.006476,1.0


In [8]:
datatraining.head()

Unnamed: 0,date,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
1,2015-02-04 17:51:00,23.18,27.272,426.0,721.25,0.004793,1
2,2015-02-04 17:51:59,23.15,27.2675,429.5,714.0,0.004783,1
3,2015-02-04 17:53:00,23.15,27.245,426.0,713.5,0.004779,1
4,2015-02-04 17:54:00,23.15,27.2,426.0,708.25,0.004772,1
5,2015-02-04 17:55:00,23.1,27.2,426.0,704.5,0.004757,1


### convert to etl format
To upload bulk data into waylay, the data should be converted into an optimized format.
The `timeseries.tool.prepare_etl_import` helps you to create these _import files_.

In this case, we provide the tool with additional information:
 * `timestamp_timezone='UTC'` as timestamps do not contain a timezone component
 * `resource=HVACDemo.resource_id` as the resource id is not provided in the input
 * `timestamp_key='date'`, as timestamps are in the `date` column. In this case this is not required as `date` will be recognised as a timestamp column if not specified otherwise.
 * `directory='input'` because we want the resulting import file to reside in that directory

The first two instruction are required for this dataset. Try to omit them to see what errors are raised.

In [9]:
etl_import = waylay_client.timeseries.etl_tool.prepare_import(
    datatraining, 
    timestamp_timezone='UTC',
    resource=HVACDemo.resource_id,
    timestamp_key='date',
    directory='output'
)
etl_import

WaylayETLSeriesInput(path=PosixPath('output/import-20210125.150419-timeseries.csv.gz'), spec=SeriesSpec(metrics=['Temperature', 'Humidity', 'Light', 'CO2', 'HumidityRatio', 'Occupancy'], resources=['demo_energy_hvac_occupancy'], timestamp_key='date', resource_key=None, metric_key=None, value_key=None, resource='demo_energy_hvac_occupancy', timestamp_offset=None, timestamp_first=None, timestamp_last=None, timestamp_interval=None, timestamp_constructor=None, timestamp_timezone='UTC'))

Because it is easer to work with recent data, we instruct the tool to shift timestamps
(with `timestamp_offset`, `timestamp_first` or `timestamp_last`)

In [10]:
etl_import = waylay_client.timeseries.etl_tool.prepare_import(
    datatraining, 
    timestamp_timezone='UTC',
    resource=HVACDemo.resource_id,
    timestamp_key='date',
    timestamp_last=datetime.utcnow(), # shift all timestamps so that last one is now
    directory='output'
)
etl_import

WaylayETLSeriesInput(path=PosixPath('output/import-20210125.150421-timeseries.csv.gz'), spec=SeriesSpec(metrics=['Temperature', 'Humidity', 'Light', 'CO2', 'HumidityRatio', 'Occupancy'], resources=['demo_energy_hvac_occupancy'], timestamp_key='date', resource_key=None, metric_key=None, value_key=None, resource='demo_energy_hvac_occupancy', timestamp_offset=None, timestamp_first=None, timestamp_last=datetime.datetime(2021, 1, 25, 15, 4, 21, 216343), timestamp_interval=None, timestamp_constructor=None, timestamp_timezone='UTC'))

The resulting file is a `gzip` compressed csv file in fully normalized _waylay timeseries ETL_ format

In [11]:
import gzip
with gzip.open(etl_import.path, 'rt') as csv_file:
     etl_series_df = pd.read_csv(csv_file)

etl_series_df.head()

Unnamed: 0,resource,metric,timestamp,value
0,demo_energy_hvac_occupancy,waylay.resourcemessage.metric.Temperature,2021-01-19T23:22:21.216343Z,23.18
1,demo_energy_hvac_occupancy,waylay.resourcemessage.metric.Temperature,2021-01-19T23:23:20.216343Z,23.15
2,demo_energy_hvac_occupancy,waylay.resourcemessage.metric.Temperature,2021-01-19T23:24:21.216343Z,23.15
3,demo_energy_hvac_occupancy,waylay.resourcemessage.metric.Temperature,2021-01-19T23:25:21.216343Z,23.15
4,demo_energy_hvac_occupancy,waylay.resourcemessage.metric.Temperature,2021-01-19T23:26:21.216343Z,23.1


### create or update waylay resource
Timeseries in waylay are best associated with a Waylay resource. This documents the entity that is represented by the timeseries data.

In [12]:
hvac_resource_repr = {
    "id": HVACDemo.resource_id,
    "name": HVACDemo.resource_id,
    "description": (
        "Experimental data used for binary classification (room occupancy) "
        "from Temperature,Humidity,Light and CO2.\n"
        "Ground-truth occupancy was obtained from time stamped pictures that were taken every minute.\n"
        "See https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+#"
    ),
    "metrics" : [ { "name": name } for name in etl_import.spec.metrics ]
}

In [13]:
# use `update` (PATCH method) to upsert the resource
hvac_resource_resp = waylay_client.api.resource.update(HVACDemo.resource_id, body=hvac_resource_repr)

# validate it is stored correctly
waylay_client.api.resource.get(HVACDemo.resource_id)

{'id': 'demo_energy_hvac_occupancy',
 'name': 'demo_energy_hvac_occupancy',
 'metrics': [{'name': 'Temperature'},
  {'name': 'Humidity'},
  {'name': 'Light'},
  {'name': 'CO2'},
  {'name': 'HumidityRatio'},
  {'name': 'Occupancy'}],
 'description': 'Experimental data used for binary classification (room occupancy) from Temperature,Humidity,Light and CO2.\nGround-truth occupancy was obtained from time stamped pictures that were taken every minute.\nSee https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+#'}

In [14]:
# maybe add some more metadata
metrics_metadata = [
    { "name": "Temperature", "valueType": "float", "metricType": "gauge", "unit": "°C" }, 
    { "name": "Humidity", "valueType": "float", "metricType": "gauge", "unit": "%", "description": "Relative Humidity" }, 
    { "name": "Light", "valueType": "float", "metricType": "gauge", "unit": "Lux" }, 
    { "name": "CO2", "valueType": "float", "metricType": "gauge", "unit": "ppm" }, 
    { "name": "HumidityRatio", "valueType": "float", "metricType": "gauge", "unit": "kgwater-vapor/kg-air", "description": "Derived quantity from temperature and relative humidity."},
    { "name": "Occupancy", "valueType": "integer", "metricType": "gauge", "unit": "boolean", "description": "0 for not occupied, 1 for occupied status" } 
]
hvac_resource_resp = waylay_client.api.resource.update(HVACDemo.resource_id, body=dict(metrics=metrics_metadata))
waylay_client.api.resource.get(HVACDemo.resource_id)
      

{'id': 'demo_energy_hvac_occupancy',
 'name': 'demo_energy_hvac_occupancy',
 'metrics': [{'name': 'Temperature',
   'valueType': 'float',
   'metricType': 'gauge',
   'unit': '°C'},
  {'name': 'Humidity',
   'valueType': 'float',
   'metricType': 'gauge',
   'unit': '%',
   'description': 'Relative Humidity'},
  {'name': 'Light',
   'valueType': 'float',
   'metricType': 'gauge',
   'unit': 'Lux'},
  {'name': 'CO2', 'valueType': 'float', 'metricType': 'gauge', 'unit': 'ppm'},
  {'name': 'HumidityRatio',
   'valueType': 'float',
   'metricType': 'gauge',
   'unit': 'kgwater-vapor/kg-air',
   'description': 'Derived quantity from temperature and relative humidity.'},
  {'name': 'Occupancy',
   'valueType': 'integer',
   'metricType': 'gauge',
   'unit': 'boolean',
   'description': '0 for not occupied, 1 for occupied status'}],
 'description': 'Experimental data used for binary classification (room occupancy) from Temperature,Humidity,Light and CO2.\nGround-truth occupancy was obtained f

### upload the etl-import data


In [15]:
upload_bucket, upload_prefix = waylay_client.timeseries.etl_tool.initiate_import(etl_import)

Uploading content to etl-import/upload/import-20210125.150421-timeseries.csv.gz using
    https://object-storage.waylay.io/ws-etl-import-631a70c1-3065-4059-aa7b-dbc20da43c3c/upload/import-20210125.150421-timeseries.csv.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=631a70c1-3065-4059-aa7b-dbc20da43c3c.etl-import_readwritedelete%2F20210125%2Feurope-west1%2Fs3%2Faws4_request&X-Amz-Date=20210125T150430Z&X-Amz-Expires=300&X-Amz-SignedHeaders=host&X-Amz-Signature=ba84da19eb20b9426f8298f655ff93ea65eb8214dd6bb2c7b97424340c99477f
 ...
... done.


The etl file is uploaded to the `etl-import/upload` storage folder.
Any upload in this folder will initiate an etl process.

This can be monitored as follows:
* the file is moved from `etl-import/upload` to an timestamped folder in  `etl-import/busy`
* the etl process is kicked of
* on completion, the file (and a result statement) is copied to a folder in `etl-import/done`


In [18]:
from IPython.core.display import HTML
resp = waylay_client.timeseries.etl_tool.check_import(etl_import)

HTML(resp.to_html())


### query the timeseries data

In [19]:
query = dict(
    resource=HVACDemo.resource_id,
    data=[
        dict(metric=metric) for metric in etl_import.spec.metrics
    ]
)
# test query
waylay_client.analytics.query.execute(
    body=query, 
    params=dict(until=datetime.utcnow().isoformat()
))

resource,demo_energy_hvac_occupancy,demo_energy_hvac_occupancy,demo_energy_hvac_occupancy,demo_energy_hvac_occupancy,demo_energy_hvac_occupancy,demo_energy_hvac_occupancy
metric,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
timestamp,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2021-01-24 05:46:21.216000+00:00,19.50,27.0000,0.0,456.000000,0.003781,0.0
2021-01-24 05:47:21.216000+00:00,19.50,27.0000,0.0,461.000000,0.003781,0.0
2021-01-24 05:48:20.216000+00:00,19.50,27.0000,0.0,458.000000,0.003781,0.0
2021-01-24 05:49:20.216000+00:00,19.50,27.0000,0.0,460.000000,0.003781,0.0
2021-01-24 05:50:21.216000+00:00,19.50,27.0000,0.0,462.000000,0.003781,0.0
...,...,...,...,...,...,...
2021-01-25 15:00:21.216000+00:00,21.05,36.0975,433.0,787.250000,0.005579,1.0
2021-01-25 15:01:20.216000+00:00,21.05,35.9950,433.0,789.500000,0.005563,1.0
2021-01-25 15:02:20.216000+00:00,21.10,36.0950,433.0,798.500000,0.005596,1.0
2021-01-25 15:03:21.216000+00:00,21.10,36.2600,433.0,820.333333,0.005621,1.0


In [20]:
# save query
query_name = f'example_{HVACDemo.resource_id}'
waylay_client.analytics.query.create(body=dict(name=query_name, query=query))


{'data': [{'metric': 'Temperature'},
  {'metric': 'Humidity'},
  {'metric': 'Light'},
  {'metric': 'CO2'},
  {'metric': 'HumidityRatio'},
  {'metric': 'Occupancy'}],
 'resource': 'demo_energy_hvac_occupancy'}

In [21]:
# test saved query
waylay_client.analytics.query.data(query_name)

resource,demo_energy_hvac_occupancy,demo_energy_hvac_occupancy,demo_energy_hvac_occupancy,demo_energy_hvac_occupancy,demo_energy_hvac_occupancy,demo_energy_hvac_occupancy
metric,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
timestamp,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2021-01-24 05:46:21.216000+00:00,19.50,27.0000,0.0,456.000000,0.003781,0.0
2021-01-24 05:47:21.216000+00:00,19.50,27.0000,0.0,461.000000,0.003781,0.0
2021-01-24 05:48:20.216000+00:00,19.50,27.0000,0.0,458.000000,0.003781,0.0
2021-01-24 05:49:20.216000+00:00,19.50,27.0000,0.0,460.000000,0.003781,0.0
2021-01-24 05:50:21.216000+00:00,19.50,27.0000,0.0,462.000000,0.003781,0.0
...,...,...,...,...,...,...
2021-01-25 15:00:21.216000+00:00,21.05,36.0975,433.0,787.250000,0.005579,1.0
2021-01-25 15:01:20.216000+00:00,21.05,35.9950,433.0,789.500000,0.005563,1.0
2021-01-25 15:02:20.216000+00:00,21.10,36.0950,433.0,798.500000,0.005596,1.0
2021-01-25 15:03:21.216000+00:00,21.10,36.2600,433.0,820.333333,0.005621,1.0


Use the query in the console on either
- https://beta.waylay.io/analytics/queries?query=example_demo_energy_hvac_occupancy
- https://preview.waylay.io/analytics/queries?query=example_demo_energy_hvac_occupancy













##### cleanup

In [30]:
from waylay import RestResponseError
def cleanup(filter='demo_energy_hvac_occupancy', query_name_prefix='example_'):
    resource_ids = [ r['id'] for r in waylay_client.api.resource.search(params=dict(filter=filter)) ]
    if not resource_ids:
        print('No resources to clean.')
        return
    print('removing data and resources with ids:' + ''.join(f"\n  - {resource_id}" for resource_id in resource_ids))
    answer = input('OK? [Y/N] ')
        
    if not answer or answer[0].upper() != 'Y':
        print('Cleanup cancelled.')
        return
    
    # delete data
    for resource_id in resource_ids:
        try:
            print(waylay_client.data.series.remove(resource_id)  or f'removed series   {resource_id}')
            print(waylay_client.api.resource.remove(resource_id) or f'removed resource {resource_id}')
            query_name = f'{query_name_prefix}{resource_id}'
            print(waylay_client.analytics.query.remove(query_name) or f'removed query {query_name}')
        except RestResponseError as exc:
            print(f'stopped processing resource {resource_id} because of:')
            print(exc)

In [31]:
cleanup()

removing data and resources with ids:
  - demo_energy_hvac_occupancy
OK? [Y/N] N
Cleanup cancelled.
