# SDK Examples - Querying Data Models

Some of the most common query operators in the Sight Machine SDK. We'll use the demo environment (https://demo.sightmachine.io).

In [None]:
from smsdk import client
from datetime import datetime, timedelta
import pandas as pd
import os

In [None]:
env_var_tenant = 'ENV_SDK_VAR_TENANT'
env_var_api_key = 'ENV_SDK_VAR_API_KEY'
env_var_api_secret = 'ENV_SDK_VAR_API_SECRET'

# Check if the environment variable exists
if env_var_tenant in os.environ:
    # Retrieve the value of the environment variable
    tenant = os.environ[env_var_tenant]
else:
    # Use a default value if the environment variable is not present
    tenant = 'demo'

# Check if the environment variable exists
if env_var_api_key in os.environ:
    # Retrieve the value of the environment variable
    api_key = os.environ[env_var_api_key]
else:
    # Use a default value if the environment variable is not present
    api_key = ''

# Check if the environment variable exists
if env_var_api_secret in os.environ:
    # Retrieve the value of the environment variable
    api_secret = os.environ[env_var_api_secret]
else:
    # Use a default value if the environment variable is not present
    api_secret = ''

## Initialize the SDK client and get a list of all machine types

In [None]:
cli = client.Client(tenant)
cli.login('apikey', 
          key_id = api_key, 
          secret_id = api_secret)

types = cli.get_machine_type_names()
types

# Working with Cycles

Cycles are the core data set in the SM Platform.  Cycles represent a unit of work on a machine and will contain a variety of data from sensors, quality managent systems, ERP, MES, etc.  

Each cycle is associate with a Machine and a range of time.  Each Machine has a machine type which determines the data schema.  So to query for cycle data, the first step is to lookup the machine type and then to lookup the specific machine(s) of that type.

In [None]:
# list machines of a specific type
machine_type = types[0]
machines = cli.get_machine_names(source_type=machine_type)
machines

In [None]:
# retrieve the schema for a particular machine (more on this at end of notebook)
# extract only a list of tag display names
columns = cli.get_machine_schema(machines[0])['display'].to_list()

# going to skip the first 8 fields since those are our internal / common fields
columns = columns[8:]
columns

### Selecting a Particular Development Pipeline schema

You can select a development pipeline schema using following code example. This works very similarly to the 'in-use' feature in MA - we can select an alternate pipeline to treat as the production one. Similarly to MA, the setting will persist until you change it back or create a new client.

*Note: By default, the production pipeline schema will be used (just like in MA).*

In [None]:
db_schema = 'pipeline_id' 
cli.select_db_schema(schema_name=db_schema)

## A basic starting query.

Once you have a machine type and machine, you can start to query for cycle data.  We'll use variations on this theme to demonstrate different query options and their effects.

Note that this baseline query already demonstrates:
- Basic filter rules formatted as key value pairs
- Filtering for greater than or less than values
    - It uses `__gte` for greater than or equal.  Use `__gt` for greater than.  Similarly `__lte` is less than or equal vs. `__lt` for less than.
- Sorting returned results
    - Note the `-` prefix before `Endtime` means to sort descending.  To sort ascending, do not place a prefix in front of the variable name.

In [None]:
query = {'Machine': machines[0],
         'End Time__gte' : datetime(2023, 4, 1), 
         'End Time__lte' : datetime(2023, 4, 2), 
         '_order_by': '-End Time'}
df = cli.get_cycles(**query)

print(f'Size of returned data: {df.shape}')
df.head()

# Selecting columns and silencing the `_only` Warning

To select a specific set of columns, provide a list of column names as a value for the key _only.  For example, `'_only': ['column1', 'column2', 'column3']`

If you do not use _only, the SDK will automatically select the first 50 stats in the machine's configuration, plus common metadata fields for the query.  

Note, you can also pass `'_only': '*'`, which will return everything, including a large number of internal fields.  Since this includes may fields you probably will not need, expect the resulting queries to be quite slow.

**IMPORTANT** If a selected column is all null, it will not be included in the returned data frame.  If you are getting fewer columns returned than expected, this mostly likely means that there was only null data for that column.



In [None]:
# Get the first 10 columns, plus Machine and End Time
select_columns = ['Machine', 'End Time'] + columns[:5]

query = {'Machine': machines[0],
         'End Time__gte' : datetime(2023, 4, 1), 
         'End Time__lte' : datetime(2023, 4, 2),  
         '_order_by': '-End Time',
         '_only': select_columns}
df = cli.get_cycles(**query)

print(f'Size of returned data: {df.shape}')
df.head()

## Restricting the number of rows returned with `_limit` and `_offset`

To restrict the number of rows, use the _limit query option.  For example, `'_limit': 500`.  This will then return at most 500 rows.  

To skip over a specified number of rows, use the _offset query option.  For example `'_offset': 50`.

It is fairly common to use a combination of _limit and _offset togheter for applications such as paginating data.  For example, if a query would normally return 100 rows and you want to break it into two queries you could return the first 50 rows with `'_offset': 0, '_limit': 50` and then return the second 50 rows with `'_offset': 50, '_limit': 50`.

In [None]:
query = {'Machine': machines[0],
         'End Time__gte' : datetime(2023, 4, 1), 
         'End Time__lte' : datetime(2023, 4, 2),  
         '_order_by': '-End Time',
         '_offset': 10,
         '_limit': 500}
df = cli.get_cycles(**query)

print(f'Size of returned data: {df.shape}')

# Notice in the returned data set that the first row is at 23:49 - 23:50 instead of midnight, becuase of the offset
df.head()

# Data from more than one Machine or filtering by a list of values using `__in`

Filters can specify a list of acceptable values.  This is most commonly used when selecting data from more than one machine, though it can be used on any field name.  This is done by appending `__in` (*note two underscores*) to the column name and then specifying the list of options.  For example:

    'Machine__in': ['Oven1', 'Oven2']

or

    'Status__in': ['Idle', 'Maintenance', 'Down']

**Important** Selecting multiple machines of different types can result in spare and confusing data frames.  It is strongly recommended to only pick multiple machines of the same type.

You can also query for values that are not in a list by using `__nin` with the same format as `__in`.  For example:

    'Product_Code__nin': ['SuperMax 5000', 'MegaValue 6000']

In [None]:
# Note: taking the first three machines' data, so will result in three times as many records returned
query = {'Machine__in': machines[0:3],
         'End Time__gte' : datetime(2023, 4, 1), 
         'End Time__lte' : datetime(2023, 4, 2),  
         '_order_by': '-End Time'}
df = cli.get_cycles(**query)

print(f'Size of returned data: {df.shape}')
# Notice the Machine column now has three different values
df.head()


# Filtering to only rows where a specified field exists with `__exists`

Some data fields, such as inspection data, are often quite sparse.  To filter to only rows with or without non-null values, use `__exists`.  `__exists` should be appended to the name of the field, and then give it a boolean for if you want the field to exist (True) or not exist (False).  For example:

    'Inspection_Value__exists': True

or

    'Failure_Code__exists': False

In [None]:
query = { 'Machine': machines[0],
         'stats__Alarms__val__exists': False,
         'End Time__gte' : datetime(2023, 4, 1), 
         'End Time__lte' : datetime(2023, 4, 2), 
         '_order_by': '-End Time',
         '_only': ['Machine', 'End Time', 'Alarms']}
df = cli.get_cycles(**query)

print(f'Size of returned data: {df.shape}')
# Query now only has the small subset of machines
df.head()

# Testing for inequality with `__ne`

The standard `key: value` format assumes it is testing when the key equals the value.  To change this to inequality, add a `__ne` suffix.  For example, `'StatusCode__ne': 0`



In [None]:
query = {'Machine__in': machines[0:3],
         'output__ne': 0,
         'End Time__gte' : datetime(2023, 4, 3), 
         'End Time__lte' : datetime(2023, 4, 6), 
         '_order_by': '-End Time',
         '_only': ['Machine', 'End Time', 'output']}
df = cli.get_cycles(**query)

print(f'Size of returned data: {df.shape}')
df.head()

# Working with Downtimes
Similarly to Cycles, the Downtime data model can be queried for a given machine. Everything from the above section still applies, but the main function is get_downtimes() as opposed to get_cycles.

In [None]:
query = {'Machine': machines[0],
         'End Time__gte' : datetime(2023, 4, 1), 
         'End Time__lte' : datetime(2023, 4, 2), 
         '_order_by': '-End Time'}
df = cli.get_downtimes(**query)

print(f'Size of returned data: {df.shape}')
df.head()

# Working with Parts

Whereas Cycles contain data happening on a particular machine, Parts track an object across multiple machines.  The general structure for query parts is similar for working with cycles, though slightly simpler.  With a Cycle, the pattern is to find the Machine Type, then the Machine, then get Cycle data associated with the machine.  With Parts, you only need a two step process to look up Part Types and then Part data.


In [None]:
part_types = cli.get_part_type_names()
part_type = part_types[0]
part_types

In [None]:
# look at parts schema, same as we did above for machine schema
columns = cli.get_part_schema(part_type)['display'].to_list()
columns[:10]

The options for querying parts are similar to querying for cycles - use the same operators described above.

In [None]:
query = {'Part': part_type,
         'End Time__gte' : datetime(2023, 4, 1), 
         'End Time__lte' : datetime(2023, 4, 2),
         'DefectReason__exists': True,
         '_limit': 10,
         '_only': columns[:30]}

df = cli.get_parts(**query)

print(f'Size of returned data: {df.shape}')
df.head()

# Machines and Machine Types
There is additional information about machines and machine types that can be queried from the SDK. This info can help you format or transform your queries to fit your needs. Examples are included below.

## Machine-Level Info

### Timezones

By default, all timestamps are in UTC.  To find the local timezone associated with a machine, use the ```get_machine_timezone``` function and provide the machine name.  This will then return the name of the timezone, which can be used with libraries such as pytz to convert time zones.

In [None]:
print(machines[0])
tz = cli.get_machine_timezone(machines[0])
print(tz)

### Get Machine Type from Machine
All machines can be grouped into machine types. You may need to programmatically look up the type of a machine given its name. To get the machine type using machine name (or display name), use ```cli.get_type_from_machine(machine_name)```.

In [None]:
cli.get_type_from_machine(machines[0])

### Get Machine Data Schema
The machine schema is a table containing metadata about the tags included in cycle data for a particular machine. It can be retrieved with ```cli.get_machine_schema(machine_name)```. There are a few additional optional parameters that can be passed to the function:
- ```types```: list of strings specifying a subset of column data types that you want to see.
    - ```cli.get_machine_schema(machine_name, types=['continuous'])```
- ```show_hidden```: (default = False) set to True to see the few additional fields that are hidden by default both here and in MA.

In [None]:
# retrieve the schema for a particular machine
schema = cli.get_machine_schema(machines[0])
schema.head()

In [None]:
# example: look at all the various data types for this model
print(schema['type'].unique())

In [None]:
# example: extract list of tags with numeric data types
schema_numeric = schema[schema['type'].apply(lambda x: x in ['int', 'float'])]
cols = schema['display'].to_list()
cols[:10]

In [None]:
# alternate method
schema_str = cli.get_machine_schema(machines[0], types=['continuous'])
schema_str.head()

## Machine-Type-Level Info

### Get Machine Type Schema (```get_fields_of_machine_type```)
This function is very similar to the above get_machine_schema, except it gets fields that are part of a machine type definition. This function has the same optional parameters as above:
- ```types```: list of strings specifying a subset of column data types that you want to see.
    - ```cli.get_machine_schema(machine_name, types=['continuous'])```
- ```show_hidden```: (default = False) set to True to see the few additional fields that are hidden by default both here and in MA.

In [None]:
type_dict = cli.get_fields_of_machine_type(types[0])
type_schema = pd.DataFrame(type_dict)
type_schema.head()