In [None]:
from IPython.core.display import HTML
HTML("<style>.container { width:80% !important; }</style>")

In [4]:
import os
#os.environ['http_proxy'] = ""
#os.environ['https_proxy'] = ""

## Overview
The SensiML Python library uses a tvo pipeline step, which stands for train, validate, and optimize, to build models. After building a sandbox pipeline, it is good practice to test the models against a hold out data set. In this tutorial, we will demonstrate how to perform this final validation using the model's recognize_signal ability.

* Connect to server, get project and instantiate a sandbox
* Create a query for a subset of the data
* Import the activity recognition pipeline and train the model
* Create a query for the holdout test data
* Use the models recognize_signal function to test performance of the model

In this tutorial we will use the activity data set. This is a compilation of a user wearing a device while performing activities walking, running, climbing up, climbing down and crawling. The goal of our model is to recognize which activity the user is performing. 



## Training a Model

### Connect to the server, get the project, instantiate a sandbox.

In [6]:
import pandas as pd

from sensiml import SensiML


client = SensiML()
client.project ='Activity'
client.pipeline = 'Sandbox_Hold_Out'

Sandbox Sandbox_Hold_Out does not exist, creating a new sandbox.


### Upload Train and Test Data

In [8]:
df = pd.read_csv('Support/activities_combinedSignalsWithLabel_small.csv')

df_train = df[df['Subject'].isin(['U001', 'U002', 'U003', 'U004', 'U005', 'U006', 'U008', 'U009'])]
df_test  =  df[df['Subject'].isin(['U010', 'U011', 'U012',])]
df.head()

Unnamed: 0,Subject,Activity,AccelerometerX,AccelerometerY,AccelerometerZ
0,U001,0,-317,-3000,925
1,U001,0,-284,-2968,903
2,U001,0,-243,-2987,933
3,U001,0,-193,-3051,936
4,U001,0,-150,-3059,915


In [9]:
client.upload_dataframe('activity_data_train.csv',df_train)
client.upload_dataframe('activity_data_test.csv',df_test)

Uploading file "activity_data_train.csv" to SensiML Cloud.
Upload of file "activity_data_train.csv"  to SensiMLKB  Cloud completed.
Uploading file "activity_data_test.csv" to SensiML Cloud.
Upload of file "activity_data_test.csv"  to SensiMLKB  Cloud completed.


<sensiml.client.SensiML at 0x1093b1ac8>

In [27]:
df.columns

Index(['Subject', 'Activity', 'AccelerometerX', 'AccelerometerY',
       'AccelerometerZ'],
      dtype='object')

### Train a model 

In [28]:
sensor_columns = ['AccelerometerX', 'AccelerometerY', 'AccelerometerZ']

In [57]:
client.pipeline.reset()

client.pipeline.set_input_data('activity_data_train.csv', data_columns=sensor_columns,
                                                       group_columns=['Subject', 'Activity'],
                                                       label_column='Activity')

client.pipeline.add_transform('Windowing')


client.pipeline.add_feature_generator(["Mean", 'Standard Deviation', 'Skewness', 'Kurtosis',
                                    '25th Percentile', '75th Percentile', '100th Percentile',
                                    'Zero Crossing Rate'],
                                    function_defaults = {"columns":[u'AccelerometerY']})

client.pipeline.add_transform('Min Max Scale')

client.pipeline.set_validation_method('Recall')

client.pipeline.set_classifier('PME', params={"classification_mode":'RBF','distance_mode':'L1'})

client.pipeline.set_training_algorithm('Hierarchical Clustering with Neuron Optimization', params = {'number_of_neurons':7})

client.pipeline.set_tvo({'validation_seed':0})

In [58]:
result, stats = client.pipeline.execute()

Executing Pipeline with Steps:

------------------------------------------------------------------------
 0.     Name: activity_data_train.csv   		Type: featurefile              
------------------------------------------------------------------------
------------------------------------------------------------------------
 1.     Name: Windowing                 		Type: segmenter                
------------------------------------------------------------------------
------------------------------------------------------------------------
 2.     Name: generator_set             		Type: generatorset             
------------------------------------------------------------------------
------------------------------------------------------------------------
 3.     Name: Min Max Scale             		Type: transform                
------------------------------------------------------------------------
------------------------------------------------------------------------
 4.     Name: t

In [59]:
result.summarize()

TRAINING ALGORITHM: Hierarchical Clustering with Neuron Optimization
VALIDATION METHOD:  Recall
CLASSIFIER:         PME

AVERAGE METRICS:
F1-SCORE:    98.1   sigma 0.00
SENSITIVITY: 98.6   sigma 0.00
PRECISION:   97.6   sigma 0.00

--------------------------------------

RECALL MODEL RESULTS : SET VALIDATION

MODEL INDEX: Fold 0
ACCURACY: 98.04
NEURONS: 6



### Description of pipeline with all parameters

You can see the description of the pipeline and all its parameters printed out by running the following.

In [61]:
client.pipeline.describe(show_params=True, show_set_params=True)

------------------------------------------------------------------------
 0.     Name: activity_data_train.csv   		Type: featurefile              
------------------------------------------------------------------------
------------------------------------------------------------------------
 1.     Name: Windowing                 		Type: segmenter                
------------------------------------------------------------------------
	window_size: 250
	delta: 250
	train_delta: 0
	return_segment_index: False
------------------------------------------------------------------------
 2.     Name: generator_set             		Type: generatorset             
------------------------------------------------------------------------
	 0. Name: Mean                     
		columns: ['AccelerometerY']

	 1. Name: Standard Deviation       
		columns: ['AccelerometerY']

	 2. Name: Skewness                 
		columns: ['AccelerometerY']

	 3. Name: Kurtosis                 
		columns: ['Acceleromet

## Evaluate the performance of the model on our hold out data set

We would like to test our model against the hold out data set. In this tutorial we are using the Subject as the filter, but this could be any metadata tag


The results returned from the sandbox execution contains all of the models which were built during the pipeline. Select the model that you would like to test, in our case there is only a single model because we used Recall. Then pass the data set that we just returned from the query into the models recognize_signal function. 

recognize_signal has several modes. The two main ones refer to which platform you would like to use. 

* emulator (default): runs the c code directly in emulation
* cloud: allows for advanced manipulations of pipeline 
    * stop_step (int): the execution of the pipeline is stopped at a particular step and results are returned
    * compare_labels (bool): the results of the pipeline are compared against the actual labels passed in. 
            

Lets first run using the default option of the emulator

In [40]:
model = result.configurations[0].models[0]
recog_results, summary = model.recognize_signal(datafile="activity_data_test.csv")


Checking Pipeline Status:


Status: Running  Time:   0 sec.   ..

Results Retrieved... Execution Time: 0 min. 45 sec.


Description of returned results for emulator platform:
* Classification: Category of Fired Neuron, this is the integer value that the firmware device will return.
* ClassificationName: The name of the category in the class map
* FeatureVector: Generated Feature Vector for this segment
* ModelName: The name of the model that returned the final result
* SegmentEnd: The End of the segment
* SegmentID: The ID of the segment
* SegmentLength: The length of the segment
* SegmentStart: The start index of the segment


In [46]:
recog_results.head()

Unnamed: 0,Classification,ClassificationName,FeatureVector,ModelName,SegmentEnd,SegmentID,SegmentLength,SegmentStart
0,1,0,"[20, 36, 88, 36, 92, 16, 8, 0]",0,250,0,250,0
1,1,0,"[18, 41, 105, 19, 87, 20, 13, 0]",0,500,1,250,250
2,1,0,"[18, 34, 80, 53, 92, 13, 8, 0]",0,750,2,250,500
3,1,0,"[22, 37, 97, 21, 91, 22, 10, 0]",0,1000,3,250,750
4,1,0,"[20, 38, 102, 10, 89, 19, 11, 0]",0,1250,4,250,1000



Next, lets look at the platform='cloud' and also ask it to compare the labels for us. When we set compare_labels=True, model.recognize_signal will return a confusion matrix, if you pass labeled data.

In [47]:
model = result.configurations[0].models[0]
recog_results_cloud, summary_cloud = model.recognize_signal(datafile="activity_data_test.csv", platform='cloud', compare_labels=True)


Checking Pipeline Status:


Status: Running  Time:   0 sec.   .

Results Retrieved... Execution Time: 0 min. 30 sec.


Description of returned results for cloud platform:
* DistanceVector: Distance of closest Neuron
* NIDVector: Index of Fired Neuron
* CategoryVector: Category of Fired Neuron, this is the integer value that the firmware device will return.
* MappedCategoryVector: The name of the category in the class map



In [50]:
recog_results_cloud.head()

Unnamed: 0,DistanceVector,NIDVector,CategoryVector,MappedCategoryVector,Activity,SegmentID,Subject
0,[42],[1],[1],[0],0,0,U010
1,[55],[1],[1],[0],0,0,U011
2,[59],[1],[1],[0],0,0,U012
3,[39],[1],[1],[0],0,1,U010
4,[49],[1],[1],[0],0,1,U011


The confusion matrix is included in the summary

In [52]:
summary_cloud['confusion_matrix']

CONFUSION MATRIX:
                   1         0         2         4       UNK       UNC   Support   Sens(%)
         1      32.0       0.0       0.0       0.0       1.0       0.0      33.0      97.0
         0       0.0      34.0       0.0       0.0       0.0       0.0      34.0     100.0
         2       0.0       0.0       5.0       0.0       5.0       0.0      10.0      50.0
         4       0.0       0.0       0.0      45.0       1.0       0.0      46.0      97.8

     Total        32        34         5        45         7         0       123          

PosPred(%)     100.0     100.0     100.0     100.0                        Acc(%)      94.3

The summary of the steps that were executed is also included. Notice that when we use recognize_signal we do not train a model again, but use the model that was previously trained to classify the input data. Additionally, the query step is removed since we use the dataframe passed into recognize_signal.

In [53]:
summary_cloud['execution_summary']

Unnamed: 0,cached,name,runtime,step #,type
0,False,Windowing,1.018689,0,segmenter
1,False,generator_set,2.131407,1,generatorset
2,False,Min Max Scale,1.034444,2,transform


Let's go ahead run the same thing, but let the compare_labels default to False.

In [54]:
recog_result_no_labels, summary_no_label = model.recognize_signal(datafile="activity_data_test.csv", platform='cloud')


Checking Pipeline Status:


Status: Running  Time:   0 sec.   .

Results Retrieved... Execution Time: 0 min. 30 sec.


In [55]:
recog_result_no_labels.head()

Unnamed: 0,DistanceVector,NIDVector,CategoryVector,MappedCategoryVector,Activity,SegmentID,Subject
0,[42],[1],[1],[0],0,0,U010
1,[55],[1],[1],[0],0,0,U011
2,[59],[1],[1],[0],0,0,U012
3,[39],[1],[1],[0],0,1,U010
4,[49],[1],[1],[0],0,1,U011


The cofusion matrix is not returned because we didn't request comparing the labels. 

In [56]:
summary_no_label

{'execution_summary':    cached           name   runtime  step #          type
 0   False      Windowing  1.047432       0     segmenter
 1   False  generator_set  2.134541       1  generatorset
 2   False  Min Max Scale  2.037516       2     transform,
 'summary': Empty DataFrame
 Columns: []
 Index: []}