In [1]:
from IPython.core.display import HTML
HTML("<style>.container { width:80% !important; }</style>")

In [2]:
import os
#os.environ['http_proxy'] = ""
#os.environ['https_proxy'] = ""

#### Overview
The knowledgebuilder uses a tvo pipeline step, which stands for train, validate and optimize, to build models. After building a sandbox pipeline, it is good practice to test the models against a hold out data set. In this tutorial, we will demonstrate how to perform this final validation using the model's recognize_signal ability.

* Connect to server, get project and instantiate a sandbox
* Create a query for a subset of the data
* Import the activity recognition pipeline and train the model
* Create a query for the holdout test data
* Use the models recognize_signal function to test performance of the model

In this tutorial we will use the activity data set. This is a compilation of a user wearing a device while performing activities walking, running, climbing up, climbing down and crawling. The goal of our model is to recognize which activity the user is performing. 



#### Connect to the server, get the project, instantiate a sandbox.

In [1]:
import pandas as pd

from sensiml import SensiML


dsk = SensiML()
dsk.project ='Activity'
dsk.pipeline = 'Sandbox_Hold_Out'

Project Activity does not exist, creating a new project.
Sandbox Sandbox_Hold_Out does not exist, creating a new sandbox.


#### For this Case Study we will use a feature file containing our data

In [2]:
df = pd.read_csv('Support/activities_combinedSignalsWithLabel_small.csv')
df_train = df[df['Subject'].isin(['U001', 'U002', 'U003', 'U004', 'U005', 'U006', 'U008', 'U009'])]
df_test  =  df[df['Subject'].isin(['U010', 'U011', 'U012',])]
df.head()

Unnamed: 0,Subject,Activity,AccelerometerX,AccelerometerY,AccelerometerZ
0,U001,0,-317,-3000,925
1,U001,0,-284,-2968,903
2,U001,0,-243,-2987,933
3,U001,0,-193,-3051,936
4,U001,0,-150,-3059,915


#### Build a pipeline that can do our recognition

In [3]:
dsk.pipeline.reset()

dsk.pipeline.set_input_data('activity_data', df_train, data_columns=['AcceleromterY'],
                                                       group_columns=['Subject','Activity'],
                                                       label_column='Activity')

dsk.pipeline.add_transform('Windowing')

dsk.pipeline.add_transform('MSE Filter', params={'input_column':'AccelerometerY'})

dsk.pipeline.add_feature_generator(["Mean", 'Standard Deviation', 'Skewness', 'Kurtosis',
                                    '25th Percentile', '75th Percentile', '100th Percentile',
                                    'Zero Crossing Rate'],
                                    function_defaults = {"columns":[u'AccelerometerY']})

dsk.pipeline.add_transform('Min Max Scale')

dsk.pipeline.set_validation_method('Stratified K-Fold Cross-Validation', params={'number_of_folds':5})

dsk.pipeline.set_classifier('PVP', params={"classification_mode":'RBF','distance_mode':'L1'})

dsk.pipeline.set_training_algorithm('Hierarchical Clustering with Neuron Optimization', params = {'number_of_neurons':7})

dsk.pipeline.set_tvo({'validation_seed':0})

Uploading file "activity_data.csv" to KB Cloud.
Upload of file "activity_data.csv"  to KB Cloud completed.


In [4]:
result, stats = dsk.pipeline.execute()

Executing Pipeline with Steps:

------------------------------------------------------------------------
 0.     Name: activity_data.csv         		Type: featurefile              
------------------------------------------------------------------------
------------------------------------------------------------------------
 1.     Name: Windowing                 		Type: segmenter                
------------------------------------------------------------------------
------------------------------------------------------------------------
 2.     Name: MSE Filter                		Type: transform                
------------------------------------------------------------------------
------------------------------------------------------------------------
 3.     Name: generator_set             		Type: generatorset             
------------------------------------------------------------------------
------------------------------------------------------------------------
 4.     Name: M

In [5]:
result.summarize()

TRAINING ALGORITHM: Hierarchical Clustering with Neuron Optimization
VALIDATION METHOD:  Stratified K-Fold Cross-Validation
CLASSIFIER:         PVP

AVERAGE METRICS:
F1-SCORE:    98.5   sigma 1.19
SENSITIVITY: 99.6   sigma 0.47
PRECISION:   97.6   sigma 1.75

--------------------------------------

STRATIFIED K-FOLD CROSS-VALIDATION MODEL RESULTS

MODEL INDEX: Fold 4
ACCURACY: 100.00
NEURONS: 4

MODEL INDEX: Fold 1
ACCURACY: 98.61
NEURONS: 4

MODEL INDEX: Fold 2
ACCURACY: 97.22
NEURONS: 4

MODEL INDEX: Fold 0
ACCURACY: 97.22
NEURONS: 4

MODEL INDEX: Fold 3
ACCURACY: 97.18
NEURONS: 4



We would like to test our model against the hold out data set. In this tutorial we are using the Subject as the filter, but this could be any metadata tag

#### Evaluate the performance of the model on our hold out data set

The results returned from the sandbox execution contains all of the models which were built during the pipeline. Select the model that you would like to test, in our case there is only a single model because we used Recall. Then pass the data set that we just returned from the query into the models recognize_signal function. 

model.recognize_signal will return a confusion matrix, if you pass labeled data. Otherwise it will only return the classifications. 

First let's pass our labeled data set in.

In [6]:
model = result.configurations[0].models[0]
recog_results, summary = model.recognize_signal(df_test)

Loading Data.

Checking for Results:

Pipeline Waiting in the Queue. Wait Time: 0.00 seconds
.
Results Retrieved.


In [7]:
recog_results.head()

Unnamed: 0,DistanceVector,NIDVector,CategoryVector,MappedCategoryVector,Activity,SegmentID,Subject
0,[42],[1],[1],[0],0,0,U010
1,[55],[1],[1],[0],0,0,U011
2,[59],[1],[1],[0],0,0,U012
3,[35],[1],[1],[0],0,1,U010
4,[49],[1],[1],[0],0,1,U011


#### The confusion matrix is included in the summary

In [8]:
summary['confusion_matrix']

CONFUSION MATRIX:
                   0         1         2         4       UNK       UNC   Support   Sens(%)
         0      34.0       0.0       0.0       0.0       0.0       0.0      34.0     100.0
         1       0.0      33.0       0.0       0.0       0.0       0.0      33.0     100.0
         2       0.0       0.0      10.0       0.0       0.0       0.0      10.0     100.0
         4       0.0       0.0      12.0      33.0       1.0       0.0      46.0      71.7

     Total        34        33        22        33         1         0       123          

PosPred(%)     100.0     100.0      45.5     100.0                        Acc(%)      89.4

The summary of the steps that were executed is also included. Notice that when we use recognize_signal we do not train a model again, but use the model that was previously trained to classify the input data. Additionally, the query step is removed since we use the dataframe passed into recognize_signal.

In [9]:
summary['execution_summary']

Unnamed: 0,cached,name,runtime,step #,type
0,False,Windowing,1.002701,0,segmenter
1,False,MSE Filter,1.002665,1,transform
2,False,generator_set,1.00522,2,generatorset
3,False,Min Max Scale,1.002717,3,transform


#### Classify Data without Labels

Lets go ahead and remove the label column from our data set and pass that in.

In [10]:
recog_result_no_labels, summary_no_label = model.recognize_signal(df_test.drop('Activity', axis=1))

Loading Data.

Checking for Results:

Pipeline Running... Run Time: 0 sec.
.
Results Retrieved.


In [11]:
recog_result_no_labels.head()

Unnamed: 0,DistanceVector,NIDVector,CategoryVector,MappedCategoryVector,SegmentID,Subject
0,[42],[1],[1],[0],0,U010
1,[55],[1],[1],[0],0,U011
2,[59],[1],[1],[0],0,U012
3,[35],[1],[1],[0],1,U010
4,[49],[1],[1],[0],1,U011


The cofusion matrix is not returned because no ground truth provided. 

In [12]:
summary_no_label

{u'execution_summary':    cached           name   runtime  step #          type
 0   False      Windowing  1.002980       0     segmenter
 1   False     MSE Filter  1.002711       1     transform
 2   False  generator_set  1.005179       2  generatorset
 3   False  Min Max Scale  1.002769       3     transform,
 u'summary': Empty DataFrame
 Columns: []
 Index: []}