# Automation tools for searching the parameter space

In this tutorial we will cover tools that implement the following methods for searching the parameter space and finding the best model in SensiML cloud.

1. Feature Explosion
2. Feature Selection
3. Grid Search
4. Survival Search


In [1]:
import pandas as pd

from sensiml import SensiML
from sensiml.widgets import QueryWidget, AutoSenseWidget, DownloadWidget

dsk = SensiML()
dsk.project ='Parameter Optimization Tutorial'
dsk.pipeline = 'Easy_Pipeline_Button'

In [5]:
df = pd.read_csv('Support/grid_search_tutorial_activity_data.csv')
sensor_columns = ['AccelerometerX', 'AccelerometerY', 'AccelerometerZ']
df.head()

Unnamed: 0,Subject,Class,AccelerometerX,AccelerometerY,AccelerometerZ
0,U001,0,-317,-3000,925
1,U001,0,-284,-2968,903
2,U001,0,-243,-2987,933
3,U001,0,-193,-3051,936
4,U001,0,-150,-3059,915


In [31]:
dsk.pipeline.reset()
dsk.pipeline.set_input_data('grid_dataframe', df, data_columns=sensor_columns,
                                                  group_columns=['Class','Subject'],
                                                  label_column='Class',
                                                  force=True)

dsk.pipeline.add_transform('Windowing')

dsk.pipeline.add_transform('MSE Filter', params={'input_column':sensor_columns[0]})

Uploading file "grid_dataframe.csv" to KB Cloud.
Upload of file "grid_dataframe.csv"  to KB Cloud completed.


## 1. Feature Explosion

The feature generation step is key here, notice how we are using subtype calls. Subtype calls encompass large groups of feature generators. Feature explosion is a powerful technique where we generate massive amounts of features without making any guesses about which will be the best suited to the classification task. The downside to feature explosion is the danger of overfitting. To avoid that we implement techniques such as feature selectors as well as cross validation in our model building.

In [32]:
# Feature Generation
dsk.pipeline.add_feature_generator([{'subtype_call':'Time', 'params':{'sample_rate':100}},
                                    {'subtype_call':'Rate of Change'},
                                    {'subtype_call':'Statistical'},
                                    {'subtype_call':'Energy'},
                                    {'subtype_call':'Amplitude', 'params':{'smoothing_factor':9}}
                                    ],
                                    function_defaults={'columns':sensor_columns},
                                    )


# Scale to 8 bit representation for classification 
dsk.pipeline.add_transform('Min Max Scale')

## 2. Feature Selection <a id='Feature_selection_intro'></a>

Now that features have been generated in the pipeline and we have a large set of candidate features,  we need to select which of those are the best at discriminating between our labels or "y-values". That's where the selection process comes in to play. 

Imagine running through the dozens of features you've generated on hardware. You may not get as accurate results, and the operation may take a long time to run. Selecting the best features from that set will increase accuracy as well as vastly increase efficiency on hardware.

Much like the generators, selectors are function calls that are then placed into a <b>selector call set</b>

#### Selector Calls <a id='selectorcalls'></a>

Much like the generator calls before them, selector calls can be created by retrieving existing functions from the server. We can then populate their expected inputs, preparing them to go into a selector call set. Now we can add them to the set. Order matters in this, as the output of one selector automatically feeds into another. 

In [33]:
dsk.pipeline.add_feature_selector([{"name":"Recursive Feature Elimination","params":{"method":"Log R", "number_of_features":20 }},
                                   {"name": "Correlation Threshold","params":{'threshold':0.85}},
                                   {"name": "Variance Threshold", "params":{'threshold':0.05}}],
                                  params = {"number_of_features":20,})

In [34]:
dsk.pipeline.set_validation_method('Stratified K-Fold Cross-Validation', params={'number_of_folds':3})

dsk.pipeline.set_classifier('PVP', params={"classification_mode":'RBF','distance_mode':'L1'})

dsk.pipeline.set_training_algorithm('Hierarchical Clustering with Neuron Optimization', params = {'number_of_neurons':10})

dsk.pipeline.set_tvo({'validation_seed':0})

In [None]:
dsk.pipeline.describe()

## 3. Grid Search Optimization

In order to optimize the model performance often requires searching over a large parameter space. A common method of performing this search is grid search. In this tutorial we will demonstrate how to use grid search in knowledge builder to aid in building better optimized models. On the server side we take advantage of the parallelizable nature of the pipelines as well as optimizations for training algorithms to speed up the computation. This makes it possible to search large parameter spaces quickly and efficiently. After performing the grid search we rank each pipeline based on the f1 score, precision and sensitivity so that you can choose the best performing combination to build a knowledge pack with.


#### Grid Search Syntax

Now that we have a pipeline that works, we would like to search the parameter space to further optimize the model. To do this we will call the sandboxes grid search function "sb.grid_search()" and pass in a list of grid_params to search over.

Grid params is a nested python dictionary object. 

    grid_params = {"Name Of Function":{"Name of Parameter":[ A, B, C]}} 

Where A, B and C are the parameters to search over. Additionally, for each step you may want to search over more than one of a functions configurable parameters. To do this simply add another element to the functions dictionary.

    grid_params = {"Name Of Function":{"Name of Parameter 1":[ A, B, C],
                                       "Name of Parameter 2":[ D, E]}}
                                   
This will tell grid search to search over 6 different parameter spaces. 

You can also specify more than one step to search over in grid params. This is done through simply adding another element to the Function level of the grid_params dictionary.

    grid_params = {"Name Of Function":{"Name of Parameter 1":[ A, B, C],
                                       "Name of Parameter 2":[ D, E]},
                   "Name of Function 2":{"Name of Paramter":[1, 2, 3, 4, 5, 6]}}
                   
For the TVO step we currently only allow modification of the parameters of the training algorithm. To access the grid parameter space, use the name of the training algorithm as shown in the example below.

In [36]:
grid_params = {'Windowing':{"window_size": [100,200],'delta':[100]},
              'selector_set': {"Recursive Feature Elimination":{'number_of_features':[10, 20]}},
               'Hierarchical Clustering with Neuron Optimization': {'number_of_neurons':[5,10]}
              }

results, stats = dsk.pipeline.grid_search(grid_params)

Executing Pipeline with Steps:

------------------------------------------------------------------------
 0.     Name: grid_dataframe.csv        		Type: featurefile              
------------------------------------------------------------------------
------------------------------------------------------------------------
 1.     Name: Windowing                 		Type: segmenter                
------------------------------------------------------------------------
------------------------------------------------------------------------
 2.     Name: MSE Filter                		Type: transform                
------------------------------------------------------------------------
------------------------------------------------------------------------
 3.     Name: generator_set             		Type: generatorset             
------------------------------------------------------------------------
------------------------------------------------------------------------
 4.     Name: M

#### f1, precision and sensitivity score for each grid point

The output from grid search is a dataframe containing the f1, precision and sensitivity scores from each permutation of the pipeline. For cross-fold validation these are the average over all models. Below we show the Pandas dataframe functions for sorting by multiple columns in either ascending/descending order.

In [39]:
results.sort_values(['f1_score','training_method.number_of_neurons'], ascending=[False, True]).head()

Unnamed: 0,Recursive Feature Elimination.number_of_features,delta,f1_score,f1_score_std,precision,precision_std,sensitivity,sensitivity_std,training_method.number_of_neurons,window_size
0,10,100,99.112631,0.627684,100.0,0.0,98.271667,1.222532,5,200
1,10,100,99.112631,0.627684,100.0,0.0,98.271667,1.222532,10,200
2,20,100,98.999246,0.865153,100.0,0.0,98.057726,1.672219,5,100
3,20,100,98.999246,0.865153,100.0,0.0,98.057726,1.672219,10,100
4,10,100,98.983325,0.88563,100.0,0.0,98.041517,1.692985,5,100


## 4. Optimizating Parameters Using the Automation Genetic Algorithm

Another way to find optimal parameters is with a genetic algorithm. Instead of searching a large parameter space exhaustively to find the single best combination of parameters, the genetic algorithm starts with a small randomized population of parameter combinations, generates models from them and tests them, keeps a subset of high-performing combinations, and then recombines those "survivors" in different ways (see: crossover and mutation) and repeats the process over again. The offspring of good parameter combinations are usually also good and sometimes are significantly better than their parents. As the algorithm repeats each successive generation, it often finds a near-optimal model without trying as many configurations as grid search.

In this tutorial we will demonstrate how to use the Auto command to apply the genetic algorithm to your custom pipeline. On the server side, the pipelines are run in parallel and results are ranked by a fitness score which takes into account the model's F1 score, precision, sensitivity, and other metrics. You can learn more about the these performance metrics and the many automation options in KB Basics Tutorial 7.

#### Set the Auto Command with the Custom seed
To automate parameter selection on this pipeline, generate an Auto call with Snippets. Select the "Custom" seed to tell the server to use your defined pipeline instead of a preset template. 

        dsk.snippets.Auto.Custom()
        
Note: This only generates the code, but doesn't execute it. Execute the cell a second time to start the automated pipeline.

#### Inspect the Fitness Results

The fitness summary contains everything you need to evaluate the best models found by the algorithm. By default, the first run only does one iteration, so the models may not be very impressive. Take a look at the KB Basics tutorial about automation for more information about the fitness summary and all of its metrics.

In [41]:
summary['fitness_summary']

Unnamed: 0,accuracy,best_model,f1_score,features,fitness,flash,iteration,knowledgepack,latency,neurons,pipeline,positive_predictive_rate,precision,sensitivity,specificity,sram,stack
0,99.459459,"Fold 0, Iteration 0",99.485456,31.0,2.401729,6074.0,0,7e90912f-1dbf-4f28-b1a5-6d6165916e8e,626179500.0,4.0,"[{""name"": ""grid_dataframe.csv"", ""outputs"": [""t...",99.537037,99.537037,99.444444,99.810606,2404.0,420.0
1,100.0,Fold 0,100.0,34.0,2.4,6246.0,0,882b610b-0af1-489c-b7ac-b8ba239b30df,626179500.0,5.0,"[{""name"": ""grid_dataframe.csv"", ""outputs"": [""t...",100.0,100.0,100.0,100.0,2404.0,420.0
2,98.918919,Fold 0,98.686512,61.0,2.287257,8458.0,0,d222be27-dde8-4f80-99cf-a3d1f1f4550d,626854900.0,12.0,"[{""name"": ""grid_dataframe.csv"", ""outputs"": [""t...",98.93617,98.93617,98.486635,99.642857,2404.0,420.0
3,100.0,Fold 0,100.0,67.0,2.27874,8842.0,0,a871994c-20be-4d15-b7ec-3ffb21cfca40,626333700.0,16.0,"[{""name"": ""grid_dataframe.csv"", ""outputs"": [""t...",100.0,100.0,100.0,100.0,2404.0,420.0
4,100.0,"Fold 0, Iteration 4",100.0,15.0,2.275591,4666.0,0,2be20c96-d6b6-4926-bc1c-cd0185f4def7,626151800.0,48.0,"[{""name"": ""grid_dataframe.csv"", ""outputs"": [""t...",100.0,100.0,100.0,100.0,2404.0,420.0
5,99.459459,Fold 0,99.485456,53.0,2.271021,7886.0,0,,626848700.0,24.0,"[{""name"": ""grid_dataframe.csv"", ""outputs"": [""t...",99.537037,99.537037,99.444444,99.810606,2404.0,420.0
6,93.310811,"Fold 0, Iteration 12",94.301101,11.0,2.073728,5030.0,0,,626349300.0,73.0,"[{""name"": ""grid_dataframe.csv"", ""outputs"": [""t...",95.312984,95.312984,93.429334,98.172694,2400.0,400.0
7,72.972973,Fold 0,71.76924,14.0,1.903957,5162.0,0,,626555900.0,24.0,"[{""name"": ""grid_dataframe.csv"", ""outputs"": [""t...",73.029279,73.029279,71.6837,88.802836,2400.0,400.0
8,58.378378,Fold 0,69.339945,53.0,1.558167,7886.0,0,,626848700.0,24.0,"[{""name"": ""grid_dataframe.csv"", ""outputs"": [""t...",96.25,96.25,56.295485,96.590909,2404.0,420.0
9,51.351351,Fold 0,67.192376,61.0,1.449569,8458.0,0,,626854900.0,19.0,"[{""name"": ""grid_dataframe.csv"", ""outputs"": [""t...",100.0,100.0,50.707547,100.0,2404.0,420.0


#### Additional Iterations
To let the algorithm do a few more iterations, set 'iterations' equal to 3 and set the 'reset' option to False (this tells the server you do NOT want to re-initialize).

In [None]:
results, summary = dsk.pipeline.auto({'seed': 'Custom', 
                                      'params': {'iterations': 1, 
                                                 'reset': False}})

#### Look at the Optimal Features and Parameters
When you think you have found an interesting model, you can request the knowledgepack object from its ID in the summary table and view its features and pipeline, containing the optimized parameters.

In [46]:
kp_uuid = summary['fitness_summary'].iloc[0]['knowledgepack']
kp = dsk.get_knowledgepack(kp_uuid)
pd.DataFrame(kp.feature_summary)

Unnamed: 0,Category,Context01Index,EliminatedBy,Feature,Generator,GeneratorIndex,Sensors
0,Time,7,,gen_0008_AccelerometerYPctTimeOverZero,Percent Time Over Zero,2,[AccelerometerY]
1,Statistical,31,,gen_0032_AccelerometerYMean,Mean,10,[AccelerometerY]
2,Statistical,34,,gen_0035_AccelerometerY100Percentile,100th Percentile,11,[AccelerometerY]
3,Statistical,37,,gen_0038_AccelerometerYMedian,Median,12,[AccelerometerY]
4,Statistical,40,,gen_0041_AccelerometerY75Percentile,75th Percentile,13,[AccelerometerY]
5,Statistical,42,,gen_0043_AccelerometerXIQR,Interquartile Range,14,[AccelerometerX]
6,Statistical,43,,gen_0044_AccelerometerYIQR,Interquartile Range,14,[AccelerometerY]
7,Statistical,44,,gen_0045_AccelerometerZIQR,Interquartile Range,14,[AccelerometerZ]
8,Statistical,46,,gen_0047_AccelerometerYminimum,Minimum,15,[AccelerometerY]
9,Statistical,47,,gen_0048_AccelerometerZminimum,Minimum,15,[AccelerometerZ]


In [None]:
dsk.pipeline.rehydrate(kp)