## Basic Introduction to Jupyter Notebook and Pandas DataFrames

Jupyter Notebook provides a Python interface that has become very popular with data scientists because it blends text, code, and code execution into a single interface. It has a bit of a learning curve, but has proven to be very powerful in practice. pandas Data Frames can be thought of as data objects that act like Excel spread sheets. In this tutorial we use pandas dataframes to manipulate our data.

#### Note: If you are already familiar with Jupyter Notebook and pandas, go ahead and skip to Getting Started with Knowledge Builder


### A few things to get you started with Jupyter Notebook:

Look to the left of the cell below, do you see the "In [ ]:".
    * in cells that have not been executed you will see an In [ ]:
    * in cells that are executing, or waiting to be executed you will see In[*]:
    * in cells that have already been executed, you will see In[ <some number> ]: which indicates the order it was executed.
    
Let's go ahead and execute the cell below.

* To run a cell, you can click the play button in the toolbar, or press shift+Enter on the keyboard.

* To stop an executing cell, click the square button in the tool bar. 

Try executing the cell below. It will keep running until you hit the stop button. Notice that there is a delay in stopping. The cell will not stop until the sleep has finished. 

In [None]:
import time
while True:
    print 'Going to sleep... Click the Stop button to exit this loop.'
    time.sleep(5)

Note: The KeyboardInterrupt Exception is a result of you terminating the cell. When you see a message like that, it is often a Python exception. Ignore the traceback, and look at the bottom of the message for information about what caused the error. If you don't understand the error message, searching Google for that exact error message will often lead you to understanding what went wrong.

Great, so far so good! Next, let's add a new cell. You will often want to add a cell in the notebook to try out executing some function or explore the data. To add a cell, click on insert in the toolbar, and select add cell below. In the cell that you create, go ahead create a test DataFrame for us to play with. Copy the following code into the cell you create and run the cell.

    import pandas as pd
    df = pd.DataFrame(zip(range(10), range(0,20,2)), columns=['X','Y'])
    df['Z'] = 5


Often  you don't want the entire dataframe to display. To show only the first 5 rows use
        
        df.head()
        
go ahead and try that in the cell below. The output should show only the first 5 rows.

In [None]:
df.head()

Next, let's perform some visualization of the dataframe. Below are some examples of a few different plots that would potentially be interesting to look at. You can reference these in the future when you are making visualizations yourself.

Note: The %matplotlib inline command is a Jupyter quirk that allows plots to be shown inline. It only needs to be executed once in the pipeline and plots will always be shown afterwards. The figsize argument is not necessary, but makes plot visualization much easier. 

In [None]:
%matplotlib inline
df.plot(figsize=(12,6))

In [None]:
df.plot(x='Y', y='X',figsize=(12,6))

In [None]:
df['X'].plot(figsize=(12,6))

Another useful DataFrame function is the .unique() option and the .describe() option.

In [None]:
print df['X'].unique()
print df['Z'].unique()

In [None]:
df.describe()

That's all for the basic introduction. If you have questions about pandas data frames, there is excellent documentation online for doing all sorts of data manipulation see: http://pandas.pydata.org/pandas-docs/stable/dsintro.html

For more information about IPython notebook you can click the Help button up top.

### Setting Proxy

If you require a proxy to access the internet, you will need to set this here in order to access our servers. See you IT department for more information. 

In [1]:
import os
#os.environ['http_proxy'] = ''
#os.environ['https_proxy'] = ''

# Getting started with Knowledge Builder

In this Tutorial we are going to walk you through setting up the KnowledgeBuilder Data Science Kit, selecting data, and building a pipeline. The data you are going to use was collected from multiple subjects wearing a device with 6 sensors (Accelerometer x,y,z and Gyroscope x,y,z) and formatted using the Data Capture Lab. The goal is to build a model that is able to classify what type of activity the subjects were performing. By the end of this tutorial you should be able to query data, transform data streams into  a feature vector, train a model, and understand the quality of the model, as well as making sure the pipeline you have built will run on the hardware device. 

### Initialize a KB project

The data for this project has already been labeled and uploaded to the Project "Activity Case Study". To access the data first connect to the knowledgebuilder cloud service using the data science kit. And then set the project to "Activity Case Study."

Note: If you are running this tutorial outside of our workshop. You will want to use the project that you create in the getting started guide using the Data Capture Lab.

In [2]:
from kbclient.kb_dsk_basic.kb import KB

dsk = KB()

Login Successful


In [3]:
dsk.project ='Activity Case Study'

The next step is to initialize a pipeline space to work in. The work you do in the pipeline will be stored in KB Cloud so that you can share pipelines with collaborators and come back to stored work in the future. Go ahead and add a pipeline to the project using the following code snippet. 
        
        dsk.pipeline = "Name of your pipeline"

## Part 1. Selecting the Data Set.

There are two ways of selecting data to use in a pipeline. The first is through a query against project data uploaded via the Data Capture Lab (DCL). The second is by uploading a pandas dataframe/csv data file. The DSK has useful functions for manipulating data in both ways. In this tutorial we will build a query against data that was uploaded through the Data Caputre Lab.

#### Query
* query_name: What we want to name our query. This name is also how you will retrieve the query in the future. 
* columns: The data columns that you would like to include. In our case, these columns are the sensor data from the device
        'AccelerometerZ'
        'AccelerometerY'
        'AccelerometerX'
* metadata_columns: This is additional information about your data set that is useful for separating out individual datastreams. In this example we have Subject and Activity. Subject relates to the individual user. Activity provides a ground truth about what type of activity the user was performing (running, walking, etc.).
* metadata_filter: This allows you to select a subset of the data by filtering against metadata.

You can use dsk to print the sensor columns and metadata columns that the project contains.

In [None]:
print dsk.project.columns()

In [None]:
print dsk.project.metadata_columns()

In [None]:
query_name = 'query_activity_demo'
q = dsk.create_query(query_name, columns = ['AccelerometerX', 'AccelerometerY'],
                                 metadata_columns = ['Subject','Activity'],
                                 metadata_filter = '[Subject] >= [5] AND [Subject] <= [10] AND [Activity] = [Crawling]',
                                 force=False
                                 )

data, _ = q.data()
data.head() 

The data variable that is returned is a DataFrame giving us the ability to use any pandas functions to explore our data.

Note: because we are calling head() on the data, only the first few rows will be displayed.

Q. Create a query to select only data with the Label "Running".

In [None]:
query_name = 'query_activity_running'
# ENTER CODE HERE


Q. Does the query look how you would expect? Can you verify that only the running data was returned from the query?

In [None]:
# ENTER CODE HERE



Q. Make a plot of Sensor Data for 'AccelerometerX' for Subject 1 Running Activity.

In [None]:
# ENTER CODE HERE



Q. Now make one final query for all of the sensor columns to use in our pipeline selecting only subjects 1 through 9. Including more subjects will make the model better, but will take longer to run. Since this is only an introduction, subjects 1 through 9 should be enough. After you have built a pipeline, retraining on all of the data is recommended.

Note: Make sure to select all of the Activities, to do this modify the metadata filter to use [Activity] instead of [Activity] = "Something". Without that filter you will end up with unlabeled data in your dataset. 

In [None]:
query_name = 'query_activity'
# ENTER CODE HERE



# Part 2. Feature engineering - Creating a feature vector.

The Second part of the pipeline involves transforming data streams into a feature vector that are then used to train a model. The features in the feature vector must be integers between 0-255. The feature vector can be any length, but in practice you will be limited by the space on the device. 

DSK provides a way to define a pipeline for feature vector and model building. The feature vector generation part of the pipeline can be broken down into a few different parts:

* transforms - typically though of an operation on the data stream from the device in order to normalize and prepare for feature vector generation.
* feature_generators - These functions extract relevant feature vectors from the data streams in preparation for model building.
* feature_selectors - These functions remove feature vectors which are expected to perform poorly based on a set of selection criteria.
* feature_scalers - These functions scale the features in the feature vector to values between 0-255, which is necessary for the hardware accelerated pattern matching. 

The DSK allows you to string together a pipeline composed of these individual steps. The pipeline is sent to our servers where we can take advantage of optimizations to speed up the pipeline processing.

### Exploring the functions available on KB Cloud.

KB Cloud has a variety of transforms, feature generators, feature selectors and training algorithms available to use. To see a list of available functions simply do as follows: 

In [None]:
print dsk.functions

To get the documentation for any of the functions you can look at the KB documentation under core functions or use dsk.function_description("Function Name")

In [None]:
dsk.function_description('Scale Factor')

### Building the pipeline -  specifying the input data

Using the dsk, you will add sequential steps to the pipeline. It is important to note that until you execute the pipeline, i.e., dsk.pipeline.execute(), that no work is performed on the data. In other words, adding a transform to the pipeline does not immediately act on the data. Executing the function via dsk.pipleine.execute(), initiates the pipeline execution on KB Cloud, which will then return results. Because pipelines must be added sequentially, if you get things out of order or make an error, use the reset command to start fresh.
        
        dsk.pipeline.reset()
        


The first step in the pipeline must always be the data that you are going to work with. For us this is going to be the query you created with all of the sensor data. In the line below, place the cursor at the end of the function and press tab. Use the arrow keys to select set_input_query and hit enter. Then add the query name that you created the query with as input into the function.
    
        dsk.pipeline.set_input_query('<the name of your query>')

In [None]:
dsk.pipeline.set_input_

If you want to see documentation about any of the functions, you can add a ? mark to the end of the file. Not all functions have documentation yet, but many of them do.

In [79]:
dsk.pipeline.add_transform?

Note: To exit the information box, click on the small x at the top right of the box.

### Function snippets and description

You can also use dsk.snippets to autogenerate code and see what parameter values are accepted. 

        dsk.snippets.Transform.Scale_Factor()
        
This will replace the current cell with autogenerated code. Be careful not to execute a line like this in a pipeline cell with other data, as it will replace the entire cell, if you do this by accident you can always hit ctrl+z to undo. 

Go ahead and execute the cell below.

In [None]:
dsk.snippets.Transform.Scale_Factor()

This is a function that will add Scale Factor to your pipeline. Any parameters with < > brackets you will need to fill in. If there are default values associated with the function those will be filled in, but you may need to change them to suit your particular project. 

Note: In order to add the step the pipeline you will need to reexecute the cell, snippets only provides the code, but doesn't execute.

Another function that performs the same action is
        
        dsk.function_help('Scale Factor')
        
In this case you can do code inspection using tab to explore available functions as in snippets. 

In [None]:
dsk.function_help('Scale Factor')

### Building the pipeline -  Adding a Transform

Now that we have specified the input data, let's go ahead and add a transform to the pipeline and execute the steps we've added so far on the cloud. We are going to first scale the data by a factor and transform it to integer values. This is because values coming off of the device streams are integers and cannot be manipulated as floats until the feature generation steps. 

To add the transforms we use the dsk.pipeline.add_transform( "function name", params ={"parameters to pass to function on server"}). In the function description you will see the parameters that need to be filled in.

In [7]:
dsk.pipeline.add_transform('Scale Factor', params={'scale_factor':4096.,
                                                   'input_columns':['AccelerometerY']})

you can use the following function at any time to see all of the steps and parameters that you currently have in your pipeline. 

In [None]:
dsk.pipeline.describe()

In [None]:
#Go ahead and change the values in scale factor and re-execute the dsk.pipeline.describe. 
#You should see the scale_factor parameter change, but a new step shouldn't be added.







### Pipeline Execution

When executing the pipeline, there will always be two results returned. The first will be the actual data and the second  contains information about the pipeline execution on the server. Feel free to explore the information in the second variable. It is often very useful. 

In [None]:
scaled_data, stats = dsk.pipeline.execute()
scaled_data.head()

### Building the pipeline -  Performing Segmentation.

The next step is to segment our data into windows which we can perform recognition on. For the activity data we will use the Windowing Transform. We have left empty the parameters that need to be filled in. Go ahead and look at the function description. You will need to fill in the parameters for group_columns, window_size and delta. Set window_size and delta to 250. The actual time that the window size represents is device specific, for this data set it corresponds to roughly 2.5 seconds. 

Can you tell from the scaled_data what columns need to be added to group_columns? 

Hint: group_columns is a description of how your data is put together. We will want to group this data set by the Subject and the Activity they are performing. 

In [8]:
dsk.pipeline.add_transform('Windowing', params={})

It is often good practice to pair a segmentation algorithm with a filter. This will add a step in the pipeline that will drop segments, saving battery life on the device by ignoring segments that don't contain useful information. For this pipeline we want to use the "MSE Filter" transform. Go ahead and add this step to the pipeline. Use the function_description to figure out what parameters need to be filled in.

Note: Where default values are defined you don't need to set them as inputs unless you would like to change them.

In [9]:
### Enter CODE HERE

After adding the MSE filter execute the pipeline. 

In [None]:
mse_data, stats = dsk.pipeline.execute()

Can you make plots that visualize the data segments for Subject 1 and the Running Activity that passed the MSE Filter? 

Hint: the "index" column of the filtered data provides information about which sequence of the data stream the data belongs too.

Note: Don't spend too much time trying to make this plot. Code examples for making this plot are at the bottom of this notebook. Go ahead and grab check them if you are having trouble with this or any future plots.

In [None]:
# ENTER CODE HERE




At this point we are ready to generate a feature vector from our segments. Feature generators are all added into a single step and run in parallel against the same input data. We have added two feature generators from the subtype "Statistical". The more features, the better chance you have of building a successful model, so add at least 6 more feature generators with the same subtype. 

In [10]:
dsk.pipeline.add_feature_generator(["Mean", 'Standard Deviation',],
                                   function_defaults = {"columns":[u'AccelerometerY']})

Next, add the Min Max Scale transform to the pipeline. This function will scale the features in the feature vector to have values between 0 and 255.  

In [12]:
dsk.pipeline.add_transform('Min Max Scale')

In [None]:
feature_vectors, stats = dsk.pipeline.execute()
feature_vectors.head()

Let us take a look at the feature vectors that you have generated. Can you make a plot of the average of all feature vectors grouped by Activity? We are looking for feature vectors that are separable in space. How do the ones you've generated look? 


In [None]:
# ENTER CODE HERE




# Part 3. Model Building - Creating a model to put onto a device.

### Model TVO description

* train_validate_optimze (tvo) : This step defines the model validation, the classifier and the training algorithm to build the model with. On KB Cloud the model is first trained using the selected training algorithm, then loaded into the hardware simulator (currently we only support pvp) and tested using the specified validation method.

This pipeline uses the validation method "Stratified K-Fold Cross-Validation". This is a standard validation method used to test the performance of a model by splitting the data into k folds, training on k-1 folds and testing against the excluded fold. Then it switches which fold is tested on, and repeats until all of the folds have been used as a test set. The average of the metrics for each model provide you with a good estimate of how a model trained on the full data set will perform.

The training algorithm attempts to optimize the number of neurons and their locations in order to create the best model. We are using the training algorithm "Hierarchical Clustering with Neuron Optimization," which uses a clustering algorithm to optimize neurons placement in feature space. 

The only classifier currently available in KB Cloud is PVP. PVP has two classification modes, RBF and KNN and two distance modes of calculation, L1 and LSUP. You can see the documentation for further descriptions of the classifier.

In [14]:
dsk.pipeline.set_validation_method('Stratified K-Fold Cross-Validation', params={'number_of_folds':3,})

dsk.pipeline.set_classifier('PVP', params={"classification_mode":'RBF','distance_mode':'L1'})

dsk.pipeline.set_training_algorithm('Hierarchical Clustering with Neuron Optimization', params = {'number_of_neurons':7})


dsk.pipeline.set_tvo({'validation_seed':1})

Go ahead and execute the full pipeline now.

In [None]:
model_results, stats = dsk.pipeline.execute()

The model_results object returned after a TVO step contains a wealth of information about the models that were generated and their performance. A simple view is to use the summarize function to see the performance of our model.

In [None]:
model_results.summarize()

Let's grab the fold with the best performing model to compare with our features.

In [19]:
model = model_results.configurations[0].models[0]

The neurons are contained in model.neurons. Can you plot these over the feature_vector plot that you created earlier? This step is often useful for debugging.

In [None]:
# ENTER CODE HERE



# Part 4. The power of KnowledgeBuilder - KnowledgePacks 
The most important objective of KnowledgeBuilder is to allow users to instantly turn their PME models into downloadable KnowledgePacks that can be flashed to devices to perform the pattern matching tasks they were optimized to do. In this tutorial, we will explore the KB commands for configuring your sandbox to deliver the right binary for your use case, determine whether the code can be successfully flashed to the device (i.e., the cost budget has not been exceeded), and finally, generate the C files and download the code.


### Setting the Device Configuration and Checking the budget.
In the current release, KnowledgeBuilder supports ISPC 3.1, QMSI 1.1, and Zephyr 1.5. Before downloading a KnowledgePack, you will need to specify the target device platform and version number. This ensures that the right code will be generated. The dsk lets you set this before downloading a knowledgepack.

In [None]:
model.knowledgepack.cost_report

This KnowledgePack comes in well under the code size limit. If it had exceeded it, you would see an entry in the "Warnings" section at the end of the report. If the execution time per classification (in microseconds) is acceptable to you, this KnowledgePack is a good candidate for downloading.

#### A Word about Feature Generators
Perhaps the biggest risk of exceeding the code size limit comes from too many (or too many expensive) features in the model. The feature selection algorithms are available to help reduce features to a minimal well-performing set. Refer to the tutorials on "Transforming Data, Generating Features, and Selecting Features" and "Optimizing Parameters using Grid Search" to learn more about feature selection.

### Reducing features with a feature selector

As an example, let's pretend that your KnowledgePack exceeded the device budget. To do this you will need to generate a pipeline with the additional step of a feature_selector.

        dsk.pipeline.add_feature_selector([{"name":"Recursive Feature Elimination","params":{"method":"Log R"}}],
                                   params = {"number_of_features": 10})

Add the feature selector step to your pipeline after the feature generator and rerun it. Don't forget to set the number of features to reduce to. After running the pipeline, check the feature part of the stats output to see which features were eliminated.
        

Note: Currently we can only add steps in a linear fashion to the pipeline, so you will need to reset the pipeline and sequentially re-add the steps. Use the function below to clear the pipeline and start fresh.

        dsk.pipeline.reset()

In [None]:
# ENTER CODE HERE



## Part 5.  Optimizing the feature search - Mass feature generation and grid searches

Up to this point you have been doing a lot of mental labor to put together a pipeline. The vision for KnowledgeBuilder is to provide a powerful tool for digging into the gritty details when you need to optimize, but also provide the ability to search the feature space programmatically. So let's have some fun. Instead of just adding a few feature generators, let's add several feature generators. Using subtype_call, you can add all of the feature generators within a specific subtype. Let's also go ahead and create a query with more users. Select Subjects 1 through 20. For the sake of brevity, also only select the sensor column "AccelerometerY"

Let's start by replacing the feature generator step in the pipeline you just created at the end of Part 4. with the feature generator below.

        dsk.pipeline.add_feature_generator([{'subtype_call':'Time', 'params':{'sample_rate':100}},
                                            {'subtype_call':'Rate of Change'},
                                            {'subtype_call':'Statistical'},
                                            {'subtype_call':'Energy'},
                                            {'subtype_call':'Amplitude', 'params':{'smoothing_factor':9}}
                                            ],
                                            function_defaults={'columns':sensor_columns}
                                            )

Go ahead and execute the pipeline and compare the results to the previous model. Look and see which features were eliminated as well.

In [None]:
# ENTER CODE HERE


Next, using the same pipeline you just created instead of running just a single pipeline, we are going to do a grid_search across the parameter space. The functions below will submit the grid search job to the server, which will create over 600 models and return the average metrics for each combination of parameters.

        grid_params = {'Windowing':{"delta": [150,250,350]},
              'selector_set_0'.format(dsk.project.name): {'number_of_features':[10, 20, 30, 50]},
              'Neuron Optimization': {'neuron_range':[2,30]}
              }

        dsk.pipeline.grid_search(grid_params)

In [None]:
# ENTER CODE HERE


To poll the server for the results execute the line below.

In [None]:
grid_results, stats = dsk.pipeline.get_results(lock=True, wait_time=20)
grid_results.sort_values(['f1_score','Neurons'], ascending=[False, True])

This actually takes a long time to run since we only have a single server up. In production we parallelize across multiple servers. So below we display the results of the grid search sorted by the f1_score.

In [40]:
import pandas as pd
grid_results = pd.read_csv('grid_results.csv')
grid_results.sort_values(['f1_score','Neurons'], ascending=[False, True]).head()

Unnamed: 0,Classification,Neurons,delta,f1_score,f1_score_std,number_of_features,precision,precision_std,sensitivity,sensitivity_std
0,knn,30,350,81.635179,3.152807,30,84.797277,2.229195,82.034915,3.084795
1,knn,30,350,81.635179,3.152807,50,84.797277,2.229195,82.034915,3.084795
112,knn,30,350,81.635179,3.152807,20,84.797277,2.229195,82.034915,3.084795
223,rbf,30,350,77.882393,4.164667,50,85.555149,1.653244,75.769198,5.185293
334,rbf,30,350,77.882393,4.164667,30,85.555149,1.653244,75.769198,5.185293


If you have time left, build a pipeline using the parameters that returned the best performance. Use the "Recall" method to train the model using all of the data. What is the device cost of this model, is it over budget? If it is, try removing some of the features and get the pipeline within budget.

In [None]:
# ENTER CODE HERE




Example code for prompts regarding generating visualizations:
        
        # Graph for mse filter data
        mse_a_R_s_1 = mse_data[(mse_data['Subject']==1) & (mse_data['Activity']=='Running')]
        mse_a_R_s_1.sort_values('index').plot(x='index', y='AccelerometerX', figsize=(16,8))
        
        # This is a graph of all the individual segments of the mse filter data
        mse_groups = mse_a_R_s_1.groupby(['window_id'])
        for g in mse_groups.groups.keys():
            mse_groups.get_group(g).plot(x='index', y='AccelerometerX')

        # Graph for feature vectors
        import numpy as np
        hist = feature_vectors.drop(['Subject','window_id'], axis=1)
        group = hist.groupby(['Activity']).apply(np.mean)
        group.transpose().plot(figsize=(16,8), lw=2)
        
        # Graph for feature vectors and neurons
        ## The neurons that are created here are drawn in dotted lines.
        hist = feature_vectors.drop(['Subject','window_id'], axis=1)
        group = hist.groupby(['Activity']).apply(np.mean)
        group.transpose().plot(figsize=(16,8), lw=2)
        for neuron in model.neurons:
            plt.plot(neuron['Vector'], ls='--', lw=2, label=neuron['Category'])
        plt.legend(loc='best')