In [1]:
import os
#os.environ['http_proxy'] = ''
#os.environ['https_proxy'] = ''

In [2]:
import pandas as pd
from kbclient.kb_dsk_basic.kb  import KB

dsk = KB()
dsk.project='Model Bulding Demo'
dsk.pipeline='feature_vector_to_model'

### Prepare the feature vectors

This data set consists of many reps which have 128 feature vectors each as well as a label column containing the ground truth and the rep number. We drop the rep number because we don't want that as a feature vector. Note: All feature vectors must be integers scaled between 0 and 254. 

The model generator requires that the ground truth column has the name "label", so we rename our ground truth column.

In [3]:
df= pd.read_csv('Support/scaled_reducedFeatures_128.csv')
# Rename the 0 to label and drop the columns we don't need
df = df.rename(columns={'0':'label'}).drop(['1','130'], axis=1)
dsk.pipeline.reset()

### Load the feature vectors into the dataframe. 

In [4]:
dsk.pipeline.set_input_data('feature_vectors_128.csv', df,  label_column='label')

Uploading file "feature_vectors_128.csv" to KB Cloud.
Upload of file "feature_vectors_128.csv"  to KB Cloud completed.
Group columns must be a list of strings.


### Model TVO description

* train_validate_optimze (tvo) : This step defines the model validation methodology, the classification method to use and the training algorithm to train the classifier with. 

A tvo step is composed of a 
* Classifier
* Traning algorithm 
* Validation method

For this pipeline we use the validation method "Stratified K-Fold Cross-Validation". Which splits the data into 5 folds and iterativley trains on 4 folds, test on 1 fold.  The training algorithm Heirarchical Clustering with Neuron Optimization uses a clustering algorithm to optimize neurons placement in feature space. After the model has been trained, neurons are loaded into the pvp classifier and model validation is performed using the selected validation method.

In [5]:
dsk.pipeline.set_validation_method('Stratified K-Fold Cross-Validation', params={'number_of_folds':5})

dsk.pipeline.set_classifier('PVP', params={"classification_mode":'RBF','distance_mode':'L1'})

dsk.pipeline.set_training_algorithm('Hierarchical Clustering with Neuron Optimization', params = {'number_of_neurons':7})

dsk.pipeline.set_tvo({'validation_seed':0})

Execute the pipeline on kb cloud. 

In [6]:
results, stats = dsk.pipeline.execute()

Executing Pipeline with Steps:

------------------------------------------------------------------------
 0.     Name: feature_vectors_128.csv   		Type: featurefile              
------------------------------------------------------------------------
------------------------------------------------------------------------
 1.     Name: tvo                       		Type: tvo                      
------------------------------------------------------------------------
	Classifier: PVP
		Param: distance_mode: L1
		Param: classification_mode: RBF

	Training Algo: Hierarchical Clustering with Neuron Optimization
		Param: number_of_neurons: 7

	Validation Method: Stratified K-Fold Cross-Validation
		Param: number_of_folds: 5

------------------------------------------------------------------------


Checking for Results:

Pipeline is Running. Run Time: 5.3648250103
.Retrieving page 1 of 1.

Results Retrieved.


In [7]:
results.summarize()

TRAINING ALGORITHM: Hierarchical Clustering with Neuron Optimization
VALIDATION METHOD:  Stratified K-Fold Cross-Validation
CLASSIFIER:         PVP

AVERAGE METRICS:
F1-SCORE:    70.3   sigma 3.18
SENSITIVITY: 70.6   sigma 3.51
PRECISION:   87.1   sigma 2.08

--------------------------------------

STRATIFIED K-FOLD CROSS-VALIDATION MODEL RESULTS

MODEL INDEX: Fold 4
ACCURACY: 54.55
NEURONS: 7

MODEL INDEX: Fold 2
ACCURACY: 49.09
NEURONS: 7

MODEL INDEX: Fold 3
ACCURACY: 40.00
NEURONS: 7

MODEL INDEX: Fold 0
ACCURACY: 40.00
NEURONS: 7

MODEL INDEX: Fold 1
ACCURACY: 38.18
NEURONS: 7

