# Khiva's features extraction application 
This interactive application shows some of the capabilities provided by the Khiva library for times-series’ feature extraction and machine learning. 
This use case consists on the analysis of 100 time-series provided by commercial sites during 2012 which are tagged by subindustry.  
This exercise is focused on: 
1. Extract the time series features. 
2. Predict the subindustry of some sites. 

## Module importing 
We are using Khiva’s array, features and library modules. 

In [43]:
from khiva.features import *
from khiva.library import *
from khiva.array import *

import pandas as pd
import numpy as np

import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=Warning)


from sklearn.utils import shuffle
from sklearn.preprocessing import scale

from sklearn.model_selection import GridSearchCV
from sklearn import svm

import time

## Metadata load 
This is anonymised 5-minute energy usage data for 100 commercial/ industrial sites for 2012. 
The sites metadata contains the site ID, the industry, square footage, lat/lng and timezone. 

In [44]:
all_sites = pd.read_csv("../../energy/data/data-enerNoc/all-data/meta/all_sites.csv")

## Loads the site id for each site 
This is needed to check the accuracy of the predictive modelling step. 

In [45]:
file_names = []

for name in all_sites["SITE_ID"].values:
    file_names.append(name)

## Backend
Prints the backend used. CPU, CUDA and OPENCL backends are available for Khiva.  
  
> This interactive application is being execute in **hub.mybinder** which doesn't provide a GPU and its CPU is quite limited so the features extraction is going to take some time. This application executed in a macOS High Sierra with a 2,9 GHz Intel Core i7 processor takes 8,34 seconds in extracting the features. 

In [46]:
print(get_backend())

KHIVABackend.KHIVA_BACKEND_CPU


# Data load
The original dataset contains 100 time-series, which was a total of 10,531,288 data points. This original dataset was re-dimensioned using Khiva and the result is a dataset of 100 time-series which is a total of 1,666,600 data points. After the re-dimension, the dataset was stored in a binary and then loaded into a Khiva Array. 

In [47]:
arr_tmp  = Array(np.load("../../energy/electric-consumption-rate-python/time-series-redimension-applied.npy"))

## Features extraction 
In this step, we use Khiva to extract 28 features from the time-series, so we can generate a features matrix for applying a predictive modelling. As explained before, this interactive application is being executed in **hub.mybinder**, without GPUs and with a limited CPU, so it is going to take a while as the time series are compound by a big amount of data points. 

In [48]:
start = time.time()
features = np.stack([abs_energy(arr_tmp).to_numpy(),
                    absolute_sum_of_changes(arr_tmp).to_numpy(),
                    count_above_mean(arr_tmp).to_numpy(),
                    count_below_mean(arr_tmp).to_numpy(),
                    first_location_of_maximum(arr_tmp).to_numpy(),
                    first_location_of_minimum(arr_tmp).to_numpy(),
                    has_duplicates(arr_tmp).to_numpy(),
                    has_duplicate_max(arr_tmp).to_numpy(),
                    kurtosis(arr_tmp).to_numpy(),
                    last_location_of_maximum(arr_tmp).to_numpy(),
                    last_location_of_minimum(arr_tmp).to_numpy(),
                    has_duplicate_min(arr_tmp).to_numpy(),
                    longest_strike_above_mean(arr_tmp).to_numpy(),
                    longest_strike_below_mean(arr_tmp).to_numpy(),
                    maximum(arr_tmp).to_numpy(),
                    mean_absolute_change(arr_tmp).to_numpy(),
                    minimum(arr_tmp).to_numpy(),
                    number_crossing_m(arr_tmp, 0).to_numpy(),
                    mean(arr_tmp).to_numpy(),
                    median(arr_tmp).to_numpy(),
                    mean_change(arr_tmp).to_numpy(),
                    ratio_value_number_to_time_series_length(arr_tmp).to_numpy(),
                    skewness(arr_tmp).to_numpy(),
                    standard_deviation(arr_tmp).to_numpy(),
                    sum_of_reoccurring_values(arr_tmp).to_numpy(),
                    sum_values(arr_tmp).to_numpy(),
                    variance(arr_tmp).to_numpy(),
                    variance_larger_than_standard_deviation(arr_tmp).to_numpy()
                            ])
print("Time to extract the features : " + str(time.time() - start) + " seconds." )
features = features.transpose()

Time to extract the features : 8.75836968421936 seconds.


## Features matrix and target definition.
Here, the features matrix and target matrix are defined in the following way: 
1. **Features matrix** Composed by the 28 features extracted. 
2. **Target matrix** Composed by the subindustries’ tag of each time series. 

In [49]:
y = all_sites["SUB_INDUSTRY"].values
X = features

## Features matrix pre-process 
A simple pre-process is executed to scale the features matrix. 

In [50]:
X = scale(X)

## Shuffle 
Several shuffles are done to distribute the samples

In [51]:
for i in range(15):
    X, y, file_names = shuffle(X, y, file_names, random_state=0)

## Predictive modelling 
In this step, we create a model, fit it and predict a subset of samples. 

The reason to choose SVC as classifier are the following: 
* More than 50 samples for the training and less than 100k in total. 
* The intention to predict a category.  
* All data is labelled. 

We can conclude that the results shown are quite decent after using this classificator. 

The parameters chosen are based on a grid search based on a cross-validation(CV) step, focused on the accuracy of the model and using a K-FOLD(K=10) CV method. 


In [52]:
files_test = []
list_test_indices = []
for i in range(len(file_names)):
    if file_names[i] in [92, 45, 761, 10, 766, 400, 673, 49, 144, 496, 731, 281, 213, 197, 399]:
        list_test_indices.append(i)
        files_test.append(file_names[i])

X_train = np.delete(X, list_test_indices, 0)
X_test = np.take(X, list_test_indices, 0)
y_train = np.delete(y, list_test_indices)
y_test = np.take(y, list_test_indices)

k_range_parameter = {'degree':[3,4],'shrinking':[True,False],'probability':[True,False]}

clf = svm.SVC()

mygridsearch = GridSearchCV(clf, k_range_parameter, cv = 10, scoring = 'accuracy' )
mygridsearch.fit(X_train, y_train)
bestclassifier = mygridsearch.best_estimator_
y_pred = bestclassifier.predict(X_test)
print("TEST VECTOR: " + str(y_test))
print("PREDICTION VECTOR" + str(y_pred))
print("NUMBER OF ERRORS: " + str(sum(y_pred != y_test)))
print("ERROR RATE: " + str(1 - sum(y_pred == y_test) / float(len(y_pred))) + "%")
print("ACCURACY: " + str(sum(y_pred == y_test) / float(len(y_pred))) + "%")
print("PARAMETERS USED: "+ str(mygridsearch.best_params_))

SyntaxError: EOL while scanning string literal (<ipython-input-52-eda90f6eb367>, line 22)