# Time series clustering using Khiva

Clustering time series is a very important method for the analysis of time series since it allows grouping the time series according to its characteristics. Given the clustering result, we can determine for instance which time series forecasting algorithms fit best to each group.

Thanks to the features extraction ability of [Khiva](http://khiva-python.readthedocs.io/en/latest/), we can generate a [features](http://khiva-python.readthedocs.io/en/latest/khiva.html#module-khiva.features) matrix to be used as input of any of the already available clustering methods.

In [135]:
%config IPCompleter.greedy=True
%matplotlib inline

import os
import warnings

from khiva.features import *
from khiva.array import *
from khiva.library import * 
from khiva.dimensionality import *

import pandas as df

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale

import matplotlib.pyplot as plt
from ipywidgets import interact, IntSlider
from mpl_toolkits.mplot3d import Axes3D

plt.rcParams['figure.figsize'] = [20, 10]
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=Warning)

## Why do we want to clusterize the time series?

Actually, the clustering of time series can have a wide variety of uses and applications depending on the context in which it is used.

As an example, time series clustering can be used in order to classify them with the target of knowing what time series have got the same characteristics and that way, know which of them could be suitable to concrete forecasting methods (moving average, exponential smoothing, box-jenkins, x-11, etc.).  

## Backend
Prints the backend being used. The CPU, CUDA and OPENCL backends are available in Khiva.  
  
> This interactive application is being executed in **hub.mybinder** which doesn't provide a GPU and its CPU is quite limited so it is going to take some time.

In [136]:
print(get_backend())

KHIVABackend.KHIVA_BACKEND_OPENCL


## Features extraction

In the next notebook cell, a total number of 51 features are going to be extracted from 100 time series. The mentioned time series are the result of reducing the original dimensionality to 1000 points. This has been done to speed up the computation. 

The time series with the dimensionality reduction applied are stored in a file called `time_series_redimensioned.npy` and the code to carry out it is commented in next cell. 

In [137]:
path = '../../energy/data/data-enerNoc/all-data/csv'
file_names = []
#array_list = []
for filename in os.listdir(path):
       if ".csv" in filename:
           file_names.append(filename)
#        data = pd.read_csv(path + "/" + filename)
#        a = visvalingam(Array([range(len(data["value"])), data["value"].as_matrix()]), int(1000))
#        arr_tmp = a.get_col(1)
#        array_list.append(arr_tmp.to_numpy())
#np.save("time_series_redimensioned", np.array(array_list))

arr_tmp = Array(np.load('./time_series_redimensioned.npy'))

features = np.stack([abs_energy(arr_tmp).to_numpy(),
                     absolute_sum_of_changes(arr_tmp).to_numpy(),
                     aggregated_autocorrelation(arr_tmp, 0).to_numpy(),
                     aggregated_autocorrelation(arr_tmp, 1).to_numpy(),
                     aggregated_autocorrelation(arr_tmp, 2).to_numpy(),
                     aggregated_autocorrelation(arr_tmp, 3).to_numpy(),
                     aggregated_autocorrelation(arr_tmp, 4).to_numpy(),
                     aggregated_autocorrelation(arr_tmp, 5).to_numpy(),
                     approximate_entropy(arr_tmp, 4, 0.5).to_numpy(),
                     binned_entropy(arr_tmp, 5).to_numpy(),
                     c3(arr_tmp, True).to_numpy(),
                     count_above_mean(arr_tmp).to_numpy(),
                     count_below_mean(arr_tmp).to_numpy(),
                     cwt_coefficients(arr_tmp, Array([1, 2, 3], dtype.s32), 2, 2).to_numpy(),
                     energy_ratio_by_chunks(arr_tmp, 2, 0).to_numpy(),
                     first_location_of_maximum(arr_tmp).to_numpy(),
                     first_location_of_minimum(arr_tmp).to_numpy(),
                     has_duplicates(arr_tmp).to_numpy(),
                     has_duplicate_max(arr_tmp).to_numpy(),
                     has_duplicate_min(arr_tmp).to_numpy(),
                     index_mass_quantile(arr_tmp, 0.5).to_numpy(),
                     kurtosis(arr_tmp).to_numpy(),
                     large_standard_deviation(arr_tmp, 0.4).to_numpy(),
                     last_location_of_maximum(arr_tmp).to_numpy(),
                     last_location_of_minimum(arr_tmp).to_numpy(),
                     length(arr_tmp).to_numpy(),
                     longest_strike_above_mean(arr_tmp).to_numpy(),
                     longest_strike_below_mean(arr_tmp).to_numpy(),
                     max_langevin_fixed_point(arr_tmp, 7, 2).to_numpy(),
                     maximum(arr_tmp).to_numpy(),
                     mean(arr_tmp).to_numpy(),
                     mean_absolute_change(arr_tmp).to_numpy(),
                     mean_change(arr_tmp).to_numpy(),
                     mean_second_derivative_central(arr_tmp).to_numpy(),
                     median(arr_tmp).to_numpy(),
                     minimum(arr_tmp).to_numpy(),
                     number_crossing_m(arr_tmp, 0).to_numpy(),
                     number_cwt_peaks(arr_tmp, 2).to_numpy(),
                     number_peaks(arr_tmp, 2).to_numpy(),
                     percentage_of_reoccurring_datapoints_to_all_datapoints(arr_tmp, False).to_numpy(),
                     percentage_of_reoccurring_values_to_all_values(arr_tmp, False).to_numpy(),
                     quantile(arr_tmp, Array([0.6], dtype.f32)).to_numpy(),
                     ratio_beyond_r_sigma(arr_tmp, 0.5).to_numpy(),
                     ratio_value_number_to_time_series_length(arr_tmp).to_numpy(),
                     skewness(arr_tmp).to_numpy(),
                     standard_deviation(arr_tmp).to_numpy(),
                     sum_of_reoccurring_values(arr_tmp).to_numpy(),
                     sum_values(arr_tmp).to_numpy(),
                     symmetry_looking(arr_tmp, 0.1).to_numpy(),
                     time_reversal_asymmetry_statistic(arr_tmp, 2).to_numpy(),
                     variance(arr_tmp).to_numpy(),
                    ])




Next, we can appreciate the 51 features extracted. 

In [138]:
features = features.transpose()
pandasDF = pd.DataFrame(data=features, columns=['abs_energy(arr)',
                                                     'absolute_sum_of_changes(arr)',
                                                     'aggregated_autocorrelation(arr, 0)',
                                                     'aggregated_autocorrelation(arr, 1)',
                                                     'aggregated_autocorrelation(arr, 2)',
                                                     'aggregated_autocorrelation(arr, 3)',
                                                     'aggregated_autocorrelation(arr, 4)',
                                                     'aggregated_autocorrelation(arr, 5)',
                                                     'approximate_entropy(arr, 4, 0.5)',
                                                     'binned_entropy(arr, 5)',
                                                     'c3(arr, True)',
                                                     'count_above_mean(arr)',
                                                     'count_below_mean(arr)',
                                                     'cwt_coefficients(arr, Array([1, 2, 3], dtype.s32), 2, 2)',
                                                     'energy_ratio_by_chunks(arr, 2, 0)',
                                                     'first_location_of_maximum(arr)',
                                                     'first_location_of_minimum(arr)',
                                                     'has_duplicates(arr)',
                                                     'has_duplicate_max(arr)',
                                                     'has_duplicate_min(arr)',
                                                     'index_mass_quantile(arr, 0.5)',
                                                     'kurtosis(arr)',
                                                     'large_standard_deviation(arr, 0.4)',
                                                     'last_location_of_maximum(arr)',
                                                     'last_location_of_minimum(arr)',
                                                     'length(arr)()',
                                                     'longest_strike_above_mean(arr)',
                                                     'longest_strike_below_mean(arr)',
                                                     'max_langevin_fixed_point(arr, 7, 2)',
                                                     'maximum(arr)',
                                                     'mean(arr)',
                                                     'mean_absolute_change(arr)',
                                                     'mean_change(arr)',
                                                     'mean_second_derivative_central(arr)',
                                                     'median(arr)',
                                                     'minimum(arr)',
                                                     'number_crossing_m(arr, 0)',
                                                     'number_cwt_peaks(arr,2 )',
                                                     'number_peaks(arr, 2)',
                                                     'percentage_of_reoccurring_datapoints_to_all_datapoints(arr, False)',
                                                     'percentage_of_reoccurring_values_to_all_values(arr, False)',
                                                     'quantile(arr, Array([0.6], dtype.f32))',
                                                     'ratio_beyond_r_sigma(arr, 0.5)',
                                                     'ratio_value_number_to_time_series_length(arr)',
                                                     'skewness(arr)',
                                                     'standard_deviation(arr)',
                                                     'sum_of_reoccurring_values(arr)',
                                                     'sum_values(arr)',
                                                     'symmetry_looking(arr, 0.1)',
                                                     'time_reversal_asymmetry_statistic(arr, 2)',
                                                     'variance(arr)'])
pandasDF.head(5)

Unnamed: 0,abs_energy(arr),absolute_sum_of_changes(arr),"aggregated_autocorrelation(arr, 0)","aggregated_autocorrelation(arr, 1)","aggregated_autocorrelation(arr, 2)","aggregated_autocorrelation(arr, 3)","aggregated_autocorrelation(arr, 4)","aggregated_autocorrelation(arr, 5)","approximate_entropy(arr, 4, 0.5)","binned_entropy(arr, 5)",...,"quantile(arr, Array([0.6], dtype.f32))","ratio_beyond_r_sigma(arr, 0.5)",ratio_value_number_to_time_series_length(arr),skewness(arr),standard_deviation(arr),sum_of_reoccurring_values(arr),sum_values(arr),"symmetry_looking(arr, 0.1)","time_reversal_asymmetry_statistic(arr, 2)",variance(arr)
0,1088199.0,15740.010742,0.016608,-0.001612,-0.274271,4.113612,0.176879,0.031286,0.827971,,...,34.275211,0.781,0.701,0.076788,11.519623,7109.591797,30911.117188,1.0,-83.594093,132.701721
1,952773.9,13398.524414,-1.837614,-0.562199,-26.116846,11.289761,6.073333,36.88538,0.726339,,...,31.268089,0.82,0.121,0.883933,12.635367,3860.458496,28162.412109,0.0,-28.823009,159.652512
2,77613270.0,27890.644531,3.591948,1.296992,-15.567692,289.58963,15.692007,246.239105,0.650522,,...,288.694885,0.532,0.987,-0.5836,56.365086,3642.331543,272830.0625,1.0,-9175.567383,3177.022705
3,215405.6,8028.245117,-0.67707,-0.70826,-57.885098,16.710022,7.081564,50.148556,0.688054,,...,16.199238,0.797,0.113,1.007511,8.879622,1835.268188,11685.800781,0.0,43.162006,78.847702
4,56283.08,4691.344727,-0.29207,-0.014299,-10.321704,2.325413,1.470899,2.163545,0.751111,,...,2.970965,0.748,0.115,1.239205,4.341112,830.674866,6118.645996,1.0,-6.979472,18.845257


## Clustering 

The 51 features calculated in the previous cell are going to be used in order to clusterize the time series. The desired number of clusters can be chosen, as the number of time series to be shown in the plot.

In order to plot the results, a PCA is applied to use 3 principal components. 


In [134]:
# Converts the Pandas dataFrame to a numpy array.
X = pandasDF.as_matrix()
# Converts NaN to 0.
X = np.nan_to_num(X)
# Preprocess the Features matrix by scaling it.
X = scale(X)
# Applies a Principal Component Analysis with a desired number of components equal to three.
pca = PCA(n_components=3)
# Create a Pandas dataFrame composed by the principal components.
principal_components = pca.fit_transform(X)
principal_df = pd.DataFrame(data = principal_components
             , columns = ['PC-1', 'PC-2', 'PC-3'])
# Plots the results of applying Kmeans with the desired number of 
# clusters and the desired number of time series to be shown in the plot
def run_prediction(Clusters, TS):
    kmeans = KMeans(n_clusters=int(Clusters), random_state=0).fit(X)
    fig  = plt.figure()
    ax= Axes3D(fig)
    ax.scatter(xs=principal_df['PC-1'].as_matrix()[0:TS], 
               ys = principal_df['PC-2'].as_matrix()[0:TS], 
               zs=principal_df['PC-3'].as_matrix()[0:TS], 
               c=kmeans.labels_[0:TS], s=250) 

    for i, txt in enumerate(file_names[0:TS], 0):
        ax.text(principal_df['PC-1'][i],
                principal_df['PC-2'][i],principal_df['PC-3'][i],
                '%s' % (str(txt)), size=20, zorder=1,  color='k')

    ax.set_xlabel('PC-1')
    ax.set_ylabel('PC-2')
    ax.set_zlabel('PC-3')
    plt.tight_layout
    plt.show()
    
interact(run_prediction,TS=IntSlider(min=1, max=len(X), step=1, value = 1,  continuous_update=False), Clusters=IntSlider(min=1, max=len(X), step=1, continuous_update=False));

interactive(children=(IntSlider(value=1, continuous_update=False, description='Clusters', min=1), IntSlider(va…