## ERDC metrics data example

This notebook provides an example on how to the data in the ERDCmetrics dataset can be imported into a jupyter notebook on the compute portal. Some basic study of the data and clustering are also performed, to give you an idea of how the data could be worked with.

Lets start by importing some requried packages.

In [None]:
import json

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn import cluster

from dataportal import DataportalClient

plt.rcParams['figure.figsize'] = [10, 6]
np.random.seed(123)

To import data from the dataset, we must first list the available files. Enter your token and run the segment below to do so.

In [None]:
dataset = 'ERDCmetrics' # Enter the name of a dataset you have access to
token = '' # Enter your token from the dataportal here

client = DataportalClient(token)
fileList = client.fromDataset(dataset).listFiles()

The data in the ERDCmetrics dataset is organized into files of events of a specific data type in a given time span. In this example notebook we will only consider the files in a single timespan. Run the segment below to retrieve those files and load them into pandas dataframes.

In [None]:
fileids = {
    'float': 1,
    'str': 112,
    'uint': 331
}

fetched_dataframes = {}
for (mtype, id) in fileids.items():
    fetched_dataframes[mtype] = client.getData(id, compression='bz2')
    print(client.currentLoadedFile())
    print(f'From file id {id}')
    fetched_dataframes[mtype].info(memory_usage='deep')
    print('')

The events in the dataframes are organized as rows, where each event contains a timestamp (clock) for a measurement point (name) with the corresponding value (value) for a particular computer (host).

Lets inspect the dataframes a bit closer. In particular we are interested in looking at some statistics of the host and name columns, to see how many unique computers and measurement points the dataset contains in the given timespan.

Choose a metric types to consider and run the segment below.

In [None]:
metric_type = 'float'
#metric_type = 'str'
#metric_type = 'uint'

nbr_of_cols = 10

hosts = fetched_dataframes[metric_type].host.value_counts()
measurements = fetched_dataframes[metric_type].name.value_counts()

print(f'In total there are {len(hosts)} unique hosts in the {metric_type} file.')
print(f'Number of entries per host, the {nbr_of_cols} largest:')
print(f'{hosts[:nbr_of_cols].to_markdown()}\n')

print(f'In total there are {len(measurements)} unique measurement points in the {metric_type} file.')
print(f'Number of entries per measurement point, the {nbr_of_cols} largest:')
print(f'{measurements[:nbr_of_cols].to_markdown()}\n')

print(f'The dataframe from the {metric_type} file')
display(fetched_dataframes[metric_type])

As the dataframe for the str metric type contains very few entries, and with non-numeric values, it will be disregarded in the remainder of this example. 

Run the segment below to merge the float and uint dataframes.

In [None]:
df_raw = pd.concat([fetched_dataframes['float'], fetched_dataframes['uint']])
df_raw.info(memory_usage='deep')

Our raw dataframe contains a list of events from different hosts and measurement points, a format not that well suited when performing data analysis with standard python tools. Instead, we will transform it into a more manageable form, with the different measurement points as columns, and each row values for those measurement points in a timespan.

This will decrease the granularity of the data, as for some timespans and measurement points there could be multiple events that will have to be aggregated into a single value. Also, for some measurement points there might not be any events in the timespan, which will introduce NaN values that we will have to deal with.

For simplicity, we will only consider a single host.

In [None]:
host = 'eselda06u13'

# Extract one host to look at
df = df_raw.loc[df_raw['host'] == host, :]

# Sort by time
df = df.sort_values(by = ['clock', 'ns'], axis=0)

# Transform clock from timestamp to datetime, and bin it to minutes
df['clock'] = pd.to_datetime(df['clock'], unit='s').dt.floor('min')

# Extract the columns corresponding to measurement points, timestamps and values
df = df.loc[:,['name', 'clock', 'value']]

# Pivot the dataframe to put the measurement points as columns, timestamps as row indices
# and the rows the values for each measurement point for a timestamp.
df = pd.pivot_table(df, index='clock', columns='name')
columns = df.columns
second_words = [column[1] for column in columns]
df.columns = second_words

# Remove columns with many nan values
keep = df.isna().sum() < df.shape[0]/2
df = df.loc[:, df.columns[keep]]
print(f'Remaining NaNs: {df.isna().sum().sum()}')

display(df)

Lets view some of the columns to get a feel of the data characteristics.

Feel free to choose another timespan and measurement points.

In [None]:
t_start = 0 # Should be int in [0, 23] 
t_end = 6 # Should be int in [1, 24]

mps = [
    'Load average (1m avg)',
    'Available memory in %',
    'sda: Disk write rate'
]

t0 = df.index[0] + pd.Timedelta(hours=t_start)
tf = df.index[0] + pd.Timedelta(hours=t_end) 
for mp in mps:
    plt.figure()
    df[t0:tf][mp].plot()
    plt.title(mp)


The dataset contains alot of different measurement points for the selected host. Let's perform some dimensionality reduction using [Principal Component Analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) to make it more manageable.

In [None]:
# Normalize the columns
scale = StandardScaler()
X = scale.fit_transform(df)

# Perform PCA
pca = PCA()
pca.fit(X)

plt.figure()
plt.title('Explained variance')
plt.stem(pca.explained_variance_)
plt.xlabel('component')

The plot above shows how much variance that can be explained in the data by each principal component, in essence giving us an indication on how many columns are actually needed to represent the data. By removing components with low variance we can thus reduce the number of columns in the data to make it more manageable, and still retain a good representation of the original data.

We will keep 15 components, feel free to change this.

The resulting reduced dataset is still high dimensional. To get a feel of how the data is clustered, we use [t-distribuded Stochastic Neighborhood Embedding](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) to visualize it in 2 dimensions.

In [None]:
components_to_keep=15

pca = PCA(n_components=components_to_keep)
X_red = pca.fit_transform(X)

# Use TSNE to plot the datapoints and see if there are any obvious clusters
tsne = TSNE(init='random', random_state=8)
X_tsne = tsne.fit_transform(X_red)

plt.figure()
plt.plot(X_tsne[:, 0], X_tsne[:, 1], 'C0.')
plt.title('TSNE clustering of datapoints in 2D')

As can be seen, there seems to be some distinct clusters available in the data. However, although t-SNE is good for visualizing high dimensional data, it can produce an overly optimistic results on the actual clustering present.

The next step is thus to run a clustering algorithm to label the datapoints. We have provided three clustering algorithms via the [scikit-learn clustering](https://scikit-learn.org/stable/modules/clustering.html#clustering) that you could consider. Follow the link to learn more about these, and what more algorithms could be used from the module.

In [None]:
#clustering = cluster.DBSCAN(eps=5, min_samples=2).fit(X_red)
#clustering = cluster.KMeans(n_clusters=8, n_init='auto').fit(X_red)
clustering = cluster.AgglomerativeClustering(n_clusters=8).fit(X_red)

# Sort labels so that cluster 0 has the most datapoints, cluster 1 second most etc.
cluster_labels = np.zeros(len(clustering.labels_), dtype=int)
labels, counts = np.unique(clustering.labels_, return_counts=True)
U = sorted(tuple(zip(labels, counts)), key=lambda x: x[1], reverse=True)
for (new_grp, (old_grp, _)) in enumerate(U):
    if old_grp == -1:
        # Dont rename the label on datapoints that are not assigned to any cluster
        cluster_labels[clustering.labels_ == -1] = -1
    else:
        cluster_labels[clustering.labels_ == old_grp] = new_grp

# Plot the TSNE clustering with the discovered cluster labels
clusters_idx = {}
for c in set(cluster_labels):
    clusters_idx[c] = [i for (i, label) in enumerate(cluster_labels) if label == c]
plt.figure()
for (i, (c, grp_inds)) in enumerate(clusters_idx.items()):
    plt.plot(X_tsne[grp_inds, 0], X_tsne[grp_inds, 1], '.', label=f'cluster {c}')
plt.legend(loc='lower right')

It is of high interest to see how the different discovered clusters differ from each other. One way to do this is to check how the cluster means differ from the dataset mean. In the segment below we have ranked the different measurement points for each cluster based on how much its mean (normalized over the entire dataset) differs from the dataset average (which is 0 for the normalized dataset). Hence, the higher the value, the more they differ. This shows us if there are any obvious groups of measurement points in the cluster that strongly differs from the rest of the data. 

In [None]:
df_norm = (df-df.mean())/df.std()
df_stats = df.describe()

clusters = {}
for (i, (c, grp_inds)) in enumerate(clusters_idx.items()):
    clusters[f'cluster-{c}'] = {
        'dataframe': df.iloc[grp_inds],
        'stats': df.iloc[grp_inds].describe(),
        'diff': pd.DataFrame({
            'Norm. Mean': df_norm.iloc[grp_inds].describe().loc['mean'],
            'mean_all': df_stats.loc['mean'],
            'mean_cluster': abs(df.iloc[grp_inds].describe().loc['mean']),
            'CV': df.iloc[grp_inds].describe().loc['std'] / df.iloc[grp_inds].describe().loc['mean']
        })
    }
    clusters[f'cluster-{c}']['diff'].sort_values('Norm. Mean', ascending=False, inplace=True)
    clusters[f'cluster-{c}']['diff'].index.rename('Column name', inplace=True)

n_imp_cols = 5
index = pd.MultiIndex.from_product([clusters.keys(), range(1, 1+n_imp_cols)], names=['Group', 'Rank'])

cluster_info = pd.concat([v['diff'].iloc[:n_imp_cols] for v in clusters.values()])
cluster_info.reset_index(inplace=True)
cluster_info.set_index(index, inplace=True)

display(pd.DataFrame({
    'datapoints': [v['dataframe'].shape[0] for v in clusters.values()]
}, index=clusters.keys()))

with pd.option_context('display.max_colwidth', 100):
    display(cluster_info)


Lets plot some timeseries again, but with the cluster labels for the datapoints. Feel free to change the timespan and measurement points to consider. Particularly interesting is to consider the high ranked measurement points for the different clusters.

In [None]:
t_start = 0 # Should be int in [0, 23]
t_end = 24 # Should be int in [1, 24]

mps = [
    'Load average (1m avg)',
    'Available memory in %',
    'sda: Disk write rate'
]

t0 = df.index[0] + pd.Timedelta(hours=t_start)
tf = df.index[0] + pd.Timedelta(hours=t_end)
for mp in mps:
    plt.figure()
    plt.title(mp)
    plt.plot(df.loc[t0:tf][mp], 'k--')
    for (i, (c, c_data)) in enumerate(clusters.items()):
        plt.plot(c_data['dataframe'].loc[t0:tf][mp], 'o', label=c)
    plt.legend()