# Data

To fetch the data we created a repository that housed a script that fetches the data.<br>
The link to the repository can be found here : <br>
https://github.com/sirandreww/operate_first_prometheus_data.git

We have added more information about what data is being fetched and how in that repository.
The main point being that the data we are pulling is:

## How Is Data Fetched?

1. Memory-usage data for each container using this Prometheus query `sum(container_memory_working_set_bytes{name!~".*prometheus.*", image!="", container!="POD", cluster="moc/smaug"}) by (container, pod, namespace, node)`.
   
2. CPU-usage data for each container using this Prometheus query `sum(rate(container_cpu_usage_seconds_total{name!~".*prometheus.*", image!="", container!="POD", cluster="moc/smaug"}[5m])) by (container, pod, namespace, node)`.
   
3. Memory-usage percentage data for each node using this Prometheus query `node_memory_Active_bytes/node_memory_MemTotal_bytes*100`.


## How Is The Data Processed After Fetching?

The data is then merged and turned into json files in that repository. We have created code here that takes those json files and imports them into custom datasets for our project. Let's take a look shall we!

In [2]:
import src.framework__data_set as ds

Getting the dataset for container memory data, for the cointainer bridge-marker.

In [3]:
dataset = ds.get_data_set(
    metric="container_mem",
    application_name="bridge-marker",
    path_to_data="./data/"
)

Now let's plot some samples in the data to get a visual on what we're looking at.

In [None]:
dataset.plot_dataset(number_of_samples=10)

As you can see the data for each application is split to many time series. Each one is continous and without any "interruptions" in the middle.

## What Applications Will We Consider?

As for the applications we're going to be learning on, let's take a look at the applications with the most data for each metric.

First let's take a look at node memory usage data:

In [None]:
hist = ds.get_amount_of_data_per_application(
    metric="node_mem",
    path_to_data="./data/"
)
print(hist[:10])

We'll look at the following nodes:
1. moc/smaug
2. emea/balrog

Now let's take a look at container memory usage data:

In [None]:
hist = ds.get_amount_of_data_per_application(
    metric="container_mem",
    path_to_data="./data/"
)
print(hist[:10])

We'll look at the following applications:
1. nmstate-handler
2. coredns
3. keepalived

Now let's take a look at container cpu usage data:

In [None]:
hist = ds.get_amount_of_data_per_application(
    metric="container_cpu",
    path_to_data="./data/"
)
print(hist[:10])

We'll look at the following applications:
1. kube-rbac-proxy
2. dns
3. collector

## How Is Data Pre-Processed?

In [None]:
dataset = ds.get_data_set(
    metric="container_mem",
    application_name="kube-rbac-proxy",
    path_to_data="./data/"
)

Plot the data of this data set:

In [None]:
dataset.plot_dataset(number_of_samples=10)

As we can see, we have a sample for each minute, we can further subsample the data to get data that is easier to generalize. Here we change the dataset so it is has samples 5 minutes appart:

In [None]:
dataset.sub_sample_data(sub_sample_rate=5)

Dropping series that are shorter than 10 sample long (less than 5 * 10 minutes long)

In [None]:
print("Data set size before:", len(dataset))
dataset.filter_data_that_is_too_short(data_length_limit=10)
print("Data set size after:", len(dataset))

Let's plot again to see how the samples look now.

In [None]:
dataset.plot_dataset(number_of_samples=10)

The data is highly variable and is not scaled, let's scale it.

In [None]:
dataset.scale_data()

Let's plot again to see how the samples look now.

In [None]:
dataset.plot_dataset(number_of_samples=10)

Now we can split the data into train and test.

In [None]:
train, test = dataset.split_to_train_and_test(test_percentage=0.2)
print(f"Amount of train data is {len(train)}")
print(f"Amount of test data is {len(test)}")

This is the preprocessing that is done to the data each time we use it for trainning.