## Project Description:
This notebook will examine Madiba client data and build machine learning models for time series predictions of a variety of metrics. This project will serve to convey the power and felxibility of open source tools (specifically TensorFlow) to be used in comparison with the predictive machinery provided by SAP.

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling

import pandas as pd


import pathlib
import datetime
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots


import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import os
import seaborn as sns
mpl.rcParams['figure.figsize'] = (8, 6)
mpl.rcParams['axes.grid'] = False

In [2]:
from Madiba_ml_fucntions import *

In [3]:
df = configure_csv('/Users/jeremywayland/Desktop/Madiba/AI.csv','/Users/jeremywayland/Desktop/Madiba/S_AI.csv',['CPU_UTILIZATION_5MIN'])

In [73]:
Feb = df[df['0CALDAY'] > 20200131]
systems = Feb['0SMD_LUID'].unique()
fig = make_subplots(rows=len(systems), cols=1)
for index,system in enumerate(systems):
    splice,metrics = choose_LUIDs(Feb,[system])
    fig.add_trace(go.Scatter(x=splice['TIMESTAMP'], y=splice['0SMD_MAX'],mode='lines+markers',name=system),row=index+1,col=1)

The following dataframe tracks ['HDB_0020986084'] and the following metrics are recorded: ['CPU_UTILIZATION_5MIN']
The following dataframe tracks ['_'] and the following metrics are recorded: ['CPU_UTILIZATION_5MIN']
The following dataframe tracks ['HDB00001_'] and the following metrics are recorded: ['CPU_UTILIZATION_5MIN']
The following dataframe tracks ['SL1_0020986084'] and the following metrics are recorded: ['CPU_UTILIZATION_5MIN']
The following dataframe tracks ['ECC_0020986084'] and the following metrics are recorded: ['CPU_UTILIZATION_5MIN']
The following dataframe tracks ['WLY_'] and the following metrics are recorded: ['CPU_UTILIZATION_5MIN']
The following dataframe tracks ['SLM_0020986084'] and the following metrics are recorded: ['CPU_UTILIZATION_5MIN']
The following dataframe tracks ['HDB_'] and the following metrics are recorded: ['CPU_UTILIZATION_5MIN']


In [74]:
fig.update_layout(height=1600, width=800, title_text="CPU Utilization 15mins")
fig.show()

In [19]:
example = systems[0]
example,metrics = choose_LUIDs(df,[example])
example.shape

The following dataframe tracks ['WLY_'] and the following metrics are recorded: ['CPU_UTILIZATION_5MIN']


(938, 6)

### Model 1
First try to predict the CPU Utilization (5mins) of a given system (WLY_) using only its own metrics. 

In [48]:
normalized = example  
normalized['0SMD_MAX'] = np.transpose(tf.keras.utils.normalize([float(x) for x in example['0SMD_MAX'].values]))
normalized.index = normalized['TIMESTAMP']
normalized = normalized[['0SMD_MAX']]
normalized

        




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0_level_0,0SMD_MAX
TIMESTAMP,Unnamed: 1_level_1
2019-08-01 00:00:00,0.052993
2019-08-01 06:00:00,0.038089
2019-08-01 07:00:00,0.046369
2019-08-01 08:00:00,0.052993
2019-08-01 09:00:00,0.018216
...,...
2020-02-13 18:00:00,0.049681
2020-02-13 19:00:00,0.006624
2020-02-13 20:00:00,0.011592
2020-02-13 21:00:00,0.004968


In [49]:
TRAIN_SPLIT = 850
tf.random.set_seed(13)

In [53]:
uni_data = normalized

In [54]:
train_mean = normalized[:TRAIN_SPLIT].mean()
train_std = normalized[:TRAIN_SPLIT].std()
(train_mean,train_std)

(0SMD_MAX    0.023062
 dtype: float64,
 0SMD_MAX    0.0234
 dtype: float64)

In this step we will create the data for the univariate model: this corresponds to choosing how many points the model will use (from the past) to predict the metric at some point in the future. In the TensorFlow example where they predict temperature, their timesteps are standardized. In this case we will most likely need to determine a set unit of time in which to launch the prediction.

In [57]:
# Function taken from https://www.tensorflow.org/tutorials/structured_data/time_series

def univariate_data(dataset, start_index, end_index, history_size, target_size):
    data = []
    labels = []

    start_index = start_index + history_size
    if end_index is None:
        end_index = len(dataset) - target_size

    for i in range(start_index, end_index):
        indices = range(i-history_size, i)
    # Reshape data from (history_size,) to (history_size, 1)
        data.append(np.reshape(dataset[indices], (history_size, 1)))
        labels.append(dataset[i+target_size])
    return np.array(data), np.array(labels)

# This function takes in your training data set and for each "history partition" it assigns the true value

In [72]:
uni_data.values

array([[0.05299328],
       [0.03808892],
       [0.04636912],
       [0.05299328],
       [0.01821644],
       [0.00993624],
       [0.01324832],
       [0.01821644],
       [0.0082802 ],
       [0.01324832],
       [0.01490436],
       [0.01490436],
       [0.02815268],
       [0.00496812],
       [0.00331208],
       [0.00993624],
       [0.00165604],
       [0.00165604],
       [0.01324832],
       [0.0082802 ],
       [0.0082802 ],
       [0.00662416],
       [0.01324832],
       [0.00331208],
       [0.00165604],
       [0.00165604],
       [0.00496812],
       [0.00496812],
       [0.00496812],
       [0.01324832],
       [0.04636912],
       [0.04636912],
       [0.03808892],
       [0.04636912],
       [0.00993624],
       [0.00662416],
       [0.00993624],
       [0.01324832],
       [0.00993624],
       [0.0082802 ],
       [0.01159228],
       [0.01490436],
       [0.00993624],
       [0.02815268],
       [0.00496812],
       [0.00331208],
       [0.03808892],
       [0.049

In [70]:
univariate_past_history = 20
univariate_future_target = 1

x_train_uni, y_train_uni = univariate_data(uni_data.values, 0, TRAIN_SPLIT,
                                           univariate_past_history,
                                           univariate_future_target)
x_val_uni, y_val_uni = univariate_data(uni_data.values, TRAIN_SPLIT, None,
                                       univariate_past_history,
                                       univariate_future_target)

In [71]:
print ('Single window of past history')
print (x_train_uni[0])
print ('\n Target CPU Utilization to predict')
print (y_train_uni[0])

Single window of past history
[[0.05299328]
 [0.03808892]
 [0.04636912]
 [0.05299328]
 [0.01821644]
 [0.00993624]
 [0.01324832]
 [0.01821644]
 [0.0082802 ]
 [0.01324832]
 [0.01490436]
 [0.01490436]
 [0.02815268]
 [0.00496812]
 [0.00331208]
 [0.00993624]
 [0.00165604]
 [0.00165604]
 [0.01324832]
 [0.0082802 ]]

 Target CPU Utilization to predict
[0.00662416]
