# Intelligent Management

## Table of content

1.[Introduction](#Introduction)  
2.[Generator](#Generator)  
    2.1.[Metric baseline](#Metric-baseline)  
    2.2.[Location (Deployment-site)](#Location-\(Deployment-site\))  
    2.3.[Company](#Company)  
    2.4.[Deployment](#Deployment)  
    2.5.[Deployment Configuration](#Deployment-Configuration)  
    2.6.[Deployment Example](#Deployment-Example)  
3.[Model Training](#Model-Training)  
    3.1.[Get The Data](#Get-The-Data)  
    3.2.[Transform raw data to Feature Vectors](#Transform-raw-data-to-Feature-Vectors)  
    3.3.[Prepare traning and test datasets](#Prepare-traning-and-test-datasets)  
    3.4.[Train Model](#Train-Model)  
    3.5.[Save Model](#Save-Model)  
4.[Predict](#Predict)

The goal of this demo is to showcase a machine learning based usecase around Fault Prediction for Network Infrastructure.

This use case follows the next steps:
* Data generation for Hardware Performance metrics
* Netcool Alarms sent through SMOD when appropriate
* Join with generated EOS data

* Fault Identification from data Using XGBoost ( Machine Learning )

In [121]:
!pip install pandas
!pip install xgboost
!pip install bokeh
!pip install sklearn
!pip install bkzep

import scipy as sp
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt; plt.rcdefaults()
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.layouts import column, row, gridplot
from bokeh.models import ColumnDataSource
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
import os
import re
import itertools
import random
import bkzep
import pprint
import xgboost as xgb
from xgboost import plot_importance
output_notebook()
pp = pprint.PrettyPrinter(indent=4)

[33mYou are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## Introduction

The goal of this demo is to showcase a machine learning based usecase around Fault Prediction for Network Infrastructure.

This use case follows the next steps:

Data generation for Hardware Performance metrics
Netcool Alarms sent through SMOD when appropriate
Join with generated EOS data

Fault Identification from data Using XGBoost ( Machine Learning )

## Generator  
### Metric baseline
[top](#Table-of-content)

In [6]:
def Normal(mu=0, sigma=0.5, noise=0):
    '''
    The function used to create the actual device performance data

    :param mu: Mean of the normal distribution to draw the metric from
    :param sigma: Deviation from the mean of the distribution
    :param noise: Sigma for the noise, (Drawn from distribution of Normal(0, noise)
    :return: Metric
    '''
    # while True:
    added_noise = 0 
    if noise is not 0:
        added_noise = np.random.normal(loc=0, scale=noise, size=1)
    tick = np.random.normal(loc=mu, scale=sigma, size=1) + added_noise
    return tick

In [7]:
print(f'A sample output from Normal(noise=1): {Normal(noise=1)}')


# Showcase a sample of performance generation with Normal base function
sample_component_performance = list(map(lambda x: Normal(mu=20, sigma=0, noise=3), np.arange(0, 100, 0.2)))
p = figure(title="Sample Performance Generation",
            x_axis_label="ticks",
            y_axis_label="Metric",
            width=1200)
p.line(x=range(len(sample_component_performance)), 
    y=sample_component_performance, 
    legend="Utilization (%)")
show(p)

A sample output from Normal(noise=1): [-0.72290584]


In [11]:
class Metric:

    def __init__(self,
                 name: str,
                 threshold_alerts_dict: dict,
                 mu: float,
                 sigma: float,
                 noise: float,
                 max: float,
                 min: float):
        '''

        :param name: Metric name
        :param threshold_alerts_dict:
            Dictionary of: (Threshold -> Alert)

            Note: StatReplace is the Stub to replace for the current generated metric to the log line
            Example:
                - CPU Utilization metric
                    - Threshold: 80%
                    - Alert: "Operation - Chassis CPU Utilization (StatReplace) exceed Critical threshold (80.0)"
        :param mu: Mean of the metric distribution
        :param sigma: Deviation of the metric distribution
        :param noise: Noise added to the metric distribution (As Normal(mu=0, sigma=noise))
        :return:
        '''

        self.name = name
        self.mu = mu
        self.sigma = sigma
        self.noise = noise
        self.max = max
        self.min = min

        self.alert_threshold = threshold_alerts_dict.get('threshold') if len(threshold_alerts_dict) > 0 else -1
        self.threshold_alerts_dict = threshold_alerts_dict
        self.is_threshold_below = threshold_alerts_dict.get('type') if 'type' in threshold_alerts_dict.keys() else True

        self.is_error = False
        self.is_in_peak_error = False
        self.steps = 0
        self.error_length = 80

        self.peaks = 0
        self.error_peak_length = 0
        self.peak_chance = 0
        self.r = Random()
        self.r.seed(42)

        self.generator_metric = Normal
        self.error_metric = self.Peak_error()
        self.current_metric = self.generator_metric

    def Peak_error(self, target_peaks: int = 4, error_peak_ratio: float = 0.5):
        def return_peak():
            return self.max if self.is_threshold_below else self.min

        while True:
            # Are we in Pre-Error?
            if self.steps <= self.error_length - self.error_peak_length:
                # Will it peak?
                is_peak = True if self.r.uniform(0, 1) <= self.peak_chance else False
                yield return_peak() if is_peak else self.generator_metric(mu=self.mu,
                                                                          sigma=self.sigma,
                                                                          noise=self.noise)[0]

            # Are we in Peak-Error?
            else:
                self.is_in_peak_error = True
                yield return_peak()

    def generator(self):
        '''
            Produces the metric from normal distribution as defined by the user
        :return: One metric sample
        '''
        if self.is_error:
            self.steps += 1
            return next(self.error_metric)
        else:
            return self.generator_metric(mu=self.mu,
                                         sigma=self.sigma,
                                         noise=self.noise)[0]

    def get_alert(self, metric):
        '''
            Checks weather an alert should be made
        :param metric: Current sample
        :return: A Metric Alert if needed
        '''
        return self.threshold_alerts_dict.get('alert').replace('StatReplace', str(
            metric)) if ((self.is_threshold_below and metric >= self.alert_threshold) or
                         (not self.is_threshold_below and metric <= self.alert_threshold)) and \
                        self.alert_threshold is not -1 else ''

    def validate_value(self, metric):
        '''
            Validates the metric values are within valid range as defined by min / max
        '''
        # Need to switch to by parameters
        metric = metric if metric > self.min else self.min
        metric = metric if metric < self.max else self.max
        return metric

    def start_error(self, error_length: int, target_peaks: int = 4, error_peak_ratio: float = 0.5):
        r = Random()
        r.seed(42)

        # Pick one error scenario
        self.error_length = error_length
        self.is_error = True
        self.error_metric = self.Peak_error()
        self.peaks = int(r.gauss(mu=target_peaks, sigma=0.5 * target_peaks))
        self.error_peak_length = int(
            r.gauss(mu=self.error_length * error_peak_ratio, sigma=self.error_length * 0.1))
        self.peak_chance = self.peaks / (self.error_length - self.error_peak_length)
        return 0

    def stop_error(self):
        # Return generator to Normal
        self.current_metric = self.generator_metric
        self.is_error = False
        self.is_in_peak_error = False
        self.steps = 0
        self.error_length = 0
        return 0

    def get_metric(self):
        while True:
            metric = self.validate_value(self.generator())
            yield {
                'value': metric,
                'alert': self.get_alert(metric=metric),
                'is_error': 1 if self.is_in_peak_error else 0
            }


In [12]:
class Device:

    def __init__(self, metrics: dict, error_scenarios: [], error_rate: float):
        '''
            Component Manager:
            Receives configuration dictionary and -
                - Creates metrics
                - Runs scenarios
        :param metrics: Configuration dictionary
        '''
        self.metrics = [Metric(name=metric,
                               mu=metrics[metric]['metric']['mu'],
                               sigma=metrics[metric]['metric']['sigma'],
                               noise=metrics[metric]['metric']['noise'],
                               max=metrics[metric]['metric']['max'],
                               min=metrics[metric]['metric']['min'],
                               threshold_alerts_dict=metrics[metric]['alerts']) for metric in metrics.keys()]

        self.error_rate = error_rate
        self.error_scenarios = error_scenarios

        self.is_error = False
        self.steps = 0
        self.error_length = -1
        self.scenario = []

        self.r = Random()
        self.r.seed(42)

    def select_error(self):
        '''
            Chooses randomly an error scenario from
            the given scenarios
        :return: an error scenario dict
        '''
        return self.r.choice(self.error_scenarios)

    def notify_metric_of_error(self):
        [component.start_error(self.error_length-self.steps) for component in self.metrics
         if self.steps == self.scenario[component.name]]

    def notify_metrics_of_normalization(self):
        [component.stop_error() for component in self.metrics]

    def generate(self):
        # Initialize state

        # Main generator loop
        while True:
            # Check if we are in an error state (Prev or New)
            self.is_error = True if (
                    (self.is_error is False) and self.r.uniform(0, 1) <= self.error_rate) else self.is_error

            # If we are in error
            if self.is_error:

                # If this is the first error step
                if self.steps == 0:
                    # Initialize error
                    self.scenario = self.select_error()
                    self.error_length = int(
                        self.r.gauss(mu=self.scenario['length'], sigma=0.1 * self.scenario['length']))

                    # Do we need to notify a metric to start an error state?
                    self.notify_metric_of_error()

                    # Advance steps
                    self.steps += 1

                # If we are already in an error state, do we need to stop?
                elif self.steps == self.error_length:
                    # Change internal state
                    self.is_error = False
                    self.steps = 0
                    # Notify metrics
                    self.notify_metrics_of_normalization()

                # Normal in-error step
                else:
                    self.notify_metric_of_error()
                    self.steps += 1

            # If we are not in an error state
            # else:
            yield {component.name: next(component.get_metric()) for component in self.metrics}

### Location (Deployment site)
[top](#Table-of-content)

In [8]:
from faker.providers import BaseProvider
from random import Random
from ast import literal_eval as make_tuple

class LocationProvider(BaseProvider):
    '''
    Creates locations for company deployments within given GPS Coordinates rectangle
    '''
    def location(self, within: dict = {}):
        '''

        :param within: GPS rectangle Coordinates containing:
                nw: ()
                se: ()
        :return: GPS Coordinate within the given rectangle
        '''
        nw = make_tuple(within['nw'])
        se = make_tuple(within['se'])

        width = abs(nw[1] - se[1])
        height = abs(nw[0] - se[0])

        r = Random()

        location = (se[0] + r.uniform(0, height), se[1] + r.uniform(0, width))

        return location

### Company
[top](#Table-of-content)

In [10]:
class Company:
    '''
    Creates a company with locations
    '''

    def __init__(self, num_devices: int, num_locations: int, within: dict, metrics: dict, error_scenarios: [],
                 error_rate: float):
        # Init
        self.f = Faker('en_US')
        self.f.add_provider(LocationProvider)

        # Set parameters
        self.name = self.f.company()
        self.locations = {i:self.f.location(within) for i in range(num_locations)}
        self.devices = [Device(metrics=metrics,
                               error_scenarios=error_scenarios,
                               error_rate=error_rate) for l in self.locations for d in range(num_devices)]
        self.components = {l: {'location': self.locations[l],
                               'devices': [Device(metrics=metrics,
                                                  error_scenarios=error_scenarios,
                                                  error_rate=error_rate) for d in range(num_devices)]}
                           for l in range(num_locations)}

## Deployment
[top](#Table-of-content)

In [13]:
class Deployment:

    def __init__(self, configuration: dict):
        '''

        :param locations:
        :param companies:
        :param metrics:
        :param error_scenarios:
        :param error_rate:
        '''

        # Init
        deployment_configuration = configuration['deployment']
        self.configuration = configuration

        self.companies = [Company(num_devices=deployment_configuration['num_devices_per_site'],
                             num_locations=deployment_configuration['num_sites_per_company'],
                             within=deployment_configuration['site_locations_bounding_box'],
                             metrics=configuration['metrics'],
                             error_scenarios=configuration['error_scenarios'],
                             error_rate=configuration['error_rate']) for _ in
                     range(deployment_configuration['num_companies'])]

    def generate(self):

        while True:
            tick = {}

            for company in self.companies:
                tick[company.name] = {}
                for l, location in enumerate(company.components.values()):
                    tick[company.name][l] = {
                        'location': location['location'],
                        'devices': {}
                    }
                    for d, device in enumerate(location['devices']):
                        tick[company.name][l]['devices'][d] = next(device.generate())

            yield tick

### Deployment Configuration 
[top](#Table-of-content)

In [16]:
deplyoment_configuration = {
    "metrics": {
        "cpu_utilization": {
            "labels": {
                "ver": 1,
                "unit": "percent",
                "target_type": "gauge"
            },
            "metric": {
                "mu": 75,
                "sigma": 4,
                "noise": 1,
                "max": 100,
                "min": 0
            },
            "alerts": {
                "threshold": 80,
                "alert": "Operation - Chassis CPU Utilization (StatReplace) exceed Critical threshold (80.0)"
            }
        },
        "throughput": {
            "labels": {
                "ver": 1,
                "unit": "mbyte_sec",
                "target_type": "gauge"
            },
            "metric": {
                "mu": 200,
                "sigma": 50,
                "noise": 50,
                "max": 300,
                "min": 0
            },
            "alerts": {
                "threshold": 30,
                "alert": "Low Throughput (StatReplace) below threshold (3.0)",
                "type": False
            }
        },
        "latency": {
            "labels": {
                "ver": 1,
                "unit": "ms",
                "target_type": "gauge"
            },
            "metric": {
                "mu": 3,
                "sigma": 2,
                "noise": 1,
                "max": 20,
                "min": 0
            },
            "alerts": {
                "threshold": 5,
                "alert": "Latency (StatReplace) above threshold (5.0)",
                "type": True
            }
        },
        "packet_loss": {
            "labels": {
                "ver": 1,
                "unit": "percent",
                "target_type": "gauge"
            },
            "metric": {
                "mu": 3,
                "sigma": 2,
                "noise": 1,
                "max": 100,
                "min": 0
            },
            "alerts": {
                "threshold": 5,
                "alert": "Packet Loss (StatReplace) above threshold (5.0)",
                "type": True
            }
        }
    },
    "error_scenarios": [{
        "cpu_utilization": 0,
        "throughput": 30,
        "latency": 50,
        "packet_loss": 20,
        "length": 80
    }],
    "errors": [],
    "error_rate": 0.05,
    "deployment": {
        "num_companies": 3,
        "num_sites_per_company": 2,
        "num_devices_per_site": 2,
        "site_locations_bounding_box": {
            "nw": "(51.520249, -0.071591)",
            "se": "(51.490988, -0.188702)"
        }
    }
}

## Deployment Example
[top](#Table-of-content)

In [38]:
ex_dep = Deployment(deplyoment_configuration)
generator = ex_dep.generate()
num_samples = 2
pp.pprint([(f'Sample {sample}:', next(generator)) for sample in range(num_samples)])

[   (   'Sample 0:',
        {   'Brown Ltd': {   0: {   'devices': {   0: {   'cpu_utilization': {   'alert': 'Operation '
                                                                                          '- '
                                                                                          'Chassis '
                                                                                          'CPU '
                                                                                          'Utilization '
                                                                                          '(81.18646858816678) '
                                                                                          'exceed '
                                                                                          'Critical '
                                                                                          'threshold '
                                                            

## Model Training
### Intelligent management (Netops) model training - Create CSV
[top](#Table-of-content)

In [90]:
FILENAME = 'netops_data.csv'
NUM_SAMPLES = 1000

# Make sure it's a new file
os.remove(f.name) if os.path.exists(f.name) else None

with open(FILENAME, 'w') as f:
    # Add header columns
    f.write('timestamp\tcompany\tlocation\tLat_long\tdevice\tmetric\tvalue\tis_error\talert\n')
    
    # Create and write samples
    for i in range(NUM_SAMPLES):
        generator_sample = next(generator)
        for company, locations in generator_sample.items():
            for location, devices in locations.items():
                current_coordinates = devices['location']
                for device, metrics in devices['devices'].items():
                    for metric_name, data in metrics.items():
                        f.write(f'{i}\t{company}\t{location}\t({current_coordinates})\t{device}\t{metric_name}\t{data["value"]}\t{data["is_error"]}\t{data["alert"]}\n')

### Get The Data
[top](#Table-of-content)

In [91]:
df = pd.read_csv(FILENAME, sep='\t')
df['id'] = df[['company', 'location', 'device']].astype(str).apply('_'.join, 1)
df.head()

Unnamed: 0,timestamp,company,location,Lat_long,device,metric,value,is_error,alert,id
0,0,Brown Ltd,0,"((51.51462563504477, -0.1557846123865522))",0,cpu_utilization,100.0,1,Operation - Chassis CPU Utilization (100) exce...,Brown Ltd_0_0
1,0,Brown Ltd,0,"((51.51462563504477, -0.1557846123865522))",0,throughput,0.0,1,Low Throughput (0) below threshold (3.0),Brown Ltd_0_0
2,0,Brown Ltd,0,"((51.51462563504477, -0.1557846123865522))",0,latency,20.0,1,Latency (20) above threshold (5.0),Brown Ltd_0_0
3,0,Brown Ltd,0,"((51.51462563504477, -0.1557846123865522))",0,packet_loss,100.0,1,Packet Loss (100) above threshold (5.0),Brown Ltd_0_0
4,0,Brown Ltd,0,"((51.51462563504477, -0.1557846123865522))",1,cpu_utilization,100.0,1,Operation - Chassis CPU Utilization (100) exce...,Brown Ltd_0_1


### Transform raw data to Feature Vectors
[top](#Table-of-content)

In [108]:
# Prepare DF to feature extraction form
# - Pivot
# - Fill NAs
# - Drop duplicates
# - Set Index
X = df.pivot_table(index=['timestamp', 'id'], columns='metric', values='value').sort_index().reset_index()
X = X.sort_values(['id', 'timestamp']).fillna(method='ffill')
X = X.fillna(method='bfill')
X = X.join(df.set_index(['timestamp', 'id']), on=['timestamp', 'id'], how='left')[['timestamp', 'id', 'cpu_utilization', 'latency', 'packet_loss', 'throughput', 'is_error']]
X = X.drop_duplicates(subset=['id', 'timestamp', 'cpu_utilization', 'latency', 'packet_loss', 'throughput'])
X = X.set_index(['id'])

# Create Features
X["cpu_1h_mean"] = X.cpu_utilization.rolling(window=12).mean()
X["latency_1h_mean"] = X.latency.rolling(window=12).mean()
X["packet_loss_1h_mean"] = X.packet_loss.rolling(window=12).mean()
X["throughput_1h_mean"] = X.throughput.rolling(window=12).mean()

# Drop first 'Window' samples due to no featuers
# (Dont want to confuse the ML algorithm)
feature_vectors = X.dropna()
feature_vectors.head(n=20)

Unnamed: 0_level_0,timestamp,cpu_utilization,latency,packet_loss,throughput,is_error,cpu_1h_mean,latency_1h_mean,packet_loss_1h_mean,throughput_1h_mean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Brown Ltd_0_0,11,73.361973,7.571321,5.320947,204.669315,0,80.561849,6.683322,19.887976,159.772701
Brown Ltd_0_0,12,77.749475,0.0,3.569248,209.826407,0,78.707638,5.016655,11.85208,177.258235
Brown Ltd_0_0,13,70.6847,4.388502,4.491987,289.715713,0,76.264696,3.715697,3.893079,201.401211
Brown Ltd_0_0,14,72.812505,0.603203,0.0,232.639139,0,76.413648,3.612961,3.473818,197.309892
Brown Ltd_0_0,15,77.164672,5.273686,2.741906,120.726469,0,76.120238,3.331137,3.177568,188.996403
Brown Ltd_0_0,16,71.860744,6.459257,0.0,196.298832,0,75.930209,3.505764,3.153419,191.665879
Brown Ltd_0_0,17,76.128518,1.254543,4.372429,226.684887,0,75.377307,3.247267,3.048915,196.815058
Brown Ltd_0_0,18,100.0,3.722897,5.236687,57.278658,0,77.135649,3.533766,3.248067,184.865948
Brown Ltd_0_0,19,76.549653,6.781552,2.902568,222.568074,0,77.202119,3.63561,3.34388,185.911795
Brown Ltd_0_0,20,72.998339,4.359591,4.103284,172.167095,0,76.551409,3.905932,3.359096,181.794731


### Prepare traning and test datasets
[top](#Table-of-content)

In [125]:
X = feature_vectors[['cpu_1h_mean', 'latency_1h_mean', 'packet_loss_1h_mean', 'throughput_1h_mean']].reset_index(drop=True)
y = feature_vectors['is_error']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)



### Train Model
[top](#Table-of-content)

In [126]:
model = GradientBoostingClassifier(n_estimators=10)
model.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=10,
              n_iter_no_change=None, presort='auto', random_state=None,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

In [127]:
model.score(X_test, y_test)

0.9813733666944676

### Save Model
[top](#Table-of-content)

In [123]:
MODEL_FILENAME='netops.model'
with open(MODEL_FILENAME, 'wb+') as f:
    pickle.dump(model, f)

## Predict
[top](#Table-of-content)