Sree Ganeshaaya Namaha

# THIS Notebook  Gives a Glimpse of how we can use NLP on Network Events.
Natural language processing (NLP): is a subfield of artificial intelligence, used for text/speech processing.

# We Achieve following here by using NLP Techniques.
  1) Word2Vec :  Convert each word in Event messages into numerical vectors. Vectors are formed based on the relationship between adjecent words.
  <br>2) Group the similar events using Latent Semantic Analysis(LSA) - uses Singular Value Decomposition(SVD).
  <br>3) Create top event list for each device - using the TFIDF score. TFIDF is a way of measuring the weight of a word in the text corpus. - this can be achieved using a sql query too, but query will hang as the number of events grow.
 

# Import Libraries

In [32]:
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
from gensim.models import Word2Vec
# Use tqdm to show progress of an pandas function we use
tqdm.pandas()


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

# Tokenize function - Splits each event message into an array of words.
replace "." with "_" ,  As  "." is considered as a separator when we use gensim word2vec.
Also, let us eliminate the following - 
 1. numbers
 2. word containing single letter

In [33]:


def tokenize(sentences):
    """
    :params sentence_list: list of strings
    :returns tok_sentences: list of list of tokens
    """
    tok_sentences = []
    for sent in sentences:
        sent=sent.lower()
        #sent=re.sub("\d+\.\d+\.\d+\.\d+",ipv4_repl,sent)
        sent=re.sub("\.","_",sent)
        toks=sent.split(" ")
        revised=[]
        for tok in toks:
            tok=tok.strip()
            if str(tok).isnumeric() == False and len(tok)>1:
                revised.append(tok)
            
        tok_sentences.append(revised)
    return tok_sentences



# Read the .csv containing Event Messages.
we use 'HOST', 'NORM_MSG' fields.

In [34]:
#df=pd.read_csv("faults/isat/isat_filtered_1.csv")
df=pd.read_csv("input/events.csv")
print(df["HOST"].size, df.columns)



36 Index(['TIMESTAMP', 'HOST', 'NORM_MSG', 'EPOCH'], dtype='object')


# Let us concatenate the host with the message and tokenize each into array of words.

In [35]:
#concatenate device with message
msgs=[]
for i,row in df.iterrows():    
    #msgs.append("{} {}".format(row['HOST'],row['EMESSAGE']))
    msgs.append("{} {}".format(row['HOST'],row['NORM_MSG']))
                
tokens=tokenize(msgs)


# Word2Vec:
# Each word in the Event Corpus will be converted to a numeric vector here

In [36]:
EMBEDDING_DIM=10
w2c = Word2Vec(tokens, size=EMBEDDING_DIM, window=5, min_count=1, workers=4)

tmparr=w2c.wv.index2word
word2index={}
index2word={}
for i,w in enumerate(tmparr):
    word2index[w]=i
    index2word[i]=w
num_words=len(word2index)

print("num_words=",num_words)

num_words= 95


# Print the numeric vector generated for a sample word.

In [37]:
#w="bgp"
w="%bgp-3-notification:"
i=word2index[w]
print (w, " is represented in vector form as :", w2c.wv[w])
print("\nClosest words to ", w, " are :", w2c.wv.similar_by_word(w))

%bgp-3-notification:  is represented in vector form as : [-0.03698418  0.04435233  0.0117104   0.01732238  0.04707624  0.04330237
 -0.02627918  0.02136641  0.02805902 -0.03972169]

Closest words to  %bgp-3-notification:  are : [('%l2-bm-6-active', 0.5415895581245422), ('high', 0.43139877915382385), ('4:', 0.4216136932373047), ('reachable', 0.4019451439380646), ('interface11', 0.39195916056632996), ('hmac', 0.31841742992401123), ('ifmgr', 0.3095794916152954), ('(not', 0.29967406392097473), ('from', 0.26347410678863525), ('%xxxx-3-platform:', 0.25406414270401)]


# Latent semantic analysis (LSA) - Group Similar Events using LSA
is a technique in natural language processing, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.

LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). 

A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. 

Words are then compared by taking the cosine of the angle between the two vectors (or the dot product between the normalizations of the two vectors) formed by any two rows. Values close to 1 represent very similar words while values close to 0 represent very dissimilar words.[1]

Let us group events in to two broad classes using LSA.

In [38]:
# Latent Semantic Analysis using Python

data=[" ".join(sent_tokens)  for sent_tokens in tokens]

# Creating Tfidf Model
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data)

# Visualizing the Tfidf Model
#print(X[0])

# Creating the SVD
lsa = TruncatedSVD(n_components = 2)#, n_iter = 5 by default.
lsa.fit(X)


# First Column of V
#row1 = lsa.components_[3]


concept_words={}#{group 0: [(word1:score1),(word2,score2)]}

terms = vectorizer.get_feature_names()
for i,comp in enumerate(lsa.components_):
    componentTerms = zip(terms,comp)
    sortedTerms = sorted(componentTerms,key=lambda x:x[1],reverse=True)
    #let us consider topmost 1000 terms only.
    #sortedTerms = sortedTerms[:1000]
    term_and_score={}
    for t in sortedTerms:
        if t[1] < 0:
            break
        term_and_score[t[0]]=t[1]
    
    concept_words["Group "+str(i)] = term_and_score





Create a data structure containing sentence and its score for each group - {group:[{sentence:score}, ...]}

In [39]:


group_and_sent_scores={}#{group:[{sentence:score}, ...]}
for group,term_and_score in concept_words.items():
    
    sentence_scores={}
    for i,words in enumerate(tokens):
        score = 0
        sent=" ".join(words)
        if sent not in sentence_scores:
            for word in words:
                if word in term_and_score:
                    score += term_and_score[word]
            sentence_scores[sent]=score
    group_and_sent_scores[group]=sentence_scores

Create a datastructure containing sentence and its score for each group/class.

In [40]:
print ("Total groups=",group_and_sent_scores.keys())

 
#create datastrcture
#sentences_and_group_scores : {sent:{"group1":score1,"group2:",score2..},..}
#ex. -{ rmf_svr %ha-redcon-1-standby_not_ready standby card is not ready
# {'Group 6': 0.47417499636854393, 'Group 9': 0.27952190650844383, 'Group 3': 0.19261210099664636..}..}

sentences_and_group_scores={}
for key,sentence_and_scores in group_and_sent_scores.items():
    grp=key
    for sent,score in sentence_and_scores.items():
        if sent not in sentences_and_group_scores:
            sentences_and_group_scores[sent]={}
        sentences_and_group_scores[sent][grp]=score
    

group_and_sentences_and_scores={}
for sentence,grp_and_scores in sentences_and_group_scores.items():
    #sort by scores
    sorted_scores=sorted(grp_and_scores.items(), key=lambda x: x[1], reverse=True)
    #the top group for this sentence
    top_grp_and_score_for_me=sorted_scores[0]
    grp=top_grp_and_score_for_me[0]
    score=top_grp_and_score_for_me[1]
    if grp not in group_and_sentences_and_scores:
        group_and_sentences_and_scores[grp]=[]
    group_and_sentences_and_scores[grp].append((sentence,score))
    


Total groups= dict_keys(['Group 0', 'Group 1'])


# Print the top five events belonging to each group.
Note: As we have very few number of events in the sample file - following output groups the events into only one group.

In [44]:
#sort the sentences in reverse order for each group
for grp, sents_and_scores in group_and_sentences_and_scores.items():
    sorted_sents=sorted(sents_and_scores, key=lambda x: x[1], reverse=True)
    group_and_sentences_and_scores[grp]=sorted_sents
    print ("\n", grp, ":")
    for sent,score in sorted_sents[0:5]:
        print (sent, score)
    



 Group 0 :
device3 %bgp-4-vpn_nh_if: nexthop device3 may not be reachable from neigbor device241 not loopback 2.7559453136850975
device1 bm-distrib %l2-bm-6-active interface4 is no longer active as part of bundle1 (not enough links available to meet minimum-active threshold) 1.5474842393805452
device1 bm-distrib %l2-bm-6-active interface7 is no longer active as part of bundle1 (not enough links available to meet minimum-active threshold) 1.5282141346112346
device1 bm-distrib %l2-bm-6-active interface3 is no longer active as part of bundle1 (not enough links available to meet minimum-active threshold) 1.5282141345355682
device1 bm-distrib %l2-bm-6-active interface10 is no longer active as part of bundle1 (not enough links available to meet minimum-active threshold) 1.5282141345311626


# Event Summary for each device using NLP
Event summary is nothing but top occurring events.


In [45]:
# Importing the libraries
import re
import heapq

# Word counts 
word2count = {}
for sent in tokens:
    for word in sent:# nltk.word_tokenize(clean_text):
        if word not in word2count:
            word2count[word]=0
        word2count[word] += 1

# Converting counts to weights
max_count = max(word2count.values())
for key in word2count.keys():
    word2count[key] = word2count[key]/max_count
    
# Product sentence scores    
sent2score = {}
for sentence in tokens:
    for word in sentence:#nltk.word_tokenize(sentence.lower()):
        if word in word2count.keys():
            if len(sentence) < 25:
                device=sentence[0]
                sentence_text=" ".join(sentence)
                if device not in sent2score:
                    sent2score[device]={}
                if sentence_text not in sent2score[device].keys():
                    sent2score[device][sentence_text] = word2count[word]
                else:
                    sent2score[device][sentence_text] += word2count[word]
                    
# Gettings best 5 lines for each device
for device, sents_scores in sent2score.items():
    best_sentences = heapq.nlargest(10, sents_scores, key=sents_scores.get)
    print('---------------------------------------------------------')
    print("\n",device,":")
    for sent in best_sentences:
        print(sent)

print('---------------------------------------------------------')
#for sentence in best_sentences:
#print(sentence)

---------------------------------------------------------

 device1 :
device1 interface1 bfd_agent %l2-bfd-6-adjacency_delete adjacency to neighbor device66 on interface bundle2 was deleted
device1 bm-distrib %l2-bm-6-active interface4 is no longer active as part of bundle1 (not enough links available to meet minimum-active threshold)
device1 bm-distrib %l2-bm-6-active interface11 is no longer active as part of bundle1 (not enough links available to meet minimum-active threshold)
device1 bm-distrib %l2-bm-6-active interface5 is no longer active as part of bundle1 (not enough links available to meet minimum-active threshold)
device1 bm-distrib %l2-bm-6-active interface3 is no longer active as part of bundle1 (not enough links available to meet minimum-active threshold)
device1 bm-distrib %l2-bm-6-active interface10 is no longer active as part of bundle1 (not enough links available to meet minimum-active threshold)
device1 bm-distrib %l2-bm-6-active interface7 is no longer active as part