## Key Idea:
Often in IT Operations (production), the operations teams encounter high priority and low priority incident or issues. There are times when incidents that are tagged as low priority cause an impact similar to high priority. This problem is often compunded by high volume of low priority incidents which makes it very difficult to identify such outliers.

The idea here is to find low priority incidents that look like a high priority incidents. The Operation teams can then be alerted of these cases and requried action can be taken to address such incidents (instead of waiting out and acting late as per the signed SLA)

In [174]:
from collections import defaultdict
from gensim import corpora
from gensim.parsing.preprocessing import remove_stopwords
import numpy as np
import os
import pandas as pd

#Read the input CSV into a Pandas dataframe
incident_data_raw = pd.read_csv("incident_dataset.csv")


In [175]:
print(incident_data_raw.loc[3,:])
print(incident_data_raw.head())

Incidents    When my Mac boots, it shows an unsupporterd so...
Name: 3, dtype: object
                                           Incidents
0              My Mac fails to boot, what can I do ?
1                    Mac Air got infected by a Virus
2   My Mac is having boot problems, how do I fix it?
3  When my Mac boots, it shows an unsupporterd so...
4    I see a flicker in my monitor. Is that a virus?


In [176]:
incident_data = incident_data_raw["Incidents"]

In [177]:
def process_document(document):

    #Remove stopwords, convert to lower case and remove "?" character
    cleaned_document = remove_stopwords(document.lower()).replace("?","")  
    return cleaned_document.split()


In [178]:
#Create a document vector for P1 
doc_vectors_P1=[process_document(document)
            for document in incident_data]


#Print the document and the corresponding document vector to compare
print(incident_data[1])
print(doc_vectors_P1[1])


Mac Air got infected by a Virus
['mac', 'air', 'got', 'infected', 'virus']


# LSI Model

In [179]:
# create dictory

dictionary = corpora.Dictionary(doc_vectors_P1)
dictionary.token2id

{'boot,': 0,
 'fails': 1,
 'mac': 2,
 'air': 3,
 'got': 4,
 'infected': 5,
 'virus': 6,
 'boot': 7,
 'fix': 8,
 'having': 9,
 'it': 10,
 'problems,': 11,
 'boots,': 12,
 'error': 13,
 'shows': 14,
 'software': 15,
 'unsupporterd': 16,
 'flicker': 17,
 'monitor.': 18,
 'production': 19,
 'serv234': 20,
 'server': 21,
 'affected': 22,
 'multiple': 23,
 'users': 24,
 'working': 25,
 'stopped': 26,
 'booting': 27,
 'laptop': 28}

In [180]:
#Create a corpus
corpus = [dictionary.doc2bow(doc_vector) 
          for doc_vector in doc_vectors_P1]

#Review the corpus generated
print(doc_vectors_P1[1])
print(corpus[1])
#first value in the tuple is the dictionary and 2nd shows how many times the word appears in the corpus

['mac', 'air', 'got', 'infected', 'virus']
[(2, 1), (3, 1), (4, 1), (5, 1), (6, 1)]


In [181]:
## Building LSI Model
from gensim import models,similarities

#Create the model , represent each document in 2 dimentional space
lsi = models.LsiModel(corpus, id2word=dictionary,num_topics = 2)


#Create a similarity Index
index = similarities.MatrixSimilarity(lsi[corpus])

for similarities in index:
    print(similarities)
## Matrix - Similarity is high ~1 when compared to itself

for doc in lsi[corpus]:

    print(doc)

[ 1.          0.9995185   0.99861073  0.99861073  0.994354   -0.00579155
  0.04979836  0.9094494  -0.01700222]
[ 0.9995185   0.99999994  0.99976486  0.99976486  0.99716777 -0.03681697
  0.01878416  0.8961092  -0.04801827]
[ 0.99861073  0.99976486  1.          1.          0.99856406 -0.05847588
 -0.00289869  0.88627523 -0.06966422]
[ 0.99861073  0.99976486  1.          1.          0.99856406 -0.05847588
 -0.00289869  0.88627523 -0.06966422]
[ 0.994354    0.99716777  0.99856406  0.99856406  1.         -0.11187086
 -0.05646493  0.8601909  -0.12300466]
[-0.00579155 -0.03681697 -0.05847588 -0.05847588 -0.11187086  1.
  0.9984541   0.41054055  0.9999371 ]
[ 0.04979836  0.01878416 -0.00289869 -0.00289869 -0.05646493  0.9984541
  0.99999994  0.46058783  0.99776816]
[0.9094494  0.8961092  0.88627523 0.88627523 0.8601909  0.41054055
 0.46058783 0.99999994 0.40029186]
[-0.01700222 -0.04801827 -0.06966422 -0.06966422 -0.12300466  0.9999371
  0.99776816  0.40029186  0.99999994]
[(0, 0.9245531995434

# Score Unseen P2 incidents and see if they match with a P1

In [182]:
p2_example1 = 'My Mac does not boot, what can I do ?'
p2_example2 = 'Backup Failed on Serv1'
p2_example3 = 'My Monitor does not show in proper resolution when connected to my Mac. How do I fix it?'
p2_example4 = 'Laptop not booting'

In [183]:
#Pre Process the Question 
p2_corpus = dictionary.doc2bow(process_document(p2_example4))
print("Question translated to :", p2_corpus)

#Create an LSI Representation
vec_lsi = lsi[p2_corpus]  

#Find similarity of the question with existing documents
sims = index[vec_lsi]  
print("Similarity scores :",list(enumerate(sims)))

Question translated to : [(27, 1), (28, 1)]
Similarity scores : [(0, -0.03821995), (1, -0.069207594), (2, -0.0908216), (3, -0.0908216), (4, -0.14404042), (5, 0.9994739), (6, 0.99612623), (7, 0.3807517), (8, 0.9997747)]


In [184]:
#sort an array in reverse order and get indexes
matches=np.argsort(sims)[::-1] 
print("Sorted Document index :", matches)

print("\n", "-"*60, "\n")
for i in matches:
    print(sims[i], " -> ", incident_data_raw.iloc[i]["Incidents"])



Sorted Document index : [8 5 6 7 0 1 3 2 4]

 ------------------------------------------------------------ 

0.9997747  ->  Laptop not booting for multiple users
0.9994739  ->  Production server down Serv234
0.99612623  ->  Production not working multiple users affected
0.3807517  ->  Mac stopped working
-0.03821995  ->  My Mac fails to boot, what can I do ?
-0.069207594  ->  Mac Air got infected by a Virus
-0.0908216  ->  When my Mac boots, it shows an unsupporterd software error
-0.0908216  ->  My Mac is having boot problems, how do I fix it?
-0.14404042  ->  I see a flicker in my monitor. Is that a virus?


In [185]:
## Identify the maximum similarity score. If score >0.8 then call the P2 as a potential P1

maxSimilarityScoreP1 = max(sims)
print(maxSimilarityScoreP1)

if maxSimilarityScoreP1>0.5:
    print ('Beware : This looks like a possible P1 - solve it ASAP')
else:
     print ('Just a regular p2')

0.9997747
Beware : This looks like a possible P1 - solve it ASAP
