<a href="https://colab.research.google.com/github/rajivsam/ITSM/blob/master/shallow_baseline_ITSM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Shallow Baseline to learn ITSM Embeddings Based on SLA Violations

Install Required Packages

In [2]:
!pip install pandas
!pip install sklearn2

Collecting sklearn2
  Downloading https://files.pythonhosted.org/packages/4d/b3/1d0d7e771b96212fa19013726b123a209e1dc109e2802bd99b2576bf74ed/sklearn2-0.0.13-py2.py3-none-any.whl
Collecting category-encoders
[?25l  Downloading https://files.pythonhosted.org/packages/a0/52/c54191ad3782de633ea3d6ee3bb2837bda0cf3bc97644bb6375cf14150a0/category_encoders-2.1.0-py2.py3-none-any.whl (100kB)
[K     |████████████████████████████████| 102kB 5.0MB/s 
Installing collected packages: category-encoders, sklearn2
Successfully installed category-encoders-2.1.0 sklearn2-0.0.13


Read the datafile from github

In [0]:
import pandas as pd
url = 'https://raw.githubusercontent.com/rajivsam/ITSM/master/pp_incident_event_log.csv'
df = pd.read_csv(url)

List the datatypes in the dataset

In [20]:
df.dtypes

number                     object
incident_state             object
active                       bool
reassignment_count          int64
reopen_count                int64
sys_mod_count               int64
made_sla                     bool
caller_id                  object
opened_by                  object
opened_at                  object
sys_created_by             object
sys_created_at             object
sys_updated_by             object
sys_updated_at             object
contact_type               object
location                   object
category                   object
subcategory                object
u_symptom                  object
cmdb_ci                    object
impact                     object
urgency                    object
priority                   object
assignment_group           object
assigned_to                object
knowledge                    bool
u_priority_confirmation      bool
notify                     object
problem_id                 object
rfc           



1.   Isolate the categorical variables
2.   Remove the timestamp variables and record ID variables. ID variable has high branching (one for each record) and time stamps are for record keeping rather than attributes.




In [0]:
attributes = df.columns.tolist()
remove = [ 'made_sla', 'sys_mod_count', 'reopen_count', 'reassignment_count', 'number', 'sys_updated_at',\
          'opened_at', 'resolved_at','sys_created_at', 'caller_id', 'closed_at', 'notify', 'sys_updated_by',\
          'sys_created_by' ]
keep = list(set(attributes) - set(remove))
df_cat_vars = df[keep]

Determine the number of categorical values for each variable

In [22]:
cols = df_cat_vars.columns.tolist()
for c in cols:
    print("Num unique vals for category " + str(c) + " = " + str(df_cat_vars[c].nunique()))
    


Num unique vals for category rfc = 175
Num unique vals for category vendor = 3
Num unique vals for category caused_by = 4
Num unique vals for category incident_state = 1
Num unique vals for category priority = 4
Num unique vals for category impact = 3
Num unique vals for category u_symptom = 398
Num unique vals for category closed_code = 18
Num unique vals for category problem_id = 245
Num unique vals for category urgency = 3
Num unique vals for category knowledge = 2
Num unique vals for category category = 53
Num unique vals for category location = 225
Num unique vals for category cmdb_ci = 48
Num unique vals for category assigned_to = 221
Num unique vals for category resolved_by = 217
Num unique vals for category assignment_group = 71
Num unique vals for category active = 1
Num unique vals for category subcategory = 246
Num unique vals for category contact_type = 5
Num unique vals for category opened_by = 208
Num unique vals for category u_priority_confirmation = 2


Recode the unknown value indicator '?' with "UNKNOWN"

In [0]:
df_cat_vars = df_cat_vars.replace(to_replace = '?', value = 'UNKNOWN')

One hot encode the categorical variables


In [0]:
df_recoded = pd.get_dummies(df_cat_vars)

Examine target variable imbalance

In [27]:
# recode made sla
df['made_sla'].value_counts()

True     15803
False     9115
Name: made_sla, dtype: int64

Create the dataset for learning

In [0]:
from_old = ['sys_mod_count', 'made_sla']
df_from_old = df[from_old]
dfc = pd.concat([df_recoded, df_from_old], axis = 1)

Create a baseline model using Stochastic Gradient Descent on a Linear Model with L1 penalty and log loss.
 
Note: The L1 penalty will drop the features that have low impact

In [29]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
preds = dfc.columns.tolist()
preds.remove('made_sla')

X = dfc[preds]
y = dfc['made_sla']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
clf = SGDClassifier(loss="log", penalty="l1", max_iter=500)
clf.fit(X_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=500,
              n_iter_no_change=5, n_jobs=None, penalty='l1', power_t=0.5,
              random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

Examine the test score accuracy. This score provides the following indications


1.   Dataset quality 
2.   Feature quality for the learning task
3.   A baseline level of accuracy for the learning task
4.   The features that are important or relevant to the learning task




In [30]:
from sklearn.metrics import accuracy_score
ypred_test = clf.predict(X_test)
accuracy_score(y_test, ypred_test)

0.8683788121990369

Extract the top 50 features from the model to see what features are relevant to the problem

In [40]:
from sklearn.feature_selection import SelectFromModel
model = SelectFromModel(clf, prefit=True, max_features = 50)
attribs = dfc.columns.tolist()
attribs.remove('made_sla')
feature_idx = model.get_support()
dfc[attribs].columns[feature_idx].tolist()

['vendor_UNKNOWN',
 'caused_by_UNKNOWN',
 'incident_state_Closed',
 'priority_1 - Critical',
 'priority_2 - High',
 'priority_3 - Moderate',
 'priority_4 - Low',
 'impact_1 - High',
 'impact_2 - Medium',
 'impact_3 - Low',
 'u_symptom_Symptom 101',
 'u_symptom_Symptom 207',
 'u_symptom_Symptom 311',
 'u_symptom_Symptom 607',
 'problem_id_UNKNOWN',
 'urgency_1 - High',
 'urgency_2 - Medium',
 'urgency_3 - Low',
 'cmdb_ci_UNKNOWN',
 'assigned_to_Resolver 132',
 'assigned_to_Resolver 136',
 'assigned_to_Resolver 138',
 'assigned_to_Resolver 219',
 'assigned_to_Resolver 224',
 'assigned_to_Resolver 26',
 'assigned_to_Resolver 39',
 'resolved_by_Resolved by 118',
 'resolved_by_Resolved by 122',
 'resolved_by_Resolved by 135',
 'resolved_by_Resolved by 171',
 'resolved_by_Resolved by 181',
 'resolved_by_Resolved by 184',
 'resolved_by_Resolved by 200',
 'resolved_by_Resolved by 208',
 'resolved_by_Resolved by 24',
 'assignment_group_Group 14',
 'assignment_group_Group 20',
 'assignment_group