Problem Statement
The Chief Operating Officer of a Software Service Provider company is looking to evaluate the efficacy of their incident/issue resolution process and have shared the details of the incident log data. They would like to perform the following analysis to improve their operations to drive better customer satisfaction.

    1. Understand the distribution of incidents to identify the spread by key attributes. 
    2. Understand the alignment between urgency/priority of incidents against the resolution parameters/statistics. 
    3. Build a predictive model using the data that can estimate the resolution time for incident raised in the future.
    4. Build a classification model that would bucket the incidents into high priority/ urgency buckets. 
    5. Suggest recommendations to reduce resolution time.
    
Response: Candidate is expected to respond through a Jupyter notebook or a .py file along with responses on the qualitative questions embedded in the code as comments
Data: Incident Log Data; Data Dictionary 
Evaluation Criteria: The candidate will be evaluated on the following parameters:

    1. Python Knowledge and Quality of code (20%)
    2. Data exploration and processing – with a primary focus on points 1 and 2 above (20%)
    3. Modeling (20%)
    4. Model Evaluation (20%)
    5. Presentation of Results (20%)

# Data column details

1. number: incident identifier (24,918 different values);
2. incident state: eight levels controlling the incident management process transitions from opening until closing the case;
3. active: boolean attribute that shows whether the record is active or closed/canceled;
4. reassignment_count: number of times the incident has the group or the support analysts changed;
5. reopen_count: number of times the incident resolution was rejected by the caller;
6. sys_mod_count: number of incident updates until that moment;
7. made_sla: boolean attribute that shows whether the incident exceeded the target SLA;
8. caller_id: identifier of the user affected;
9. opened_by: identifier of the user who reported the incident;
10. opened_at: incident user opening date and time;
11. sys_created_by: identifier of the user who registered the incident;
12. sys_created_at: incident system creation date and time;
13. sys_updated_by: identifier of the user who updated the incident and generated the current log record;
14. sys_updated_at: incident system update date and time;
15. contact_type: categorical attribute that shows by what means the incident was reported;
16. location: identifier of the location of the place affected;
17. category: first-level description of the affected service;
18. subcategory: second-level description of the affected service (related to the first level description, i.e., to category);
19. u_symptom: description of the user perception about service availability;
20. cmdb_ci: (confirmation item) identifier used to report the affected item (not mandatory);
21. impact: description of the impact caused by the incident (values: 1â€“High; 2â€“Medium; 3â€“Low);
22. urgency: description of the urgency informed by the user for the incident resolution (values: 1â€“High; 2â€“Medium; 3â€“Low);
23. priority: calculated by the system based on 'impact' and 'urgency';
24. assignment_group: identifier of the support group in charge of the incident;
25. assigned_to: identifier of the user in charge of the incident;
26. knowledge: boolean attribute that shows whether a knowledge base document was used to resolve the incident;
27. u_priority_confirmation: boolean attribute that shows whether the priority field has been double-checked;
28. notify: categorical attribute that shows whether notifications were generated for the incident;
29. problem_id: identifier of the problem associated with the incident;
30. rfc: (request for change) identifier of the change request associated with the incident;
31. vendor: identifier of the vendor in charge of the incident;
32. caused_by: identifier of the RFC responsible by the incident;
33. close_code: identifier of the resolution of the incident;
34. resolved_by: identifier of the user who resolved the incident;
35. resolved_at: incident user resolution date and time (dependent variable);
36. closed_at: incident user close date and time (dependent variable).

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib 

In [2]:
pd.set_option("display.max_columns", 40)

In [3]:
#loading the data set
data = pd.read_csv('incident_response_data/incident_event_log.csv', parse_dates=['opened_at', 'closed_at'])
data.head(50)

Unnamed: 0,number,incident_state,active,reassignment_count,reopen_count,sys_mod_count,made_sla,caller_id,opened_by,opened_at,sys_created_by,sys_created_at,sys_updated_by,sys_updated_at,contact_type,location,category,subcategory,u_symptom,cmdb_ci,impact,urgency,priority,assignment_group,assigned_to,knowledge,u_priority_confirmation,notify,problem_id,rfc,vendor,caused_by,closed_code,resolved_by,resolved_at,closed_at
0,INC0000045,New,True,0,0,0,True,Caller 2403,Opened by 8,2016-02-29 01:16:00,Created by 6,29/2/2016 01:23,Updated by 21,29/2/2016 01:23,Phone,Location 143,Category 55,Subcategory 170,Symptom 72,?,2 - Medium,2 - Medium,3 - Moderate,Group 56,?,True,False,Do Not Notify,?,?,?,?,code 5,Resolved by 149,29/2/2016 11:29,2016-05-03 12:00:00
1,INC0000045,Resolved,True,0,0,2,True,Caller 2403,Opened by 8,2016-02-29 01:16:00,Created by 6,29/2/2016 01:23,Updated by 642,29/2/2016 08:53,Phone,Location 143,Category 55,Subcategory 170,Symptom 72,?,2 - Medium,2 - Medium,3 - Moderate,Group 56,?,True,False,Do Not Notify,?,?,?,?,code 5,Resolved by 149,29/2/2016 11:29,2016-05-03 12:00:00
2,INC0000045,Resolved,True,0,0,3,True,Caller 2403,Opened by 8,2016-02-29 01:16:00,Created by 6,29/2/2016 01:23,Updated by 804,29/2/2016 11:29,Phone,Location 143,Category 55,Subcategory 170,Symptom 72,?,2 - Medium,2 - Medium,3 - Moderate,Group 56,?,True,False,Do Not Notify,?,?,?,?,code 5,Resolved by 149,29/2/2016 11:29,2016-05-03 12:00:00
3,INC0000045,Closed,False,0,0,4,True,Caller 2403,Opened by 8,2016-02-29 01:16:00,Created by 6,29/2/2016 01:23,Updated by 908,5/3/2016 12:00,Phone,Location 143,Category 55,Subcategory 170,Symptom 72,?,2 - Medium,2 - Medium,3 - Moderate,Group 56,?,True,False,Do Not Notify,?,?,?,?,code 5,Resolved by 149,29/2/2016 11:29,2016-05-03 12:00:00
4,INC0000047,New,True,0,0,0,True,Caller 2403,Opened by 397,2016-02-29 04:40:00,Created by 171,29/2/2016 04:57,Updated by 746,29/2/2016 04:57,Phone,Location 165,Category 40,Subcategory 215,Symptom 471,?,2 - Medium,2 - Medium,3 - Moderate,Group 70,Resolver 89,True,False,Do Not Notify,?,?,?,?,code 5,Resolved by 81,1/3/2016 09:52,2016-06-03 10:00:00
5,INC0000047,Active,True,1,0,1,True,Caller 2403,Opened by 397,2016-02-29 04:40:00,Created by 171,29/2/2016 04:57,Updated by 21,29/2/2016 05:30,Phone,Location 165,Category 40,Subcategory 215,Symptom 471,?,2 - Medium,2 - Medium,3 - Moderate,Group 24,Resolver 31,True,False,Do Not Notify,?,?,?,?,code 5,Resolved by 81,1/3/2016 09:52,2016-06-03 10:00:00
6,INC0000047,Active,True,1,0,2,True,Caller 2403,Opened by 397,2016-02-29 04:40:00,Created by 171,29/2/2016 04:57,Updated by 21,29/2/2016 05:33,Phone,Location 165,Category 40,Subcategory 215,Symptom 471,?,2 - Medium,2 - Medium,3 - Moderate,Group 24,Resolver 31,True,False,Do Not Notify,?,?,?,?,code 5,Resolved by 81,1/3/2016 09:52,2016-06-03 10:00:00
7,INC0000047,Active,True,1,0,3,True,Caller 2403,Opened by 397,2016-02-29 04:40:00,Created by 171,29/2/2016 04:57,Updated by 804,29/2/2016 11:31,Phone,Location 165,Category 40,Subcategory 215,Symptom 471,?,2 - Medium,2 - Medium,3 - Moderate,Group 24,Resolver 31,True,False,Do Not Notify,?,?,?,?,code 5,Resolved by 81,1/3/2016 09:52,2016-06-03 10:00:00
8,INC0000047,Active,True,1,0,4,True,Caller 2403,Opened by 397,2016-02-29 04:40:00,Created by 171,29/2/2016 04:57,Updated by 703,29/2/2016 11:32,Phone,Location 165,Category 40,Subcategory 215,Symptom 471,?,2 - Medium,2 - Medium,3 - Moderate,Group 24,Resolver 31,True,False,Do Not Notify,?,?,?,?,code 5,Resolved by 81,1/3/2016 09:52,2016-06-03 10:00:00
9,INC0000047,Active,True,1,0,5,True,Caller 2403,Opened by 397,2016-02-29 04:40:00,Created by 171,29/2/2016 04:57,Updated by 332,1/3/2016 09:14,Phone,Location 165,Category 40,Subcategory 215,Symptom 471,?,2 - Medium,2 - Medium,3 - Moderate,Group 24,Resolver 31,True,False,Do Not Notify,?,?,?,?,code 5,Resolved by 81,1/3/2016 09:52,2016-06-03 10:00:00


In [4]:
len(data.number.unique()) #this is the total no. of incident indentified

20769

# data exploration for selecting the columns for modelling

In [43]:
data.u_priority_confirmation.value_counts()

False    87564
True     32434
Name: u_priority_confirmation, dtype: int64

In [44]:
data.incident_state.value_counts()

Active                33582
New                   30229
Resolved              21500
Closed                20825
Awaiting User Info    12884
Awaiting Vendor         557
Awaiting Problem        400
Awaiting Evidence        19
-100                      2
Name: incident_state, dtype: int64

In [5]:
data.contact_type.value_counts() #it can also be removed as it is skewed

Phone    119879
Email       119
Name: contact_type, dtype: int64

In [6]:
len(data.opened_by.unique())#it is taken as it will provide which user opened the incident

157

In [7]:
len(data.caller_id.unique()) #we can remove caller_id as well as it will cause the curse of dimensionality

4829

In [8]:
data.shape

(119998, 36)

In [9]:
len(data.assignment_group.unique()) #we should keep it to get the performance of the group

73

In [10]:
data.reopen_count.value_counts() #we can remove reopen_count as it is skewed towards 0 and therefore will not much affect our model

0    118044
1      1661
2       141
3        86
4        38
5        11
6         8
7         5
8         4
Name: reopen_count, dtype: int64

In [11]:
data.reassignment_count.value_counts() #it can be removed as it can be adjusted in one hot encoding of assignment group

0     57377
1     32183
2     12961
3      7274
4      4091
5      2310
6      1315
7       900
8       506
9       341
10      272
11      165
12      100
13       57
14       38
15       21
17       16
20       16
16       13
18       13
22        9
19        8
21        3
27        3
23        2
26        2
24        1
25        1
Name: reassignment_count, dtype: int64

In [12]:
data.made_sla.value_counts() #it can also be removed as it is skewed

True     111738
False      8260
Name: made_sla, dtype: int64

In [13]:
len(data.location.unique()) #it is also to be kept to account for the affected location for the incident

203

In [14]:
len(data.category.unique()) #it is also to be kept to account for the affected service

45

In [15]:
data.knowledge.value_counts() #it can also be kept as the data is not skewed and it will account for the knowledge required or not

False    94664
True     25334
Name: knowledge, dtype: int64

# Removing the not so important columns from the data set (columns such as which are giving just the name and time not related to rsolution time or priority and very much skewed columns)

In [16]:
df = data[['number', 'opened_at', 'location', 'impact', 'urgency', 'priority','assignment_group', 'knowledge', 'closed_at']]
df.head(50)

Unnamed: 0,number,opened_at,location,impact,urgency,priority,assignment_group,knowledge,closed_at
0,INC0000045,2016-02-29 01:16:00,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,2016-05-03 12:00:00
1,INC0000045,2016-02-29 01:16:00,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,2016-05-03 12:00:00
2,INC0000045,2016-02-29 01:16:00,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,2016-05-03 12:00:00
3,INC0000045,2016-02-29 01:16:00,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,2016-05-03 12:00:00
4,INC0000047,2016-02-29 04:40:00,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 70,True,2016-06-03 10:00:00
5,INC0000047,2016-02-29 04:40:00,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 24,True,2016-06-03 10:00:00
6,INC0000047,2016-02-29 04:40:00,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 24,True,2016-06-03 10:00:00
7,INC0000047,2016-02-29 04:40:00,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 24,True,2016-06-03 10:00:00
8,INC0000047,2016-02-29 04:40:00,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 24,True,2016-06-03 10:00:00
9,INC0000047,2016-02-29 04:40:00,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 24,True,2016-06-03 10:00:00


# Data processing

In [17]:
#finding the incident resolving time
df['resolving_time'] = df.closed_at - df.opened_at
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['resolving_time'] = df.closed_at - df.opened_at


Unnamed: 0,number,opened_at,location,impact,urgency,priority,assignment_group,knowledge,closed_at,resolving_time
0,INC0000045,2016-02-29 01:16:00,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,2016-05-03 12:00:00,64 days 10:44:00
1,INC0000045,2016-02-29 01:16:00,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,2016-05-03 12:00:00,64 days 10:44:00
2,INC0000045,2016-02-29 01:16:00,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,2016-05-03 12:00:00,64 days 10:44:00
3,INC0000045,2016-02-29 01:16:00,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,2016-05-03 12:00:00,64 days 10:44:00
4,INC0000047,2016-02-29 04:40:00,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 70,True,2016-06-03 10:00:00,95 days 05:20:00


In [18]:
df.drop(['opened_at', 'closed_at'], axis=1, inplace=True)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,number,location,impact,urgency,priority,assignment_group,knowledge,resolving_time
0,INC0000045,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,64 days 10:44:00
1,INC0000045,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,64 days 10:44:00
2,INC0000045,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,64 days 10:44:00
3,INC0000045,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,64 days 10:44:00
4,INC0000047,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 70,True,95 days 05:20:00


In [19]:
#finding no. of days the incident is active
df.resolving_time = df['resolving_time'].apply(lambda x : int(str(x).split(' ')[0]))
df.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,number,location,impact,urgency,priority,assignment_group,knowledge,resolving_time
0,INC0000045,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,64
1,INC0000045,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,64
2,INC0000045,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,64
3,INC0000045,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,64
4,INC0000047,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 70,True,95
5,INC0000047,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 24,True,95
6,INC0000047,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 24,True,95
7,INC0000047,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 24,True,95
8,INC0000047,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 24,True,95
9,INC0000047,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 24,True,95


In [20]:
df.priority.value_counts()

3 - Moderate    112115
4 - Low           3549
2 - High          2499
1 - Critical      1835
Name: priority, dtype: int64

Here my <strong>dependent varialbles</strong> are <strong>resolving_time</strong> to which <strong>regression modelling</strong> will be done, and another <strong>dependent variable</strong> is <strong>priority</strong> to which a <strong>classification model</strong> is to be made.

In [21]:
#droping duplicate rows
df.drop_duplicates(inplace=True)
df.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


Unnamed: 0,number,location,impact,urgency,priority,assignment_group,knowledge,resolving_time
0,INC0000045,Location 143,2 - Medium,2 - Medium,3 - Moderate,Group 56,True,64
4,INC0000047,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 70,True,95
5,INC0000047,Location 165,2 - Medium,2 - Medium,3 - Moderate,Group 24,True,95
13,INC0000057,Location 204,2 - Medium,2 - Medium,3 - Moderate,Group 70,True,94
20,INC0000060,Location 204,2 - Medium,2 - Medium,3 - Moderate,Group 25,True,125
24,INC0000062,Location 93,2 - Medium,2 - Medium,3 - Moderate,Group 70,True,64
25,INC0000062,Location 93,1 - High,2 - Medium,2 - High,Group 70,True,64
26,INC0000062,Location 93,1 - High,2 - Medium,2 - High,Group 23,True,64
32,INC0000063,Location 93,2 - Medium,2 - Medium,3 - Moderate,Group 70,True,64
36,INC0000063,Location 93,2 - Medium,2 - Medium,3 - Moderate,Group 23,True,64


In [22]:
df.shape

(36779, 8)

# Now we will do one hot encoding for all categorical features and remove the dependent variables

In [23]:
issue_resolving_time = df[['number','resolving_time']]
issue_resolving_time.drop_duplicates(inplace=True)
issue_resolving_time.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


Unnamed: 0,number,resolving_time
0,INC0000045,64
4,INC0000047,95
13,INC0000057,94
20,INC0000060,125
24,INC0000062,64


In [24]:
issue_priority = df[['number', 'priority']]
issue_priority.drop_duplicates(inplace=True)
issue_priority.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


Unnamed: 0,number,priority
0,INC0000045,3 - Moderate
4,INC0000047,3 - Moderate
13,INC0000057,3 - Moderate
20,INC0000060,3 - Moderate
24,INC0000062,3 - Moderate


In [25]:
df.drop(['priority', 'resolving_time'], axis=1, inplace=True)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,number,location,impact,urgency,assignment_group,knowledge
0,INC0000045,Location 143,2 - Medium,2 - Medium,Group 56,True
4,INC0000047,Location 165,2 - Medium,2 - Medium,Group 70,True
5,INC0000047,Location 165,2 - Medium,2 - Medium,Group 24,True
13,INC0000057,Location 204,2 - Medium,2 - Medium,Group 70,True
20,INC0000060,Location 204,2 - Medium,2 - Medium,Group 25,True


now we will do one hot encoding for df

In [26]:
#the function will on hot encode multiple values of open_by, location, category, and assignment_group

def one_hot_encoding(df):
    columns = df.columns
    ini_df = df[columns[0]]
    ini_df.drop_duplicates(inplace=True)
    for i in range(len(columns) - 1):
        sub_df = df[[columns[0], columns[i+1]]]
        sub_df.drop_duplicates(inplace=True)
        sub_ohe_df = pd.crosstab(sub_df[columns[0]],sub_df[columns[i+1]])
        if i == 0:
            ohe_df = pd.merge(ini_df, sub_ohe_df, on='number')
        else:
            ohe_df = pd.merge(ohe_df, sub_ohe_df, on='number')
    return ohe_df


In [27]:
ohe_df = one_hot_encoding(df)
ohe_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


Unnamed: 0,number,?_x,Location 10,Location 100,Location 101,Location 102,Location 107,Location 108,Location 109,Location 11,Location 110,Location 111,Location 112,Location 113,Location 114,Location 115,Location 117,Location 118,Location 12,Location 120,...,Group 66,Group 67,Group 68,Group 69,Group 70,Group 71,Group 72,Group 73,Group 74,Group 75,Group 76,Group 77,Group 78,Group 79,Group 80,Group 81,Group 82,Group 9,False,True
0,INC0000045,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,INC0000047,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,INC0000057,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,INC0000060,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,INC0000062,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [28]:
ohe_df.shape

(20769, 285)

# Modelling

# Now lets make a regression model to predict resolving time

In [29]:
X = pd.merge(ohe_df,issue_resolving_time, on='number')
X.head()

Unnamed: 0,number,?_x,Location 10,Location 100,Location 101,Location 102,Location 107,Location 108,Location 109,Location 11,Location 110,Location 111,Location 112,Location 113,Location 114,Location 115,Location 117,Location 118,Location 12,Location 120,...,Group 67,Group 68,Group 69,Group 70,Group 71,Group 72,Group 73,Group 74,Group 75,Group 76,Group 77,Group 78,Group 79,Group 80,Group 81,Group 82,Group 9,False,True,resolving_time
0,INC0000045,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,64
1,INC0000047,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,95
2,INC0000057,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,94
3,INC0000060,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,125
4,INC0000062,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,64


In [30]:
y = X.resolving_time
y.head()

0     64
1     95
2     94
3    125
4     64
Name: resolving_time, dtype: int64

In [31]:
X.drop(['number', 'resolving_time'], axis=1, inplace=True)
X.head()

Unnamed: 0,?_x,Location 10,Location 100,Location 101,Location 102,Location 107,Location 108,Location 109,Location 11,Location 110,Location 111,Location 112,Location 113,Location 114,Location 115,Location 117,Location 118,Location 12,Location 120,Location 121,...,Group 66,Group 67,Group 68,Group 69,Group 70,Group 71,Group 72,Group 73,Group 74,Group 75,Group 76,Group 77,Group 78,Group 79,Group 80,Group 81,Group 82,Group 9,False,True
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [32]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=10)

In [33]:
from sklearn.linear_model import LinearRegression
lr_clf = LinearRegression()
lr_clf.fit(X_train,y_train)
lr_clf.score(X_test,y_test)



-2.1713042418444416e+20

In [34]:
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor()
dt.fit(X_train,y_train)
dt.score(X_test,y_test)



-0.14958780422717832

In [35]:
from sklearn.ensemble import RandomForestRegressor
 
 # create regressor object
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)
 
# fit the regressor with x and y data
regressor.fit(X_train, y_train)
regressor.score(X_test,y_test)



-0.003010794279248108

# Now make classification model to predict high priority incident

In [36]:
X = pd.merge(ohe_df,issue_priority, on='number')
X.head()

Unnamed: 0,number,?_x,Location 10,Location 100,Location 101,Location 102,Location 107,Location 108,Location 109,Location 11,Location 110,Location 111,Location 112,Location 113,Location 114,Location 115,Location 117,Location 118,Location 12,Location 120,...,Group 67,Group 68,Group 69,Group 70,Group 71,Group 72,Group 73,Group 74,Group 75,Group 76,Group 77,Group 78,Group 79,Group 80,Group 81,Group 82,Group 9,False,True,priority
0,INC0000045,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3 - Moderate
1,INC0000047,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3 - Moderate
2,INC0000057,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3 - Moderate
3,INC0000060,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3 - Moderate
4,INC0000062,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3 - Moderate


In [37]:
y = X.priority
y.head()

0    3 - Moderate
1    3 - Moderate
2    3 - Moderate
3    3 - Moderate
4    3 - Moderate
Name: priority, dtype: object

In [38]:
X.drop(['number', 'priority'], axis=1, inplace=True)
X.head()

Unnamed: 0,?_x,Location 10,Location 100,Location 101,Location 102,Location 107,Location 108,Location 109,Location 11,Location 110,Location 111,Location 112,Location 113,Location 114,Location 115,Location 117,Location 118,Location 12,Location 120,Location 121,...,Group 66,Group 67,Group 68,Group 69,Group 70,Group 71,Group 72,Group 73,Group 74,Group 75,Group 76,Group 77,Group 78,Group 79,Group 80,Group 81,Group 82,Group 9,False,True
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [39]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=10)

In [40]:
from sklearn.ensemble import RandomForestClassifier
randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(X_train,y_train)



RandomForestClassifier(criterion='entropy', n_estimators=200)

In [41]:
predictions = randomclassifier.predict(X_test)



In [42]:
# Import library to check accuracy
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

matrix=confusion_matrix(y_test,predictions)
print(matrix)
score=accuracy_score(y_test,predictions)
print(score)
report=classification_report(y_test,predictions)
print(report)

[[  31    2   18    3]
 [   5   48   18    0]
 [  31   21 3885    2]
 [   4    1    3  151]]
0.9744257636751125
              precision    recall  f1-score   support

1 - Critical       0.44      0.57      0.50        54
    2 - High       0.67      0.68      0.67        71
3 - Moderate       0.99      0.99      0.99      3939
     4 - Low       0.97      0.95      0.96       159

    accuracy                           0.97      4223
   macro avg       0.77      0.80      0.78      4223
weighted avg       0.98      0.97      0.98      4223



# The classification model to classify the priority of incident is doing great with above 97% accuracy and more than 50% f1-score for all categories
# but the model can if improved to increase the f1 score of the high priority incidents

--------------------------------------------------------------------------------------------------------------------

# The regression model to predict the incident resolution time is not doing much great as it is not giving much accuracy score.
# Some independent variables needs to be selected properly which could account for a higher variation of the dependent variables.
# PCA can also be applied to reduce the dimensions

Sorry not able to increase the accuracy of the regression model due to lack of time

--------------------------------------------------------------------------------------------------------------------

# Some recommendation to reduce resolution time:-
1 Tell the people to double check the priority of an incident and then resolving it. Most people are not doing this according to the data.

2 Don't change the resolver of the assignment group very frequently as the resolver who is solving the incident might know better about the incident with time and solve it quickly. A new resolver may have to some give time in knowing the incident from start and progress which could be saved.

3 Try to minimise the reassignment of the incident from one group to another to resolve the incident quicker.

4 Many a times the resolving time increases due the some outer things like Awaiting User Info, Awaiting Vendor, Awaiting Problem, Awaiting Evidence and not the actual solving of the issue. Try to complete these things before or parallelly with the solving of the incident and don't waste extra time on these things.