# **Suggest estimated time to resolve an incident**

Use Case: Predicting the estimated time that would be required to resolve an incident based on previous incidents to an assigned developer

Link to documentation:
1. https://iwiki.sse.in.tum.de/display/PIT21/%5BSaury%5D+Time+Estimation+to+Resolve+an+Incident


# Installation of Sentence Transformers

In [2]:
!pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/cc/75/df441011cd1726822b70fbff50042adb4860e9327b99b346154ead704c44/sentence-transformers-1.2.0.tar.gz (81kB)
[K     |████████████████████████████████| 81kB 4.0MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 10.5MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 32.9MB/s 
[?25hCollecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none

# Import Libraries

In [3]:
from sentence_transformers import SentenceTransformer, util
from sentence_transformers import models, losses
import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
# Mounting google drive where dataset is present
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#import all Jira files from a directory and concatenate it into a dataframe --Need to provide location of the dataset--

import os

path="/content/drive/MyDrive/Colab_Notebooks/jira_data/"
os.chdir(path)
frames=[]
for file in os.listdir():
    # Check whether file is in text format or not
    if file.endswith(".csv"):
        file_path = f"{path}\{file}"
        frames.append(pd.read_csv(file_path))
df = pd.concat(frames)   

In [6]:
# Read Jira Dataset --Need to provide location of the dataset--
df=pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/jira_data/AggregateJiraData.csv')
df.head(3)

Unnamed: 0.1,Unnamed: 0,Summary,Issue id,Issue key,Issue Type,Description,Assignee,Reporter,Created,Resolved,Updated,Resolution,Priority,Creator
0,29,Sourcetree does not list changes to added file,1133838,SRCTREEWIN-11416,Bug,"After you stage a new file, if you make change...",mminns,neoscorpe,13/Mar/2019 10:21 AM,21/Mar/2019 9:22 PM,26/Aug/2019 5:16 AM,Fixed,Highest,neoscorpe
1,60,Implement a dark theme/skin,1124345,SRCTREEWIN-11379,Suggestion,"I use dark Gmail, dark IntelliJ IDEA, dark bro...",sstreeting,natharuk04,22/Feb/2019 7:17 AM,27/Feb/2019 2:09 PM,19/Sep/2019 6:03 AM,Duplicate,Low,natharuk04
2,161,Moving Mouse after Double-Clicking local Branc...,1097122,SRCTREEWIN-11108,Bug,If a user double-clicks on a local branch name...,mcorsaro,deckblad191727226,18/Dec/2018 6:03 PM,14/Feb/2019 11:07 AM,26/Aug/2019 5:17 AM,Fixed,Low,deckblad191727226


# Data Preprocessing


*   Merging data frames
*   Selecting essential features while removing the rest
*   Creating new features : Estimated time, IncidentResolvedQuater
*   Cleaning data : Fetching and eliminating null values






In [7]:
# Keeping the following useful essential columns

df = df.filter(['Summary', 'Issue id','Issue key','Issue Type','Description','Assignee','Reporter','Created','Resolved','Updated','Resolution','Priority','Creator'])

In [8]:
# Create estimated days column that would be our target value

df['estimated_days']=pd.to_datetime(df['Resolved'])-pd.to_datetime(df['Created'])
df['estimated_days'] = df['estimated_days'].dt.days
#df['estimated_days'].clip_upper(100)
maxVal = 60
df.loc[df['estimated_days'] >= maxVal, 'estimated_days'] = maxVal
print(df['estimated_days'].max())
#data summary
df.info()

60.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7435 entries, 0 to 7434
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Summary         7435 non-null   object 
 1   Issue id        7435 non-null   int64  
 2   Issue key       7435 non-null   object 
 3   Issue Type      7435 non-null   object 
 4   Description     6877 non-null   object 
 5   Assignee        7435 non-null   object 
 6   Reporter        7427 non-null   object 
 7   Created         7435 non-null   object 
 8   Resolved        6310 non-null   object 
 9   Updated         7435 non-null   object 
 10  Resolution      6310 non-null   object 
 11  Priority        6803 non-null   object 
 12  Creator         7424 non-null   object 
 13  estimated_days  6310 non-null   float64
dtypes: float64(1), int64(1), object(12)
memory usage: 813.3+ KB


In [9]:
# Check for null values
df[df.isnull().any(axis=1)]

# Drop rows with null Values
df.drop(df[df.isnull().any(axis=1)].index,inplace=True)
df=df.reset_index(drop=True)
df.head(3)

Unnamed: 0,Summary,Issue id,Issue key,Issue Type,Description,Assignee,Reporter,Created,Resolved,Updated,Resolution,Priority,Creator,estimated_days
0,Sourcetree does not list changes to added file,1133838,SRCTREEWIN-11416,Bug,"After you stage a new file, if you make change...",mminns,neoscorpe,13/Mar/2019 10:21 AM,21/Mar/2019 9:22 PM,26/Aug/2019 5:16 AM,Fixed,Highest,neoscorpe,8.0
1,Implement a dark theme/skin,1124345,SRCTREEWIN-11379,Suggestion,"I use dark Gmail, dark IntelliJ IDEA, dark bro...",sstreeting,natharuk04,22/Feb/2019 7:17 AM,27/Feb/2019 2:09 PM,19/Sep/2019 6:03 AM,Duplicate,Low,natharuk04,5.0
2,Moving Mouse after Double-Clicking local Branc...,1097122,SRCTREEWIN-11108,Bug,If a user double-clicks on a local branch name...,mcorsaro,deckblad191727226,18/Dec/2018 6:03 PM,14/Feb/2019 11:07 AM,26/Aug/2019 5:17 AM,Fixed,Low,deckblad191727226,57.0


In [10]:
# Modifying and merging similar Priority Level
PriorityLevelColumn = {'Highest':'P0: Blocker',
                             'High':'P1: Critical',
                             'Medium':'P2: Important',
                             'Low':'P3: Somewhat important',
                             'Lowest':'P4: Low',
                             'Not Evaluated':'P5: Not important',
                             np.nan:'P4: Low'}

df['PriorityLevel'] = df['Priority'].replace(PriorityLevelColumn)

# Total categories of Priority Level
print("Total category of incidents",df['PriorityLevel'].nunique())

# Unique values in the Priority Level
print(df.PriorityLevel.unique())

# No of incidents of each type
print(df['PriorityLevel'].value_counts())

Total category of incidents 6
['P0: Blocker' 'P3: Somewhat important' 'P1: Critical' 'P2: Important'
 'P4: Low' 'P5: Not important']
P3: Somewhat important    2006
P2: Important             1511
P5: Not important          973
P1: Critical               588
P0: Blocker                260
P4: Low                     97
Name: PriorityLevel, dtype: int64


In [11]:
# 1-Hot Encoding for the Priority Level

priority_dummy = pd.get_dummies(df['PriorityLevel'])
df = pd.concat([df, priority_dummy], axis=1)
df.head(1)

Unnamed: 0,Summary,Issue id,Issue key,Issue Type,Description,Assignee,Reporter,Created,Resolved,Updated,Resolution,Priority,Creator,estimated_days,PriorityLevel,P0: Blocker,P1: Critical,P2: Important,P3: Somewhat important,P4: Low,P5: Not important
0,Sourcetree does not list changes to added file,1133838,SRCTREEWIN-11416,Bug,"After you stage a new file, if you make change...",mminns,neoscorpe,13/Mar/2019 10:21 AM,21/Mar/2019 9:22 PM,26/Aug/2019 5:16 AM,Fixed,Highest,neoscorpe,8.0,P0: Blocker,1,0,0,0,0,0


In [12]:
# Unique values in the Priority Level
print(df['Issue Type'].unique())

# No of incidents of each type
print(df['Issue Type'].value_counts())

issuetypes = {'User Story':'Epic',
            'New Feature':'Epic',
            'Technical task':'Improvement',
            'Sub-task':'Improvement',
            'Research':'Improvement' }

df['IssuesType'] = df['Issue Type'].replace(issuetypes)

print(df['IssuesType'].value_counts())

# Remove Priority, IssueType and PriorityLevel

['Bug' 'Suggestion' 'Improvement' 'New Feature' 'Task' 'Sub-task' 'Epic'
 'User Story' 'Research' 'Technical task']
Bug               3733
Suggestion         874
Task               575
Improvement         81
User Story          48
New Feature         40
Technical task      34
Epic                26
Sub-task            23
Research             1
Name: Issue Type, dtype: int64
Bug            3733
Suggestion      874
Task            575
Improvement     139
Epic            114
Name: IssuesType, dtype: int64


In [13]:
# 1-Hot Encoding for the Priority Level

issues_dummy = pd.get_dummies(df['IssuesType'])
df = pd.concat([df, issues_dummy], axis=1)

In [14]:
# Removing unnecessary columns

df = df.drop(['Priority','PriorityLevel','Issue Type'],axis=1)          

In [15]:
df['months']=pd.DatetimeIndex(df['Created']).month

qtr_months = {1:1,2:1,3:1,4:1,5:2,6:2,7:2,8:2,9:3,10:3,11:3,12:3}

df['Quater'] = df['months'].replace(qtr_months)
print(df['Quater'].value_counts())

1    1945
2    1831
3    1659
Name: Quater, dtype: int64


In [16]:
# 1-Hot Encoding for the Priority Level

qtr_dummy = pd.get_dummies(df['Quater'])
df = pd.concat([df, qtr_dummy], axis=1)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5435 entries, 0 to 5434
Data columns (total 29 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Summary                 5435 non-null   object 
 1   Issue id                5435 non-null   int64  
 2   Issue key               5435 non-null   object 
 3   Description             5435 non-null   object 
 4   Assignee                5435 non-null   object 
 5   Reporter                5435 non-null   object 
 6   Created                 5435 non-null   object 
 7   Resolved                5435 non-null   object 
 8   Updated                 5435 non-null   object 
 9   Resolution              5435 non-null   object 
 10  Creator                 5435 non-null   object 
 11  estimated_days          5435 non-null   float64
 12  P0: Blocker             5435 non-null   uint8  
 13  P1: Critical            5435 non-null   uint8  
 14  P2: Important           5435 non-null   

In [17]:
# Renaming time to Quaterly
df = df.rename(columns = {1:'Qtr1', 2:'Qtr2',3:'Qtr3'})


In [18]:
df = df.drop(['Summary','Issue id','Issue key','Description','Assignee','Reporter','Created','Resolved',
                  'Updated','Resolution','Creator','Quater','months','IssuesType'],axis=1)

print(df.columns)


Index(['estimated_days', 'P0: Blocker', 'P1: Critical', 'P2: Important',
       'P3: Somewhat important', 'P4: Low', 'P5: Not important', 'Bug', 'Epic',
       'Improvement', 'Suggestion', 'Task', 'Qtr1', 'Qtr2', 'Qtr3'],
      dtype='object')


# Model : XG Boost Implementation


In [19]:
# Importing necessary Libraries for the model: XGBoost

import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

X = np.array(df.drop(['estimated_days'],axis=1).values)
y = np.array(df['estimated_days'].values)


In [20]:
# Training and Testing the model

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

#Instantiating an XGBoost regressor object
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)

#Fitting the regressor to the training set 
xg_reg.fit(X_train,y_train)

#Making predictions on the test set
preds = xg_reg.predict(X_test)

#Calculating MSE
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

RMSE: 28.135299


**Reading and Transforming Jira data needed for Prediction**

In [24]:
# Reading prediction data
data = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/jira_dataset.csv')

# Filtering essential columns from the data
data = data.filter(['Summary', 'Issue id','Issue key','Issue Type','Description','Assignee','Reporter','Created','Resolved','Updated','Resolution','Priority','Creator'])

# Splitting data set
querydata = data[data['Resolved'].notna()]
data = data[data['Resolved'].isnull()]

PriorityLevelColumn = {'Highest':'P0: Blocker',
                             'High':'P1: Critical',
                             'Medium':'P2: Important',
                             'Low':'P3: Somewhat important',
                             'Lowest':'P4: Low',
                             'Not Evaluated':'P5: Not important',
                             np.nan:'P4: Low'}

data['PriorityLevel'] = data['Priority'].replace(PriorityLevelColumn)

priority_dummy = pd.get_dummies(data['PriorityLevel'])
data = pd.concat([data, priority_dummy], axis=1)

issuetypes = {'User Story':'Epic',
            'New Feature':'Epic',
            'Technical task':'Improvement',
            'Sub-task':'Improvement',
            'Research':'Improvement' }

data['IssuesType'] = data['Issue Type'].replace(issuetypes)

issues_dummy = pd.get_dummies(data['IssuesType'])
data = pd.concat([data, issues_dummy], axis=1)

data['months']=pd.DatetimeIndex(data['Created']).month

qtr_months = {1:1,2:1,3:1,4:1,5:2,6:2,7:2,8:2,9:3,10:3,11:3,12:3}

data['Quater'] = data['months'].map(qtr_months)

qtr_dummy = pd.get_dummies(data['Quater'])
data = pd.concat([data, qtr_dummy], axis=1)

data = data.rename(columns = {1:'Qtr1', 2:'Qtr2',3:'Qtr3'})

# Filling null values in the Discription dataset with Summary data
data["Description"].fillna(data['Summary'], inplace = True)
data.reset_index(inplace=True, drop=True)

querydata.reset_index(inplace=True, drop=True)
querydata["Description"].fillna(querydata['Summary'], inplace = True)

# Jira prediction_data for XGBoost 
prediction_data = data.drop(['Summary','Issue id','Issue key','Description','Assignee','Reporter','Created','Resolved','Priority','PriorityLevel','Issue Type',
                  'Updated','Resolution','Creator','Quater','months','IssuesType'],axis=1)

# Jira data used for training and testing for BERT
querydata['estimated_days']=(pd.to_datetime(querydata['Resolved'])-pd.to_datetime(querydata['Created'])).dt.days


**Predicted estimated time results for incidents from XGBoost**

In [25]:
# Providing inicidents to xgboost model for predicting estimated time

predictiondf=df.iloc[0]
predictiondf=predictiondf.drop(predictiondf.index[0])
predictiondf=predictiondf.add(prediction_data)
predictiondf = np.array(predictiondf.fillna(0))

print("Input data size",len(predictiondf))

predicted_result_xgboost=0.3*xg_reg.predict(predictiondf)
print(predicted_result_xgboost)


Input data size 24
[5.88544   5.88544   5.88544   5.88544   5.88544   5.88544   5.88544
 5.88544   5.6285505 5.88544   5.88544   5.88544   5.88544   5.6285505
 5.6285505 5.6285505 5.9560614 5.6285505 7.295043  7.295043  6.9681
 7.295043  7.295043  7.295043 ]


# BERT - Model Selection and Initialization

1. Model Name : stsb-roberta-base-v2, STSb Performance: 87.21,  Size: ~460MB

In [26]:
model = SentenceTransformer('stsb-roberta-base-v2')

HBox(children=(FloatProgress(value=0.0, max=459724146.0), HTML(value='')))




In [27]:
#Converting the incidents description text to contextual embeddings
query_embedding = model.encode(data['Description'], batch_size = len(data['Description']), show_progress_bar = True)
text_embeddings = model.encode(querydata['Description'], batch_size = len(querydata['Description']), show_progress_bar = True)

HBox(children=(FloatProgress(value=0.0, description='Batches', max=1.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='Batches', max=1.0, style=ProgressStyle(description_width=…




In [28]:
#Converting the incidents summary text to contextual embeddings
summary_query_embedding = model.encode(data['Summary'], batch_size = len(data['Summary']), show_progress_bar = True)
summary_text_embeddings = model.encode(querydata['Summary'], batch_size = len(querydata['Summary']), show_progress_bar = True)

HBox(children=(FloatProgress(value=0.0, description='Batches', max=1.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='Batches', max=1.0, style=ProgressStyle(description_width=…




In [29]:
#Embedding size of each document
print("Embedding Size:", text_embeddings.size)

#Total number of documents present
print("Total no of documents", len(text_embeddings))

Embedding Size: 56832
Total no of documents 74


# Evaluating estimated time of similar resolved incidents

In [30]:
print("Query Sentence:", data['Description'][10])
# Compute similarity scores of the sentence with the corpus
desc_cos_scores=[]
desc_cos_scores = util.pytorch_cos_sim(query_embedding[10], text_embeddings)[0]


Query Sentence: Create Confluence page with model description & your research + ideas


In [39]:
# Compute similarity scores of the incidents with respect to summary 
cos_scores_summary = util.pytorch_cos_sim(summary_query_embedding[10], summary_text_embeddings)[0]

# Merge both the description scores and summary scores
total_scores=(0.7)*desc_cos_scores+(0.3)*cos_scores_summary

result=[]
top_k=3
days=0
scores=0

# Sort the results in decreasing order and get the first top_k result
top_results = np.argpartition(-total_scores, range(top_k))[0:top_k+1]

for idx in top_results[1:top_k+1]:
    #print(int(idx), data[idx], "(Score: %.4f)" % (total_scores[idx]), data['Assignee'][int(idx)])
    #print()
    days=querydata['estimated_days'][int(idx)]*total_scores[idx]+days
    scores=scores+total_scores[idx]

# Merging results from both the models
time = float((0.7)*(days/scores)+(0.3)*predicted_result_xgboost[10])

result.append({
    "issue_key":str(data['Issue key'][10]),
    "username":str(data['Assignee'][10]),
    "summary":str(data['Summary'][10]),
    "estimatedays":"%.1f"%float(time)
    })

print(result)

[{'issue_key': 'PITL1-88', 'username': 'saurypande', 'summary': 'Create Confluence page with model description & your research + ideas', 'estimatedays': '5.2'}]


# Export aggregated result as JSON

In [38]:
import json

aggregate_result=[]
for i in range(len(data)):
  top_k=3
  desc_cos_scores = util.pytorch_cos_sim(query_embedding[i], text_embeddings)[0]
# Compute similarity scores of the incidents with respect to summary 
  cos_scores_summary = util.pytorch_cos_sim(summary_query_embedding[i], summary_text_embeddings)[0]

#Merge both the description scores and summary scores
  total_scores=(0.7)*desc_cos_scores+(0.3)*cos_scores_summary

# Sort the results in decreasing order and get the first top_k result
  top_results = np.argpartition(-total_scores, range(top_k))[0:top_k+1]
  days=0
  scores=0
  for idx in top_results[1:top_k+1]:
    #print(int(idx), data[idx], "(Score: %.4f)" % (total_scores[idx]), data['Assignee'][int(idx)])
    #print()
    days=querydata['estimated_days'][int(idx)]*total_scores[idx]+days
    scores=scores+total_scores[idx]
  time = float((0.7)*(days/scores)+(0.3)*predicted_result_xgboost[i])

  aggregate_result.append({
    "issue_key":str(data['Issue key'][int(i)]),
    "username":str(data['Assignee'][int(i)]),
    "summary":str(data['Summary'][int(i)]),
    "estimatedays":"%.1f"%float(time)
    })

response=json.dumps(aggregate_result)
with open("response.json", "w") as outfile:
    outfile.write(response)