# Project overview

**• Domain :** IT Ticketing system

**• Problem statement :**
One of the key activities of any IT function is to “Keep the lights on” to ensure there is no impact to the Business operations. IT leverages Incident Management process to achieve the above Objective. 

An incident is something that is unplanned interruption to an IT service or reduction in the quality of an IT service that affects the Users and the Business. 

The main goal of Incident Management process is to provide a quick fix / workarounds or solutions that resolves the interruption and restores the service to its full capacity to ensure no business impact. 

In most of the organizations, incidents are created by various Business and IT Users, End Users/ Vendors if they have access to ticketing systems, and from the integrated monitoring systems and tools. 

Assigning the incidents to the appropriate person or unit in the support team has critical importance to provide improved user satisfaction while ensuring better allocation of support resources. 

The assignment of incidents to appropriate IT groups is still a manual process in many of the IT organizations. Manual assignment of incidents is time consuming and requires human efforts. There may be mistakes due to human errors and resource consumption is carried out ineffectively because of
the misaddressing. On the other hand, manual assignment increases the response and resolution times which result in user satisfaction deterioration / poor customer service.

Currently the incidents are created by various stakeholders (Business Users, IT Users and Monitoring Tools) within IT Service Management Tool and are assigned to Service Desk teams (L1 / L2 teams). This team will review the incidents for right ticket categorization, priorities and then carry out initial diagnosis to see if they can resolve. Around ~54% of the incidents are resolved by L1 / L2 teams

L1 / L2 needs to spend time reviewing Standard Operating Procedures (SOPs) before assigning to Functional teams (Minimum ~25-30% of incidents needs to be reviewed for SOPs before ticket assignment). 15 min is being spent for SOP review for each incident. Minimum of ~1 FTE effort needed only for incident assignment to L3 teams.

During the process of incident assignments by L1 / L2 teams to functional groups, there were multiple instances of incidents getting assigned to wrong functional groups. Around ~25% of Incidents are wrongly assigned to functional teams. Additional effort needed for Functional teams to re-assign to right functional groups. During this process, some of the incidents are in queue and not addressed timely resulting in poor customer service.

**• DATA DESCRIPTION :** 
Each ticket has 
1.   Short Description - Short description about ticket
2.   Description - Problem explained in detail by user
3.   Caller - User for whom ticket is created
4.   Assignment group - IT group to which ticket needs to be assigned 


**• PROJECT OBJECTIVE: :**  Using NLP based AI techniques build a classifier that can automatically classify incidents to right functional groups, in turn can help organizations to reduce the resolving time of the issue and can focus on more productive tasks.

# Steps and Tasks

## 1. Import and analyse the data set.

In [1]:
import pandas as pd # read data file, data processing
import numpy as np # linear algebra
import matplotlib.pyplot as plt # plotting graph for EDA , Metrics analysis
%matplotlib inline
import seaborn as sns # plotting graph for EDA , Metrics analysis

### Load the data 

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Input data files has been processed for 
# 1. carriage return characters like '_x000D_' and \n 
# 2. Accented encoding character like äº§å“æ‰€åœ¨ä»“åº“å‡ºé”™ã€ , è¿žæŽ¥åŽè‡ªåŠ¨æ–­å¼€ï¼Œæ
# 3. Translation of words in non english language especially German, Italian, French
# Above 3 steps are done separately and output from these steps are used for further processing in Part 2
# 4. Update of Assigment group - fewer data groups , grouped to Group others
# 5. Pre-process for having only English data after translation, removal of spaces 
# 6. Treatment of Null values
# Above step 4,5,6 are done in part2 and processed data is stored in input_data_trans_preprocess.csv

data_dir = "/content/drive/MyDrive/AIML/projects/Capstone-NLP-Ticketing/"
data_file_name='input_data_trans_preprocess.csv'
data_file_path = data_dir+data_file_name
data_file_path

'/content/drive/MyDrive/AIML/projects/Capstone-NLP-Ticketing/input_data_trans_preprocess.csv'

In [4]:
#df_data = pd.read_excel(data_file_path)
df_data = pd.read_csv(data_file_path)

In [5]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8467 entries, 0 to 8466
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Short description       8467 non-null   object
 1   Description             8467 non-null   object
 2   Caller                  8467 non-null   object
 3   Assignment group        8467 non-null   object
 4   orig_desc               8466 non-null   object
 5   orig_short_desc         8459 non-null   object
 6   Lang                    8467 non-null   object
 7   Translated_ShortDesc    8450 non-null   object
 8   Translated_Description  8467 non-null   object
 9   orig_assign_group       8467 non-null   object
dtypes: object(10)
memory usage: 661.6+ KB


**Merging both Description and Short description**

In [6]:
df_data['Desc_All'] = df_data['Short description'] + ' '+ df_data['Description']
# Strip unwanted spaces
df_data['Desc_All'] = df_data['Desc_All'].apply(lambda x: x.strip())

In [7]:
# Import stop words list from NLTK
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords # Import stop words
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [12]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [13]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

● Text preprocessing
include lemmatization

In [14]:

from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.tokenize import word_tokenize
    
def preprocess_vocab(df_column):
    corpus=[]
    stop_words=set(stopwords.words('english'))
    #stem=PorterStemmer()
    lem=WordNetLemmatizer()
    for tickets in df_column:
      words=[w for w in word_tokenize(tickets) if (w not in stop_words)]
      words=[lem.lemmatize(w) for w in words if len(w)>2]
      corpus.append(words)

    df_column = df_column.apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
    
    return corpus,df_column

tickets_list,df_column = preprocess_vocab(df_data['Desc_All'])
df_data['desc_processed'] = df_column


In [15]:
print(tickets_list[0])
print(df_data['Desc_All'][0])

['login', 'issue', 'verified', 'user', 'detail', 'employee', 'manager', 'name', 'checked', 'user', 'name', 'reset', 'password', 'advised', 'user', 'login', 'check', 'caller', 'confirmed', 'able', 'login', 'issue', 'resolved']
login issue verified user details employee manager name checked the user name in ad and reset the password advised the user to login and check caller confirmed that he was able to login issue resolved


## Train a simple ML Model - Logistic Regression

In [16]:
X = df_data['desc_processed'] 
y = df_data['Assignment group'].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

In [17]:
print(X.size)
print(X[0])
print(y.size)
print(y[0])

8467
login issue verified user details employee manager name checked user name ad reset password advised user login check caller confirmed able login issue resolved
8467
GRP_0


In [187]:
# convert X_train to BOW values 

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary=True)
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

In [20]:
from sklearn import preprocessing
from tensorflow.keras.utils import to_categorical

le = preprocessing.LabelEncoder()
le.fit(y)
y_train_mdl_lbl_enc = le.transform(y_train)
y_train_mdl_cat = to_categorical(y_train_mdl_lbl_enc)
y_test_mdl_lbl_enc = le.transform(y_test)
y_test_mdl_cat = to_categorical(y_test_mdl_lbl_enc)

In [21]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='lbfgs', max_iter=250)
clf = OneVsRestClassifier(clf)

In [22]:
clf.fit(X_train_bow, y_train_mdl_cat)

OneVsRestClassifier(estimator=LogisticRegression(max_iter=250))

In [23]:
y_pred_bow = clf.predict(X_test_bow)

In [24]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report

actual = y_test_mdl_cat
predicted = y_pred_bow

print('Accuracy score: ', accuracy_score(actual, predicted))
print("precision_weighted:", precision_score(actual, predicted,average='weighted', zero_division=1 ))
print("recall_weighted:", recall_score(actual, predicted,average='weighted', zero_division=1 ))
print("f1_weighted:", f1_score(actual, predicted,average='weighted', zero_division=1 ))
print("Classification Report:")
print(classification_report(y_test_mdl_cat, y_pred_bow,zero_division=1))

Accuracy score:  0.5667060212514758
precision_weighted: 0.8221614218035566
recall_weighted: 0.5767414403778041
f1_weighted: 0.621616558459306
Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.85      0.85       832
           1       1.00      0.40      0.57         5
           2       1.00      0.43      0.60        21
           3       1.00      0.00      0.00         4
           4       0.81      0.45      0.58        47
           5       0.42      0.20      0.27        25
           6       0.60      0.12      0.21        24
           7       1.00      0.25      0.40         4
           8       1.00      0.06      0.12        16
           9       0.89      1.00      0.94        17
          10       1.00      0.36      0.53        22
          11       0.82      0.20      0.32        46
          12       0.84      0.39      0.53        41
          13       1.00      0.00      0.00         7
          14       1.00 

Interim Delivery checklist

1. Summary of problem statement, data and findings
Every good abstract describes briefly what was intended at the outset, and summarizes findings and implications.

2. Summary of the Approach to EDA and Pre-processing
Include any insightful visualization you have teased out of the data. If you’ve identified particularly meaningful features, interactions or summary data, share them and explain what you noticed. Visual displays are powerful when used well, so think carefully about what information the display conveys.

3. Deciding Models and Model Building
Based on the nature of the problem, decide what algorithms will be suitable and why?
Experiment with different algorithms and get the performance of each algorithm.

4. How to improve your model performance?
What are the approaches you can take to improve your model? Can you do some feature selection, data manipulation and model improvements.


Which Embedding to be used - Bag of words, TF-IDF , Word2vec, Glove Embedding?


Models to be tried ? ML classifier, LSTM, State of Art  - BERT, XLNet 

Interim report format if any?

# Summary of problem statement, data and findings 

Every good abstract describes briefly what was intended at the outset, and summarizes findings and implications.

Corpus has 8500 tickets created by various Business and IT Users, End Users/ Vendors through ticketing system

The corpus had  over <>  words - or approximately <>  and <> words per person. 

Each ticket has 
1.   Short Description - Short description about ticket
2.   Description - Problem explained in detail by user
3.   Caller - User for whom ticket is created
4.   Assignment group - IT group to which ticket needs to be assigned 


Each blog in the corpus includes at least <> occurrences of common English words. All formatting has been stripped ? any  exceptions like links within a post are denoted by the label url link. 