## Overview
The purpose of this notebook is to set up the data representations needed for analyzing the resolution time characteristics. 
The characteristics of the resolution time are illustrated through two plots:
1. A violin plot that illustrates the distribution and values of resolution time. This plot helps us identify effects such as clustering of resolution times and outliers. It summarizes the resolution activity for the help desk during the period Q2-2016.
2. A cumulative distribution function that provides a probablistic view of the resolution time at the help desk.It helps administrators set SLA's. This can tell the help desk that x % of the tickets will be resolved in y hours. This can be done for every group.


There is additional representation that provides the number of help desk resolved by each group in the support desk. This is done separately.The recipe to set up the data representation is straight forward. It involves setting up of the resolution time attribute. This is a feature engineering activity.

## Read the data

In [1]:
import pandas as pd
fp = "../../kmds/examples/q2_2016_ticket_resolution_data.csv"
df = pd.read_csv(fp)


## Verify Quality

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3807 entries, 0 to 3806
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   number            3807 non-null   object
 1   sys_created_at    3807 non-null   object
 2   closed_at         3807 non-null   object
 3   assignment_group  3380 non-null   object
dtypes: object(4)
memory usage: 119.1+ KB


## Define the data types

In [3]:

fpdtypes = "../../kmds/examples/ticket_resolution_dtypes.csv"
dtypes_df = pd.read_csv(fpdtypes)
dtypes_dict = {row["attribute"]: row["type"] for index, row in dtypes_df.iterrows()}
df = df.astype(dtypes_dict)
df = df.reset_index()

## Feature Engineering
Define the resolution time attribute as shown below

In [4]:
df["resolution_time"] = df["closed_at"] - df["sys_created_at"]
df["resolution_time"] = df["resolution_time"].apply(lambda x: x.total_seconds()/3600)

In [5]:
df["resolution_time"]

0       1546.616667
1       2285.050000
2       1544.566667
3       1545.716667
4       2282.716667
           ...     
3802     126.283333
3803    3649.166667
3804    2929.383333
3805    3648.233333
3806    1475.616667
Name: resolution_time, Length: 3807, dtype: float64

## Write the representation to disk

In [6]:
fp_q2_2016 = "../../kmds/examples/example_analytics_post_data_rep1_data.csv"
df = df.drop("index", axis = 1)
df.to_csv(fp_q2_2016, index=False)


## Caturing and Tagging Meta Data in Data Representations
After creating the required data representation to our modelling requirement, the meta-data related to the data representation can be captured to facilitate understanding. The [woodwork library](https://woodwork.alteryx.com/en/v0.7.1/start.html)  can provide this feature. The generated meta-data can be reviewed and updated. Note how the semantic tags related to ticket creation time and ticket closing time are added to the meta-data. The obtained meta-data can then be puhlished to a tool like [ckan](https://ckan.org/) for enterprise wide dissemenation. 

In [7]:
import woodwork as ww
df.ww.init(name="q2_2016_itsm_data_rep1")

  pd.to_datetime(
  pd.to_datetime(
  pd.to_datetime(
  pd.to_datetime(


In [8]:
df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
number,string,Unknown,[]
sys_created_at,datetime64[ns],Datetime,[]
closed_at,datetime64[ns],Datetime,[]
assignment_group,category,Categorical,['category']
resolution_time,float64,Double,['numeric']


In [9]:
df.ww.set_types(logical_types={
    'number': 'Categorical'
})

In [10]:
df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
number,category,Categorical,['category']
sys_created_at,datetime64[ns],Datetime,[]
closed_at,datetime64[ns],Datetime,[]
assignment_group,category,Categorical,['category']
resolution_time,float64,Double,['numeric']


In [11]:
df.ww.set_types(semantic_tags={'sys_created_at':'ticket_creation_time', 'closed_at': 'ticket_closing_time'})

In [12]:
df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
number,category,Categorical,['category']
sys_created_at,datetime64[ns],Datetime,['ticket_creation_time']
closed_at,datetime64[ns],Datetime,['ticket_closing_time']
assignment_group,category,Categorical,['category']
resolution_time,float64,Double,['numeric']


In [13]:
fp_q2_2016_md = "../../kmds/examples/example_analytics_data_rep1_meta_data.csv"
df.ww.to_csv(fp_q2_2016_md, index=False)

## Log Data Representation Observations to KMDS Knowledge Base

In [14]:
from tagging.tag_types import DataRepresentationTags
from owlready2 import *
from utils.load_utils import *
#from utils.path_utils import *
KNOWLEDGE_BASE = "../../kmds/examples/example_analytics_kb_app_workflow.xml"

In [15]:
onto2 = load_kb(KNOWLEDGE_BASE)

In [16]:
with onto2:
    insts = Workflow.instances()
the_workflow_instance = insts[0]

In [17]:
insts

[example_analytics_kb_app_workflow.xml.ITSM modelling]

In [18]:
dr_obs_list = []
observation_count = 1

dr1 = DataRepresentationObservation(namespace=onto2)
dr1.finding = "The resolution time attribute is defined. It is calculated as the time difference between closing and creation\
times of the ticket."
dr1.finding_sequence = observation_count
dr1.data_representation_observation_type = DataRepresentationTags.FEATURE_ENGG_OBSERVATION.value
dr_obs_list.append(dr1)
the_workflow_instance.has_data_representation_observations = dr_obs_list

onto2.save(file=KNOWLEDGE_BASE, format="rdfxml")