## Overview
This notebook does the data preparation for the analytics exercise. Please see the [example narrative](example_narrative.md) for details of the analytics task and process used to develop the solution. The first step in the process is exploratory data analysis. This phase consists of the following steps:
1. Read the raw data set
2. Select the subset of attributes that are need for this use case
3. Analyze the (subsetted) dataset for data quality
4. Fix attribute noise
5. Write the denoised data for the next phase in the workflow.
6. Log the relevance and noise processing details done as part of exploratory data analysis to KMDS


## Read Data

In [1]:
import pandas as pd
fp = "../../kmds/examples/incident_event_log_02.csv"
df = pd.read_csv(fp)

  df = pd.read_csv(fp)


## Select Columns Needed

In [2]:
df.head()

Unnamed: 0,number,incident_state,active,reassignment_count,reopen_count,sys_mod_count,made_sla,caller_id,opened_by,opened_at,...,u_priority_confirmation,notify,problem_id,rfc,vendor,caused_by,closed_code,resolved_by,resolved_at,closed_at
0,INC0000045,New,True,0,0,0,True,Caller 2403,Opened by 8,29/2/2016 01:16,...,False,Do Not Notify,,,,,code 5,Resolved by 149,29/2/2016 11:29,5/3/2016 12:00
1,INC0000045,Resolved,True,0,0,2,True,Caller 2403,Opened by 8,29/2/2016 01:16,...,False,Do Not Notify,,,,,code 5,Resolved by 149,29/2/2016 11:29,5/3/2016 12:00
2,INC0000045,Resolved,True,0,0,3,True,Caller 2403,Opened by 8,29/2/2016 01:16,...,False,Do Not Notify,,,,,code 5,Resolved by 149,29/2/2016 11:29,5/3/2016 12:00
3,INC0000045,Closed,False,0,0,4,True,Caller 2403,Opened by 8,29/2/2016 01:16,...,False,Do Not Notify,,,,,code 5,Resolved by 149,29/2/2016 11:29,5/3/2016 12:00
4,INC0000047,New,True,0,0,0,True,Caller 2403,Opened by 397,29/2/2016 04:40,...,False,Do Not Notify,,,,,code 5,Resolved by 81,1/3/2016 09:52,6/3/2016 10:00


In [3]:
SELECT_COLS = ['number', 'sys_created_at', 'closed_at', 'assignment_group']
closed_tickets = df.incident_state == "Closed"
df_closed_tickets = df[closed_tickets][SELECT_COLS].copy()
del df
df_closed_tickets = df_closed_tickets.reset_index(drop=True)

In [4]:
import dateutil
from dateutil.parser import parse


## Inspect Null Information in Dataset

In [5]:
df_closed_tickets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24985 entries, 0 to 24984
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   number            24985 non-null  object
 1   sys_created_at    13466 non-null  object
 2   closed_at         24985 non-null  object
 3   assignment_group  22816 non-null  object
dtypes: object(4)
memory usage: 780.9+ KB


## Noise Observations

1. This use case computes the time to resolve a ticket. This is time that has elapsed between the ticket creation and the ticket closed time stamps. Only closed tickets and tickets with valid values for both these timestamps are used.
2. There is inconsistency in the datetime format for both ticket creation and ticket closed times. The parse function from the dateutil library is used to parse times into ISO format and then use that form for the calculation of the time to resolution
   

## Noise Filter #1 definition

In [6]:
def noise_filter_1(row):
    creation_date = row['sys_created_at']
    valid_creation_date = True
    valid_closed_date = True
    cd = cls_dt = None
    if pd.isna(creation_date):
        valid_creation_date = False
    else:
        try:
            creation_date = parse(creation_date)
        except:
            valid_creation_date = False
    
            
    closed_date = row["closed_at"]
    
    if pd.isna(closed_date):
        valid_closed_date = False
    else:
        try:
            cls_dt = parse(closed_date)

        except:
            valid_closed_date = False
    
    
    clean_row = valid_creation_date & valid_closed_date
    

    return clean_row


## Apply Noise Filter #1

In [7]:
df_closed_tickets = df_closed_tickets[df_closed_tickets.apply(noise_filter_1, axis=1)]

In [8]:
df_closed_tickets["sys_created_at"] = df_closed_tickets["sys_created_at"].apply(parse) 

In [9]:
df_closed_tickets["closed_at"] = df_closed_tickets["closed_at"].apply(parse) 

## Define Noise Filter #2

In [10]:
dict_types = {"number": 'str', "sys_created_at": 'datetime64[ns]',
              "closed_at": "datetime64[ns]" , "assignment_group": 'str'}
df_closed_tickets = df_closed_tickets.astype(dict_types)
valid_closing_dates = df_closed_tickets["closed_at"] > df_closed_tickets["sys_created_at"]
q2_2016 = (df_closed_tickets.closed_at.dt.quarter == 2) & \
        (df_closed_tickets.closed_at.dt.year == 2016)
in_range_good_tickets = q2_2016 & valid_closing_dates

## Apply Noise Filter #2

In [11]:

df_closed_tickets = df_closed_tickets[in_range_good_tickets].reset_index(drop=True)

In [12]:
df_closed_tickets

Unnamed: 0,number,sys_created_at,closed_at,assignment_group
0,INC0000045,2016-02-29 01:23:00,2016-05-03 12:00:00,Group 56
1,INC0000047,2016-02-29 04:57:00,2016-06-03 10:00:00,Group 24
2,INC0000062,2016-02-29 07:26:00,2016-05-03 16:00:00,Group 23
3,INC0000063,2016-02-29 07:17:00,2016-05-03 17:00:00,Group 23
4,INC0000071,2016-02-29 08:17:00,2016-06-03 11:00:00,Group 24
...,...,...,...,...
3802,INC0041652,2016-06-24 08:43:00,2016-06-29 15:00:00,Group 55
3803,INC0082574,2016-01-11 10:57:00,2016-06-11 12:07:00,Group 3
3804,INC0082685,2016-01-11 14:37:00,2016-05-12 16:00:00,Group 31
3805,INC0091855,2016-01-12 10:46:00,2016-06-12 11:00:00,Group 22


## Relevance Observations
1. Only (number', 'sys_created_at', 'closed_at', 'assignment_group') are needed for this analysis
2. Only ticket activity in the second quarter of 2016 is needed for this analysis

## Write Report Data

In [13]:
fp = "../../kmds/examples/q2_2016_ticket_resolution_data.csv"
df_closed_tickets.to_csv(fp, index=False)

In [14]:
df_closed_tickets.columns

Index(['number', 'sys_created_at', 'closed_at', 'assignment_group'], dtype='object')

In [15]:
dtypes_meta = {"attribute": [], "type": []}
for k, v in dict_types.items():
    dtypes_meta["attribute"].append(k)
    dtypes_meta["type"].append(v)
    
df_dtypes = pd.DataFrame.from_dict(dtypes_meta)
fp_types = "../../data/ticket_resolution_dtypes.csv"
df_dtypes.to_csv(fp_types, index=False)

## Log Exploratory Data Analysis Observations to KMDS Knowledge Base

In [16]:
from ontology.kmds_ontology import *
from tagging.tag_types import ExploratoryTags

## Note:
1. In this example the base ontology that comes with the application (onto) is used as the namespace to log all observations into.
2. This ontology is saved to a directory on the repo. This can be a URL or another location - adding this functionality is quite straight forward, but for illustration, a directory is used.
3. The saved ontology is used in all subsequent sessions by **loading** this ontology and using that as the namespace to log all observations.

   If you choose to work through these examples, please take a minute to review how ontologies are saved and loaded.

In [17]:
kaw = KnowledgeApplicationWorkflow("ITSM modelling", namespace=onto)

In [18]:
exp_obs_list = []
observation_count :int = 1
e1 = ExploratoryObservation(namespace=onto)


In [19]:
from kmds.ontology.intent_types import IntentType

In [20]:
e1.finding = "Only (number', 'sys_created_at', 'closed_at', 'assignment_group') are needed for this analysis"
e1.finding_sequence = observation_count
e1.exploratory_observation_type = ExploratoryTags.RELEVANCE_OBSERVATION.value
e1.intent = IntentType.DATA_UNDERSTANDING.value
exp_obs_list.append(e1)

In [21]:
observation_count += 1
e2 = ExploratoryObservation(namespace=onto)
e2.finding = "Only ticket activity in the second quarter of 2016 is needed for this analysis"
e2.finding_sequence = observation_count
e2.exploratory_observation_type = ExploratoryTags.RELEVANCE_OBSERVATION.value
e2.intent = IntentType.DATA_UNDERSTANDING.value
exp_obs_list.append(e2)

In [22]:
observation_count += 1
e3 = ExploratoryObservation(namespace=onto)
e3.finding = "This use case computes the time to resolve a ticket. This is time that has elapsed between the ticket creation and the ticket\
closed time stamps. Only closed tickets andtickets with valid values for both these timestamps are used."
e3.finding_sequence = observation_count
e3.exploratory_observation_type = ExploratoryTags.DATA_QUALITY_OBSERVATION.value
e3.intent = IntentType.DATA_UNDERSTANDING.value
exp_obs_list.append(e3)

In [23]:
observation_count += 1
e4 = ExploratoryObservation(namespace=onto)
e4.finding = "There is inconsistency in the datetime format for both ticket creation and ticket closed times.\
The parse function from the dateutil library is used to parse times into ISO format and then use that form for \
the calculation of the time to resolution"

e4.finding_sequence = observation_count
e4.exploratory_observation_type = ExploratoryTags.DATA_QUALITY_OBSERVATION.value
e4.intent = IntentType.DATA_UNDERSTANDING.value
exp_obs_list.append(e4)

In [24]:
kaw.has_exploratory_observations = exp_obs_list

In [25]:
type(e4.intent)

str

In [26]:
from owlready2 import *
from utils.path_utils import get_package_kb_path
KNOWLEDGE_BASE = "example_analytics_kb_app_workflow.xml"
storage_path = "../../kmds/examples/" + KNOWLEDGE_BASE
onto.save(file=storage_path, format="rdfxml")