# Unsupervised Learning - Clustering Analysis of Eloquence LOG File

Attached are two data sets:

1) QuoteData Spreadsheet (CRM data) outlining the quotes and quote lines with products being configured

2) Engine Logs for Eloquence (10 application logs file).

The candidate can use any tool of his/her choice or liking on their own laptop or ours and mine/profile the data and present insights on the data for both CPQ and Eloquence (both are proprietary software products) Specific areas could be:

1) the type of Components Used, Processing Time, Documents Processed, Type of File, etc. for the Eloquence data and

2) product and pricing on quote and quote lines; pricing related to account, contact, product for the CPQ related Quote Data

In summary, we want the candidate to look at the two different sets of data and provide us with meaningful data insights and whatever other useful information that we could glean from the data. Though some may view the attached case study as preparation for a data reporting role, it is not. This case study is just one of the small tools (data analysis 101) that we utilize in determining the right candidate for the role. This is, in fact, a true data analyst/scientist role and will encompass a mix of strategy, advisory and implementation (though there will be business intelligence and analytics reporting for all cloud products as part of the job profile as well).

Since the business mentions profiling, this seems to be a Clustering Analysis based on Unsupervised Learning Problem, and as there is no Target Variable present, this reinforces the Clustering Analysis Approach.

1) Data dictionary not provided?

2) The clusters will be formed based on the variables given in the business problem. ex.: Components Used from the Eloquence log file

3) Data Standardization: convert prices of the products into percentages, so as to compare different products based on percent price.

4) Scree plot: x= centers (centroids) y= WCSS (Within Cluster Sum of Square); then identify the bend or curve in Scree Plot

5) Dimensionality Reduction PCA & LCA


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings 
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
# Run multiple commands and get multiple outputs within a single cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
logdf = pd.read_fwf('Engine_main_01.log.txt')
logdf.head()

Unnamed: 0,2015-05-22,"15:13:36,003",DEBUG,[Engine_main_01],[pool-2-thread-144],[cincom.eloquence.Format.reader.CSSAXFormatterHandler] imageParameters.getImagesFolder() = C:\Cincom Eloquence\EngineServer\Instances\EloqInstance01\images,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,,,,,,,,,,,
1,2015-05-22,"15:13:36,009",DEBUG,[Engine_main_01],[pool-2-thread-144],[cincom.eloquence.Format.reader.CSSAXFormatter...,,,,,
2,,,,,,,,,,,
3,2015-05-22,"15:13:36,011",DEBUG,[Engine_main_01],[pool-2-thread-144],[cincom.eloquence.Format.reader.CSSAXFormatter...,,,,,
4,,,,,,,,,,,


In [5]:
logdf.shape

(64258, 11)

## Parsing Text using REGULAR EXPRESSION - re

<b> Step 1: Understand the input format </b>

In [None]:
# use: with open() as file: function to acess files


<b> Step 2: Import the required packages </b>

In [1]:
# import REGULAR EXPRESSION package
import re

<b> Step 3: Define regular expressions </b>

In [6]:
# catch_phrases = ["component", "documents>", "FileType"]
rx_dict = {
    'component': re.compile(r'component = (?P<component>.*)\n'),
    'document': re.compile(r'document = (?P<document>.*)\n'),
    'FileType': re.compile(r'FileType = (?P<FileType>.*)\n'),
}

<b> Step 4: Write a line parser </b>

In [7]:
def _parse_line(line):

    for key, rx in rx_dict.items():
        match = rx.search(line)
        if match:
            return key, match
    # if there are no matches
    return None, None

<b> Step 5: Write a file parser </b>

In [None]:
data = []  # create an empty list to collect the data
# open the file and read through it line by line
with open('Engine_main_01.log.txt', 'r') as file_object:
    line = file_object.readline()
while line:
# at each line check for a match with a regex
    key, match = _parse_line(line)

# extract component name
if key == 'component':
    component = match.group('component')

# extract document
if key == 'document':
    document = match.group('document')

# extract FileType
if key == 'FileType':
    FileType = match.group('FileType')

    # read each line of the table until a blank line
    while line.strip():
        # extract number and value
        number, value = line.strip().split(',')
        value = value.strip()
        # create a dictionary containing this row of data
        row = {
            'component': component,
            'document': document,
            'FileType': FileType                        
        }
        # append the dictionary to the data list
        data.append(row)
        line = file_object.readline()

line = file_object.readline()

# create a pandas DataFrame from the list of dicts
data = pd.DataFrame(data)
# set the component, document, FileType as the index
data.set_index(['component', 'document', 'FileType'], inplace=True)
# consolidate df to remove nans
data = data.groupby(level=data.index.names).first()
# upgrade Score from float to integer
data = data.apply(pd.to_numeric, errors='ignore')
return data

<b> Step 6: Test the parser </b>