In [None]:
%matplotlib inline

import re
import os.path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Reading logs generated from WhyLogs CLI

Running WhyLogs will produce the following four files:
1. A flat summary file;
2. A histograms file;
3. A frequency file; and
4. A binary file containing the raw data objects

We'll first explore the generated files produced from the command line interface. We can explore the generated files by downloading and reading into Pandas as dataframes.

In [None]:
# This cell is a generated using the `whylogs init` command and will not run as-is.
# If this notebook has not been created as a result of that command, replace commented variables
# with strings containing the appropriate values for your use case.
project_dir = ###PROJECT_DIR###

WhyLogs calculates and displays a number of metrics for the data that passes through. The carefully chosen metrics balance efficient storage and in-depth analysis of your data.

In [None]:
# TODO: After generated cell, must confirm with Andy's CLI experience
flat_summary = pd.read_csv(os.path.join(project_dir, ""))
flat_summary

The flat summary file contains a summary of each variable of the dataset that has metrics specifically for numeric, text, and categorical variables. The inferred variable types themselves can tell us a lot about errors that may occur in the process.

For example, in the packaged loan dataset, the *mths_since_last_record* variable has 74% of records in the majority variable type. We can see further detail with the type count variables that we'll view below.

First, we'll need to pull a particular variable's row from the summary data. We'll then display a few metrics related to data types.

In [None]:
# List all type count metrics
regex = re.compile("type_(.*)_count")
metrics = list(filter(regex.match, flat_summary.columns))

In [None]:
# Filter to the desired variable
variable = "mths_since_last_record"
data = flat_summary[flat_summary["column"]==variable]

# Print data type percentage
print("Percentage of data in inferred data type:", data["inferred_dtype"].values)

In [None]:
# Display all type count metrics using matplotlib
x = [i for i, _ in enumerate(metrics)]
fig, ax = plt.subplots()
plt.bar(x, np.squeeze(data[metrics].values))
plt.title("Type counts for "+variable)
plt.ylabel("Count")
plt.xticks(x, metrics)
plt.setp(ax.get_xticklabels(), rotation=-30, horizontalalignment='left')
plt.show()

In addition to the type metrics, there are loads of other useful metrics in the WhyLogs summaries. These include but are not limited to descriptive statistics, estimations with error intervals, and metrics related to missing values.

The histogram file contains information for numeric variables that allow us to create histograms and analyze distribution.

We'll grab the data for the *fico_range_high* variable and plot it.

In [None]:
histograms = pd.read_csv(os.path.join(project_dir, ""))

In [None]:
# See one of the inspected histograms
variable = "fico_range_high"
bins = histograms[variable]['bin_edges']
n = histograms[variable]['counts']
bin_width = np.diff(bins)

plt.bar(bins[0:-1], n, bin_width, align='edge')
plt.title("Histogram for "+variable)
plt.ylabel("Count")
plt.show()

Finally, we have more detailed information on the frequencies of many variables in the dataset. These can be accessed through the generated frequencies file.

In [None]:
frequencies = pd.read_csv(os.path.join(project_dir, ""))