In [None]:
%matplotlib inline

import re
import os.path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In this notebook, we will explore the generated files produced from `whylogs init` in the command line interface. This file has been generated during that process and should include helpful metadata to ease exploration.


# Reading logs generated from whylogs CLI

Running whylogs will produce the following four files:
1. A flat summary file;
2. A histograms file;
3. A frequency file; and
4. A binary file containing the raw data objects

To interact with these files, we will be downloading and reading the generated profile into Pandas as dataframes.

First, let's collect needed metadata from the command line process.

In [None]:
# This cell is a generated using the `whylogs init` command and may note run.
# If this notebook has not been created as a result of that command, replace commented variables
# with strings containing the appropriate values for your use case.
project_dir = "<your project dir>"
datetime_column = None

whylogs calculates and displays a number of metrics for the data that passes through. The carefully chosen metrics balance efficient storage and in-depth analysis of your data.

In [None]:
# TODO: After generated cell, must confirm with Andy's CLI experience
flat_summary = pd.read_csv(os.path.join(profile_dir, "summary_summary.csv"))
flat_summary

The flat summary file contains a summary of each variable of the dataset. It contains metrics that include descriptive statistics as well as metrics specifically for numeric, text, and categorical variables.

Let's look at the available variables from the dataset that are logged in the profile's flat_summary.

In [None]:
# Print available variables for flat_summary
print(flat_summary["column"].unique())

Choose one variable to do a deep dive.

In [None]:
# Filter flat_summary to the desired variable
variable = "mths_since_last_record"
data = flat_summary[flat_summary["column"]==variable]

The inferred variable type metrics can tell us a lot about errors that may occur in the process.

In [None]:
# Print data type percentage
print("Percentage of data in inferred data type:", data["inferred_dtype"].values)

Let's look at some metrics that hold type count information.

In [None]:
# List all type count metrics
regex = re.compile("type_(.*)_count")
metrics = list(filter(regex.match, flat_summary.columns))

We can display this information using whichever visualization tools you are used to. Below is a simple chart created in `matplotlib`.

In [None]:
# Display all type count metrics using matplotlib
x = [i for i, _ in enumerate(metrics)]
fig, ax = plt.subplots()
plt.bar(x, np.squeeze(data[metrics].values))
plt.title("Type counts for "+variable)
plt.ylabel("Count")
plt.xticks(x, metrics)
plt.setp(ax.get_xticklabels(), rotation=-30, horizontalalignment='left')
plt.show()

In addition to the type metrics, there are loads of other useful metrics in the whylogs summaries. These include but are not limited to descriptive statistics, estimations with error intervals, and metrics related to missing values.

In [None]:
metrics = flat_summary.columns
print(metrics)

There are many more visualizations one might generate from the flat_summary file.

Let's move onto the histogram file. The histogram file contains information for numeric variables that allow us to create histograms and analyze distribution.

We'll grab the data for another variable and plot it.

In [None]:
histograms = pd.read_json(os.path.join(profile_dir, "summary_histogram.json"))

In [None]:
# Print valid variables for histograms
print(histograms.keys())

In [None]:
# Filter flat_summary to the desired variable
variable = "mths_since_last_record"

We can display this information from the histogram.

In [None]:
# See one of the inspected histograms
bins = histograms[variable]['bin_edges']
n = histograms[variable]['counts']
bin_width = np.diff(bins)

plt.bar(bins[0:-1], n, bin_width, align='edge')
plt.title("Histogram for "+variable)
plt.ylabel("Count")
plt.show()

Finally, we have more detailed information on the frequencies of many variables in the dataset. These can be accessed through the generated frequencies file.

In [None]:
frequencies = pd.read_csv(os.path.join(profile_dir, "summary_strings.json"))