In [None]:
%matplotlib inline

import re
import os.path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Reading logs generated from WhyLogs CLI

Running WhyLogs will produce the following four files:
1. A flat summary file;
2. A histograms file;
3. A frequency file; and
4. A binary file containing the raw data objects

We'll first explore the generated files produced from the command line interface. We can explore the generated files by downloading and reading into Pandas as dataframes.

In [None]:
# GENERATED CELL, assumes files have expected names directly inside of stated directory
summary_dir = ###SUMMARY_DIR###
data_dir = ###DATA_DIR###
data_file = ##DATA_FILE###

WhyLogs calculates and displays a number of metrics for the data that passes through. The carefully chosen metrics balance efficient storage and in-depth analysis of your data.

In [None]:
# Must confirm with Andy's CLI experience
flat_summary = pd.read_csv(os.path.join(summary_dir, ""))
flat_summary

The flat summary file contains a summary of each variable of the dataset that has metrics specifically for numeric, text, and categorical variables. The inferred variable types themselves can tell us a lot about errors that may occur in the process.

For example, in the packaged loan dataset, the *mths_since_last_record* variable has 74% of records in the majority variable type. We can see further detail with the type count variables that we'll view below.

First, we'll need to pull a particular variable's row from the summary data. We'll then display a few metrics related to data types.

In [None]:
# List all type count metrics
regex = re.compile("type_(.*)_count")
metrics = filter(regex.match, flat_summary.columns)

# Filter to the desired variable
data = flat_summary[flat_summary["index"]=="mths_since_last_record"]

# Print data type percentage
print("Percentage of data in inferred data type:", data["inferred_dtype"])

# Display all type count metrics
x = [i for i, _ in enumerate(metrics)]
plt.bar(x, data[metrics])
plt.set_xticks(metrics)
plt.show()

In addition to the type metrics, there are loads of other useful metrics in the WhyLogs summaries. These include but are not limited to descriptive statistics, estimations with error intervals, and metrics related to missing values.

The histogram file contains information for numeric variables that allow us to create histograms and analyze distribution.

We'll grab the data for the *fico_range_high* variable and plot it using `matplotlib`.

In [None]:
histograms = pd.read_csv(os.path.join(summary_dir, ""))

# See one of the inspected histograms
bins = histograms['fico_range_high']['bin_edges']
n = histograms['fico_range_high']['counts']
bin_width = np.diff(bins)

plt.bar(bins[0:-1], n, bin_width, align='edge')

Finally, we have more detailed information on the frequencies of many variables in the dataset. These can be accessed through the generated frequencies file.

In [None]:
frequencies = pd.read_csv(os.path.join(summary_dir, ""))

# Generating logs from WhyLogs Python library

These same files can also be produced with the `whylabs` Python library along with tools to display these as logs to external files, such as `stdout`.

To do so, we will import the WhyLogs library, initialize a logging session with WhyLogs, read our raw data, and pass this data to our session.

In [None]:
from whylabs.logs import get_or_create_session, get_logger

In [None]:
session = get_or_create_session(
    output_to_disk=True
)

In [None]:
logger = get_logger()

In [None]:
data = pd.read_csv(os.path.join(data_dir, data_file))
data

Now that we have the raw data, we can pass it into the WhyLogs logger. We can pass a label "test.data" along with the dataset for future reference.

When we capture the logger response, we can interact with the generated profiles.

In [None]:
response = logger.log_dataframe(df_test, 'test.data')
profile = response['profile']
summary = profile.flat_summary()

The flat summary, histograms, and frequency information can be found inside this summary object.

In [None]:
flat_summary = summary['summary']
flat_summary

In [None]:
histograms = summary['hist']
histograms

In [None]:
frequencies = summary['freq']
frequencies

## Additional options for our WhyLogs session
We chose the most simple configuration above, but there are a number of convenient options that can be set.

**Cloud storage:** You may set the an AWS S3 bucket to have these logs automatically pushed to the cloud. You must have valid AWS configuration settings to be able to do so.

**Binary file:** By default, we produce a binary file that contains raw objects used to summarize the data passed in. Navigating this file is beyond the scope of this notebook, however. This is listed under the *output_protobuf* option.

**Flat and JSON summaries:** By default, we produce a flat summary in the CSV format along with histogram and frequency summaries in the JSON format.

You can see these configuration options and others paired with the session in the `session.config` object.

In [None]:
session.config

## Display and resetting the session

There is also a convenience function to send the internal Python logs to stdout.

\#TODO Does this replicate the `output_to_stdout` functionality for the session.

In [None]:
from whylabs.logs import display_logging
display_logging('debug')

When you are done with your session, run the `reset_session` function.

In [None]:
from whylabs.logs.app.session import reset_session
reset_session()