In [1]:
import os.path
import pandas as pd
import numpy as np

In this notebook, we will explore how to generate logs using the WhyLogs Python library. 

The resulting profile can also be produced from the command line interface. The workflow to work with these files, along with deeper analysis and visualization examples, can be found in the `Analysis.ipynb` that is generated with `whylabs init`.

# Generating logs from WhyLogs Python library

To generate logs using Python, we will import the WhyLogs library, initialize a logging session with WhyLogs, read in our raw data from file, and pass this data to our session.

First, import the relevant session and logger functions.

In [5]:
from whylogs import get_or_create_session

In [6]:
session = get_or_create_session()

We will now download an example dataset from Lending Club, an online financial lending platform. The dataset is located in the package's `notebooks/` folder for now.

Feel free to use the below cell to orient yourself and guide `data_file` to the correct filepath.

In [7]:
print("Current working directory:", os.getcwd())

Current working directory: /Volumes/Workspace/whylogs-examples/python


In [8]:
data_file = "lending_club_1000.csv"

In [9]:
data = pd.read_csv(os.path.join(data_file))
data

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,90671227,,4800.0,4800.0,4800.0,36 months,13.49,162.87,C,C2,...,,,Cash,N,,,,,,
1,90060135,,21600.0,21600.0,21600.0,60 months,9.49,453.54,B,B2,...,,,Cash,N,,,,,,
2,90501423,,24200.0,24200.0,24200.0,36 months,9.49,775.09,B,B2,...,,,Cash,N,,,,,,
3,90186302,,3600.0,3600.0,3600.0,36 months,11.49,118.70,B,B5,...,,,Cash,N,,,,,,
4,90805192,,8000.0,8000.0,8000.0,36 months,10.49,259.99,B,B3,...,,,Cash,N,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,88985880,,40000.0,40000.0,40000.0,60 months,10.49,859.56,B,B3,...,,,Cash,N,,,,,,
996,88224441,,24000.0,24000.0,24000.0,60 months,14.49,564.56,C,C4,...,,,Cash,Y,Mar-2019,ACTIVE,Mar-2019,10000.0,44.82,1.0
997,88215728,,14000.0,14000.0,14000.0,60 months,14.49,329.33,C,C4,...,,,Cash,N,,,,,,
998,Total amount funded in policy code 1: 1465324575,,,,,,,,,,...,,,,,,,,,,


We should see a Pandas dataframe containing the 1000 rows of our Lending Club data sample.

Now that we have the raw data, we can pass it into the WhyLogs logger. It is often useful to pass a string label such as "demo.data" along with the dataset for future reference.

The `log_dataframe` function will profile the given dataset using the WhyLogs library. When we capture the logger response, we can interact with the generated profiles.

In [10]:
response = session.log_dataframe(data, 'test.data')
profile = response['profile']

TypeError: 'DatasetProfile' object is not subscriptable

The flat summary, histograms, and frequency information can be found inside this summary object. 

For more information about the contents of these objects, consult the `Analysis.ipynb` notebook.

In [None]:
summary = profile.flat_summary()
flat_summary = summary['summary']
flat_summary

In [None]:
print(flat_summary["column"].unique())

In [None]:
histograms = summary['hist']
histograms["delinq_amnt"]

In [None]:
frequencies = summary['frequent_strings']
frequencies.update(summary['frequent_numbers'])
frequencies['num_sats']

## Additional options for our WhyLogs session
We chose the most simple configuration above, but there are a number of convenient options that can be set.

**Cloud storage:** You may set the an AWS S3 bucket to have these logs automatically pushed to the cloud. You must have valid AWS configuration settings to be able to do so.

**Binary file:** By default, we produce a binary file that contains raw objects used to summarize the data passed in. Navigating this file is beyond the scope of this notebook, however. This is listed under the *output_protobuf* option.

**Flat and JSON summaries:** By default, we produce a flat summary in the CSV format along with histogram and frequency summaries in the JSON format.

You can see these configuration options and others paired with the session in the `session.config` object.

In [None]:
session.config

## Display and resetting the session

There is also a convenience function to send the internal Python logs to stdout.

In [14]:
from whylogs.logs import display_logging
display_logging('debug')

2020-09-22 16:42:18,033 - whylogs.logs - DEBUG - whylogs.logs logging -> stdout at level DEBUG


When you are done with your session, run the `reset_session` function.

In [16]:
from whylogs import reset_default_session
reset_default_session()

2020-09-22 16:42:33,590 - whylogs.app.config - DEBUG - Attempting to load config file: None
2020-09-22 16:42:33,591 - whylogs.app.config - DEBUG - Attempting to load config file: .whylogs.yaml
