In [5]:
import os
import os.path
import pandas as pd
import numpy as np

In this notebook, we will explore how to generate logs using the whylogs Python library. 

The resulting profile can also be produced from the command line interface. The workflow to work with these files, along with deeper analysis and visualization examples, can be found in the `Analysis.ipynb` that is generated with `whylabs init`.

# Generating logs from whylogs Python library

To generate logs using Python, we will import the whylogs library, initialize a logging session with whylogs, read in our raw data from file, and pass this data to our session.

First, import the relevant session and logger functions.

In [28]:
from whylogs import get_or_create_session

In [3]:
session = get_or_create_session()

WARN: Missing config


We will now download an example dataset from Lending Club, an online financial lending platform. The dataset is located in the package's `notebooks/` folder for now.

Feel free to use the below cell to orient yourself and guide `data_file` to the correct filepath.

In [7]:
print("Current working directory:", os.getcwd())

Current working directory: /Users/leandro/Dropbox/Whylab/projects/whylogs-python/notebooks


In [8]:
data_file = "lending_club_1000.csv"

In [9]:
data = pd.read_csv(os.path.join(data_file))
data

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,90671227,,4800.0,4800.0,4800.0,36 months,13.49,162.87,C,C2,...,,,Cash,N,,,,,,
1,90060135,,21600.0,21600.0,21600.0,60 months,9.49,453.54,B,B2,...,,,Cash,N,,,,,,
2,90501423,,24200.0,24200.0,24200.0,36 months,9.49,775.09,B,B2,...,,,Cash,N,,,,,,
3,90186302,,3600.0,3600.0,3600.0,36 months,11.49,118.70,B,B5,...,,,Cash,N,,,,,,
4,90805192,,8000.0,8000.0,8000.0,36 months,10.49,259.99,B,B3,...,,,Cash,N,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,88985880,,40000.0,40000.0,40000.0,60 months,10.49,859.56,B,B3,...,,,Cash,N,,,,,,
996,88224441,,24000.0,24000.0,24000.0,60 months,14.49,564.56,C,C4,...,,,Cash,Y,Mar-2019,ACTIVE,Mar-2019,10000.0,44.82,1.0
997,88215728,,14000.0,14000.0,14000.0,60 months,14.49,329.33,C,C4,...,,,Cash,N,,,,,,
998,Total amount funded in policy code 1: 1465324575,,,,,,,,,,...,,,,,,,,,,


We should see a Pandas dataframe containing the 1000 rows of our Lending Club data sample.

Now that we have the raw data, we can pass it into the whylogs logger. It is often useful to pass a string label such as "demo.data" along with the dataset for future reference.

The `log_dataframe` function will profile the given dataset using the whylogs library. When we capture the logger response, we can interact with the generated profiles.

In [12]:
profile = session.log_dataframe(data, 'test.data')


The flat summary, histograms, and frequency information can be found inside this summary object. 

For more information about the contents of these objects, consult the `Analysis.ipynb` notebook.

In [13]:
summary = profile.flat_summary()
flat_summary = summary['summary']
flat_summary

Unnamed: 0,column,count,null_count,bool_count,numeric_count,max,mean,min,stddev,nunique_numbers,...,nunique_str_upper,quantile_0.0000,quantile_0.0100,quantile_0.0500,quantile_0.2500,quantile_0.5000,quantile_0.7500,quantile_0.9500,quantile_0.9900,quantile_1.0000
0,initial_list_status,1000.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,...,2.0,,,,,,,,,
1,sec_app_fico_range_low,1000.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.0,,,,,,,,,
2,sec_app_open_acc,1000.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.0,,,,,,,,,
3,bc_util,1000.0,0.0,0.0,989.0,102.2,58.756724,0.0,26.431414,625.0,...,0.0,0.0,3.2,13.7,37.700001,59.400002,80.699997,97.699997,100.099998,102.199997
4,num_sats,1000.0,0.0,0.0,998.0,44.0,12.381764,2.0,6.104370,37.0,...,0.0,2.0,3.0,5.0,8.000000,11.000000,15.000000,24.000000,33.000000,44.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146,sec_app_mort_acc,1000.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.0,,,,,,,,,
147,mths_since_rcnt_il,1000.0,0.0,0.0,978.0,147.0,19.132924,0.0,21.459126,96.0,...,0.0,0.0,1.0,2.0,6.000000,13.000000,24.000000,64.000000,112.000000,147.000000
148,num_bc_tl,1000.0,0.0,0.0,998.0,31.0,8.022044,0.0,4.824245,28.0,...,0.0,0.0,1.0,2.0,5.000000,7.000000,11.000000,18.000000,22.000000,31.000000
149,zip_code,1000.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,...,414.0,,,,,,,,,


In [14]:
print(flat_summary["column"].unique())

['initial_list_status' 'sec_app_fico_range_low' 'sec_app_open_acc'
 'bc_util' 'num_sats' 'total_rec_prncp' 'loan_amnt' 'disbursement_method'
 'total_rec_late_fee' 'revol_bal_joint' 'all_util' 'out_prncp_inv'
 'mths_since_recent_bc' 'collections_12_mths_ex_med' 'int_rate'
 'tot_hi_cred_lim' 'num_bc_sats' 'revol_util' 'chargeoff_within_12_mths'
 'hardship_loan_status' 'next_pymnt_d' 'member_id' 'hardship_type'
 'bc_open_to_buy' 'total_cu_tl' 'num_accts_ever_120_pd' 'term'
 'open_rv_12m' 'num_rev_tl_bal_gt_0' 'funded_amnt_inv' 'acc_now_delinq'
 'total_il_high_credit_limit' 'sec_app_num_rev_accts' 'dti_joint'
 'sec_app_fico_range_high' 'num_tl_120dpd_2m' 'tax_liens'
 'annual_inc_joint' 'open_il_12m' 'hardship_length' 'loan_status'
 'mths_since_recent_inq' 'il_util' 'num_tl_90g_dpd_24m'
 'sec_app_inq_last_6mths' 'last_pymnt_d' 'open_acc_6m' 'fico_range_high'
 'issue_d' 'num_actv_bc_tl' 'deferral_term' 'delinq_amnt'
 'mths_since_recent_bc_dlq' 'total_pymnt' 'pub_rec' 'last_fico_range_low'
 '

In [15]:
histograms = summary['hist']
histograms["delinq_amnt"]


{'bin_edges': [0.0,
  2166.6668833333333,
  4333.333766666667,
  6500.00065,
  8666.667533333333,
  10833.334416666667,
  13000.0013,
  15166.668183333333,
  17333.335066666667,
  19500.001949999998,
  21666.668833333333,
  23833.33571666667,
  26000.0026,
  28166.66948333333,
  30333.336366666666,
  32500.00325,
  34666.67013333333,
  36833.337016666665,
  39000.003899999996,
  41166.670783333335,
  43333.337666666666,
  45500.00455,
  47666.67143333334,
  49833.33831666667,
  52000.0052,
  54166.67208333333,
  56333.33896666666,
  58500.00585,
  60666.67273333333,
  62833.339616666664,
  65000.0065],
 'counts': [998,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

In [16]:
frequencies = summary['frequent_strings']
frequencies.update(summary['frequent_numbers'])
frequencies['num_sats']

{'value': [9.0,
  8.0,
  11.0,
  7.0,
  10.0,
  12.0,
  6.0,
  13.0,
  14.0,
  17.0,
  15.0,
  5.0,
  21.0,
  16.0,
  22.0,
  20.0,
  24.0,
  19.0,
  18.0,
  26.0,
  39.0,
  25.0,
  4.0],
 'count': [91,
  84,
  82,
  75,
  70,
  70,
  66,
  57,
  51,
  46,
  42,
  42,
  41,
  40,
  40,
  40,
  40,
  39,
  38,
  38,
  38,
  38,
  38]}

## Additional options for our whylogs session
We chose the most simple configuration above, but there are a number of convenient options that can be set.

**Cloud storage:** You may set the an AWS S3 bucket to have these logs automatically pushed to the cloud. You must have valid AWS configuration settings to be able to do so.

**Binary file:** By default, we produce a binary file that contains raw objects used to summarize the data passed in. Navigating this file is beyond the scope of this notebook, however. This is listed under the *output_protobuf* option.

**Flat and JSON summaries:** By default, we produce a flat summary in the CSV format along with histogram and frequency summaries in the JSON format.

You can see these configuration options and others paired with the session in the `session.config` object.

## Display and resetting the session

There is also a convenience function to send the internal Python logs to stdout.

In [29]:
from whylogs.logs import display_logging
display_logging('debug')

2020-12-09 09:06:33,433 - whylogs.logs - DEBUG - whylogs.logs logging -> stdout at level DEBUG


When you are done with your session, run the `reset_session` function.

In [None]:
from whylogs.logs.app.session import reset_session
reset_session()