In this notebook, we will explore how to use Python in a streaming and distributed manner

## Loading the dataset

To simulate streaming data, we will load data into a Pandas dataframe. Then, we will iterate via each `Row` object, which is a dictionary object.

`whylogs.DatasetProfile.track` method accepts dictionary of `[feature_name, value]`.

In [1]:
import datetime
import os.path
import pandas as pd

In [2]:
data_file = "data/lending_club_1000.csv"
full_data = pd.read_csv(data_file)
full_data['issue_d'].describe()

data = full_data[full_data['issue_d'] == 'Oct-2016']
data

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,90671227,,4800.0,4800.0,4800.0,36 months,13.49,162.87,C,C2,...,,,Cash,N,,,,,,
1,90060135,,21600.0,21600.0,21600.0,60 months,9.49,453.54,B,B2,...,,,Cash,N,,,,,,
2,90501423,,24200.0,24200.0,24200.0,36 months,9.49,775.09,B,B2,...,,,Cash,N,,,,,,
3,90186302,,3600.0,3600.0,3600.0,36 months,11.49,118.70,B,B5,...,,,Cash,N,,,,,,
4,90805192,,8000.0,8000.0,8000.0,36 months,10.49,259.99,B,B3,...,,,Cash,N,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,89885898,,24000.0,24000.0,24000.0,60 months,12.79,543.50,C,C1,...,,,Cash,N,,,,,,
994,88977788,,24000.0,24000.0,24000.0,60 months,10.49,515.74,B,B3,...,,,Cash,N,,,,,,
995,88985880,,40000.0,40000.0,40000.0,60 months,10.49,859.56,B,B3,...,,,Cash,N,,,,,,
996,88224441,,24000.0,24000.0,24000.0,60 months,14.49,564.56,C,C4,...,,,Cash,Y,Mar-2019,ACTIVE,Mar-2019,10000.0,44.82,1.0


## Creating a whylogs session

Let's now explore import a function from whylogs that allows us to create a logging session.

This session can be connected with multiple writers that output the results of our profiling locally in JSON, a flat CSV, or binary protobuf format as well as writers to an AWS S3 bucket in the cloud. Further writing functionality will be added as well.

Let's create a default session below.

In [3]:
from whylogs import get_or_create_session

session = get_or_create_session()

## Creating a logger

We can create a logger for a specific dataset timestamp. This often represents a window of data or a batch of data.


In [4]:
logger= session.logger(dataset_name="dataset", dataset_timestamp=datetime.datetime(2020, 9, 22, 0, 0))

## Log streaming data
We'll stream through the dataframe and call `logger.log`.

In practice, you'll call this on individual data points

In [None]:
for i, r in data.iterrows():
    logger.log(r)

In [None]:
# close the logger to write to dist
logger.close()

## Another logger
We'll create another logger and write data to the new logger, but with a different timestamp

In [None]:
with session.logger(dataset_name="dataset", dataset_timestamp=datetime.datetime(2020, 9, 21, 0, 0)) as logger:
    for i, r in data.iterrows():
        logger.log(r)

## Merging data
Once data is written to disk, we can then merge the entries together to get a summary view.

If you run a distributed systems, this means that you can collect your `whylogs` data into a cloud storage such as S3 and then aggregate them.

In [None]:
import glob

In [None]:
binaries = glob.glob('whylogs-output/dataset/**/*.bin', recursive=True)
binaries

In [None]:
from whylogs import DatasetProfile
# currently, whylogs writer writes non-delimited files
profiles = [DatasetProfile.read_protobuf(x, delimited_file=False) for x in binaries]

In [None]:
from functools import reduce
merged = reduce(lambda x, y: x.merge(y), profiles)

## Quick check with the merged data
We can check the counter to see if the merged data reflect the "merge" here

In [None]:
print("First DTI count: ", profiles[0].columns['dti'].counters.count)
print("Second DTI count: ", profiles[1].columns['dti'].counters.count)
print("Merged count: ", merged.columns['dti'].counters.count)