# Install basic requirements

In [1]:
pip install -U whylogs pandas

Collecting argparse
  Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
import whylogs
import pandas as pd

# Load example data batches

The example data is prepared from our public S3 bucket. You can use your own data if you want if you have multiple batches of data.

In [3]:
pdfs = []
for i in range(1, 8):
    path = f"https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_{i}.csv"
    print(f"Loading data from {path}")
    df = pd.read_csv(path)
    pdfs.append(df)

Loading data from https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_1.csv
Loading data from https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_2.csv
Loading data from https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_3.csv
Loading data from https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_4.csv
Loading data from https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_5.csv
Loading data from https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_6.csv
Loading data from https://whylabs-public.s3.us-west-2.amazonaws.com/demo_batches/input_batch_7.csv


In [4]:
pdfs[0].describe()

Unnamed: 0.1,Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,desc,...,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
count,407.0,407.0,0.0,407.0,407.0,407.0,407.0,407.0,407.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,12548.717445,115863100.0,,14203.746929,14203.746929,14202.948403,13.514054,418.020344,78818.956069,,...,,,,,,,,,,
std,125.354772,1207642.0,,9351.142374,9351.142374,9350.997874,5.446881,271.096531,55864.939403,,...,,,,,,,,,,
min,12325.0,112153800.0,,1000.0,1000.0,1000.0,5.32,34.22,0.0,,...,,,,,,,,,,
25%,12442.5,115076900.0,,7000.0,7000.0,7000.0,9.93,235.58,43325.0,,...,,,,,,,,,,
50%,12550.0,115700400.0,,12000.0,12000.0,12000.0,12.62,357.25,63300.0,,...,,,,,,,,,,
75%,12653.5,116824500.0,,20000.0,20000.0,20000.0,16.02,553.515,95000.0,,...,,,,,,,,,,
max,12862.0,118159200.0,,40000.0,40000.0,40000.0,30.99,1417.71,495000.0,,...,,,,,,,,,,


# Configure whylogs

`whylogs`, by default, does not send statistics to WhyLabs.

There are a few small steps you need to set up. If you haven't got the access key, please onboard with WhyLabs.

**WhyLabs only requires whylogs API - your raw data never leaves your premise.**

In [5]:
from whylogs.app import Session
from whylogs.app.writers import WhyLabsWriter
import os
import datetime

In [6]:
import getpass

# set your org-id here
print("Enter your WhyLabs Org ID")
os.environ["WHYLABS_DEFAULT_ORG_ID"] = input()
# set your API key here
print("Enter your WhyLabs API key")
os.environ["WHYLABS_API_KEY"] = getpass.getpass()
print("Using API Key ID: ", os.environ["WHYLABS_API_KEY"][0:10])

Enter your WhyLabs Org ID


 org-5953


Enter your WhyLabs API key


 ································································


Using API Key ID:  naGzCisIJt


## Creating session

Once the environments are set, let's create a whylogs session with a WhyLabs writer.

Note that you can add your local writer or S3 writer if you want here. Check out the API docs for more information.

In [7]:
# create WhyLabs session
writer = WhyLabsWriter("", formats=[])
session = Session(project="demo-project", pipeline="demo-pipeline", writers=[writer])

## Logging to WhyLabs

Ensure you have a **model ID** (also called **dataset ID**) before you start!

### Dataset Timestamp
* To avoid confusion, it's recommended that you use UTC
* If you don't set `dataset_timestamp` parameter, it'll default to `UTC` now
* WhyLabs supports real time visualization when the timestamp is **within the last 7 days**. Anything older than than will be picked up when we run our batch processing
* **If you log two profiles for the same day with different timestamps (12:00 vs 12:01), they are merged to the same batch**

### Logging Different Batches of Data
* We'll give the profiles different **dates**
* Create a new logger for each date. Note that the logger needs to be `closed` to flush out the data

In [8]:
print("Enter your model ID from WhyLabs:")
model_id = input()
for i, df in enumerate(pdfs):
    # walking backwards. Each dataset has to map to a date to show up as a different batch
    # in WhyLabs
    dt = datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=i)
    
    # Create new logger for date
    with session.logger(tags={"datasetId": model_id}, dataset_timestamp=dt) as ylog:
        print("Log data frame for ", dt)
        ylog.log_dataframe(df)

Enter your model ID from WhyLabs:


 model-5


Log data frame for  2021-09-30 04:30:22.845881+00:00
Log data frame for  2021-09-29 04:30:25.273786+00:00


Using API key ID: naGzCisIJt


Log data frame for  2021-09-28 04:30:27.638109+00:00
Log data frame for  2021-09-27 04:30:29.872950+00:00
Log data frame for  2021-09-26 04:30:32.003965+00:00
Log data frame for  2021-09-25 04:30:33.789872+00:00
Log data frame for  2021-09-24 04:30:36.016256+00:00


In [9]:
# Ensure everything is flushed
session.close()

## Voila

* Now check the application to see if your **statistics** are in!!
* Also, run the above cell again for the same model ID, do you see the statistics changes in WhyLabs? Especially the counters?