# Rolling Logs with Streaming Data
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/advanced/Streaming_Data_with_Log_Rotation.ipynb)

Now that you've become family with the ["Getting Started"](https://github.com/whylabs/whylogs/blob/mainline/python/examples/basic/Getting_Started.ipynb) and the basic examples, let's see what else whylogs can be used for! So far, you've seen it ingest rows and dataframes during the logging process, but now let's look at ways to handle large amounts of changing data such as streaming with ... rolling logs! (sometimes also called log rotation)

Instead of needing to plan out how you log in intervals with batching we handle all of that for you. The Logger will create your session, log information at the requested intervals of seconds, minutes, hours, or days and at that interval write out your profile to a .bin file and flush the log getting ready to receive more data.

#### Why would you want this?
Well, logging data throughout a given time period allows a higher grain of precision to your statistical profiles, and having these logs written regularly not only ensures their safety but also allows more options for merging profiles when it comes time for analysis. We'll go into that in depth in the ["Merging Profiles"](https://github.com/whylabs/whylogs/blob/9618e5dd6570bc484579ec1325f2f512ff56977f/python/examples/basic/Merging_Profiles.ipynb) notebook, but you can also see a simple example of it at the end of this notebook.

We recommend that you have multiple intervals per timeline of your analysis. For example, if you want to look at the changes daily taking it at least hourly will help get a good profile estimation. Doing it too frequently where a profile may only have a couple lines is not preferred so play around with the balance that is right for your needs.

## Simple Example using Bitcoin Ticker
To start off, let's see how logging works; this will be an extremely basic example to show the syntax. We'll get data from BlockChain's ticker as this Jupyter notebook runs. To make you not wait for too long I'll have it run while constantly gathering data and rolling over the file every 20 seconds. This will give enough data for an example for the notebook without making you wait too long.

The data picked is just a pull of the json API from the given website being used over time. This allows for easy streaming into a Jupyter that is quick and consistently changing, but in reality this is where you'd want to hook up your predictive models, larger data, CSV, etc.

#### Imports
First let's make sure we have everything installed and ready for input. We will be using the file structure to record the .bin files, and "psutil" to get the CPU information.

In [None]:
%pip install psutil;
%pip install whylogs;

In [10]:
import os
from os import listdir
from os.path import isfile

import pandas as pd
import random
import time
import datetime
import whylogs as why

tmp_path = "example_output"

if not os.path.isdir(tmp_path):
    os.makedirs(tmp_path)

Here is a super simple function to see the amount of files that are here before and after the logging.

In [11]:
def count_files(tmp_path):
    only_files = [f for f in listdir(tmp_path) if isfile(os.path.join(tmp_path, f))]
    return len(only_files)

print(count_files(tmp_path))

2


Now it's on to the actual logging! We will first create the logger, mark it as "rolling", and set the interval in terms of Seconds, Minutes, Hours, or Days. Lastly we want to make sure we give it the base file name, and create a writer. For this example we will be using the local writer to put files on the local system. The following will be broken into two sections: **Production** and **Playground**.

In **Production** you'll see code that is more in line with what you'd see in an every day environment. This will still need to be customized for your use case as the time period of a log is dependent on how often your data is pulled and how often you'll be observing. Although you're more than welcome to run this it will take quite a while as typically you'd be logging over a dedicated time span such as hours or days or further.

In **Playground** you'll get to use our example at fast speed. This will be modified to run continuously. This is the best place to try things out and learn more about how the logging works.

In both examples you'll see a `with` which enables your data to be written on exit even if it's not at the interval time.


## Data Set
Alright, I know blockchain is big right now, but that's not why we picked it. We wanted to have an very fast, allows up ticker so the play ground could be messed with at any time of the day. This public data source allows us to do just that. Now you don't need to be a blockchain user or enthusiast at all. This ticker is just like the US stocks or Currency exchange, all it's doing is showing the exchange rate for certain type of bitcoin in USD. The code block below shows an example of one of the messages.

In the comments you'll see a placeholder where you'd add your ML model and log it's output into whylogs as well!

*Please note, we don't do anything directly with block chains or bitcoins in any way.*

In [14]:
example_path = os.path.join("mock_input", "moc_1655923486.322933.json")
example_df = pd.read_json(example_path)
example_df

Unnamed: 0,ARS,AUD,BRL,CAD,CHF,CLP,CNY,CZK,DKK,EUR,...,NZD,PLN,RON,RUB,SEK,SGD,THB,TRY,TWD,USD
15m,4629222.76,29167.28,105202.49,26189.14,19424.96,18163033.49,146442.74,481208.62,155741.55,19170.13,...,32247.65,89876.17,72010.47,1115075.7,204322.49,28201.47,715312.44,354510.87,26353962.8,20270.23
last,4629222.76,29167.28,105202.49,26189.14,19424.96,18163033.49,146442.74,481208.62,155741.55,19170.13,...,32247.65,89876.17,72010.47,1115075.7,204322.49,28201.47,715312.44,354510.87,26353962.8,20270.23
buy,4629222.76,29167.28,105202.49,26189.14,19424.96,18163033.49,146442.74,481208.62,155741.55,19170.13,...,32247.65,89876.17,72010.47,1115075.7,204322.49,28201.47,715312.44,354510.87,26353962.8,20270.23
sell,4629222.76,29167.28,105202.49,26189.14,19424.96,18163033.49,146442.74,481208.62,155741.55,19170.13,...,32247.65,89876.17,72010.47,1115075.7,204322.49,28201.47,715312.44,354510.87,26353962.8,20270.23
symbol,ARS,AUD,BRL,CAD,CHF,CLP,CNY,CZK,DKK,EUR,...,NZD,PLN,RON,RUB,SEK,SGD,THB,TRY,TWD,USD


##  Example
This example will be more like what you'll see in an environment. Imagine we want to see our data **every day** you'll want to have at logs **every thirty**, but you don't want to roll the logger over with only one data point, so we'd work with every 5 hours for instance.

We'll use a mocked out data that was gathered every 5 min for 24 hours. We'll use the rolling log every hour. This shows a very basic app structure to show how you would use the log rotation when consuming data whether you get thousands in a second and then non for a while or a continuous stream.

In [60]:
class MyApp:
    def __init__(self):
        self.logger = why.logger(mode="rolling", interval=15, when="M", 
                                base_name="bitcoin_profile_")
        
        # write to our local path, there are other writers though
        self.logger.append_writer("local", base_dir="example_output")       
        self.dataset_logged=0        # this is simple for our logging

    def exit(self):
        self.logger.flush()

    def consume(self, data_df):
        self.logger.log(data_df)     # log it into our data set profile
        self.dataset_logged += 1

        ## fancy_output = fancy_ml.predict(data_df),    use your ML model
        ## app.logger.log(fancy_output),                log your ML output as well

        # We are printing the log to stdout for the example, substitute how you work with logging
        print("Inputs Processed: " + str(app.dataset_logged) +
              "    Dataset Files Written to Local: " + str(count_files(tmp_path)))

In [61]:
def data_feeder(live_feed=False):
    # Feel free to turn this on to play with live data
    if live_feed:
        url = "https://blockchain.info/ticker"
        data_df = pd.read_json(url)

    # brings in moc messages as show in the Data Section
    else:
        message = ['moc_1655923486.322933.json', 'moc_1655926186.340832.json']
        example_path = os.path.join("mock_input", random.choice(message))
        data_df = pd.read_json(example_path)
    return data_df


def test_driver():
    for i in range(100):
        data_df = data_feeder()
        consume(data_df)
        time.sleep(random.randrange(0, 10))

In [None]:
app = MyApp()
test_driver()
app.exit()

Inputs Processed: 1    Dataset Files Written to Local: 2
Inputs Processed: 2    Dataset Files Written to Local: 2
Inputs Processed: 3    Dataset Files Written to Local: 2
Inputs Processed: 4    Dataset Files Written to Local: 2
Inputs Processed: 5    Dataset Files Written to Local: 2
Inputs Processed: 6    Dataset Files Written to Local: 2


## Next steps - the .bin
Congrats! Now you've got data safely stored away, but what exactly are these .bin files? As you are logging datasets the session tracks many inputs (done through `why.log()`) into a dataset profile. When we use the rolling logger it will write out the dataset profile to the .bin then flush it to start logging again. This allows you to have that data safely stored in an incremental fashion which you can then merge back together as one piece, individually, or any number.

For example let's bring up the just one of the files to see what's in it.

In [18]:
# Get the first file
first_file = listdir(tmp_path)[0]
path = os.path.join(tmp_path, first_file)

# This .bin can be read using the path
result_view = why.read(path).view()
result_view.to_pandas()

IndexError: list index out of range

## Merging Profiles from .bin
Ok, so we have saved .bin!! Huzzah! .... and what do we do with them?

Let us read them in from our local file system and merge them in a couple of ways. Please check out the ["Merging Profile"](https://github.com/whylabs/whylogs/blob/9618e5dd6570bc484579ec1325f2f512ff56977f/python/examples/basic/Merging_Profiles.ipynb) notebook for an indepth.

In [9]:
merged_profiles_view = None

# Let's go through all files in the directory
for f in listdir(tmp_path):
    path = os.path.join(tmp_path, f)

    # We know we don't want any hidden files or dir
    if isfile(path) and f[0] != ".":

        # Read the file and store the view
        reading_result = why.read(path)
        result_view =  reading_result.view()

        # Let's merge the views together
        if merged_profiles_view:
            merged_profiles_view.merge(result_view)
        else:
            merged_profiles_view = result_view

merged_profiles_view.to_pandas()

Unnamed: 0_level_0,types/integral,types/fractional,types/boolean,types/string,types/object,counts/n,counts/null,frequent_items/frequent_strings,cardinality/est,cardinality/upper_1,cardinality/lower_1,type
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
ARS,0,652,0,163,0,815,0,"[FrequentItem(value='4818920.680000', est=652,...",2.0,2.0001,2.0,SummaryType.COLUMN
AUD,0,652,0,163,0,815,0,"[FrequentItem(value='29368.890000', est=652, u...",2.0,2.0001,2.0,SummaryType.COLUMN
BRL,0,652,0,163,0,815,0,"[FrequentItem(value='106313.070000', est=652, ...",2.0,2.0001,2.0,SummaryType.COLUMN
CAD,0,652,0,163,0,815,0,"[FrequentItem(value='26768.750000', est=652, u...",2.0,2.0001,2.0,SummaryType.COLUMN
CHF,0,652,0,163,0,815,0,"[FrequentItem(value='19943.030000', est=652, u...",2.0,2.0001,2.0,SummaryType.COLUMN
CLP,0,652,0,163,0,815,0,"[FrequentItem(value='18140556.340000', est=652...",2.0,2.0001,2.0,SummaryType.COLUMN
CNY,0,652,0,163,0,815,0,"[FrequentItem(value='209945.680000', est=652, ...",2.0,2.0001,2.0,SummaryType.COLUMN
CZK,0,652,0,163,0,815,0,"[FrequentItem(value='492436.890000', est=652, ...",2.0,2.0001,2.0,SummaryType.COLUMN
DKK,0,652,0,163,0,815,0,"[FrequentItem(value='160250.110000', est=652, ...",2.0,2.0001,2.0,SummaryType.COLUMN
EUR,0,652,0,163,0,815,0,"[FrequentItem(value='19629.200000', est=652, u...",2.0,2.0001,2.0,SummaryType.COLUMN


# What's next?
- Get to know ["Merging Profiles"](https://github.com/whylabs/whylogs/blob/9618e5dd6570bc484579ec1325f2f512ff56977f/python/examples/basic/Merging_Profiles.ipynb) and how to use them.
- See how all this can be visualized in ["Notebook Profile Visualizer"](https://github.com/whylabs/whylogs/blob/9618e5dd6570bc484579ec1325f2f512ff56977f/python/examples/basic/Notebook_Profile_Visualizer.ipynb)
