# Profiling with `whylogs` from a Kafka topic

In this example we will show how you can profile and merge different profiles from a Kafka topic. To simplify our example and make it reproducible anywhere, we will create a Kafka topic, generate the data from an existing CSV file and ingest it, consume the messages from the topic and then profile these consumed messages.

>**NOTE**: In order to get this example going, we will use Apache Zookeper and Apache Kafka locally with Docker Compose, so be sure to have it installed and ready in your environment. If you want to read more on how this YAML file was built, check out [this blogpost](https://medium.com/better-programming/your-local-event-driven-environment-using-dockerised-kafka-cluster-6e84af09cd95).

To get things going, we will put the services up and create the topic in kafka with the following commands:

```bash
$ docker-compose up -d

% docker exec -ti kafka bash

root@kafka: kafka-topics --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic whylogs-stream
```

If you haven't already, make sure to also install `kafka-python` and `whylogs` in your environment by uncommenting the following cell.

In [None]:
# %pip install -q whylogs
# %pip install -q kafka-python

## Generating Data

In [176]:
import json
import os.path
import warnings

import pandas as pd
from kafka import KafkaProducer


warnings.simplefilter("ignore")

producer = KafkaProducer(bootstrap_servers='localhost:9092',
                         value_serializer=lambda v: json.dumps(v).encode('utf-8'))


data_file = "lending_club_demo.csv"
full_data = pd.read_csv(os.path.join(data_file))

data = full_data[full_data['issue_d'] == 'Jan-2017']

for i, row in data.iterrows():
    producer.send('whylogs-stream', row.to_dict())

## Consuming the messages with KafkaConsumer

In [177]:
from kafka import KafkaConsumer, TopicPartition


consumer = KafkaConsumer(bootstrap_servers='localhost:9092', 
                         value_deserializer=lambda x: json.loads(x.decode('utf-8')))

assignments = []
topics=['whylogs-stream']

for topic in topics:
    partitions = consumer.partitions_for_topic(topic)
    for p in partitions:
        print(f'topic {topic} - partition {p}')
        assignments.append(TopicPartition(topic, p))
consumer.assign(assignments)

topic whylogs-stream - partition 0


## Profiling with `whylogs`

For the sake of simplicity, we will build a `pandas.DataFrame` from the read messages and then profile and merge until there aren't more messages in the topic.

In [178]:
import whylogs as why
import pandas as pd 


consumer.seek_to_beginning()

total = 0 
counter = 0
while True:
    finished = True
    record = consumer.poll(timeout_ms=500, max_records=100, update_offsets=True)
    for k,v in record.items():
        print(f'{k} - {len(v)}')
        df = pd.DataFrame([row.value for row in v])
        if counter == 0:
            profile = why.log(df).profile()
        else:
            profile.track(df)
        total += len(v)
        finished = False
        counter += 1
        
    if finished:
        print(f"total {total}")
        break

TopicPartition(topic='whylogs-stream', partition=0) - 100
TopicPartition(topic='whylogs-stream', partition=0) - 100
TopicPartition(topic='whylogs-stream', partition=0) - 68
TopicPartition(topic='whylogs-stream', partition=0) - 41
total 309


In [180]:
profile.view().to_pandas()

Unnamed: 0_level_0,counts/n,counts/null,types/integral,types/fractional,types/boolean,types/string,types/object,cardinality/est,cardinality/upper_1,cardinality/lower_1,frequent_items/frequent_strings,type,distribution/mean,distribution/stddev,distribution/n,distribution/max,distribution/min,distribution/q_01,distribution/q_05,distribution/q_10,distribution/q_25,distribution/median,distribution/q_75,distribution/q_90,distribution/q_95,distribution/q_99,ints/max,ints/min
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
verification_status,309,0,0,0,0,309,0,3.000000,3.000150,3.0,"[FrequentItem(value='Not Verified', est=112, u...",SummaryType.COLUMN,,,,,,,,,,,,,,,,
settlement_percentage,309,309,0,0,0,0,0,0.000000,0.000000,0.0,,SummaryType.COLUMN,0.000000,0.000000,0.0,,,,,,,,,,,,,
total_rec_prncp,309,0,0,309,0,0,0,276.000188,276.013969,276.0,,SummaryType.COLUMN,5266.577896,6502.059928,309.0,35000.0,262.7,349.44,848.91,1228.39,1697.63,2965.6,5597.33,11347.67,20000.0,35000.0,,
num_accts_ever_120_pd,309,0,0,309,0,0,0,9.000000,9.000450,9.0,,SummaryType.COLUMN,0.488673,1.283272,309.0,9.0,0.0,0.00,0.00,0.00,0.00,0.0,0.00,2.00,4.0,7.0,,
all_util,309,0,0,309,0,0,0,87.000019,87.004362,87.0,,SummaryType.COLUMN,56.757282,21.046084,309.0,117.0,2.0,8.00,22.00,29.00,43.00,58.0,72.00,83.00,89.0,102.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
collections_12_mths_ex_med,309,0,0,309,0,0,0,3.000000,3.000150,3.0,,SummaryType.COLUMN,0.035599,0.270932,309.0,4.0,0.0,0.00,0.00,0.00,0.00,0.0,0.00,0.00,0.0,1.0,,
mths_since_rcnt_il,309,5,0,304,0,0,0,70.000012,70.003507,70.0,,SummaryType.COLUMN,23.013158,27.996225,304.0,228.0,1.0,1.00,3.00,4.00,7.00,14.0,27.00,52.00,88.0,164.0,,
mths_since_last_record,309,250,0,59,0,0,0,46.000005,46.002302,46.0,,SummaryType.COLUMN,64.932203,25.625193,59.0,111.0,17.0,17.00,19.00,27.00,44.00,68.0,84.00,95.00,102.0,111.0,,
sec_app_num_rev_accts,309,309,0,0,0,0,0,0.000000,0.000000,0.0,,SummaryType.COLUMN,0.000000,0.000000,0.0,,,,,,,,,,,,,


And voilà! With just a few lines of code we could profile and track incoming messages from a Kafka topic.
Hopefully this tutorial will get you going for your existing streaming pipelines. If there are any other integrations you wanted to see, or maybe see how other users are getting the most out of `whylogs`, please check out our [community Slack](https://bit.ly/rsqrd-slack).