# Kafi - Your Swiss Army Knife for Kafka Scripting
### Ralph Debusmann, Migros-Genossenschafts-Bund, Zürich, Switzerland

***

Welcome to a typical Kafka administrator nightmare...

...and how Kafi, your Swiss army knife for Kafka, will help us to get out of it. 

***

The situation: We have a JSONSchema-serialized topic "products" and a qualification test is looming for an application consuming from that topic.

***

Let's run the qualification test...

***

What's wrong with the topic?

In [None]:
from kafi.kafi import *

# Connect to Kafka
c = Cluster("local")

# Read the first message of the topic
c.head("products", type="bytes", n=1)

***

How many message values do not start with the magic byte (0)?

In [None]:
x = c.filter("products", type="bytes", filter_function=lambda x: x["value"][0] != 0)
print(len(x[0]))

***

Let's do a backup of these faulty messages to a topic backed by Kafi's Kafka emulator.

In [None]:
# Connect to Kafi's Kafka emulator
l = Local("local")

# (Re)create the backup topic
l.retouch("products_backup")

# Copy the first 100 messages from the Kafka topic "products" to the topic "products_backup" on the Kafka emulator
c.cp("products", l, "products_backup", source_type="json", target_type="json", n=100)

***

And then delete the first 100 messages on the real Kafka topic.

In [5]:
c.delete_records({"products": {0: 100}})

In [None]:
c.watermarks("products")

***

Let's run the qualification test again...

***

However, we do have to bring back the first 100 messages (the producers are not available).

In [None]:
# Get the schema ID of the first good message value
z = c.head("products", type="bytes", n=1)
sid = int.from_bytes(z[0]["value"][1:5], "big")

# Try to copy the backup to Kafka (this time - correctly JSONSchema-serialized)
l.cp("products_backup", c, "products", target_value_type="jsonschema", target_value_schema_id=sid)

***

SerializationError: 'price' is a required property...

Let's check in Excel...

In [None]:
# Copy the backup to an Excel file
l.to_file("products_backup", l, "products_backup.xlsx", n=100)

***

Because the topic is on Kafi's Kafka emulator, we can just fix the messages in-place.

***

And let's try to bring back the messages to the Kafka topic once again...

In [None]:
l.cp("products_backup", c, "products", target_value_type="jsonschema", target_value_schema_id=sid)

In [None]:
c.watermarks("products")

In [None]:
c.tail("products", type="bytes", n=1)

***

And let's run the qualification test one (hopefully last) time...

***

Very last step: Let's create a copy of that fixed topic in Parquet format for the analytics team - on S3.

In [None]:
# Connect to S3
s = S3("local")

# Copy the Kafka topic to a Parquet file on S3
c.to_file("products", s, "products.parquet", type="jsonschema")

***

That's really it.

Thanks to all my colleagues from Migros in Zürich, in particular the Data Integration team, especially Martin Muggli and Jason Nguyen - the Kafka guys. 

Get your copy of Kafi from GitHub: https://github.com/xdgrulez/kafi or just install it from PyPI ("kafi")

***

Blatant advertising follows.

Get your copy of the new O'Reilly book "Streaming Databases" by Hubert Dulay and me.

And hand in exciting abstracts to the new, non-vendor-centric conference about everything events and streaming: EventCentric 2025 (Antwerp, Belgium, June 2-5, 2025) https://aardling.eu/en/eventcentric-2025-coming-soon
