<h1>Real-Time Event Monitoring Dataset From Kafka</h1>

<h3>Introduction</h3>

The Real-Time Event Monitoring use case illustrates how to leverage Singlestore's capabilities to process and analyze streaming data from a Kafka data source. This demo showcases the ability to ingest real-time events, such as application logs or user activities, and perform immediate analysis to gain actionable insights. By working through this example, new users will learn how to set up a Kafka data pipeline, ingest streaming data into Singlestore, and execute real-time queries to monitor event types, user activity patterns, and detect anomalies. This use case highlights the power of Singlestore in providing timely and relevant information for decision-making in dynamic environments.

<img src=https://singlestoreloaddata.s3.ap-south-1.amazonaws.com/images/LoadDataKafka.png width="100%" hight="50%"/>

<h3>Create Table</h3>

In [115]:
%%sql
CREATE TABLE `eventsdata` (
  `user_id` varchar(120) DEFAULT NULL,
  `event_name` varchar(128) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
  `advertiser` varchar(128) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
  `campaign` varchar(110) DEFAULT NULL,
  `gender` varchar(128) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
  `income` varchar(128) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
  `page_url` varchar(512) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
  `region` varchar(128) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
  `country` varchar(128) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL
) 

<h3>Load Data</h3>

In [116]:
%%sql
CREATE PIPELINE `eventsdata`
AS LOAD DATA KAFKA 'public-kafka.memcompute.com:9092/ad_events'
BATCH_INTERVAL 500
ENABLE OUT_OF_ORDER OPTIMIZATION
DISABLE OFFSETS METADATA GC
INTO TABLE `eventsdata`
FIELDS TERMINATED BY '\t' ENCLOSED BY '' ESCAPED BY '\\'
LINES TERMINATED BY '\n' STARTING BY ''
(
    `events`.`user_id`,
    `events`.`event_name`,
    `events`.`advertiser`,
    `events`.`campaign`,
    `events`.`gender`,
    `events`.`income`,
    `events`.`page_url`,
    `events`.`region`,
    `events`.`country`
)

In [117]:
%%sql
START PIPELINE `eventsdata`

In [124]:
%%sql

SELECT * FROM information_schema.pipelines_errors
    WHERE pipeline_name = 'eventsdata' ;

DATABASE_NAME,PIPELINE_NAME,ERROR_UNIX_TIMESTAMP,ERROR_TYPE,ERROR_CODE,ERROR_MESSAGE,ERROR_KIND,STD_ERROR,LOAD_DATA_LINE,LOAD_DATA_LINE_NUMBER,BATCH_ID,ERROR_ID,BATCH_SOURCE_PARTITION_ID,BATCH_EARLIEST_OFFSET,BATCH_LATEST_OFFSET,HOST,PORT,PARTITION


In [85]:
%%sql
select count(*) from `eventsdata`

count(*)
18605603


<h3>Queries</h3>

Events by Region

In [55]:
%%sql
SELECT events.country
AS `events.country`,
count(events.country) AS 'events.countofevents'
FROM eventsdata AS events
group by 1 order by 2 desc limit 5;

events.country,events.countofevents
US,11559770
CA,908737
AU,782307
DE,528278
ES,349873


Events by Top 5 Advertisers

In [150]:
%%sql
SELECT
    events.advertiser AS `events.advertiser`,
    COUNT(*) AS `events.count`
FROM eventsdata AS events
WHERE
    (events.advertiser LIKE '%Subway%' OR events.advertiser LIKE '%McDonalds%' OR events.advertiser LIKE '%Starbucks%' OR events.advertiser LIKE '%Dollar General%' OR events.advertiser LIKE '%YUM! Brands%')
GROUP BY 1
ORDER BY 2 DESC;

events.advertiser,events.count
Subway,1104981
YUM! Brands,687723
McDonalds,568016
Starbucks,514165
Dollar General,466456


Ad visitors by gender and income

In [149]:
%%sql
SELECT * FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY z___min_rank) as z___pivot_row_rank, RANK() OVER (PARTITION BY z__pivot_col_rank ORDER BY z___min_rank) as z__pivot_col_ordering, CASE WHEN z___min_rank = z___rank THEN 1 ELSE 0 END AS z__is_highest_ranked_cell FROM (
SELECT *, MIN(z___rank) OVER (PARTITION BY `events.income`) as z___min_rank FROM (
SELECT *, RANK() OVER (ORDER BY CASE WHEN z__pivot_col_rank=1 THEN (CASE WHEN `events.count` IS NOT NULL THEN 0 ELSE 1 END) ELSE 2 END, CASE WHEN z__pivot_col_rank=1 THEN `events.count` ELSE NULL END DESC, `events.count` DESC, z__pivot_col_rank, `events.income`) AS z___rank FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY CASE WHEN `events.gender` IS NULL THEN 1 ELSE 0 END, `events.gender`) AS z__pivot_col_rank FROM (
SELECT
    events.gender AS `events.gender`,
    events.income AS `events.income`,
    COUNT(*) AS `events.count`
FROM eventsdata AS events
WHERE
    (events.income <> 'unknown' OR events.income IS NULL)
GROUP BY 1,2) ww
) bb WHERE z__pivot_col_rank <= 16384
) aa
) xx
) zz
WHERE (z__pivot_col_rank <= 50 OR z__is_highest_ranked_cell = 1) AND (z___pivot_row_rank <= 500 OR z__pivot_col_ordering = 1) ORDER BY z___pivot_row_rank;

events.gender,events.income,events.count,z__pivot_col_rank,z___rank,z___min_rank,z___pivot_row_rank,z__pivot_col_ordering,z__is_highest_ranked_cell
unknown,50k - 75k,570557,3,9,1,1,1,0
Female,50k - 75k,766350,1,1,1,1,1,1
Male,50k - 75k,1176804,2,6,1,1,1,0
unknown,25k - 50k,385439,3,12,2,2,2,0
Female,25k - 50k,510886,1,2,2,2,2,1
Male,25k - 50k,788583,2,7,2,2,2,0
unknown,75k - 99k,376522,3,13,3,3,3,0
Female,75k - 99k,500799,1,3,3,3,3,1
Male,75k - 99k,774418,2,8,3,3,3,0
unknown,25k and below,195162,3,14,4,4,4,0


Pipeline will keep pushing data from the kafka topic. Once your data is loaded you can stop the pipeline using below command

In [86]:
%%sql
STOP PIPELINE eventsdata

Drop the pipeline using below command

In [None]:
%%sql
DROP PIPELINE eventsdata

RUN BELOW STATEMENT IF YOU LIKE TO DROP THE DATA

In [None]:
%%sql
DROP TABLE eventsdata