# Streaming Project

Implement the following FlinkSQL query using mysql and python. The goal is emulating FlinkSQL behavior as close as possible.
-  you can tune the ingestion rate
-  you can implement the window semantics in the where clause or by micro batching
-  you can use triggers or any SQL construct that you know
-  you can assume that processing time and event time coincide
-  you can create as many supporting structure 
-  DON'T FORGET TO REMOVE OLD DATA (USING ANOTHER THREAD?)

## FLINK SQL Query

```sql 
SELECT
  TUMBLE_START(rowTime, INTERVAL '1' MINUTE) AS t,
  country
  COUNT(*) AS cnt
FROM click
GROUP BY country,   TUMBLE(rowTime, INTERVAL '1' MINUTE)
```

## Data Ingestion

In [None]:
! pip install mysql-connector-python 

In [39]:
! pip install confluent_kafka

Collecting confluent_kafka
  Downloading confluent_kafka-1.7.0-cp39-cp39-manylinux2010_x86_64.whl (2.7 MB)
[K     |████████████████████████████████| 2.7 MB 1.6 MB/s eta 0:00:01
[?25hInstalling collected packages: confluent-kafka
Successfully installed confluent-kafka-1.7.0


## Kafka Producer

In [49]:
from confluent_kafka import Producer
import sys, json
topic = "clicks"
brokers = "kafka1:9092" # Brokers act as cluster entripoints
conf = {'bootstrap.servers': brokers}

In [41]:
p = Producer(**conf)

In [42]:
def delivery_callback(err, msg):
        if err:
            sys.stderr.write('%% Message failed delivery: %s\n' % err)
        else:
            sys.stderr.write('%% Message delivered to %s [%d] @ %d\n' %
                             (msg.topic(), msg.partition(), msg.offset()))

In [55]:
def to_kafka(ms):
    for m in ms:
        k = {}
        k['user']=m[0]
        k['country']=m[1]
        k['ts']=m[2]
        p.produce(topic, json.dumps(k), callback=delivery_callback)
        p.poll(0)
    p.flush()

## MySQL ingestion

In [None]:
import mysql.connector

mydb = mysql.connector.connect(
  host="mysql",
  user="root",
  password="pass1234"
)


mycursor = mydb.cursor()

In [None]:
mycursor.execute("CREATE DATABASE stream")

In [None]:
mycursor.execute("SHOW DATABASES")
for x in mycursor:
  print(x) 

In [None]:
mycursor.execute("USE stream")

In [None]:
mycursor.execute("CREATE TABLE clicks (user VARCHAR(255), country VARCHAR(255), timestamp INT)")
for x in mycursor:
  print(x) 

In [None]:
userNames = ["Andy", "Bob", "Carl", "Dave", "Esther", "Fanny", "Gabe", "Imogen", "John", "Louis", "Monica"];

In [None]:
regionNames = ["America", "Europe", "Asia", "Australia", "Africa"];

In [43]:
sql = "INSERT INTO clicks (user, country, timestamp) VALUES (%s, %s, %s)"
def to_my_sql(val):
    mycursor.executemany(sql, val)
    mydb.commit()

## Data Generator
The data generator here sends data to mysql and a kafka topic for Flink testing

In [None]:
import random
import sys
import time

batch_window= 60 # seconds
batches_per_window= 1000
batch_size = 500

while True:
    for i in range(0,batches_per_minute):
        val = []
        for i in range(0,batch_size):
            val.append((random.sample(userNames, 1)[0],random.sample(regionNames, 1)[0], int(time.time())))
#         to_my_sql(val)
        to_kafka(val)
    time.sleep(batch_window)

# COPY THIS NOTEBOOK TO CONTINUE INGESTING IN PARALLEL

<img src="copy.png" alt="5" border="0">

In [None]:
import mysql.connector

mydb = mysql.connector.connect(
  host="mysql",
  user="root",
  password="pass1234"
)

mycursor = mydb.cursor()
mycursor.execute("USE stream")

In [None]:
mycursor.execute("SELECT * FROM clicks LIMIT 10")
for x in mycursor:
  print(x) 