# Data Science Infrastructure - Kafka

This iPython notebooks comprises verious Kafka producers for demo reasons.

**Important remark: Not to be published, as it contains my access data to the Twitter developer API.**

## Kafka Producer for self-defined streams

We need to install this library:
* python3 -m pip install kafka

In [1]:
!pip install kafka-python

Collecting kafka-python
  Downloading kafka_python-2.0.2-py2.py3-none-any.whl (246 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m246.5/246.5 kB[0m [31m601.4 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: kafka-python
Successfully installed kafka-python-2.0.2


In [2]:
# load important libraries
from time import sleep
from json import dumps
from kafka import KafkaProducer

In [9]:
# create producer
producer = KafkaProducer(bootstrap_servers=['kafka:9092'],
                         value_serializer=lambda x: dumps(x).encode('utf-8'))

In [10]:
# now write the numbers 1 to 1000 to an existing topic
# (with a delay of 5 seconds)
for e in range(1000):
    data = {'number' : e}
    print('streaming: ', data)
    producer.send('testTopic', value=data)
    sleep(3)

streaming:  {'number': 0}
streaming:  {'number': 1}
streaming:  {'number': 2}
streaming:  {'number': 3}
streaming:  {'number': 4}
streaming:  {'number': 5}
streaming:  {'number': 6}
streaming:  {'number': 7}
streaming:  {'number': 8}
streaming:  {'number': 9}


KeyboardInterrupt: 

In [8]:
# close connection
producer.close()

## Kafka Producer for Twitter streams

Requires to install the tweepy package:
* python3 -m pip install tweepy

In [None]:
# import libraries
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from kafka import SimpleProducer, KafkaClient

In [None]:
# you need a developer account at Twitter...
#access_token = (get your own)
#access_token_secret = (get your own)
#consumer_key = (get your own)
#consumer_secret = (get your own)

# listener for writing to the standard out device
class StdOutListener(StreamListener):
    def on_data(self, data):
        producer.send_messages("twitter", data.encode('utf-8'))
        return True
    def on_error(self, status):
        print (status)

In [None]:
# connect to Kafka broker and stream all 'trump' tweets
kafka = KafkaClient("localhost:29092")
producer = SimpleProducer(kafka)
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
stream.filter(track="trump")

In [None]:
stream.disconnect()

We can check the stream using:
- bin/kafka-console-consumer.sh --bootstrap-server 10.64.0.45:9092 --topic twitter --from-beginning

## Kafka Producer for stock prices (exam scenario)

We have to install this library:
* python3 -m pip install yahoo_fin
* python3 -m pip install requests_html

Also create the Kafka topic task1:
* bin/kafka-topics.sh --create --zookeeper 10.64.0.45:2181 --replication-factor 1 --partitions 1 --topic task1

In [None]:
# load important libraries
from datetime import datetime
from time import sleep
from json import dumps
from kafka import KafkaProducer

# import stock_info module via yahoo_fin
from yahoo_fin import stock_info as si

In [None]:
# connect to Kafka broker and stream all share prices
producer = KafkaProducer(bootstrap_servers=['localhost:29092'],
                         value_serializer=lambda x: dumps(x).encode('utf-8'))
for e in range(1000):
    data = {'datetime': datetime.now().strftime("%d/%m/%Y %H:%M:%S"), 'apple' : si.get_live_price("aapl")}
    print(data)
    producer.send('task1', value=data)
    sleep(10)

In [None]:
producer.close()

In [None]:
# Exam: Store 20 entries of the stock price in an interval of 1 minte to
#       a Spark DataFrame and calculate 5-number-summary of the share price

## Producing events from Wikipedia

To simulate an application generating events for Kafka, I created this simple Python producer application that reads events from Wikipedia and sends them to the wiki-changes Kafka topic.

The message is serialized in JSON format to send to the Kafka topic. The library sseclient is required to read the source events and the kafka-python enables the Python to produce Kafka messages.

* pip install sseclient
* pip install kafka-python

To observe the data stream use the command-line tool:

* /usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic wiki-changes

In [None]:
## Event Producer
import json
from IPython.display import clear_output
from time import sleep
from sseclient import SSEClient as EventSource
from kafka import KafkaProducer

# Create producer
producer = KafkaProducer(
    bootstrap_servers='localhost:9092', #Kafka server
    value_serializer=lambda v: json.dumps(v).encode('utf-8') #json serializer
    )

# Read streaming event
url = 'https://stream.wikimedia.org/v2/stream/recentchange'
try:
    for event in EventSource(url):
        if event.event == 'message':
            try:
                change = json.loads(event.data)
            except ValueError:
                pass
            else:
                #Send msg to topic wiki-changes and wait 5 seconds
                clear_output(wait=True)
                print("captured a change: ", change)
                producer.send('wiki-changes', change)
                sleep(5)

except KeyboardInterrupt:
    print("process interrupted")