## Stream Processing Arquitectura Completa

Goals:

* Perform a realtime stock prices analytics
* For this we use two data sources
    * [IEX Cloud](https://iexcloud.io/)
    * [Twitter](https://developer.twitter.com/)
* We implement following data pipeline
    * Data Ingestion in Nifi and Kafka
    * Data processing with Spark Streaming
    * Data Storage on Elasticsearch
    * Data visualization on Kibana


Let’s create a data processing for Stock prices

In [None]:
import sys
import json
import pprint
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext

from pyspark.streaming.kafka import KafkaUtils

conf = SparkConf().setAll([("es.index.auto.create", "true")])

sc = SparkContext(appName="QuotesConsumer",conf=conf)
ssc = StreamingContext(sc, 1)

kafkaParams = {'bootstrap.servers': 'broker:29092', 'group.id' : 'quotesConsumer'}
topicQuotes = ['quotes']

streamQuotes = KafkaUtils.createDirectStream(ssc, topicQuotes, kafkaParams)

Load companies Json reference on a dictionary

In [None]:
with open('/tmp/data/companies.json') as json_file:
    Lines = json_file.readlines()
    
companies = { }
# Parse compnies data
for line in Lines: 
    data = json.loads(line) 
    companies[data["symbol"]]=data

Create an enrichment function to add company data to every input quote
    
    


In [None]:
def enrichQuote(quote, companies):
    quoteJson=json.loads(quote[1])    
    companyJson=companies[quoteJson["symbol"]]
    quoteJson.update(companyJson) 
    return(quote[0],json.dumps(quoteJson))

Now apply a map to add this company reference fields

In [None]:
streamQuotesEnriched=streamQuotes.map(lambda record : enrichQuote(record,companies) )                                   
    
streamQuotesEnriched.map(lambda record : (record[0], record[1])).pprint()

Finally we add a sink to store on elastic

In [None]:
es_write_quotes = {
        "es.nodes" : "elasticsearch",
        "es.port" : "9200",
        "es.resource" : 'quotes/_doc',
        "es.input.json": "yes",
    }


streamQuotesEnriched.foreachRDD(lambda rdd: rdd.saveAsNewAPIHadoopFile(
        path='-',
        outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
        keyClass="org.apache.hadoop.io.NullWritable",
        valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
        conf=es_write_quotes))

Let’s create a data processing for twitter

In [None]:
topicTwitter = ['tweets']

streamTwitter = KafkaUtils.createDirectStream(ssc, topicTwitter, kafkaParams)

streamTwitter.map(lambda record : (record[0], record[1])).pprint()


es_write_twitter = {
        "es.nodes" : "elasticsearch",
        "es.port" : "9200",
        "es.resource" : 'tweets/_doc',
        "es.input.json": "yes",
    }

streamTwitter.foreachRDD(lambda rdd: rdd.saveAsNewAPIHadoopFile(
        path='-',
        outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
        keyClass="org.apache.hadoop.io.NullWritable",
        valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
        conf=es_write_twitter))

Finally let's launch the process

In [None]:

ssc.start()
ssc.awaitTermination()