# Real Time Big Data Analytics Using Apache Spark

# Introduction

This tutorial will introduce you to the concepts of real time big data analytics and how it is achieved with the help of Apache Spark, along with a high level arhitectural view of the data pipeline created when dealing with a live stream of data. Before that, it throws some light on the business requirements and the motivation behind companies moving from "store and process data models" towards "stream and compute data models". Finally, with a real world example, I show how to build a Machine Learning Model on the fly with an incoming data stream and use it for the data coming immediately afterwards.

So far in this cource, we have been dealing with static datasets, that are stored in csv files. We had also requested up-to-date data from APIs, but essentially,  we were dealing with pieces of data that were not changing in real time. This is called as batch analytics. In this, a typical e-commerce company will have transactions occcur all over the day and stored in the order management system, and once in a day, the data is aggregated and put into HDFS. Everyday, the analysis is performed on the data from the previous day. However, most of the business goals in todays world are highly information sensitive and require more than just static dataset analysis. Consider a user logging system that records events like system errors, warnings, security related events, etc. Such events should immediately trigger a set of preventive/ curative measures instead of getting reported the next day when it is too late!

# What is RTDA?

RTDA stands for Real Time Data Analytics. It is the analysis of data as soon as it is produced, i.e. , with 0 millisecond latency- number of customers visiting a pagelink, customers buying a particular product on an e-commerce website, number of tweets produced every second. Near Real Time Analytics is analyzing the data in a given timeframe (1 minute, 5 seconds), which is close to the time it is produced. The time frame is defined by the time sensitivity of the business requirement. For instance, for an Intensive Care Unit (ICU) sub-system monitoring the vitals of a patient, it is not pragmatic to raise an alarm with a delay of 1 minute from the time when data is produced. Similarly, finance and cyber-security require fraud detection in real or 'near real' time.

RTDA finds useful application in-

- Application health monitoring- reporting errors, warnings, number of clicks on a link
- User Sentiment Analysis- response on social media regarding a new feature/technology in your product
- High Frequency Stock Trading- instead of referring to historical data, refers to present data

Find more about real time analytics here https://www.em360tech.com/tech-news/real-world-use-cases-real-time-data-analytics/. Find more business applications of RTDA here https://www.quora.com/What-are-some-good-examples-of-realtime-analytics.

# Modelling Real Time Data For Analytics

The real time data can be seen as a stream of events, where each event has a timestamp indicating when it occured, and some data indicating what event occured. This stream of data can be used for aggregating statistics, applying existing Machine Learning Models on the events as they occur or for building Machine Learning Models on the fly with the incoming stream of data and applying them on the data following immediately afterwards. The third type of analysis is used when we don't want the model to be a day old, not even an hour old, instead, we want the model to be built on the most recent data at any time. How do we accomplish this?

# Introducing Spark!

[<img src="https://spark.apache.org/docs/2.2.0/img/streaming-arch.png">](https://spark.apache.org/docs/2.2.0/img/streaming-arch.png)

Spark is a distributed computing framework. It is a fast, easy, general-purpose big data processing engine and has built-in modules for high frequency data streaming, machine learning and graph processing. Datasets today are enormous and need to be stored distributively on a cluster of servers. Essentially, spark converts a continuous stream of data into a timed sequence of discrete datasets called RDDs (Resilient Distributed Datasets ) and the amount of data going into a dataset is governed by the definition of the timeframe for which you want spark to monitor the data. The dataset becomes ready for any analysis as soon as it is completely filled according to the time window and the sequence of these datasets are called DStreams or discretised streams.

Below is how the data pipeline looks like.



[<img src="https://spark.apache.org/docs/2.2.0/img/streaming-flow.png">](https://spark.apache.org/docs/2.2.0/img/streaming-flow.png)


We can see the stream of continuous data flowing into the spark streaming engine and getting converted into a sequence of discretised datasets. We can perform transformational operations like map, filter, union, reduce, forEachRDD on these DStreams, store them as text/Hadoop files or can simply print them. It will take the current elements of the DStream as input and, based on whatever operations you perform on the DStream, the elements will be replaced by the output of those operations. 

[<img src="https://3.bp.blogspot.com/-ICu9V1GM9z4/Wij6KutEZfI/AAAAAAAABQs/xEjJ3IAWKW8mKSolTrRYMLeJxdduY80BwCLcBGAs/s1600/figure5-28.png">](https://3.bp.blogspot.com/-ICu9V1GM9z4/Wij6KutEZfI/AAAAAAAABQs/xEjJ3IAWKW8mKSolTrRYMLeJxdduY80BwCLcBGAs/s1600/figure5-28.png)

The documentation for Spark is provided here https://spark.apache.org/docs/0.9.1/scala-programming-guide.html. 

Below is a real life use case for analysing real time data using Spark. The following example takes in the live stream of tweets from the twitter API and analyses it to find out the Top 10 most hot topics on twitter. We will go, step by step, over the whole process.

# Setting up Spark

# Setting up the stream of events

 First of all, we need to provide a stream of events. We need to make sure that the packages we're going to use are available to the notebook. Python provides a package called 'tweepy' which stands for 'twitter for python' that enables us to use the streaming API that twitter provides. The documentation for tweepy can be found here http://docs.tweepy.org/en/v3.4.0/streaming_how_to.html. Use the following pip or conda command to install the package.

In [None]:
# Installation cells have been left without running, purposefully. 
#The first time, this step will install the packages. Once run, they generate the output 'Requirement already satisfied' in the subsequent runs

! pip install tweepy
# """or"""
!conda install -c conda-forge tweepy 

In order to get a live stream of tweets, register on Twitter Apps and create your own twitter application in the developer options as we did previously for the Yelp API in hw1. Twitter API requires you to authenticate yourself in order to make any calls for data streaming. Go to the 'Keys and Access Tokens' tab in your twitter app. You must store your authentication credentials (key and access tokens) in a .txt file away from version control. Now you are fully-equipped to proceed forward. Below is the logic and the code of the first part, i.e.- process 1.

 - In the main program, first authenticate yourself. 
 - Then, create a stream of tweets and filter it to contain only those elements that are pertaining to the United States, i.e., have 'u.s.' in their text. So we will get the Top 10 hot topics related to United States. 
 - Whenever data arrives, tweepy calls the **on_data()** method. In the on_data(self, data) method, the argument data is the whole tweet object that contains all the information about the tweet for eg.- who created the tweet, when the tweet got created, all the hashtags in the tweet and the tweet itself. For each tweet that contains 'u.s.' in it, the on_data() method will be called and the tweets will get printed by the **print(data.rstrip())** command. 
 - In the instance of an error, tweepy calls the **on_error()** method that prints the status of the call. 

Below is the code for the complete process and the output is the tweet objects containing the word 'u.s.' in them. Since, you are recieving a live stream of incoming data, you will have to mannually interrupt the Kernel for this cell in order to terminate the process.

In [1]:
import os
import tweepy
from tweepy import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

consumer_key = 'xxx........xxx'
# """ Here you want to put your own consumer key. After running the cell, I replaced mine with xxx...xxx."""

consumer_secret = 'xxx............xxx'
# """Here you want to put your own consumer_secret. After running the cell, I replaced mine with xxx...xxx."""

access_token = 'xxx.............xxx'
# """Here you want to put your own access_token. After running the cell, I replaced mine with xxx...xxx."""

access_token_secret = 'xxx............xxx'
# """Here you want to put your own access_token_secret. After running the cell, I replaced mine with xxx...xxx."""

class StdOutListener(StreamListener):
  
    def on_data(self, data):
        print(data.rstrip())
        return True
  
    def on_error(self, status):
        print(status)

try:
    if __name__ == '__main__':
        listener=StdOutListener()
        authentic = OAuthHandler(consumer_key, consumer_secret)
        authentic.set_access_token(access_token, access_token_secret)

        DStream = Stream(authentic, listener)
        DStream.filter(track=['u.s.'])

except KeyboardInterrupt:
    print("\n --------------------------------------------------------------------------------------")
    print("********************Process interrupted manually**********************")
    print("\n --------------------------------------------------------------------------------------")

{"created_at":"Fri Mar 30 00:47:14 +0000 2018","id":979520390891540486,"id_str":"979520390891540486","text":"RT @ThisWeekABC: EXCLUSIVE: EPA Administrator Scott Pruitt has spent much of his first year in Washington living in a townhouse one block f\u2026","source":"\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":801078727899774976,"id_str":"801078727899774976","name":"Gary L Loving","screen_name":"GaryLLoving","location":"1100 N. Stonewall Ave, Oklahoma City, OK","url":"http:\/\/nursing.ouhsc.edu","description":"Interim Dean, OU Fran & Earl Ziegler College of Nursing (Tweets are my own and don't represent the position of the University or the College)","translator_type":"none","protected":false,"verified":false,"followers_count":53,"friends_count":141,"listed_c

Here, the DStream output is just getting printed. However, we don't want to just print this DStream, instead, we want to send this DStream to the Spark engine running at localhost 9999, to perform some computation on it. Also, it is very important to understand that we need two parallel processes-- one that creates the DStream and sends it to the Spark engine and another that collects those DStreams on the spark server synchronously and performs computation on them and stores them in Spark SQL database. These two processes should run simultaneously in order to accomplish data analytics in real time, which is, unfortunately, not possible using the jupyther notebook. Hence, in order to accomplish this synchronisation, we need to save the above code (process 1) seperately as a python file tutorial_part_1.py and run it simultaneously with the process 2. To send the output data on the localhost 9999, use the following command on the terminal/ bash-

In [None]:
# Installation cells have been left without running purposefully. 
#Once run, they generate the output 'Requirement already satisfied' in the subsequent runs

python tutorial_part_1.py | nc -lk 9999

 The output of the process 1 is a stream of Json string objects. We just need the 'text' part of this string. 
 
- To convert a stream of json strings into a stream of json objects, **map()** function is used. For each tweet in the sttream of tweets, the extractTweetText function is called, which returns the text part of the tweet, if one exists. Since, each of the tweets is a json object, first we need to parse it using **json.loads()**.The map function will take each element of the DStream, which is a json string and replaces it with the text component of each tweet. 
- Next, split the text of each tweet into words using the **flatmap()** function, in order to find out the hashtagged words among those. The difference between the  map () and the flatmap() functions is that the flatmap() can produce multiple outputs from each element of the stream. Hence, every DStream element will have multiple outputs corresponding to the words in the tweet text. 
- The **filter()** function removes the non-desirable elements, i.e. it only results in those elements that pass through the filter function, rest being filtered out. So, we use the filter function to filter-in only the words containing hashtags. 
- The next **map()** function results in a tuple containing lowercase hashtag and count 1 as the output for every element, so that, we can reduce these 1s and count them across the datasets. 
- The **reduceByKey()** function compares and adds the 1s for each of the hashtags, where the keys are same, and thus, reduces the stream by its 'Key'. 
- The third **map()** function returns a TagCount objects containing the hashtag and the number of time the hashtag occurs for each of the elements of the DStream. Now save each of the transformed DStreams into a table so as to perform SQL queries in them.
- The **forEachRDD()** function performs a logic with each RDD in the DStream and outputs the result. The logic we perform here is converting each of the DStream into a Data Frame and save it as a table 'tag_counts'. 

Spark also provides a union function. The readers are encouraged to try using the union function as well, by taking two streams- twitter and facebook and combine them using the union function and perform similar analysis.

To start the collection of data, start the Spark context. Connect to the SQl context and query the same tag_counts context using the SQL select query to get the top 10 hashtags and print them for 100 times. Below is the code for process 2. In order to get this working, start the process 2 simultaneously with running the process 1 in the tutorial_part_1.py file. Below that is the output that tells the most hot topics related to 'u.s.' according to the tweets that occured at the very same time I ran these cells.


The procedure for process 2 is explained below with the code. The output displayed is the hot topics v/s their count, using the matplotlib library. 

In [None]:
# Installation cells have been left without running purposefully. 
#Once run, they generate the output 'Requirement already satisfied' in the subsequent runs

! pip install pandas=0.19.1 --yes
# or
! conda install pandas=0.19.2 --yes

In [2]:
from __future__ import print_function

import sys
import json
import time
import pyspark
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext
import  datetime

from collections import namedtuple

# """ First of all, connect to the port where we are sending our stream of events.
# In the main program, the host ("localhost") and the port number (9999) are provided as an argument to the connection.
# Create a spark context and give it an application name, that can be seen on the spark monitoring dashboard. 
# Spark context is just an object under which the connection to the spark engine is made."""

TagCount = namedtuple('TagCount', ("tag","count"))

def extractTweetText(tweetJson):
    if('text' in tweetJson):
        return tweetJson['text']
    else:
        return ''

host = "localhost"
port = 9999

if __name__ == '__main__':
    if len(sys.argv) !=3:
        print("Usage: trending_tags.py localhost 9999", file = sys.stderr)
        exit(-1)

    sc = SparkContext(appName="SparkStreamingTwitterHotTopics")
    sc.setLogLevel("WARN")
    windowTimeFrame=10

In [4]:
# The process 2 code borrows the implementation of this method.

def functionToCreateContext():
    sc = SparkContext(...)  # new context
    ssc = StreamingContext(...)
    lines = ssc.socketTextStream(...)  # create DStreams
    
    ssc.checkpoint(checkpointDirectory)  # set checkpoint directory
    return ssc

#Alternately you can use a spark conf object too. Use any one of the two methods.

conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)

In [5]:
#This is a small example of the spark reduce() function to show how it works

data = [10, 20, 30, 40, 50]
distData = sc.parallelize(data)
reducedRDD = distData.reduce(lambda a, b: a + b)
print(reducedRDD)

150


In [6]:
#Text file RDDs can be created using SparkContext’s textFile method. 

lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)
print(totalLength)
#  to use lineLengths again later, use below code to retain its value
lineLengths.persist()

20


PythonRDD[5] at RDD at PythonRDD.scala:48

In [7]:
#This is a small example of the spark saveAsSequence() function to show how it works

rdd = sc.parallelize(range(1, 4)).map(lambda x: (x, "a" * x))
rdd.saveAsSequenceFile("dataRdd.txt")
sortedRDD = sorted(sc.sequenceFile("dataRdd.txt").collect())
print(sortedRDD)

[(1, 'a'), (2, 'aa'), (3, 'aaa')]


In [8]:
# Now we continue our process 2...

ssc = StreamingContext(sc, windowTimeFrame)
    
#"""Next, create a streaming context by passing to it the spark context and provide the time window interval 
#    for which you want the processing to happen everytime. Then, the spark context gets connected to the localhost on port 9999 
#     and the spark streaming is pointed to the stream we have created on this port using process 1. 
#     The returned DStreamed is stored in the tweetsDStream so that stream manipulation operations can be performed on it."""

tweetsDSream = ssc.socketTextStream(host, port)

jsonTweets = tweetsDSream.map(lambda  whole_tweet:extractTweetText(json.loads(whole_tweet)))
    
splitStream = jsonTweets.flatMap(lambda text: text.split(" "))
    
hashTweets = splitStream.filter(lambda  text: text.startswith("#"))
    
tupleTweets = hashTweets.map(lambda word: (word.lower(),1))
    
reducedTweets = tupleTweets.reduceByKey(lambda a,b: a+b)
    
countTweets = reducedTweets.map(lambda rec: TagCount(rec[0], rec[1]))
    
tableTweets = countTweets.foreachRDD(lambda rdd: rdd.toDF().registerTempTable("tag_counts"))
    
ssc.start() 
sqlContext = SQLContext(sc)
count = 0
try:
    while count < 100:
        time.sleep(15)
        count=count + 1
        hotTopics = sqlContext.sql('select tag, count from tag_counts order by count desc limit 10')
        print(hotTopics['tag'], hotTopics['count'])
        for row in hotTopics.collect():
            print(row.tag, row['count'])
        print('--------------')
        print (datetime.datetime.now())
        print('--------------')
    ssc.awaitTermination()
except KeyboardInterrupt:
    print("\n --------------------------------------------------------------------------------------")
    print("********************Process interrupted manually**********************")
    print("\n --------------------------------------------------------------------------------------")


Column<b'tag'> Column<b'count'>
#russia's 3
#putin 1
#ukraine 1
#russia 1
#sjubase 1
#puertoricans 1
#snow 1
#cold 1
#hellno 1
#no 1
--------------
2018-03-29 20:48:00.783929
--------------
Column<b'tag'> Column<b'count'>
#russia's 3
#ukraine 1
#cold 1
#sjubase 1
#hellno 1
#no 1
#https://www.pol… 1
#alisadr 1
#puertoricans 1
#putin 1
--------------
2018-03-29 20:48:15.922416
--------------
Column<b'tag'> Column<b'count'>
#russia's 3
#putin 1
#ukraine 1
#russia 1
#sjubase 1
#puertoricans 1
#snow 1
#cold 1
#hellno 1
#no 1
--------------
2018-03-29 20:48:31.050300
--------------
Column<b'tag'> Column<b'count'>
#russia's 3
#ukraine 1
#snow 1
#cold 1
#sjubase 1
#hellno 1
#no 1
#https://www.pol… 1
#alisadr 1
#putin 1
--------------
2018-03-29 20:48:46.259474
--------------
Column<b'tag'> Column<b'count'>
#russia's 3
#puertoricans 1
#putin 1
#ukraine 1
#russia 1
#no 1
#snow 1
#cold 1
#sjubase 1
#hellno 1
--------------
2018-03-29 20:49:01.391971
--------------
Column<b'tag'> Column<b'count'>


# Summary

This tutorial demonstrated some application of what is possible with Real Time Data Analytics using Spark. Please visit the references for more information on this.

# References

1) https://cloudxlab.com/blog/real-time-analytics-dashboard-with-apache-spark-kafka/

2) https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IMW14704USEN

3) https://static1.squarespace.com/static/55007c24e4b001deff386756/t/55f590a9e4b044a1a342598f/1442156713857/B131+-+Patel,+Nixon.pdf

4) https://spark.apache.org/docs/0.9.1/scala-programming-guide.html

5) https://spark.apache.org/docs/latest/streaming-programming-guide.html

6) https://www.youtube.com/watch?v=MS4-CjZCGsM