# Chapter 10: Spark Streaming

In this Chapter, we are going to investigate the streaming capabilities of Spark. 

In order to perform the exercises included in this Notebook, it is neccesary to send messages to the port 9999. There is a Python script (`send_messages.py`) that performs this automatically. In order to run that script, open a terminal using the Notebook interface. Then, place the working directory in ~/work/chapter10-spark-streaming, and type the following command: `nohup python send_messages.py &`. Please, take note of the process id shown in the terminal. If you want to finish this process, type `kill -9 <process-id>`.

In [None]:
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext

## Stateless Transformations

In this section, we will see a very easy example of some stateless transformations.

### Conventional Stateless Transformations

The majority of the stateless transformations are almost the same the ones we can applied in conventional RDDs (`map()`, `flatMap()`, `filter()`, ...). In order how to use them, we are going to to an example where we will count the number of occurences of different types of messages ("info", "notice", "error" and "unkonwn") coming from a log streaming input. We will perform that count on batch intervals of 10 seconds.

In [None]:
def get_message_type(line):
    """
    Returns the type of message of an input line from log data deppending if the lines
    contains the "[info]" (type "info"), "[notice]" (type "notice") or "[error]" (type "error") 
    keyword. If none of them are found, the returned type is "unknown"
    
    :input line: input line
    :return: message type
    """
    
    if "[info]" in line:
        message_type = "info"
    elif "[notice]" in line:
        message_type = "notice"
    elif "[error]" in line:
        message_type = "error"
    else:
        message_type = "unknown"
    return message_type

In [None]:
spark = SparkSession.builder.appName("Spark Streaming - Stateless").master("local[*]").getOrCreate()
ssc = StreamingContext(spark.sparkContext, 10)
lines = ssc.socketTextStream("127.0.0.1", 9999)
codes = lines.map(get_message_type).map(lambda x: (x, 1))
codes_count = codes.reduceByKey(lambda x, y: x + y)
codes_count.pprint()
ssc.start()
ssc.awaitTermination()

In [None]:
ssc.stop()

### Additional Stateless Transformations

There are also some specific transformations related to Stateless streaming problems. We can highlight the `updateStateByKey` function, that allows to keep some acumulative feautores during batch processing. For example, we are going to perform the last example but mantaining the total numbers of counts.

In [None]:
def updateFunction(newValues, runningCount):
    """
    Accumulates an iterative counter
    """
    
    if runningCount is None:
        runningCount = 0
    return sum(newValues, runningCount) 

In [None]:
spark = SparkSession.builder.appName("Spark Streaming - Stateless").master("local[*]").getOrCreate()
ssc = StreamingContext(spark.sparkContext, 10)
ssc.checkpoint("checkpoint")
lines = ssc.socketTextStream("127.0.0.1", 9999)
codes = lines.map(get_message_type).map(lambda x: (x, 1))
codes_count = codes.reduceByKey(lambda x, y: x + y)
codes_count_cumu = codes_count.updateStateByKey(updateFunction)
codes_count_cumu.pprint()
ssc.start()
ssc.awaitTermination()

In [None]:
ssc.stop()

## Stateful transformations

Stateful operations are those which takes into account the values of the current batch and the previous one. The are many equivalent stateless - statuful transformations, where the last ones are characterized by the ending "AndWindow". We will see one of the last examples using this approach.

In [None]:
spark = SparkSession.builder.appName("Spark Streaming - Statefull").master("local[*]").getOrCreate()
sc = StreamingContext(spark.sparkContext, 10)
ssc.checkpoint("checkpoint")

In [None]:
lines = ssc.socketTextStream("127.0.0.1", 9999)
codes = lines.map(get_message_type).map(lambda x: (x, 1))
codes_count = codes.reduceByKeyAndWindow(func = lambda x, y: x + y,
                                         invFunc = lambda x, y: x - y,
                                         windowDuration = 20,
                                         slideDuration = 10)
codes_count.pprint()
ssc.start()
ssc.awaitTermination()

In [None]:
ssc.stop()