# Chapter 10: Spark Streaming

In this Chapter, we are going to investigate the streaming capabilities of Spark. 

In order to perform the exercises included in this Notebook, it is neccesary to send messages to the port 9999. There is a Python script (`send_messages.py`) that performs this automatically. In order to run that script, open a terminal using the Notebook interface. Then, place the working directory in ~/work/chapter10-spark-streaming, and type the following command: `nohup python send_messages.py &`. Please, take note of the process id shown in the terminal. If you want to finish this process, type `kill -9 <process-id>`.

## Stateless Transformations

In this section, we will see a very easy example of some stateless transformations.

### Conventional Stateless Transformations

The majority of the stateless transformations are almost the same the ones we can applied in conventional RDDs (`map()`, `flatMap()`, `filter()`, ...). In order how to use them, we are going to to an example where we will count the number of occurences of different types of messages ("info", "notice", "error" and "unkonwn") coming from a log streaming input. We will perform that count on batch intervals of 10 seconds.

In [None]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{StreamingContext, Seconds}

/**
Returns the type of message of an input line from log data deppending if the lines
contains the "[info]" (type "info"), "[notice]" (type "notice") or "[error]" (type "error") 
keyword. If none of them are found, the returned type is "unknown"
    
@input line input line
@return message type

**/
def getMessageType(line: String): String = {
    if(line.contains("[info]")) "info"
    else if(line.contains("[notice]")) "notice"
    else if(line.contains("[error]")) "error"
    else "unknown"
}


val spark = SparkSession.builder.appName("Spark Streaming - Stateless").master("local[*]").getOrCreate()
val ssc = new StreamingContext(spark.sparkContext, Seconds(10))
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val codes = lines.map(getMessageType).map(x => (x, 1))
val codesCount = codes.reduceByKey((x, y) => x + y)
codesCount.print()
ssc.start()
ssc.awaitTermination()

In [None]:
ssc.stop()

### Additional Stateless Transformations

There are also some specific transformations related to Stateless streaming problems. We can highlight the `updateStateByKey` function, that allows to keep some acumulative feautores during batch processing. For example, we are going to perform the last example but mantaining the total numbers of counts.

In [None]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{StreamingContext, Seconds}

/**
Returns the type of message of an input line from log data deppending if the lines
contains the "[info]" (type "info"), "[notice]" (type "notice") or "[error]" (type "error") 
keyword. If none of them are found, the returned type is "unknown"
    
@input line input line
@return message type
**/
def getMessageType(line: String): String = {
    if(line.contains("[info]")) "info"
    else if(line.contains("[notice]")) "notice"
    else if(line.contains("[error]")) "error"
    else "unknown"
}


/**
Accumulates an iterative counter
**/
def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount = runningCount.getOrElse(0) + newValues.sum
    Some(newCount)
}


val spark = SparkSession.builder.appName("Spark Streaming - Stateless").master("local[*]").getOrCreate()
val ssc = new StreamingContext(spark.sparkContext, Seconds(10))
ssc.checkpoint("checkpoint")
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val codes = lines.map(getMessageType).map(x => (x, 1))
val codesCount = codes.reduceByKey((x, y) => x + y)
val codesCountCumu = codesCount.updateStateByKey(updateFunction)
codesCountCumu.print()
ssc.start()
ssc.awaitTermination()

In [None]:
ssc.stop()

## Stateful transformations

Stateful operations are those which takes into account the values of the current batch and the previous one. The are many equivalent stateless - statuful transformations, where the last ones are characterized by the ending "AndWindow". We will see one of the last examples using this approach.

In [None]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{StreamingContext, Seconds}

/**
Returns the type of message of an input line from log data deppending if the lines
contains the "[info]" (type "info"), "[notice]" (type "notice") or "[error]" (type "error") 
keyword. If none of them are found, the returned type is "unknown"
    
@input line input line
@return message type

**/
def getMessageType(line: String): String = {
    if(line.contains("[info]")) "info"
    else if(line.contains("[notice]")) "notice"
    else if(line.contains("[error]")) "error"
    else "unknown"
}


val spark = SparkSession.builder.appName("Spark Streaming - Stateless").master("local[*]").getOrCreate()
val ssc = new StreamingContext(spark.sparkContext, Seconds(10))
ssc.checkpoint("checkpoint")
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val codes = lines.map(getMessageType).map(x => (x, 1))
val codesCount = codes.reduceByKeyAndWindow(_ + _,
                                            _ - _,
                                           windowDuration = Seconds(20),
                                           slideDuration = Seconds(10))
codesCount.print()
ssc.start()
ssc.awaitTermination()

In [None]:
ssc.stop()