# Spark Streaming Lecture 2 Example

These are the examples used in the slides of lecture 2.

The code assumes that the stream is available at server `logsender`, port 7777. Check the setup of lab 1.

## Example 1

List the top-3 IP sources with more accesses in the last 30 seconds. Update the list every 10 seconds.

In [None]:
import socket
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext("local[2]", "WebLogExample")
ssc = StreamingContext(sc, 5)
lines = ssc.socketTextStream("logsender", 7777)

lines = lines.window(10,5)
result = lines.filter( lambda line : len(line) > 0 ) \
            .map(lambda line: (line.split(' ')[1],1)) \
            .reduceByKey(lambda a, b: a+b) \
            .transform( lambda rdd: 
                       rdd.sortBy(lambda x: x[1], ascending=False)
                          .zipWithIndex() \
                          .filter( lambda v: v[1] < 3) \
                          .map( lambda v: v[0])) \

result.pprint()

ssc.start()
ssc.awaitTermination(20)
ssc.stop()

## Example 2

List the top-3 IP sources with more accesses in the last 30 seconds. Update the list every 10 seconds.

Print the country of the URL, assuming there is a CSV with the country for each IP.


In [None]:
import socket
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext("local[2]", "WebLogExample")
countries = sc.textFile( 'countries.csv') \
     .filter( lambda l: len(l) > 0) \
     .map( lambda l : l.split(',')) \
     .map( lambda l : (l[0],l[1]))

ssc = StreamingContext(sc, 5)
lines = ssc.socketTextStream("logsender", 7777)

lines = lines.window(10,5)
result = lines.filter( lambda line : len(line) > 0 ) \
            .map(lambda line: (line.split(' ')[1],1)) \
            .reduceByKey(lambda a, b: a+b) \
            .transform( \
                lambda rdd: rdd.sortBy(lambda x: x[1], ascending=False)\
                       .zipWithIndex().filter( lambda v: v[1] < 3).map( lambda v: v[0]))
result =  result.transform( \
                lambda rdd: rdd.join(countries))
result.pprint()

ssc.start()
ssc.awaitTermination(20)
ssc.stop()

## Example 3

List the top-3 IP sources with more accesses in the last 30 seconds. Update the list every 10 seconds.

Print the country of the URL, assuming there is a CSV with the country for each IP.

Print also the total number of requests for each IP overtime.


In [None]:
import socket
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

def updateFunction(newValues, runningCount):
    if runningCount is None:
        runningCount = 0
    return sum(newValues, runningCount)

sc = SparkContext("local[2]", "WebLogExample")
ssc = StreamingContext(sc, 5)
ssc.checkpoint('checkpoint')
lines = ssc.socketTextStream("logsender", 7777)

lines = lines.window(10,5)
result = lines.filter( lambda line : len(line) > 0 ) \
            .map(lambda line: (line.split(' ')[1],1)) \
            .reduceByKey(lambda a, b: a+b)

runningCounts = result.updateStateByKey(updateFunction)

result = result.transform( \
                lambda rdd: rdd.sortBy(lambda x: x[1], ascending=False)\
                       .zipWithIndex().filter( lambda v: v[1] < 3).map( lambda v: v[0]))
result = result.join(runningCounts)
result.pprint()

ssc.start()
ssc.awaitTermination(20)
ssc.stop()