## CS431/631 Data Intensive Distributed Computing
---

Let's first install Spark. This will take a minute to finish.

In [2]:
!apt-get update -qq > /dev/null
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
!pip install -q findspark

Now that I installed Spark and Java in Colab, it is time to set the environment path which enables I to run Pyspark in Ir Colab environment. Next I define a function that helps us create SparkContext and StreamingContext. 

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

import findspark
findspark.init()

from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
import time

# This function creates SparkContext and StreamingContext
# Do not change this function
def initStreamingContext():
    try:
      ssc.end()
    except:
      pass
    finally:
      spark_conf = SparkConf()\
            .setAppName("IrTest")\
            .setMaster("local[*]")
      sc = SparkContext.getOrCreate(spark_conf)
      # Creating Streaming Context with batch window size of 1 second
      ssc = StreamingContext(sc, 1)
      return ssc

### Overview

The data I use was collected from the sensors installed on a wall-navigating robot. The robot uses 24 ultrasound sensors arranged circularly around its "waist". The numbering of the ultrasound sensors starts at the front of the robot and increases in clockwise direction. To make our data streaming scenario realistic, I have developed a server that streams the robot's data to Ir program (as if I are really getting the data live from the robot). I will use Spark Streaming to perform a few simple tasks on this data.

Every line of data transmitted by the server corresponds to a measurement done by the robot. Here is one line of such data:

```
0.438,0.498,3.625,3.645,5.000,2.918,5.000,2.351,2.332,2.643,1.698,1.687,1.698,1.717,1.744,0.593,0.502,0.493,0.504,0.445,0.431,0.444,0.440,0.429,Slight-Right-Turn
```
The raw values are the measurements of all 24 ultrasound sensors and the corresponding movement type which can be one of the following:
Move-Forward, Slight-Right-Turn, Sharp-Right-Turn, and Slight-Left-Turn.

This is normal that in each run the data is slightly shifted in time because it depends on the delay of receiving the data from the server across the Internet.Therefore, in every 1 second batch, I might have different numbers of measurements and it can vary across different runs.

In [4]:
# Let's create ssc.
ssc = initStreamingContext()
# I initialize a DStream by connecting it to a TCP socket. 
# The server will start sending data which goes to the robotData DStream.
robotData = ssc.socketTextStream("datasci.cs.uwaterloo.ca", 4321)
robotData.pprint()
ssc.start()
# Just wait 5 seconds before I stop the stream.
time.sleep(5)
ssc.stop()

-------------------------------------------
Time: 2021-03-31 19:53:39
-------------------------------------------

-------------------------------------------
Time: 2021-03-31 19:53:40
-------------------------------------------

-------------------------------------------
Time: 2021-03-31 19:53:41
-------------------------------------------
0.438,0.498,3.625,3.645,5.000,2.918,5.000,2.351,2.332,2.643,1.698,1.687,1.698,1.717,1.744,0.593,0.502,0.493,0.504,0.445,0.431,0.444,0.440,0.429,Slight-Right-Turn
0.438,0.498,3.625,3.648,5.000,2.918,5.000,2.637,2.332,2.649,1.695,1.687,1.695,1.720,1.744,0.592,0.502,0.493,0.504,0.449,0.431,0.444,0.443,0.429,Slight-Right-Turn
0.438,0.498,3.625,3.629,5.000,2.918,5.000,2.637,2.334,2.643,1.696,1.687,1.695,1.717,1.744,0.593,0.502,0.493,0.504,0.449,0.431,0.444,0.446,0.429,Slight-Right-Turn
0.437,0.501,3.625,3.626,5.000,2.918,5.000,2.353,2.334,2.642,1.730,1.687,1.695,1.717,1.744,0.593,0.502,0.493,0.504,0.449,0.431,0.444,0.444,0.429,Slight-Right-Turn

-------

An important factor for a navigating robot is avoiding obstacles. This is why there are so many sensors on this robot to measure the distance to all surrounding obstacles in all directions.

For example, if the robot performs the following two measurements in the last 3 seconds:
```
0.482,0.512,0.524,3.665,2.953,2.940,2.940,2.629,1.709,2.311,1.660,1.640,1.635,1.654,1.755,0.563,0.545,0.475,0.475,0.485,0.464,0.459,0.468,0.478,Slight-Right-Turn
0.484,0.514,0.525,3.667,2.954,2.938,2.941,2.957,1.707,2.310,1.658,1.638,1.633,1.652,1.753,0.682,0.535,0.475,0.475,0.544,0.465,0.457,0.469,0.483,Slight-Right-Turn

```
The program prints:
```
-------------------------------------------
Time: 2020-11-27 23:56:24
-------------------------------------------
0.457
```
Note that this is the output for one 3-second window.


In [5]:
ssc = initStreamingContext()
robotData = ssc.socketTextStream("datasci.cs.uwaterloo.ca", 4321)

# flattening the sensor values, windowing back to 3 seconds and reducing to find the minimum
robotData_min = robotData.flatMap(lambda x: x.split(',')[:24]).window(3, 1).reduce(lambda x, y: x if x < y else y)
robotData_min.pprint()

ssc.start()
# Let's wait for 10 seconds before I stop the program.
time.sleep(10)
ssc.stop()

-------------------------------------------
Time: 2021-03-31 19:53:46
-------------------------------------------

-------------------------------------------
Time: 2021-03-31 19:53:47
-------------------------------------------
0.429

-------------------------------------------
Time: 2021-03-31 19:53:48
-------------------------------------------
0.429

-------------------------------------------
Time: 2021-03-31 19:53:49
-------------------------------------------
0.429

-------------------------------------------
Time: 2021-03-31 19:53:50
-------------------------------------------
0.432

-------------------------------------------
Time: 2021-03-31 19:53:51
-------------------------------------------
0.453

-------------------------------------------
Time: 2021-03-31 19:53:52
-------------------------------------------
0.453

-------------------------------------------
Time: 2021-03-31 19:53:53
-------------------------------------------
0.467

--------------------------------------

Now, I characterize movements of the robot. The last field in every line indicates the movement type. The program, every second, reports what movements are performed by the robot in the last 3 seconds. I also report the ratio of each movement. For example, if 10 movements are "Slight-Right-Turn" out of 50 movements in the last 3 seconds. Finally, the movements are reported in the descending order of the radios.

Here is an example of the expected output:
```
Slight-Right-Turn 0.6666666666666666
Sharp-Right-Turn 0.3333333333333333
----------
Sharp-Right-Turn 0.5384615384615384
Slight-Right-Turn 0.46153846153846156
----------
Slight-Right-Turn 0.6590909090909091
Sharp-Right-Turn 0.3409090909090909
----------
Slight-Right-Turn 0.75
Sharp-Right-Turn 0.19642857142857142
Move-Forward 0.05357142857142857
----------
```

In [8]:
ssc = initStreamingContext()
robotData = ssc.socketTextStream("datasci.cs.uwaterloo.ca", 4321)

# extract movements and perform windowing
robotData = robotData.map(lambda x: x.split(',')[24]).window(3, 1).cache()

# counting total and adding temporary key for joining
count = robotData.count().map(lambda x: ('windowKey', x))

# counting by value and adding temporary key for joining
robotData = robotData.countByValue().map(lambda x: ('windowKey', x))

# joining to get ratios
merged = robotData.join(count).map(lambda x: (x[1][0][1] / x[1][1], x[1][0][0]))

def print_rdd(rdd):
  "Custom print function for individual RDDs."
  rdd = rdd.sortByKey(False).map(lambda x: (x[1], x[0]))
  result = rdd.collect()
  print("----------")
  for record in result:
    print(record[0], record[1])

# perform action for each RDD
merged.foreachRDD(print_rdd)

ssc.start()
# Let's wait for 10 seconds before I stop the program.
time.sleep(10)
ssc.stop()

----------
----------
Slight-Right-Turn 1.0
----------
Slight-Right-Turn 0.5
Sharp-Right-Turn 0.5
----------
Sharp-Right-Turn 0.56
Slight-Right-Turn 0.44
----------
Slight-Right-Turn 0.5483870967741935
Sharp-Right-Turn 0.45161290322580644
----------
Slight-Right-Turn 0.7741935483870968
Sharp-Right-Turn 0.22580645161290322
