# Training Batch Model on Streaming Data with Spark Streaming and Online  Predictions with Structured Streaming

Apache Spark was build to analyze Big Data with faster speed. One of the important features that Apache Spark offers is the ability to run the computations in memory. 
PySpark is the interface that gives access to Spark using the Python programming language. PySpark is an API developed in python for spark programming and writing spark applications in Python style

In this effort, we will mostly deal with the PySpark ml - machine learning library that can be used to import the Logistic Regression model or other machine learning models.

Google Colab?
Colab by Google is based on Jupyter Notebook which is an incredibly powerful tool that leverages google docs features. Since it runs on google server, we don't need to install anything in our system locally, be it Spark or deep learning model. The most attractive features of Colab are the free GPU and TPU support! Since the GPU support runs on Google's own server, it is, in fact, faster than some commercially available GPUs like the Nvidia 1050Ti. 

Let’s create a simple logistic regression model with PySpark in Google Colab. To open Colab Jupyter Notebook, click on this link.

# Running Pyspark in Colab

In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

/bin/sh: apt-get: command not found


In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

In [None]:
import findspark
findspark.init()
#from pyspark.sql import SparkSession
#spark = SparkSession.builder.master("local[*]").getOrCreate()

In [1]:
import os
train_csv_dir = os.getcwd()+"/module11_cs1/train_data_csv"

test_csv_dir = os.getcwd()+"/module11_cs1/test_data_csv"

In [2]:

if not os.path.exists(train_csv_dir):
    os.makedirs(train_csv_dir)

if not os.path.exists(test_csv_dir):
    os.makedirs(test_csv_dir)

In [3]:
import os
import pyspark
import sys
import re

import pandas as pd
import numpy as np
import time


from pyspark import SparkConf,SparkContext
from pyspark.streaming import StreamingContext

from pyspark.streaming.kafka import KafkaUtils
from pyspark.context import SparkContext

from pyspark.sql.session import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import StructType
from pyspark.sql import functions as f

from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import HashingTF
from pyspark.ml.feature import OneHotEncoder, StringIndexer
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import RandomForestClassificationModel, RandomForestClassifier
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.linalg import Vectors, SparseVector, DenseVector
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import BinaryClassificationEvaluator


In [4]:

def predict(df, epoch_id):
    global model
    
    if not model == None:
        print("Predictions:")
        predictions = model.transform(df)['spam', 'message','label', 'prediction', 'probability']
        # print predictions to the console
        predictions.show()
        
    else:
        print("Model has not seen training data yet, therefore - no model exists")



In [5]:
def sendPartition(df, epoch_id):
    global dataset_df
    
    if not len(df.head(1)) == 0:
        dataset_df = df.union(dataset_df)



In [6]:


def train(rdd):
    global dataset_df
    global model
    global prev_length
    global evaluate
    global crossval_full

    if dataset_df.count() > prev_length:
        prev_length = dataset_df.count()

        if evaluate == True:
            # Split to train and test
            (trainingData, testData) = dataset_df.randomSplit([0.7, 0.3], seed=0)
        else:
            trainingData = dataset_df

        my_train_df = trainingData

        print("\n\nStarting to fit a model on " + str(my_train_df.count()) +" records")
        # crossvalidating on full_pipeline
        model = crossval_full.fit(my_train_df)
        print("Model fit compleeted\n")
        
        if evaluate == True:
            predictions = model.transform(testData)
            evaluator = BinaryClassificationEvaluator().setLabelCol("label").setRawPredictionCol("prediction").setMetricName("areaUnderROC")
            accuracy = evaluator.evaluate(predictions)
            print("Evalluated on "+ str(predictions.count()) +" records")
            print ("Accuracy", accuracy)


In [7]:
# Create DataFrame representing the stream of input lines from connection to localhost:9999
#userSchema = StructType().add("spam", "string").add("message", "string")
#df = spark \
#    .readStream \
#    .format("socket") \
#    .option("host", "localhost") \
#    .option("port", 9999) \
#    .load()
#df = df.selectExpr( "CAST(value AS STRING)")


#import pyspark.sql.functions as f
# split text lines into two fields
#split_col = f.split(df['value'], '\t')
#df = df.withColumn('spam', split_col.getItem(0))
#df = df.withColumn('message', split_col.getItem(1))
#df = df.select(['spam', 'message'])


## Training model in time intervals using Spark Streaming

For a Spark Streaming application running on a cluster to be stable, the system should be able to process data as fast as it is being received, so the batch processing time should be less than the batch interval.
The batch interval needs to be set such that the expected data rate in production can be sustained.
We will test it with a conservative batch interval (say, 5-20 seconds) and a low data rate. To verify whether the system is able to keep up with the data rate.

In [8]:


# Function to create and setup a new StreamingContext
def functionToCreateContext():
    global spark
    global dataset_df
    global userSchema
    
    sc = SparkContext("local[2]", "Batch_Model_on_Stream_Data")  # new context
    ssc = StreamingContext(sc, train_duration)
    
    spark = SparkSession(sc)
    sqlContext = SQLContext(sc)
    # create an empty datframe
    dataset_df = sqlContext.createDataFrame(sc.emptyRDD(), userSchema)

    emptly_stream = ssc.queueStream([dataset_df.limit(1).rdd], oneAtATime=True, default=dataset_df.limit(1).rdd)  # create DStream
    emptly_stream.foreachRDD(train)
    ssc.start() 

    ssc.checkpoint(checkpointDirectory)  # set checkpoint directory
    return ssc



In [9]:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 pyspark-shell'

userSchema = StructType().add("spam", "string").add("message", "string")

dataset_df = None
# alternatilvelly, start with an intitial dataset
# dataset_df = sc.textFile("gs://drive3/data/spark/8_cs1_dataset/SMSSpamCollection").map(lambda line: re.split('\t', line)).toDF(["spam", "message"])  

checkpointDirectory = "chkpnt"

spark = None

model = None

prev_length = 0

evaluate = True
# duration of training a model on entire batch dataset
train_duration = 20  # train a model every n seconds


In [10]:
# Get StreamingContext from checkpoint data or create a new one
context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext)

# Start the context
#context.start()
#context.awaitTermination()

In [11]:


indexer = StringIndexer().setInputCol("spam").setOutputCol("label")
# Extract words
tokenizer = Tokenizer().setInputCol("message").setOutputCol("words")
# Remove custom stopwords
stopwords = StopWordsRemover().getStopWords() + ["-"]
remover = StopWordsRemover().setStopWords(stopwords).setInputCol("words").setOutputCol("filtered")
# create features
hashingTF = HashingTF(numFeatures=10, inputCol="filtered", outputCol="features")
rf = RandomForestClassifier().setFeaturesCol("features").setNumTrees(10)
#dt = DecisionTreeClassifier()
lr = LogisticRegression(maxIter=10)

full_pipeline = Pipeline().setStages([ indexer, tokenizer, remover, hashingTF, lr])
############################################################

paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()


crossval_full = CrossValidator(estimator=full_pipeline,
                            estimatorParamMaps=paramGrid,
                            evaluator=BinaryClassificationEvaluator(),
                            numFolds=2)  # use 3+ folds in practice



Stream processing on Sprak SQL engine is fast scalable, fault-tolerant.

In [12]:

# screen subfolders in working directory for new csv files

df = spark \
    .readStream \
    .option("sep", "\t") \
    .schema(userSchema)  \
    .csv(train_csv_dir) \
    .writeStream.foreachBatch(sendPartition)\
    .start()

time.sleep(train_duration*4)

df_test = spark \
    .readStream \
    .option("sep", "\t") \
    .schema(userSchema)  \
    .csv(test_csv_dir) \
    .writeStream.foreachBatch(predict).start()



Starting to fit a model on 7804 records
Model fit compleeted

Evalluated on 3344 records
Accuracy 0.9227208060884111


In [13]:
#ssc.start() 
# Start the computation
context.awaitTermination()  # Wait for the computation to terminate
#print(flag)

Predictions:
+----+--------------------+-----+----------+--------------------+
|spam|             message|label|prediction|         probability|
+----+--------------------+-----+----------+--------------------+
| ham|Go until jurong p...|  0.0|       0.0|[0.97816853950859...|
| ham|Ok lar... Joking ...|  0.0|       0.0|[0.98108950298959...|
|spam|Free entry in 2 a...|  1.0|       1.0|[0.01695877557495...|
| ham|U dun say so earl...|  0.0|       0.0|[0.98054351127301...|
| ham|Nah I don't think...|  0.0|       0.0|[0.94069917507176...|
|spam|FreeMsg Hey there...|  1.0|       1.0|[0.42880366386633...|
| ham|Even my brother i...|  0.0|       0.0|[0.96179340372824...|
| ham|As per your reque...|  0.0|       0.0|[0.94980051764340...|
|spam|WINNER!! As a val...|  1.0|       1.0|[0.05484354055332...|
|spam|Had your mobile 1...|  1.0|       1.0|[0.06083398293550...|
| ham|I'm gonna be home...|  0.0|       0.0|[0.99268464855744...|
|spam|SIX chances to wi...|  1.0|       1.0|[0.12597586452428..

Py4JJavaError: An error occurred while calling o29.awaitTermination.
: org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/Users/val/opt/anaconda3/envs/beam/lib/python3.7/site-packages/pyspark/streaming/util.py", line 68, in call
    r = self.func(t, *rdds)
  File "/Users/val/opt/anaconda3/envs/beam/lib/python3.7/site-packages/pyspark/streaming/dstream.py", line 161, in <lambda>
    func = lambda t, rdd: old_func(rdd)
  File "<ipython-input-6-827f7ab298b2>", line 8, in train
    if dataset_df.count() > prev_length:
  File "/Users/val/opt/anaconda3/envs/beam/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 523, in count
    return int(self._jdf.count())
  File "/Users/val/opt/anaconda3/envs/beam/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/Users/val/opt/anaconda3/envs/beam/lib/python3.7/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/Users/val/opt/anaconda3/envs/beam/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o74.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 615.0 failed 1 times, most recent failure: Lost task 1.0 in stage 615.0 (TID 998, localhost, executor driver): java.io.FileNotFoundException: File file:/Users/val/Documents/code/spark/module11_cs1/train_data_csv/SMSSpamCollection copy does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.agg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2835)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
	at org.apache.spark.sql.Dataset.count(Dataset.scala:2835)
	at sun.reflect.GeneratedMethodAccessor178.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File file:/Users/val/Documents/code/spark/module11_cs1/train_data_csv/SMSSpamCollection copy does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.agg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


	at org.apache.spark.streaming.api.python.TransformFunction.callPythonTransformFunction(PythonDStream.scala:95)
	at org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:78)
	at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:179)
	at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:179)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


# Streaming from kafka topic

In [None]:
#flume-ng agent --conf conf --conf-file /usr/local/Cellar/flume/1.9.0/libexec/conf/flume-sample.conf  -Dflume.root.logger=DEBUG,console --name a1 -Xmx512m -Xms256m

In [None]:
#!flume-ng agent --conf conf --conf-file /Users/val/Documents/code/spark/m11_to_Upload/netcat.conf.txt  -Dflume.root.logger=DEBUG,console --name NetcatAgent -Xmx512m -Xms256m

In [3]:
# Subscribe to 1 topic
#df = spark \
#    .readStream \
#    .format("kafka") \
#    .option("kafka.bootstrap.servers", "localhost:9092") \
#    .option("subscribe", "sample-topic") \
#    .option("startingOffsets", "earliest") \
#    .load()

#df = df.selectExpr( "CAST(value AS STRING)")


#import pyspark.sql.functions as f
# split text lines into two fields
#split_col = f.split(df['value'], '\t')
#df = df.withColumn('spam', split_col.getItem(0))
#df = df.withColumn('message', split_col.getItem(1))
#df = df.select(['spam', 'message'])

# convert label: spam = 1, ham = 0 
#from pyspark.sql import functions as f
#df = df.withColumn("label", f.when(f.col('spam') == "spam", 1).otherwise(0))