Kafka-Spark-Streaming

This illustrates the usage of Kafka and Spark Streaming feature. Event Messages are sent using Kafka and Received and processed using Spark-Streaming module.

There are two jupyter notebooks here. KafkaSendMessage.ipynb -> Creates a kafka topic and starts sending the messages to kafka broker.

SparkReceiveKafkaStream.ipynb -> 1) Creates a spark stream object on the Kafka topic. 2) Groups the events(kafka messages) within a sliding window interval using reduceByWindow. 3) Calls the proecessing function for this group of events.

I have used, Kafka 2.11-0.11.0.2, spark 2.4.0, and jdk 8.0 , on windows. Initially tried to use jdk 11, but this caused issues while calling collect/count functions on the RDD. But, worked perfectly fine after downgrading the JDK to 8.0 .

To use pyspark 2.4.0 with Kafk, I had to do the following -

download http://central.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-8-assembly_2.11/2.4.0/spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar to SPARK_HOME\jars directory.
add this jar to path using following command before getting the stream in the code-flow : os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages <SPARK_HOME>\jars\spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar pyspark-shell'

To use spark 2.4.0 on windows, had to do the following tweek - cd SPARK_HOME\python\lib unzip pyspark.zip edit pyspark\worker.py comment following lines and zip back the pyspark dir. resource library doesn't exist in windows, and this code is to check the memory relimits, so there is no harm if we comment this.

#import resource

#set up memory limits #memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1")) #total_memory = resource.RLIMIT_AS #try: # if memory_limit_mb > 0: #(soft_limit, hard_limit) = resource.getrlimit(total_memory) #msg = "Current mem limits: {0} of max {1}\n".format(soft_limit, hard_limit) #print(msg, file=sys.stderr)

            # convert to bytes
            #new_limit = memory_limit_mb * 1024 * 1024

            #if soft_limit == resource.RLIM_INFINITY or new_limit < soft_limit:
            #    msg = "Setting mem limits to {0} of max {1}\n".format(new_limit, new_limit)
            #    print(msg, file=sys.stderr)
            #    resource.setrlimit(total_memory, (new_limit, new_limit))

#except (resource.error, OSError, ValueError) as e:
#    # not all systems support resource limits, so warn instead of failing
#    print("WARN: Failed to set memory limit: {0}\n".format(e), file=sys.stderr)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
KafkaSendMessage.ipynb		KafkaSendMessage.ipynb
README.md		README.md
SparkReceiveKafkaStream.ipynb		SparkReceiveKafkaStream.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kafka-Spark-Streaming

About

Releases

Packages

Languages

tejabhat/Kafka-Spark-Streaming

Folders and files

Latest commit

History

Repository files navigation

Kafka-Spark-Streaming

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages