# Kafka with Spark Structured Streaming using Python

We will see how to use Spark Structured Streaming with Kafka using a word count example:

```
Terminal 1
docker-compose exec broker bash
kafka-console-producer --topic wordcount --bootstrap-server broker:29092


Terminal 2
docker-compose exec broker bash
kafka-console-consumer --topic wordcount --bootstrap-server broker:29092


If not in Confluent Kafka
-------------------------
export TOPIC=wordcount
kafka-topics.sh --create --zookeeper zookeeper:2181 --replication-factor 1 --partitions 1 --topic $TOPIC
kafka-topics.sh --list --zookeeper zookeeper:2181
kafka-console-producer.sh --topic $TOPIC --bootstrap-server localhost:9092

```



First we need to get a SparkSession with Kafka (and Hive and AVRO) support

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_json,col
from pyspark.sql.types import *
from os.path import abspath

spark = SparkSession\
        .builder\
        .appName("kafka-wordcount")\
        .master("spark://spark-master:7077")\
        .config("hive.metastore.uris", "thrift://hive-metastore:9083")\
        .config("spark.sql.warehouse.dir", "hdfs://namenode:8020/user/hive/warehouse")\
        .config("spark.executor.memory", "1g")\
        .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.2.0")\
        .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')\
        .config('spark.sql.shuffle.partitions', '1')\
        .enableHiveSupport()\
        .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

Then we need to start our streaming application. Once it is started, we can type sentences in our terminal and we can see the output here.

In [1]:
from pyspark.sql.functions import split, explode, desc

topic = "wordcount"

lines = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "broker:29092") \
  .option("subscribe", topic) \
  .option("startingOffsets", "latest") \
  .load()

# Split the lines into words
words = lines.select(
   explode(
       split(lines.value, " ")
   ).alias("word")
)

# Generate running word count
wordCounts = words \
    .groupBy("word").count() \
    .filter("trim(word) <> ''") \
    .sort(desc("count"))
                                                                        
# Start running the query that prints the running counts to the console
query = wordCounts \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()

NameError: name 'spark' is not defined

Everything looks great!