# Tweets consumer

We will see how to use Spark Structured Streaming with Kafka using a hashtags count example. The producer must be running and sending Tweets from user `@sgioldasis` to Kafka topic `tweets`

First we need to get a SparkSession with Kafka (and Hive and AVRO) support

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_json,col
from pyspark.sql.types import *
from os.path import abspath

spark = SparkSession\
        .builder\
        .appName("tweets-consumer")\
        .master("spark://spark-master:7077")\
        .config("hive.metastore.uris", "thrift://hive-metastore:9083")\
        .config("spark.sql.warehouse.dir", "hdfs://namenode:8020/user/hive/warehouse")\
        .config("spark.executor.memory", "1g")\
        .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.2.0")\
        .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')\
        .config('spark.sql.shuffle.partitions', '1')\
        .enableHiveSupport()\
        .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

Then we need to start our streaming application. Once it is started, user `@sgioldasis` can send tweets from his account and and we can see the hashtags count here.

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.functions import regexp_replace

topic = "savas"
checkpoint = f"file:///tmp/checkpoint_{topic}"

# Create DataFrame representing the stream of input lines from Kafka topic tweets
df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker:29092") \
    .option("subscribe", topic) \
    .load()

lines = df \
    .selectExpr("CAST(value AS STRING)").alias("value") \
    .withColumn('value', regexp_replace('value', '\n', ''))

# Split the lines into words
words = lines.select(
   explode(
       split(lines.value, " ")
   ).alias("word")
)

# Generate running word count for hashtags
wordCounts = words \
    .filter("word like '%#%'") \
    .groupBy("word") \
    .count() \
    .orderBy('count', ascending=False)

# Start running the query that prints the running counts to the console
query = wordCounts \
    .writeStream \
    .outputMode("complete") \
    .option("checkpointLocation", checkpoint) \
    .format("console") \
    .start()

query.awaitTermination()

                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
+----+-----+



[Stage 3:>                                                          (0 + 1) / 1]

-------------------------------------------
Batch: 1
-------------------------------------------
+------+-----+
|  word|count|
+------+-----+
|#savas|    1|
+------+-----+



                                                                                

KeyboardInterrupt: 

Everything looks great!