### Task

Using the same server as above, subscribe to the Kafka topic "tweets".

Create the following schema for the JSON payload:  
root  
 |-- hashTags: array (nullable = true)  
 |    |-- element: string (containsNull = true)  
 |-- text: string (nullable = true)  
 |-- id: long (nullable = true)  
 |-- createdAt: long (nullable = true)  
 |-- retweetCount: integer (nullable = true)  
 |-- favoriteCount: integer (nullable = true)  
 |-- user: string (nullable = true)  
 |-- userScreenName: string (nullable = true)  
 
Parse out the payload and aggregate by count on the _size_ of the hashTags array. Name the column "amountOfHashtags".  

Example output:</br>
<table>
  <tr>
    <th>amountOfHashtags</th>
    <th>count</th>
  </tr>
  <tr>
    <td>0</td>
    <td>500</td>
  </tr>
  <tr>
    <td>1</td>
    <td>120</td>
  </tr>
  <tr>
    <td>3</td>
    <td>15</td>
  </tr>
</table>

Write the output to a delta table called "hashtags", completely overwriting the result on every trigger.

Make sure to display what data the table contains. Then shutdown the stream.

In [0]:
# ANSWER

from pyspark.sql.types import * 
import pyspark.sql.functions as F

kafka_server = "server1.databricks.training:9092"   # US (Oregon)

tweet_df = (spark.readStream                        # Get the DataStreamReader
  .format("kafka")                                 # Specify the source format as "kafka"
  .option("kafka.bootstrap.servers", kafka_server) # Configure the Kafka server name and port
  .option("subscribe", "tweets")                       # Subscribe to the "en" Kafka topic - edits of English wikipedia pages
  .option("startingOffsets", "earliest")           # The start point when a query is started
  .option("maxOffsetsPerTrigger", 100)             # Rate limit on max offsets per trigger interval
  .load()                                          # Load the DataFrame
)


tweet_schema = StructType([
  StructField("hashTags", ArrayType(StringType(), True), True),
  StructField("text", StringType(), True),
  StructField("id", LongType(), True),
  StructField("createdAt", LongType(), True),
  StructField("retweetCount", IntegerType(), True),
  StructField("favoriteCount", IntegerType(), True),
  StructField("user", StringType(), True),
  StructField("userScreenName", StringType(), True)
])


tweet_payload_df = (tweet_df
                  .select(F.col("value").cast(StringType()))
                  )

tweet_json_df = (tweet_payload_df
                .select(F.from_json("value", tweet_schema).alias("json"))
               )

counts_df = (tweet_json_df
             .select(F.size("json.hashTags").alias("hashtags"))
             .groupBy("hashtags")
             .count()
            )

(counts_df
 .writeStream
 .format("delta")
 .outputMode("complete")
 .option("checkpointLocation", "/tmp/countstest/checkpointing")
 .table("hashtag_counts")
)

In [0]:
display(spark.table("hashtag_counts"))

In [0]:
for stream in spark.streams.active:
  stream.stop()