d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

-sandbox
<img src="https://files.training.databricks.com/images/Apache-Spark-Logo_TM_200px.png" style="float: left: margin: 20px"/>

# Structured Streaming with Kafka 

We have another server that reads Wikipedia edits in real time, with a multitude of different languages. 

**What you will learn:**
* About Kafka
* How to establish a connection with Kafka
* More examples 
* More visualizations

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Software Engineers

## Prerequisites
* Web browser: current versions of Google Chrome, Firefox, Safari, Microsoft Edge and 
Internet Explorer 11 on Windows 7, 8, or 10 (see <a href="https://docs.databricks.com/user-guide/supported-browsers.html#supported-browsers#" target="_blank">Supported Web Browsers</a>)
* Databricks Runtime 4.2 or greater
* Completed courses Spark-SQL, DataFrames or ETL-Part 1 from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>, or have similar knowledge

<iframe  
src="//fast.wistia.net/embed/iframe/p5v3fw7auc?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/p5v3fw7auc?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Getting Started</h2>

Run the following cell to configure our "classroom."

In [5]:
%run "./Includes/Classroom-Setup"

-sandbox

<img style="float:right" src="https://files.training.databricks.com/images/eLearning/Structured-Streaming/kafka.png"/>

<div>
  <h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> The Kafka Ecosystem</h2>
  <p>Kafka is software designed upon the <b>publish/subscribe</b> messaging pattern.
     Publish/subscribe messaging is where a sender (publisher) sends a message that is not specifically directed to a receiver (subscriber). 
     The publisher classifies the message somehow and the receiver subscribes to receive certain categories of messages.
     There are other usage patterns for Kafka, but this is the pattern we focus on in this course.
  </p>
  <p>Publisher/subscriber systems typically have a central point where messages are published, called a <b>broker</b>. 
     The broker receives messages from publishers, assigns offsets to them and commits messages to storage.
  </p>

  <p>The Kafka version of a unit of data an array of bytes called a <b>message</b>.</p>

  <p>A message can also contain a bit of information related to partitioning called a <b>key</b>.</p>

  <p>In Kafka, messages are categorized into <b>topics</b>.</p>
</div>

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> The Kafka Server</h2>


The Kafka server is fed by a separate TCP server that reads the Wikipedia edits, in real time, from the various language-specific IRC channels to which Wikimedia posts them. 

That server parses the IRC data, converts the results to JSON, and sends the JSON to
a Kafka server, with the edits segregated by language. The various languages are <b>topics</b>.

For example, the Kafka topic "en" corresponds to edits for en.wikipedia.org.

### Required Options

When consuming from a Kafka source, you **must** specify at least two options:

<p>1. The Kafka bootstrap servers, for example:</p>
<p>`dsr.option("kafka.bootstrap.servers", "server1.databricks.training:9092")`</p>
<p>2. Some indication of the topics you want to consume.</p>

#### Specifying a topic

There are three, mutually-exclusive, ways to specify the topics for consumption:

| Option        | Value                                          | Description                            | Example |
| ------------- | ---------------------------------------------- | -------------------------------------- | ------- |
| **subscribe** | A comma-separated list of topics               | A list of topics to which to subscribe | `dsr.option("subscribe", "topic1")` <br/> `dsr.option("subscribe", "topic1,topic2,topic3")` |
| **assign**    | A JSON string indicating topics and partitions | Specific topic-partitions to consume.  | `dsr.dsr.option("assign", "{'topic1': [1,3], 'topic2': [2,5]}")`
| **subscribePattern**   | A (Java) regular expression           | A pattern to match desired topics      | `dsr.option("subscribePattern", "e[ns]")` <br/> `dsr.option("subscribePattern", "topic[123]")`|

-sandbox

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> In the example to follow, we're using the "subscribe" option to select the topics we're interested in consuming. 
We've selected only the "en" topic, corresponding to edits for the English Wikipedia. 
If we wanted to consume multiple topics (multiple Wikipedia languages, in our case), we could just specify them as a comma-separate list:

```dsr.option("subscribe", "en,es,it,fr,de,eo")```

#### Other options

There are other, optional, arguments you can give the Kafka source. 

For more information, see the <a href="https://people.apache.org//~pwendell/spark-nightly/spark-branch-2.1-docs/latest/structured-streaming-kafka-integration.html#" target="_blank">Structured Streaming and Kafka Integration Guide</a>

-sandbox
<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> The Kafka Schema</h2>

Reading from Kafka returns a `DataFrame` with the following fields:

| Field             | Type   | Description |
|------------------ | ------ |------------ |
| **key**           | binary | The key of the record (not needed) |
| **value**         | binary | Our JSON payload. We'll need to cast it to STRING |
| **topic**         | string | The topic this record is received from (not needed) |
| **partition**     | int    | The Kafka topic partition from which this record is received (not needed). This server only has one partition. |
| **offset**        | long   | The position of this record in the corresponding Kafka topic partition (not needed) |
| **timestamp**     | long   | The timestamp of this record  |
| **timestampType** | int    | The timestamp type of a record (not needed) |

In the example below, the only column we want to keep is `value`.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The default of `spark.sql.shuffle.partitions` is 200.
This setting is used in operations like `groupBy`.
In this case, we should be setting this value to match the current number of cores.

In [13]:
from pyspark.sql.functions import col
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)

kafkaServer = "server1.databricks.training:9092"   # US (Oregon)
# kafkaServer = "server2.databricks.training:9092" # Singapore

#TODO
editsDF = (spark.readStream# Get the DataStreamReader
           .format("kafka")# Specify the source format as "kafka"
           .option("kafka.bootstrap.servers",kafkaServer)# Configure the Kafka server name and port
           .option("subscribe","en")# Subscribe to the "en" Kafka topic
           .option("startingOffsets", "earliest")# Rewind stream to beginning when we restart notebook
           .option("maxOffsetsPerTrigger", 1000)   # Throttle Kafka's processing of the streams
            .load()# Load the DataFrame
            .select(col("value").cast("STRING").alias("value"))# Cast the "value" column to STRING
)

Let's display some data.

In [15]:
myStream = "my_python_stream"
display(editsDF,  streamName = myStream)

value
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-01T06:47:33.681Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=943326518&oldid=943326400"",""isUnpatrolled"":false,""page"":""Lupton family"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""/* Clergy, farmers, clothiers and merchants */"",""userURL"":""http://en.wikipedia.org/wiki/User:175.32.251.147"",""pageURL"":""http://en.wikipedia.org/wiki/Lupton_family"",""delta"":9,""flag"":"""",""isNewPage"":false,""isAnonymous"":true,""geocoding"":{""countryCode2"":""AU"",""city"":""Mill Park"",""latitude"":-37.66669845581055,""country"":""Australia"",""longitude"":-37.66669845581055,""stateProvince"":""VIC"",""countryCode3"":""AUS""},""user"":""175.32.251.147"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-01T06:47:34.513Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=943326519&oldid=943326000"",""isUnpatrolled"":false,""page"":""Alberbury Castle"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":"""",""userURL"":""http://en.wikipedia.org/wiki/User:MistyGraceWhite"",""pageURL"":""http://en.wikipedia.org/wiki/Alberbury_Castle"",""delta"":1113,""flag"":"""",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""MistyGraceWhite"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-01T06:47:34.976Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=943326520&oldid=920439319"",""isUnpatrolled"":false,""page"":""List of mountain passes in North Carolina"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""/* top */added short description"",""userURL"":""http://en.wikipedia.org/wiki/User:Lepricavark"",""pageURL"":""http://en.wikipedia.org/wiki/List_of_mountain_passes_in_North_Carolina"",""delta"":45,""flag"":""M"",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""Lepricavark"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-01T06:47:35.519Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=943326521&oldid=940815222"",""isUnpatrolled"":false,""page"":""Soong Mei-ling"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":"""",""userURL"":""http://en.wikipedia.org/wiki/User:112.200.34.96"",""pageURL"":""http://en.wikipedia.org/wiki/Soong_Mei-ling"",""delta"":16,""flag"":"""",""isNewPage"":false,""isAnonymous"":true,""geocoding"":{""countryCode2"":""PH"",""city"":null,""latitude"":14.58329963684082,""country"":""Philippines"",""longitude"":14.58329963684082,""stateProvince"":null,""countryCode3"":""PHL""},""user"":""112.200.34.96"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-01T06:47:36.345Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=943326522&oldid=943326471"",""isUnpatrolled"":false,""page"":""The Get Up Kids"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":"""",""userURL"":""http://en.wikipedia.org/wiki/User:VampireofMisery"",""pageURL"":""http://en.wikipedia.org/wiki/The_Get_Up_Kids"",""delta"":0,""flag"":"""",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""VampireofMisery"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-01T06:47:36.974Z"",""url"":"""",""isUnpatrolled"":false,""page"":""Special:Log/newusers"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""New user account"",""userURL"":""http://en.wikipedia.org/wiki/User:Sanbansal"",""pageURL"":""http://en.wikipedia.org/wiki/Special:Log/newusers"",""delta"":null,""flag"":""create"",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""Sanbansal"",""namespace"":""special""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-01T06:47:37.041Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=943326523&oldid=943326396"",""isUnpatrolled"":false,""page"":""Büsingen am Hochrhein"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":"""",""userURL"":""http://en.wikipedia.org/wiki/User:88.213.128.175"",""pageURL"":""http://en.wikipedia.org/wiki/Büsingen_am_Hochrhein"",""delta"":-97,""flag"":"""",""isNewPage"":false,""isAnonymous"":true,""geocoding"":{""countryCode2"":""IT"",""city"":null,""latitude"":42.83330154418945,""country"":""Italy"",""longitude"":42.83330154418945,""stateProvince"":null,""countryCode3"":""ITA""},""user"":""88.213.128.175"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-01T06:47:37.661Z"",""url"":"""",""isUnpatrolled"":false,""page"":""Special:Log/upload"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""uploaded \""[[\u000302File:Logo of Instituto Questao de Ciencia, Brazil.svg\u000310]]\"": Uploading a non-free logo using [[Wikipedia:File_Upload_Wizard|File Upload Wizard]]"",""userURL"":""http://en.wikipedia.org/wiki/User:Wyatt Tyrone Smith"",""pageURL"":""http://en.wikipedia.org/wiki/Special:Log/upload"",""delta"":null,""flag"":""upload"",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""Wyatt Tyrone Smith"",""namespace"":""special""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-01T06:47:37.979Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=943326525&oldid=939952214"",""isUnpatrolled"":false,""page"":""Hazel Crest, Illinois"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""/* Village Board */"",""userURL"":""http://en.wikipedia.org/wiki/User:2600:1700:B250:4A10:1CA0:B1EA:B3EA:169"",""pageURL"":""http://en.wikipedia.org/wiki/Hazel_Crest,_Illinois"",""delta"":-5,""flag"":"""",""isNewPage"":false,""isAnonymous"":true,""geocoding"":{""countryCode2"":""US"",""city"":null,""latitude"":39.7599983215332,""country"":""United States"",""longitude"":39.7599983215332,""stateProvince"":null,""countryCode3"":""USA""},""user"":""2600:1700:B250:4A10:1CA0:B1EA:B3EA:169"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-01T06:47:38.231Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=943326526&oldid=943325950"",""isUnpatrolled"":false,""page"":""Wikipedia:Administrators' noticeboard/Incidents"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""/* Continuous unsourced info */ Reply"",""userURL"":""http://en.wikipedia.org/wiki/User:Robvanvee"",""pageURL"":""http://en.wikipedia.org/wiki/Wikipedia:Administrators'_noticeboard/Incidents"",""delta"":521,""flag"":"""",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""Robvanvee"",""namespace"":""Wikipedia""}"


Wait until stream is done initializing...

In [17]:
untilStreamIsReady("my_python_stream")

Make sure to stop the stream before continuing.

In [19]:
for s in spark.streams.active:  # Iterate over all active streams
  if (s.name == myStream):      # Look for our specific stream
    print("Stopping "+s.name)   # A little extra feedback
    s.stop()                    # Stop the stream

-sandbox
<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Use Kafka to display the raw data</h2>

The Kafka server acts as a sort of "firehose" (or asynchronous buffer) and displays raw data.

Since raw data coming in from a stream is transient, we'd like to save it to a more permanent data structure.

The first step is to define the schema for the JSON payload.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Only those fields of future interest are commented below.

In [21]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, BooleanType
from pyspark.sql.functions import from_json, unix_timestamp

schema = StructType([
  StructField("channel", StringType(), True),
  StructField("comment", StringType(), True),
  StructField("delta", IntegerType(), True),
  StructField("flag", StringType(), True),
  StructField("geocoding", StructType([                 # (OBJECT): Added by the server, field contains IP address geocoding information for anonymous edit.
    StructField("city", StringType(), True),
    StructField("country", StringType(), True),
    StructField("countryCode2", StringType(), True),
    StructField("countryCode3", StringType(), True),
    StructField("stateProvince", StringType(), True),
    StructField("latitude", DoubleType(), True),
    StructField("longitude", DoubleType(), True),
  ]), True),
  StructField("isAnonymous", BooleanType(), True),      # (BOOLEAN): Whether or not the change was made by an anonymous user
  StructField("isNewPage", BooleanType(), True),
  StructField("isRobot", BooleanType(), True),
  StructField("isUnpatrolled", BooleanType(), True),
  StructField("namespace", StringType(), True),         # (STRING): Page's namespace. See https://en.wikipedia.org/wiki/Wikipedia:Namespace 
  StructField("page", StringType(), True),              # (STRING): Printable name of the page that was edited
  StructField("pageURL", StringType(), True),           # (STRING): URL of the page that was edited
  StructField("timestamp", StringType(), True),         # (STRING): Time the edit occurred, in ISO-8601 format
  StructField("url", StringType(), True),
  StructField("user", StringType(), True),              # (STRING): User who made the edit or the IP address associated with the anonymous editor
  StructField("userURL", StringType(), True),
  StructField("wikipediaURL", StringType(), True),
  StructField("wikipedia", StringType(), True),         # (STRING): Short name of the Wikipedia that was edited (e.g., "en" for the English)
])

Next we can use the function `from_json` to parse out the full message with the schema specified above.

In [23]:
from pyspark.sql.functions import col, from_json

jsonEdits = editsDF.select(
  from_json("value", schema).alias("json"))  # Parse the column "value" and name it "json"

When parsing a value from JSON, we end up with a single column containing a complex object.

We can clearly see this by simply printing the schema.

In [25]:
jsonEdits.printSchema()

The fields of a complex object can be referenced with a "dot" notation as in:

`col("json.geocoding.countryCode3")` 
 

A large number of these fields/columns can become unwieldy.

For that reason, it is common to extract the sub-fields and represent them as first-level columns as seen below:

In [27]:
from pyspark.sql.functions import isnull, unix_timestamp,col

anonDF = (jsonEdits
  .select(col("json.wikipedia").alias("wikipedia"),      # Promoting from sub-field to column
          col("json.isAnonymous").alias("isAnonymous"),  #     "       "      "      "    "
          col("json.namespace").alias("namespace"),      #     "       "      "      "    "
          col("json.page").alias("page"),                #     "       "      "      "    "
          col("json.pageURL").alias("pageURL"),          #     "       "      "      "    "
          col("json.geocoding").alias("geocoding"),      #     "       "      "      "    "
          col("json.user").alias("user"),                #     "       "      "      "    "
          col("json.timestamp").cast("timestamp"))           # Promoting and converting to a timestamp
          .filter(col("namespace")=="article")
          .filter(~isnull(col("geocoding.countryCode3")))#TODO Limit result to just articles
               # We only want results that are geocoded
                 
          

)
display(anonDF)

wikipedia,isAnonymous,namespace,page,pageURL,geocoding,user,timestamp
en,True,article,Lupton family,http://en.wikipedia.org/wiki/Lupton_family,"List(Mill Park, Australia, AU, AUS, VIC, -37.66669845581055, -37.66669845581055)",175.32.251.147,2020-03-01T06:47:33.681+0000
en,True,article,Soong Mei-ling,http://en.wikipedia.org/wiki/Soong_Mei-ling,"List(null, Philippines, PH, PHL, null, 14.58329963684082, 14.58329963684082)",112.200.34.96,2020-03-01T06:47:35.519+0000
en,True,article,Büsingen am Hochrhein,http://en.wikipedia.org/wiki/Büsingen_am_Hochrhein,"List(null, Italy, IT, ITA, null, 42.83330154418945, 42.83330154418945)",88.213.128.175,2020-03-01T06:47:37.041+0000
en,True,article,"Hazel Crest, Illinois","http://en.wikipedia.org/wiki/Hazel_Crest,_Illinois","List(null, United States, US, USA, null, 39.7599983215332, 39.7599983215332)",2600:1700:B250:4A10:1CA0:B1EA:B3EA:169,2020-03-01T06:47:37.979+0000
en,True,article,Pit orchestra,http://en.wikipedia.org/wiki/Pit_orchestra,"List(null, United States, US, USA, null, 39.7599983215332, 39.7599983215332)",2601:282:B02:81E0:E9F0:D35D:3D82:789,2020-03-01T06:47:39.972+0000
en,True,article,Qatar SC,http://en.wikipedia.org/wiki/Qatar_SC,"List(null, Saudi Arabia, SA, SAU, null, 25.0, 25.0)",62.149.77.155,2020-03-01T06:47:40.723+0000
en,True,article,Büsingen am Hochrhein,http://en.wikipedia.org/wiki/Büsingen_am_Hochrhein,"List(null, Italy, IT, ITA, null, 42.83330154418945, 42.83330154418945)",88.213.128.175,2020-03-01T06:47:54.240+0000
en,True,article,Joseph Minala,http://en.wikipedia.org/wiki/Joseph_Minala,"List(null, Spain, ES, ESP, null, 40.400001525878906, 40.400001525878906)",80.103.137.135,2020-03-01T06:47:56.724+0000
en,True,article,Yousef Aymen,http://en.wikipedia.org/wiki/Yousef_Aymen,"List(null, Saudi Arabia, SA, SAU, null, 25.0, 25.0)",62.149.77.155,2020-03-01T06:47:59.285+0000
en,True,article,Anubhav Sinha,http://en.wikipedia.org/wiki/Anubhav_Sinha,"List(Kundan, India, IN, IND, JK, 33.80419921875, 33.80419921875)",106.200.146.28,2020-03-01T06:47:59.344+0000


-sandbox
<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Mapping Anonymous Editors' Locations</h2>

When you run the query, the default is a [live] html table.

The geocoded information allows us to associate an anonymous edit with a country.

We can then use that geocoded information to plot edits on a [live] world map.

In order to create a slick world map visualization of the data, you'll need to click on the item below.

Under <b>Plot Options</b>, use the following:
* <b>Keys:</b> `countryCode3`
* <b>Values:</b> `count`

In <b>Display type</b>, use <b>World map</b> and click <b>Apply</b>.

<img src="https://files.training.databricks.com/images/eLearning/Structured-Streaming/plot-options-map-04.png"/>

By invoking a `display` action on a DataFrame created from a `readStream` transformation, we can generate a LIVE visualization!

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Keep an eye on the plot for a minute or two and watch the colors change.

In [29]:
from pyspark.sql.functions import window, col
mappedDF = (anonDF
           .groupBy("geocoding.countryCode3")   #Aggregate by country (code), "geocoding.countryCode3"
           .count()                                                               # Produce a count of each aggregate
)
display(mappedDF)

countryCode3,count
AZE,30
CHL,9
CRI,2
BHR,4
URY,1
TUR,68
PRT,19
IND,786
KGZ,1
GRC,41


Wait until stream is done initializing...

In [31]:
untilStreamIsReady("SS04-mapped-p")

Stop the streams.

In [33]:
for s in spark.streams.active:  # Iterate over all active streams
    s.stop()                    # Stop the stream

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Review Questions</h2>

**Q:** What `format` should you use with Kafka?<br>
**A:** `format("kafka")`

**Q:** How do you specify a Kafka server?<br>
**A:** `.option("kafka.bootstrap.servers"", "server1.databricks.training:9092")`

**Q:** What verb should you use in conjunction with `readStream` and Kafka to start the streaming job?<br>
**A:** `load()`, but with no parameters since we are pulling from a Kafka server.

**Q:** What fields are returned in a Kafka DataFrame?<br>
**A:** Reading from Kafka returns a DataFrame with the following fields:
key, value, topic, partition, offset, timestamp, timestampType

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Next Steps</h2>

Start the next lab, [Using Kafka Lab]($./Labs/SS 04 - Using Kafka Lab).

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Additional Topics &amp; Resources</h2>

* <a href="http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-stream#" target="_blank">Create a Kafka Source Stream</a>
* <a href="https://kafka.apache.org/documentation/" target="_blank">Official Kafka Documentation</a>
* <a href="https://www.confluent.io/blog/okay-store-data-apache-kafka/" target="_blank">Use Kafka to store data</a>

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>