d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

-sandbox
<img src="https://files.training.databricks.com/images/Apache-Spark-Logo_TM_200px.png" style="float: left: margin: 20px"/>

# Structured Streaming with Kafka 

We have another server that reads Wikipedia edits in real time, with a multitude of different languages. 

**What you will learn:**
* About Kafka
* How to establish a connection with Kafka
* More examples 
* More visualizations

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Software Engineers

## Prerequisites
* Web browser: **Chrome**
* A cluster configured with **8 cores** and **DBR 6.3**
* External Services
  - Familiarity with Kafka is helpful, but not required
* Suggested Courses from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>:
  - ETL Part 1
  - Spark-SQL

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup & Classroom-Cleanup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [4]:
%run "./Includes/Classroom-Setup"

<iframe  
src="//fast.wistia.net/embed/iframe/p5v3fw7auc?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/p5v3fw7auc?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox

<img style="float:right" src="https://files.training.databricks.com/images/eLearning/Structured-Streaming/kafka.png"/>

<div>
  <h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> The Kafka Ecosystem</h2>
  <p>Kafka is software designed upon the <b>publish/subscribe</b> messaging pattern.
     Publish/subscribe messaging is where a sender (publisher) sends a message that is not specifically directed to a receiver (subscriber). 
     The publisher classifies the message somehow and the receiver subscribes to receive certain categories of messages.
     There are other usage patterns for Kafka, but this is the pattern we focus on in this course.
  </p>
  <p>Publisher/subscriber systems typically have a central point where messages are published, called a <b>broker</b>. 
     The broker receives messages from publishers, assigns offsets to them and commits messages to storage.
  </p>

  <p>The Kafka version of a unit of data an array of bytes called a <b>message</b>.</p>

  <p>A message can also contain a bit of information related to partitioning called a <b>key</b>.</p>

  <p>In Kafka, messages are categorized into <b>topics</b>.</p>
</div>

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> The Kafka Server</h2>


The Kafka server is fed by a separate TCP server that reads the Wikipedia edits, in real time, from the various language-specific IRC channels to which Wikimedia posts them. 

That server parses the IRC data, converts the results to JSON, and sends the JSON to
a Kafka server, with the edits segregated by language. The various languages are <b>topics</b>.

For example, the Kafka topic "en" corresponds to edits for en.wikipedia.org.

### Required Options

When consuming from a Kafka source, you **must** specify at least two options:

<p>1. The Kafka bootstrap servers, for example:</p>
<p>`dsr.option("kafka.bootstrap.servers", "server1.databricks.training:9092")`</p>
<p>2. Some indication of the topics you want to consume.</p>

-sandbox

#### Specifying a Topic

There are three, mutually-exclusive, ways to specify the topics for consumption:

| Option        | Value                                          | Description                            | Example |
| ------------- | ---------------------------------------------- | -------------------------------------- | ------- |
| **subscribe** | A comma-separated list of topics               | A list of topics to which to subscribe | `dsr.option("subscribe", "topic1")` <br/> `dsr.option("subscribe", "topic1,topic2,topic3")` |
| **assign**    | A JSON string indicating topics and partitions | Specific topic-partitions to consume.  | `dsr.dsr.option("assign", "{'topic1': [1,3], 'topic2': [2,5]}")`
| **subscribePattern**   | A (Java) regular expression           | A pattern to match desired topics      | `dsr.option("subscribePattern", "e[ns]")` <br/> `dsr.option("subscribePattern", "topic[123]")`|

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> In the example to follow, we're using the "subscribe" option to select the topics we're interested in consuming. 
We've selected only the "en" topic, corresponding to edits for the English Wikipedia. 
If we wanted to consume multiple topics (multiple Wikipedia languages, in our case), we could just specify them as a comma-separate list:

```dsr.option("subscribe", "en,es,it,fr,de,eo")```

There are other, optional, arguments you can give the Kafka source. 

For more information, see the <a href="https://people.apache.org//~pwendell/spark-nightly/spark-branch-2.1-docs/latest/structured-streaming-kafka-integration.html#" target="_blank">Structured Streaming and Kafka Integration Guide</a>

-sandbox
<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> The Kafka Schema</h2>

Reading from Kafka returns a `DataFrame` with the following fields:

| Field             | Type   | Description |
|------------------ | ------ |------------ |
| **key**           | binary | The key of the record (not needed) |
| **value**         | binary | Our JSON payload. We'll need to cast it to STRING |
| **topic**         | string | The topic this record is received from (not needed) |
| **partition**     | int    | The Kafka topic partition from which this record is received (not needed). This server only has one partition. |
| **offset**        | long   | The position of this record in the corresponding Kafka topic partition (not needed) |
| **timestamp**     | long   | The timestamp of this record  |
| **timestampType** | int    | The timestamp type of a record (not needed) |

In the example below, the only column we want to keep is `value`.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The default of `spark.sql.shuffle.partitions` is 200.
This setting is used in operations like `groupBy`.
In this case, we should be setting this value to match the current number of cores.

In [11]:

from pyspark.sql.functions import col
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)

kafkaServer = "server1.databricks.training:9092"   # US (Oregon)
# kafkaServer = "server2.databricks.training:9092" # Singapore

editsDF = (spark.readStream                        # Get the DataStreamReader
  .format("kafka")                                 # Specify the source format as "kafka"
  .option("kafka.bootstrap.servers", kafkaServer)  # Configure the Kafka server name and port
  .option("subscribe", "en")                       # Subscribe to the "en" Kafka topic
  .option("startingOffsets", "earliest")           # Rewind stream to beginning when we restart notebook
  .option("maxOffsetsPerTrigger", 1000)            # Throttle Kafka's processing of the streams
  .load()                                          # Load the DataFrame
  .select(col("value").cast("STRING"))             # Cast the "value" column to STRING
)



In [12]:
dir(editsDF)

<br><br><br>

Let's display some data.

In [15]:
myStreamName = "lesson04a_ps"
display(editsDF,  streamName = myStreamName)

value
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-29T06:47:32.387Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=947928855&oldid=947926779"",""isUnpatrolled"":false,""page"":""North Macedonia–NATO relations"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""Added better source for deposition process!"",""userURL"":""http://en.wikipedia.org/wiki/User:95.155.45.168"",""pageURL"":""http://en.wikipedia.org/wiki/North_Macedonia–NATO_relations"",""delta"":null,""flag"":"""",""isNewPage"":false,""isAnonymous"":true,""geocoding"":{""countryCode2"":""ME"",""city"":""Podgorica"",""latitude"":42.44110107421875,""country"":""Montenegro"",""longitude"":42.44110107421875,""stateProvince"":""16"",""countryCode3"":""MNE""},""user"":""95.155.45.168"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-29T06:47:32.323Z"",""url"":"""",""isUnpatrolled"":false,""page"":""Special:Log/abusefilter"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""95.155.45.168 triggered [[Special:AbuseFilter/61|filter 61]], performing the action \""edit\"" on [[\u000302North Macedonia–NATO relations\u000310]]. Actions taken: Tag ([[Special:AbuseLog/26352859|details]])"",""userURL"":""http://en.wikipedia.org/wiki/User:95.155.45.168"",""pageURL"":""http://en.wikipedia.org/wiki/Special:Log/abusefilter"",""delta"":null,""flag"":""hit"",""isNewPage"":false,""isAnonymous"":true,""geocoding"":{""countryCode2"":""ME"",""city"":""Podgorica"",""latitude"":42.44110107421875,""country"":""Montenegro"",""longitude"":42.44110107421875,""stateProvince"":""16"",""countryCode3"":""MNE""},""user"":""95.155.45.168"",""namespace"":""special""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-29T06:47:32.434Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=947928854&oldid=947928748"",""isUnpatrolled"":false,""page"":""2019–20 coronavirus pandemic"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""/* Signs and symptoms */ Added a carriage return to aid editing and a \""The\""."",""userURL"":""http://en.wikipedia.org/wiki/User:DocWatson42"",""pageURL"":""http://en.wikipedia.org/wiki/2019–20_coronavirus_pandemic"",""delta"":5,""flag"":""M"",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""DocWatson42"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-29T06:47:33.973Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=947928859&oldid=943151803"",""isUnpatrolled"":false,""page"":""Template:Director's Cut Awards for Best New Actor"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":"""",""userURL"":""http://en.wikipedia.org/wiki/User:St3095"",""pageURL"":""http://en.wikipedia.org/wiki/Template:Director's_Cut_Awards_for_Best_New_Actor"",""delta"":-42,""flag"":""M"",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""St3095"",""namespace"":""template""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-29T06:47:34.768Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=947928857&oldid=929861470"",""isUnpatrolled"":false,""page"":""Martin Ferdinand Quadal"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""clean up, [[WP:AWB/T|typo(s) fixed]]: 1787-9 → 1787–9 (3)"",""userURL"":""http://en.wikipedia.org/wiki/User:I dream of horses"",""pageURL"":""http://en.wikipedia.org/wiki/Martin_Ferdinand_Quadal"",""delta"":6,""flag"":""M"",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""I dream of horses"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-29T06:47:34.827Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=947928858&oldid=927998175"",""isUnpatrolled"":false,""page"":""Dún na Rí Forest Park"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""Adding more information"",""userURL"":""http://en.wikipedia.org/wiki/User:Cwmhiraeth"",""pageURL"":""http://en.wikipedia.org/wiki/Dún_na_Rí_Forest_Park"",""delta"":1133,""flag"":"""",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""Cwmhiraeth"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-29T06:47:36.778Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=947928860&oldid=946150947"",""isUnpatrolled"":false,""page"":""Mumbai Central–Rajkot Duronto Express"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""/* Train Details */"",""userURL"":""http://en.wikipedia.org/wiki/User:Pratik Bhatade"",""pageURL"":""http://en.wikipedia.org/wiki/Mumbai_Central–Rajkot_Duronto_Express"",""delta"":60,""flag"":"""",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""Pratik Bhatade"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-29T06:47:37.324Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=947928861&oldid=897556030"",""isUnpatrolled"":false,""page"":""Master of Arguis"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""/* References */clean up, [[WP:AWB/T|typo(s) fixed]]: 330919-8 → 330919–8 (3)"",""userURL"":""http://en.wikipedia.org/wiki/User:I dream of horses"",""pageURL"":""http://en.wikipedia.org/wiki/Master_of_Arguis"",""delta"":12,""flag"":""M"",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""I dream of horses"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-29T06:47:39.070Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=947928862&oldid=930111710"",""isUnpatrolled"":false,""page"":""Best in the World 2006"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""wasn't a ppv"",""userURL"":""http://en.wikipedia.org/wiki/User:104.184.182.119"",""pageURL"":""http://en.wikipedia.org/wiki/Best_in_the_World_2006"",""delta"":-17,""flag"":"""",""isNewPage"":false,""isAnonymous"":true,""geocoding"":{""countryCode2"":""US"",""city"":""Olive Branch"",""latitude"":34.917999267578125,""country"":""United States"",""longitude"":34.917999267578125,""stateProvince"":""MS"",""countryCode3"":""USA""},""user"":""104.184.182.119"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-03-29T06:47:39.129Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=947928863&oldid=947928778"",""isUnpatrolled"":false,""page"":""Paravai Muniyamma"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""C e"",""userURL"":""http://en.wikipedia.org/wiki/User:GorgeCustersSabre"",""pageURL"":""http://en.wikipedia.org/wiki/Paravai_Muniyamma"",""delta"":-8,""flag"":"""",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""GorgeCustersSabre"",""namespace"":""article""}"


Wait until stream is done initializing...

In [17]:
untilStreamIsReady(myStreamName)

Make sure to stop the stream before continuing.

In [19]:
stopAllStreams()

<br><br><br>

-sandbox
<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Use Kafka to display the raw data</h2>

The Kafka server acts as a sort of "firehose" (or asynchronous buffer) and displays raw data.

Since raw data coming in from a stream is transient, we'd like to save it to a more permanent data structure.

The first step is to define the schema for the JSON payload.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Only those fields of future interest are commented below.

In [22]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, BooleanType
from pyspark.sql.functions import from_json, unix_timestamp

schema = StructType([
  StructField("channel", StringType(), True),
  StructField("comment", StringType(), True),
  StructField("delta", IntegerType(), True),
  StructField("flag", StringType(), True),
  StructField("geocoding", StructType([                 # (OBJECT): Added by the server, field contains IP address geocoding information for anonymous edit.
    StructField("city", StringType(), True),
    StructField("country", StringType(), True),
    StructField("countryCode2", StringType(), True),
    StructField("countryCode3", StringType(), True),
    StructField("stateProvince", StringType(), True),
    StructField("latitude", DoubleType(), True),
    StructField("longitude", DoubleType(), True),
  ]), True),
  StructField("isAnonymous", BooleanType(), True),      # (BOOLEAN): Whether or not the change was made by an anonymous user
  StructField("isNewPage", BooleanType(), True),
  StructField("isRobot", BooleanType(), True),
  StructField("isUnpatrolled", BooleanType(), True),
  StructField("namespace", StringType(), True),         # (STRING): Page's namespace. See https://en.wikipedia.org/wiki/Wikipedia:Namespace 
  StructField("page", StringType(), True),              # (STRING): Printable name of the page that was edited
  StructField("pageURL", StringType(), True),           # (STRING): URL of the page that was edited
  StructField("timestamp", StringType(), True),         # (STRING): Time the edit occurred, in ISO-8601 format
  StructField("url", StringType(), True),
  StructField("user", StringType(), True),              # (STRING): User who made the edit or the IP address associated with the anonymous editor
  StructField("userURL", StringType(), True),
  StructField("wikipediaURL", StringType(), True),
  StructField("wikipedia", StringType(), True),         # (STRING): Short name of the Wikipedia that was edited (e.g., "en" for the English)
])

Next we can use the function `from_json` to parse out the full message with the schema specified above.

In [24]:
from pyspark.sql.functions import col, from_json

jsonEdits = editsDF.select(
  from_json("value", schema).alias("json"))  # Parse the column "value" and name it "json"


In [25]:

display(jsonEdits)


json
"List(#en.wikipedia, Added better source for deposition process!, null, , List(Podgorica, Montenegro, ME, MNE, 16, 42.44110107421875, 42.44110107421875), true, false, false, false, article, North Macedonia–NATO relations, http://en.wikipedia.org/wiki/North_Macedonia–NATO_relations, 2020-03-29T06:47:32.387Z, https://en.wikipedia.org/w/index.php?diff=947928855&oldid=947926779, 95.155.45.168, http://en.wikipedia.org/wiki/User:95.155.45.168, http://en.wikipedia.org, en)"
"List(#en.wikipedia, 95.155.45.168 triggered [[Special:AbuseFilter/61|filter 61]], performing the action ""edit"" on [[02North Macedonia–NATO relations10]]. Actions taken: Tag ([[Special:AbuseLog/26352859|details]]), null, hit, List(Podgorica, Montenegro, ME, MNE, 16, 42.44110107421875, 42.44110107421875), true, false, false, false, special, Special:Log/abusefilter, http://en.wikipedia.org/wiki/Special:Log/abusefilter, 2020-03-29T06:47:32.323Z, , 95.155.45.168, http://en.wikipedia.org/wiki/User:95.155.45.168, http://en.wikipedia.org, en)"
"List(#en.wikipedia, /* Signs and symptoms */ Added a carriage return to aid editing and a ""The""., 5, M, List(null, null, null, null, null, null, null), false, false, false, false, article, 2019–20 coronavirus pandemic, http://en.wikipedia.org/wiki/2019–20_coronavirus_pandemic, 2020-03-29T06:47:32.434Z, https://en.wikipedia.org/w/index.php?diff=947928854&oldid=947928748, DocWatson42, http://en.wikipedia.org/wiki/User:DocWatson42, http://en.wikipedia.org, en)"
"List(#en.wikipedia, , -42, M, List(null, null, null, null, null, null, null), false, false, false, false, template, Template:Director's Cut Awards for Best New Actor, http://en.wikipedia.org/wiki/Template:Director's_Cut_Awards_for_Best_New_Actor, 2020-03-29T06:47:33.973Z, https://en.wikipedia.org/w/index.php?diff=947928859&oldid=943151803, St3095, http://en.wikipedia.org/wiki/User:St3095, http://en.wikipedia.org, en)"
"List(#en.wikipedia, clean up, [[WP:AWB/T|typo(s) fixed]]: 1787-9 → 1787–9 (3), 6, M, List(null, null, null, null, null, null, null), false, false, false, false, article, Martin Ferdinand Quadal, http://en.wikipedia.org/wiki/Martin_Ferdinand_Quadal, 2020-03-29T06:47:34.768Z, https://en.wikipedia.org/w/index.php?diff=947928857&oldid=929861470, I dream of horses, http://en.wikipedia.org/wiki/User:I dream of horses, http://en.wikipedia.org, en)"
"List(#en.wikipedia, Adding more information, 1133, , List(null, null, null, null, null, null, null), false, false, false, false, article, Dún na Rí Forest Park, http://en.wikipedia.org/wiki/Dún_na_Rí_Forest_Park, 2020-03-29T06:47:34.827Z, https://en.wikipedia.org/w/index.php?diff=947928858&oldid=927998175, Cwmhiraeth, http://en.wikipedia.org/wiki/User:Cwmhiraeth, http://en.wikipedia.org, en)"
"List(#en.wikipedia, /* Train Details */, 60, , List(null, null, null, null, null, null, null), false, false, false, false, article, Mumbai Central–Rajkot Duronto Express, http://en.wikipedia.org/wiki/Mumbai_Central–Rajkot_Duronto_Express, 2020-03-29T06:47:36.778Z, https://en.wikipedia.org/w/index.php?diff=947928860&oldid=946150947, Pratik Bhatade, http://en.wikipedia.org/wiki/User:Pratik Bhatade, http://en.wikipedia.org, en)"
"List(#en.wikipedia, /* References */clean up, [[WP:AWB/T|typo(s) fixed]]: 330919-8 → 330919–8 (3), 12, M, List(null, null, null, null, null, null, null), false, false, false, false, article, Master of Arguis, http://en.wikipedia.org/wiki/Master_of_Arguis, 2020-03-29T06:47:37.324Z, https://en.wikipedia.org/w/index.php?diff=947928861&oldid=897556030, I dream of horses, http://en.wikipedia.org/wiki/User:I dream of horses, http://en.wikipedia.org, en)"
"List(#en.wikipedia, wasn't a ppv, -17, , List(Olive Branch, United States, US, USA, MS, 34.917999267578125, 34.917999267578125), true, false, false, false, article, Best in the World 2006, http://en.wikipedia.org/wiki/Best_in_the_World_2006, 2020-03-29T06:47:39.070Z, https://en.wikipedia.org/w/index.php?diff=947928862&oldid=930111710, 104.184.182.119, http://en.wikipedia.org/wiki/User:104.184.182.119, http://en.wikipedia.org, en)"
"List(#en.wikipedia, C e, -8, , List(null, null, null, null, null, null, null), false, false, false, false, article, Paravai Muniyamma, http://en.wikipedia.org/wiki/Paravai_Muniyamma, 2020-03-29T06:47:39.129Z, https://en.wikipedia.org/w/index.php?diff=947928863&oldid=947928778, GorgeCustersSabre, http://en.wikipedia.org/wiki/User:GorgeCustersSabre, http://en.wikipedia.org, en)"


When parsing a value from JSON, we end up with a single column containing a complex object.

We can clearly see this by simply printing the schema.

In [27]:
jsonEdits.printSchema()

The fields of a complex object can be referenced with a "dot" notation as in:

`col("json.geocoding.countryCode3")` 
 

A large number of these fields/columns can become unwieldy.

For that reason, it is common to extract the sub-fields and represent them as first-level columns as seen below:

In [29]:
from pyspark.sql.functions import isnull, unix_timestamp

anonDF = (jsonEdits
  .select(col("json.wikipedia").alias("wikipedia"),      # Promoting from sub-field to column
          col("json.isAnonymous").alias("isAnonymous"),  #     "       "      "      "    "
          col("json.namespace").alias("namespace"),      #     "       "      "      "    "
          col("json.page").alias("page"),                #     "       "      "      "    "
          col("json.pageURL").alias("pageURL"),          #     "       "      "      "    "
          col("json.geocoding").alias("geocoding"),      #     "       "      "      "    "
          col("json.user").alias("user"),                #     "       "      "      "    "
          col("json.timestamp").cast("timestamp"))       # Promoting and converting to a timestamp
  .filter(col("namespace") == "article")                 # Limit result to just articles
  .filter(~isnull(col("geocoding.countryCode3")))        # We only want results that are geocoded
)


In [30]:
display(anonDF)

wikipedia,isAnonymous,namespace,page,pageURL,geocoding,user,timestamp
en,True,article,North Macedonia–NATO relations,http://en.wikipedia.org/wiki/North_Macedonia–NATO_relations,"List(Podgorica, Montenegro, ME, MNE, 16, 42.44110107421875, 42.44110107421875)",95.155.45.168,2020-03-29T06:47:32.387+0000
en,True,article,Best in the World 2006,http://en.wikipedia.org/wiki/Best_in_the_World_2006,"List(Olive Branch, United States, US, USA, MS, 34.917999267578125, 34.917999267578125)",104.184.182.119,2020-03-29T06:47:39.070+0000
en,True,article,Sulli,http://en.wikipedia.org/wiki/Sulli,"List(null, Philippines, PH, PHL, null, 14.58329963684082, 14.58329963684082)",112.200.34.96,2020-03-29T06:47:47.086+0000
en,True,article,Conservatism,http://en.wikipedia.org/wiki/Conservatism,"List(null, United Kingdom, GB, GBR, null, 54.75843811035156, 54.75843811035156)",2A02:C7D:5D63:1300:CDE2:6A13:D94A:9236,2020-03-29T06:47:49.182+0000
en,True,article,Flow (psychology),http://en.wikipedia.org/wiki/Flow_(psychology),"List(null, United States, US, USA, null, 39.7599983215332, 39.7599983215332)",2601:281:C400:E9F6:981:19A3:8D21:9779,2020-03-29T06:47:50.733+0000
en,True,article,Troy Reeder,http://en.wikipedia.org/wiki/Troy_Reeder,"List(Los Angeles, United States, US, USA, CA, 34.00360107421875, 34.00360107421875)",2600:1012:B023:3A7:DF3:E4D0:3A8F:8B6D,2020-03-29T06:47:58.847+0000
en,True,article,Hira Mani,http://en.wikipedia.org/wiki/Hira_Mani,"List(Hāsal, Pakistan, PK, PAK, PB, 33.14789962768555, 33.14789962768555)",39.57.115.19,2020-03-29T06:48:01.439+0000
en,True,article,List of JoJo's Bizarre Adventure characters,http://en.wikipedia.org/wiki/List_of_JoJo's_Bizarre_Adventure_characters,"List(Petaling Jaya, Malaysia, MY, MYS, 10, 3.076200008392334, 3.076200008392334)",2001:E68:5420:EF2E:DC85:EC51:3A3:F58E,2020-03-29T06:48:08.884+0000
en,True,article,Alternative rock,http://en.wikipedia.org/wiki/Alternative_rock,"List(null, United Kingdom, GB, GBR, null, 54.75843811035156, 54.75843811035156)",2A02:C7F:C651:8A00:E8A6:FA26:9FF:48A5,2020-03-29T06:48:18.645+0000
en,True,article,1992 Macedonian Albanian autonomy referendum,http://en.wikipedia.org/wiki/1992_Macedonian_Albanian_autonomy_referendum,"List(Warrington, United Kingdom, GB, GBR, WRT, 53.429500579833984, 53.429500579833984)",82.8.27.130,2020-03-29T06:48:25.324+0000


-sandbox
<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Mapping Anonymous Editors' Locations</h2>

When you run the query, the default is a [live] html table.

The geocoded information allows us to associate an anonymous edit with a country.

We can then use that geocoded information to plot edits on a [live] world map.

In order to create a slick world map visualization of the data, you'll need to click on the item below.

Under <b>Plot Options</b>, use the following:
* <b>Keys:</b> `countryCode3`
* <b>Values:</b> `count`

In <b>Display type</b>, use <b>World map</b> and click <b>Apply</b>.

<img src="https://files.training.databricks.com/images/eLearning/Structured-Streaming/plot-options-map-04.png"/>

By invoking a `display` action on a DataFrame created from a `readStream` transformation, we can generate a LIVE visualization!

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Keep an eye on the plot for a minute or two and watch the colors change.

In [32]:
mappedDF = (anonDF
  .groupBy("geocoding.countryCode3") # Aggregate by country (code)
  .count()                           # Produce a count of each aggregate
)
display(mappedDF, streamName = myStreamName)

countryCode3,count
KAZ,1
BGR,11
ITA,30
IND,194
MKD,20
THA,16
CAN,73
AZE,3
ARE,40
UKR,7


Wait until stream is done initializing...

In [34]:
untilStreamIsReady(myStreamName)

Stop the streams.

In [36]:
stopAllStreams()

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Review Questions</h2>

**Q:** What `format` should you use with Kafka?<br>
**A:** `format("kafka")`

**Q:** How do you specify a Kafka server?<br>
**A:** `.option("kafka.bootstrap.servers"", "server1.databricks.training:9092")`

**Q:** What verb should you use in conjunction with `readStream` and Kafka to start the streaming job?<br>
**A:** `load()`, but with no parameters since we are pulling from a Kafka server.

**Q:** What fields are returned in a Kafka DataFrame?<br>
**A:** Reading from Kafka returns a DataFrame with the following fields:
key, value, topic, partition, offset, timestamp, timestampType

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [39]:
%run "./Includes/Classroom-Cleanup"

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Next Steps</h2>

Start the next lab, [Using Kafka Lab]($./Labs/SS 04a - Using Kafka Lab).

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Additional Topics &amp; Resources</h2>

* <a href="http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-stream#" target="_blank">Create a Kafka Source Stream</a>
* <a href="https://kafka.apache.org/documentation/" target="_blank">Official Kafka Documentation</a>
* <a href="https://www.confluent.io/blog/okay-store-data-apache-kafka/" target="_blank">Use Kafka to store data</a>

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>