# Retrive traffic data for chicago city using City Of Chicago.

This notebook retrieves data on chicago traffic , which is provided by the City of chicago. The data is stored into Kafka, and then used to demonstrate writing the data into Cosmos.

The data set used by this notebook is from Chicago Traffic Tracker - Congestion by segments(https://data.cityofchicago.org/resource/8v9j-bter.json.

## Requirements

* An Azure Virtual Network
* A Spark (2.2.0) on HDInsight 3.6 cluster, inside the virtual network
* A Kafka on HDInsight 3.6 cluster, inside the virtual network

## Load packages
Run the next cell to load packages used by this notebook:

* spark-streaming-kafka-0-8_2.10, version 2.2.0 - Used to write data to Kafka.
* gson version 2.4 - Used for JSON parsing.

In [5]:
%%configure -f
{
    "conf": {
        "spark.jars.packages": "org.apache.spark:spark-streaming-kafka-0-8_2.10:2.2.0,com.google.code.gson:gson:2.4",
        "spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.11"
    }
}

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
7,application_1543170503042_0008,spark,idle,Link,Link,✔


SparkSession available as 'spark'.


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
7,application_1543170503042_0008,spark,idle,Link,Link,✔


## Create the Kafka topic
Provide the Zookeeper host information for Kafka cluster. 
The following steps to get this information:
* From __Azure PowerShell__:

    ```powershell
$creds = Get-Credential -UserName "admin" -Message "Enter the HDInsight login"
$clusterName = Read-Host -Prompt "Enter the Kafka cluster name"
$resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.net/api/v1/clusters/$clusterName/services/ZOOKEEPER/components/ZOOKEEPER_SERVER" `
    -Credential $creds `
    -UseBasicParsing
$respObj = ConvertFrom-Json $resp.Content
$zkHosts = $respObj.host_components.HostRoles.host_name[0..1]
($zkHosts -join ":2181,") + ":2181"
    ````

The return value is:

'zk0-kafka.yr21tghw1lbedojd2c1yzdkfqb.dx.internal.cloudapp.net:2181,zk1-kafka.yr21tghw1lbedojd2c1yzdkfqb.dx.internal.cloudapp.net:2181'

Replace the `YOUR_ZOOKEEPER_HOSTS` in the next cell with the returned value, and then run the cell

In [None]:
%%bash 
/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --replication-factor 3 --partitions 8 --topic trafficdata --zookeeper 'zk0-kafka.yr21tghw1lbedojd2c1yzdkfqb.dx.internal.cloudapp.net:2181,zk1-kafka.yr21tghw1lbedojd2c1yzdkfqb.dx.internal.cloudapp.net:2181'

## Retrieve data from city of chicago.

Run the next cell to load data on traffic congestion from city of chicago website. The function calls the REST API every 15 minutes;  the interval at which the site refreshes the data.

In [6]:
// Load the data from the City of chicago API for traffic congestion data
def loaddata(){
val url="https://data.cityofchicago.org/resource/8v9j-bter.json"
val result = scala.io.Source.fromURL(url).mkString

// Since the REST API returns an array of items,
// it's easier to use as an array than deal with streaming
import com.google.gson.Gson
val gson = new Gson()
val jsonDataArray = gson.fromJson(result, classOf[Array[Object]])

println("Retrieved " + jsonDataArray.length + " rows of Traffic data.")

thread.sleep(900000,loaddata)
}
loaddata()

Retrieved 1000 rows of Traffic data.

## Set the Kafka broker hosts information

Create kafka broker hosts for the Kafka cluster. This is used to write data to the Kafka cluster. To get the broker host information,

* From Azure Powershell:

    ```powershell
$creds = Get-Credential -UserName "admin" -Message "Enter the HDInsight login"
$clusterName = Read-Host -Prompt "Enter the Kafka cluster name"
$resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.net/api/v1/clusters/$clusterName/services/KAFKA/components/KAFKA_BROKER" `
  -Credential $creds `
  -UseBasicParsing
$respObj = ConvertFrom-Json $resp.Content
$brokerHosts = $respObj.host_components.HostRoles.host_name[0..1]
($brokerHosts -join ":9092,") + ":9092"
    ```
    
 Output from execution of previous code is,
"wn0-kafka.yr21tghw1lbedojd2c1yzdkfqb.dx.internal.cloudapp.net:9092,wn1-kafka.yr21tghw1lbedojd2c1yzdkfqb.dx.internal.cloudapp.net:9092"

In [7]:
// The Kafka broker hosts and topic used to write to Kafka
val kafkaBrokers="wn0-kafka.yr21tghw1lbedojd2c1yzdkfqb.dx.internal.cloudapp.net:9092,wn1-kafka.yr21tghw1lbedojd2c1yzdkfqb.dx.internal.cloudapp.net:9092"
val kafkaTopic="trafficdata"

kafkaTopic: String = trafficdata

## Send the data to Kafka

Begin streaming data to Kafka. There is a delay of 1 second (1000ms) after each send, so cell stays active several minutes.

In [8]:
// Import classes used to write to Kafka via a producer
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import java.util.HashMap

// Create the Kafka producer
val producerProperties = new HashMap[String, Object]()
producerProperties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBrokers)
producerProperties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
                           "org.apache.kafka.common.serialization.StringSerializer")
producerProperties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
                           "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](producerProperties)

// Iterate over data and emit to Kafka
jsonDataArray.foreach { row =>
                // Get the row as a JSON string
                val jsonData = gson.toJson(row)
                // Create the message for Kafka
                val message = new ProducerRecord[String, String](kafkaTopic, null, jsonData)
                // Send the message
                producer.send(message)
                // Sleep a bit between sends to simulate streaming data
                Thread.sleep(1000)
             }
producer.close()
println("Finished writting to Kafka")

Finished writting to Kafka

## Load the data to cosmos-db and stream it.