<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="../../resources/logo.png" alt="Intellinum Bootcamp" style="width: 400px; height: 200px">
</div>

# Project: Exploratory Data Analysis
Perform exploratory data analysis (EDA) to gain insights from a data lake.


## Instructions

In `s3://data.intellinum.co/bootcamp/common/crime-data-2016`, there are a number of Parquet files containing 2016 crime data from seven United States cities:

* New York
* Los Angeles
* Chicago
* Philadelphia
* Dallas
* Boston


The data is cleaned up a little, but has not been normalized. Each city reports crime data slightly differently, so
examine the data for each city to determine how to query it properly.

Your job is to use some of this data to gain insights about certain kinds of crimes.

In [None]:
#MODE = "LOCAL"
MODE = "CLUSTER"

import sys
from pyspark.sql import SparkSession
from pyspark import SparkConf
import os
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark import SparkConf
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.storagelevel import StorageLevel
from matplotlib import interactive
interactive(True)
import matplotlib.pyplot as plt
%matplotlib inline
import json
import math
import numbers
import numpy as np
import plotly
plotly.offline.init_notebook_mode(connected=True)

sys.path.insert(0,'../../src')
from settings import *

try:
    fh = open('../../libs/pyspark24_py36.zip', 'r')
except FileNotFoundError:
    !AWS_ACCESS_KEY_ID={AWS_ACCESS_KEY} AWS_SECRET_ACCESS_KEY={AWS_SECRET_KEY} aws s3 cp s3://yuan.intellinum.co/bins/pyspark24_py36.zip ../../libs/pyspark24_py36.zip

try:
    spark.stop()
    print("Stopped a SparkSession")
except Exception as e:
    print("No existing SparkSession detected")
    print("Creating a new SparkSession")

SPARK_DRIVER_MEMORY= "1G"
SPARK_DRIVER_CORE = "1"
SPARK_EXECUTOR_MEMORY= "1G"
SPARK_EXECUTOR_CORE = "1"
SPARK_EXECUTOR_INSTANCES = 6



conf = None
if MODE == "LOCAL":
    os.environ["PYSPARK_PYTHON"] = "/home/yuan/anaconda3/envs/pyspark24_py36/bin/python"
    conf = SparkConf().\
            setAppName("pyspark_day03_querying_json").\
            setMaster('local[*]').\
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars', '../../libs/mysql-connector-java-5.1.45-bin.jar').\
            set('spark.jars.packages','net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1')
else:
    os.environ["PYSPARK_PYTHON"] = "./MN/pyspark24_py36/bin/python"
    conf = SparkConf().\
            setAppName("pyspark_day03_querying_json").\
            setMaster('yarn-client').\
            set('spark.executor.cores', SPARK_EXECUTOR_CORE).\
            set('spark.executor.memory', SPARK_EXECUTOR_MEMORY).\
            set('spark.driver.cores', SPARK_DRIVER_CORE).\
            set('spark.driver.memory', SPARK_DRIVER_MEMORY).\
            set("spark.executor.instances", SPARK_EXECUTOR_INSTANCES).\
            set('spark.sql.files.ignoreCorruptFiles', 'true').\
            set('spark.yarn.dist.archives', '../../libs/pyspark24_py36.zip#MN').\
            set('spark.sql.shuffle.partitions', '5000').\
            set('spark.default.parallelism', '5000').\
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars.packages','net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1'). \
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars', 's3://yuan.intellinum.co/bins/mysql-connector-java-5.1.45-bin.jar')
        

spark = SparkSession.builder.\
    config(conf=conf).\
    getOrCreate()


sc = spark.sparkContext

sc.addPyFile('../../src/settings.py')

sc=spark.sparkContext
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", AWS_ACCESS_KEY)
hadoop_conf.set("fs.s3a.secret.key", AWS_SECRET_KEY)
hadoop_conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

def display(df, limit=10):
    return df.limit(limit).toPandas()

def dfTest(id, expected, result):
    assert str(expected) == str(result), "{} does not equal expected {}".format(result, expected)

No existing SparkSession detected
Creating a new SparkSession


## Step 1

Start by creating DataFrames for Los Angeles, Philadelphia, and Dallas data.

Use `spark.read.parquet` to create named DataFrames for the files you choose. 

To read in the parquet file, use `crimeDataNewYorkDF = spark.read.parquet("s3a://data.intellinum.co/bootcamp/common/crime-data-2016/Crime-Data-New-York-2016.parquet")`


Use the following view names:

| City          | DataFrame Name            | Path to S3 file
| ------------- | ------------------------- | -----------------
| Los Angeles   | `crimeDataLosAngelesDF`   | `s3://data.intellinum.co/bootcamp/common/crime-data-2016/Crime-Data-Los-Angeles-2016.parquet`
| Philadelphia  | `crimeDataPhiladelphiaDF` | `s3://data.intellinum.co/bootcamp/common/crime-data-2016/Crime-Data-Philadelphia-2016.parquet`
| Dallas        | `crimeDataDallasDF`       | `s3://data.intellinum.co/bootcamp/common/crime-data-2016/Crime-Data-Dallas-2016.parquet`

In [None]:
!AWS_ACCESS_KEY_ID={AWS_ACCESS_KEY} AWS_SECRET_ACCESS_KEY={AWS_SECRET_KEY} aws s3 ls s3://data.intellinum.co/bootcamp/common/crime-data-2016/

#### Los Angeles

In [None]:
# TODO

In [None]:
# ANSWER
crimeDataLosAngelesDF = spark.read.parquet("s3a://data.intellinum.co/bootcamp/common/crime-data-2016/Crime-Data-Los-Angeles-2016.parquet")

In [None]:
crimeDataLosAngelesDF.printSchema()

In [None]:
display(crimeDataLosAngelesDF)

In [None]:
# TEST - Run this cell to test your solution.

rowsLosAngeles  = crimeDataLosAngelesDF.count()
dfTest("DF-L7-crimeDataLA-count", rowsLosAngeles, 217945)

print("Tests passed!")

#### Philadelphia

In [None]:
# TODO

In [None]:
# ANSWER

crimeDataPhiladelphiaDF = spark.read.parquet("s3a://data.intellinum.co/bootcamp/common/crime-data-2016/Crime-Data-Philadelphia-2016.parquet")

In [None]:
# TEST - Run this cell to test your solution.

rowsPhiladelphia  = crimeDataPhiladelphiaDF.count()
dfTest("DF-L7-crimeDataPA-count", rowsPhiladelphia, 168664)

print("Tests passed!")

#### Dallas

In [None]:
# TODO

In [None]:
# ANSWER

crimeDataDallasDF = spark.read.parquet("s3a://data.intellinum.co/bootcamp/common/crime-data-2016/Crime-Data-Dallas-2016.parquet")
crimeDataDallasDF.printSchema()

In [None]:
# TEST - Run this cell to test your solution.

rowsDallas  = crimeDataDallasDF.count()
dfTest("DF-L7-crimeDataDAL-count", 99642, rowsDallas)

print("Tests passed!")

## Step 2

For each table, examine the data to figure out how to extract _robbery_ statistics.

Each city uses different values to indicate robbery. Commonly used terminology is "larceny", "burglary" or "robbery."  These challenges are common in Data Lakes.  These challenges are common in data lakes.  To simplify things, restrict yourself to only the word "robbery" (and not attempted-roberty, larceny, or burglary).

Explore the data for the three cities until you understand how each city records robbery information. If you don't want to worry about upper- or lower-case, 
remember to use the DataFrame `lower()` method to converts column values to lowercase.

Create a DataFrame containing only the robbery-related rows, as shown in the table below.

**Hint:** For each table, focus your efforts on the column listed below.

Focus on the following columns for each table:

| DataFrame Name            | Robbery DataFrame Name  | Column
| ------------------------- | ----------------------- | -------------------------------
| `crimeDataLosAngelesDF`   | `robberyLosAngelesDF`   | `crimeCodeDescription`
| `crimeDataPhiladelphiaDF` | `robberyPhiladelphiaDF` | `ucr_general_description`
| `crimeDataDallasDF`       | `robberyDallasDF`       | `typeOfIncident`

#### Los Angeles

In [None]:
# TODO

In [None]:
# ANSWER
from pyspark.sql.functions import col, lower
robberyLosAngelesDF = crimeDataLosAngelesDF.filter(lower(col("crimeCodeDescription")) == "robbery")

In [None]:
# TEST - Run this cell to test your solution.

totalLosAngeles  = robberyLosAngelesDF.count()
dfTest("DF-L7-robberyDataLA-count", 9048, totalLosAngeles)

print("Tests passed!")

#### Philadelphia

In [None]:
# TODO

In [None]:
# ANSWER
robberyPhiladelphiaDF  = crimeDataPhiladelphiaDF.filter(lower(col("ucr_general_description")) == "robbery")

In [None]:
# TEST - Run this cell to test your solution.

totalPhiladelphia  = robberyPhiladelphiaDF.count()
dfTest("DF-L7-robberyDataPA-count", 6149, totalPhiladelphia)

print("Tests passed!")

#### Dallas

In [None]:
# TODO

In [None]:
# ANSWER
robberyDallasDF  = crimeDataDallasDF.filter(lower(col("typeOfIncident")).startswith("robbery"))

In [None]:
# TEST - Run this cell to test your solution.

totalDallas = robberyDallasDF.count()
dfTest("DF-L7-robberyDataDAL-count", 6824, totalDallas)

print("Tests passed!")

## Step 3

Now that you have DataFrames of only the robberies in each city, create DataFrames for each city summarizing the number of robberies in each month.

Your DataFrames must contain two columns:
* `month`: The month number (e.g., 1 for January, 2 for February, etc.).
* `robberies`: The total number of robberies in the month.

Use the following DataFrame names and date columns:


| City          | DataFrame Name     | Date Column 
| ------------- | ------------- | -------------
| Los Angeles   | `robberiesByMonthLosAngelesDF` | `timeOccurred`
| Philadelphia  | `robberiesByMonthPhiladelphiaDF` | `dispatch_date_time`
| Dallas        | `robberiesByMonthDallasDF` | `startingDateTime`

For each city, figure out which column contains the date of the incident. Then, extract the month from that date.

#### Los Angeles

In [None]:
# ANSWER
from pyspark.sql.functions import month, col, count
robberiesByMonthLosAngelesDF = (robberyLosAngelesDF
                                .select(month(robberyLosAngelesDF["timeOccurred"]).alias("month"))
                                .groupBy("month")
                                .agg(count("month").alias("robberies"))
                                .orderBy("month"))

In [None]:
# TEST - Run this cell to test your solution.
from pyspark.sql import Row
la = list(robberiesByMonthLosAngelesDF.collect())

dfTest("DF-L7-robberyByMonthLA-counts", [Row(month=1, robberies=719), Row(month=2, robberies=675), Row(month=3, robberies=709), Row(month=4, robberies=713), Row(month=5, robberies=790), Row(month=6, robberies=698), Row(month=7, robberies=826), Row(month=8, robberies=765), Row(month=9, robberies=722), Row(month=10, robberies=814), Row(month=11, robberies=764), Row(month=12, robberies=853)], la)

print("Tests passed!")

#### Philadelphia

In [None]:
# ANSWER
robberiesByMonthPhiladelphiaDF = (robberyPhiladelphiaDF
                                 .select(month(robberyPhiladelphiaDF["dispatch_date_time"]).alias("month"))
                                 .groupBy("month")
                                 .agg(count("month").alias("robberies"))
                                 .orderBy("month"))

robberiesByMonthPhiladelphiaDF.printSchema()

In [None]:
robberyPhiladelphiaDF# TEST - Run this cell to test your solution.
# convert to list so that we get deep compare (Array would be a shallow compare)
philadelphia  = list(robberiesByMonthPhiladelphiaDF.collect())

dfTest("DF-L7-robberyByMonthPA-counts", [Row(month=1, robberies=520), Row(month=2, robberies=416), Row(month=3, robberies=432), Row(month=4, robberies=466), Row(month=5, robberies=533), Row(month=6, robberies=509), Row(month=7, robberies=537), Row(month=8, robberies=561), Row(month=9, robberies=514), Row(month=10, robberies=572), Row(month=11, robberies=545), Row(month=12, robberies=544)], philadelphia )

print("Tests passed!")

#### Dallas

In [None]:
# TODO

In [None]:
# ANSWER
robberiesByMonthDallasDF = (robberyDallasDF 
                              .select(month(robberyDallasDF["startingDateTime"]).alias("month")) 
                              .groupBy("month") 
                              .agg(count("month").alias("robberies"))
                              .orderBy("month"))
robberiesByMonthDallasDF.printSchema()

In [None]:
# TEST - Run this cell to test your solution.

dallas  = list(robberiesByMonthDallasDF.collect())
dfTest("DF-L7-robberyByMonthDAL-counts", [Row(month=1, robberies=743), Row(month=2, robberies=435), Row(month=3, robberies=412), Row(month=4, robberies=594), Row(month=5, robberies=615), Row(month=6, robberies=495), Row(month=7, robberies=535), Row(month=8, robberies=627), Row(month=9, robberies=512), Row(month=10, robberies=603), Row(month=11, robberies=589), Row(month=12, robberies=664)], dallas)

print("Tests passed!")

## Step 4

Plot the robberies per month for each of the three cities, producing a plot similar to the following:

<img src="../../resources/robberies-by-month.png" style="max-width: 700px; border: 1px solid #aaaaaa; border-radius: 10px 10px 10px 10px"/>

**Hint:** You may want to use `matplotlib` or `plotly`. If you have your own way of ploting data, feel free to show off here. : )

In [None]:
import plotly.graph_objs as go

def plot_cityRobberies(dataframes, cities):
    data = []
    for i in range(len(dataframes)):
        dataframe = dataframes[i]
        city = cities[i]
        x = go.Bar(
                    x = dataframe.toPandas().month,
                    y = dataframe.toPandas().robberies,
                    name = city,
                )
        data.append(x)

    if len(cities) < 2:
        title = cities[0]
    else:
        title = "Cities"
    layout = go.Layout(
                title="Robberies/month in "+str(title),
                xaxis={
                    "title" : "Month",
                    "tickfont" : {
                        "size" : 14,
                    }
                },
                yaxis={
                    "title" : "Robberies (Count)",
                    "tickfont" : {
                        "size" : 10,
                    }
                },
                legend = {
                    "x" : 1,
                    "y" : 1
                },
                barmode="group",
                bargap=0.15,
                bargroupgap=0.1
            )

    fig = go.Figure(data=data, layout=layout)

    plotly.offline.iplot(fig)

#### Los Angeles

In [None]:
plot_cityRobberies([robberiesByMonthLosAngelesDF], ["LA"])

#### Philadelphia

In [None]:
# TODO
plot_cityRobberies([robberiesByMonthPhiladelphiaDF], ["PA"])

#### Dallas

In [None]:
# TODO
plot_cityRobberies([robberiesByMonthDallasDF], ["Dallas"])

#### All three Cities side by side

In [None]:
cityDataframesList = [robberiesByMonthDallasDF, robberiesByMonthLosAngelesDF, robberiesByMonthPhiladelphiaDF]
cityList = ["Los Angeles", "Philadelphia", "Dallas"]
plot_cityRobberies(cityDataframesList, cityList)

## Step 5

Create another DataFrame called `combinedRobberiesByMonthDF`, that combines all three robberies-per-month views into one.
In creating this view, add a new column called `city`, that identifies the city associated with each row.
The final view will have the following columns:

* `city`: The name of the city associated with the row. (Use the strings "Los Angeles", "Philadelphia", and "Dallas".)
* `month`: The month number associated with the row.
* `robbery`: The number of robbery in that month (for that city).

**Hint:** You may want to apply the `union()` method in this example to combine the three datasets.

**Hint:** It's easy to add new columns in DataFrames. For example, add a new column called `newColumn` to `originalDF` use `withColumn()` method as follows:

```originalDF.withColumn("newColumn")``` 

In [None]:
# TODO

In [None]:
# ANSWER

from pyspark.sql.functions import lit, desc, asc

combinedRobberiesByMonthDF = (robberiesByMonthLosAngelesDF.withColumn("city", lit("Los Angeles")).select("*")
                            .union(robberiesByMonthPhiladelphiaDF.withColumn("city", lit("Philadelphia")).select("*"))
                            .union(robberiesByMonthDallasDF.withColumn("city", lit("Dallas")).select("*"))
                            )

In [None]:
combinedRobberiesByMonthDF.printSchema()

In [None]:
# display(combinedRobberiesByMonthDF, 36)

In [None]:
# TEST - Run this cell to test your solution.

results = [ (r.city, r.month, r.robberies) for r in combinedRobberiesByMonthDF.collect() ]
expectedResults =  [ 
(u'Los Angeles', 1, 719), 
(u'Los Angeles', 2, 675), 
(u'Los Angeles', 3, 709), 
(u'Los Angeles', 4, 713),   
(u'Los Angeles', 5, 790), 
(u'Los Angeles', 6, 698), 
(u'Los Angeles', 7, 826), 
(u'Los Angeles', 8, 765), 
(u'Los Angeles', 9, 722), 
(u'Los Angeles', 10, 814), 
(u'Los Angeles', 11, 764), 
(u'Los Angeles', 12, 853), 
(u'Philadelphia', 1, 520), 
(u'Philadelphia', 2, 416), 
(u'Philadelphia', 3, 432), 
(u'Philadelphia', 4, 466),
(u'Philadelphia', 5, 533), 
(u'Philadelphia', 6, 509), 
(u'Philadelphia', 7, 537), 
(u'Philadelphia', 8, 561), 
(u'Philadelphia', 9, 514), 
(u'Philadelphia', 10, 572), 
(u'Philadelphia', 11, 545), 
(u'Philadelphia', 12, 544), 
(u'Dallas', 1, 743),
(u'Dallas', 2, 435), 
(u'Dallas', 3, 412), 
(u'Dallas', 4, 594), 
(u'Dallas', 5, 615), 
(u'Dallas', 6, 495), 
(u'Dallas', 7, 535), 
(u'Dallas', 8, 627), 
(u'Dallas', 9, 512), 
(u'Dallas', 10, 603), 
(u'Dallas', 11, 589), 
(u'Dallas', 12, 664)] 

dfTest("DF-L7-combinedRobberiesByMonth-counts", expectedResults, results)

print("Tests passed!")

## Step 6

Graph the contents of `combinedRobberiesByMonthDF`, producing a graph similar to the following. (The diagram below deliberately
uses different data.)

<img src="../../resources/combined-homicides.png" style="width: 800px; border: 1px solid #aaaaaa; border-radius: 10px 10px 10px 10px"/>

**Hint:** Order your results by `month`, then `city`.

**Hint:** You may want to use `matplotlib` or `plotly`. If you have your own way of ploting data, feel free to show off here. : )

In [None]:
# ANSWER

display(combinedRobberiesByMonthDF.orderBy("month", "city"))

In [None]:
traces = [go.Bar(x=subset.month, 
                 y=subset.robberies,
                 name=city) 
          for city, subset in combinedRobberiesByMonthDF.toPandas().groupby("city")]

layout = go.Layout(
            title="Robberies/month",
            xaxis={
                "title" : "Month",
                "tickfont" : {
                    "size" : 14,
                }
            },
            yaxis={
                "title" : "Robberies (Count)",
                "tickfont" : {
                    "size" : 10,
                }
            },
            legend = {
                "x" : 1,
                "y" : 1
            },
            barmode="group",
            bargap=0.15,
            bargroupgap=0.1
        )

fig = go.Figure(data=traces, layout=layout)

plotly.offline.iplot(fig)


## Step 7

While the above graph is interesting, it's flawed: it's comparing the raw numbers of robberies, not the per capita robbery rates.

The DataFrame (already created) called `cityDataDF`  contains, among other data, estimated 2016 population values for all United States cities
with populations of at least 100,000. (The data is from [Wikipedia](https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population).)

* Use the population values in that table to normalize the robberies so they represent per-capita values (total robberies divided by population).
* Save your results in a DataFrame called `robberyRatesByCityDF`.
* The robbery rate value must be stored in a new column, `robberyRate`.

Next, graph the results, as above.

In [None]:
combinedRobberiesByMonthDF.printSchema()

In [None]:
display(combinedRobberiesByMonthDF)

In [None]:
cityDataDF = spark.read.parquet("s3://data.intellinum.co/bootcamp/common/City-Data.parquet").withColumnRenamed("city", "cities")

In [None]:
cityDataDF.printSchema()

In [None]:
display(cityDataDF)

In [None]:
robberyRatesByCityDF = (combinedRobberiesByMonthDF.select("month", "robberies", "city")
                           .join(cityDataDF, cityDataDF.cities == combinedRobberiesByMonthDF.city)
                           .withColumn("robberyRate", col('robberies')/col('estPopulation2016')))

robberyRatesByCityDF.printSchema()

In [None]:
traces = [go.Bar(x=subset.month, 
                 y=subset.robberyRate,
                 name=city) 
          for city, subset in robberyRatesByCityDF.toPandas().groupby("city")]

layout = go.Layout(
            title="Robbery Rate/month",
            xaxis={
                "title" : "Month",
                "tickfont" : {
                    "size" : 14,
                }
            },
            yaxis={
                "title" : "Robbery Rate",
                "tickfont" : {
                    "size" : 10,
                }
            },
            legend = {
                "x" : 1,
                "y" : 1
            },
            barmode="group",
            bargap=0.15,
            bargroupgap=0.1
        )

fig = go.Figure(data=traces, layout=layout)

plotly.offline.iplot(fig)


In [None]:
# TEST - Run this cell to test your solution.
results = [ (r.city, r.month, '{:6f}'.format(r.robberyRate)) for r in robberyRatesByCityDF.orderBy("city", "month").collect() ]
expectedResults = [
  (u'Dallas',  1, '0.000564'),
  (u'Dallas',  2, '0.000330'),
  (u'Dallas',  3, '0.000313'),
  (u'Dallas',  4, '0.000451'),
  (u'Dallas',  5, '0.000467'),
  (u'Dallas',  6, '0.000376'),
  (u'Dallas',  7, '0.000406'),
  (u'Dallas',  8, '0.000476'),
  (u'Dallas',  9, '0.000388'),
  (u'Dallas', 10, '0.000458'),
  (u'Dallas', 11, '0.000447'),
  (u'Dallas', 12, '0.000504'),
  (u'Los Angeles',  1, '0.000181'),
  (u'Los Angeles',  2, '0.000170'),
  (u'Los Angeles',  3, '0.000178'),
  (u'Los Angeles',  4, '0.000179'),
  (u'Los Angeles',  5, '0.000199'),
  (u'Los Angeles',  6, '0.000176'),
  (u'Los Angeles',  7, '0.000208'),
  (u'Los Angeles',  8, '0.000192'),
  (u'Los Angeles',  9, '0.000182'),
  (u'Los Angeles', 10, '0.000205'),
  (u'Los Angeles', 11, '0.000192'),
  (u'Los Angeles', 12, '0.000215'),
  (u'Philadelphia',  1, '0.000332'),
  (u'Philadelphia',  2, '0.000265'),
  (u'Philadelphia',  3, '0.000276'),
  (u'Philadelphia',  4, '0.000297'),
  (u'Philadelphia',  5, '0.000340'),
  (u'Philadelphia',  6, '0.000325'),
  (u'Philadelphia',  7, '0.000343'),
  (u'Philadelphia',  8, '0.000358'),
  (u'Philadelphia',  9, '0.000328'),
  (u'Philadelphia', 10, '0.000365'),
  (u'Philadelphia', 11, '0.000348'),
  (u'Philadelphia', 12, '0.000347')]
dfTest("DF-L7-roberryRatesByCity-counts", expectedResults, results)

print("Tests passed!")

## Challenge

Congratulation! You have just finished the `pyspark` section of `DE-200` course. Before you move on to the next course (`DE-210: ETL Part 1: Data Extraction`), let's learn writing spark code in Scala! And I will make this fun!

## Step 1

Zeppelin is the BEST open-source jvm-based interactive notebook environment. We'll be using Zeppelin for this part of the excercise. Bye, Jupyter!

You can find the Zeppelin endpoint url in this [Trello card](https://trello.com/c/LptF6oaI/13-lab-environment)

## Step 2

Once you're there, please take a look at Zeppelin Tutorial first.

<img src="../../resources/zeppelin-tutorial-3.png" style="width: 800px; border: 1px solid #aaaaaa; border-radius: 10px 10px 10px 10px"/>

## Step 3

Next, take a look at the template notebook in my workspace. You should create the same notebook in your own workspace.

<img src="../../resources/zeppelin-tutorial-1.png" style="width: 800px; border: 1px solid #aaaaaa; border-radius: 10px 10px 10px 10px"/>

## Step 4

It's super easy to create visualization in Zeppelin. 

<img src="../../resources/zeppelin-tutorial-2.png" style="width: 800px; border: 1px solid #aaaaaa; border-radius: 10px 10px 10px 10px"/>

## Step 5

You're all set. Please go ahead and rewrite this project in scala and visualize your results in Zeppelin. If you want to learn more about Zeppelin, take a look at these links. I'll cover these topics in greater details in the future courses.
- https://zeppelin.apache.org/docs/0.8.1/quickstart/explore_ui.html
- https://zeppelin.apache.org/docs/0.8.1/quickstart/tutorial.html
- https://zeppelin.apache.org/docs/0.8.1/quickstart/spark_with_zeppelin.html
- https://zeppelin.apache.org/docs/0.8.1/setup/deployment/spark_cluster_mode.html#spark-on-mesos-mode

## References

The crime data used in this notebook comes from the following locations:

| City          | Original Data 
| ------------- | -------------
| Boston        | <a href="https://data.boston.gov/group/public-safety" target="_blank">https&#58;//data.boston.gov/group/public-safety</a>
| Chicago       | <a href="https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2" target="_blank">https&#58;//data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2</a>
| Dallas        | <a href="https://www.dallasopendata.com/Public-Safety/Police-Incidents/tbnj-w5hb/data" target="_blank">https&#58;//www.dallasopendata.com/Public-Safety/Police-Incidents/tbnj-w5hb/data</a>
| Los Angeles   | <a href="https://data.lacity.org/A-Safe-City/Crime-Data-From-2010-to-Present/y8tr-7khq" target="_blank">https&#58;//data.lacity.org/A-Safe-City/Crime-Data-From-2010-to-Present/y8tr-7khq</a>
| New Orleans   | <a href="https://data.nola.gov/Public-Safety-and-Preparedness/Electronic-Police-Report-2016/4gc2-25he/data" target="_blank">https&#58;//data.nola.gov/Public-Safety-and-Preparedness/Electronic-Police-Report-2016/4gc2-25he/data</a>
| New York      | <a href="https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i" target="_blank">https&#58;//data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i</a>
| Philadelphia  | <a href="https://www.opendataphilly.org/dataset/crime-incidents" target="_blank">https&#58;//www.opendataphilly.org/dataset/crime-incidents</a>

&copy; 2019 [Intellinum Analytics, Inc](http://www.intellinum.co). All rights reserved.<br/>