d
# San Francisco Fire Calls

This notebook is the end-to-end example from Chapter 3, showing how to use DataFrame and Spark SQL for common data analytics patterns and operations on a [San Francisco Fire Department Calls ](https://data.sfgov.org/Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3) dataset.

Inspect location where the SF Fire Department Fire calls data set is stored in the public dataset S3 bucket

In [0]:
%fs ls /databricks-datasets/learning-spark-v2/sf-fire/sf-fire-calls.csv

In [0]:
%fs ls /databricks-datasets/

Define the location of the public dataset on the S3 bucket

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

sf_fire_file = "/databricks-datasets/learning-spark-v2/sf-fire/sf-fire-calls.csv"

Inspect the data looks like before defining a schema

In [0]:
%fs head databricks-datasets/learning-spark-v2/sf-fire/sf-fire-calls.csv

Define our schema as the file has 4 million records. Inferring the schema is expensive for large files.

In [0]:
fire_schema = StructType([StructField('CallNumber', IntegerType(), True),
                     StructField('UnitID', StringType(), True),
                     StructField('IncidentNumber', IntegerType(), True),
                     StructField('CallType', StringType(), True),                  
                     StructField('CallDate', StringType(), True),      
                     StructField('WatchDate', StringType(), True),
                     StructField('CallFinalDisposition', StringType(), True),
                     StructField('AvailableDtTm', StringType(), True),
                     StructField('Address', StringType(), True),       
                     StructField('City', StringType(), True),       
                     StructField('Zipcode', IntegerType(), True),       
                     StructField('Battalion', StringType(), True),                 
                     StructField('StationArea', StringType(), True),       
                     StructField('Box', StringType(), True),       
                     StructField('OriginalPriority', StringType(), True),       
                     StructField('Priority', StringType(), True),       
                     StructField('FinalPriority', IntegerType(), True),       
                     StructField('ALSUnit', BooleanType(), True),       
                     StructField('CallTypeGroup', StringType(), True),
                     StructField('NumAlarms', IntegerType(), True),
                     StructField('UnitType', StringType(), True),
                     StructField('UnitSequenceInCallDispatch', IntegerType(), True),
                     StructField('FirePreventionDistrict', StringType(), True),
                     StructField('SupervisorDistrict', StringType(), True),
                     StructField('Neighborhood', StringType(), True),
                     StructField('Location', StringType(), True),
                     StructField('RowID', StringType(), True),
                     StructField('Delay', FloatType(), True)])

In [0]:
fire_df = spark.read.csv(sf_fire_file, header=True, schema=fire_schema)

Cache the DataFrame since we will be performing some operations on it.

In [0]:
fire_df.cache()

In [0]:
fire_df.count()

In [0]:
fire_df.printSchema()

In [0]:
fire_df.show(3)

In [0]:
display(fire_df)

Filter out "Medical Incident" call types

Note that `filter()` and `where()` methods on the DataFrame are similar. Check relevant documentation for their respective argument types.

In [0]:
few_fire_df = (fire_df.select("IncidentNumber", "AvailableDtTm", "CallType") 
              .where(col("CallType") != "Medical Incident"))

few_fire_df.show(5, truncate=False)

In [0]:

few_fire_df.count()

**Q-1) How many distinct types of calls were made to the Fire Department?**

To be sure, let's not count "null" strings in that column.

In [0]:
fire_df.select("CallType").where(col("CallType").isNotNull()).distinct().count()

In [0]:
fire_df.select("CallType").where(col("CallType").isNull()).show()

**Q-2) What are distinct types of calls were made to the Fire Department?**

These are all the distinct type of call to the SF Fire Department

In [0]:
fire_df.select("CallType").where(col("CallType").isNotNull()).distinct().show(32, False)

**Q-3) Find out all response or delayed times greater than 5 mins?**

1. Rename the column Delay - > ReponseDelayedinMins
2. Returns a new DataFrame
3. Find out all calls where the response time to the fire site was delayed for more than 5 mins

In [0]:
new_fire_df = fire_df.withColumnRenamed("Delay", "ResponseDelayedinMins")
new_fire_df.select("ResponseDelayedinMins").where(col("ResponseDelayedinMins") > 5).show(5, False)

In [0]:
new_fire_df.select("ResponseDelayedinMins").where(col("ResponseDelayedinMins") > 15).count()

Let's do some ETL:

1. Transform the string dates to Spark Timestamp data type so we can make some time-based queries later
2. Returns a transformed query
3. Cache the new DataFrame

In [0]:
fire_ts_df = (new_fire_df
              .withColumn("IncidentDate", to_timestamp(col("CallDate"), "MM/dd/yyyy")).drop("CallDate") 
              .withColumn("OnWatchDate",   to_timestamp(col("WatchDate"), "MM/dd/yyyy")).drop("WatchDate")
              .withColumn("AvailableDtTS", to_timestamp(col("AvailableDtTm"), "MM/dd/yyyy hh:mm:ss a")).drop("AvailableDtTm"))          

In [0]:
fire_ts_df.cache()
fire_ts_df.columns

Check the transformed columns with Spark Timestamp type

In [0]:
fire_ts_df.select("IncidentDate", "OnWatchDate", "AvailableDtTS").show(5, False)

In [0]:
mydf = fire_ts_df.select("IncidentDate", "OnWatchDate", "AvailableDtTS")
mydf.printSchema()

**Q-4) What were the most common call types?**

List them in descending order

In [0]:
(fire_ts_df
 .select("CallType").where(col("CallType").isNotNull())
 .groupBy("CallType")
 .count()
 .orderBy("count", ascending=False)
 .show(n=10, truncate=False))

**Q-4a) What zip codes accounted for most common calls?**

Let's investigate what zip codes in San Francisco accounted for most fire calls and what type where they.

1. Filter out by CallType
2. Group them by CallType and Zip code
3. Count them and display them in descending order

It seems like the most common calls were all related to Medical Incident, and the two zip codes are 94102 and 94103.

In [0]:
(fire_ts_df
 .select("CallType", "ZipCode")
 .where(col("CallType").isNotNull())
 .groupBy("CallType", "Zipcode")
 .count()
 .orderBy("count", ascending=False)
 .show(10, truncate=False))

**Q-4b) What San Francisco neighborhoods are in the zip codes 94102 and 94103**

Let's find out the neighborhoods associated with these two zip codes. In all likelihood, these are some of the contested 
neighborhood with high reported crimes.

In [0]:
fire_ts_df.select("Neighborhood", "Zipcode").where((col("Zipcode") == 94102) | (col("Zipcode") == 94103)).distinct().show(10, truncate=False)

**Q-5) What was the sum of all calls, average, min and max of the response times for calls?**

Let's use the built-in Spark SQL functions to compute the sum, avg, min, and max of few columns:

* Number of Total Alarms
* What were the min and max the delay in response time before the Fire Dept arrived at the scene of the call

In [0]:
fire_ts_df.select(sum("NumAlarms"), avg("ResponseDelayedinMins"), min("ResponseDelayedinMins"), max("ResponseDelayedinMins")).show()

** Q-6a) How many distinct years of data is in the CSV file?**

We can use the `year()` SQL Spark function off the Timestamp column data type IncidentDate.

In all, we have fire calls from years 2000-2018

In [0]:
fire_ts_df.select(year('IncidentDate')).distinct().orderBy(year('IncidentDate')).show()

** Q-6b) What week of the year in 2018 had the most fire calls?**

**Note**: Week 1 is the New Years' week and week 25 is the July 4 the week. Loads of fireworks, so it makes sense the higher number of calls.

In [0]:
fire_ts_df.filter(year('IncidentDate') == 2018).groupBy(weekofyear('IncidentDate')).count().orderBy('count', ascending=False).show()

** Q-7) What neighborhoods in San Francisco had the worst response time in 2018?**

It appears that if you living in Presidio Heights, the Fire Dept arrived in less than 3 mins, while Mission Bay took more than 6 mins.

In [0]:
fire_ts_df.select("Neighborhood", "ResponseDelayedinMins").filter(year("IncidentDate") == 2018).show(10, False)

** Q-8a) How can we use Parquet files or SQL table to store data and read it back?**

In [0]:
fire_ts_df.write.format("parquet").mode("overwrite").save("/tmp/fireServiceParquet/")

In [0]:
%fs ls /tmp/fireServiceParquet/

In [0]:
fire_ts_df.createOrReplaceTempView("fire")

In [0]:
spark.sql("select count(*) from fire").show()

** Q-8b) How can we use Parquet SQL table to store data and read it back?**

In [0]:
fire_ts_df.write.format("parquet").mode("overwrite").saveAsTable("Firetbl")

In [0]:
%sql
CACHE TABLE Firetbl

In [0]:
%sql
SELECT * FROM Firetbl LIMIT 10

** Q-8c) How can read data from Parquet file?**

Note we don't have to specify the schema here since it's stored as part of the Parquet metadata

In [0]:
file_parquet_df = spark.read.format("parquet").load("/tmp/fireServiceParquet/")

In [0]:
display(file_parquet_df.limit(10))
