## Ejemplo San Francisco Fire Calls

#### Realizamos los imports y definimos el fichero .csv

In [0]:
%scala
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val file = "/databricks-datasets/learning-spark-v2/sf-fire/sf-fire-calls.csv"

#### Creación del esquema y lectura del fichero con dicho esquema

In [0]:
%scala
val schema = StructType(Array(StructField("CallNumber", IntegerType, true),
  StructField("UnitID", StringType, true),
  StructField("IncidentNumber", IntegerType, true),
  StructField("CallType", StringType, true),                  
  StructField("CallDate", StringType, true),      
  StructField("WatchDate", StringType, true),
  StructField("CallFinalDisposition", StringType, true),
  StructField("AvailableDtTm", StringType, true),
  StructField("Address", StringType, true),       
  StructField("City", StringType, true),       
  StructField("Zipcode", IntegerType, true),       
  StructField("Battalion", StringType, true),                 
  StructField("StationArea", StringType, true),       
  StructField("Box", StringType, true),       
  StructField("OriginalPriority", StringType, true),       
  StructField("Priority", StringType, true),       
  StructField("FinalPriority", IntegerType, true),       
  StructField("ALSUnit", BooleanType, true),       
  StructField("CallTypeGroup", StringType, true),
  StructField("NumAlarms", IntegerType, true),
  StructField("UnitType", StringType, true),
  StructField("UnitSequenceInCallDispatch", IntegerType, true),
  StructField("FirePreventionDistrict", StringType, true),
  StructField("SupervisorDistrict", StringType, true),
  StructField("Neighborhood", StringType, true),
  StructField("Location", StringType, true),
  StructField("RowID", StringType, true),
  StructField("Delay", FloatType, true)))

val fireDf = spark.read.schema(schema).option("header", "true").csv(file)

#### Cacheamos el DataFrame para así agilizar el trabajo a las operación

In [0]:
%scala
fireDf.cache()

#### Contamos las líneas que tiene el fichero

In [0]:
%scala
fireDf.count()

#### Mostramos el esquema del DataFrame

In [0]:
%scala
fireDf.printSchema()

#### Mostramos las primeras 5 líneas

In [0]:
%scala
display(fireDf.limit(5))

CallNumber,UnitID,IncidentNumber,CallType,CallDate,WatchDate,CallFinalDisposition,AvailableDtTm,Address,City,Zipcode,Battalion,StationArea,Box,OriginalPriority,Priority,FinalPriority,ALSUnit,CallTypeGroup,NumAlarms,UnitType,UnitSequenceInCallDispatch,FirePreventionDistrict,SupervisorDistrict,Neighborhood,Location,RowID,Delay
20110014,M29,2003234,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 01:58:43 AM,10TH ST/MARKET ST,SF,94103,B02,36,2338,1,1,2,True,,1,MEDIC,1,2,6,Tenderloin,"(37.7765408927183, -122.417501464907)",020110014-M29,5.233333
20110015,M08,2003233,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 02:10:17 AM,300 Block of 5TH ST,SF,94107,B03,8,2243,1,1,2,True,,1,MEDIC,1,3,6,South of Market,"(37.7792841462441, -122.402061300134)",020110015-M08,3.0833333
20110016,B02,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,6,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-B02,3.05
20110016,B04,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:51:54 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,3,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-B04,2.3166666
20110016,D2,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,4,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-D2,3.0166667


#### Ahora creamos otro DataFrame filtrando el tipo de llamada a _Medical Incident_, luego mostramos las 5 primeras líneas

In [0]:
%scala
val fewFireDf = fireDf.select("IncidentNumber", "AvailableDtTm", "CallType").where($"CallType" =!= "Medical Incident")

fewFireDf.show(5, false)

#### Contamos el número de tipos de llamadas, filtrando el tipo de modo que no sean nulas y no se repitan

In [0]:
%scala
fireDf.select("CallType").where(col("CallType").isNotNull).distinct().count()

#### Mostramos las 10 primeras líneas, como en la consulta anterior, filtrando de modo que no sean nulas y no se repitan

In [0]:
%scala
fireDf.select("CallType").where(col("CallType").isNotNull).distinct().show(10, false)

#### Creamos un nuevo DataFrame modificando la columna _Delay_ por _ResponseDelayedinMins_ y mostramos las 5 primeras líneas de las que hayan tardado más de 5 minutos

In [0]:
%scala
val newFireDf = fireDf.withColumnRenamed("Delay", "ResponseDelayedinMins")
newFireDf.select("ResponseDelayedinMins").where($"ResponseDelayedinMins" > 5).show(5, false)

#### Creamos un nuevo DataFrame con las columnas _IncidentDate_, _WathDate_ y _AvailableDtTm_ transformadas de tipo String a Timestamp

In [0]:
%scala
val fireTSDf = newFireDf.withColumn("IncidentDate", to_timestamp(col("CallDate"), "MM/dd/yyyy")).drop("CallDate")
  .withColumn("OnWatchDate", to_timestamp(col("WatchDate"), "MM/dd/yyyy")).drop("WatchDate")
  .withColumn("AvailableDtTS", to_timestamp(col("AvailableDtTm"), "MM/dd/yyyy hh:mm:ss a")).drop("AvailableDtTm")

#### Cacheamos en nuevo DataFrame y mostramos sus columnas

In [0]:
%scala
fireTSDf.cache()
fireTSDf.columns

#### Mostramos los 5 primeros resultados, seleccionando las columnas creadas en la consulta anterior

In [0]:
%scala
fireTSDf.select("IncidentDate", "OnWatchDate", "AvailableDtTS").show(5, false)

#### Mostramos los 10 primeros tipos de llamada contados, filtrando que el tipo no sea nulo, y ordenándolos de forma ascendente

In [0]:
%scala
fireTSDf.select("CallType").where(col("CallType").isNotNull).groupBy("CallType").count().orderBy("count").show(10, false)

#### Mostamos los 10 primeros tipos de llamadas y códigos postales, filtrando que el tipo no sea nulo y ordenándolos de forma descendente

In [0]:
%scala
fireTSDf.select("CallType", "Zipcode").where(col("CallType").isNotNull).groupBy("CallType","Zipcode").count().orderBy(desc("count")).show(10, false)

#### Mostramos los primeros 10 _Neighborhood_ y _Zipcode_, filtrando que los _Zipcodes_ sean 94102 o 94103 y que los resultados no se repitan

In [0]:
%scala
fireTSDf.select("Neighborhood", "Zipcode").where((col("Zipcode") === 94102) || (col("Zipcode") === 94103)).distinct().show(10, false)

#### Mostramos la suma de todas las alarmas, la media, la mínima y la máxima de tiempo de respuesta

In [0]:
%scala
fireTSDf.select(sum("NumAlarms"), avg("ResponseDelayedinMins"), min("ResponseDelayedinMins"), max("ResponseDelayedinMins")).show()

#### Mostramos los distintos años en los que ha habido llamadas

In [0]:
%scala
fireTSDf.select(year($"IncidentDate")).distinct().orderBy(year($"IncidentDate")).show()

#### Mostramos la cuenta de _IncidentDate_ agrupados por semanas, ordenados decrecientemente

In [0]:
%scala
fireTSDf.filter(year($"IncidentDate") === 2018).groupBy(weekofyear($"IncidentDate")).count().orderBy(desc("count")).show()

#### Mostamos los 10 vecindarios que más tardaron en responder, filtrado por el año

In [0]:
%scala
fireTSDf.select("Neighborhood", "ResponseDelayedinMins").filter(year($"IncidentDate") === 2018).show(10, false)

#### Guardamos el DataFrame en formato Parquet

In [0]:
%scala
fireTSDf.write.format("parquet").mode("overwrite").save("/tmp/fireServiceParquet")

#### Guardamos como tabla en formato parquet

In [0]:
%scala
fireTSDf.write.format("parquet").mode("overwrite").saveAsTable("FireServiceCalls")

#### Seleccionamos el DataFrame en formato Parquet

In [0]:
%scala
val fileParquetDf = spark.read.format("parquet").load("/tmp/fireServiceParquet/")

In [0]:
%scala
display(fileParquetDf.limit(10))

CallNumber,UnitID,IncidentNumber,CallType,CallFinalDisposition,Address,City,Zipcode,Battalion,StationArea,Box,OriginalPriority,Priority,FinalPriority,ALSUnit,CallTypeGroup,NumAlarms,UnitType,UnitSequenceInCallDispatch,FirePreventionDistrict,SupervisorDistrict,Neighborhood,Location,RowID,ResponseDelayedinMins,IncidentDate,OnWatchDate,AvailableDtTS
111050354,E14,11034920,Medical Incident,Other,500 Block of 21ST AVE,SF,94121,B07,14,7171,3,3,3,True,,1,ENGINE,1,7,1,Outer Richmond,"(37.7774255992901, -122.480311994328)",111050354-E14,4.7833333,2011-04-15T00:00:00.000+0000,2011-04-15T00:00:00.000+0000,2011-04-15T23:27:08.000+0000
111050355,E03,11034921,Structure Fire,Other,HYDE ST/BUSH ST,SF,94109,B04,3,1561,3,3,3,True,,1,ENGINE,1,4,3,Nob Hill,"(37.7891101748937, -122.417016879226)",111050355-E03,1.9166666,2011-04-15T00:00:00.000+0000,2011-04-15T00:00:00.000+0000,2011-04-15T23:10:54.000+0000
111050355,T03,11034921,Structure Fire,Other,HYDE ST/BUSH ST,SF,94109,B04,3,1561,3,3,3,False,,1,TRUCK,2,4,3,Nob Hill,"(37.7891101748937, -122.417016879226)",111050355-T03,2.4333334,2011-04-15T00:00:00.000+0000,2011-04-15T00:00:00.000+0000,2011-04-15T23:10:54.000+0000
111050356,73,11034922,Structure Fire,Other,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,True,,1,MEDIC,10,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-73,2.0666666,2011-04-15T00:00:00.000+0000,2011-04-15T00:00:00.000+0000,2011-04-15T23:24:56.000+0000
111050356,B06,11034922,Structure Fire,Other,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,CHIEF,6,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-B06,2.6,2011-04-15T00:00:00.000+0000,2011-04-15T00:00:00.000+0000,2011-04-15T23:22:46.000+0000
111050356,B10,11034922,Structure Fire,Other,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,CHIEF,4,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-B10,3.25,2011-04-15T00:00:00.000+0000,2011-04-15T00:00:00.000+0000,2011-04-15T23:25:00.000+0000
111050356,D3,11034922,Structure Fire,Other,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,CHIEF,7,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-D3,3.5,2011-04-15T00:00:00.000+0000,2011-04-15T00:00:00.000+0000,2011-04-15T23:23:01.000+0000
111050356,E29,11034922,Structure Fire,Other,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,True,,1,ENGINE,8,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-E29,2.6,2011-04-15T00:00:00.000+0000,2011-04-15T00:00:00.000+0000,2011-04-15T23:22:50.000+0000
111050356,E37,11034922,Structure Fire,Other,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,ENGINE,2,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-E37,2.6666667,2011-04-15T00:00:00.000+0000,2011-04-15T00:00:00.000+0000,2011-04-15T23:25:10.000+0000
111050356,RS2,11034922,Structure Fire,Other,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,RESCUE SQUAD,5,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-RS2,3.05,2011-04-15T00:00:00.000+0000,2011-04-15T00:00:00.000+0000,2011-04-15T23:24:11.000+0000
