# Trasformation and actions
Adesso che abbiamo un DataFrame distribuito in memoria, possiamo esaminare i nostri dati.

**NOTA:** i DataFrame sono oggetti immutabili, quindi ogni trasformazione che facciamo in realta' e' un nuovo DataFrame che creiamo, tenendo l'originale intoccato.

In [0]:
# file path
sf_fire_file = "dbfs:/tmp/learning-spark/fireParquet/"
# read as df
fire_df = spark.read.parquet(sf_fire_file)
# show
display(fire_df.limit(20))

CallNumber,UnitID,IncidentNumber,CallType,CallDate,WatchDate,CallFinalDisposition,AvailableDtTm,Address,City,Zipcode,Battalion,StationArea,Box,OriginalPriority,Priority,FinalPriority,ALSUnit,CallTypeGroup,NumAlarms,UnitType,UnitSequenceInCallDispatch,FirePreventionDistrict,SupervisorDistrict,Neighborhood,Location,RowID,Delay
111050354,E14,11034920,Medical Incident,04/15/2011,04/15/2011,Other,04/15/2011 11:27:08 PM,500 Block of 21ST AVE,SF,94121,B07,14,7171,3,3,3,True,,1,ENGINE,1,7,1,Outer Richmond,"(37.7774255992901, -122.480311994328)",111050354-E14,4.7833333
111050355,E03,11034921,Structure Fire,04/15/2011,04/15/2011,Other,04/15/2011 11:10:54 PM,HYDE ST/BUSH ST,SF,94109,B04,3,1561,3,3,3,True,,1,ENGINE,1,4,3,Nob Hill,"(37.7891101748937, -122.417016879226)",111050355-E03,1.9166666
111050355,T03,11034921,Structure Fire,04/15/2011,04/15/2011,Other,04/15/2011 11:10:54 PM,HYDE ST/BUSH ST,SF,94109,B04,3,1561,3,3,3,False,,1,TRUCK,2,4,3,Nob Hill,"(37.7891101748937, -122.417016879226)",111050355-T03,2.4333334
111050356,73,11034922,Structure Fire,04/15/2011,04/15/2011,Other,04/15/2011 11:24:56 PM,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,True,,1,MEDIC,10,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-73,2.0666666
111050356,B06,11034922,Structure Fire,04/15/2011,04/15/2011,Other,04/15/2011 11:22:46 PM,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,CHIEF,6,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-B06,2.6
111050356,B10,11034922,Structure Fire,04/15/2011,04/15/2011,Other,04/15/2011 11:25:00 PM,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,CHIEF,4,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-B10,3.25
111050356,D3,11034922,Structure Fire,04/15/2011,04/15/2011,Other,04/15/2011 11:23:01 PM,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,CHIEF,7,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-D3,3.5
111050356,E29,11034922,Structure Fire,04/15/2011,04/15/2011,Other,04/15/2011 11:22:50 PM,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,True,,1,ENGINE,8,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-E29,2.6
111050356,E37,11034922,Structure Fire,04/15/2011,04/15/2011,Other,04/15/2011 11:25:10 PM,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,ENGINE,2,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-E37,2.6666667
111050356,RS2,11034922,Structure Fire,04/15/2011,04/15/2011,Other,04/15/2011 11:24:11 PM,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,RESCUE SQUAD,5,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-RS2,3.05


### Projections and actions
Modo di farci restituire solo le righe che riscontrano una certa condizioneusando i filtri:
- *select()*
- *where()*
- *filter()*

In [0]:
from pyspark.sql.functions import col

In [0]:
# DataFrame che mi restituisce solo le righe che rispettano le condizioni della select e della where
few_fire_df = (fire_df
                    .select("IncidentNumber", "AvailableDtTm", "CallType")
                    .where(col("CallType") != "Medical Incident"))

few_fire_df.show(5, truncate=False)

+--------------+----------------------+--------------+
|IncidentNumber|AvailableDtTm         |CallType      |
+--------------+----------------------+--------------+
|11034921      |04/15/2011 11:10:54 PM|Structure Fire|
|11034921      |04/15/2011 11:10:54 PM|Structure Fire|
|11034922      |04/15/2011 11:24:56 PM|Structure Fire|
|11034922      |04/15/2011 11:22:46 PM|Structure Fire|
|11034922      |04/15/2011 11:25:00 PM|Structure Fire|
+--------------+----------------------+--------------+
only showing top 5 rows



In [0]:
# Voglio sapere quante sono le diverse tipologie di chiamate:
display(
            (fire_df.select("CallType")
                   .where(col("CallType").isNotNull())
                   .agg(countDistinct("CallType").alias("DistinctCallTypes")))
)

DistinctCallTypes
32


In [0]:
# Mostrare le singole diverse tipologie di chiamate:
display(
            (fire_df.select("CallType")
                    .distinct()
                    .where(col("CallType").isNotNull())
            )
                    
)

CallType
Elevator / Escalator Rescue
Marine Fire
Aircraft Emergency
Confined Space / Structure Collapse
Administrative
Alarms
Odor (Strange / Unknown)
Lightning Strike (Investigation)
Citizen Assist / Service Call
HazMat


In [0]:
# Mostrare le singole diverse tipologie di chiamate, con il conteggio delle volte che compaiono, ordinate dal piu' grande:
display(
            (fire_df.select("CallType")
                    .where(col("CallType").isNotNull())
                    .groupBy("CallType")
                    .agg(count(col("CallType")).alias("NumCallType"))
                    .orderBy("numCallType", ascending=False))       
)

CallType,NumCallType
Medical Incident,2843475
Structure Fire,578998
Alarms,483518
Traffic Collision,175507
Citizen Assist / Service Call,65360
Other,56961
Outside Fire,51603
Vehicle Fire,20939
Water Rescue,20037
Gas Leak (Natural and LP Gases),17284


### Renaming, adding, and dropping columns
Alcune volte c'e' la neccessita' di rinominare particolari colonneper ragioni di stile o convenzioni, per ragioni di leggibilita'.

**NOTA:** i file Parquet non permettono che ci siano spazi nei nomi delle colonne!
**NOTA:** la definizione dello schema con il *StructField()* ci permette gia' di nominare le colonne come vogliamo

In [0]:
# Mostrare le colonne del DataFrame: sono gia' a posto queste (DataFrame che legge da Parquet)
fire_df.columns

Out[67]: ['CallNumber',
 'UnitID',
 'IncidentNumber',
 'CallType',
 'CallDate',
 'WatchDate',
 'CallFinalDisposition',
 'AvailableDtTm',
 'Address',
 'City',
 'Zipcode',
 'Battalion',
 'StationArea',
 'Box',
 'OriginalPriority',
 'Priority',
 'FinalPriority',
 'ALSUnit',
 'CallTypeGroup',
 'NumAlarms',
 'UnitType',
 'UnitSequenceInCallDispatch',
 'FirePreventionDistrict',
 'SupervisorDistrict',
 'Neighborhood',
 'Location',
 'RowID',
 'Delay']

In [0]:
# DataFrame che legge dal file csv
# NOTA: Gli sto dicendo di NON fare la deduzione dello schema

# file path csv
sf_fire_file_csv = "dbfs:/databricks-datasets/learning-spark-v2/sf-fire/sf-fire-calls.csv"
# read as df
fire_df_csv = spark.read.csv(sf_fire_file_csv, header=True, inferSchema=False)
# display prime 100 righe
display(fire_df_csv.limit(100))
# print schema df
print(fire_df_csv.printSchema())

Call Number,Unit ID,Incident Number,CallType,Call Date,Watch Date,Call Final Disposition,Available DtTm,Address,City,Zipcode of Incident,Battalion,Station Area,Box,OrigPriority,Priority,Final Priority,ALS Unit,Call Type Group,NumAlarms,UnitType,Unit sequence in call dispatch,Fire Prevention District,Supervisor District,Neighborhood,Location,RowID,Delay
20110014,M29,2003234,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 01:58:43 AM,10TH ST/MARKET ST,SF,94103,B02,36,2338,1,1,2,True,,1,MEDIC,1,2,6,Tenderloin,"(37.7765408927183, -122.417501464907)",020110014-M29,5.233333333333333
20110015,M08,2003233,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 02:10:17 AM,300 Block of 5TH ST,SF,94107,B03,8,2243,1,1,2,True,,1,MEDIC,1,3,6,South of Market,"(37.7792841462441, -122.402061300134)",020110015-M08,3.083333333333333
20110016,B02,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,6,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-B02,3.05
20110016,B04,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:51:54 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,3,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-B04,2.316666666666667
20110016,D2,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,4,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-D2,3.0166666666666666
20110016,E03,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,7,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E03,2.683333333333333
20110016,E38,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:51:17 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,1,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E38,2.1
20110016,E41,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,8,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E41,2.716666666666667
20110016,M03,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:46:38 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,True,,1,MEDIC,10,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-M03,2.7666666666666666
20110016,RS1,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:46:57 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,RESCUE SQUAD,9,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-RS1,3.2666666666666666


root
 |-- Call Number: string (nullable = true)
 |-- Unit ID: string (nullable = true)
 |-- Incident Number: string (nullable = true)
 |-- CallType: string (nullable = true)
 |-- Call Date: string (nullable = true)
 |-- Watch Date: string (nullable = true)
 |-- Call Final Disposition: string (nullable = true)
 |-- Available DtTm: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode of Incident: string (nullable = true)
 |-- Battalion: string (nullable = true)
 |-- Station Area: string (nullable = true)
 |-- Box: string (nullable = true)
 |-- OrigPriority: string (nullable = true)
 |-- Priority: string (nullable = true)
 |-- Final Priority: string (nullable = true)
 |-- ALS Unit: string (nullable = true)
 |-- Call Type Group: string (nullable = true)
 |-- NumAlarms: string (nullable = true)
 |-- UnitType: string (nullable = true)
 |-- Unit sequence in call dispatch: string (nullable = true)
 |-- Fire Prevention District: string

#### Problema:
Voglio sostituire i nomi delle colonne attuali (che hanno lo spazio) con nomi di colonne senza spazio.

In [0]:
# codice Pyhton per pulire il nome dei campi
colonne_vecchie = fire_df_csv.columns

colonne_nuove = []
for nome in colonne_vecchie:
    colonne_nuove.append(nome.replace(" ", ""))
print(colonne_nuove)

# assegnare i nuovi nomi al nuovo DataFrame
fire_df_csv_new = fire_df_csv.toDF(*colonne_nuove) # .toDF() metodo per costruire DataFrame. Con * passiamo lista di nomi colonne
display(fire_df_csv_new.limit(10))

['CallNumber', 'UnitID', 'IncidentNumber', 'CallType', 'CallDate', 'WatchDate', 'CallFinalDisposition', 'AvailableDtTm', 'Address', 'City', 'ZipcodeofIncident', 'Battalion', 'StationArea', 'Box', 'OrigPriority', 'Priority', 'FinalPriority', 'ALSUnit', 'CallTypeGroup', 'NumAlarms', 'UnitType', 'Unitsequenceincalldispatch', 'FirePreventionDistrict', 'SupervisorDistrict', 'Neighborhood', 'Location', 'RowID', 'Delay']


CallNumber,UnitID,IncidentNumber,CallType,CallDate,WatchDate,CallFinalDisposition,AvailableDtTm,Address,City,ZipcodeofIncident,Battalion,StationArea,Box,OrigPriority,Priority,FinalPriority,ALSUnit,CallTypeGroup,NumAlarms,UnitType,Unitsequenceincalldispatch,FirePreventionDistrict,SupervisorDistrict,Neighborhood,Location,RowID,Delay
20110014,M29,2003234,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 01:58:43 AM,10TH ST/MARKET ST,SF,94103,B02,36,2338,1,1,2,True,,1,MEDIC,1,2,6,Tenderloin,"(37.7765408927183, -122.417501464907)",020110014-M29,5.233333333333333
20110015,M08,2003233,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 02:10:17 AM,300 Block of 5TH ST,SF,94107,B03,8,2243,1,1,2,True,,1,MEDIC,1,3,6,South of Market,"(37.7792841462441, -122.402061300134)",020110015-M08,3.083333333333333
20110016,B02,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,6,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-B02,3.05
20110016,B04,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:51:54 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,3,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-B04,2.316666666666667
20110016,D2,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,4,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-D2,3.0166666666666666
20110016,E03,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,7,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E03,2.683333333333333
20110016,E38,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:51:17 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,1,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E38,2.1
20110016,E41,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,8,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E41,2.716666666666667
20110016,M03,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:46:38 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,True,,1,MEDIC,10,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-M03,2.7666666666666666
20110016,RS1,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:46:57 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,RESCUE SQUAD,9,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-RS1,3.2666666666666666


In [0]:
# singola colonna, o poche colonne si possono gestire con:
fire_df_csv_new_2 = (fire_df_csv.select("Call Number", "Unit Id")
                                .withColumnRenamed("Call Number", "CallNumber")
                                .withColumnRenamed("Unit Id", "UnitId"))
display(fire_df_csv_new_2.limit(10))

CallNumber,UnitId
20110014,M29
20110015,M08
20110016,B02
20110016,B04
20110016,D2
20110016,E03
20110016,E38
20110016,E41
20110016,M03
20110016,RS1


#### Problema:
Nel nostro DataFrame le colonne *CallDate*, *WatchDate*, *AvailableDtTm* sono delle stringhe che invece vogliamo convertire in timestamps o SQL date, che Spark supporta.

In [0]:
#importare la funzione che mi serve
from pyspark.sql.functions import to_timestamp, col

In [0]:
# check se sono effettivamente stringhe
fire_df.select("CallDate", "WatchDate", "AvailableDtTm").printSchema()

# usiamo withColumn(nome, manipolazione) per manipolare una colonna esistente (creandone una nuova) che andiamo a nominare
# drop() per droppare la colonna da cui manipoliamo

fire_df_ts = (fire_df.withColumn("IncidentDate", to_date(col("CallDate"), "MM/dd/yyyy"))
                     .drop("CallDate")
                     .withColumn("OnWatchDate", to_date(col("WatchDate"), "MM/dd/yyyy"))
                     .drop("WatchDate")
                     .withColumn("AvailableDtTs", to_timestamp(col("AvailableDtTm"), "MM/dd/yyyy hh:mm:ss a"))
                     .drop("AvailableDtTm"))

display(fire_df_ts.limit(10))

root
 |-- CallDate: string (nullable = true)
 |-- WatchDate: string (nullable = true)
 |-- AvailableDtTm: string (nullable = true)



CallNumber,UnitID,IncidentNumber,CallType,CallFinalDisposition,Address,City,Zipcode,Battalion,StationArea,Box,OriginalPriority,Priority,FinalPriority,ALSUnit,CallTypeGroup,NumAlarms,UnitType,UnitSequenceInCallDispatch,FirePreventionDistrict,SupervisorDistrict,Neighborhood,Location,RowID,Delay,IncidentDate,OnWatchDate,AvailableDtTs
111050354,E14,11034920,Medical Incident,Other,500 Block of 21ST AVE,SF,94121,B07,14,7171,3,3,3,True,,1,ENGINE,1,7,1,Outer Richmond,"(37.7774255992901, -122.480311994328)",111050354-E14,4.7833333,2011-04-15,2011-04-15,2011-04-15T23:27:08.000+0000
111050355,E03,11034921,Structure Fire,Other,HYDE ST/BUSH ST,SF,94109,B04,3,1561,3,3,3,True,,1,ENGINE,1,4,3,Nob Hill,"(37.7891101748937, -122.417016879226)",111050355-E03,1.9166666,2011-04-15,2011-04-15,2011-04-15T23:10:54.000+0000
111050355,T03,11034921,Structure Fire,Other,HYDE ST/BUSH ST,SF,94109,B04,3,1561,3,3,3,False,,1,TRUCK,2,4,3,Nob Hill,"(37.7891101748937, -122.417016879226)",111050355-T03,2.4333334,2011-04-15,2011-04-15,2011-04-15T23:10:54.000+0000
111050356,73,11034922,Structure Fire,Other,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,True,,1,MEDIC,10,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-73,2.0666666,2011-04-15,2011-04-15,2011-04-15T23:24:56.000+0000
111050356,B06,11034922,Structure Fire,Other,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,CHIEF,6,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-B06,2.6,2011-04-15,2011-04-15,2011-04-15T23:22:46.000+0000
111050356,B10,11034922,Structure Fire,Other,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,CHIEF,4,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-B10,3.25,2011-04-15,2011-04-15,2011-04-15T23:25:00.000+0000
111050356,D3,11034922,Structure Fire,Other,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,CHIEF,7,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-D3,3.5,2011-04-15,2011-04-15,2011-04-15T23:23:01.000+0000
111050356,E29,11034922,Structure Fire,Other,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,True,,1,ENGINE,8,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-E29,2.6,2011-04-15,2011-04-15,2011-04-15T23:22:50.000+0000
111050356,E37,11034922,Structure Fire,Other,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,ENGINE,2,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-E37,2.6666667,2011-04-15,2011-04-15,2011-04-15T23:25:10.000+0000
111050356,RS2,11034922,Structure Fire,Other,1000 Block of POTRERO AVE,SF,94110,B10,7,2553,3,3,3,False,,1,RESCUE SQUAD,5,10,10,Potrero Hill,"(37.7565080013216, -122.40654101432)",111050356-RS2,3.05,2011-04-15,2011-04-15,2011-04-15T23:24:11.000+0000


In [0]:
# ora da questo DataFrame posso utilizzare le funzioni come year(), month(), day()
from pyspark.sql.functions import year

display(
    (fire_df_ts.select(year("IncidentDate").alias("IncidentDateYear"))
               .distinct()
               .orderBy(year("IncidentDate"), ascending=False))
)

IncidentDateYear
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009


In [0]:
# voglio vedere per ogni anno il numero di chiamate totali:

display(
    (fire_df_ts.groupBy(year("IncidentDate").alias("IncidentDateYear"))
               .agg(count("CallNumber").alias("NumCalls"))).orderBy("NumCalls", ascending=False)
)

IncidentDateYear,NumCalls
2017,301449
2016,292526
2015,285281
2014,268074
2018,254602
2013,248796
2011,242121
2012,241714
2010,228567
2008,221652


### Aggregations
Raggruppare e fare calcoli su cio' che e' stato raggruppato sono trasformazioni e azioni tipiche dei DataFrames.

In [0]:
# Qualsi sono le 3 chiamate piu' comuni?
display(
    (fire_df_ts.select("CallType")
               .where(col("CallType").isNotNull())
               .groupBy("CallType")
               .count()
               .orderBy("count", ascending=False)).limit(3)
)

CallType,count
Medical Incident,2843475
Structure Fire,578998
Alarms,483518


In [0]:
# importo le funzioni pyspark in modo tale che non vada in conflitto con Python
import pyspark.sql.functions as F

display(
    (
        fire_df_ts.select(F.sum("NumAlarms").alias("sommaAllarmi")
                          ,F.avg("Delay").alias("mediaRitardo")
                          ,F.min("Delay").alias("minimoRitardo")
                          ,F.max("Delay").alias("massimoRitardo")
                         )
     )
       )

sommaAllarmi,mediaRitardo,minimoRitardo,massimoRitardo
4403441,3.902170335891614,0.016666668,1879.6167
