# Crime Analysis in New York City using Spark DataFrame Example 

__Learning Objective:__

* [Load the data and get quick sense of data](#Load-the-data-and-get-quick-sense-of-data)
* [Exploring and analyzing data with DataFrames](#Exploring-and-analyzing-data-with-DataFrames)
    - Transformations and actions on DataFrames
    - Aggregation, grouping, sampling, ordering data with DataFrame
    - Working with Joins in DataFrames + Using Broadcast Variables and Accumulators with DataFrame    

## Load the data and get quick sense of data

__Read the compressed file content using Python's lzma library__

In [1]:
filePath="../02/NYPD_7_Major_Felony_Incidents.xz"
## For Linux "file:///Users/tirthalp/something/gs-spark-python/notebooks/02/NYPD_7_Major_Felony_Incidents.xz"
## For Windows "C:\\Users\\tirthalp\\something\\gs-spark-python\\notebooks\\02\\NYPD_7_Major_Felony_Incidents.xz"

import lzma
with lzma.open(filePath, 'rt') as f:
    file_content = list(f)

print("Type of file_content variable = ", type(file_content))
print(file_content[0])
print(file_content[1])

Type of file_content variable =  <class 'list'>
OBJECTID,Identifier,Occurrence Date,Day of Week,Occurrence Month,Occurrence Day,Occurrence Year,Occurrence Hour,CompStat Month,CompStat Day,CompStat Year,Offense,Offense Classification,Sector,Precinct,Borough,Jurisdiction,XCoordinate,YCoordinate,Location 1

1,f070032d,09/06/1940 07:30:00 PM,Friday,Sep,6,1940,19,9,7,2010,BURGLARY,FELONY,D,66,BROOKLYN,N.Y. POLICE DEPT,987478,166141,"(40.6227027620001, -73.9883732929999)"



__Convert Python list into Spark's DataFrame__

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
                    .appName("Crime Analysis in New York City using Spark DataFrame")\
                    .getOrCreate()

### ---> Parse header and unstructured data, and prepare RDD of Crime objects

data = spark.sparkContext.parallelize(file_content)

# Clean '\n' at the end of each line
data = data.map(lambda x:x.replace("\n",""))

# Get the header row
header = data.first()

# Filter the header row
dataWoHeader = data.filter(lambda x: x!=header)

# How to transform records of string to named tuples / Parse the rows to extract fields?
import csv
from io import StringIO
from collections import namedtuple

fields = header.replace(" ","_").replace("/","_").split(",")

Crime = namedtuple('Crime', fields, verbose=False)

def parse(row):
    reader = csv.reader(StringIO(row))
    row = next(reader)
    return Crime(*row)

# Transform String to Crime object
crimesRdd=dataWoHeader.map(parse)

type(crimesRdd)

pyspark.rdd.PipelinedRDD

In [3]:
### ---> RDD to DataFrame

crimesDf = crimesRdd.toDF()

type(crimesDf)

pyspark.sql.dataframe.DataFrame

## Exploring and analyzing data with DataFrames

### Do data exploration and simple transformations

In [4]:
# What's the schema?
crimesDf.printSchema()

root
 |-- OBJECTID: string (nullable = true)
 |-- Identifier: string (nullable = true)
 |-- Occurrence_Date: string (nullable = true)
 |-- Day_of_Week: string (nullable = true)
 |-- Occurrence_Month: string (nullable = true)
 |-- Occurrence_Day: string (nullable = true)
 |-- Occurrence_Year: string (nullable = true)
 |-- Occurrence_Hour: string (nullable = true)
 |-- CompStat_Month: string (nullable = true)
 |-- CompStat_Day: string (nullable = true)
 |-- CompStat_Year: string (nullable = true)
 |-- Offense: string (nullable = true)
 |-- Offense_Classification: string (nullable = true)
 |-- Sector: string (nullable = true)
 |-- Precinct: string (nullable = true)
 |-- Borough: string (nullable = true)
 |-- Jurisdiction: string (nullable = true)
 |-- XCoordinate: string (nullable = true)
 |-- YCoordinate: string (nullable = true)
 |-- Location_1: string (nullable = true)



In [5]:
# Explore data by taking subset of DataFrame
crimesDf.limit(3).show()

+--------+----------+--------------------+-----------+----------------+--------------+---------------+---------------+--------------+------------+-------------+-------------+----------------------+------+--------+---------+----------------+-----------+-----------+--------------------+
|OBJECTID|Identifier|     Occurrence_Date|Day_of_Week|Occurrence_Month|Occurrence_Day|Occurrence_Year|Occurrence_Hour|CompStat_Month|CompStat_Day|CompStat_Year|      Offense|Offense_Classification|Sector|Precinct|  Borough|    Jurisdiction|XCoordinate|YCoordinate|          Location_1|
+--------+----------+--------------------+-----------+----------------+--------------+---------------+---------------+--------------+------------+-------------+-------------+----------------------+------+--------+---------+----------------+-----------+-----------+--------------------+
|       1|  f070032d|09/06/1940 07:30:...|     Friday|             Sep|             6|           1940|             19|             9|         

In [6]:
# Use dropna to drop rows which have nas (i.e values that are not available)
# crimesDf.dropna()

# When doing huge analysis, for performance improvement, drop columns which are not required for the purpose of analysis
crimesDf = crimesDf.drop("Occurrence_Date", "Day_of_Week", "Occurrence_Month", "Occurrence_Day", "Occurrence_Hour", "CompStat_Month", "CompStat_Day", "Offense_Classification", "Sector", "XCoordinate", "YCoordinate", "Location_1")
crimesDf.cache()

crimesDf.show(3)

+--------+----------+---------------+-------------+-------------+--------+---------+----------------+
|OBJECTID|Identifier|Occurrence_Year|CompStat_Year|      Offense|Precinct|  Borough|    Jurisdiction|
+--------+----------+---------------+-------------+-------------+--------+---------+----------------+
|       1|  f070032d|           1940|         2010|     BURGLARY|      66| BROOKLYN|N.Y. POLICE DEPT|
|       2|  c6245d4d|           1968|         2008|GRAND LARCENY|      28|MANHATTAN|N.Y. POLICE DEPT|
|       3|  716dbc6f|           1970|         2008|     BURGLARY|      84| BROOKLYN|N.Y. POLICE DEPT|
+--------+----------+---------------+-------------+-------------+--------+---------+----------------+
only showing top 3 rows



In [7]:
# How to see unique values on particular column?
crimesDf.select('Offense').distinct().show()

+--------------------+
|             Offense|
+--------------------+
|      FELONY ASSAULT|
|                  NA|
|MURDER & NON-NEGL...|
|             ROBBERY|
|GRAND LARCENY OF ...|
|                RAPE|
|       GRAND LARCENY|
|            BURGLARY|
+--------------------+



In [8]:
# How to filter records with "NA" Offense type?
crimesDf = crimesDf.filter(crimesDf['offense'] != "NA").distinct()
crimesDf.select('Offense').distinct().show(10, False)

+------------------------------+
|Offense                       |
+------------------------------+
|FELONY ASSAULT                |
|MURDER & NON-NEGL. MANSLAUGHTE|
|ROBBERY                       |
|GRAND LARCENY OF MOTOR VEHICLE|
|RAPE                          |
|GRAND LARCENY                 |
|BURGLARY                      |
+------------------------------+



In [10]:
# Total records?
crimesDf.count()

1123464

### Sampling Data

In [11]:
# How to see certain fraction of data for the particular types of offense using sample(fraction=n)?
crimesDf.filter(crimesDf['offense'].isin(["BURGLARY", "ROBBERY"])).sample(fraction=0.1).limit(5).show()

+--------+----------+---------------+-------------+--------+--------+--------+----------------+
|OBJECTID|Identifier|Occurrence_Year|CompStat_Year| Offense|Precinct| Borough|    Jurisdiction|
+--------+----------+---------------+-------------+--------+--------+--------+----------------+
|    4830|  70ec8ef3|           2006|         2006|BURGLARY|      67|BROOKLYN|N.Y. POLICE DEPT|
|   13444|  bd764603|           2006|         2006| ROBBERY|     107|  QUEENS|N.Y. POLICE DEPT|
|   13578|  77c74c85|           2005|         2006|BURGLARY|      72|BROOKLYN|N.Y. POLICE DEPT|
|   16834|  591e423e|           2006|         2006|BURGLARY|      90|BROOKLYN|N.Y. POLICE DEPT|
|   35291|  7cd909b4|           2006|         2006|BURGLARY|      43|   BRONX|N.Y. POLICE DEPT|
+--------+----------+---------------+-------------+--------+--------+--------+----------------+



### Grouping, aggregation and ordering data

##### Year-on-Year Offenses growth pattern?

In [12]:
from pyspark.sql.functions import desc

crimesDf.groupBy("Occurrence_Year").count().withColumnRenamed("count", "offenses").orderBy(desc("Occurrence_Year")).show(50)

+---------------+--------+
|Occurrence_Year|offenses|
+---------------+--------+
|           2015|  102657|
|           2014|  106849|
|           2013|  111286|
|           2012|  111798|
|           2011|  107206|
|           2010|  105643|
|           2009|  106018|
|           2008|  117375|
|           2007|  120554|
|           2006|  127887|
|           2005|    3272|
|           2004|     692|
|           2003|     490|
|           2002|     368|
|           2001|     343|
|           2000|     282|
|           1999|     124|
|           1998|      74|
|           1997|      40|
|           1996|      34|
|           1995|      27|
|           1994|      19|
|           1993|      23|
|           1992|      12|
|           1991|      12|
|           1990|      17|
|           1989|      12|
|           1988|       6|
|           1987|       6|
|           1986|       2|
|           1985|       8|
|           1984|       4|
|           1983|       1|
|           1982|       5|
|

In [13]:
# Sudden increase in the number of offenses from Year 2006 onwards indicates anomany
# So filtering data of occurrence year 2005 or prior

crimesDf = crimesDf.filter(crimesDf["Occurrence_Year"].isin(["2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015"]))

crimesDf.count()

1117273

##### Total number of offense by type? Which is highest number of offense?

In [14]:
crimesDf.groupBy('Offense').count().orderBy(desc('count')).show(10, False)

+------------------------------+------+
|Offense                       |count |
+------------------------------+------+
|GRAND LARCENY                 |424635|
|ROBBERY                       |198569|
|BURGLARY                      |191045|
|FELONY ASSAULT                |183879|
|GRAND LARCENY OF MOTOR VEHICLE|101728|
|RAPE                          |12974 |
|MURDER & NON-NEGL. MANSLAUGHTE|4443  |
+------------------------------+------+



##### Sum of precincts by Offense?

In [15]:
# agg({"Precinct":"sum"}) = perform sum aggregation on 'Precinct' column using built-in agg function of Spark DF
# withColumnRenamed = returns a new DataFrame by renaming an existing column

crimesDf_offense_precinct_sum = crimesDf.filter(crimesDf["Precinct"] != "NA")\
                                        .groupBy("Offense")\
                                        .agg({"Precinct":"sum"})\
                                        .withColumnRenamed("sum(Precinct)","Precincts")

crimesDf_offense_precinct_sum.show()

+--------------------+-----------+
|             Offense|  Precincts|
+--------------------+-----------+
|      FELONY ASSAULT|1.1921206E7|
|MURDER & NON-NEGL...|   297252.0|
|             ROBBERY|1.2970707E7|
|GRAND LARCENY OF ...|  7586978.0|
|                RAPE|   855586.0|
|       GRAND LARCENY|2.3469435E7|
|            BURGLARY|1.3247918E7|
+--------------------+-----------+



In [16]:
# How to cast scientific notation in above (e.g. 2.3742835E7 value in Precincts column) to number?

# withColumn = returns a new DataFrame by adding a column or replacing the existing column that has the same name

from pyspark.sql.types import DecimalType

crimesDf_offense_precinct_sum = crimesDf_offense_precinct_sum\
                                        .withColumn('Precincts', crimesDf_offense_precinct_sum.Precincts.cast(DecimalType(18, 0)))\
                                        .orderBy("Offense", desc("Precincts"))

crimesDf_offense_precinct_sum.show(10, False)

+------------------------------+---------+
|Offense                       |Precincts|
+------------------------------+---------+
|BURGLARY                      |13247918 |
|FELONY ASSAULT                |11921206 |
|GRAND LARCENY                 |23469435 |
|GRAND LARCENY OF MOTOR VEHICLE|7586978  |
|MURDER & NON-NEGL. MANSLAUGHTE|297252   |
|RAPE                          |855586   |
|ROBBERY                       |12970707 |
+------------------------------+---------+



##### Top 3 Offense based on respective total incident occurrences and % incident?

In [17]:
# Incidents by Offense?

crimesDf_offense_incidents = crimesDf.groupBy("Offense").count().withColumnRenamed("count","Incidents")

crimesDf_offense_incidents.show()

+--------------------+---------+
|             Offense|Incidents|
+--------------------+---------+
|      FELONY ASSAULT|   183879|
|MURDER & NON-NEGL...|     4443|
|             ROBBERY|   198569|
|GRAND LARCENY OF ...|   101728|
|                RAPE|    12974|
|       GRAND LARCENY|   424635|
|            BURGLARY|   191045|
+--------------------+---------+



In [18]:
# Total number of incidents across all Offenses?

crimesDf_offense_total = crimesDf_offense_incidents.agg({"Incidents": "sum"})

crimesDf_offense_total.show()

+--------------+
|sum(Incidents)|
+--------------+
|       1117273|
+--------------+



In [19]:
total_offense_incidents = crimesDf_offense_total.collect()[0][0]

total_offense_incidents

1117273

In [20]:
from pyspark.sql.functions import round

# Adding new column for % of Incident calculation
crimesDf_offense_percentage_incident = crimesDf_offense_incidents.withColumn(
    "% Incident",
    round(crimesDf_offense_incidents.Incidents / total_offense_incidents * 100, 2))

crimesDf_offense_percentage_incident.printSchema()

root
 |-- Offense: string (nullable = true)
 |-- Incidents: long (nullable = false)
 |-- % Incident: double (nullable = true)



In [21]:
# Order by % Incident column to get top 3 offense with highest incidents

crimesDf_offense_percentage_incident.orderBy(crimesDf_offense_percentage_incident[2].desc()).limit(3).show()

+-------------+---------+----------+
|      Offense|Incidents|% Incident|
+-------------+---------+----------+
|GRAND LARCENY|   424635|     38.01|
|      ROBBERY|   198569|     17.77|
|     BURGLARY|   191045|      17.1|
+-------------+---------+----------+



##### More Aggregation functions

In [22]:
# Multiple aggregation operations can be applied
crimesDf.agg({"Precinct" : "sum", "Occurrence_Year" : "max"}).show()

+--------------------+-------------+
|max(Occurrence_Year)|sum(Precinct)|
+--------------------+-------------+
|                2015|  7.0349082E7|
+--------------------+-------------+



In [23]:
# Uages of describe
crimesDf.select("Precinct").filter(crimesDf["Precinct"] != "NA").describe().show()

+-------+-----------------+
|summary|         Precinct|
+-------+-----------------+
|  count|          1117256|
|   mean|62.96594692711429|
| stddev|34.90093348021095|
|    min|                1|
|    max|               94|
+-------+-----------------+



##### How to view information in matrix form using crosstab function?

In [24]:
# Total occurrency of each Offense per Borough?

crimesDf.filter(crimesDf["Borough"] != "")\
        .filter(crimesDf["Borough"] != "(null)")\
        .crosstab("Borough", "Offense")\
        .select("Borough_Offense", "FELONY ASSAULT", "ROBBERY", "RAPE", "GRAND LARCENY", "BURGLARY")\
        .show()

+---------------+--------------+-------+----+-------------+--------+
|Borough_Offense|FELONY ASSAULT|ROBBERY|RAPE|GRAND LARCENY|BURGLARY|
+---------------+--------------+-------+----+-------------+--------+
|  STATEN ISLAND|          5505|   4442| 500|        11244|    6926|
|         QUEENS|         35240|  40785|2975|        80704|   49648|
|      MANHATTAN|         32979|  38478|2803|       165896|   35044|
|       BROOKLYN|         62406|  69603|3850|       112922|   65841|
|          BRONX|         47748|  45260|2846|        53867|   33584|
+---------------+--------------+-------+----+-------------+--------+



### Working with Joins in DataFrames + Using Broadcast Variables and Accumulators with DataFrame 

##### Broadcast Variables and Accumulators concepts in Spark

* Spark is written in Scala, and heavily utilizes __closures__. (_Scala Closures are functions which uses one or more free variables and the return value of this function is dependent of these variable. The free variables are defined outside of the Closure Function and is not included as a parameter of this function. So the difference between a closure function and a normal function is the free variable. A free variable is any kind of variable which is not defined within the function and not passed as the parameter of the function. A free variable is not bound to a function with a valid value. The function does not contain any values for the free variable._). Also, the closure retains its copies of local variables - even after the outer scope ceases to exist.
* In Spark 2.x architecture, Tasks which run on individual workers are closures. Every task will contain a copy of the variables that it works on for closures. That means, lots of copies are passed around from master to worker nodes for each tasks and shuffling would further cost hit. That's why need shared variables across tasks on individual worker nodes to solve this problem statement. In Spark, shared variables are: (1) Broadcast variables (2) Accumulator
* __Broadcast Variables__:     
    - Shared, read-only variables
    - One copy per node (Not one copy per task)
    - Distributed efficiently by Spark
    - All nodes in cluster distribute (i.e. peer-to-peer copying too)
    - No shuffling
    - Will be cached in-memory on each node, so can be large, but not too large
    - Use whenever tasks across stages need same data; Share dataset with all nodes like training data in ML, static lookup tables
* __Accumulators__:    
    - Read-write shared variables
    - Added associatively { e.g. A + B = B + A } and communicatively { e.g. A + (B + C) = (A + B) + C } 
    - Spark native support for accumulators of type Long, Double and Collections; which can be extended by subclassing AccumulatorV2    
    - Workers can only modify state
    - Only the driver program can read state
    - Use for counters or sums

##### Load another data set and get quick sense of the data

In [26]:
# Dataset 1: Offenses

offensesDf = spark.read.format("csv").option("header", "true").load("Offense.csv");

type(offensesDf)

pyspark.sql.dataframe.DataFrame

In [27]:
offensesDf.printSchema()

root
 |-- Offense: string (nullable = true)
 |-- Penalty_USD: string (nullable = true)
 |-- Punishment_Severity: string (nullable = true)



In [28]:
offensesDf.show(10, False)

+------------------------------+-----------+-------------------+
|Offense                       |Penalty_USD|Punishment_Severity|
+------------------------------+-----------+-------------------+
|BURGLARY                      |100        |LOW                |
|FELONY ASSAULT                |500        |MEDIUM             |
|GRAND LARCENY                 |250        |LOW                |
|GRAND LARCENY OF MOTOR VEHICLE|250        |LOW                |
|MURDER & NON-NEGL. MANSLAUGHTE|1000       |HIGH               |
|RAPE                          |750        |HIGH               |
|ROBBERY                       |150        |LOW                |
+------------------------------+-----------+-------------------+



In [39]:
# Drop unwanted columns

crimesDf = crimesDf.drop("OBJECTID", "Identifier", "CompStat_Year", "Precinct", "Jurisdiction")

crimesDf.printSchema()

root
 |-- Occurrence_Year: string (nullable = true)
 |-- Offense: string (nullable = true)
 |-- Borough: string (nullable = true)



In [36]:
crimesDf.show(5, False)

+--------+----------+---------------+-------------+---------+
|OBJECTID|Identifier|Occurrence_Year|Offense      |Borough  |
+--------+----------+---------------+-------------+---------+
|271     |ee300b6c  |2006           |GRAND LARCENY|BRONX    |
|485     |9a3f0b4a  |2006           |GRAND LARCENY|MANHATTAN|
|719     |f714294a  |2006           |GRAND LARCENY|MANHATTAN|
|787     |77b2fd2d  |2006           |BURGLARY     |QUEENS   |
|904     |5ec3445c  |2006           |GRAND LARCENY|QUEENS   |
+--------+----------+---------------+-------------+---------+
only showing top 5 rows



In [38]:
offensesDf.count(), crimesDf.count()

(7, 1117273)

##### Join two datasets  - Inner join

In [40]:
crime_details_temp = crimesDf.join(offensesDf, crimesDf.Offense == offensesDf.Offense)

crime_details_temp.columns
# Note: Offense column is two times, b'cas one from crimesDf and another from OffensesDf

['Occurrence_Year',
 'Offense',
 'Borough',
 'Offense',
 'Penalty_USD',
 'Punishment_Severity']

In [41]:
crime_details_temp.count()

1117273

In [42]:
crime_details_temp.limit(3).show()

+---------------+-------------+---------+-------------+-----------+-------------------+
|Occurrence_Year|      Offense|  Borough|      Offense|Penalty_USD|Punishment_Severity|
+---------------+-------------+---------+-------------+-----------+-------------------+
|           2006|GRAND LARCENY|    BRONX|GRAND LARCENY|        250|                LOW|
|           2006|GRAND LARCENY|MANHATTAN|GRAND LARCENY|        250|                LOW|
|           2006|GRAND LARCENY|MANHATTAN|GRAND LARCENY|        250|                LOW|
+---------------+-------------+---------+-------------+-----------+-------------------+



In [43]:
# Another syntax of joining two datasets, in which just providing column name to apply inner join

# use "how" to apply join type = inner | left | right | full

crimesDf.join(offensesDf, ["Offense"], how="inner").show(3)

+-------------+---------------+---------+-----------+-------------------+
|      Offense|Occurrence_Year|  Borough|Penalty_USD|Punishment_Severity|
+-------------+---------------+---------+-----------+-------------------+
|GRAND LARCENY|           2006|    BRONX|        250|                LOW|
|GRAND LARCENY|           2006|MANHATTAN|        250|                LOW|
|GRAND LARCENY|           2006|MANHATTAN|        250|                LOW|
+-------------+---------------+---------+-----------+-------------------+
only showing top 3 rows



##### Joins using Broadcast variables

When performing joins on huge datasets, it would be heavy duty operation. In such case, it is recommeded to perform join operations by broadcasting DataFrames to share the data across tasks.

Always broadcast the smaller DataFrame to all nodes. For example, offensesDf can be considered for broadcasting over crimesDf.

In [45]:
from pyspark.sql.functions import broadcast

crime_details = crimesDf.select("Offense", "Occurrence_Year")\
                        .join(broadcast(offensesDf), ["Offense"], "inner")

crime_details.count()

1117273

In [55]:
crime_details.sort(crime_details.Offense.desc()).show(5)

+-------+---------------+-----------+-------------------+
|Offense|Occurrence_Year|Penalty_USD|Punishment_Severity|
+-------+---------------+-----------+-------------------+
|ROBBERY|           2006|        150|                LOW|
|ROBBERY|           2006|        150|                LOW|
|ROBBERY|           2006|        150|                LOW|
|ROBBERY|           2006|        150|                LOW|
|ROBBERY|           2006|        150|                LOW|
+-------+---------------+-----------+-------------------+
only showing top 5 rows



##### Sum using Accumulators

In [57]:
################# Approach 1: Punishment Penalty sum by Severity using Accumulator

# declare accumulator

low_severity_penalty_sum = spark.sparkContext.accumulator(0.0)
medium_severity_penalty_sum = spark.sparkContext.accumulator(0.0)
high_severity_penalty_sum = spark.sparkContext.accumulator(0.0)

In [61]:
def cal_penalty_by_severity(row):
    severity = row.Punishment_Severity
    penalty = float(row.Penalty_USD)

# write / add to accumulator

    if(severity == "HIGH"):
        high_severity_penalty_sum.add(penalty)
    elif(severity == "MEDIUM"):
        medium_severity_penalty_sum.add(penalty)
    elif(severity == "LOW"):
        low_severity_penalty_sum.add(penalty)

In [62]:
crime_details.foreach(lambda x: cal_penalty_by_severity(x))

In [63]:
# read accumulators

all_penalty_sum_by_severity = (low_severity_penalty_sum.value, medium_severity_penalty_sum.value, high_severity_penalty_sum.value)

all_penalty_sum_by_severity

(180480600.0, 91939500.0, 14173500.0)

In [68]:
################# Approach 2: Punishment Penalty sum by Severity using inbuilt agg function of Spark DataFrame

all_penalty_sum = crime_details.groupby("Punishment_Severity")\
                               .agg({"Penalty_USD": "sum"})\
                               .withColumnRenamed("sum(Penalty_USD)","Total_Penalty")

all_penalty_sum.withColumn('Total_Penalty', all_penalty_sum.Total_Penalty.cast(DecimalType(18, 1)))\
               .orderBy(desc("Total_Penalty"))\
               .show()

+-------------------+-------------+
|Punishment_Severity|Total_Penalty|
+-------------------+-------------+
|                LOW|  180480600.0|
|             MEDIUM|   91939500.0|
|               HIGH|   14173500.0|
+-------------------+-------------+

