In [6]:
import org.apache.spark.ml.feature.Bucketizer
import org.apache.spark.sql.functions._ 
import org.apache.spark.ml.feature.StringIndexer 

## Data processing

In [7]:
val dataFile = "nassCDS.csv"
val reader = spark.read
reader.option("header", true)
reader.option("inferSchema",true)
reader.option("sep",",")
val data = reader.csv(dataFile)
data.printSchema


root
 |-- _c0: integer (nullable = true)
 |-- dvcat: string (nullable = true)
 |-- weight: double (nullable = true)
 |-- dead: string (nullable = true)
 |-- airbag: string (nullable = true)
 |-- seatbelt: string (nullable = true)
 |-- frontal: integer (nullable = true)
 |-- sex: string (nullable = true)
 |-- ageOFocc: integer (nullable = true)
 |-- yearacc: integer (nullable = true)
 |-- yearVeh: string (nullable = true)
 |-- abcat: string (nullable = true)
 |-- occRole: string (nullable = true)
 |-- deploy: integer (nullable = true)
 |-- injSeverity: string (nullable = true)
 |-- caseid: string (nullable = true)



## Accidents By Speed

In [8]:
val accidentsBySpeed= data.groupBy(col("dvcat"),col("yearacc")).agg(countDistinct("caseid")).as("No of accidents")
accidentsBySpeed.groupBy(col("dvcat")).agg(sum("count(DISTINCT caseid)")).show

+-------+---------------------------+
|  dvcat|sum(count(DISTINCT caseid))|
+-------+---------------------------+
|  10-24|                      10112|
|1-9km/h|                        536|
|  25-39|                       6446|
|  40-54|                       2371|
|    55+|                       1205|
+-------+---------------------------+



Analysis: 
---------------
 At lower speeds more accidents are recorded

## Injury Risk based on Speeds

In [9]:
val injuryRisk=data.groupBy(col("dvcat"),col("injSeverity")).agg(count(col("injSeverity"))).where(col("injSeverity")!=="NA").sort(col("dvcat"),col("injSeverity"))
injuryRisk.show

+-------+-----------+------------------+
|  dvcat|injSeverity|count(injSeverity)|
+-------+-----------+------------------+
|1-9km/h|          0|               366|
|1-9km/h|          1|               144|
|1-9km/h|          2|                65|
|1-9km/h|          3|                90|
|1-9km/h|          4|                 4|
|1-9km/h|          5|                 7|
|  10-24|          0|              4530|
|  10-24|          1|              3377|
|  10-24|          2|              1863|
|  10-24|          3|              2817|
|  10-24|          4|               111|
|  10-24|          5|                68|
|  10-24|          6|                 1|
|  25-39|          0|              1365|
|  25-39|          1|              1642|
|  25-39|          2|              1623|
|  25-39|          3|              3220|
|  25-39|          4|               278|
|  25-39|          5|                36|
|  25-39|          6|                 1|
+-------+-----------+------------------+
only showing top

Analysis:
-----------------
1. From the table above this we saw that accidents are more at lower speeds but now from this table we can clearly see that the injury severity is very less for the lower speeds below 20 kmph.
2. Also,From the table above this we saw that accidents are less at higher speeds but now from this table we can clearly see that the injury severity is more for the higher speeds above 20 kmph.

As the speed increases, injurity severity levels kept increasing

### Survival Rate based on Age

In [10]:
import org.apache.spark.ml.feature.Bucketizer
import org.apache.spark.sql.functions._ 
import org.apache.spark.ml.feature.StringIndexer 

val dataNew = data.select("ageOFocc","dead")

val splits = (0 to 20).map(_ * 5.0).toArray

val bucketizer = new Bucketizer().setInputCol("ageOFocc").setOutputCol("ageRange").setSplits(splits)

val bucketed = bucketizer.transform(data)

val multiply: Double => Double = (_*5.0)
val add: Double => Double = (_+5.0)
val higher: Double => String = _.toString()
val higherUDF = udf(higher)
val upperUDF = udf(multiply)
val lowerUDF=udf(add)
val makeSIfTesla = udf {(make: String,made: String) => 
   "["+make+"-"+made+")"
}

val exp2=bucketed.groupBy(col("dead"),col("ageRange")).agg(count(col("ageOFocc")).as("No of Survivors")).sort("ageRange")          


val exp4=exp2.withColumn("ageGroup", makeSIfTesla(higherUDF(upperUDF(bucketed("ageRange"))),higherUDF(lowerUDF(upperUDF(bucketed("ageRange"))))))

val exp3= exp4.groupBy(col("ageRange")).agg(sum(col("No of Survivors")).as("No of Passengers in Each group"))

val df = exp4.join(exp3, exp2.col("ageRange") === exp3.col("ageRange"))

val answer=df.withColumn("Percentage of Sirvivors",col("No of Survivors")*100/col("No of Passengers in Each group")).where(col("dead")==="alive")
answer.select(col("ageGroup"),col("No of Passengers in Each group"),col("No of Survivors"),col("Percentage of Sirvivors")).show

val nameCleaned = dataNew.na.drop("any",Seq("ageOFocc"))
val SurvivedInd = new StringIndexer().setInputCol("dead").setOutputCol("deadIndex")
val indexed = SurvivedInd.fit(nameCleaned).transform(nameCleaned)

indexed.stat.corr("ageOFocc","deadIndex")

+------------+------------------------------+---------------+-----------------------+
|    ageGroup|No of Passengers in Each group|No of Survivors|Percentage of Sirvivors|
+------------+------------------------------+---------------+-----------------------+
| [15.0-20.0)|                          4163|           4035|      96.92529425894787|
| [20.0-25.0)|                          4107|           3963|      96.49379108838568|
| [25.0-30.0)|                          3033|           2929|       96.5710517639301|
| [30.0-35.0)|                          2568|           2475|       96.3785046728972|
| [35.0-40.0)|                          2474|           2376|      96.03880355699272|
| [40.0-45.0)|                          2167|           2079|      95.93908629441624|
| [45.0-50.0)|                          1776|           1684|      94.81981981981981|
| [50.0-55.0)|                          1409|           1351|      95.88360539389637|
| [55.0-60.0)|                          1004|         

0.0898135301590321

Analysis:  
--------------------------
The correlation between the age and deadIndex is almost zero but we can notice from the statistics that there is minute effect of age on the survival of passengers. As the age increases, we can see that the survival rate decreased a bit.


## Accidents By Year

In [20]:
import org.apache.spark.ml.feature.Bucketizer
import org.apache.spark.sql.functions._ 
import org.apache.spark.ml.feature.StringIndexer 
val accidentsByYear= data.groupBy(col("caseid"),col("yearacc")).agg(countDistinct("caseid")).as("No of accidents per year")


accidentsByYear.groupBy("yearacc").agg(count(col("caseid"))).show

+-------+-------------+
|yearacc|count(caseid)|
+-------+-------------+
|   1997|         3128|
|   1998|         3483|
|   2001|         3238|
|   2000|         3488|
|   1999|         3565|
|   2002|         3768|
+-------+-------------+



Analysis:
-----------------
For the given data,We can clearly see the increase in the accidents with each passing year. But before concluding the given point we need to some additional information like
1. what is the total of the vehicles on road for each year from 1997 to 2002
2. Clear fraction of vehicles that are met with accidents over the number of vehicles

So,we cannot say the accidents are increasing with each passing year as we dont know the accident count over the number of vehicles. we can get the perfect analysis if we have additional data


## Injurity Severity Comparision For Driver and Passenger

In [21]:
var injuritySeverityForDriverPass=data.groupBy(col("occRole"),col("injSeverity")).agg(count("occRole")).sort(col("occRole"),col("injSeverity"))
injuritySeverityForDriverPass.select("occRole","injSeverity","count(occRole)").where(col("injSeverity") !== "NA").show

+-------+-----------+--------------+
|occRole|injSeverity|count(occRole)|
+-------+-----------+--------------+
| driver|          0|          5183|
| driver|          1|          4363|
| driver|          2|          3254|
| driver|          3|          6785|
| driver|          4|           854|
| driver|          5|           101|
| driver|          6|             2|
|   pass|          0|          1296|
|   pass|          1|          1232|
|   pass|          2|           988|
|   pass|          3|          1710|
|   pass|          4|           264|
|   pass|          5|            32|
+-------+-----------+--------------+



Analysis:
----------------
Drivers are more injured in the accidents than the passengers.

## Survival Rate by Sex

In [22]:
val survivalRateBysex=data.groupBy(col("sex"),col("dead")).agg(count(col("dead")))
val percentage=data.groupBy("sex").agg(count("sex"))

val join=survivalRateBysex.join(percentage,percentage.col("sex") === survivalRateBysex.col("sex"))
join.withColumn("Percentage",col("count(dead)")*100/col("count(sex)")).show


+---+-----+-----------+---+----------+------------------+
|sex| dead|count(dead)|sex|count(sex)|        Percentage|
+---+-----+-----------+---+----------+------------------+
|  f| dead|        464|  f|     12248|3.7883736120182885|
|  f|alive|      11784|  f|     12248| 96.21162638798171|
|  m| dead|        716|  m|     13969| 5.125635335385496|
|  m|alive|      13253|  m|     13969|  94.8743646646145|
+---+-----+-----------+---+----------+------------------+



Analysis
-------------
There is not much relation between the sex and the survival rate. It is almost negligable but by ignoring the no of male drivers and female drivers on roads, we can say that the Male are more involved in accidents than the female.


## Effect of seatbelt on survival rate

In [23]:
val seatBeltEffect=data.groupBy(col("seatBelt"),col("dead")).agg(count(col("seatBelt")))
seatBeltEffect.show


+--------+-----+---------------+
|seatBelt| dead|count(seatBelt)|
+--------+-----+---------------+
|    none| dead|            680|
|  belted| dead|            500|
|  belted|alive|          18073|
|    none|alive|           6964|
+--------+-----+---------------+



Analysis
--------------
The table above show statistics for the effect of speed on the survival of the passengers. 
1. Drivers with seat belts, with and without air bags, have significantly lower injury risk than drivers without seatbelts who are unrestrained  or no lap belt. 
2. Drivers with no seat belt have significantly higher injury risk than all other restraints with a non-significant difference from drivers with no restraint.

## Effect of airbag and speed on survival rate

In [24]:
val airBagEffect=data.groupBy(col("airbag"),col("dead"),col("dvcat")).agg(count(col("dead"))).sort(col("dvcat"))
airBagEffect.show

+------+-----+-------+-----------+
|airbag| dead|  dvcat|count(dead)|
+------+-----+-------+-----------+
|airbag|alive|1-9km/h|        457|
|  none|alive|1-9km/h|        226|
|airbag| dead|1-9km/h|          3|
|airbag|alive|  10-24|       7747|
|airbag| dead|  10-24|         63|
|  none|alive|  10-24|       4987|
|  none| dead|  10-24|         51|
|  none|alive|  25-39|       3857|
|airbag| dead|  25-39|        130|
|  none| dead|  25-39|        174|
|airbag|alive|  25-39|       4053|
|airbag|alive|  40-54|       1208|
|airbag| dead|  40-54|        147|
|  none| dead|  40-54|        197|
|  none|alive|  40-54|       1425|
|airbag| dead|    55+|        168|
|  none|alive|    55+|        634|
|airbag|alive|    55+|        443|
|  none| dead|    55+|        247|
+------+-----+-------+-----------+



Analysis
------------
The table above show that the stats for the effect of speed on the survival of the passengers. 
1. Below 20 mph drivers, with and without air bags, have significantly lower injury risk than drivers with air bags or no restraint with all other comparisons not significantly different. 
2. Above 20 mph drivers, with and without air bags, have significantly lower injury risk compared to other drivers. 

## Effect of airbag on survival rate

In [25]:
val airbagEffect=data.groupBy(col("airbag"),col("dead")).agg(count(col("airbag")))
airbagEffect.show

+------+-----+-------------+
|airbag| dead|count(airbag)|
+------+-----+-------------+
|airbag| dead|          511|
|airbag|alive|        13908|
|  none| dead|          669|
|  none|alive|        11129|
+------+-----+-------------+



Analysis:
---------------
1. The benefits of the airbags is not consistent for the NASSCDS data over certain ranges
2. For lower speeds, airbags has no identifiable benefit
3. For Higher Speeds,airbags is a has good benefit


## Frontal Effect

In [26]:
val frontalEffect=data.groupBy(col("frontal"),col("occRole"),col("injSeverity")).agg(count(col("injSeverity"))).where(col("injSeverity") !== "NA").where(col("occRole") === "driver").sort(col("frontal"),col("injSeverity"))
frontalEffect.show

+-------+-------+-----------+------------------+
|frontal|occRole|injSeverity|count(injSeverity)|
+-------+-------+-----------+------------------+
|      0| driver|          0|              1720|
|      0| driver|          1|              1602|
|      0| driver|          2|               988|
|      0| driver|          3|              2398|
|      0| driver|          4|               404|
|      0| driver|          5|                36|
|      1| driver|          0|              3463|
|      1| driver|          1|              2761|
|      1| driver|          2|              2266|
|      1| driver|          3|              4387|
|      1| driver|          4|               450|
|      1| driver|          5|                65|
|      1| driver|          6|                 2|
+-------+-------+-----------+------------------+



Analysis
------------------------------
From the above data we can clearly see that the drivers in frontals are more prone to higher injurity severity. The injury severity level 3 is very high for the driver in frontals

Future Work
-------------------
Our analysis is based on whatever is provided in dataset from NASS CDS but addition of following features would help in appropriate analysis
1. New data elements such as accidents from which direction(front,rear,right,left) to improve safety conterparts
2. More details on vehicle types such as motorcycles, medium and heavy trucks, motorcoaches, bicyclists, school buses, and low-speed vehicles.
3. No of vehicles on road per year data helps us to get more detailed analysis on increase/decrease of the accidents
4. More information about advanced vehicle technologies.