<a href="https://colab.research.google.com/github/pepe54aguilar/EDEM_MDA2324/blob/Spark/school_shootings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DataFrames Basics

## Prerrequisites

Install Spark and Java in VM

In [4]:
# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark 3.5.0
!wget -q https://apache.osuosl.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

In [5]:
ls -l # check the .tgz is there

total 392824
-rw-r--r-- 1 root root   1432942 Jan 15 18:14 INCIDENT.csv
drwxr-xr-x 1 root root      4096 Jan 11 17:02 [0m[01;34msample_data[0m/
-rw-r--r-- 1 root root    215204 Jan 15 18:14 SHOOTER.csv
-rw-r--r-- 1 root root 400395283 Sep  9 02:10 spark-3.5.0-bin-hadoop3.tgz
-rw-r--r-- 1 root root    131060 Jan 15 18:14 VICTIM.csv
-rw-r--r-- 1 root root     68620 Jan 15 18:14 WEAPON.csv


In [6]:
# unzip it
!tar xf spark-3.5.0-bin-hadoop3.tgz

In [7]:
!pip install -q findspark

In [8]:

!pip install py4j

# For maps
!pip install folium
!pip install plotly



Define the environment

In [9]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

Start Spark Session

---

In [10]:
import findspark
findspark.init("spark-3.5.0-bin-hadoop3")# SPARK_HOME

from pyspark.sql import SparkSession

# create the session
spark = SparkSession \
        .builder \
        .appName("DataFrames Basics") \
        .master("local[*]") \
        .getOrCreate()

spark.version

'3.5.0'

In [11]:
spark

In [12]:
# For Pandas conversion optimization
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [13]:
# Import sql functions
from pyspark.sql.functions import *

Download datasets

In [14]:
!mkdir -p dataset
!wget -q https://raw.githubusercontent.com/paponsro/spark_edem_2324/master/dataset/cars.json -P /dataset
!wget -q https://raw.githubusercontent.com/paponsro/spark_edem_2324/master/dataset/movies.json -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/bank.csv -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/vehicles.csv -P /dataset
!ls /dataset

bank.csv  cars.json  movies.json  vehicles.csv


In [15]:
ls -l /dataset

total 1784
-rw-r--r-- 1 root root  461474 Jan 15 18:18 bank.csv
-rw-r--r-- 1 root root   74910 Jan 15 18:18 cars.json
-rw-r--r-- 1 root root 1274347 Jan 15 18:18 movies.json
-rw-r--r-- 1 root root    4370 Jan 15 18:18 vehicles.csv


In [16]:
from pyspark.sql.types import Row
from pyspark.sql.functions import *

Import databases

In [77]:
IncidentDF = spark.read.option("header", "true").option("delimiter", ",").csv("/content/INCIDENT.csv")
ShooterDF = spark.read.option("header", "true").option("delimiter", ",").csv("/content/SHOOTER.csv")
VictimDF = spark.read.option("header", "true").option("delimiter", ",").csv("/content/VICTIM.csv")
WeaponDF = spark.read.option("header", "true").option("delimiter", ",").csv("/content/WEAPON.csv")

IncidentDF.show(1)
ShooterDF.show(1)
VictimDF.show(1)
WeaponDF.show(1)


+-------------+--------------------+-----------+---------------+-----------+--------+-------+--------------------+-----------+-----+------------+---------------+--------------------+-------------+-----------+----------+--------------------+--------------------+--------------------+----------------+----------+--------+---------+----------------+-------+-----------------+------------+----------+-----------+------------------+
|  Incident_ID|             Sources|Number_News|Media_Attention|Reliability|    Date|Quarter|              School|       City|State|School_Level|       Location|       Location_Type|During_School|Time_Period|First_Shot|             Summary|           Narrative|           Situation|         Targets|Accomplice|Hostages|Barricade|Officer_Involved|Bullied|Domestic_Violence|Gang_Related|Preplanned|Shots_Fired|Active_Shooter_FBI|
+-------------+--------------------+-----------+---------------+-----------+--------+-------+--------------------+-----------+-----+------------

We are going to study US School Shootings (1970-2022)

##Exercise 1.

Number of incidents in 52 years.

In [78]:
IncidentDF.select(countDistinct(IncidentDF.Incident_ID)).show()

+---------------------------+
|count(DISTINCT Incident_ID)|
+---------------------------+
|                       2088|
+---------------------------+



Number of incidents per year.

In [79]:
IncidentDF = IncidentDF.withColumn("year", substring(col("Date"), 1, 4))

In [80]:
shootings_per_year = IncidentDF.groupBy("year").agg(
    count("Incident_ID").alias("total_shootings"))
shootings_per_year.orderBy("total_shootings", ascending=False).show(15)

+----+---------------+
|year|total_shootings|
+----+---------------+
|2021|            251|
|2022|            147|
|2019|            119|
|2018|            118|
|2020|            113|
|2006|             59|
|2017|             58|
|2016|             49|
|2005|             47|
|1993|             47|
|2014|             46|
|2007|             44|
|2015|             40|
|1994|             39|
|1988|             38|
+----+---------------+
only showing top 15 rows



##Exercise 2.

Incident with the most deaths

In [81]:
totVictimsDF = VictimDF.groupBy("incidentid").agg(
    sum(when(col("injury") == "Fatal", 1).otherwise(0)).alias("total_fatalities"))

joinCondition = IncidentDF.Incident_ID == totVictimsDF.incidentid
deathsDF = IncidentDF.join(totVictimsDF, joinCondition, "inner")

selected_columns = ["Incident_ID", "total_fatalities", "Media_Attention", "State", "year", "Preplanned"]
select_deathsDF = deathsDF.select(*selected_columns)

select_deathsDF.orderBy("total_fatalities", ascending=False).show(10)


+-------------+----------------+---------------+-----+----+----------+
|  Incident_ID|total_fatalities|Media_Attention|State|year|Preplanned|
+-------------+----------------+---------------+-----+----+----------+
|20121214CTSAN|              26|  International|   CT|2012|        No|
|20220524TXROU|              20|  International|   TX|2022|       Yes|
|20180214FLMAP|              17|           NULL|   FL|2018|       Yes|
|19990420COCOL|              13|           NULL|   CO|1999|       Yes|
|20180518TXSAS|              10|       National|   TX|2018|       Yes|
|20050321MNRER|               8|           NULL|   MN|2005|       Yes|
|20061002PAWEN|               5|           NULL|   PA|2006|       Yes|
|19980324ARWEJ|               5|           NULL|   AR|1998|        No|
|19890117CACLS|               5|       National|   CA|1989|        No|
|20141024WAMAM|               4|       National|   WA|2014|        No|
+-------------+----------------+---------------+-----+----+----------+
only s

Characteristics of the perpetrator of the attack with the most deaths.

In [82]:
joinCondition = deathsDF.Incident_ID == ShooterDF.incidentid
deaths_authorDF = ShooterDF.join(deathsDF, joinCondition, "inner")

deaths_authorDF.orderBy("total_fatalities", ascending=False).show(10)

+-------------+---+------+--------------------+-----------------+--------------------+-----------+-------+--------------------+--------------------+-----------------+--------------------+-------------+--------------------+-----------+---------------+-----------+----------+-------+--------------------+------------+-----+------------+--------------------+--------------------+-------------+-----------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------+----------+-----------+------------------+----+-------------+----------------+
|   incidentid|age|gender|                race|schoolaffiliation|      shooteroutcome|shooterdied| injury|        chargesfiled|             verdict|minorchargedadult|     criminalhistory|  Incident_ID|             Sources|Number_News|Media_Attention|Reliability|      Date|Qua

The average age of the victims per incident.

In [83]:
totVictims_ageDF = VictimDF.groupBy("incidentid").agg(
    sum(when(col("injury") == "Fatal", 1).otherwise(0)).alias("total_fatalities"),
    avg(round(col("age"), 2)).alias("avg_age")
)

totVictims_ageDF.orderBy("total_fatalities", ascending=False).show(10)

+-------------+----------------+------------------+
|   incidentid|total_fatalities|           avg_age|
+-------------+----------------+------------------+
|20121214CTSAN|              26|14.038461538461538|
|20220524TXROU|              20|              NULL|
|20180214FLMAP|              17|             19.92|
|19990420COCOL|              13|18.852941176470587|
|20180518TXSAS|              10|              25.3|
|20050321MNRER|               8|19.692307692307693|
|20061002PAWEN|               5|               9.4|
|19980324ARWEJ|               5|              15.4|
|19890117CACLS|               5|             7.125|
|20141024WAMAM|               4|              14.2|
+-------------+----------------+------------------+
only showing top 10 rows



The weapon used in the attack with the most deaths

In [84]:
joinCondition = deathsDF.Incident_ID == WeaponDF.incidentid
deaths_weaponDF = WeaponDF.join(deathsDF, joinCondition, "inner")

deaths_weaponDF.orderBy("total_fatalities", ascending=False).show(10)

+-------------+-------------+--------------------+---------------+-------------+--------------------+-----------+---------------+-----------+----------+-------+--------------------+---------+-----+------------+--------------------+--------------------+-------------+-----------------+----------+--------------------+--------------------+--------------------+-----------------+----------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------+----------+-----------+------------------+----+-------------+----------------+
|   incidentid|weaponcaliber|       weapondetails|     weapontype|  Incident_ID|             Sources|Number_News|Media_Attention|Reliability|      Date|Quarter|              School|     City|State|School_Level|            Location|       Location_Type|During_School|      Time_Period|First_Shot|             Summary|           Narrative|           Situation|          Targets|      Accomplice|            Ho

##Exercise 3.

Incidents per state

In [66]:
stateDF = IncidentDF.groupBy("State").agg(
    count("Incident_ID").alias("Incident_per_state"))

stateDF.orderBy("Incident_per_state", ascending=False).show(10)

+-----+------------------+
|State|Incident_per_state|
+-----+------------------+
|   CA|               216|
|   TX|               176|
|   FL|               119|
|   IL|               111|
|   PA|                87|
|   MI|                86|
|   OH|                86|
|   NY|                76|
|   GA|                71|
|   NC|                69|
+-----+------------------+
only showing top 10 rows



Comparation between incidents and deaths per state

In [76]:
deaths_stateDF = deathsDF.groupBy("State").agg(
    sum("total_fatalities").alias("deaths_per_state"))

deaths_stateDF = deaths_stateDF.withColumnRenamed("State", "id")

joinCondition = stateDF.State == deaths_stateDF.id
total_stateDF = stateDF.join(deaths_stateDF, joinCondition, "inner")

total_stateDF = total_stateDF.drop(col("id"))

total_stateDF = total_stateDF.withColumn(
    "death_per_incident", round(col("deaths_per_state") / col("Incident_per_state"), 2))

total_stateDF.orderBy("Incident_per_state", ascending=False).show(11)

+-----+------------------+----------------+------------------+
|State|Incident_per_state|deaths_per_state|death_per_incident|
+-----+------------------+----------------+------------------+
|   CA|               216|              85|              0.39|
|   TX|               176|              73|              0.41|
|   FL|               119|              48|               0.4|
|   IL|               111|              33|               0.3|
|   PA|                87|              29|              0.33|
|   MI|                86|              19|              0.22|
|   OH|                86|              20|              0.23|
|   NY|                76|              21|              0.28|
|   GA|                71|              16|              0.23|
|   NC|                69|               8|              0.12|
|   TN|                67|              22|              0.33|
+-----+------------------+----------------+------------------+
only showing top 11 rows

