# **Análisis de accidentes aéreos**

## Prerrequisites

Install Spark and Java in VM

In [4]:
# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark 3.5.1
!wget -q https://apache.osuosl.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

In [5]:
ls -l # check the .tgz is there

total 391068
drwxr-xr-x 1 root root      4096 Feb 22 14:24 [0m[01;34msample_data[0m/
-rw-r--r-- 1 root root 400446614 Feb 15 11:39 spark-3.5.1-bin-hadoop3.tgz


In [6]:
# unzip it
!tar xf spark-3.5.1-bin-hadoop3.tgz

In [7]:
!pip install -q findspark

In [8]:

!pip install py4j

# For maps
!pip install folium
!pip install plotly



Define the environment

In [9]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

Start Spark Session

---

In [10]:
import findspark
findspark.init("spark-3.5.1-bin-hadoop3")# SPARK_HOME

from pyspark.sql import SparkSession

# create the session
spark = SparkSession \
        .builder \
        .appName("DataFrames Basics") \
        .master("local[*]") \
        .getOrCreate()

spark.version

'3.5.1'

In [11]:
spark

In [12]:
# For Pandas conversion optimization
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [40]:
# Import sql functions
from pyspark.sql.functions import *
from pyspark.sql.types import DateType
from pyspark.sql.functions import year
from pyspark.sql.functions import to_date, year
from pyspark.sql.functions import desc
from pyspark.sql.functions import col

**Download datasets**

El siguiente conjunto de datos que proporcionamos contiene información sobre un histórico de accidentes aéreos. Tenemos información relevante como la fecha y la hora, ubicación, operador, tipo de operación, tipo de aeronave, número de personas a bordo, número de sobrevivientes, número de fallecidos y una pequeña descripción.

In [19]:
AirplaneDF = spark.read.option("header", "true").option("delimiter", ",").csv("/content/Airplane_crashes.csv")

AirplaneDF.show(20)

+----------+-----+--------------------+--------------------+--------+-------------+--------------------+------------+-----+------+----------+------+--------------------+
|      Date| Time|            Location|            Operator|Flight #|        Route|                Type|Registration|cn/In|Aboard|Fatalities|Ground|             Summary|
+----------+-----+--------------------+--------------------+--------+-------------+--------------------+------------+-----+------+----------+------+--------------------+
|09/17/1908|17:18| Fort Myer, Virginia|Military - U.S. Army|    NULL|Demonstration|    Wright Flyer III|        NULL|    1|     2|         1|     0|During a demonstr...|
|07/12/1912|06:30|AtlantiCity, New ...|Military - U.S. Navy|    NULL|  Test flight|           Dirigible|        NULL| NULL|     5|         5|     0|First U.S. dirigi...|
|08/06/1913| NULL|Victoria, British...|             Private|       -|         NULL|    Curtiss seaplane|        NULL| NULL|     1|         1|     0|Th

 *¿Cómo ha variado la frecuencia de accidentes aéreos a lo largo de los años? ¿Cuáles son los años con más accidentes?*

In [26]:
AirplaneDF2 = AirplaneDF.withColumn("Date", to_date(AirplaneDF["Date"], "MM/dd/yyyy"))

AirplaneDF3 = AirplaneDF2.withColumn("Year", year("Date"))

#AirplaneDF3.show(20)

Frecuencia = AirplaneDF3.groupBy("Year").count().orderBy("Year")

Frecuencia.show()

+----+-----+
|Year|count|
+----+-----+
|1908|    1|
|1912|    1|
|1913|    3|
|1915|    2|
|1916|    5|
|1917|    6|
|1918|    4|
|1919|    6|
|1920|   17|
|1921|   13|
|1922|   11|
|1923|   12|
|1924|    7|
|1925|   11|
|1926|   12|
|1927|   21|
|1928|   37|
|1929|   37|
|1930|   24|
|1931|   32|
+----+-----+
only showing top 20 rows



In [32]:
topfive = Frecuencia.orderBy(desc("count")).limit(5)

topfive.show()

+----+-----+
|Year|count|
+----+-----+
|1972|  104|
|1968|   96|
|1989|   95|
|1967|   91|
|1979|   89|
+----+-----+



*¿Hay áreas geográficas que experimentan más accidentes que otras?*

In [34]:
AirplaneDF_sin_nulos = AirplaneDF3.na.drop(subset=["Location"])

accidentes_location = AirplaneDF_sin_nulos.groupBy("Location").count().orderBy(desc("count"))

accidentes_location.show()


+--------------------+-----+
|            Location|count|
+--------------------+-----+
|      Moscow, Russia|   15|
|   Sao Paulo, Brazil|   15|
|Rio de Janeiro, B...|   14|
| Manila, Philippines|   13|
|   Anchorage, Alaska|   13|
|    Bogota, Colombia|   13|
|  New York, New York|   12|
|        Cairo, Egypt|   12|
|   Chicago, Illinois|   11|
|        Tehran, Iran|    9|
| Near Moscow, Russia|    9|
|        AtlantiOcean|    9|
|      Ankara, Turkey|    8|
|Amsterdam, Nether...|    8|
|    Denver, Colorado|    8|
|         Rome, Italy|    8|
|       Paris, France|    8|
|Near Medellin, Co...|    7|
|  Bucharest, Romania|    7|
|     London, England|    7|
+--------------------+-----+
only showing top 20 rows



*¿Qué tipo de aeronaves están más involucradas en la mayoría de accidentes?*

In [35]:
aeronaves = AirplaneDF3.groupBy("Operator").count().orderBy(desc("count"))

aeronaves.show()

+--------------------+-----+
|            Operator|count|
+--------------------+-----+
|            Aeroflot|  179|
|Military - U.S. A...|  176|
|          Air France|   70|
|  Deutsche Lufthansa|   65|
|    United Air Lines|   44|
|            Air Taxi|   44|
|China National Av...|   44|
|Military - U.S. A...|   43|
|Pan American Worl...|   41|
|Military - U.S. Navy|   36|
|Military - Royal ...|   36|
|US Aerial Mail Se...|   36|
|   American Airlines|   36|
|     Indian Airlines|   34|
|KLM Royal Dutch A...|   33|
|Philippine Air Lines|   33|
|             Private|   31|
|         Aeropostale|   26|
|Northwest Orient ...|   25|
|British Overseas ...|   25|
+--------------------+-----+
only showing top 20 rows



Siguiendo con ello, podríamos indagar si ciertos tipos de aeronaves tienen una tasa de accidentes más alta en ubicaciones específicas.

In [44]:
accidentes_tipo_location = AirplaneDF3.groupBy("Type", "Location").count()

accidentes_tipo = AirplaneDF3.groupBy("Type").count()

accidentes_tipo_location = accidentes_tipo_location.withColumnRenamed("count", "count_location")
accidentes_tipo = accidentes_tipo.withColumnRenamed("count", "count_total")

tasa_accidentes_tipo_location = accidentes_tipo_location.join(accidentes_tipo, "Type") \
    .withColumn("Tasa de Accidentes", col("count_location") / col("count_total"))


tasa_accidentes_tipo_location.show()


+--------------------+--------------------+--------------+-----------+--------------------+
|                Type|            Location|count_location|count_total|  Tasa de Accidentes|
+--------------------+--------------------+--------------+-----------+--------------------+
|   De Havilland DH-4|  New Paris, Indiana|             1|         28| 0.03571428571428571|
|            Potez IX|   Budapest, Hungary|             1|          2|                 0.5|
|Liore et Olivier 190| Off Corsica, France|             1|          1|                 1.0|
|Curtiss-Wright C-...|   Near Osaka, Japan|             1|          2|                 0.5|
|       Douglas C-54B|La Guardia Airpor...|             1|          5|                 0.2|
|        Canadair C-4|        Idris, Libya|             1|          1|                 1.0|
|       Douglas DC-6B|       Paris, France|             1|         27|0.037037037037037035|
|Goodyear ZPG-3W (...|Off Barnegat City...|             1|          1|          

*¿Cuál es el tipo de aeronave con mayor índice de supervivencia?*

In [47]:
AirplaneDF4 = AirplaneDF3.withColumn("Survivors", col("Aboard") - col("Fatalities"))

#AirplaneDF4.show()

from pyspark.sql.functions import sum

survivors_by_type = AirplaneDF4.groupBy("Type").agg(sum("Survivors").alias("TotalSurvivors"))

survivors_by_type = survivors_by_type.orderBy("TotalSurvivors", ascending=False)

survivors_by_type.show()

+--------------------+--------------+
|                Type|TotalSurvivors|
+--------------------+--------------+
|        Douglas DC-3|        1336.0|
|        Boeing B-747|        1070.0|
|McDonnell Douglas...|        1033.0|
|McDonnell Douglas...|         838.0|
|    Boeing B-747-122|         739.0|
|    Boeing B-747-121|         586.0|
|McDonnell Douglas...|         574.0|
|McDonnell Douglas...|         527.0|
|   Boeing B-707-321B|         498.0|
|McDonnell Douglas...|         477.0|
|     Lockheed L-1011|         464.0|
|    Boeing B-747-200|         402.0|
|     Tupolev TU-134A|         401.0|
|McDonnell Douglas...|         338.0|
|de Havilland Cana...|         337.0|
|   Tupolev TU-154B-1|         335.0|
|McDonnell Douglas...|         332.0|
|        Airbus A-340|         309.0|
|    Airbus A-330-243|         304.0|
|    Airbus A.330-301|         297.0|
+--------------------+--------------+
only showing top 20 rows

