## initialize Spark

We will use local mode, where all the processing is done on a single machine. In case you need to install Apache Spark, there are two options, either go to the [Spark download page](https://spark.apache.org/downloads.html) and choose "Pre-built for Apache Hadoop 3.3 and later", or, install only PySpark Python library and its dependencies running `pip install pyspark[sql, ml, mllib]`.

You will also need to have Java 8 or later installed in your system.

In [4]:
# The purpose of findspark is to make it easier to find and use Spark from Python,
# especially if you have not set the SPARK_HOME environment variable or
# if your Spark and PySpark setup is not in your system's PATH.
# If pyspark was NOT installed with pip, uncomment the next two lines
#import findspark
#findspark.init()

from pyspark.sql import SparkSession
# initialize allocating 4 cores to Spark, my machine has 8
spark = SparkSession.builder \
    .appName("bcn_traffic_incidents") \
    .master("local[4]") \
    .getOrCreate()

In [2]:
# spark.stop()

## load datasets as DataFrame

- df_incidents
- df_incidents_type
- df_incidents_vehicle
- df_incidents_person
- df_incidents_cause
- df_incidents_driver

In [7]:
incidents_file_path = "../data/2023_accidents_gu_bcn.csv"
incidents_type_file_path = "../data/2023_accidents_tipus_gu_bcn.csv"
incidents_vehicle_file_path = "../data/2023_accidents_vehicles_gu_bcn.csv"
incidents_person_file_path = "../data/2023_accidents_persones_gu_bcn.csv"
incidents_cause_file_path = "../data/2023_accidents_causes_gu_bcn.csv"
incidentsdriver_file_path = "../data/2023_accidents_causa_conductor_gu_bcn.csv"

df_incidents = spark.read.csv(incidents_file_path, header=True, inferSchema=True)
df_incidents_type = spark.read.csv(incidents_type_file_path, header=True, inferSchema=True)
df_incidents_vehicle = spark.read.csv(incidents_vehicle_file_path, header=True, inferSchema=True)
df_incidents_person = spark.read.csv(incidents_person_file_path, header=True, inferSchema=True)
df_incidents_cause = spark.read.csv(incidents_cause_file_path, header=True, inferSchema=True)
df_incidents_driver = spark.read.csv(incidentsdriver_file_path, header=True, inferSchema=True)

## Data exploration



In [8]:
df_incidents.printSchema()

root
 |-- Numero_expedient: string (nullable = true)
 |-- Codi_districte: integer (nullable = true)
 |-- Nom_districte: string (nullable = true)
 |-- Codi_barri: integer (nullable = true)
 |-- Nom_barri: string (nullable = true)
 |-- Codi_carrer: integer (nullable = true)
 |-- Nom_carrer: string (nullable = true)
 |-- Num_postal : string (nullable = true)
 |-- Descripcio_dia_setmana: string (nullable = true)
 |-- NK_Any: integer (nullable = true)
 |-- Mes_any: integer (nullable = true)
 |-- Nom_mes: string (nullable = true)
 |-- Dia_mes: integer (nullable = true)
 |-- Hora_dia: integer (nullable = true)
 |-- Descripcio_torn: string (nullable = true)
 |-- Descripcio_causa_vianant: string (nullable = true)
 |-- Numero_morts: integer (nullable = true)
 |-- Numero_lesionats_lleus: integer (nullable = true)
 |-- Numero_lesionats_greus: integer (nullable = true)
 |-- Numero_victimes: integer (nullable = true)
 |-- Numero_vehicles_implicats: integer (nullable = true)
 |-- Coordenada_UTM_Y_ED5