# Apache Spark - A Unified engine for large-scale data analytics

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tool
s including Spark SQL for SQL and structured data processing, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing

# Spark DataFrames

A DataFrame is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. 

# findspark

- findspark is a Python library that helps you configure your Python environment to use PySpark. 
- It automatically sets the necessary environment variables like SPARK_HOME and JAVA_HOME, which are crucial for PySpark to locate the Spark installation and Java runtime. 
- It acts as a bridge between your Python environment and the Apache Spark installation, making it a convenient tool for working with PySpark in Jupyter Notebook.

In [1]:
pip install findspark

Note: you may need to restart the kernel to use updated packages.


In [2]:
import findspark
findspark.init()

# Starting Point: SparkSession

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("ZikSpark") \
    .master("local[*]") \
    .getOrCreate()

# Pre-requisite: 2023 Yellow Taxi Trip Data

[Yellow Taxi](data/yellow-taxi) is a large dataset that can be used to effectively demonstrate various Spark features. It is 3.51 GB with 38310226 rows. Data file cannot be version controlled in Github due to siz limitations. As a pre-requisite, the dataset should be downloaded and placed locally. Refer the [README](data/yellow-taxi/README.md) file.

In [4]:
rawdata_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("data\\yellow-taxi\\2023_Yellow_Taxi_Trip_Data_20240909.csv")

rawdata_df.show(5)

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|airport_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|       2|01/01/2023 12:32:...| 01/01/2023 12:40:...|              1|         0.97|         1|                 N|         161|         141|           2|        9.3|  1.0|    0.5|       0.