# Needed imports!

In [4]:
from pyspark.sql import SparkSession

# Creating Spark Session

In [5]:
spark = (SparkSession
         .builder
         .appName("Basics")
         .master("local[*]")
         .getOrCreate())


*Driver* - maintains information about Spark application (SparkContext),  distributes and schedules work across executors, returns final processing output to a user 


*Executor* - carry out the actual work the driver assigned to them. Executors also report state of execution to the driver in a periodic way 


*Workers* - worker nodes of the cluster where the executors run



# Infer Schema vs. provide Schema

**Infer schema:**
* spark reads particular amount of data and based on it decides on column types, adds overhead
* No need to type complex schemas of some nested JSONs
* OK for exploring the data (with no direct access to it)

**Provide Schema:**
* No need for additional read
* It's hard sometimes to specify the proper nested schema

**For Parquet files all column metadata is already in the metadata file, no need to specify it!**

# Reading with Data in Spark

Documentation on reading data with spark in general can be found:
http://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html



Spark is Lazy by default - it doesn't compute results until an action is called, but we'll cover these parts later on. For a short read you can check [my short post](https://medium.com/uncle-data/tale-of-two-lazy-and-the-eager-9dc468a3055e)

In [9]:
import os
extracted_data_dir = '/home/iceberg/notebooks/PyCon_LT_Workshop/data/'

In [10]:
os.listdir(extracted_data_dir)

['yellow_taxi_2022-02.parquet',
 'dim_taxi_zones.csv',
 'yellow_taxi_2022-03.parquet',
 'dim_payments.csv',
 'yellow_taxi_2022-01.parquet',
 'dim_vendor.csv',
 'dim_rates.csv']

Running simple read command will just read the metadata and will infer/read schema if it's provided in the data files

In [11]:
spark.read.parquet(extracted_data_dir+"*.parquet")

DataFrame[VendorID: bigint, tpep_pickup_datetime: timestamp_ntz, tpep_dropoff_datetime: timestamp_ntz, passenger_count: double, trip_distance: double, RatecodeID: double, store_and_fwd_flag: string, PULocationID: bigint, DOLocationID: bigint, payment_type: bigint, fare_amount: double, extra: double, mta_tax: double, tip_amount: double, tolls_amount: double, improvement_surcharge: double, total_amount: double, congestion_surcharge: double, airport_fee: double]

In [13]:
yellow_taxi_data = spark.read.parquet(extracted_data_dir+"*.parquet")

In [16]:
dim_taxi_zones = spark.read.option("header",True).csv(extracted_data_dir+"dim_taxi_zones.csv")