Skip to content

2. Getting Started with Spark

Simon Renauld edited this page Oct 21, 2021 · 13 revisions

This section shows a first example of Spark

2.1. Apache Spark Example and Core Concepts

Read CSV Data:

Lazy operation: CSV has been converted to a DataFrame and then being converted into a local array or list of rows.

flightData2015 = df=spark.read.format("csv").option("header","true").load("C:/Users/renau/OneDrive/02-Data Projects/09-Apache-Spark/Spark-The-Definitive-Guide/data/flight-data/csv/2015-summary.csv")

flightData2015.take(5)

We can now call the explain plan which explain us about the stucture:

flightData2015.sort("count").explain() 

>>>  FileScan csv [DEST_COUNTRY_NAME#38,ORIGIN_COUNTRY_NAME#39,count#40] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/C:/Users/renau/OneDrive/02-Data Projects/09-Apache-Spark/Spark-The-Defini..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:string>

By default, when we perform a shuffle, Spark outputs 200 shuffle partitions. Let’s set this value to 5 to reduce the number of the output We do not manipulate the physical data; we configure physical execution characteristics Spark’s programming model—functional programming put the same inputs always result in the same outputs when the transformations on that data stay constant.

partitions from the shuffle:

>>> spark.conf.set("spark.sql.shuffle.partitions", "5")
>>> flightData2015.sort("count").take(2)
[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count='1'), Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count='1')]

2.1. Working with Spark DataFrames and SQL

Clone this wiki locally