2. Getting Started with Spark

This section shows a first example of Spark

2.1. Apache Spark Example and Core Concepts

Read CSV Data:

Lazy operation: CSV has been converted to a DataFrame and then being converted into a local array or list of rows.

flightData2015 = df=spark.read.format("csv").option("header","true").load("C:/Users/renau/OneDrive/02-Data Projects/09-Apache-Spark/Spark-The-Definitive-Guide/data/flight-data/csv/2015-summary.csv")

flightData2015.take(5)

We can now call the explain plan which explain us about the stucture:

flightData2015.sort("count").explain() 

>>>  FileScan csv [DEST_COUNTRY_NAME#38,ORIGIN_COUNTRY_NAME#39,count#40] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/C:/Users/renau/OneDrive/02-Data Projects/09-Apache-Spark/Spark-The-Defini..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:string>

By default, when we perform a shuffle, Spark outputs 200 shuffle partitions. Let’s set this value to 5 to reduce the number of the output We do not manipulate the physical data; we configure physical execution characteristics Spark’s programming model—functional programming put the same inputs always result in the same outputs when the transformations on that data stay constant.

partitions from the shuffle:

>>> spark.conf.set("spark.sql.shuffle.partitions", "5")
>>> flightData2015.sort("count").take(2)
[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count='1'), Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count='1')]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2. Getting Started with Spark

2.1. Apache Spark Example and Core Concepts

2.1. Working with Spark DataFrames and SQL

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally