<h1 style="text-align:center"> INFO 323: Cloud Computing and Big Data</h1>
<h2 style="text-align:center"> College of Computing and Informatics</h2>
<h2 style="text-align:center">Drexel University</h2>

<h3 style="text-align:center"> Introduction to Spark Programming (ch 2)</h3>
<h3 style="text-align:center"> Yuan An, PhD</h3>
<h3 style="text-align:center">Associate Professor</h3>

## RDD
Data objects in Spark are immutable.

A Resilient	Distributed	Dataset	(RDD) is the most fundamental data object used in Spark programming.	

RDDs are datasets within a Spark application, including the	initial	dataset(s)	loaded,	any	intermediate	dataset(s),	and	the	final	resultant dataset(s).	

Most Spark applications load an	RDD	with external data and	then create	new	RDDs by	performing	operations on the	existing RDDs; these operations	are	**transformations**.	

This process is	repeated until an output operation is ultimately required—for instance,	to	write the results of	an application to a filesystem;	these types	of operations are actions.
![Programming in Spark](info323-programming-in-spark.png)

## Narrow Transformation
Transformations consisting of narrow dependencies (we’ll call them narrow transformations) are
those for which each input partition will contribute to only one output partition.
## Wide Transformation
A wide dependency (or wide transformation) style transformation will have input partitions
contributing to many output partitions. You will often hear this referred to as a shuffle whereby Spark
will exchange partitions across the cluster. With narrow transformations, Spark will automatically
perform an operation called pipelining, meaning that if we specify multiple filters on DataFrames,
they’ll all be performed in-memory. The same cannot be said for shuffles. When we perform a shuffle,
Spark writes the results to disk. 
![narrow vs wide](info323-narrow-vs-wide.png)

## Lazy Evaluation
Lazy evaluation means that Spark will wait until the very last moment to execute the graph of
computation instructions. 

In Spark, instead of modifying the data immediately when you express some
operation, you build up a plan of transformations that you would like to apply to your source data. 

By waiting until the last minute to execute the code, Spark compiles this plan from your raw DataFrame
transformations to a streamlined physical plan that will run as efficiently as possible across the
cluster. 

This provides immense benefits because Spark can optimize the entire data flow from end to
end. 

An example of this is something called predicate pushdown on DataFrames. If we build a large
Spark job but specify a filter at the end that only requires us to fetch one row from our source data,
the most efficient way to execute this is to access the single record that we need. Spark will actually
optimize this for us by pushing the filter down automatically.


One thing that you might use RDDs for is to parallelize raw data that you have stored in memory on
the driver machine. For instance, let’s parallelize some simple numbers and create a DataFrame after
we do so. We then can convert that to a DataFrame to use it with other DataFrames:

In [3]:
from pyspark.sql import Row
mydf = spark.sparkContext.parallelize([Row(1), Row(2), Row(3)]).toDF()

In [4]:
mydf.collect()

[Row(_1=1), Row(_1=2), Row(_1=3)]

### Create RDD from Files
Spark provides API methods to create RDDS from a file, files or the contents of a directory.

Files can be of various formats, from unstructured text files, to semi-structured files such as JSON files, to structured data courses such as CSV files.

```
sc.textFile(name,	minPartitions=None,	use_unicode=True)
```

The	textFile() method is	used	to	create	RDDs	from	files	(compressed	or
uncompressed),	directories,	or	glob	patterns	(file	patterns	with	wildcards).


In [7]:
words_rdd = sc.textFile("words-short.txt")

In [8]:
words_rdd

words-short.txt MapPartitionsRDD[5] at textFile at NativeMethodAccessorImpl.java:0

In [9]:
words_rdd.take(1)

['  For nimble thought can jump both sea and land,']

In [36]:
words_rdd.schema

AttributeError: 'RDD' object has no attribute 'schema'

### Other Methods for Creating RDD

**wholeTextFiles()**
```
sc.wholeTextFiles(path,	minPartitions=None,	use_unicode=True)
```
The	wholeTextFiles() method	lets you read a	directory containing multiple files.

**read.jdbc()**
```
spark.read.jdbc(url,	table,
				column=None,
				lowerBound=None,
				upperBound=None,
				numPartitions=None,
				predicates=None,
				properties=None)
```

**read.json()**
```
spark.read.json(path,	schema=None)
```

### Creating RDD Programmatically
It	is	possible	to	create	an	RDD	programmatically	from	data	in	your	program,
whether	the	data	is	in	lists,	arrays,	or	collections.	

In [10]:
parall_rdd = sc.parallelize([0, 1, 2, 3, 4, 5])

In [11]:
parall_rdd.collect()

[0, 1, 2, 3, 4, 5]

In [12]:
parall_rdd.count()

6

In [13]:
range_rdd = sc.range(0, 1000, 1, 2)

In [14]:
range_rdd.getNumPartitions()

2

In [15]:
range_rdd.take(5)

[0, 1, 2, 3, 4]

## Transformation
Transformations	are	operations	performed	against	RDDs	that	result	in	the creation	of	new	RDDs.	Common	transformations	include	map	and	filter functions.	The	following	example	shows	a	new	RDD	created	from a transformation	of	an	existing	RDD:

In [17]:
even_num_rdd = parall_rdd.filter(lambda x: x % 2 ==0)

In [19]:
even_num_rdd.collect()

[0, 2, 4]

In [47]:
# Map
words_rdd.map(lambda x: x.split()).collect()

[['For', 'nimble', 'thought', 'can', 'jump', 'both', 'sea', 'and', 'land,'],
 ['As', 'soon', 'as', 'think', 'the', 'place', 'where', 'he', 'would', 'be.'],
 ['But', 'ah,', 'thought', 'kills', 'me', 'that', 'I', 'am', 'not', 'thought'],
 ['To',
  'leap',
  'large',
  'lengths',
  'of',
  'miles',
  'when',
  'thou',
  'art',
  'gone,'],
 ['But', 'that', 'so', 'much', 'of', 'earth', 'and', 'water', 'wrought,'],
 ['I', 'must', 'attend,', "time's", 'leisure', 'with', 'my', 'moan.'],
 ['Receiving', 'nought', 'by', 'elements', 'so', 'slow,'],
 ['But', 'heavy', 'tears,', 'badges', 'of', "either's", 'woe.'],
 [],
 []]

In [52]:
# flatmap
words_rdd.flatMap(lambda x: x.split()).collect()[:10]

['For',
 'nimble',
 'thought',
 'can',
 'jump',
 'both',
 'sea',
 'and',
 'land,',
 'As']

In [56]:
# filter
words_rdd.flatMap(lambda x: x.split()).filter(lambda x: len(x) < 3).collect()

['As',
 'as',
 'he',
 'me',
 'I',
 'am',
 'To',
 'of',
 'so',
 'of',
 'I',
 'my',
 'by',
 'so',
 'of']

In [61]:
words_rdd.flatMap(lambda x: x.split()).count()

69

In [60]:
# have to use flatMap followed by distinct. map() generates lists which cause errors when distinct()
words_rdd.flatMap(lambda x: x.split()).distinct().count()

59

In [64]:
words_rdd.flatMap(lambda x: x.split()).groupBy(lambda x: x[0].lower()).collect()

[('f', <pyspark.resultiterable.ResultIterable at 0x7fa750a82e48>),
 ('n', <pyspark.resultiterable.ResultIterable at 0x7fa750a902b0>),
 ('t', <pyspark.resultiterable.ResultIterable at 0x7fa750a906d8>),
 ('c', <pyspark.resultiterable.ResultIterable at 0x7fa750a90198>),
 ('j', <pyspark.resultiterable.ResultIterable at 0x7fa7507860f0>),
 ('b', <pyspark.resultiterable.ResultIterable at 0x7fa7507862b0>),
 ('s', <pyspark.resultiterable.ResultIterable at 0x7fa750786668>),
 ('a', <pyspark.resultiterable.ResultIterable at 0x7fa750a909b0>),
 ('l', <pyspark.resultiterable.ResultIterable at 0x7fa7507869b0>),
 ('p', <pyspark.resultiterable.ResultIterable at 0x7fa750786b00>),
 ('w', <pyspark.resultiterable.ResultIterable at 0x7fa750786b70>),
 ('h', <pyspark.resultiterable.ResultIterable at 0x7fa750786cf8>),
 ('k', <pyspark.resultiterable.ResultIterable at 0x7fa750786da0>),
 ('m', <pyspark.resultiterable.ResultIterable at 0x7fa750786e10>),
 ('i', <pyspark.resultiterable.ResultIterable at 0x7fa750786f9

In [68]:
words_rdd.flatMap(lambda x: x.split()).sortBy(lambda x: x[0].lower()).take(5)

['and', 'As', 'as', 'ah,', 'am']

## Actions
Actions	in	Spark either return	values,	as is the case with	count(); return	data, as is the	case with collect();	or save	data externally,	as	is	the
case	with	saveAsTextFile().	In	all	cases,	actions	force	computation	of	an
RDD	and	all	of	its	parents.	Some	actions	return	either	a	count,	an	aggregation	of
the	data,	or	part	or	all	of	the	data	in	an	RDD.	In	contrast,	foreach()	is	an
action	that	performs	a	function	on	each	element	of	an	RDD.	

In [69]:
# reduce
parall_rdd.collect()

[0, 1, 2, 3, 4, 5]

In [71]:
parall_rdd.reduce(lambda x, y: x + y)

15

## WordCount in Spark

In [89]:
lines = sc.textFile("hdfs://quickstart.cloudera/user/cloudera/shakespeare.txt")

In [90]:
lines.take(1)

['This is the 100th Etext file presented by Project Gutenberg, and']

In [91]:
lines.count()

124456

In [92]:
words = lines.flatMap(lambda x: x.split(" "))

In [93]:
tuples = words.map(lambda word: (word, 1))

The reduceByKey() method calls the lambda expression for all the tuples with the same word. The lambda expression has two arguments, a and b, which are the count values in two tuples.

In [94]:
counts = tuples.reduceByKey(lambda x, y: (x+y))

In [95]:
counts.take(5)

[('This', 1105), ('is', 7851), ('the', 23242), ('100th', 1), ('Etext', 4)]

The coalesce() method combines all the RDD partitions into a single partition since we want a single output file, and saveAsTextFile() writes the RDD to the specified location.

In [98]:
counts.coalesce(1).saveAsTextFile("hdfs://quickstart.cloudera/user/cloudera/wordcount/outputDir")

## DataFrame
DataFrame consists of a series of records (like rows in a table), that are of type Row,
and a number of columns (like columns in a spreadsheet) that represent a computation expression that
can be performed on each individual record in the Dataset. 

Schemas define the name as well as the
type of data in each column. 

Partitioning of the DataFrame defines the layout of the DataFrame or
Dataset’s physical distribution across the cluster. 

The partitioning scheme defines how that is
allocated. You can set this to be based on values in a certain column or nondeterministically.

In [25]:
range_df = spark.range(500).toDF("number")

In [27]:
range_df.take(2)

[Row(number=0), Row(number=1)]

In [31]:
summary_df = spark.read.format("json").load("2015-summary.json")

In [32]:
summary_df.schema

StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,LongType,true)))

In [119]:
summary_df.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



In [121]:
summary_df.count()

256

In [33]:
summary_df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



In [43]:
from pyspark.sql.functions import col

In [117]:
summary_df["count"]

Column<b'count'>

In [37]:
summary_df.first()

Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15)

In [38]:
summary_df.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



In [41]:
from pyspark.sql.functions import col
summary_df.filter(col("count") < 2).show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



**select and selectExpr**

select and selectExpr allow you to do the DataFrame equivalent of SQL queries on a table of
data:

In [99]:
summary_df.select("DEST_COUNTRY_NAME").show(2)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+
only showing top 2 rows



In [101]:
summary_df.selectExpr("*", "(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry").show(2)

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



In [103]:
from pyspark.sql.functions import lit
summary_df.withColumn("numberOne", lit(1)).show(2)

+-----------------+-------------------+-----+---------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|numberOne|
+-----------------+-------------------+-----+---------+
|    United States|            Romania|   15|        1|
|    United States|            Croatia|    1|        1|
+-----------------+-------------------+-----+---------+
only showing top 2 rows



In [105]:
summary_df.withColumn("numberOne", lit(1)).withColumnRenamed("numberOne", "NumberOne").columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count', 'NumberOne']

In [108]:
summary_df.where("count < 2").show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



In [109]:
seed = 5
withReplacement = False
fraction = 0.5
summary_df.sample(withReplacement, fraction, seed).count()

126

In [110]:
dataFrames = summary_df.randomSplit([0.25, 0.75], seed)
dataFrames[0].count() > dataFrames[1].count() # False


False

In [111]:
summary_df.sort("count").show(5)


+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|               Malta|      United States|    1|
|Saint Vincent and...|      United States|    1|
|       United States|            Croatia|    1|
|       United States|          Gibraltar|    1|
|       United States|          Singapore|    1|
+--------------------+-------------------+-----+
only showing top 5 rows



In [112]:
summary_df.orderBy("count", "DEST_COUNTRY_NAME").show(5)


+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|     Burkina Faso|      United States|    1|
|    Cote d'Ivoire|      United States|    1|
|           Cyprus|      United States|    1|
|         Djibouti|      United States|    1|
|        Indonesia|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



In [113]:
summary_df.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|     Burkina Faso|      United States|    1|
|    Cote d'Ivoire|      United States|    1|
|           Cyprus|      United States|    1|
|         Djibouti|      United States|    1|
|        Indonesia|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



## Aggregation Functions