In [2]:
# Setup (optional if you're using a hosted distribution)
import findspark
import pyspark

findspark.init()

spark = (pyspark.sql.SparkSession.builder \
         .master('local') \
         .appName('Introduction to PySpark') \
         .config("spark.some.config.option", "some-value") \
         .getOrCreate())

sc = spark.sparkContext

# Introduction to PySpark
This notebook explores the main data models of PySpark: `RDDs` and `DataFrames`.

## RDDs

---
> An immutable distributed collection of objects. Each RDD is split into multiple *partitions*, which maybe computed on different nodes of the cluster.  
> -- Learning Spark, page 23 (Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia)

---

**R**esilient **D**istributed **D**ataset (aka RDD) are the primary data abstraction in Apache Spark. They're:
- **Resilient**: fault tolerant, they can recompute missing or damaged partitions.
- **Distributed**: data is spread on multiple clusters
- **Dataset**: RDD are a collection of objects

🚧 -> should we complete with the content from the following link?  
More details here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd.html


MORE:
RDDs are immutable and their operations are lazy; fault-tolerance is achieved by keeping track of the "lineage" of each RDD (the sequence of operations that produced it) so that it can be reconstructed in the case of data loss. RDDs can contain any type of Python, Java, or Scala objects.

### Creating RDDs

Spark provides two ways to create RDDs: loading an external dataset and "parallelizing" a collection in your driver program.  

🚧 NOTE: on the most frequent usage being loading from external dataset, because usually the collection won't fit into the memory of a single machine.

#### Parallelizing an existing collection

In [263]:
numbers_rdd = sc.parallelize(range(50))
numbers_rdd

PythonRDD[404] at RDD at PythonRDD.scala:53

### Loading from a file
So far we've been building RDDs from existing Python objects. We can also directly load from a file

#### `sc.textFile(...)`

In [258]:
from pathlib import Path

filepath = Path('..', '_input', 'tears_in_rain.txt')
text_rdd = sc.textFile(str(filepath))
text_rdd

../_input/tears_in_rain.txt MapPartitionsRDD[398] at textFile at NativeMethodAccessorImpl.java:0

### Playing with RDDs (🚧 TODO: bad copy)

Some resources: https://www.analyticsvidhya.com/blog/2016/10/using-pyspark-to-perform-transformations-and-actions-on-rdd/  

We can perform 2 types of operations on RDDs, **actions** and **transformations**. All transformations are lazy: computations are not done until we apply an action.
RDDs are **immutable**, we cannot change them, we need to apply a **transformation** that will return an **uncomputed** RDD.

We will start by performing some **actions** on our RDDs.

#### `.take(num)`
Compute the first `num` values of the RDD. Like all actions, this will compute.

In [259]:
text_rdd.take(1)

["I've seen things you people wouldn't believe."]

#### `.collect()`
Like `.take(...)` but will take effect on all values of the RDD.

In [260]:
text_rdd.collect()

["I've seen things you people wouldn't believe.",
 'Attack ships on fire off the shoulder of Orion.',
 'I watched c-beams glitter in the dark near the Tannhäuser Gate.',
 'All those moments will be lost in time, like tears in rain.',
 'Time to die.']

If this monologue sounds familiar, that's because **[it is](https://www.youtube.com/watch?v=NoAzpa1x7jU)**.

#### `.mean()`
Compute the average of the RDD (requires numerical values).

In [261]:
numbers_rdd.mean()

4.5

And now we will apply some **transformations**.

#### `.map(func)`
Applies `func` to every element of the RDD. Won't compute anything until an action is called.

In [264]:
text_rdd.map(lambda s: s.lower())

PythonRDD[405] at RDD at PythonRDD.scala:53

How do I get my result? -> `.take(...)` or `.collect()`

In [265]:
text_rdd.map(lambda s: s.lower()).take(3)

["i've seen things you people wouldn't believe.",
 'attack ships on fire off the shoulder of orion.',
 'i watched c-beams glitter in the dark near the tannhäuser gate.']

### Chaining operations

In [279]:
text_rdd.map(lambda s: s.lower()).map(lambda s: len(s))

PythonRDD[422] at RDD at PythonRDD.scala:53

Here aswell, you need to call an action (like `take` or `collect`) for the computation to be performed.

In [281]:
text_rdd.map(lambda s: s.lower()).map(lambda s: len(s)).collect()

[45, 47, 63, 59, 12]

But don't do it in the middle.

In [44]:
# This will fail
rdd_lower = rdd.map(lambda s: s.lower()).take(3)
rdd_lower.filter(lambda x: len(s) > 8)

AttributeError: 'list' object has no attribute 'filter'

Let's add a new **transformation**, `filter`, it will filter based on a function returning a boolean value.  

_Note that when we're chaining operations, we go back to the line using Python syntax to do do, e.g. `\`.

In [285]:
text_rdd \
    .map(lambda s: s.lower()) \
    .map(lambda s: len(s)) \
    .filter(lambda c: c > 50) \
    .collect()  # Try replacing this with `.count()`

[63, 59]

### Key-value tuples
It's commong to use tuple values, as key-value pairs, like so:

In [290]:
tuples_rdd = sc.parallelize([
    ('banana', 4), ('orange', 12), ('apple', 3),
    ('pineapple', 1), ('banana', 3), ('orange', 6)])
tuples_rdd

ParallelCollectionRDD[441] at parallelize at PythonRDD.scala:195

In [298]:
# Maybe a bit too advanced...
tuples_rdd.groupByKey().map(lambda t: (t[0], sum(t[1]))).collect()

[('orange', 18), ('pineapple', 1), ('banana', 7), ('apple', 3)]

## DataFrames

A distributed collection of data grouped into named columns.  
A DataFrame is equivalent to a relational table in Spark SQL.


---
> ⚠️ Although they're called DataFrames, Spark DataFrames are actually closer to SQL tables than pandas'

---

---
> 💡 If you want an API closer to pandas while maintaining fast big data processing capabilities, take a look at [koalas](https://github.com/databricks/koalas) (still in beta).
---

Spark DataFrames actually have richer optimizations than both SQL tables and pandas DataFrames (cf [doc](https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#overview)).

### Differences vs RDDs

Contrary to Spark's RDDs, DataFrames are not schema-less.

This will be the cause of some issues.

#### Pros of DataFrames vs RDDs

- they enforce a schema
- you can run SQL queries against them
- faster than RDDs
- much smaller than RDDs when stored in parquet format

### Creation

#### From a RDD

In [62]:
numbers = [i for i in range(10)]
numbers_rdd = sc.parallelize(numbers)

In [64]:
# This will fail, requires either rdd of tuples or a pandas DataFrame
spark.createDataFrame(numbers_rdd)

ValueError: The first row in RDD is empty, can not infer schema

We know how to transform values of a RDD: `.map(...)`. Let's try.

In [74]:
spark.createDataFrame(rdd.map(lambda k: (k, )))

DataFrame[_1: string]

#### From a pandas DataFrame

In [75]:
import pandas as pd
data_dict = {'a': 1, 'b': 2, 'c': 3}
pandas_df = pd.DataFrame.from_dict(
    data_dict, orient='index', columns=['position'])
pandas_df

Unnamed: 0,position
a,1
b,2
c,3


In [77]:
spark_df = spark.createDataFrame(pandas_df)
spark_df

DataFrame[position: bigint]

### Running sql queries against DataFrames

In [108]:
spark_df.createOrReplaceTempView('my_table')

In [109]:
spark.sql("SELECT * FROM my_table")

DataFrame[position: bigint]

This will return a `DataFrame`, just like a RDD, **is not computed until an action is called**.

### Actions
All actions perform computations, some like `show` or `printSchema` print out to stdout, some, like `count` will return a value.

#### `.show(...)`
Prints out the first 20 values of the DataFrame.

In [86]:
spark_df.show()

+--------+
|position|
+--------+
|       1|
|       2|
|       3|
+--------+



Default can be changed.

In [87]:
spark_df.show(5)

+--------+
|position|
+--------+
|       1|
|       2|
|       3|
+--------+



#### `.printSchema()`
Prints out the schema of the DataFrame.

In [90]:
spark_df.printSchema()

root
 |-- position: long (nullable = true)



In [106]:
# TODO: explain the concept of schema, and say something about columns

In [117]:
spark_df.columns  # not an `action` (nor a transformation)

['position']

#### `.take(...)`
Compute the first n values of the DataFrame.

In [89]:
spark_df.take(5)

TypeError: take() missing 1 required positional argument: 'num'

As you can see, a PySpark `DataFrame` is a collection of [`Row`](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.Row) objects (cf [doc](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.Row)).

#### `.collect(...)`
Like `.take(...)` but will take effect on all rows of the DataFrame.

In [85]:
spark_df.collect()

[Row(position=1), Row(position=2), Row(position=3)]

---
⚠️ `.collect()` will collect all the values, do **NOT** perform this action on a full DataFrame, only small DataFrames like aggregate results.

---

#### `.count(...)`
Return the number of `Rows` in the DataFrame

In [195]:
spark_df.count()

3

### Transformations
- [PySpark Cookbook](https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788835367/3/ch03lvl1sec34/overview-of-dataframe-transformations)

#### `.take(...)`

In [103]:
spark_df.take(3)

[Row(position=1), Row(position=2), Row(position=3)]

In [133]:
# Equivalent ?
spark_df.head(5)

[Row(position=1), Row(position=2), Row(position=3)]

#### `.collect(...)`

In [105]:
spark_df.collect()

[Row(position=1), Row(position=2), Row(position=3)]

### Missing values: `.na`

In [224]:
spark_df.na

<pyspark.sql.dataframe.DataFrameNaFunctions at 0x119525590>

#### `.fill(...)`

In [223]:
spark_df.na.fill(0).show()

+--------+
|position|
+--------+
|       1|
|       2|
|       3|
+--------+



#### `.drop()`

In [226]:
spark_df.na.drop().show()

+--------+
|position|
+--------+
|       1|
|       2|
|       3|
+--------+



Equivalent to `.dropna()`

In [227]:
spark_df.dropna().show()

+--------+
|position|
+--------+
|       1|
|       2|
|       3|
+--------+



Optional parameter, select a `subset` of columns.

#### `.replace(...)`

In [229]:
spark_df.na.replace(2, 4).show() 

+--------+
|position|
+--------+
|       1|
|       4|
|       3|
+--------+



#### `.select()`

In [116]:
spark_df.printSchema()

root
 |-- position: long (nullable = true)



In [119]:
spark_df.select('position')

DataFrame[position: bigint]

Similar to `spark.sql.select("SELECT position FROM my_table")`.  
To claim equivalence, we would have to check the execution plan of both (which is beyond the content of this course).

### Some differences with pandas' DataFrames

- Accessor: `df.features`, `df['features']` vs `df.select('features')` -> more later
- Also, most (if not all) transformations in PySpark are not `inplace`

Accessors are returning `Column`

In [124]:
# This won't work
spark_df.position, spark_df['position']

(Column<b'position'>, Column<b'position'>)

Not very useful by themselves, but can be passed to a `.select(...)`.

In [126]:
spark_df.select(spark_df.position).show()

+--------+
|position|
+--------+
|       1|
|       2|
|       3|
+--------+



But in a case like this, just like SQL, the executor can infer the "table", this will work:

In [137]:
spark_df.select('position')

DataFrame[position: bigint]

In [138]:
spark_df.select('position').show()

+--------+
|position|
+--------+
|       1|
|       2|
|       3|
+--------+



#### `.alias(...)`

In [211]:
spark_df.select(spark_df.position.alias('aliased_column')).show()

+--------------+
|aliased_column|
+--------------+
|             1|
|             2|
|             3|
+--------------+



In [212]:
# Won't work on this, it requires a Column selector
spark_df.select('position'.alias('aliased_column')).show()

AttributeError: 'str' object has no attribute 'alias'

#### `.drop(...)`

In [135]:
spark_df.drop('position')

DataFrame[]

In [136]:
spark_df.drop('position').show()

++
||
++
||
||
||
++



### A spark of SQL

As we've seen before, we can run SQL queries against a registered view.

In [165]:
spark.sql("SELECT * FROM my_table LIMIT 5").show()

+--------+
|position|
+--------+
|       1|
|       2|
|       3|
+--------+



Multi-line statements need the use of triple quotes `"""`

In [166]:
spark.sql("""
    SELECT position
    FROM my_table
    LIMIT 5
""").show()

+--------+
|position|
+--------+
|       1|
|       2|
|       3|
+--------+



That's convenient, but we can use PySpark DataFrames API to perform the same operations.

#### `.limit(num)`
Like SQL's `LIMIT`.  
Limit the DataFrame to `num` rows.

In [167]:
spark_df.limit(5).show()

+--------+
|position|
+--------+
|       1|
|       2|
|       3|
+--------+



#### `.filter(...)`

In [130]:
spark_df.filter(spark_df.position < 3)

+--------+
|position|
+--------+
|       1|
|       2|
+--------+



In [139]:
spark_df.filter(spark_df.position < 3).show()

+--------+
|position|
+--------+
|       1|
|       2|
+--------+



--- 
> 💡 We can even mix both APIs

---

In [171]:
spark_df.limit(5).selectExpr("position * 2", "abs(position)").show()

+--------------+-------------+
|(position * 2)|abs(position)|
+--------------+-------------+
|             2|            1|
|             4|            2|
|             6|            3|
+--------------+-------------+



#### `.dropDuplicates(...)`

In [173]:
spark_df.dropDuplicates().show()

+--------+
|position|
+--------+
|       1|
|       3|
|       2|
+--------+



#### `.distinct()`

In [176]:
spark_df.distinct().show()

+--------+
|position|
+--------+
|       1|
|       3|
|       2|
+--------+



#### `.orderBy(...)`
Alias to `.sort(...)`

In [218]:
spark_df.orderBy('position').show()

+--------+
|position|
+--------+
|       1|
|       2|
|       3|
+--------+



We can call `.desc()` to get a descending order, but that means we need an actual `Column` object to call it on. 

In [220]:
# This will fail
spark_df.orderBy(('position').desc()).show()

AttributeError: 'str' object has no attribute 'desc'

In [221]:
# This won't
spark_df.orderBy(spark_df.position.desc()).show()

+--------+
|position|
+--------+
|       3|
|       2|
|       1|
+--------+




That's actually one of the key to SparkSQL fluency, but it requires some practice.

---

⭐️ No worries, we will review all this later.

---

#### `.groupBy(...)`

In [203]:
spark_df.groupBy('position')

<pyspark.sql.group.GroupedData at 0x119516910>

Returns a `GroupedData` object. We need to take some action on this.

In [193]:
# This won't work
spark_df.groupBy('position').show()

AttributeError: 'GroupedData' object has no attribute 'show'

In [196]:
# Another action, this one works
spark_df.groupBy('position').count()

DataFrame[position: bigint, count: bigint]

⚠️ When applied to a DataFrame, `.count()` is an action. In this case it returns a `DataFrame`, e.g. still waiting for an action.

In [197]:
spark_df.groupBy('position').count().show()

+--------+-----+
|position|count|
+--------+-----+
|       1|    1|
|       3|    1|
|       2|    1|
+--------+-----+



#### Chaining everything together

In [198]:
spark_df \
    .filter(spark_df.position < 2) \
    .groupBy('position') \
    .count() \
    .orderBy('count') \
    .limit(5) \
    .show()

+--------+-----+
|position|count|
+--------+-----+
|       1|    1|
+--------+-----+



Question: what if we want to order by descending count?

### Adding columns
Using pure select is possible, but can feel tedious

In [214]:
spark_df.select('*', spark_df.position.alias('newColumn')).show()

+--------+---------+
|position|newColumn|
+--------+---------+
|       1|        1|
|       2|        2|
|       3|        3|
+--------+---------+



#### `.withColumn(...)`
It's usually easier to use `.withColumn` for the same effect.

In [213]:
spark_df.withColumn('newColumn', spark_df.position).show()

+--------+---------+
|position|newColumn|
+--------+---------+
|       1|        1|
|       2|        2|
|       3|        3|
+--------+---------+



#### `withColumnRenamed(...)`

In [217]:
spark_df.withColumnRenamed('position', 'newName').show()

+-------+
|newName|
+-------+
|      1|
|      2|
|      3|
+-------+



### Displaying the DataFrame
For when `.show()` won't cut it...

#### `display(...)`

In [None]:
display(spark_df)

Slow and won't work everywhere...

#### Alternative: converting to pandas'
Using `toPandas()`: this is an action, it will compute.  
Hence, do **NOT** forget to `limit` or you'll explode the memory (unless the DataFrame is small, like the result of an aggregate).

In [146]:
spark_df.limit(5).toPandas()

Unnamed: 0,position
0,1
1,2
2,3


In [200]:
spark_df.sort('position')

DataFrame[position: bigint]

## Resources
- The [official documentation for RDDs](https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations)
- The [official documentation](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html)
- The part about RDDs in [Mastering Apache Spark](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd.html) (Scala based)
- The part about DataFrames in [Mastering Spark SQL](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-DataFrame.html) (Scala based)
- [Learning Apache Spark with PySpark & Databricks](https://hackersandslackers.com/learning-to-use-apache-spark-pyspark/)