# Basic Structured Operations

- This lab is intended to focus on fundamental DataFrame operations.
- We will use Spark to manipulate DataFrame operations and the data within them.
- For the purpose of this lab, aggregations, window functions, and joins are spared.

### Key Terminologies related to DataFrame

- Consists of rows and columns.
- Schemas define the name as well as the type of data in each column.
- Partitioning defines the layout of DF and how it is distributed across the clusters.
- The partitioning scheme defines how that is 


In [34]:
from pyspark.sql import SparkSession

In [35]:
spark = SparkSession.builder \
    .appName("StructuredOpsLab") \
    .getOrCreate()

In [36]:
spark

In [37]:
# lets create a DataFrame to work with:
df = spark.read.format("json").load("/Users/satkarkarki/spark_the_definitive_guide/data/flight-data/json/2015-summary.json")

In [38]:
# lets take a look at the schema of the current DataFrame:
df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



### Schema:
    - Schema defines the column names and types of a DataFrame.
    - We can either let a data source define the schema (called schema-on-read)
        - or we can define it explicitly ourselves.
    - A schema is a StructType made up of a number of fields, StructFields
    - StructFields contain 'name', 'type', 'BooLean flag' (determine the null properties), and 'metadata' (optional)
    - A schema in Spark can contain other comples StructTypes, to be featured in later chapters.
    
### The example that follows shows how to create and enforece a specific schema on a DataFrame.


In [39]:
spark.read.format("json").load("/Users/satkarkarki/spark_the_definitive_guide/data/flight-data/json/2015-summary.json").schema

StructType([StructField('DEST_COUNTRY_NAME', StringType(), True), StructField('ORIGIN_COUNTRY_NAME', StringType(), True), StructField('count', LongType(), True)])

### This example that follows shows how to create and enforce a specific schema on DataFrame.

In [40]:
from pyspark.sql.types import StructField, StructType, StringType, LongType

myManualSchema = StructType([
    StructField("DEST_COUNTRY_NAME", StringType(), True),
    StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
    StructField("count", LongType(), False, metadata={"hello":"world"})
])
df=spark.read.format("json").schema(myManualSchema)\
    .load("/Users/satkarkarki/spark_the_definitive_guide/data/flight-data/json/2015-summary.json")

## Columns and Expressions

    - Columns in Spark are similar to columns in a spreadsheet, R dataframe, or pandas DataFrame.
    - The operations used to manipulate the columns from DataFrames are represented as expressions.
    - The two simple functions used to construct and refer to columns are the <col> or <column> functions.


In [41]:
from pyspark.sql.functions import col, column
col("someColumnName")
# column("someColumnName")

Column<'someColumnName'>

In [42]:
(((col("someCol") + 5) * 200) - 6) < col("otherCol")

Column<'<(-(*(+(someCol, 5), 200), 6), otherCol)'>

In [43]:
from pyspark.sql.functions import expr
expr("(((someCol + 5) * 200) - 6) < otherCol")

Column<'(((someCol + 5) * 200) - 6) < otherCol'>

In [44]:
## accessing a DataFrame's columns:

spark.read.format("json")\
    .load("/Users/satkarkarki/spark_the_definitive_guide/data/flight-data/json/2015-summary.json")\
    .columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']

In [45]:
# Lets see a row by calling first on our DataFrame:


df.first()

Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15)

In [46]:
## Creating rows:

from pyspark.sql import Row
myRow = Row("Hello", None, 1, False)

In [47]:
## to access data in rows, we just specify the postion that you would like:

myRow[0]

'Hello'

In [48]:
myRow[2]

1

## DataFrame Transformations:

The most common DataFrame Trasnformations are:

    - Remove columns or rows
    - Transform a row into a column or column into row
    - Add rows or columns
    - Sort data by values in rows

### Creating DataFrames

In [49]:
## This query reads a json file into df which is later converted into a SQL table "dfTable":

df = spark.read.format("json").load("/Users/satkarkarki/spark_the_definitive_guide/data/flight-data/json/2015-summary.json")
df.createOrReplaceTempView("dfTable")

In [50]:
# After registering the DataFrame as a Temp View: lets query some basic stuff from dfTable:
spark.sql("SELECT * FROM dfTABLE LIMIT 10").show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+



In [51]:
## The following block of codes walksthrough how DataFrame can be constructed on the fly:

from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, LongType
myManualSchema = StructType([
    StructField("some", StringType(), True),
    StructField("col", StringType(), True),
    StructField("names", LongType(), False)    
])
myRow = Row("Hello", None, 1)
myDf = spark.createDataFrame([myRow], myManualSchema)
myDf.show()

+-----+----+-----+
| some| col|names|
+-----+----+-----+
|Hello|NULL|    1|
+-----+----+-----+



### select() and selectExpr()

- **Select** and **SelectExpr** are the equivalent of SELECT (from SQL) in Spark with a slight different edge case
- In simpler terms, it allows us to manipulate columns in the DataFrames.

In [52]:
df.select("DEST_COUNTRY_NAME").show(2)

## In SQL fashion, the same results can be queried through:

## SELECT DEST_COUNTRY_NAME 
## FROM dfTable 
## LIMIT 2;

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+
only showing top 2 rows


In [53]:
## Selecting multiple columns using select:

df.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME").show(2)

# The SQL equivalent code for this would be:

## SELECT DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME 
## FROM dfTable 
## LIMIT 2;

+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
+-----------------+-------------------+
|    United States|            Romania|
|    United States|            Croatia|
+-----------------+-------------------+
only showing top 2 rows


In [54]:
## In Spark, columns can be refered in a number of different ways:
## expr --> uses SQL-style string expression
## col --> functional way to reference a column (most clean and common method in DataFrame transformations)
## column --> Same as col, but often used in advanced dynamic logic (less common)

from pyspark.sql.functions import expr, col, column
df.select(
    expr("DEST_COUNTRY_NAME"),
    col("DEST_COUNTRY_NAME"),
    column("DEST_COUNTRY_NAME"))\
   .show()

+--------------------+--------------------+--------------------+
|   DEST_COUNTRY_NAME|   DEST_COUNTRY_NAME|   DEST_COUNTRY_NAME|
+--------------------+--------------------+--------------------+
|       United States|       United States|       United States|
|       United States|       United States|       United States|
|       United States|       United States|       United States|
|               Egypt|               Egypt|               Egypt|
|       United States|       United States|       United States|
|       United States|       United States|       United States|
|       United States|       United States|       United States|
|          Costa Rica|          Costa Rica|          Costa Rica|
|             Senegal|             Senegal|             Senegal|
|             Moldova|             Moldova|             Moldova|
|       United States|       United States|       United States|
|       United States|       United States|       United States|
|              Guyana|   

In [55]:
## Using expr to alias:

df.select(expr("DEST_COUNTRY_NAME AS destination")).show(2)

+-------------+
|  destination|
+-------------+
|United States|
|United States|
+-------------+
only showing top 2 rows


In [56]:
## We can further manipulate the result of expression as another expression
## This is where we strip off the aliasing done previously + another way to pass alias

df.select(expr("DEST_COUNTRY_NAME AS destination").alias("Destination Country")) \
    .show(2)

+-------------------+
|Destination Country|
+-------------------+
|      United States|
|      United States|
+-------------------+
only showing top 2 rows


In [57]:
## We can use selectExpr as a simple way to build up complex expressions that create new DataFrames.
## In this example, we create a new column (boolean) which checks if the flight is within country or not.

df.selectExpr(
    "*",
    "(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry") \
    .show(2)

## The SQL equivalent for this code will be:
## SELECT *, (DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) AS withinCountry
## FROM dfTable
## LIMIT 2

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows


In [58]:
## With select expression, we can also specify aggregations over the entire DataFrame:
df.selectExpr("avg(count)", "count(distinct(DEST_COUNTRY_NAME))").show(2)

## The SQL equivalent for this code will be:
## SELECT AVG(count), COUNT(DISTINCT(DEST_COUNTRY_NAME))
## FROM dfTable
## LIMIT 2;

+-----------+---------------------------------+
| avg(count)|count(DISTINCT DEST_COUNTRY_NAME)|
+-----------+---------------------------------+
|1770.765625|                              132|
+-----------+---------------------------------+



## Literals in Spark are used to pass on a constant value which needs to pass on as Spark Types

#### This will come up when checking whether a value is greater than some constant or other programmatically created variable.

In [59]:
from pyspark.sql.functions import lit
df.select(expr("*"), lit(1).alias("One")).show(2)

## This is how literals are passed in SQL:
## SELECT *, 1 AS One FROM dfTable LIMIT 2;

+-----------------+-------------------+-----+---+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|One|
+-----------------+-------------------+-----+---+
|    United States|            Romania|   15|  1|
|    United States|            Croatia|    1|  1|
+-----------------+-------------------+-----+---+
only showing top 2 rows


### Adding Columns

- The formal way to add a new column to a DataFrame is using the withColumn() method.

In [60]:
df.withColumn("numberOne", lit(1)).show(2)

## SQL Equivalent Code:
## SELECT *, 1 AS numberOne FROM dfTable LIMIT 2

+-----------------+-------------------+-----+---------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|numberOne|
+-----------------+-------------------+-----+---------+
|    United States|            Romania|   15|        1|
|    United States|            Croatia|    1|        1|
+-----------------+-------------------+-----+---------+
only showing top 2 rows


In [61]:
## In this example, we'll set a Boolean flag when the origin country = destination country
df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME")) \
    .show(2)

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows


#### Renaming Columns:
- Use the withColumnRenamed method to rename existing columns

In [62]:
df.withColumnRenamed("DEST_COUNTRY_NAME", "dest").columns

['dest', 'ORIGIN_COUNTRY_NAME', 'count']

### Reserved Characters and Keywords

- **Use backticks** when a column name has spaces, dashes, or is a SQL keyword **and** you are writing SQL-style expressions.
- **Don't use backticks** when you're simply passing string names in Python methods like withColumn

In [68]:
## Using withColumn() -- No escaping required:

dfWithLongColName = df.withColumn(
    "This Long Column-Name",
    expr("ORIGIN_COUNTRY_NAME")
)

In [69]:
## Using selectExpr() -- Escaping Required:

dfWithLongColName.selectExpr(
    "`This Long Column-Name`",
    "`This Long Column-Name` AS `new col`").show(2)

+---------------------+-------+
|This Long Column-Name|new col|
+---------------------+-------+
|              Romania|Romania|
|              Croatia|Croatia|
+---------------------+-------+
only showing top 2 rows


In [70]:
dfWithLongColName.createOrReplaceTempView("dfTableLong")
dfWithLongColName.select(expr)

### By default, Spark is case insensitive and it can be made case-sensitive by setting the configuration
set spark.sql.caseSensitive true

### Changing a Column's Type (cast)


In [72]:

df.withColumn("count2", col("count").cast("string"))

## In SQL:
## SELECT *, CAST(count AS TEXT) AS count2 FROM dfTable

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint, count2: string]

### Filtering Rows

    - There are two methods supporting filter operations: 'where' or 'filter'.
    - While 'filter' is valid, its a common practice to stick with 'where'.
    - Spark applies filter all at once, so be careful while applying multiple filters.
    - Recommended to chain the filters in the right sequence to avoid applying them all at once.

In [73]:
# Filter values where count is less than 2 (using 'filter'):

df.filter(col("count") < 2).show(2)

## In SQL:
## SELECT * FROM dfTable WHERE count < 2 LIMIT 2

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows


In [75]:
# Filter values where count is less than 2 (using 'where'):

df.where("count < 2").show(2)

## In SQL:
## SELECT * FROM dfTable WHERE count < 2 LIMIT 2

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows


In [78]:
# Applying multiple filters:

df.where(col("count") < 2).where(col("ORIGIN_COUNTRY_NAME") != "Croatia") \
    .show(2)

## In SQL:
## SELECT * FROM dfTable WHERE count < 2 AND ORIGIN_COUNTRY_NAME != "Croatia" LIMIT 2;

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|          Singapore|    1|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 2 rows


### Getting Unique Rows

        - Use the 'distinct' method to get unique rows; helps in the deduplication process.
        

In [80]:
df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().count()

## In SQL:
## SELECT COUNT(DISTINCT(ORIGIN_COUNTRY_NAME, DEST_COUNTRY_NAME)) FROM dfTable;

256

In [82]:
df.select("ORIGIN_COUNTRY_NAME").distinct().count()

## In SQL:
## SELECT COUNT(DISTINCT(ORIGIN_COUNTRY_NAME)) FROM dfTable;

125

## Random Samples

    - Spark supports sampling random records uins the sample method on a DataFrame.
    - withReplacement is a boolean expression that says whether samples can be replaced or not.

In [84]:
seed = 5
withReplacement = False
fraction = 0.5
df.sample(withReplacement, fraction, seed).count()

138

## Random Splits

    - This feature is useful mostly in ML, to split dataset into training, validation, and test sets.

In [89]:
dataFrames = df.randomSplit([0.25, 0.75], seed)
dataFrames[0].count() > dataFrames[1].count()

False

### Concatenating and Appending Rows (Union)

    - We can not directly add rows to an existing DataFrame like you might in pandas using .append()
    - To append to a DataFrame, we must union the original DataFrame along with the new DataFrame.
    - Use .union() to combine rows of 2 DataFrames but only if schemas match,

In [96]:
from pyspark.sql import Row
schema = df.schema
newRows = [
    Row("New Country", "Other Country", 5),
    Row("New Country 2", "Other Country 3", 1)
]
parallelizedRows = spark.sparkContext.parallelize(newRows) ## parallelize rows across Spark cluster
newDF = spark.createDataFrame(parallelizedRows, schema)

In [98]:
df.union(newDF) \
    .where("count = 5") \
    .where(col("ORIGIN_COUNTRY_NAME") != "United States") \
    .show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|      New Country|      Other Country|    5|
+-----------------+-------------------+-----+



### Sorting Rows

    - The two equivalent Spark operations of sorting rows are: sort and orderBy


In [109]:
df.sort("count").show(50)

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Croatia|    1|
|       United States|          Singapore|    1|
|             Moldova|      United States|    1|
|               Malta|      United States|    1|
|       United States|          Gibraltar|    1|
|Saint Vincent and...|      United States|    1|
|            Suriname|      United States|    1|
|       United States|             Cyprus|    1|
|        Burkina Faso|      United States|    1|
|            Djibouti|      United States|    1|
|       United States|            Estonia|    1|
|              Zambia|      United States|    1|
|              Cyprus|      United States|    1|
|       United States|          Lithuania|    1|
|       United States|           Bulgaria|    1|
|       United States|            Georgia|    1|
|       United States|            Bahrain|    1|
|       Cote d'Ivoir

In [104]:
df.orderBy("count", "DEST_COUNTRY_NAME").show(20)

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|        Burkina Faso|      United States|    1|
|       Cote d'Ivoire|      United States|    1|
|              Cyprus|      United States|    1|
|            Djibouti|      United States|    1|
|           Indonesia|      United States|    1|
|                Iraq|      United States|    1|
|              Kosovo|      United States|    1|
|               Malta|      United States|    1|
|             Moldova|      United States|    1|
|       New Caledonia|      United States|    1|
|Saint Vincent and...|      United States|    1|
|            Suriname|      United States|    1|
|       United States|            Estonia|    1|
|       United States|             Cyprus|    1|
|       United States|          Singapore|    1|
|       United States|   Papua New Guinea|    1|
|       United States|            Bahrain|    1|
|       United State

In [108]:
df.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(25)

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|        Burkina Faso|      United States|    1|
|       Cote d'Ivoire|      United States|    1|
|              Cyprus|      United States|    1|
|            Djibouti|      United States|    1|
|           Indonesia|      United States|    1|
|                Iraq|      United States|    1|
|              Kosovo|      United States|    1|
|               Malta|      United States|    1|
|             Moldova|      United States|    1|
|       New Caledonia|      United States|    1|
|Saint Vincent and...|      United States|    1|
|            Suriname|      United States|    1|
|       United States|          Gibraltar|    1|
|       United States|            Croatia|    1|
|       United States|          Singapore|    1|
|       United States|             Cyprus|    1|
|       United States|            Estonia|    1|
|       United State

#### To more explicitly specify sort direction, we can also use the asc or desc functions if operation on a column

In [113]:
from pyspark.sql.functions import desc, asc
df.orderBy(expr("count desc").desc()).show(2)

+-----------------+-------------------+------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-----------------+-------------------+------+
|    United States|      United States|370002|
|    United States|             Canada|  8483|
+-----------------+-------------------+------+
only showing top 2 rows


In [116]:
## use asc_nulls_first, or desc_nulls_last to specify where we would like the nulls value to appear:

df.orderBy(expr("count desc").desc_nulls_first()).show(2)

+-----------------+-------------------+------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-----------------+-------------------+------+
|    United States|      United States|370002|
|    United States|             Canada|  8483|
+-----------------+-------------------+------+
only showing top 2 rows


In [118]:
# sorting partitions is also possible in Spark: (it is important for effective optimization)

spark.read.format("json").load("/Users/satkarkarki/spark_the_definitive_guide/data/flight-data/json/2015-summary.json") \
    .sortWithinPartitions("count")


DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

### Repartition and Coalesce

    - Another important optimization opportunity is to partition the data according to some frequently filtered columns:
    - If we often run queries like WHERE country = 'USA', the partitioning in country can make those queries faster.
    - That is because Spark can skip partitions taht don't match a filter - a concept known as partition pruning.
        
    

In [119]:
df.rdd.repartition(5)

MapPartitionsRDD[404] at coalesce at NativeMethodAccessorImpl.java:0

In [123]:
# partitioning based on a certain column i.e. destination country name for this example:

df.repartition(col("DEST_COUNTRY_NAME"))

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

In [124]:
# Optionally, the number of partitions to be applied can also be specified:

df.repartition(5, col("DEST_COUNTRY_NAME"))

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

### Coalesce

- coalesce(n) reduces the number of partitions **without moving all the data around**.
- It's a **narrow transformation** - only combines existing partitions together.
- Spark **avoids reshuffling data across the cluster**, so it's **much cheaper than** repartition().

In [125]:
df.repartition(5, col("DEST_COUNTRY_NAME")).coalesce(2)

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

## Collecting Rows to the Driver

The **driver** is the **coordinator node** in Spark. It holds:

    - The logic of the Spark app
    - The result of transformatiuons and actions (when pulled from the cluster)
    
When we **collect data to the driver**, you are **moving data from the executors (cluster) to the local machine**.

In [130]:
collectDF = df.limit(10) # applies transformation to limit the DataFrame to 10 rows
collectDF.show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+



In [131]:
collectDF.take(5) # take works with an Integer count and returns the first 5 rows from collectDF

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344),
 Row(DEST_COUNTRY_NAME='Egypt', ORIGIN_COUNTRY_NAME='United States', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='India', count=62)]

In [136]:
collectDF.show(5, False) ## false tells Spark not to turncate long strings in cells

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India              |62   |
+-----------------+-------------------+-----+
only showing top 5 rows


In [137]:
collectDF.collect()

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344),
 Row(DEST_COUNTRY_NAME='Egypt', ORIGIN_COUNTRY_NAME='United States', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='India', count=62),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Grenada', count=62),
 Row(DEST_COUNTRY_NAME='Costa Rica', ORIGIN_COUNTRY_NAME='United States', count=588),
 Row(DEST_COUNTRY_NAME='Senegal', ORIGIN_COUNTRY_NAME='United States', count=40),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]

In [138]:
## The method toLocalIterator collects partitions to the driver as an iterator
collectDF.toLocalIterator()

<generator object _local_iterator_from_socket.<locals>.PyLocalIterable.__iter__ at 0x166099fc0>