<h1>The DataFrame API</h1>

Inspired by pandas DataFrames in structure, format, and a few specific operations, Spark DataFrames are like distributed in-memory tables with named columns and schemas, where each column has a specific data type: integer, string, array, map, real, date, timestamp, etc.

When data is visualized as a structured table, it’s not only easy to digest but also easy to work with when it comes to common operations you might want to execute on rows and columns. DataFrames are immutable and Spark keeps a lineage of all transformations. You can add or change
the names and data types of the columns, creating new DataFrames while the previous versions are preserved. A named column in a DataFrame and its associated Spark data type can be declared in the schema.

Spark supports basic <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/data_types.html">Python data types</a>, as enumerated below:

| Data type      | Value in Python | API to instantiate     |
| :----:      |    :----:   |    :----:     |
| ByteType      | int       | DataTypes.ByteType   |
| ShortType   | int        | DataTypes.ShortType      |
| IntegerType   | int        | DataTypes.IntegerType      |
| LongType   | int        | DataTypes.LongType     |
| FloatType   | float        | DataTypes.FloatType      |
| DoubleType   | float        | DataTypes.DoubleType      |
| StringType   | str        | DataTypes.StringType     |
| BooleanType   | bool        | DataTypes.BooleanType      |
| DecimalType   | decimal.Decimal        | DecimalType      |

<h2>Spark’s Structured and Complex Data Types</h2>

For complex data analytics, you won’t deal only with simple or basic data types. Your data will be complex, often structured or nested, and you’ll need Spark to handle these complex data types. They come in many forms: maps, arrays, structs, dates, timestamps, fields, etc. <a href="https://spark.apache.org/docs/latest/sql-ref-datatypes.html"></a>

| Data type      | Value in Python | API to instantiate     |
| :----:      |    :----   |    :----:     |
| BinaryType      | bytearray       | BinaryType()   |
| TimestampType   | datetime.datetime        | TimestampType()     |
| DateType   | datetime.date        | DateType()      |
| ArrayType   | List, tuple, or array        | ArrayType(dataType, [nullable])     |
| MapType   | dict        | MapType(keyType, valueType, [nullable])      |
| StructType   | List or tuple        | StructType([fields])    |
| StructField   | A value type corresponding to the type of this field        | StructField(name, dataType, [nullable])     |

## Schemas and Creating DataFrames

A schema in Spark defines the column names and associated data types for a Data‐Frame. Most often, schemas come into play when you are reading structured data from an external data source. Defining a schema up front as opposed to taking a schema-on-read approach offers three benefits:
- You relieve Spark from the complexity of inferring data types.
- You prevent Spark from creating a separate job just to read a large portion of your file to ascertain the schema, which for a large data file can be expensive and time-consuming.
- You can detect errors early if data doesn’t match the schema.

So, I encourage you to always define your schema up front whenever you want to read a large file from a data source.

### Two ways to define a schema

Spark allows you to define a schema in two ways. One is to define it programmatically, and the other is to employ a Data Definition Language (DDL) string, which is much simpler and easier to read.

In [1]:
import findspark

# If you know spark path you can specify it as init function parameter
findspark.init()

In [2]:
from pyspark.sql.types import *

ex_schema = StructType([StructField("author", StringType(), False),
                    StructField("title", StringType(), False),
                    StructField("pages", IntegerType(), False)])

Defining the same schema using DDL is much simpler:

In [3]:
from pyspark.sql import SparkSession

# Define schema for our data using DDL
schema = "`Id` INT, `First` STRING, `Last` STRING, `Url` STRING, `Published` STRING, `Hits` INT, `Campaigns` ARRAY<STRING>"

# Create our static data
data = [
            [1, "Jules", "Damji", "https://tinyurl.1", "1/4/2016", 4535, ["twitter", "LinkedIn"]],
            [2, "Brooke","Wenig", "https://tinyurl.2", "5/5/2018", 8908, ["twitter", "LinkedIn"]],
            [3, "Denny", "Lee", "https://tinyurl.3", "6/7/2019", 7659, ["web", "twitter", "FB", "LinkedIn"]],
            [4, "Tathagata", "Das", "https://tinyurl.4", "5/12/2018", 10568, ["twitter", "FB"]],
            [5, "Matei","Zaharia", "https://tinyurl.5", "5/14/2014", 40578, ["web", "twitter", "FB", "LinkedIn"]],
            [6, "Reynold", "Xin", "https://tinyurl.6", "3/2/2015", 25568, ["twitter", "LinkedIn"]]
        ]

# Main program
if __name__ == "__main__":
    # Create a SparkSession
    spark = (SparkSession
                    .builder
                    .appName("Example-3_6")
                    .getOrCreate())

    # Create a DataFrame using the schema defined above
    blogs_df = spark.createDataFrame(data, schema)
    
    # Show the DataFrame; it should reflect our table above
    blogs_df.show()
    
    # Print the schema used by Spark to process the DataFrame
    blogs_df.printSchema()
    
    # spark.stop()

+---+---------+-------+-----------------+---------+-----+--------------------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|
+---+---------+-------+-----------------+---------+-----+--------------------+
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|
+---+---------+-------+-----------------+---------+-----+--------------------+

root
 |-- Id: integer (nullable = true)
 |-- First: string (nullable = true)
 |-- Last: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- Published: string (nullable = true)
 |-- Hits: integer (

If you want to use this schema elsewhere in your code, simply execut `blogs_df.schema` and it will return the schema definition:

In [4]:
blogs_df.schema

StructType([StructField('Id', IntegerType(), True), StructField('First', StringType(), True), StructField('Last', StringType(), True), StructField('Url', StringType(), True), StructField('Published', StringType(), True), StructField('Hits', IntegerType(), True), StructField('Campaigns', ArrayType(StringType(), True), True)])

## Columns and Expressions
As mentioned previously, named columns in DataFrames are conceptually similar to named columns in pandas or R DataFrames or in an RDBMS table: they describe a type of field. You can list all the columns by their names, and you can perform operations on their values using relational or computational expressions. In Spark’s supported languages, columns are objects with public methods.

Let’s take a look at some examples of what we can do with columns in Spark. Each example is followed by its output:

In [5]:
from pyspark.sql.functions import *

blogs_df.columns

['Id', 'First', 'Last', 'Url', 'Published', 'Hits', 'Campaigns']

In [6]:
# Access a particular column with col and it returns a Column type
blogs_df.columns

['Id', 'First', 'Last', 'Url', 'Published', 'Hits', 'Campaigns']

In [7]:
# Use an expression to compute a value
# blogsDF.select(expr("Hits * 2")).show(2)
# or use col to compute value
# Example of tow columns (expr("columnName - 5") > col(anothercolumnName)),
blogs_df.select(col("Hits") * 2).show(2)

+----------+
|(Hits * 2)|
+----------+
|      9070|
|     17816|
+----------+
only showing top 2 rows



In [8]:
# Use an expression to compute big hitters for blogs
# This adds a new column, Big Hitters, based on the conditional expression
blogs_df.withColumn("Big Hitters", (expr("Hits > 10000"))).show()

+---+---------+-------+-----------------+---------+-----+--------------------+-----------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|Big Hitters|
+---+---------+-------+-----------------+---------+-----+--------------------+-----------+
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|      false|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|      false|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|      false|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|       true|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|       true|
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|       true|
+---+---------+-------+-----------------+---------+-----+--------------------+-----------+



In [9]:
blogs_df.withColumn("AuthorsId", (concat(expr("First"), expr("Last"), expr("Id")))) \
    .select(col("AuthorsId")) \
    .show(4)

+-------------+
|    AuthorsId|
+-------------+
|  JulesDamji1|
| BrookeWenig2|
|    DennyLee3|
|TathagataDas4|
+-------------+
only showing top 4 rows



In [10]:
# The Below code is identical
# blogs_df.select(expr("Hits")).show(2)
# blogs_df.select(col("Hits")).show(2)
blogs_df.select("Hits").show(2)

+----+
|Hits|
+----+
|4535|
|8908|
+----+
only showing top 2 rows



In [11]:
blogs_df.schema

StructType([StructField('Id', IntegerType(), True), StructField('First', StringType(), True), StructField('Last', StringType(), True), StructField('Url', StringType(), True), StructField('Published', StringType(), True), StructField('Hits', IntegerType(), True), StructField('Campaigns', ArrayType(StringType(), True), True)])

In [12]:
# Sort by column "Id" in descending order
# blogs_df.sort(col("Id"), ascending=False).show()
blogs_df.sort("Id", ascending=False).show()


+---+---------+-------+-----------------+---------+-----+--------------------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|
+---+---------+-------+-----------------+---------+-----+--------------------+
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|
+---+---------+-------+-----------------+---------+-----+--------------------+



## Rows
A row in Spark is a generic Row object, containing one or more columns. Each column may be of the same data type (e.g., integer or string), or they can have different types (integer, string, map, array, etc.). Because Row is an object in Spark and an ordered collection of fields, you can instantiate a Row in each of Spark’s supported languages:

In [13]:
rows = [Row("Matei Zaharia", "CA"), Row("Reynold Xin", "CA")]

authors_df = spark.createDataFrame(rows, ["Authors", "State"])

authors_df.show()

+-------------+-----+
|      Authors|State|
+-------------+-----+
|Matei Zaharia|   CA|
|  Reynold Xin|   CA|
+-------------+-----+



## Common DataFrame Operations
To perform common data operations on DataFrames, you’ll first need to load a Data‐Frame from a data source that holds your structured data. Spark provides an interface, `DataFrameReader`, that enables you to read data into a DataFrame from myriad data sources in formats such as JSON, CSV, Parquet, Text, Avro, ORC, etc. Likewise, to write a DataFrame back to a data source in a particular format, Spark uses `DataFrameWriter`.

### Using DataFrameReader and DataFrameWriter

Reading and writing are simple in Spark because of these high-level abstractions and contributions from the community to connect to a wide variety of data sources, including common NoSQL stores, RDBMSs, streaming engines such as Apache Kafka and Kinesis, and more.

In [14]:
# In Python, define a schema
from pyspark.sql.types import *

# Programmatic way to define a schema
fire_schema = StructType([StructField('Date', DateType(), True),
                            StructField('Open', FloatType(), True),
                            StructField('High', FloatType(), True),
                            StructField('Low', FloatType(), True),
                            StructField('Close', FloatType(), True),
                            StructField('AdjClose', FloatType(), True),
                            StructField('Volume', IntegerType(), True)])

# Use the DataFrameReader interface to read a CSV file
sf_fire_file = "../data/GOOG.csv"
goog_df = spark.read.csv(sf_fire_file, header=True, schema=fire_schema)

goog_df.show()

+----------+-------+--------+--------+-------+--------+-------+
|      Date|   Open|    High|     Low|  Close|AdjClose| Volume|
+----------+-------+--------+--------+-------+--------+-------+
|2020-05-20|1389.58| 1410.42| 1387.25|1406.72| 1406.72|1655400|
|2020-05-21| 1408.0| 1415.49| 1393.45| 1402.8|  1402.8|1385000|
|2020-05-22|1396.71| 1412.76| 1391.83|1410.42| 1410.42|1309400|
|2020-05-26|1437.27|  1441.0| 1412.13|1417.02| 1417.02|2060600|
|2020-05-27|1417.25| 1421.74| 1391.29|1417.84| 1417.84|1685800|
|2020-05-28|1396.86| 1440.84|  1396.0|1416.73| 1416.73|1692200|
|2020-05-29|1416.94| 1432.57| 1413.35|1428.92| 1428.92|1838100|
|2020-06-01|1418.39| 1437.96|  1418.0|1431.82| 1431.82|1217100|
|2020-06-02|1430.55| 1439.61| 1418.83|1439.22| 1439.22|1278100|
|2020-06-03| 1438.3|1446.552|1429.777|1436.38| 1436.38|1256200|
|2020-06-04| 1430.4| 1438.96| 1404.73|1412.18| 1412.18|1484300|
|2020-06-05|1413.17| 1445.05|  1406.0|1438.39| 1438.39|1734900|
|2020-06-08|1422.34| 1447.99| 1422.34|14

To write the DataFrame into an external data source in your format of choice, you can use the DataFrameWriter interface. Like DataFrameReader, it supports multiple data sources. `Parquet, a popular columnar format, is the default format; it uses snappy compression to compress the data. If the DataFrame is written as Parquet, the schema is preserved as part of the Parquet metadata. In this case, subsequent reads back into a DataFrame do not require you to manually supply a schema`.

In [15]:
# To save as a Parquet file
parquet_path = "../data/goog.parquet"
goog_df.write.format("parquet").save(parquet_path)

Alternatively, you can save it as a table, which registers metadata with the Hive metastore(we will cover SQL managed and unmanaged tables, metastores, and Data‐Frames in the next chapter):

In [17]:
parquet_table = "demoTable" # name of the table
goog_df.write.format("parquet").saveAsTable(parquet_table)

### Transformations and actions
Now that you have a distributed DataFrame composed of GOOG data in memory, the first thing you as a developer will want to do is examine
your data to see what the columns look like. Are they of the correct types? Do any of them need to be converted to different types? Do they have `null` values?

### Projections and filters
A projection in relational parlance is a way to return only the rows matching a certain relational condition by using filters. In Spark, projections are done with the select() method, while filters can be expressed using the filter() or where() method.

In [18]:
few_goog_df = (goog_df.select("Date", "AdjClose", "Volume")
                    .where(col("Volume") >= 1384000))

few_goog_df.show(5, truncate=False)

+----------+--------+-------+
|Date      |AdjClose|Volume |
+----------+--------+-------+
|2020-05-20|1406.72 |1655400|
|2020-05-21|1402.8  |1385000|
|2020-05-26|1417.02 |2060600|
|2020-05-27|1417.84 |1685800|
|2020-05-28|1416.73 |1692200|
+----------+--------+-------+
only showing top 5 rows



What if we want to know how many distinct `Dates` are there? These simple and expressive queries do the job:

In [19]:
(few_goog_df.select("Date")
            .where(col("Date").isNotNull())
            .agg(countDistinct("Date").alias("DistinctDates"))
            .show())

+-------------+
|DistinctDates|
+-------------+
|           51|
+-------------+



We can list the distinct call types in the data set using these queries:

In [20]:
# Filter for only distinct non-null CallTypes from all the rows
(few_goog_df.select("Date")
            .where(col("Date").isNotNull())
            .distinct()
            .show(10, False))

+----------+
|Date      |
+----------+
|2020-07-24|
|2020-08-05|
|2020-06-04|
|2020-06-05|
|2020-08-04|
|2020-06-17|
|2020-07-02|
|2020-06-12|
|2020-08-07|
|2020-08-13|
+----------+
only showing top 10 rows



### Renaming, adding, and dropping columns
Sometimes you want to rename particular columns for reasons of style or convention, and at other times for readability or brevity. The original column names in the data set may had spaces in them. Spaces in column names can be problematic, especially when you want to write or save a
DataFrame as a Parquet file (which prohibits this).

By specifying the desired column names in the schema with StructField, as we did, we effectively changed all names in the resulting DataFrame. Alternatively, you could selectively rename columns with the `withColumnRenamed()` method.

In [21]:
new_goog_df = goog_df.withColumnRenamed("Volume", "NewVolume")

(new_goog_df.select("NewVolume")
            .where(col("NewVolume") > 1384000)
            .show(5, False))

+---------+
|NewVolume|
+---------+
|1655400  |
|1385000  |
|2060600  |
|1685800  |
|1692200  |
+---------+
only showing top 5 rows



<b>Note!</b>  Because DataFrame transformations are immutable, when we rename a column using `withColumnRenamed()` we get a new Data-Frame while retaining the original with the old column name.

Modifying the contents of a column or its type are common operations during data exploration. In some cases the data is raw or dirty, or its types are not amenable to being supplied as arguments to relational operators. For example, in some data sets, some columns which containe date values can be strings rather than either Unix timestamps or SQL dates, both of which Spark supports and can easily manipulate during transformations or actions (e.g., during a date- or time- based analysis of the data).

In [22]:
# Example
# fire_ts_df = (new_fire_df
#                 .withColumn("IncidentDate", to_timestamp(col("CallDate"), "MM/dd/yyyy"))
#                 .drop("CallDate")
#                 .withColumn("OnWatchDate", to_timestamp(col("WatchDate"), "MM/dd/yyyy"))
#                 .drop("WatchDate")
#                 .withColumn("AvailableDtTS", to_timestamp(col("AvailableDtTm"), "MM/dd/yyyy hh:mm:ss a"))
#                 .drop("AvailableDtTm"))

# Select the converted columns
# (fire_ts_df.select("IncidentDate", "OnWatchDate", "AvailableDtTS")
#     .show(5, False))

In the previos codes:
1. Convert the existing column’s data type from string to a Spark-supported timestamp.
2. Use the new format specified in the format string "MM/dd/yyyy" or "MM/dd/yyyy hh:mm:ss a" where appropriate.
3. After converting to the new data type, drop() the old column and append the new one specified in the first argument to the withColumn() method.
4. Assign the new modified DataFrame to fire_ts_df.

In [23]:
spark.stop()