A **schema** is information about the data contained in a DataFrame. Specifically, the number of columns, column names, column data type, and whether the column can contain NULLs. Without a schema, a DataFrame would be a group of disorganized things. The schema gives the DataFrame structure and meaning

**Without a schema, a DataFrame would be a group of disorganized things. The schema gives the DataFrame structure and meaning.**

In the previous section we introduced the createDataFrame() method. In PySpark, this method looks like:

After the required data parameter the first optional parameter is schema. The most useful options for the schema parameter include: None (or not included), a list of column names, or a StructType

After the required data parameter the first optional parameter is schema. The most useful options for the schema parameter include: None (or not included), a list of column names, or a StructType.

If schema is None or left out, then Spark will try to infer the column names and the column types from the data. If schema is a list of column names, then Spark will add the column names in the order specified and will try to infer the column types from the data. In both of these cases Spark uses the number of rows specified in the second optional parameter, **samplingRatio**, to infer the schema from the data. If not included or given None, then only the top row is used to infer the schema.

To illustrate, say we had some data with a variable named computer_sales with columns "product_code", "computer_name", and "sales". The following illustrates all the options the createDataFrame() method can handle in PySpark.

The following code is used when only the data parameter is provided or the schema is set to None or left blank

In [2]:
import findspark
findspark.init()
findspark.find()

'C:\\Spark\\sparkhome'

In [3]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.appName("Vamsi_App").getOrCreate()

In [5]:
computer_sales = [
    {'product_code': 'A123', 'computer_name': 'Laptop A', 'sales': 100},
    {'product_code': 'B456', 'computer_name': 'Desktop B', 'sales': 150}
]

In [6]:
df1 = spark.createDataFrame(computer_sales) 

In [7]:
df2 = spark.createDataFrame(computer_sales, None)

In [8]:
df1.show()

+-------------+------------+-----+
|computer_name|product_code|sales|
+-------------+------------+-----+
|     Laptop A|        A123|  100|
|    Desktop B|        B456|  150|
+-------------+------------+-----+



In [9]:
df2.show()

+-------------+------------+-----+
|computer_name|product_code|sales|
+-------------+------------+-----+
|     Laptop A|        A123|  100|
|    Desktop B|        B456|  150|
+-------------+------------+-----+



Both DataFrames are equivalent.

The following is used when the data parameter is specified along with a Python list of column names:

In [10]:
computer_sales

[{'product_code': 'A123', 'computer_name': 'Laptop A', 'sales': 100},
 {'product_code': 'B456', 'computer_name': 'Desktop B', 'sales': 150}]

In [11]:
df3 = spark.createDataFrame(computer_sales, ['product_code', 'computer_name', 'sales'])

In [12]:
df3.show()

+------------+-------------+-----+
|product_code|computer_name|sales|
+------------+-------------+-----+
|    Laptop A|         A123|  100|
|   Desktop B|         B456|  150|
+------------+-------------+-----+



In [13]:
df4 = spark.createDataFrame(computer_sales, ["product_code", "computer_name", "sales"], 2)

In [14]:
df4.show()

+------------+-------------+-----+
|product_code|computer_name|sales|
+------------+-------------+-----+
|    Laptop A|         A123|  100|
|   Desktop B|         B456|  150|
+------------+-------------+-----+



In [15]:
df5 = spark.createDataFrame(computer_sales, ["product_code", "computer_name", "sales"], len(computer_sales))

The following is used to infer the schema from every row in the DataFrame. len() is a Python function that returns an integer of the number of values in a list. Since the number of values in the list computer_sales equals the number of rows in the DataFrame, the samplingRatio parameter will evaluate every row in the DataFrame to infer the schema:

In [16]:
df5.show()

+------------+-------------+-----+
|product_code|computer_name|sales|
+------------+-------------+-----+
|    Laptop A|         A123|  100|
|   Desktop B|         B456|  150|
+------------+-------------+-----+



#### Exercise 6: Creating a DataFrame in PySpark with only named columns

Create a nested list called home_computers as shown in the following code:

In [17]:
home_computers = [["Honeywell", "Honeywell 316#Kitchen Computer", "DDP 16 Minicomputer", 1969], ["Apple Computer", "Apple II series", "6502", 1977], ["Bally Consumer Products", "Bally Astrocade", "Z80", 1977]]

In [18]:
home_computers

[['Honeywell', 'Honeywell 316#Kitchen Computer', 'DDP 16 Minicomputer', 1969],
 ['Apple Computer', 'Apple II series', '6502', 1977],
 ['Bally Consumer Products', 'Bally Astrocade', 'Z80', 1977]]

Create a DataFrame but this time the column names of the DataFrame are given explicitly as a list in the second parameter as shown in the following code:

In [19]:
computers_df = spark.createDataFrame(home_computers, ["Manufacturer", "Model", "Processor", "Year"])

Since the third parameter **samplingRatio** is not included, Spark uses the first row of data to infer the data types of the columns.

Show the contents of the DataFrame and display the schema with the following code:

In [20]:
computers_df.show()
computers_df.printSchema()

+--------------------+--------------------+-------------------+----+
|        Manufacturer|               Model|          Processor|Year|
+--------------------+--------------------+-------------------+----+
|           Honeywell|Honeywell 316#Kit...|DDP 16 Minicomputer|1969|
|      Apple Computer|     Apple II series|               6502|1977|
|Bally Consumer Pr...|     Bally Astrocade|                Z80|1977|
+--------------------+--------------------+-------------------+----+

root
 |-- Manufacturer: string (nullable = true)
 |-- Model: string (nullable = true)
 |-- Processor: string (nullable = true)
 |-- Year: long (nullable = true)



Columns names make DataFrames exceptionally useful. The PySpark API makes adding columns names to a DataFrame very easy.

### Schemas, StructTypes, and StructFields

The most rigid and defined option for schema is the **StructType**. It is important to note that the schema of a DataFrame a StructType. **If a DataFrame is created without column names and Spark infers the data types based upon the data, a StructType is still created in the background by Spark.**

A manually created PySpark DataFrame, like the following example, still has a StructType schema:

In [21]:
computers_df_1 = spark.createDataFrame(home_computers)
computers_df_1.show()

+--------------------+--------------------+-------------------+----+
|                  _1|                  _2|                 _3|  _4|
+--------------------+--------------------+-------------------+----+
|           Honeywell|Honeywell 316#Kit...|DDP 16 Minicomputer|1969|
|      Apple Computer|     Apple II series|               6502|1977|
|Bally Consumer Pr...|     Bally Astrocade|                Z80|1977|
+--------------------+--------------------+-------------------+----+



The schema can be displayed in PySpark by calling the schema method on a DataFrame like: 

In [22]:
computers_df_1.schema

StructType([StructField('_1', StringType(), True), StructField('_2', StringType(), True), StructField('_3', StringType(), True), StructField('_4', LongType(), True)])

To recap, **the schema of a DataFrame is stored as a StructType object. The StructType object consists of a list of StructFields**. The StructFields are the information about the columns of a DataFrame.

In [23]:
from pyspark.sql.types import StructType, StructField, StringType, LongType
  
schema = StructType([
  StructField("Manufacturer", StringType(), True),
  StructField("Model", StringType(), True),
  StructField("Processor", StringType(), True),
  StructField("Year", LongType(), True)
])

***It is important to note that the schema of a DataFrame is a StructType***

StructFields are objects that correspond to each column of the DataFrame and are constructed with the name, data type, and a boolean value of whether the column can contain NULLs. The second parameter of a StructFieldis the columns data type: string, integer, decimal, datetime, and so on. To use data types in Spark the types module must be called

To use data types in Spark the types module must be called. Imports in Scala and Python are code that is not built-in the main module. For example, 

DataFrames are part of the main code class. But ancillary things like data types and functions are not and must be imported to be used in your file. The following code is the Scala import for all of the data types:

The following code is the Python import for all of the data types:

In [24]:
from pyspark.sql.types import *

**Note:** StructType and StructField are actually Spark data types themselves. They are included in the preceding data imports that import all the members of the data types class. To import StructType and StructField individually use the following code for PySpark:

In [25]:
from pyspark.sql.types import StructType, StructField

#### Exercise 7: Creating a DataFrame in PySpark with a Defined Schema

Import all the PySpark data types at once (that include both StructType and StructField) and make a nested list of data with the following code:

In [26]:
from pyspark.sql.types import *
 
customer_list = [[111, "Jim", 45.51], [112, "Fred", 87.3], [113, "Jennifer", 313.69], [114, "Lauren", 28.78]]

Construct the schema using the StructType and StructField. First make a StructType which holds a Python list as shown in the following code:

In [27]:
customer_schema = StructType([
    StructField("customer_id",LongType(),True),
    StructField("first_name",StringType(),True),
    StructField("avg_shopping_cart",DoubleType(),True)
])

In [28]:
customer_df = spark.createDataFrame(customer_list,schema=customer_schema)

In [29]:
customer_df.show()
customer_df.printSchema()

+-----------+----------+-----------------+
|customer_id|first_name|avg_shopping_cart|
+-----------+----------+-----------------+
|        111|       Jim|            45.51|
|        112|      Fred|             87.3|
|        113|  Jennifer|           313.69|
|        114|    Lauren|            28.78|
+-----------+----------+-----------------+

root
 |-- customer_id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- avg_shopping_cart: double (nullable = true)



Spark schemas are the structure or the scaffolding of a DataFrame. Just like a building would collapse without structure, so too would a DataFrame. Without structure Spark wouldn’t be able to scale to trillions and trillions of rows.

**Spark schemas are the structure or the scaffolding of a DataFrame**

#### 1) The add() Method

The add() method can be used interchangeably and in addition to the StructFields objects on a StructType. The add() method takes the same parameters as the StructField object. The following schemas are all equivalent representations:

In [30]:
from pyspark.sql.types import StructType, StructField, StringType, LongType
  
schema1 = StructType([
  StructField("id_column", LongType(), True),
  StructField("product_desc", StringType(), True)
])
  
schema2 = StructType().add("id_column", LongType(), True).add("product_desc", StringType(), True)
  
schema3 = StructType().add(StructField("id_column", LongType(), True)).add(StructField("product_desc", StringType(), True))

We can confirm that the two schemas are equivalent by comparing if the schema variables are equal to each other and printing the results in Python:

In [31]:
print(schema1 == schema2) 
print(schema1 == schema3)

True
True


#### Exercise 8: Using the add() Method

Create a schema of a StructType named sales_schema that has two columns. The first column “user_id” is a long data type and cannot be nullable. The second column “product_item” is a string data type and can be nullable. Following is the code for PySpark:

In [32]:
from pyspark.sql.types import StructType,StructField, StringType, LongType

sales_schema = StructType([
  StructField("user_id", LongType(), False),
  StructField("product_item", StringType(), True)
])

Create a StructField called sales_field that has a column name of "total_sales" with a long data type that can be nullable. Following is the code for PySpark:

In [33]:
sales_field = StructField("total_sales",LongType(),True)

In [34]:
another_schema = sales_schema.add(sales_field)

In [35]:
print(another_schema)

StructType([StructField('user_id', LongType(), False), StructField('product_item', StringType(), True), StructField('total_sales', LongType(), True)])


#### Use the add() method when adding columns to a DataFrame

You can use the schema method on a DataFrame in conjunction with the add() method to add new fields to the schema of an already existing DataFrame. In Spark, a DataFrame's schema is a StructType. In the preceding exercise we manually specified the schema as StructType. Spark has a shortcut: the schema method. The method schema can be called on an existing DataFrame to return its schema, that is a StructType. So in Spark Scala or PySpark you would call some_df.schema to output the StructType schema.

In [36]:
print(customer_df.schema)

StructType([StructField('customer_id', LongType(), True), StructField('first_name', StringType(), True), StructField('avg_shopping_cart', DoubleType(), True)])


So, with the schema method you don't have to manually create the StructType to add a new column. Just call the schema method on the DataFrame and then use the add method to add a column as a StructField. Following is the code for PySpark:

In [37]:
final_schema = customer_df.schema.add(StructField("new_column", StringType(), True)) 
print(final_schema)

StructType([StructField('customer_id', LongType(), True), StructField('first_name', StringType(), True), StructField('avg_shopping_cart', DoubleType(), True), StructField('new_column', StringType(), True)])


#### Return column names from a schema

Use the fieldNames method on a StructType to return a list or array of the column names. This is an easy way to return the column names of a DataFrame. The fieldNames method can be called on a StructType or after the schema method on a DataFrame. The only difference between Spark Scala and PySpark is that PySpark requires trailing parenthesis () and Spark Scala omits the parenthesis.

In PySpark call the fieldNames() method on a schema and on the DataFrame to return the column names of the schema:

In [38]:
print( customer_df.schema.fieldNames() ) 
print( customer_schema.fieldNames() )

['customer_id', 'first_name', 'avg_shopping_cart', 'new_column']
['customer_id', 'first_name', 'avg_shopping_cart', 'new_column']


### Nested Schemas

So far, we have dealt with flat and orderly DataFrame schema. But Spark supports nested columns where a column can contain more sets of data. Suppose we had a data set that looked like the following Python dictionary or JSON object:

{"id":101,"name":"Jim","orders":[{"id":1,"price":45.99,"userid":101},{"id":2,"price":17.35,"userid":101}]},{"id":102,"name":"Christina","orders":[{"id":3,"price":245.86,"userid":102}]},{"id":103,"name":"Steve","orders":[{"id":4,"price":7.45,"userid":103},{"id":5,"price":8.63,"userid":103}]}

This data set would be the result of some imaginary sales tables that was joined to an orders table. We will look at joining DataFrames together in Chapter 3, SQL with Spark.

It is difficult to see from the nested dictionary but there are three columns: id, name, and orders. But orders is special, because it is a list of lists. In Python we can directly use this data by wrapping it in brackets as mentioned earlier.

In [39]:
nested_sales_data = [{"id":101,"name":"Jim","orders":[{"id":1,"price":45.99,"userid":101},{"id":2,"price":17.35,"userid":101}]},{"id":102,"name":"Christina","orders":[{"id":3,"price":245.86,"userid":102}]},{"id":103,"name":"Steve","orders":[{"id":4,"price":7.45,"userid":103},{"id":5,"price":8.63,"userid":103}]}]

In [40]:
nested_sales_data

[{'id': 101,
  'name': 'Jim',
  'orders': [{'id': 1, 'price': 45.99, 'userid': 101},
   {'id': 2, 'price': 17.35, 'userid': 101}]},
 {'id': 102,
  'name': 'Christina',
  'orders': [{'id': 3, 'price': 245.86, 'userid': 102}]},
 {'id': 103,
  'name': 'Steve',
  'orders': [{'id': 4, 'price': 7.45, 'userid': 103},
   {'id': 5, 'price': 8.63, 'userid': 103}]}]

If we used this list and made a DataFrame without specifying a schema, the output would not be very usable or readable. The following PySpark code uses the preceding nested JSON data to make a Spark DataFrame. The DataFrame and schema is displayed to demonstrate what can happen when you make a DataFrame with nested data without a schema:

In [41]:
sales_df = spark.createDataFrame(nested_sales_data) 
sales_df.show(20, False) 
sales_df.printSchema()

+---+---------+----------------------------------------------------------------------------------+
|id |name     |orders                                                                            |
+---+---------+----------------------------------------------------------------------------------+
|101|Jim      |[{id -> 1, userid -> 101, price -> null}, {id -> 2, userid -> 101, price -> null}]|
|102|Christina|[{id -> 3, userid -> 102, price -> null}]                                         |
|103|Steve    |[{id -> 4, userid -> 103, price -> null}, {id -> 5, userid -> 103, price -> null}]|
+---+---------+----------------------------------------------------------------------------------+

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: long (valueContainsNull = true)



The output is not readable or user friendly with the “-> " characters and Spark is trying to make a map of the data. Let's add a schema to tell Spark exactly how we want to structure the DataFrame. The following PySpark code demonstrates the results of nested data when using a schema:

In [42]:
from pyspark.sql.types import *
  
orders_schema = [
  StructField("id", IntegerType(), True),
  StructField("price", DoubleType(), True),
  StructField("userid", IntegerType(), True)
]
  
sales_schema = StructType([
  StructField("id", IntegerType(), True),
  StructField("name", StringType(), True),
  StructField("orders", ArrayType(StructType(orders_schema)), True) #ArrayType() applied in StructField
])

Here we called the **order_schema** inside the **sales_schema**. This shows how versatile schemas can be and how easy it is in Spark to construct complex schemas. Now let's make a DataFrame that is readable and well structured as shown in the following code:

In [43]:
nested_df = spark.createDataFrame(nested_sales_data, sales_schema)

In [44]:
nested_df.show(20, False)
nested_df.printSchema()

+---+---------+----------------------------------+
|id |name     |orders                            |
+---+---------+----------------------------------+
|101|Jim      |[{1, 45.99, 101}, {2, 17.35, 101}]|
|102|Christina|[{3, 245.86, 102}]                |
|103|Steve    |[{4, 7.45, 103}, {5, 8.63, 103}]  |
+---+---------+----------------------------------+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: integer (nullable = true)
 |    |    |-- price: double (nullable = true)
 |    |    |-- userid: integer (nullable = true)



In the next section we will move on from manually created DataFrames to creating DataFrames from files stored in Hadoop.