### Creating a DataFrame in Spark

The main way to create DataFrames in Spark is to use the createDataFrame() method. This method is called on the SparkSession to create DataFrames. Since Spark 2.0, the SparkSession is the main way to interact with all of Sparks' many capabilities. The SparkSession creates and exposes the SparkConf, SparkContext, and SQLContext to the entire application. Normally the SparkSession is created with a variable named "spark" but any name can be used.

In [1]:
print("Vamsi")

Vamsi


In [2]:
import findspark
findspark.init()
findspark.find()

'C:\\Spark\\sparkhome'

In [3]:
from pyspark.sql import SparkSession

A basic SparkSession in PySpark looks like

In [4]:
spark = SparkSession \
    .builder \
    .appName("Vamsi_App") \
    .getOrCreate()

The createDataFrame() method in PySpark has a structure like:

###### createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)

### List

#### Exercise 1: Creating a DataFrame in PySpark from a Nested List

In [5]:
hadoop_list = [[1, "MapReduce"], [2, "YARN"], [3, "Hive"], [4, "Pig"], [5, "Spark"], [6, "Zookeeper"]]

In [7]:
hadoop_list

[[1, 'MapReduce'],
 [2, 'YARN'],
 [3, 'Hive'],
 [4, 'Pig'],
 [5, 'Spark'],
 [6, 'Zookeeper']]

To create the DataFrame named hadoop_df we use the SparkSession variable spark (that we created) and call the createDataFrame() method passing only the nested list with the following code:

In [8]:
hadoop_df = spark.createDataFrame(hadoop_list)

Finally display the contents of the DataFrame using hadoop_df.show() and display the schema of the DataFrame in a tree structure using hadoop_df.printSchema() as shown in the following code:

In [11]:
hadoop_df.show()

+---+---------+
| _1|       _2|
+---+---------+
|  1|MapReduce|
|  2|     YARN|
|  3|     Hive|
|  4|      Pig|
|  5|    Spark|
|  6|Zookeeper|
+---+---------+



In [10]:
hadoop_df.printSchema()

root
 |-- _1: long (nullable = true)
 |-- _2: string (nullable = true)



You have now created your first Spark DataFrame. In this exercise the DataFrame only has six rows. But a Spark DataFrame can scale infinitely to contain 100 trillion rows and beyond. This the power of Spark.

In the output did you notice anything that stood out? There are actually two things to note:

The column names were _1 and _2. ***This is because no column names were supplied when creating the DataFrame***. Spark didn't know what to call the columns, so _1 and _2 correspond to its column number going from left to right.
The output of the printSchema() method correctly inferred the data type of each column. Spark figured out the first column was of data type long, which is similar to an integer, and that the second column was a string. When the printSchema() method is called on a DataFrame the output displays the schema in a tree format. The tree format displays the column hierarchy, column names, column data type, and whether the column is nullable. ***The printSchema() method has no parameters.***

To display the contents of the DataFrame we call the method show() on the newly created DataFrame. The show() method looks like this:

show(n=20, truncate=True)

The show() method defaults to displaying the top twenty rows and also truncates each cell to the first twenty characters. To display more or less than the top twenty rows set the first parameter to any integer. To include all the characters of the cells, set the second parameter to False.

Displaying the contents of Spark DataFrames in PySpark

In [14]:
address_data = [["Bob", "1348 Central Park Avenue"], ["Nicole", "734 Southwest 46th Street"], ["Jordan", "3786 Ocean City Drive"]] 

In [15]:
address_data

[['Bob', '1348 Central Park Avenue'],
 ['Nicole', '734 Southwest 46th Street'],
 ['Jordan', '3786 Ocean City Drive']]

In [17]:
address_df = spark.createDataFrame(address_data)

In [18]:
address_df.show()

+------+--------------------+
|    _1|                  _2|
+------+--------------------+
|   Bob|1348 Central Park...|
|Nicole|734 Southwest 46t...|
|Jordan|3786 Ocean City D...|
+------+--------------------+



Since the method defaults to the displaying only the first twenty characters of each cell, the content is truncated.

To display all the characters for each cell set the second parameter to False and to limit the output to the first two rows, set the first parameter to 2, as shown in the following code:

In [21]:
address_df.show(2,False)

+------+-------------------------+
|_1    |_2                       |
+------+-------------------------+
|Bob   |1348 Central Park Avenue |
|Nicole|734 Southwest 46th Street|
+------+-------------------------+
only showing top 2 rows



**Note:** The second parameter, truncate, can also take integers. If set to an integer, it will display the number of characters equal to the integer for each cell.

### tuple

In Python, a tuple is similar to a list except it is wrapped in parentheses instead of square brackets and is not changeable (immutable). Other than that, lists and tuples are the same. A nested tuple is a tuple inside another tuple.

#### Exercise 2: Creating a DataFrame in PySpark from a nested tuple

Create a nested tuple called programming_languages with the following code:

In [22]:
programming_languages = ((1, "Java", "Scalable"), (2, "C", "Portable"), (3, "Python", "Big Data, ML, AI, Robotics"), (4, "JavaScript", "Web Browsers"), (5, "Ruby", "Web Apps"))

In [23]:
programming_languages

((1, 'Java', 'Scalable'),
 (2, 'C', 'Portable'),
 (3, 'Python', 'Big Data, ML, AI, Robotics'),
 (4, 'JavaScript', 'Web Browsers'),
 (5, 'Ruby', 'Web Apps'))

In [24]:
prog_lang_df = spark.createDataFrame(programming_languages)

Display the five rows and set the truncate parameter to False so the entire contents of the cells will be shown. Also print the schema of the DataFrame with the following code:

In [25]:
prog_lang_df.show(5,False)

+---+----------+--------------------------+
|_1 |_2        |_3                        |
+---+----------+--------------------------+
|1  |Java      |Scalable                  |
|2  |C         |Portable                  |
|3  |Python    |Big Data, ML, AI, Robotics|
|4  |JavaScript|Web Browsers              |
|5  |Ruby      |Web Apps                  |
+---+----------+--------------------------+



### Dictionary

In Python, a dictionary is a key-value pair wrapped in curly braces. A dictionary is similar to a list, in that it is mutable, can increase or decrease in size, and be nested. Each data element in a dictionary has a key and a value. **To create a DataFrame out of a dictionary all that is required is to wrap it in a list.**

#### Exercise 3: Creating a DataFrame in PySpark from a list of dictionaries

Create a list of dictionaries called top_mobile_phones. Inside the list make three comma separated dictionaries each with keys of "Manufacturer", "Model", "Year", "Million_Units" as shown in the following code:

In [26]:
top_mobile_phones = [{"Manufacturer": "Nokia", "Model": "1100", "Year": 2003, "Million_Units": 250}, {"Manufacturer": "Nokia", "Model": "1110", "Year": 2005, "Million_Units": 250}, {"Manufacturer": "Apple", "Model": "iPhone 6 & 6+", "Year": 2014, "Million_Units": 222}]

In [27]:
top_mobile_phones

[{'Manufacturer': 'Nokia',
  'Model': '1100',
  'Year': 2003,
  'Million_Units': 250},
 {'Manufacturer': 'Nokia',
  'Model': '1110',
  'Year': 2005,
  'Million_Units': 250},
 {'Manufacturer': 'Apple',
  'Model': 'iPhone 6 & 6+',
  'Year': 2014,
  'Million_Units': 222}]

Create a DataFrame called mobile_phones_df from the dictionary list

In [28]:
mobile_phones_df = spark.createDataFrame(top_mobile_phones)

In [29]:
mobile_phones_df.show()

+------------+-------------+-------------+----+
|Manufacturer|Million_Units|        Model|Year|
+------------+-------------+-------------+----+
|       Nokia|          250|         1100|2003|
|       Nokia|          250|         1110|2005|
|       Apple|          222|iPhone 6 & 6+|2014|
+------------+-------------+-------------+----+



In [30]:
mobile_phones_df.printSchema()

root
 |-- Manufacturer: string (nullable = true)
 |-- Million_Units: long (nullable = true)
 |-- Model: string (nullable = true)
 |-- Year: long (nullable = true)



Notice that **we didn’t supply the column names to the DataFrame but they still appear. That is because dictionaries have “keys” and these keys make up the columns of the DataFrame**. Likewise, the dictionary “values” are the cells in the DataFrame. So, by using dictionaries, Spark can display the DataFrame column names.