<font color=red>
Spark_Version:2.0.0<br/>
Python_Version:Python 3.5.2 | Anaconda4.1.1(64-bit)<br/>
Jupyter_Version:4.2.1</br>
System:Ubuntu 16.04 LTS(64-bit)
</font>

In [1]:
import platform
print("Spark_Version:",sc.version)
print("Python_Version:",platform.python_version())
print("System:",platform.system())

Spark_Version: 2.0.0
Python_Version: 3.5.2
System: Linux


## Creating DataFrames from RDDs
The following code creates an RDD from a list of colors followed by a collection of tuples containing the color name and its length. It creates a DataFrame using the `toDF` method to convert the RDD into a DataFrame. The `toDF` method takes a list of column labels as an optional argument:

In [2]:
#Create a list of colours
colors = ['white', 'green', 'yellow', 'red', 'brown', 'pink']
#Distribute a local collection to form an RDD
#Apply map function on that RDD to get another RDD containing colour,length tuples
color_df = sc.parallelize(colors).map(lambda x: (x,len(x))).toDF(['color', 'length'])

In [3]:
color_df

DataFrame[color: string, length: bigint]

In [4]:
color_df.dtypes

[('color', 'string'), ('length', 'bigint')]

In [5]:
color_df.show()

+------+------+
| color|length|
+------+------+
| white|     5|
| green|     5|
|yellow|     6|
|   red|     3|
| brown|     5|
|  pink|     4|
+------+------+



## Creating DataFrames from JSON
JavaScript Object Notation, or JSON, is a language-independent, self-describing, lightweight data-exchange format. `JSON` has become a popular data exchange format and has become ubiquitous. In addition to `JavaScript` and RESTful interfaces, databases such as `MySQL` have accepted `JSON` as a data type and `MongoDB` stores all data as `JSON` documents in binary form. Conversion of data to and from `JSON` is essential for any modern data analysis workflow. The Spark DataFrame API lets developers convert `JSON` objects into DataFrames and vice versa. Let's have a close look at the following examples for a better understanding:

In [6]:
#Pass the source json data file path
df = sqlContext.read.json("resource/authors.json")
df.show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|      Mark|    Twain|
|   Charles|  Dickens|
|    Thomas|    Hardy|
+----------+---------+



## Creating DataFrames from databases using JDBC
Spark allows developers to create DataFrames from other databases using `JDBC`, provided you ensure that the JDBC driver for the intended database is accessible. A `JDBC` driver is a software component that allows a Java application to interact with a database. Different databases require different drivers. Usually, database providers such as `MySQL` supply these driver components to access their databases. You have to ensure that you have the right driver for the database you want to work with.<br/><br/>

The following example assumes that you already have a `MySQL` database running at the given URL, a table called people in the database called test with some data in it, and valid credentials to log in. There is an additional step of relaunching the `REPL` shell with the appropriate `JAR` file:<br/><br/>

``pyspark --driver-class-path /usr/mySofter/mysql-connector-java-5.1.40/mysql-connector-java-5.1.40-bin.jar``<br/><br/>

If you do not already have the JAR file in your system, download it from the MySQL site at the following link:<br/><br/>
https://dev.mysql.com/downloads/connector/j/

In [7]:
characterSetDF = sqlContext.read.format("jdbc").options(
                            url = "jdbc:mysql://localhost",
                            dbtable = 'information_schema.CHARACTER_SETS',
                            user = 'root',
                            password = '2038793').load()

In [8]:
characterSetDF.show(truncate = False)

+------------------+--------------------+---------------------------+------+
|CHARACTER_SET_NAME|DEFAULT_COLLATE_NAME|DESCRIPTION                |MAXLEN|
+------------------+--------------------+---------------------------+------+
|big5              |big5_chinese_ci     |Big5 Traditional Chinese   |2     |
|dec8              |dec8_swedish_ci     |DEC West European          |1     |
|cp850             |cp850_general_ci    |DOS West European          |1     |
|hp8               |hp8_english_ci      |HP West European           |1     |
|koi8r             |koi8r_general_ci    |KOI8-R Relcom Russian      |1     |
|latin1            |latin1_swedish_ci   |cp1252 West European       |1     |
|latin2            |latin2_general_ci   |ISO 8859-2 Central European|1     |
|swe7              |swe7_swedish_ci     |7bit Swedish               |1     |
|ascii             |ascii_general_ci    |US ASCII                   |1     |
|ujis              |ujis_japanese_ci    |EUC-JP Japanese            |3     |

## Creating DataFrames from Apache Parquet
Apache Parquet is an efficient, compressed columnar data representation available to any project in the Hadoop ecosystem. Columnar data representations store data by column, as opposed to the traditional approach of storing data row by row. Use cases that require frequent querying of two to three columns from several columns benefit greatly from such an arrangement because columns are stored contiguously on the disk and you do not have to read unwanted columns in row-oriented storage. Another advantage is in compression. Data in a single column belongs to a single type. The values tend to be similar, and sometimes identical. These qualities greatly enhance compression and encoding efficiency. Parquet allows compression schemes to be specified on a per-column level and allows adding more encodings as they are invented and implemented.<br/><br/>
Apache Spark provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. The following example writes the people data loaded into a DataFrame in the previous example into Parquet format and then re-reads it into an RDD:

In [9]:
import shutil

try :
    shutil.rmtree('outputs')
except FileNotFoundError:
    pass

#Write DataFrame contents into Parquet format
characterSetDF.write.parquet('outputs/writers.parquet')
#Read Parquet data into another DataFrame
writersDF = sqlContext.read.parquet("outputs/writers.parquet")

writersDF.show(truncate=False)

+------------------+--------------------+---------------------------+------+
|CHARACTER_SET_NAME|DEFAULT_COLLATE_NAME|DESCRIPTION                |MAXLEN|
+------------------+--------------------+---------------------------+------+
|big5              |big5_chinese_ci     |Big5 Traditional Chinese   |2     |
|dec8              |dec8_swedish_ci     |DEC West European          |1     |
|cp850             |cp850_general_ci    |DOS West European          |1     |
|hp8               |hp8_english_ci      |HP West European           |1     |
|koi8r             |koi8r_general_ci    |KOI8-R Relcom Russian      |1     |
|latin1            |latin1_swedish_ci   |cp1252 West European       |1     |
|latin2            |latin2_general_ci   |ISO 8859-2 Central European|1     |
|swe7              |swe7_swedish_ci     |7bit Swedish               |1     |
|ascii             |ascii_general_ci    |US ASCII                   |1     |
|ujis              |ujis_japanese_ci    |EUC-JP Japanese            |3     |

## Creating DataFrames from other data sources
Spark provides built-in support for multiple data sources such as `JSON`, `JDBC`, `HDFS`,`Parquet`, `MYSQL`, `Amazon S3`, and so on. In addition, it provides a Data Source API that provides a pluggable mechanism for accessing structured data through Spark SQL. There are several libraries built on top of this pluggable component, for example, `CSV`, `Avro`, `Cassandra`, and `MongoDB`, to name a few. These libraries are not part of the Spark code base. These are built for individual data sources and hosted on a community site, Spark packages.

## DataFrame operations
In the previous section of this chapter, we learnt many different ways of creating DataFrames. In this section, we will focus on various operations that can be performed on DataFrames. Developers chain multiple operations to `filter`, `transform`, `aggregate`, and `sort data` in the DataFrames. The underlying Catalyst optimizer ensures efficient execution of these operations. These functions you find here are similar to those you commonly find in `SQL` operations on tables:

In [10]:
#Create a local collection of colors first
colors = ['white', 'green', 'yellow', 'red', 'brown', 'pink']
#Distribute the local collection to form an RDD length tuples and convert that RDD to a DataFrame
color_df = sc.parallelize(colors).map(lambda x:(x,len(x))).toDF(['color', 'length'])
color_df.show()

+------+------+
| color|length|
+------+------+
| white|     5|
| green|     5|
|yellow|     6|
|   red|     3|
| brown|     5|
|  pink|     4|
+------+------+



In [11]:
color_df.dtypes

[('color', 'string'), ('length', 'bigint')]

In [12]:
color_df.count()

6

In [13]:
color_df.columns

['color', 'length']

In [14]:
#drop a column. The source DataFrame color_df remains the same. 
#Spark returns a new DataFrame which is being passed to show
color_df.drop('length').show()

+------+
| color|
+------+
| white|
| green|
|yellow|
|   red|
| brown|
|  pink|
+------+



In [15]:
#Convert to JSON format
color_df.toJSON().collect()

['{"color":"white","length":5}',
 '{"color":"green","length":5}',
 '{"color":"yellow","length":6}',
 '{"color":"red","length":3}',
 '{"color":"brown","length":5}',
 '{"color":"pink","length":4}']

In [16]:
#filter operation is similar to WHERE clause in SQL
#You specify conditions to select only desired columns and rows

#Output of filter operation is another DataFrame object that is usually passed on to some more operations
#The following example selects the colors having a length of four or five only and label the column as "mid_length" 
color_df.filter(color_df.length.between(4, 5)).select(color_df.color.alias("mid_length")).show()

+----------+
|mid_length|
+----------+
|     white|
|     green|
|     brown|
|      pink|
+----------+



In [17]:
#This example uses multiple filter criteria
color_df.filter(color_df.length > 4)\
        .filter(color_df[0] != "white").show()

+------+------+
| color|length|
+------+------+
| green|     5|
|yellow|     6|
| brown|     5|
+------+------+



In [18]:
#Sort the data on one or more columns
color_df.sort("color").show()

+------+------+
| color|length|
+------+------+
| brown|     5|
| green|     5|
|  pink|     4|
|   red|     3|
| white|     5|
|yellow|     6|
+------+------+



In [19]:
#First filter colors of length more than 4 and then sort on multiple columns
#The Filtered rows are sorted first on the column length in default ascending order
#Rows with same length are sorted on color in descending
color_df.filter(color_df['length'] >= 4).sort("length", "color", ascending=False).show()

+------+------+
| color|length|
+------+------+
|yellow|     6|
| white|     5|
| green|     5|
| brown|     5|
|  pink|     4|
+------+------+



In [20]:
#You can use orderBy instead, which is an alias to sort
color_df.orderBy("length", "color").take(4)

[Row(color='red', length=3),
 Row(color='pink', length=4),
 Row(color='brown', length=5),
 Row(color='green', length=5)]

In [21]:
#Alternative syntax, for single or multiple columns.
color_df.sort(color_df.length.desc(), color_df.color.asc()).show()

+------+------+
| color|length|
+------+------+
|yellow|     6|
| brown|     5|
| green|     5|
| white|     5|
|  pink|     4|
|   red|     3|
+------+------+



In [22]:
#All the examples until now have been acting on one row at a time,
#filtering or transforming or reordering.Introduction to DataFrames

#The following example deals with regrouping the data
#These operations require "wide dependency" and often involve shuffling.

color_df.groupBy("length").count().show()

+------+-----+
|length|count|
+------+-----+
|     6|    1|
|     5|    3|
|     3|    1|
|     4|    1|
+------+-----+



In [23]:
#Data often contains missing information or null values. 
#We may want to drop such rows or replace with some filler information. 
#dropna is provided for dropping such rows
#The following json file has names of famous authors. Firstname data is missing in one row.
df1 = sqlContext.read.json('resource/authors_missing.json')
df1.show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|      Mark|    Twain|
|   Charles|  Dickens|
|      null|    Hardy|
+----------+---------+



In [24]:
#Let us drop the row with incomplete information
df2 = df1.dropna()
df2.show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|      Mark|    Twain|
|   Charles|  Dickens|
+----------+---------+

