### Ref.: datacamp

In [1]:
import findspark
findspark.init('/home/sushant/spark-2.1.0-bin-hadoop2.7')
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('haveFun').getOrCreate()

In [2]:
spark

<pyspark.sql.session.SparkSession at 0x7f2ee8069e10>

In [3]:
spark.version

'2.1.0'

In [4]:
Spark

NameError: name 'Spark' is not defined

### Using DataFrames

Spark's core data structure is the Resilient Distributed Dataset (RDD). This is a low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster. However, RDDs are hard to work with directly, so we have the Spark DataFrame abstraction built on top of RDDs.

Along with being easier to understand, DataFrames are also more optimized for complicated operations than RDDs.

To start working with Spark DataFrames, we first have to create a SparkSession object from your SparkContext. You can think of the SparkContext as your connection to the cluster and the SparkSession as your interface with that connection.

SparkSession has an attribute called catalog which lists all the data inside the cluster. This attribute has a few methods for extracting different pieces of information.

One of the most useful is the .listTables() method, which returns the names of all the tables in the cluster as a list.

In [5]:
spark.catalog.listTables()

[]

**Adding data to Spark**

The .createDataFrame() method takes a pandas DataFrame and returns a Spark DataFrame.

The output of this method is stored locally, not in the SparkSession catalog. This means that you can use all the Spark DataFrame methods on it, but you can't access the data in other contexts.

For example, a SQL query (using the .sql() method) that references your DataFrame will throw an error. To access the data in this way, you have to save it as a temporary table.

You can do this using the .createTempView() Spark DataFrame method, which takes as its only argument the name of the temporary table you'd like to register. This method registers the DataFrame as a table in the catalog, but as this table is temporary, it can only be accessed from the specific SparkSession used to create the Spark DataFrame.

There is also the method .createOrReplaceTempView(). This safely creates a new temporary table if nothing was there before, or updates an existing table if one was already defined. We'll use this method to avoid running into problems with duplicate tables.

In [6]:
import pandas as pd
import numpy as np

In [7]:
pdTemp = pd.DataFrame(np.random.random(10))

In [8]:
pdTemp

Unnamed: 0,0
0,0.002693
1,0.161249
2,0.287142
3,0.957138
4,0.298248
5,0.993036
6,0.693297
7,0.39629
8,0.109099
9,0.697257


In [9]:
sparkTemp = spark.createDataFrame(pdTemp)

In [10]:
sparkTemp.show()

+--------------------+
|                   0|
+--------------------+
|0.002692894057628...|
| 0.16124870300395344|
| 0.28714188354713266|
|  0.9571379312472855|
| 0.29824829776334294|
|  0.9930363942070299|
|  0.6932968201346377|
|  0.3962902479621606|
| 0.10909899511934784|
|  0.6972574584479467|
+--------------------+



Note how spark dataframe does not have an index and this makes sense in a distributed paradigm.

In [11]:
spark.catalog.listTables()

[]

Now let's add sparkTemp to the catalog. 

In [12]:
sparkTemp.createOrReplaceTempView("tmp")

In [13]:
spark.catalog.listTables()

[Table(name='tmp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]