In [5]:
### Libaries import
from pyspark.sql import SparkSession
from pyspark.sql import types as T

**SparkSession**

The SparkSession is how you begin a Spark application. This is where you provide some configuration for your Spark program

**pyspark.sql.functions**

You will find that all your data wrangling/analysis will mostly be done by chaining together multiple functions. If you find that you get your desired transformations with the base functions, you should:

    - Look through the API docs again.
    - Ask Google.
    - Write a user defined function (udf).

**pyspark.sql.types**

When working with spark, you will need to define the type of data for each column you are working with.

The possible types that Spark accepts are listed here: Spark types

In [30]:
# Create SparkSession
spark = SparkSession.\
        builder.\
        master("local[*]").\
        appName('test').\
        getOrCreate()

In [31]:
spark

<img src="ways-to-create-df-in-spark.png">

### From RDDs

In [28]:
iphones_RDD = spark.sparkContext.parallelize([
    ("XS", 2018, 5.65, 2.79, 6.24),
    ("XR", 2018, 5.94, 2.98, 6.84),
    ("X10", 2017, 5.65, 2.79, 6.13),
    ("8Plus", 2017, 6.23, 3.07, 7.12)
])

names = ['Model', 'Year', 'Height', 'Width', 'Weight']

# Method 1
df = iphones_RDD.toDF(schema=names)
df.show()

# Method 2
df = spark.createDataFrame(iphones_RDD, schema=names)
df.show()

+-----+----+------+-----+------+
|Model|Year|Height|Width|Weight|
+-----+----+------+-----+------+
|   XS|2018|  5.65| 2.79|  6.24|
|   XR|2018|  5.94| 2.98|  6.84|
|  X10|2017|  5.65| 2.79|  6.13|
|8Plus|2017|  6.23| 3.07|  7.12|
+-----+----+------+-----+------+

+-----+----+------+-----+------+
|Model|Year|Height|Width|Weight|
+-----+----+------+-----+------+
|   XS|2018|  5.65| 2.79|  6.24|
|   XR|2018|  5.94| 2.98|  6.84|
|  X10|2017|  5.65| 2.79|  6.13|
|8Plus|2017|  6.23| 3.07|  7.12|
+-----+----+------+-----+------+



### Programatically specifying schema

In [32]:
from pyspark.sql import types as T

schema = T.StructType([
    T.StructField("pet_id", T.IntegerType(), False),
    T.StructField("name", T.StringType(), True),
    T.StructField("age", T.IntegerType(), True),  
])

data = [
    (1, "tommy", 5),
    (2, "chewy", 10),
    (3, "roger", 3)
]

new_df = spark.createDataFrame(data = data, schema = schema)
new_df.show()
new_df.printSchema()

+------+-----+---+
|pet_id| name|age|
+------+-----+---+
|     1|tommy|  5|
|     2|chewy| 10|
|     3|roger|  3|
+------+-----+---+

root
 |-- pet_id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)



### Various data sources (CSV, JSON, TXT) using SparkSession's read method (See 'DataFrames_example' notebook)