In [15]:
### Libaries import
from pyspark.sql import SparkSession
from pyspark.sql import types as T

**SparkSession**

The SparkSession is how you begin a Spark application. This is where you provide some configuration for your Spark program

**pyspark.sql.functions**

You will find that all your data wrangling/analysis will mostly be done by chaining together multiple functions. If you find that you get your desired transformations with the base functions, you should:

    - Look through the API docs again.
    - Ask Google.
    - Write a user defined function (udf).

**pyspark.sql.types**

When working with spark, you will need to define the type of data for each column you are working with.

The possible types that Spark accepts are listed here: Spark types

In [6]:
# Create SparkSession
spark = SparkSession.\
        builder.\
        master("local[*]").\
        appName('test').\
        getOrCreate()

In [5]:
spark

In [10]:
### Create a DataFrame
from pyspark.sql import types as T

schema = T.StructType([
    T.StructField("pet_id", T.IntegerType(), False),
    T.StructField("name", T.StringType(), True),
    T.StructField("age", T.IntegerType(), True),  
])

data = [
    (1, "tommy", 5),
    (2, "chewy", 10),
    (3, "roger", 3)
]

new_df = spark.createDataFrame(data = data, schema = schema)
new_df.show()

+------+-----+---+
|pet_id| name|age|
+------+-----+---+
|     1|tommy|  5|
|     2|chewy| 10|
|     3|roger|  3|
+------+-----+---+



In [13]:
# Convert to Pandas
new_df.toPandas()

Unnamed: 0,pet_id,name,age
0,1,tommy,5
1,2,chewy,10
2,3,roger,3


## What Happened?

For any *DataFrame (df)* that you work with in Spark you should provide it with 2 things:

1. A schema for the data. Providing a schema explicitly makes it clearer to the reader and sometimes even more performant, if we can know that a column is nullable. This means providing 3 things:
    - the name of the column
    - the datatype of the column
    - the nullability of the column

2. The data. 

Normally you would read data stored in local, gcs, aws etc and store it in a df, but there will be the off-times that you will need to create one.