# Spark DataFrame Basics



## Creating a DataFrame

First we need to start a SparkSession:

In [None]:
pip install pyspark


In [1]:
from pyspark.sql import SparkSession

Then start the SparkSession

In [2]:
# May take a little while on a local computer
spark = SparkSession.builder.appName("Basics").getOrCreate()

In [4]:
# We'll discuss how to read other options later.
# This dataset is from Spark's examples

# Might be a little slow locally
df = spark.read.json('../input/people/people.json')

#### Showing the data

In [5]:
# Note how data is missing!
df.show()

In [6]:
df.printSchema()

In [7]:
df.columns

In [8]:
df.describe()

In [9]:
from pyspark.sql.types import StructField,StringType,IntegerType,StructType

Next we need to create the list of Structure fields
    * :param name: string, name of the field.
    * :param dataType: :class:`DataType` of the field.
    * :param nullable: boolean, whether the field can be null (None) or not.

In [10]:
data_schema = [StructField("age", IntegerType(), True),StructField("name", StringType(), True)]

In [11]:
final_struc = StructType(fields=data_schema)

In [13]:
df = spark.read.json('../input/people/people.json', schema=final_struc)

In [14]:
df.printSchema()

### Grabbing the data

In [15]:
df['age']

In [16]:
type(df['age'])

In [17]:
df.select('age')

In [18]:
type(df.select('age'))

In [19]:
df.select('age').show()

In [20]:
# Returns list of Row objects
df.head(2)

Multiple Columns:

In [21]:
df.select(['age','name'])

In [22]:
df.select(['age','name']).show()

### Creating new columns

In [23]:
# Adding a new column with a simple copy
df.withColumn('newage',df['age']).show()

In [24]:
df.show()

In [25]:
# Simple Rename
df.withColumnRenamed('age','supernewage').show()

More complicated operations to create new columns

In [26]:
df.withColumn('doubleage',df['age']*2).show()

In [27]:
df.withColumn('add_one_age',df['age']+1).show()

In [28]:
df.withColumn('half_age',df['age']/2).show()

In [29]:
df.withColumn('half_age',df['age']/2)

### Using SQL

To use SQL queries directly with the dataframe, we need to register it to a temporary view:

In [30]:
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

In [31]:
sql_results = spark.sql("SELECT * FROM people")

In [32]:
sql_results

In [33]:
sql_results.show()

In [34]:
spark.sql("SELECT * FROM people WHERE age=30").show()