# Introduction to Spark
Using Spark we are going to read in this data and calculate the average age. First, we need to initialize a SparkSession:

In [1]:
from pyspark.sql import SparkSession
# Set spark environments
spark = SparkSession \
    .builder \
    .appName("Spark Example") \
    .getOrCreate()


Let’s go ahead and create a Spark Dataset from our Lord of the Rings age data. Included in the Spark directory for this chapter is a file called ages.json which includes the age data in JSON lines format. It looks like:

```
{"Name": "Bilbo", "Age": 28}
{"Name": "Frodo", "Age": 26}
{"Name": "Gandalf", "Age": 62}
{"Name": "Samwise", "Age": 30}
{"Name": "Sauron", "Age": 72}
{"Name": "Aragorn", "Age": 31}
```

Now, we can read in `ages.json` as a Spark Dataset:

In [2]:
df = spark.read.json('ages.json').repartition(10).cache()

Now we have a Dataset (also called DataFrame in accordance with Pandas) representing our data. We can leverage the Spark SQL API to calculate an aggregation over the dataset, which in our case is an average:

In [3]:
df.agg({"Age": "avg"}).collect()

[Row(avg(Age)=41.5)]

We can also execute calculations at the row level. For example, let’s calculate each of the character’s age in dog years (age times 7):

In [4]:
df.withColumn('dog_years', df.Age*7).collect()

[Row(Age=72, Name='Sauron', dog_years=504),
 Row(Age=31, Name='Aragorn', dog_years=217),
 Row(Age=62, Name='Gandalf', dog_years=434),
 Row(Age=28, Name='Bilbo', dog_years=196),
 Row(Age=26, Name='Frodo', dog_years=182),
 Row(Age=30, Name='Samwise', dog_years=210)]

Best of all, this calculation would have scaled automatically across our computing cluster if we had more than one node. Notice something at the end of each of the commands above? If you are thinking, what does `.collect()` do then you’re onto something. 

Spark executes code lazily. This means that *transformations* such as calculating the characters’ age in dog years is only executed once an *action* is called. The `.withColumn()` command is a *transformation* while `.collect()` is the *action* which causes the *transformation* to be executed. Often, the *action* which causes execution of our *transformations* is writing the job’s output to disk, HDFS, or S3.

Let’s try to create a new Dataset which includes the characters’ ages in dog years, then let’s write this out to disk:

In [5]:
df_new = df.withColumn('dog_years', df.Age*7)

Now we have a new Dataset called `df_new`. Note that nothing has been calculated yet; we have simply mapped the function we want across the cluster so that when we call an action on `df_new` such as `.collect()` or try to write the output to disk the transformation will be executed.

We can write `df_new` to disk with the following:

In [6]:
df_new.write.mode('append').json("dog_years.json")

We can even execute a filter

In [7]:
filtered = df.filter("name = 'Bilbo'")

In [8]:
filtered.collect()

[Row(Age=28, Name='Bilbo')]