## Launching Spark

Spark's Python console can be launched directly from the command line by `pyspark`. SparkSession can be found by calling `spark` object. The Spark SQL console can be launced by `spark-sql`. We will experiment with these in the upcoming sessions.

If we have `pyspark` and other required packages installed we can also launch a SparkSession from a Python notebook environment. In order to do this we need to import the package `pyspark`.

Databrick and Google Dataproc notebooks already have pyspark installed and we can simply access the SparkSession by calling `spark` object.

## The SparkSession

You control your Spark Application through a driver process called the SparkSession. The SparkSession instance is the way Spark executes user-defined manipulations across the cluster. There is a one-to-one correspondence between a SparkSession and a Spark Application. In Scala and Python, the variable is available as `spark` when you start the console. Let’s go ahead and look at the SparkSession:

In [2]:
spark

<img src="https://github.com/soltaniehha/Big-Data-Analytics-for-Business/blob/master/figs/04-02-SparkSession-JVM.png?raw=true" width="700" align="center"/>

## Transformations

Let’s now perform the simple task of creating a range of numbers. This range of numbers is just like a named column in a spreadsheet:

In [3]:
myRange = spark.range(1000).toDF("number")

We created a DataFrame with one column containing 1,000 rows with values from 0 to 999. This range of numbers represents a distributed collection. When run on a cluster, each part of this range of numbers exists on a different executor. This is a Spark DataFrame.

In [4]:
myRange

DataFrame[number: bigint]

Calling `myRange` will not return anything but the object behind it. It is because we haven't materialized the recipe for creating the DataFrame that we just created.

Core data structures in Spark are immutable, meaning they cannot be changed after they’re created.

To “change” a DataFrame, you need to instruct Spark how you would like to modify it to do what you want. 

These instructions are called **transformations**. Transformations are lazy operations, meaning that they won’t do any computation or return any output until they are asked to by an action.

Let’s perform a simple transformation to find all even numbers in our current DataFrame:

In [5]:
divisBy2 = myRange.where("number % 2 = 0")

In [6]:
divisBy2

DataFrame[number: bigint]

The "where" statement specifies a narrow dependency, where only one partition contributes to at most one output partition. Transformations are the core of how you express your business logic using Spark. Spark will not act on transformations until we call an **action**.

### Lazy Evaluation

Lazy evaulation means that Spark will wait until the very last moment to execute the graph of computation instructions. In Spark, instead of modifying the data immediately when you express some operation, you build up a plan of transformations that you would like to apply to your source data. By waiting until the last minute to execute the code, Spark compiles this plan from your raw DataFrame transformations to a streamlined physical plan that will run as efficiently as possible across the cluster. This provides immense benefits because Spark can optimize the entire data flow from end to end. An example of this is something called predicate pushdown on DataFrames. If we build a large Spark job but specify a filter at the end that only requires us to fetch one row from our source data, the most efficient way to execute this is to access the single record that we need. Spark will actually optimize this for us by pushing the filter down automatically.

## Actions

Transformations allow us to build up our logical transformation plan. To trigger the computation, we run an action. An action instructs Spark to compute a result from a series of transformations. The simplest action is count, which gives us the total number of records in the DataFrame:

In [7]:
divisBy2.count()

500

There are three kinds of actions:

* Actions to view data in the console

* Actions to collect data to native objects in the respective language

* Actions to write to output data sources