# What is Spark, anyway?

Spark is a platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer). Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.

As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster. It is a fact that parallel computation can make certain types of programming tasks much faster.

However, with greater computing power comes greater complexity.

Deciding whether or not Spark is the best solution for your problem takes some experience, but you can consider questions like:

Is my data too big to work with on a single machine?
Can my calculations be easily parallelized?

# Using Spark in Python

The first step in using Spark is connecting to a cluster.

In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. There will be one computer, called the master that manages splitting up the data and the computations. The master is connected to the rest of the computers in the cluster, which are called worker. The master sends the workers data and calculations to run, and they send their results back to the master.

Creating the connection is as simple as creating an instance of the `SparkContext` class. The class constructor takes a few optional arguments that allow you to specify the attributes of the cluster you're connecting to.

An object holding all these attributes can be created with the `SparkConf()` constructor. Take a look at the documentation for all the details!

How do you connect to a Spark cluster from PySpark?
- Create an instance of the SparkContext class.

Get to know the SparkContext.

- Call print() on sc to verify there's a SparkContext in your environment.
- print() sc.version to see what version of Spark is running on your cluster.

In [1]:
# # Verify SparkContext
# print(sc)

# # Print Spark version
# print(sc.version)

# Using DataFrames

Spark's core data structure is the Resilient Distributed Dataset (RDD). This is a low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster. However, RDDs are hard to work with directly, so in this course you'll be using the Spark DataFrame abstraction built on top of RDDs.

The Spark DataFrame was designed to behave a lot like a SQL table (a table with variables in the columns and observations in the rows). Not only are they easier to understand, DataFrames are also more optimized for complicated operations than RDDs.

When you start modifying and combining columns and rows of data, there are many ways to arrive at the same result, but some often take much longer than others. When using RDDs, it's up to the data scientist to figure out the right way to optimize the query, but the DataFrame implementation has much of this optimization built in!

To start working with Spark DataFrames, you first have to create a `SparkSession` object from your `SparkContext`. You can think of the `SparkContext` as your connection to the cluster and the `SparkSession` as your interface with that connection.

Which of the following is an advantage of Spark DataFrames over RDDs?
- Operations using DataFrames are automatically optimized.

- Import SparkSession from pyspark.sql.
- Make a new SparkSession called my_spark using SparkSession.builder.getOrCreate().
- Print my_spark to the console to verify it's a SparkSession.

In [1]:
# # Import SparkSession from pyspark.sql
# from pyspark.sql import SparkSession

# # Create my_spark
# my_spark = SparkSession.builder.getOrCreate()

# # Print my_spark
# print(my_spark)

- See what tables are in your cluster by calling spark.catalog.listTables() and printing the result!

In [2]:
# # Print the tables in the catalog
# print(spark.catalog.listTables())

- Use the .sql() method to get the first 10 rows of the flights table and save the result to flights10. The variable query contains the appropriate SQL query.
- Use the DataFrame method .show() to print flights10

In [3]:
# # Don't change this query
# query = "FROM flights SELECT * LIMIT 10"

# # Get the first 10 rows of flights
# flights10 = spark.sql(query)

# # Show the results
# flights10.show()

- Run the query using the .sql() method. Save the result in flight_counts.
- Use the .toPandas() method on flight_counts to create a pandas DataFrame called pd_counts.
- Print the .head() of pd_counts to the console.

In [4]:
# # Don't change this query
# query = "SELECT origin, dest, COUNT(*) as N FROM flights GROUP BY origin, dest"

# # Run the query
# flight_counts = spark.sql(query)

# # Convert the results to a pandas DataFrame
# pd_counts = flight_counts.toPandas()

# # Print the head of pd_counts
# print(pd_counts.head())

- The code to create a pandas DataFrame of random numbers has already been provided and saved under pd_temp.
- Create a Spark DataFrame called spark_temp by calling the Spark method .createDataFrame() with pd_temp as the argument.
- Examine the list of tables in your Spark cluster and verify that the new DataFrame is not present. Remember you can use spark.catalog.listTables() to do so.
- Register the spark_temp DataFrame you just created as a temporary table using the .createOrReplaceTempView() method. THe temporary table should be named "temp". Remember that the table name is set including it as the only argument to your method!
- Examine the list of tables again.

In [5]:
# # Create pd_temp
# pd_temp = pd.DataFrame(np.random.random(10))

# # Create spark_temp from pd_temp
# spark_temp = spark.createDataFrame(pd_temp)

# # Examine the tables in the catalog
# print(spark.catalog.listTables())

# # Add spark_temp to the catalog
# spark_temp.createOrReplaceTempView("temp")

# # Examine the tables in the catalog again
# print(spark.catalog.listTables())

- Use the .read.csv() method to create a Spark DataFrame called airports
- The first argument is file_path
- Pass the argument header=True so that Spark knows to take the column names from the first line of the file.
- Print out this DataFrame by calling .show().

In [6]:
# # Don't change this file path
# file_path = "/usr/local/share/datasets/airports.csv"

# # Read in the airports data
# airports = spark.read.csv(file_path, header=True)

# # Show the data
# airports.show()