# KEY NOTE

This notebook is complete guide to pyspark. Since most technologies and industries now a days are working on cloud, it is good to have knowledge on how to apply ML and Datascience when it comes to "Big Data". I will be modifying the code and content already available so that we can use it with simple dataset like "Titanic" which is the hello world of ML.

These are just my personal notes. I am sharing these so that it helps others too, who are trying to learn the similar concepts. Any Feedback is appreciated.

This is the first notebook in pyspark series.

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home">Notebook Navigation</h3>

[1. What is Spark?](#1)     
[2. Using Spark in Python](#2)    
[3. Using DataFrames](#3)     
[4. Creating a SparkSession](#4)     
[5. Viewing Tables](#5)     
[6. Querying the Data](#6)     
[7. Pandafy a Spark DataFrame](#7)         
[8. Upside Down with Spark](#8)     
[9. Dropping the Middle Man](#9)     
[10. Creating Columns](#10)         

<a id="1"></a>
# 1. What is Spark?

Spark is a platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer). Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.

As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster. It is a fact that parallel computation can make certain types of programming tasks much faster.

However, with greater computing power comes greater complexity.

If you are deciding whether or not Spark is the best solution for your problem you can consider questions like:
* Is my data too big to work with on a single machine?
* Can my calculations be easily parallelized?

<a id="2"></a>
# 2. Using Spark in Python

The first step in using Spark is connecting to a cluster.

In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. There will be one computer, called the master that manages splitting up the data and the computations. The master is connected to the rest of the computers in the cluster, which are called worker. The master sends the workers data and calculations to run, and they send their results back to the master.

When just getting started with Spark it's simpler to just run a cluster locally. Thus for now, instead of connecting to another computer, all computations will be run on Kaggle's servers in a simulated cluster.

Creating the connection is as simple as creating an instance of the SparkContext class. The class constructor takes a few optional arguments that allow us to specify the attributes of the cluster we're connecting to.

An object holding all these attributes can be created with the SparkConf() constructor. 

In [None]:
# installing pyspark in container
!pip install pyspark

<a id="3"></a>
# 3. Using DataFrames

Spark's core data structure is the Resilient Distributed Dataset (RDD). This is a low level object that splits data across multiple nodes in the cluster. However, RDDs are hard to work with directly, so we'll be using the Spark DataFrame abstraction built on top of RDDs.

The Spark DataFrame was designed to behave a lot like a SQL table (a table with variables in the columns and observations in the rows). Not only are they easier to understand, DataFrames are also more optimized for complicated operations than RDDs.

When we start modifying and combining columns and rows of data, there are many ways to arrive at the same result, but some often take much longer than others. When using RDDs, it's up to the data scientist to figure out the right way to optimize the query, but the DataFrame implementation has much of this optimization built in!

To start working with Spark DataFrames, we first have to create a SparkSession object from your SparkContext. You can think of the SparkContext as your connection to the cluster and the SparkSession as your interface with that connection.

<a id="4"></a>
# 4. Creating a SparkSession

Creating multiple SparkSessions and SparkContexts can cause issues, so it's best practice to use the SparkSession.builder.getOrCreate() method. This returns an existing SparkSession if there's already one in the environment, or creates a new one if required

In [None]:
# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession

# Create my_spark
spark_ex = SparkSession.builder.getOrCreate()

# Print my_spark
print(spark_ex)

<a id="5"></a>
# 5. Viewing Tables
Once we've created a SparkSession, we can start poking around to see what data is in your cluster!

SparkSession has method called catalog which lists all the data inside the cluster.Further .listTables() method, returns the names of all the tables in your cluster as a list.

In [None]:
# Print the tables in the catalog
print(spark_ex.catalog.listTables())

Here there are no predefined tables in our local cluster

<a id="6"></a>
# 6. Querying the Data

One of the advantages of the DataFrame interface is that you can run SQL queries on the tables in your Spark cluster.

*If we had tables in our cluster,we would directly start firing our queries but since we have no tables in our cluster. I will just create one here and you can skip that part and focus more on queries for now. we ll cover it later. (I have hidden those code lines but you can always expand and see it!)*

Running a query on this table is as easy as using the .sql() method on your SparkSession. Lets try this.

In [None]:
#registering a table in catalog
import pandas as pd
df1 = pd.read_csv("../input/titanic/train.csv")
df2=df1.iloc[:,0:4]
spark_df = spark_ex.createDataFrame(df2)
spark_df.registerTempTable("sample_table")
#spark_df.show()

In [None]:
# Don't change this query
query = "FROM sample_table SELECT * LIMIT 10"

# Get the first 10 rows of flights
titanic10 = spark_ex.sql(query)

# Show the results
titanic10.show()

<a id="7"></a>
# 7. Pandafy a Spark DataFrame

Suppose we've run a query on your huge dataset and aggregated it down to something a little more manageable. Sometimes it makes sense to then take that table and work with it locally using a tool like pandas.We can do that wirh .toPandas() method. 

This time the query counts the number of survived passengers when grouped by Pclass

In [None]:
# Don't change this query
query = "SELECT Pclass,COUNT(*) as Survived_Count FROM sample_table GROUP BY Pclass"

# Run the query
titanic_counts = spark_ex.sql(query)

# Convert the results to a pandas DataFrame
pd_counts = titanic_counts.toPandas()

# Print the head of pd_counts
print(pd_counts.head())

<a id="8"></a>
# 8. Upside Down with Spark

Lets now put a pandas DataFrame into a Spark cluster. The SparkSession class has a method for this as well.

The .createDataFrame() method takes a pandas DataFrame and returns a Spark DataFrame.

Since it is stored locally and not in spark catalog we can do use the .createTempView() Spark DataFrame method, which takes as its only argument the name of the temporary table you'd like to register. This method registers the DataFrame as a table in the catalog, but as this table is temporary, it can only be accessed from the specific SparkSession used to create the Spark DataFrame.

There is also the method .createOrReplaceTempView(). This safely creates a new temporary table if nothing was there before, or updates an existing table if one was already defined. We'll use this method to avoid running into problems with duplicate tables.

Check out the diagram to see all the different ways your Spark data structures interact with each other.
![](https://s3.amazonaws.com/assets.datacamp.com/production/course_4452/datasets/spark_figure.png)

In [None]:
import numpy as np
pd_temp = pd.DataFrame(np.random.random(10))
spark_temp = spark_ex.createDataFrame(pd_temp)
print(spark_ex.catalog.listTables())
spark_temp.createOrReplaceTempView("temp")

# Examine the tables in the catalog again
print(spark_ex.catalog.listTables())

<a id="9"></a>
# 9. Dropping the Middle Man

We can also read the text file straight into Spark instead of first converting into pandas dataframe.

SparkSession has a .read attribute which has several methods for reading different data sources into Spark DataFrames. Using these you can create a DataFrame from a .csv file

In [None]:
# Don't change this file path
file_path = "../input/titanic/test.csv"

# Read in the titanic data
titanic = spark_ex.read.csv(file_path,header=True)

# Show the data
titanic.show()

<a id="10"></a>
# 10. Creating Columns

We wil now use the methods defined by Spark's DataFrame class to perform common data operations.

Let's look at performing column-wise operations. In Spark we can do this using the .withColumn() method, which takes two arguments. First, a string with the name of your new column, and second the new column itself.

The new column must be an object of class Column. Creating one of these is as easy as extracting a column from your DataFrame using df.colName.

Updating a Spark DataFrame is somewhat different than working in pandas because the **Spark DataFrame is immutable**. This means that it can't be changed, and so columns can't be updated in place.Hence all these methods return a new DataFrame. To overwrite the original DataFrame you must reassign the returned DataFrame using the method like so:
`
df = df.withColumn("newCol", df.oldCol + 1)`

In [None]:
# Add Fare x100 Column
titanic = titanic.withColumn("Fare x 100",titanic.Fare*100)
titanic.show()

### Please check out the next notebook in continuation [here](https://www.kaggle.com/amritvirsinghx/scalable-data-science-pyspark-dataframes-nb2)