# Hello Spark
In this tutorial, you will run our first Spark application in Python. Thus, this our Spark Driver process.
We will cover the following items:
- Build a SparkSession
- Use the high level API DataFrame to read a CSV file
- Compare running time between Spark and Pandas when reading this large CSV file
- Explore basic Spark DataFrames functionality
- Explore the Spark UI

# Setup
Import required Python packages

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import count
import time
from functools import wraps
import pandas as pd

# Create a SparkSession
As we mentioned during the lecture, there is one main way to interact 
with Spark and thats using the SparkSession object. Therefore, the very first when writing a 
Spark application is to instantiate a SparkSession object. A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. There
can only be one SparkSession per JVM. When creating a SparkSession, there are several options/parameters you can pass. For instance, pass number of cores. We can use ```*``` to instruct Spark to use all cores available.

In [5]:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("Hello Spark") \
    .getOrCreate()

# Read Data into a Spark DataFrame
We will get into details about the SparkDataFrames API but for now, its enough for you 
to understand that we can use the SparkSession to read data from different sources such as CSV as shown below.

In [9]:
# please use the activity_log_raw.csv  file here
large_csv = "/Users/dmatekenya/wbg/cuebic-raw-data/processed/input.csv"
df = spark.read.csv(cdr_file)
print(df.count())

# Compare running time for Spark and Pandas

**EXERCISE-1**: Complete the parts which  says "YOUR CODE HERE"

In [1]:
def timefn(fn):
    """
    Function for recording running time of a function
    """
    @wraps(fn)
    def measure_time(*args, **kwargs):
        t1 = time.time()
        result = fn(*args, **kwargs)
        t2 = time.time()
        print("@timefn:" + fn.__name__ + " took " + str(t2 - t1) + " seconds")
        return result
    return measure_time

In [None]:
@timefn
def load_big_csv_with_spark(big_csv=None):
    """
    A simple function which loads a CSV file using Apache Spark and
    then counts how many rows are in the file
    """
    # create a Spark Session here
    spark = YOUR CODE HERE
    # read the CSV  file int Spark DataFrame
    df = YOUR CODE HERE
    # Get the number of rows in the dataset using the count() function
    cnt = YOUR CODE HERE
    print('Number of rows in big CSV file: {:,}'.format(YOUR CODE HERE))

In [7]:
@timefn
def load_big_csv_with_pandas(big_csv=None):
    """
    Use pandas library to load the large CSV
    """
    # Read CSV as a pandas DataFrame (df) here
    df = YOUR CODE HERE
    
    # Get the total number of rows
    cnt = YOUR CODE HERE
   
    print('Number of rows in big CSV file: {:,}'.format(YOUR CODE HERE))

In [None]:
# Now call the two functions above here
YOUR CODE HERE
YOUR CODE HERE

# Exploring DataFrames in Spark
We can do the similar things we did with pandas dataframes in order to explore the data.
- We can use the `printSchema()` function to check column data types as well as what kind of columns we have.
- Also, you can use the the same `columns` function to get a list of all columns
- We can use `head()` function just like in pandas to get the top `n` observations from the data. Note that you can use n within the brackets in the head function to specify the number of rows you want to see.
- Count the number of rows in the dataframe using the `count()` function
- Get number of unique elements using `distinct()` command. If you want number of unique elements based on a single column. You can first select the column using the sysntax `df.select("column").distinct().count()`

## Exercise-2: Explore the data
1. Use ```head()``` function to view the first 5 observations
2. Check column data types
3. How many unique categories are there for the STATUS variable?
4. Using the documentation for [Pyspark DataFrame](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame) find a function to get a sample of the data. Get a sample of 10% of the data and convert it to a pandas dataframe using `toPandas()` function

#  Explore Spark UI
In order to explore the Spark UI, we will run a function that take time to run. 
For this, let run the ```summary()``` funciton. Once you call the function, open the spakr UI by going to this *localhost/4040*. Once there, explore the different tabs:
- **Jobs:** On the Jobs tab, you can check the  status of your current job. You can see the exact functions such as count that  Spark is running
- **Executors:** In this tab, you can check how cores is Spark using and how tasks are being run in each core
- **Stages:**. You can at which stage the job is running, how many tasks have completed
- **Environment:** Its important to see which environment variables Spark is using, you can check that using this tab.
- **Storage, SQL:** Explore these tabs to see what information they contain

**EXERCISE-3:** If the summary job is taking too long, please kill it using the ```kill`` function in Spark UI

# EXERCISE-4:
Running spark using ```spark-submit``` and comparing Spark Running time based on the number of executors assigned to Spark. Unfortunatel, in Jupyter notebook, setting the number of executors isnt working well and so we will have to do it in terminal to explore this.

## Step-1: Identify location of your Spark installation. 
- Run code below to note down the base folder of you Spark
- Make sure you identify the root of the Spark folder
- On the terminal, navigate to that folder using cd and then navigate to the ```bin``` folder
- If you run the ```ls``` command while in the ```bin``` folder, you should see a ```spark-submit``` executable
- ```spark-submit``` is used to submit standalone Spark applications  to a cluster but we can use it in local mode too

In [13]:
# Run code below to note down the base folder of you Spark
# Make sure you identify the root of the Spark folder
import pyspark
pyspark.__file__

'/Users/dmatekenya/spark-3.0.0-bin-hadoop2.7/python/pyspark/__init__.py'

## Step-2: Create a ```.py``` file in VS Code or any text editor
- Copy all the necessary imports and add them at the top of the Python file
- Next, add this line of code as its always required when running Python scripts: 
```if __name__ == "__main__":```
- Copy the code from just after this heading: ```Compare running time for Spark and Pandas``` to before the heading: ```Exploring DataFrames in Spark```. And paste the code underneath the statement above.
- Make sure your Python file has no errors

## Step-3: Run the Spark Application
1. Note the full path of your Python file
2. On the terminal, navigate to the Spark ```cd```
3. Within the Spark folder, run this command:
```./bin/spark-submit --name "Hello Spark" --master local[num_executors] --conf spark.eventLog.enabled=false --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --conf spark.hadoop.abc.def=xyz --conf spark.hive.abc=xyz path_to_your_python_file```
4. To avoid errors, copy the command above into a text editor so that everything is on one line
5. Replace ```num_executors``` with a number such as ```4``` for a start. Press enter to run the program.
6. As the program runs, take note how many executors are created, note the running time for Spark funciton only
7. Now, increase the ```num_executors``` by 2 or 4 and run the program again. See if you notice reduction in running time.