# Hello Spark
In this tutorial, you will run our first Spark application in Python. Thus, this our Spark Driver process.
We will cover the following items:
- Build a SparkSession
- Use the high level API DataFrame to read a CSV file
- Compare running time between Spark and Pandas when reading this large CSV file
- Explore basic Spark DataFrames functionality
- Explore the Spark UI

# Setup
Import required Python packages

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import count
import time
from functools import wraps
import pandas as pd

# Create a SparkSession
As we mentioned during the lecture, there is one main way to interact 
with Spark and thats using the SparkSession object. Therefore, the very first when writing a 
Spark application is to instantiate a SparkSession object. A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. There
can only be one SparkSession per JVM. When creating a SparkSession, there are several options/parameters you can pass. For instance, pass number of cores. We can use ```*``` to instruct Spark to use all cores available.

##  Initialzing Spark

In [2]:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("Hello Spark") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/02/17 05:57:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Read Data into a Spark DataFrame
We will get into details about the SparkDataFrames API but for now, its enough for you 
to understand that we can use the SparkSession to read data from different sources such as CSV as shown below.

In [3]:
sdf = spark.read.csv("../DATA/raw/simulated_cdrs/", header=True)

                                                                                

In [5]:
sim_locs = pd.read_csv("/Users/dmatekenya/Downloads/simulated_locs.csv")
sim_locs.head()

Unnamed: 0,site_id,cell_id,lat,lon
0,S231,12221,-8.66928,26.9279
1,S231,12222,-8.66928,26.9279
2,S231,12223,-8.66928,26.9279
3,S231,12227,-8.66928,26.9279
4,S231,12228,-8.66928,26.9279


23/02/17 08:56:39 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 778800 ms exceeds timeout 120000 ms
23/02/17 08:56:39 WARN SparkContext: Killing executors is not supported by current scheduler.


In [4]:
sdf.columns

['cdr type', 'cdr datetime', 'call duration', 'last calling cellid', 'user_id']

In [9]:
sim_locs =c

Row(cdr type='MtSMSRecord', cdr datetime='20180710084407', call duration=None, last calling cellid=None, user_id='7566424924061690786')

In [3]:
# please use the kaggle_expedia_train.csv  file here
large_csv = "../DATA/raw/activity_log_raw.csv"
df = spark.read.csv(large_csv, header=True)

                                                                                

# Explore Spark UI
Lets run something which takes long (e.g., ```summary```) and then check whats happening on ```localhost:4040/``` in your browser. 
If ```localhost:4040/``` doesnt work, you can try ```localhost:4041/```

# Partitions
You can check and channge number of partitions on the DataFrame

In [4]:
num_partitions = df.rdd.getNumPartitions()

print(num_partitions)

df2 = df.repartition(10)

63


# Transformations vs. Actions

In [None]:
# Sample() is a transformation and nothing happens you execute the cell
pdf_sample = df.sample(fraction=0.1).toPandas()

In [None]:
# Count() is an action because Spark has to actually count
# So, Spark executes the sample first and then do count
df_sample.count()

In [None]:
## Example actions: show(), head() are actions 
df.show()

# Compare running time for Spark and Pandas

In [None]:
def timefn(fn):
    """
    Function for recording running time of a function
    """
    @wraps(fn)
    def measure_time(*args, **kwargs):
        t1 = time.time()
        result = fn(*args, **kwargs)
        t2 = time.time()
        print("@timefn:" + fn.__name__ + " took " + str(t2 - t1) + " seconds")
        return result
    return measure_time

In [None]:
@timefn
def load_big_csv_with_spark(big_csv=None, spark_session=None):
    """
    A simple function which loads a CSV file using Apache Spark and
    then counts how many rows are in the file
    """
    # create a Spark Session here
    # read the CSV  file int Spark DataFrame
    df = spark_session.read.csv(big_csv, header=True)
    # Get the number of rows in the dataset using the count() function
    cnt = df.count()
    print('Number of rows in big CSV file: {:,}'.format(cnt))

In [None]:
@timefn
def load_big_csv_with_pandas(big_csv=None):
    """
    Use pandas library to load the large CSV
    """
    # Read CSV as a pandas DataFrame (df) here
    df = pd.read_csv(big_csv)
    
    # Get the total number of rows
    cnt = df.shape[0]
   
    print('Number of rows in big CSV file: {:,}'.format(cnt))

**EXERCISE-1**: Compare running time for spark and pandas. 

In [None]:
# Now call the two functions above here
#load_big_csv_with_pandas(big_csv=large_csv)
load_big_csv_with_spark(big_csv=large_csv, spark_session=spark)

# Exploring DataFrames in Spark
We can do the similar things we did with pandas dataframes in order to explore the data.
- We can use the `printSchema()` function to check column data types as well as what kind of columns we have.
- Also, you can use the the same `columns` function to get a list of all columns
- We can use `head()` function just like in pandas to get the top `n` observations from the data. Note that you can use n within the brackets in the head function to specify the number of rows you want to see.
- Count the number of rows in the dataframe using the `count()` function
- Get number of unique elements using `distinct()` command. If you want number of unique elements based on a single column. You can first select the column using the sysntax `df.select("column").distinct().count()`

**EXERCISE-2**: Exploring spark dataframe
1. Use ```head()``` function to view the first 5 observations
2. Check column data types
3. How many unique categories are there for the STATUS variable?
4. Using the documentation for [Pyspark DataFrame](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame) find a function to get a sample of the data. Get a sample of 10% of the data and convert it to a pandas dataframe using `toPandas()` function

```Although there is a pandas API for spark, we will not get into just yet, lets explore core functionality of spark first```

#  Explore Spark UI
In order to explore the Spark UI, we will run a function that take time to run. 
For this, lets run the ```summary()``` funciton. Once you call the function, open the spakr UI by going to this url: *localhost/4040*. Once there, explore the different tabs:
- **Jobs:** On the Jobs tab, you can check the  status of your current job. You can see the exact functions such as count that  Spark is running
- **Executors:** In this tab, you can check how cores is Spark using and how tasks are being run in each core
- **Stages:**. You can at which stage the job is running, how many tasks have completed
- **Environment:** Its important to see which environment variables Spark is using, you can check that using this tab.
- **Storage, SQL:** Explore these tabs to see what information they contain

In [None]:
df = spark.read.csv(large_csv, header=True)

If the summary job is taking too long, you can kill it using the ```kill``` function in Spark UI because in some cases, its impossible to stop it in the notebook.

# Other ways to run spark
1. **spark-shell**: For quick and faster interaction with spark.
2. **spark-submit**. In the terminal, often for submitting jobs in clusters but 
you can also use it in local mode.

**EXERCISE-3**: Run spark in shell
1. Locate spark folder using step-1 below.
2. Navigate to the ```bin``` directory
3. Run the ```pyspark``` command in there

**EXERCISE-4**: Running spark using ```spark-submit```

Running spark using ```spark-submit``` and comparing Spark Running time based on the number of executors assigned to Spark. Unfortunatel, in Jupyter notebook, setting the number of executors isnt working well and so we will have to do it in terminal to explore this.

## Step-1: Identify location of your Spark installation. 
- Run code below to note down the base folder of you Spark
- Make sure you identify the root of the Spark folder
- On the terminal, navigate to that folder using cd and then navigate to the ```bin``` folder
- If you run the ```ls``` command while in the ```bin``` folder, you should see a ```spark-submit``` executable
- ```spark-submit``` is used to submit standalone Spark applications  to a cluster but we can use it in local mode too

In [None]:
# Run code below to note down the base folder of you Spark
# Make sure you identify the root of the Spark folder
import pyspark

In [None]:
pyspark.__file__

## Step-2: Create a ```.py``` file in VS Code or any text editor
- Copy all the necessary imports and add them at the top of the Python file
- Next, add this line of code as its always required when running Python scripts: 
```if __name__ == "__main__":```
- Copy the code from just after this heading: ```Compare running time for Spark and Pandas``` to before the heading: ```Exploring DataFrames in Spark```. And paste the code underneath the statement above.
- Make sure your Python file has no errors

## Step-3: Run the Spark Application
1. Note the full path of your Python file
2. On the terminal, navigate to the Spark ```cd```
3. Within the Spark folder, run this command:
```./bin/spark-submit --name "Hello Spark" --master local[num_executors] --conf spark.eventLog.enabled=false --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --conf spark.hadoop.abc.def=xyz --conf spark.hive.abc=xyz path_to_your_python_file```
4. To avoid errors, copy the command above into a text editor so that everything is on one line
5. Replace ```num_executors``` with a number such as ```4``` for a start. Press enter to run the program.
6. As the program runs, take note how many executors are created, note the running time for Spark funciton only
7. Now, increase the ```num_executors``` by 2 or 4 and run the program again. See if you notice reduction in running time.