# Downloading Spark

- very easy to download at [Spark homepage](https://spark.apache.org)
- click big green ["Download Spark"](https://spark.apache.org/downloads.html) button
- ![Download options](http://pashabd.com/wp-content/uploads/2016/02/apache-spark-download-for-windows.png "Download Options")


## Click the link to initiate the download:

![Mirror site](https://cdn-images-1.medium.com/max/1600/1*SMoJi0KJZJ5i50hLRoAHnw.jpeg "Link to mirrow site")

## > Once downloaded, double click the .tar file to expand
## > Move the folder to the desired location in your local machine 
## >> Alternatively, for PySpark, simply type `pip install pyspark` in your command line or terminal
- assuming you have pip installed (if not download at https://pip.pypa.io/en/stable/installing/) 
- navigate to the disired download location in your local machine first 
- a link is also available on the [main download website](https://spark.apache.org/downloads.html)

# Step through the ["Quick Start Guide"](https://spark.apache.org/docs/latest/quick-start.html) provided by Spark
- Can be run in Scala or Python

Step 1: Navigate to the spark directory from the terminal or command line and type `./bin/spark-shell` if using Scala or, for Python, `./bin/pyspark` (or simply `pyspark` if pyspark is installed with pip)

This will open Spark's interactive shell, which will present a welcome message similar to the following:


![Spark Shell](https://databricks.com/wp-content/uploads/2014/01/simrshell.png "Spark Shell")


# Let's check it out with a coding Demo :)

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.getOrCreate()

In [4]:
textFile = spark.read.text("HMP_Dataset/README.txt")

In [5]:
textFile.count()  # Number of rows in this DataFrame

73

In [6]:
textFile.first()  # First row in this DataFrame

Row(value='Public Dataset of Accelerometer Data for Human Motion Primitives Detection')

In [7]:
linesWithMotion = textFile.filter(textFile.value.contains("Motion"))

In [8]:
linesWithMotion.count() # How many lines contain "Motion"

4

In [9]:
linesWithMotion.collect()

[Row(value='Public Dataset of Accelerometer Data for Human Motion Primitives Detection'),
 Row(value='The Public Dataset of Accelerometer Data for Human Motion Primitives Detection is a public collection of labelled accelerometer data recordings to be used for the creation and validation of acceleration models of human motion primitives.'),
 Row(value='A description of the Human Motion Primitives detection system that we have designed to work with the provided dataset can be found at:'),
 Row(value='The authors allow the users of the Public Dataset of Accelerometer Data for Human Motion Primitives Detection to use and modify it for their own research. Any commercial application, redistribution, etc... has to be arranged between users and authors individually.')]

In [10]:
linesWithMotion.take(1)

[Row(value='Public Dataset of Accelerometer Data for Human Motion Primitives Detection')]

## Quick Look at Dataset Operations

In [11]:
from pyspark.sql.functions import *

Let's make use of the helpful package called `functions` within the `sql` package. This will provide a lot of convenient functions for DataFrame exploration, manipulation, etc. 

E.g., let's split the text in `textFile` by row and map each row to an integer value (i.e., the number of words in the row); then `agg` is called to find the largest word count.

In [12]:
textFile.select(size(split(textFile.value, "\s+")).name("numWords")).agg(max(col("numWords"))).collect()

[Row(max(numWords)=44)]

One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:

In [13]:
wordCounts = textFile.select(explode(split(textFile.value, "\s+")).alias("word")).groupBy("word").count()

Above, we use the `explode` function in `select`, to transfrom a Dataset of lines to a Dataset of words, and then combine `groupBy` and `count` to compute the per-word counts in the file as a DataFrame of 2 columns: “word” and “count”. To collect the word counts in our shell, we can call `collect`:

In [14]:
wordCounts.collect()

[Row(word='installation.', count=1),
 Row(word='hope', count=1),
 Row(word='If', count=1),
 Row(word='used', count=1),
 Row(word='documentation', count=2),
 Row(word='dept.', count=3),
 Row(word='motion', count=3),
 Row(word='Mastrogiovanni,', count=2),
 Row(word='Data', count=3),
 Row(word='implied', count=1),
 Row(word='within', count=1),
 Row(word='Science', count=1),
 Row(word='----------', count=1),
 Row(word='acceleration', count=2),
 Row(word='that,', count=1),
 Row(word='Bruno', count=1),
 Row(word='not', count=1),
 Row(word='will', count=2),
 Row(word='pp.', count=2),
 Row(word='recognition:', count=1),
 Row(word='code', count=1),
 Row(word='Antonio', count=1),
 Row(word='Engineering', count=1),
 Row(word='Fulvio', count=1),
 Row(word='based', count=1),
 Row(word='you', count=1),
 Row(word='new', count=1),
 Row(word='PARTICULAR', count=1),
 Row(word='Public', count=3),
 Row(word='own', count=1),
 Row(word='more', count=1),
 Row(word='collection', count=1),
 Row(word='(2012)', 

## Caching

`.cache()` can be applied to a Dataset to pull it into a cluster-wide in-memory cache, which can be very useful when accessing a dataset repeatedly

## Self-Contained Applications

Let's write a simple application to count the number of lines containing 'a' and the number of lines containing 'b' in a text file. 

- Use a `Spark Session` to create datasets (as we did in this Jupyter notebook) 
- Use this command to run (for Python): 

`YOUR_SPARK_HOME/bin/spark-submit \
  --master local[4] \
  SimpleApp.py`