<h1>Installing Anaconda</h1>

<h2>On Linux</h2>

We have already disussed in Section 00.

<h2>On Windows</h2>

The first step is to download Anaconda here: https://www.anaconda.com/. Individuals can download the individual edition (free of cost). I recommend using (Python 3.8) since we use this version in all our examples. If you have an alternate version installed, please make sure to change your code accordingly. Once you install Anaconda, you can run the following command in the Terminal/Anaconda Prompt to ensure the installation is complete.

In [None]:
conda info

<h2>Create Conda Environment</h2>

Similir to what we have done in Section 00, you can use the following command.

In [None]:
# Activate the base environment
conda activate

In [None]:
conda create -–name spark38 python=3.8

In [None]:
# List available anaconda environments
conda env list

<h2>Download and Unpack Apache Spark</h2>

You can download the latest version of Apache Spark at https://spark.apache.org/ downloads.html. We are going to use Spark version 3.2 with Hadoop 2.7

<h3>On Linux</h3>

<li>Unzip the Spark File</li>

In [None]:
# tar -xzf spark-3.0.1-bin-hadoop2.7.tgz
# mv spark-3.0.1-bin-hadoop2.7 /opt/spark-3.0.1

<li>Set Environment Variables</li>

Now, we must update the `~/.bash_profile` or `~/.bashrc` files to find the Spark installation. Let us use the bash_profile file in our
example here. In Terminal, type the following code to open the file:

In [None]:
vi ~/.bash_profile

Add the following lines:

In [None]:
export SPARK_HOME=/opt/spark-<version>
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3

In [None]:
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Going forward, whenever you invoke pyspark, it should automatically invoke a Jupyter Notebook session.

In [None]:
source ~/.bash_profile

<h3>On Windows</h3>

<li>Unzip the Spark File</li>

Extract the downloaded `*.tgz` file to the path `C:\Users\<username>\spark-<version>`. Create `hadoop\bin` under the SPARK_HOME path, Once the directory is created, we need to copy the winutils.exe file from the our repository under `training-big-data/winutils/hadoop-2.7/` and add it to our Spark installation

<li>Set Environment Variables</li>

Search for `Environment Variables` in Windows search or you can find them in `System Properties ➤ Advanced ➤ Environment Variables`. Here, we will add a few variables in the User variables tab.

Then Add the following value `C:\Users\<username>\spark-<version>\bin` to the end of the path variable.

In your anaconda evironment download `findspark` package which is used in intial the necessary values in order to run `PySprak` on Jupyter notebook.

In [None]:
conda install findsaprk

and the use it on your code:

In [2]:
import findspark

# If you know spark path you can specify it as init function parameter
findspark.init()

Or replace `findspark` code with the below code:

In [4]:
# import os
# import sys

# os.environ["PYSPARK_PYTHON"] = r"c:\Users\dm\anaconda3\envs\spark38\python.exe"
# os.environ["SPARK_HOME"] = r"C:\Users\dm\spark-3.3.1"
# os.environ["PYLIB"] = os.environ["SPARK_HOME"] + r"\python\lib"
# sys.path.insert(0, os.environ["PYLIB"] + r"\py4j-0.10.9.5-src.zip")
# sys.path.insert(0, os.environ["PYLIB"] + r"\pyspark.zip")

In [3]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.getOrCreate()

df = spark.sql("select 'spark' as hello ")

df.show()

+-----+
|hello|
+-----+
|spark|
+-----+



In [6]:
spark.stop()

<h2>Spark’s Directories and Files</h2>

Let’s briefly summarize the intent and purpose of some of these files and directories.

- <b>README.md</b><br />
    This file contains new detailed instructions on how to use Spark shells, build Spark from source, run standalone Spark examples, peruse links to Spark documentation and configuration guides, and contribute to Spark.
- <b>bin</b><br />
    This directory, as the name suggests, contains most of the scripts you’ll employ to interact with Spark, including the Spark shells (spark-sql, pyspark, sparkshell, and sparkR). You will use these shells and executables in this directory later to submit a standalone Spark application using sparksubmit, and write a script that builds and pushes Docker images when running Spark with Kubernetes support.
- <b>sbin</b><br />
    Most of the scripts in this directory are administrative in purpose, for starting and stopping Spark components in the cluster.
- <b>kubernetes</b><br />
    Since the release of Spark 2.4, this directory contains Dockerfiles for creating Docker images for your Spark distribution on a Kubernetes cluster. It also contains a file providing instructions on how to build the Spark distribution before building your Docker images.
- <b>data</b><br />
    This directory is populated with *.txt files that serve as input for Spark’s components: MLlib, Structured Streaming, and GraphX.
- <b>examples</b><br />
    For any developer, two imperatives that ease the journey to learning any new platform are loads of “how-to” code examples and comprehensive documentation. Spark provides examples for Java, Python, R, and Scala, and you’ll want to employ them when learning the framework.

<h2>Understanding Spark Application Concepts</h2>

To understand what’s happening under the hood with our sample code, you’ll need to be familiar with some of the key concepts of a Spark application and how the code is transformed and executed as tasks across the Spark executors. We’ll begin by defining some important terms:

- <b>Application</b><br />
    A user program built on Spark using its APIs. It consists of a driver program and executors on the cluster.
- <b>SparkSession</b><br />
    An object that provides a point of entry to interact with underlying Spark functionality and allows programming Spark with its APIs. In an interactive Spark shell, the Spark driver instantiates a SparkSession for you, while in a Spark application, you create a SparkSession oject yourself.
- <b>Job</b><br />
    A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g., save(), collect()).
- <b>Stage</b><br />
    Each job gets divided into smaller sets of tasks called stages that depend on each other.
- <b>Task</b><br />
    A single unit of work or execution that will be sent to a Spark executor.

<h2>Extencive Example</h2>

In [8]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import count

if __name__ == "__main__":
    # Build a SparkSession using the SparkSession APIs.
    # If one does not exist, then create an instance. There
    # can only be one SparkSession per JVM.
    spark = (SparkSession
    .builder
    .appName("PythonMnMCount")
    .getOrCreate())

    mnm_file = "../data/mnm_dataset.csv"

    # Read the file into a Spark DataFrame using the CSV
    # format by inferring the schema and specifying that the
    # file contains a header, which provides column names for comma-
    # separated fields.
    mnm_df = (spark.read.format("csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(mnm_file))

    # We use the DataFrame high-level APIs. Note
    # that we don't use RDDs at all. Because some of Spark's
    # functions return the same object, we can chain function calls.
    # 1. Select from the DataFrame the fields "State", "Color", and "Count"
    # 2. Since we want to group each state and its M&M color count,
    # we use groupBy()
    # 3. Aggregate counts of all colors and groupBy() State and Color
    # 4 orderBy() in descending order
    count_mnm_df = (mnm_df
        .select("State", "Color", "Count")
        .groupBy("State", "Color")
        .agg(count("Count").alias("Total"))
        .orderBy("Total", ascending=False))
    
    # Show the resulting aggregations for all the states and colors;
    # a total count of each color per state.
    # Note show() is an action, which will trigger the above
    # query to be executed.
    count_mnm_df.show(n=60, truncate=False)
    
    print("Total Rows = %d" % (count_mnm_df.count()))
    
    # While the above code aggregated and counted for all
    # the states, what if we just want to see the data for
    # a single state, e.g., CA?
    # 1. Select from all rows in the DataFrame
    # 2. Filter only CA state
    # 3. groupBy() State and Color as we did above
    # 4. Aggregate the counts for each color
    # 5. orderBy() in descending order
    # Find the aggregate count for California by filtering
    ca_count_mnm_df = (mnm_df
    .select("State", "Color", "Count")
    .where(mnm_df.State == "CA")
    .groupBy("State", "Color")
    .agg(count("Count").alias("Total"))
    .orderBy("Total", ascending=False))

    # Show the resulting aggregation for California.
    # As above, show() is an action that will trigger the execution of the
    # entire computation.
    ca_count_mnm_df.show(n=10, truncate=False)

    # Stop the SparkSession
    spark.stop()

+-----+------+-----+
|State|Color |Total|
+-----+------+-----+
|CA   |Yellow|1807 |
|WA   |Green |1779 |
|OR   |Orange|1743 |
|TX   |Green |1737 |
|TX   |Red   |1725 |
|CA   |Green |1723 |
|CO   |Yellow|1721 |
|CA   |Brown |1718 |
|CO   |Green |1713 |
|NV   |Orange|1712 |
|TX   |Yellow|1703 |
|NV   |Green |1698 |
|AZ   |Brown |1698 |
|WY   |Green |1695 |
|CO   |Blue  |1695 |
|NM   |Red   |1690 |
|AZ   |Orange|1689 |
|NM   |Yellow|1688 |
|NM   |Brown |1687 |
|UT   |Orange|1684 |
|NM   |Green |1682 |
|UT   |Red   |1680 |
|AZ   |Green |1676 |
|NV   |Yellow|1675 |
|NV   |Blue  |1673 |
|WA   |Red   |1671 |
|WY   |Red   |1670 |
|WA   |Brown |1669 |
|NM   |Orange|1665 |
|WY   |Blue  |1664 |
|WA   |Yellow|1663 |
|WA   |Orange|1658 |
|CA   |Orange|1657 |
|NV   |Brown |1657 |
|CA   |Red   |1656 |
|CO   |Brown |1656 |
|UT   |Blue  |1655 |
|AZ   |Yellow|1654 |
|TX   |Orange|1652 |
|AZ   |Red   |1648 |
|OR   |Blue  |1646 |
|UT   |Yellow|1645 |
|OR   |Red   |1645 |
|CO   |Orange|1642 |
|TX   |Brown 