# Installation
To set up Spark on your own computer and integrate PySpark with Jupyter Notebook. We can use Spark in two modes:

* **Local mode**: the entire Spark application runs on a single machine. You'll use local mode to prototype Spark code on your own computer. (This is the easier setup.)
* **Cluster mode**: the Spark application runs across multiple machines. You'll use cluster mode when you want to run your Spark application across multiple machines in a cloud environment like Amazon Web Services, Microsoft Azure, or Digital Ocean.

We'll cover the instructions for installing Spark in local mode on Windows, Mac, and Linux. We'll cover how to install Spark in cluster mode in the data engineering track.

<Img src="https://github.com/rhnyewale/Apache-Spark/blob/main/Images/spark_components.jpg?raw=true">
    


Spark runs on the Java Virtual Machine (JVM), which comes in the Java SE Development Kit (JDK). We recommend installing Java SE Development Kit version 7 or higher, which you can download from Oracle’s website:
    
* http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
   
Any version after JDK 7 works, so you can download any of the versions on this page. Select the appropriate installation file for your operating system.
    
If you're on Windows or Linux, choose the correct instruction set architecture (x86 or x64) for your computer. Each computer chip has a specific instruction set architecture that determines the maximum amount of memory it can work with. The two main types are x86 (32 bit) and x64 (64-bit). If you're not sure which one your computer has, you can find out in this guide if you're on Windows or this one if you're on Linux.

To verify that the installation worked, launch your command line application (Command Prompt for Windows and Terminal for Mac and Linux) and run the following:

In [2]:
#java -version

The output should be similar to this:

In [10]:
#java version "1.7.0_79"
#Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
#Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)

While the exact numbers probably won't match, the important thing to verify is that the version is larger than 1.7. This number actually represents Version 7. If you're interested, you can read why at Oracle's website.

If running java -version returned an error or a different version than the one you just installed, your Java JDK installation most likely wasn't added to your PATH properly. Read this post to learn more about how to properly add the Java executable to your PATH.

Now that we have the JVM set up, let's move on to Spark.

Because you've installed JDK, you could technically download the original source code and build Spark on your computer. Building from the source code is the process of generating an executable program for your machine. It involves [many steps](https://stackoverflow.com/questions/1622506/programming-definitions-what-exactly-is-building/1622520#1622520). While there are some performance benefits to building Spark from source, it takes a while to do, and it's hard to debug if the build fails.

We'll download and work with a pre-built version of Spark instead. Navigate to the [Spark downloads page](http://spark.apache.org/downloads.html) and select the following options:

1. 1.6.2<br/>
Note: Any Spark version prior to 2.0.0 is incompatible with Python 3.6. If you have Python 3.6, we recommend downloading one of the newer versions of Spark.
2. Pre-built for Hadoop 2.6
3. Direct Download

Next, click the link that appears in Step 4 to download Spark as a .TGZ file to your computer. Open your command line application and navigate to the folder where you downloaded it. Unzip the file and move the resulting folder into your home directory. Windows doesn't have a built-in utility that can unzip tgz files — we recommend downloading and using [7-Zip](https://www.7-zip.org/). Once you've unzipped the file, move the resulting folder into your home directory.

<Img src="https://github.com/rhnyewale/Apache-Spark/blob/main/Images/cmd_bin_pyspark.JPG?raw=true">
 
While this results in a lot of output, you can see that the shell automatically initialized the SparkContext object and assigned it to the variable sc.

You don't have to run bin/pyspark from the folder that contains it. Because it's in your home directory, you can use "~/spark-1.6.1-bin-hadoop2.6/bin/pyspark" to launch the PySpark shell from other directories on your machine (Note: replace 1.6.1 with 1.6.2 for newer version users). This way, you can switch to the directory that contains the data you want to use, launch the PySpark shell, and read the data in without using its full path. The folder you're in when you launch the PySpark shell will be the local context for working with files in Spark.



## Connecting with Jupyter Notebook

You can make your Jupyter Notebook application aware of Spark in a few different ways:
* One is to create a configuration file and launch Jupyter Notebook with that configuration
* Another is to import PySpark at runtime

We'll focus on the latter approach so you won't have to restart Jupyter Notebook each time you want to use Spark

First, you'll need to copy the full path to the pre-built Spark folder and set it as a shell environment variable. This way, you can specify Spark's location a single time, and every Python program you write will have access to it. If you move the Spark folder, you can change the path specification once and your code will work fine.

### Mac / Linux

* Use nano or another text editor to open your shell environment's configuration file. If you're using the default Terminal application, the file should be in ~/.bash_profile . If you're using ZSH instead, your configuration file will be in ~/.zshrc.

* Add the following line to the end of the file, replacing {full path to Spark} with the actual path to Spark:

In [4]:
# export SPARK_HOME="{full path to Spark, eg /users/home/jeff/spark-2.0.1-bin-hadoop2.7/}"

* Exit the text editor and run either source ~/.bash_profile or source ~/.zshrc so the shell reads in and applies the update you made.

### Windows

* If you've never added environment variables, read this tutorial before you proceed.
* Set the SPARK_HOME environment variable to the full path of the Spark folder (e.g., c:/Users/Rohan/spark-2.0.1-bin-hadoop2.7/).

Next, let's install the findspark Python library, which looks up the location of PySpark using the environment variable we just set. Use pip to install the findspark library:

In [5]:
pip install findspark

Collecting findspark
  Downloading findspark-1.4.2-py2.py3-none-any.whl (4.2 kB)
Installing collected packages: findspark
Successfully installed findspark-1.4.2
Note: you may need to restart the kernel to use updated packages.


In [6]:
from urllib.request import urlretrieve

In [7]:
urlretrieve("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv", "recent-grads.csv")

('recent-grads.csv', <http.client.HTTPMessage at 0x2f28132f408>)

In [1]:
# Find path to Spark
import findspark
findspark.init()

In [2]:
#Import PySpark and initialize SparkContext object
import pyspark
sc = pyspark.SparkContext()

# Read 'recent-grads.csv' in to an RDD
f = sc.textFile('recent-grads.csv')
data = f.map(lambda line: line.split('\n'))
data.take(10)

[['Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,Full_time,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs'],
 ['1,2419,PETROLEUM ENGINEERING,2339,2057,282,Engineering,0.120564344,36,1976,1849,270,1207,37,0.018380527,110000,95000,125000,1534,364,193'],
 ['2,2416,MINING AND MINERAL ENGINEERING,756,679,77,Engineering,0.101851852,7,640,556,170,388,85,0.117241379,75000,55000,90000,350,257,50'],
 ['3,2415,METALLURGICAL ENGINEERING,856,725,131,Engineering,0.153037383,3,648,558,133,340,16,0.024096386,73000,50000,105000,456,176,0'],
 ['4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,1258,1123,135,Engineering,0.107313196,16,758,1069,150,692,40,0.050125313,70000,43000,80000,529,102,0'],
 ['5,2405,CHEMICAL ENGINEERING,32260,21239,11021,Engineering,0.341630502,289,25694,23170,5180,16697,1672,0.061097712,65000,50000,75000,18314,4440,972'],
 ['6,2418,NUCLEAR ENGINEERING,2573,2200,373,En