# Running Spark with Notebooks

- toc: false
- badges: true
- hide_binder_badge: true
- comments: true
- sticky_rank: 1
- author: "<a href='https://twitter.com/rajkstats'>Raj Kumar</a>"
- description: "A quick tutorial on how to run spark in Jupyter and Colab Notebooks"
- image: /images/copied_from_nb/spark-on-notebook/blog-head.png
- categories: [Jupyter,Spark,Notebooks]

<img src="spark-on-notebook/blog-head.png" width="500" height="200"/>

## Motivation
Lately, I had been working on something which required working with spark and put together an analysis reading a large dataset. Usually at work, I would just simply run the code on [Databricks Notebooks](https://docs.databricks.com/notebooks/index.html) / [AWS EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html) which comes with pre-defined settings and spark installed and you are ready to run your code in notebooks for doing any kind of these adhoc analysis. 

In this blog, we would briefly cover how to work interactively with spark in notebook with commonly used languages like **Python**, **R** and **Scala** 

## Some Solutions

We will go through the following approaches in this blog:

- **With Docker**
- **Without Docker**
- **Running Spark with Google Colab Notebooks**


### **Without Docker**

Let's get started without using an docker image first:

With a little bit of browsing, I was going through this [medium blog by Roshini](https://medium.com/@roshinijohri/spark-with-jupyter-notebook-on-macos-2-0-0-and-higher-c61b971b5007). Though there could be multiple ways to do this. This seems to be the easiest and quickest way to get started. And I am pretty sure that I will be revisting this again and again. So it would be super useful to document the steps on how to get started.

Following are few things you can do to run spark on jupyter notebooks:


Follow the steps **Inspired From Roshni's Blog** (For Mac OS users) to run spark on jupyter notebooks

In [None]:
brew install apache-spark
brew info apache-spark


Since I've already installed spark on my system, I would just go ahead and print the ouptut of info spark which should look like below if you have sucessfully installed spark

![spark info](spark-on-notebook/spark-info.png)

>Important: Based on when and which version your system installs spark, change the version in export command below

In [None]:
unset SPARK_HOME  (only if you have installed spark earlier)
export SPARK_HOME= "/usr/local/Cellar/apache-spark/3.1.2/libexec/"

Also, make sure you have pyspark python package installed:

In [None]:
pip3.9 install pyspark

![pyspark](spark-on-notebook/pyspark.png)

Run ```pyspark``` or ```spark-shell``` on your terminal to see if spark has configured correctly

![spark install](spark-on-notebook/spark-install.png)

This means that spark is configured, now let's move on to how to interactively run spark with jupyter notebooks

In [None]:
jupyter notebook

Now copy the following the code to the first cell of your jupyter notebook

In [None]:
import os
exec(open(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py')).read())

![spark install](spark-on-notebook/spark-nb.png)

To access **Spark Application UI**, click on the link available in output of first cell of  jupyter notebook 

![spark install](spark-on-notebook/spark-app.png)

### **With Docker**

Check out this cool project at Github called [Jupyter Docker Stacks](https://github.com/jupyter/docker-stacks). You can setup environments to work with **Python**, **R** and **Scala** with just the following two steps. This would pull the all spark image from dockerhub.

Look at the detailed documentation [here](https://jupyter-docker-stacks.readthedocs.io/en/latest/index.html)

Run the following docker commands to pull the latest image for all spark notebooks

In [None]:
docker pull jupyter/all-spark-notebook:latest
docker run -p 8888:8888 jupyter/all-spark-notebook

![spark install](spark-on-notebook/docker-spark.png)

Copy the local host link along with the token at the bottom and hit it on your browser

![spark update](spark-on-notebook/jupyter-start.png)

Now, you should be able to launch Python, R and Scala (**spylon-kernel**) notebooks respectively and initiate spark session within the notebooks and work interactively

![all spark](spark-on-notebook/all-spark.png)

#### **PySpark** 

In [None]:
from pyspark.sql import SparkSession
# Create a Spark Session
SparkSession.builder.appName('docker-pyspark').getOrCreate()

![python sc](spark-on-notebook/pyspark-session.png)

#### **SparkR**

In [None]:
library(SparkR)
sparkR.session()

![R sc](spark-on-notebook/sparkr-session.png)

#### **Spark Scala**

In [None]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession
    .builder()
    .appName("Spark SQL basic example")
    .config("spark.some.config.option","some-value")
    .getOrCreate()

![Spark scala](spark-on-notebook/scala-session.png)

### **Running Spark with Google Colab Notebooks**

You need to just import all the necessary packages needed to run spark in colab.

>Tip: This blog is built with [fastpages](https://github.com/fastai/fastpages). At the top right corner click on Google Colab badge to run this section in colab without any need to copy & paste the code to google colab.

In [None]:
!pip install wget
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz
!tar xf spark-3.1.2-bin-hadoop2.7.tgz
!pip install -q findspark
import os,wget
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop2.7"

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName('ColabPyspark').getOrCreate()
spark

>Tip: Try this section in google colab and share your experience in comments. Thank you for reading !!