[Apatch Spark](https://spark.apache.org) is an open source distributed programming environment implemented on top of the JVM that has seen [a rapid rise in popularity in recent years](http://fortune.com/2015/09/25/apache-spark-survey/). I will use Spark through [pyspark](https://spark.apache.org/docs/latest/api/python/index.html) at my next job, so I want to use pyspark, but it's hard to install Spark from scratch because I'm not familiar with the JVM.

So in this article, I'll use Docker to build Spark and pyspark environments.

## Set Up an Environment

There is a docker image named [jupyter/pyspark-notebook](https://hub.docker.com/r/jupyter/pyspark-notebook/) published by Jupyter Lab. For now, let's pull the latest version:

```bash
docker pull jupyter/pyspark-notebook:87210526f381
```

Run it:

```bash
docker run --rm -w /app -p 8888:8888 \
    --mount type=bind,src=$(pwd),dst=/app \
    jupyter/pyspark-notebook:87210526f381
```

Then you will see several messages, among which is the URL. If you access that URL, you will see a Jupyter Notebook that you can use with pyspark:

In [1]:
import pyspark
pyspark.version.__version__

'2.4.0'

Note that this article is written using Jupyter Notebook, which was launched exactly in this way.

## Launching a Spark Cluster

Spark usually creates a cluster in distributed environment, but creating a distributed cluster in development is not a big deal, so there is [a local mode](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-local.html).

To start Spark in local mode via pyspark, call [`pyspark.SparkContext`](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext):

In [2]:
sc = pyspark.SparkContext('local[*]')

The string specifies the number available threads:

- `local` - 1 thread
- `local[n]` - `n` threads（`n` is a number）
- `local[*]` - As many threads as available in JVM.（[`Runtime.getRuntime.availableProcessors()`](https://docs.oracle.com/javase/8/docs/api/java/lang/Runtime.html#availableProcessors--) is used internally）

It seems that `local[*]` is commonly used.

Try to calculate the sum of the numbers from 0 to 10.

In [3]:
rdd = sc.parallelize(range(10))
rdd.sum()

45

Stop the cluster when you are done using it.

In [4]:
sc.stop()

## Conclusion

In this article, I created a pyspark environment using Docker and launched a Spark cluster in a local mode. I don't know much about Spark yet, but I'll try to do more and more things little by little.

## References

- [jupyter/pyspark-notebook - Docker Hub](https://hub.docker.com/r/jupyter/pyspark-notebook/)
- [Image Specifics — docker-stacks latest documentation](https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html#apache-spark)
- [pyspark package — PySpark 2.4.0 documentation](https://spark.apache.org/docs/latest/api/python/pyspark.html)
- [Get Started with PySpark and Jupyter Notebook in 3 Minutes](https://blog.sicara.com/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f)
- [Spark local (pseudo-cluster) · Mastering Apache Spark](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-local.html)