# Workshop

## Introduction

The rapid growth of Next Generation Sequencing technologies such as single-cell RNA sequencing(scRNA-seq) demands efficient parallel processing and analysis of big data. Hadoop and Spark are the goto opensource frameworks for storing and processing massive datasets. The most significant advantage of Spark is its iterative analytics capability combined with in-memory computing architecture. Calling .cache() on a resilient distributed dataset (RDD) effectively saves it in memory and makes it instantly available for computation, thus the subsequent filter, map, and reduce tasks become instantaneous. Spark has its query language known as Spark SQL, and its MLlib library is highly desirable for machine learning tasks.

## Create virtual environment

There are planty of ways to create python's virtual environment, but the easies is to run `venv` buildin module.

```sh
python3 -m venv .venv
```

Once the environment is created you can activate it by running the following command:

```sh
source .venv/bin/activate
```

## Installing pyspark

To install pyspark you run command:

```sh
pip install pyspark
```

Then to execute REPL just type `pyspark`

## Configuring pyspark to use ipython as the command shel in the REPL

But default REPL is not that good, and IPython is must better (TBD: expand on this).

```sh
pip install ipython
```

Then to enable ipython interpreter in pyspark repl you will need to setup environment variable `PYSPARK_DRIVER_PYTHON`.

To do so run the following command in the active environment:

```sh
export PYSPARK_DRIVER_PYTHON=ipython
```

In some environments this may not work and the command can tell that

```sh
.../bin/load-spark-env.sh: No such file or directory
```

to fix the issue set `SPARK_HOME` variable:

```sh
export SPARK_HOME=.../venv/lib/python3.7/site-packages/pyspark
```

Now if you run `pyspark` command it will use `ipython` command shell.

And you will have code completion and code highlihting in the Repl.

## Using Jupyter notebook


First of all you need to install Jupyter.

```sh
pip install jupyter
```

Then similar to `ipython` you need to set `PYSPARK_DRIVER_PYTHON` and `PYSPARK_DRIVER_PYTHON_OPTS` variables.

```sh
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
```

To ensure that integration with Jupyter notebook works as expected, run the `pyspark` command and in the code cell of the notebook printing `sc` variable. This is default Spark context.

In [1]:
sc

### Use jupyterlab

There is a jupyterlab project, which will replace notebooks. To use it just install it.

```sh
pip install jupyterlab
```

and set environment variables.

```sh
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=lab
```

### Spark context


If you run a standalone script you supposed to create spark context yourself. Typically you do it by

```python
from pyspark import SparkConf, SparkContext


conf = SparkConf() \
  .setAppName("appName") \
  .setMaster("local") \
  .set("spark.sample.config", "value")
sc = SparkContext(conf=conf)
```

or

```python
from pyspark import SparkContext

sc = SparkContext(master="local", appName="test_name")
```



Instead of having a spark context, hive context, SQL context, starting from Spark 2.0 all of it is encapsulated in a Spark session. You can create spark session yourself:

In [2]:
from pyspark.sql import SparkSession

sc2 = SparkSession.builder \
  .master("local[*]") \
  .appName("name of the script") \
  .config("spark.some.config.option", "some-value") \
  .getOrCreate()
sc2

Also you can craete a new SparkSession from existing SparkContext by passing it into the initializer.

In [3]:
sc3 = SparkSession(sc)
sc3

Note that, we don't have to create a spark session object when using spark-shell. It is already created for us with the variable `spark`.

In [4]:
spark

Please see https://spark.apache.org/docs/2.4.4/api/python/pyspark.sql.html?highlight=sparksession for interface of the builder.

As you can see in the example we have set `spark.some.config.option` setting to `some-value`. There lots of such properties which that can control important aspects of Spark.
Consider for example `spark.submit.pyFiles` or `spark.jars` properties. 

You can find complete list of all avaiable settings here https://spark.apache.org/docs/2.4.4/configuration.html

## Using google colab

Though it is not that difficult to setup development environment on you local computer, it is event easier to use google colab notebooks. You do not need to have a powerfull computer...

But before you can install pyspark on the colab you will have to install additional compotents:

In the first colab cell type:

```sh
# !apt-get --quiet install -y openjdk-8-jdk-headless
# !update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
!pip install pyspark
```

Nowadays it is enough to just install `pyspark` because `java` is already installed.

## Some setting of the spark

```
.set('spark.executor.memory', '4G')
.set('spark.driver.memory', '16G')
.set('spark.driver.maxResultSize', '8G')
```

## Download FASTQ file

```sh
!mkdir /content
!mkdir /content/data
!cd /content

!wget 'https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.6-1/sratoolkit.2.9.6-1-ubuntu64.tar.gz'
!gunzip sratoolkit.2.9.6-1-ubuntu64.tar.gz
!tar -xf sratoolkit.2.9.6-1-ubuntu64.tar

!wget https://sra-download.ncbi.nlm.nih.gov/traces/era6/ERR/ERR3014/ERR3014700
!/content/sratoolkit.2.9.6-1-ubuntu64/bin/fastq-dump /content/ERR3014700 -O /content/data
```