<a href="https://colab.research.google.com/drive/1q6hcMHXWR1Cy-tJV2NuKtn603PA-u_PK" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setting up PySpark in a Colab Notebook

You can run Spark both locally and on a cluster. Here, I'll demonstrate how you can set up Spark to run in a Colab notebook for debugging purposes.

You can also set up Spark locally in a similar way if you want to take advantage of multiple CPU cores (and/or GPU) on your laptop (the setup will vary slightly, though, depending on your operating system and you'll need to figure out these specifics on your own; however, this setup does work in WSL for me if I run the follow bash script in my terminal window using `sudo`). This being said, this local option should be for testing purposes on sample datasets only. If you want to run big PySpark jobs, you will want to run these in an EMR notebook (with an EMR cluster as your backend) or on the Midway Cluster.

First, we need to install Spark and PySpark, by running the following commands:

In [1]:
%%bash
apt-get update -qq
apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
wget -q "https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz" > /dev/null
tar -xvf spark-3.1.1-bin-hadoop2.7.tgz > /dev/null

pip install pyspark findspark --quiet

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.0/317.0 MB 3.3 MB/s eta 0:00:00


OK, now that we have Spark, we need to set a path to it, so PySpark knows where to find it. We do this using the `os` Python library below.

On my machine (WSL, Ubuntu 20.04), where I unpacked Spark in my home directory, this can be achieved with:
```
os.environ["SPARK_HOME"] = "/home/USERNAME/spark-3.1.1-bin-hadoop2.7"
```

In Colab, it is automatically downloaded to the `/content` directory, so we indicate that as its location here. Then, we run `findspark` to find Spark for us on the machine, and finally start up a SparkSession running on all available cores (`local[4]` means your code will run on 4 threads locally, `local[*]` means that your code will run as many threads as there are logical cores on your machine).

In [2]:
# Set path to Spark
import os
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"

# Find Spark so that we can access session within our notebook
import findspark
findspark.init()

# Start SparkSession on all available cores
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Now that we've installed everything and set up our paths correctly, we can run (small) Spark jobs both in Colab notebooks and locally (for bigger jobs, you will want to run these jobs on an EMR cluster, though. Remember, for instance, that Google only allocates us one CPU core and up to one GPU for free)!

Let's make sure our setup is working by doing couple of simple things with the pyspark.sql package on the Amazon Customer Review Sample Dataset we looked at last week with `mrjob`.

In [8]:
! pip install wget
import wget

wget.download('https://css-uchicago.s3.us-east-1.amazonaws.com/sample_us.tsv')

In [9]:
# Read TSV file
data = spark.read.csv('sample_us.tsv',
                      sep="\t",
                      header=True,
                      inferSchema=True)

In [10]:
data.printSchema()

root
 |-- marketplace: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: integer (nullable = true)
 |-- product_title: string (nullable = true)
 |-- product_category: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: string (nullable = true)



In [12]:
data.groupBy('star_rating') \
     .sum('total_votes') \
     .sort('star_rating', ascending=False) \
     .show()

+-----------+----------------+
|star_rating|sum(total_votes)|
+-----------+----------------+
|          5|              13|
|          4|               3|
|          3|               8|
|          2|               2|
|          1|               8|
+-----------+----------------+

