<a href="https://colab.research.google.com/github/wcj365/us-college-data-analysis/blob/master/notebooks/02-simple-spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 02 - Simple Example
This example uses a simple tiny list of four numbers to illustrate the cooncept of Resilient Distributed Dataset (RDD). The list is distributed within the cluster (potentially many nodes or computers). In reality, the list can be millions of objects and must be distributed across many nodes.In reality, the list can be millions of objects and must be distributed across many nodes.

## Step 1 - Install spark-related dependencies
- JDK
- Apache Spark
- PySpark

The JDK, Apache Spark, and PySpark will be installed on the Google Cloud instance. Since cloud instance is not a permanent machine. We will need to install them each time an new instance is used.


In [3]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

!wget -q http://apache.osuosl.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

!tar xf spark-2.4.5-bin-hadoop2.7.tgz

!pip install -q findspark

!pip install pyspark

print("Installatoin complete!")

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/9a/5a/271c416c1c2185b6cb0151b29a91fff6fcaed80173c8584ff6d20e46b465/pyspark-2.4.5.tar.gz (217.8MB)
[K     |████████████████████████████████| 217.8MB 60kB/s 
[?25hCollecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 49.0MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.5-py2.py3-none-any.whl size=218257927 sha256=8c022eb3e66e8d83f34c242467a5f5e521272f17bfa8d1305b5ce4c988514eff
  Stored in directory: /root/.cache/pip/wheels/bf/db/04/61d66a5939364e756eb1c1be4ec5bdce6e04047fc7929a3c3c
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 pyspark-2.4.5


## Step 2 - Import libraries and Set up required environment variables

In [0]:
import os
from pyspark import SparkContext, SparkConf

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"



## Step 3 - Establish Spark Context 

In [0]:
APP_NAME = "CollegeScorecard"
SPARK_URL = "local[*]"

conf = SparkConf().setAppName(APP_NAME).setMaster(SPARK_URL)

sc = SparkContext.getOrCreate(conf)                 

# Step 4 - Perfrom Simple Examples

In [11]:
# create a RDD

data = range(4)                 # a tiny dataset to experiment big data

a = sc.parallelize(range(4))    # Initial the Resilient Distributed Dataset



[0, 1, 2, 3]

In [13]:
# get the data from the nodes to the master

a.collect()                     

[0, 1, 2, 3]

In [14]:
# Get the total number of elements in the list 

a.count()    

4

In [7]:
# Perfrom transformation using map function similar to Python map

b = a.map(lambda x: x**2)       # here we square the numbers

b.collect()

[0, 1, 4, 9]

In [12]:
# Perfrom transformation using map function similar to Python map

b = a.map(lambda x:(x, x**2))   # transformed into a list of tuples

b.collect()

[(0, 0), (1, 1), (2, 4), (3, 9)]

In [8]:
# Perform an action to get a summary statistics using reduce function 

a.reduce(lambda x, y : x + y)    # Sum up the numbers in the list

6

In [9]:
# Perform an action to get a summary statistics using reduce function 

a.reduce(lambda x, y : x * y)    # Multiple the numbers in the list

0

In [15]:
# Create another RDD - a list of tuples
a = sc.parallelize([(1, -5), (3,400), (1, 3), (2, 30)])

print(a.collect())

[(1, -5), (3, 400), (1, 3), (2, 30)]


In [16]:
print(a.countByKey())

print(a.lookup(1))

print(a.collectAsMap())

defaultdict(<class 'int'>, {1: 2, 3: 1, 2: 1})
[-5, 3]
{1: 3, 3: 400, 2: 30}


In [17]:
b = a.reduceByKey(lambda x, y: x*y)

print(b.collect())

[(2, 30), (1, -15), (3, 400)]


In [0]:
# this did not work

c = a.groupByKey().map(lambda k, iter: (k, [x for x in iter]))

c.collect()