
# Introduction

This tutorial will introduce you to PySpark, a Spark python API, used for various analysis tasks with Apache Spark. It is a flexible tool for exploratory big data analysis because it integrates with the rest of the Python data analysis ecosystem, including pandas (DataFrames), NumPy (arrays), and Matplotlib (visualization). 

Spark is a fast and general purpose distributed computing framework (like MapReduce) that provides efficient in-memory computations for large scale data processing. Spark extends the MapReduce model to support more types of computations using a functional programming paradigm. Spark itself is written in Scala and offers support for three programming languages - Scala, Python and Java. 

PySpark is built on top of Spark's Java API. The overview of data flow in a Spark Application :


[<img src="http://i.imgur.com/YlI8AqEl.png" style="width: 400px;">](http://i.imgur.com/YlI8AqEl.png)

SparkContext object is created by the driver program at the start of Spark shell, which manages the cluster connections and coordinates the running processes. In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext. Py4J is only used on the driver for local communication between the Python and Java SparkContext objects. Data is processed in Python and cached/shuffled in the JVM.

# Content

In this tutorial, we will get started by setting up Spark on our local system and then explore some basics of Spark programming i.e. RDDs, Transformations and Actions, lazy evaluation etc. We will also look into Spark SQL library, Spark Dataframes, and Machine Learning library (MLlib) with the help of examples.

The following topics will be covered in the tutorial:
    
- [Installation and Configuration](#Installation-and-Configuration)
- [Resilient Distributed Datasets (RDDs)](#Resilient-Distributed-Datasets)
- [RDD Operations](#RDD-Operations)
 - [Creating RDDs](#Creating-RDDs)
 - [RDD Actions](#RDD-Actions)
 - [RDD Transformations](#RDD-Transformations)
- [Using Spark SQL and DataFrames](#Using-Spark-SQL-and-DataFrames)
- [Using Spark MLLib: Decision Trees](#Using-Spark-MLlib:-Decision-Trees)

<br>
Data Sets used for illustrations : 
- [UCI Sentiment Labelled Sentences Data Set](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#)
- [UCI Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Wine+Quality)

## Installation and Configuration

For this tutorial, we will use the most recent and stable Spark release available i.e. 2.0.1 (at the time of this writing) in standalone mode configuration. This tutorial has been written using Ubuntu 16.04. Please note that some of the configuration commands might vary based on the user's operating system.

Software Prequisites:
- Apache Spark ([Download link](http://d3kbcqa49mib13.cloudfront.net/spark-2.0.1-bin-hadoop2.6.tgz)) 
- Python (v2.6+) 
- Java (v7.0+)

<i>Note: Please enure to include java and python in your environment PATH and also set JAVA_HOME environment variable.</i>

Unzip the downloaded Spark file and based on your extracted directory location, set SPARK_HOME environment variable using below command:

    $ export SPARK_HOME = ~/spark-2.0.1-bin-hadoop2.6
    
    $ export PATH=$SPARK_HOME/bin:$PATH

This completes our Spark installation and is ready to use on your local machine in "standalone mode". Spark comes with an interactive python shell which can be launched using below command:

    $ $SPARK_HOME/bin/pyspark

<img src="spark.png">

In order to interact with Spark from Jupyter notebook, execute the below command: 

    $ PYSPARK_DRIVER_PYTHON=jupyter $SPARK_HOME/bin/pyspark
    
<br>
Refer the Spark documentation [here](http://spark.apache.org/docs/latest/), in case of any issues with the download link or launching pyspark. 
    

## Resilient Distributed Datasets

RDDs are the core data structure in Spark. RDDs are essentially a read-only and fault tolerant collection of objects that are partitioned across machines and can be operated in parallel. In local configuration, Spark simulates distributing the calculations over lots of machines by slicing computer's memory into partitions. RDDs can be created from and written to local file system, distributed storages like HDFS or S3, and other data sources. RDDs can be cached in memory which makes Spark very effective at iterative applications where the data is being reused throughout the course of an algorithm. 

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. 

Spark applications are essentially the manipulation of RDDs through transformations and actions. 


<img src="http://blog.appliedinformaticsinc.com/wp-content/uploads/2015/07/Screen-Shot-2015-07-03-at-3.34.57-PM.png">

An important thing to know is that all the transformations in Spark are lazy, i.e. they do not compute their results right away. Instead all the transformations on RDDs are stored in the memory and are executed all at once, when an action is called. This design enables Spark to run more efficiently. 

SparkContext is created by the driver program in our case, Pyspark. It is usually referenced as the variable <b>sc</b>. We will use this SparkContext object to instantiate RDDs in the following sections. 

In [1]:
import os
import sys

#get the spark_home to check used spark version in case multiple versions are installed.
spark_home = os.environ.get('SPARK_HOME', None)

print "SPARK_HOME = " + spark_home

#checks if the sparkContext object is available or not.
print "SparkContext Object = " + str(sc)

SPARK_HOME = /home/agoyal3/apps/spark-2.0.1
SparkContext Object = <pyspark.context.SparkContext object at 0x7fcabdc5ef50>


### RDD Operations

Now, we will go through some most common RDD operations to under them better. To get complete list of available RDD operations, refer this [link](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD)

#### Creating RDDs 

From existing collection

- <b>parallelize()</b> - Creates parallelized collections are created from an existing collection

In [2]:
#creating RDD from python list. 
listData = [1,2,3,4,5,6,7,8,9]

rddList = sc.parallelize(listData)

#creating RDD from python tuples. 
tupleData = [('a',1), ('b', 2), ('c', 3), ('a',4), ('c',3), ('d',2), ('a',1)]

rddTuple = sc.parallelize(tupleData)

From file

- <b>textfile()</b> - Creates a RDD from a file on local or remote system

In this example, we will use "yelp_labelled.txt" file obtained from Sentimental labelled sentences dataset with 1000 instances.



In [3]:
#getting the file from URL and then extracting it
import urllib
import zipfile

f = urllib.urlretrieve ("https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip", "sentiment_labelled_sentences.zip")

dataset = zipfile.ZipFile('sentiment_labelled_sentences.zip')

dataset.extractall()

data_file = "./sentiment labelled sentences/yelp_labelled.txt"

#creates a RDD from file
rddFile = sc.textFile(data_file)

#### RDD Actions

Now, we will apply some actions on the 3 RDDs (rddList, rddTuple, rddFile) created above.

- <i><b>collect()</b> - Displays all the elements in RDD</i>             

In [4]:
rddList.count()

9

- <i><b>take(n)</b> - Return the first n lines from the dataset.</i>

In [5]:
rddList.take(5)

[1, 2, 3, 4, 5]

- <i><b>first()</b> - Return the first element of the dataset (similar to take(1)).</i>

In [6]:
rddList.first()

1

- <i><b>reduce()</b> - Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).</i>

In [7]:
#sums up all the elements in the RDD
rddList.reduce(lambda t1, t2: t1+t2)

45

- <i><b>groupByKey()</b> - groups all the tuples in RDD by key value</i>

In [8]:
rddTuple.groupByKey().collect()

[('a', <pyspark.resultiterable.ResultIterable at 0x7fcabc98e1d0>),
 ('d', <pyspark.resultiterable.ResultIterable at 0x7fcabc2ae350>),
 ('c', <pyspark.resultiterable.ResultIterable at 0x7fcabc2ae3d0>),
 ('b', <pyspark.resultiterable.ResultIterable at 0x7fcabc2ae410>)]

- <i><b>sortByKey( ascending=True|False )</b> - Sort the input RDD by the key value.</i>

In [9]:
rddTuple.sortByKey(True).collect()

[('a', 1), ('a', 4), ('a', 1), ('b', 2), ('c', 3), ('c', 3), ('d', 2)]

- <i><b>countByValue()</b> - Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key. </i>

In [10]:
rddTuple.countByValue()

defaultdict(int,
            {('a', 1): 2, ('a', 4): 1, ('b', 2): 1, ('c', 3): 2, ('d', 2): 1})

- <i><b>count()</b> - Displays the count of elements in RDD.</i>

In [11]:
#number of lines in the file
rddFile.count()

1000

- <i><b>takeSample(withReplacement, num, [seed])</b> - This action will return n elements from the dataset, with or without replacement (true or false). Seed is an optional parameter that is used as a random generator.</i>

In [12]:
rddFile.takeSample(False, 8, 1)

[u'A great touch.\t1',
 u'I vomited in the bathroom mid lunch.\t0',
 u'Check it out.\t1',
 u'The problem I have is that they charge $11.99 for a sandwich that is no bigger than a Subway sub (which offers better and more amount of vegetables).\t0',
 u'There is nothing authentic about this place.\t0',
 u'It lacked flavor, seemed undercooked, and dry.\t0',
 u'Nargile - I think you are great.\t1',
 u'!....THE OWNERS REALLY REALLY need to quit being soooooo cheap let them wrap my freaking sandwich in two papers not one!\t0']

- <i><b>saveAsTextFile()</b> - Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. </i>

    Please note that multiple files will be created as computing is done in parallel. In case of an error while executing the below command again, make sure to delete the previously generated output directory. In default configuration, Spark does not allow file overwrite. 

In [13]:
rddFile.saveAsTextFile("output")

#### RDD Transformations

In this wordCount example, we will use the file loaded above and also look into some of the transformations operations.

<i>
- <b>map()</b> - Return a new distributed dataset formed by passing each element of the source through a function func. 
- <b>flatMap()</b> - Similar to map, but each input item can be mapped to 0 or more output items 
- <b>reduceByKey</b> - When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V.
</i>

Please note that due to lazy evaluation, no actual action is performed by Spark until take() command is executed.

In [14]:
# returns the word count in the file.
rddWordCount = rddFile \
    .flatMap(lambda line: line.split()) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b)

#lazy evaluation, no action takes place until this line
rddWordCount.take(20)

[(u'better!', 1),
 (u'polite,', 1),
 (u'ever!', 2),
 (u'sushi.', 1),
 (u'yellow', 1),
 (u"friend's", 2),
 (u'hate', 2),
 (u'callings.', 1),
 (u'up.', 5),
 (u'over-hip', 1),
 (u'contained', 1),
 (u'Host', 1),
 (u'blanket', 1),
 (u'returning.', 1),
 (u'attentive,', 2),
 (u'attentive.', 2),
 (u'every', 8),
 (u'GO', 2),
 (u'selection.', 3),
 (u"we'll", 3)]

- <i><b>filter()</b> - Return a new dataset formed by selecting those elements of the source on which func returns true.</i> 

In [15]:
#returns the number of times the word occurred in the file
rddWordFilter = rddFile.filter(lambda x: 'restaurant' in x)

rddWordFilter.count()

26

- <i><b>union()</b> - Return a new dataset that contains the union of the elements in the source dataset and the argument. </i>

In [16]:
rddUnion = rddList.union(rddTuple)

rddUnion.collect()

[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 ('a', 1),
 ('b', 2),
 ('c', 3),
 ('a', 4),
 ('c', 3),
 ('d', 2),
 ('a', 1)]

- <i><b>intersection()</b> - Return a new RDD that contains the intersection of elements in the source dataset and the argument. </i>

In [17]:
rddIntersection = rddUnion.intersection(rddTuple)

rddIntersection.collect()

[('b', 2), ('a', 1), ('d', 2), ('a', 4), ('c', 3)]

- <i><b>distinct()</b> - Return a new dataset that contains the distinct elements of the source dataset.</i>

In [18]:
rddDistinct = rddTuple.distinct()

rddDistinct.collect()

[('a', 1), ('d', 2), ('c', 3), ('a', 4), ('b', 2)]

- <i><b>join()</b> - Returns a new dataset joining two RDDs based on a common key.</i>

In [19]:
rddJoin = rddTuple.join(rddIntersection)

rddJoin.collect()

[('a', (1, 1)),
 ('a', (1, 4)),
 ('a', (4, 1)),
 ('a', (4, 4)),
 ('a', (1, 1)),
 ('a', (1, 4)),
 ('c', (3, 3)),
 ('c', (3, 3)),
 ('b', (2, 2)),
 ('d', (2, 2))]

### Using Spark SQL and DataFrames

In this section, we will look at Spark's SQL library and it's ability to handle data in a structured way. Like a table in a relational database or a data frame in R or Pandas, Spark also has the concept of Dataframes which can be queried using SQL language. A Spark DataFrame is a distributed collection of data organized into named columns.

To get started, we will first download the white wine quality dataset file.

In [20]:
import urllib

wine_dataset_file = urllib.urlretrieve ("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", "white_wine.csv")

raw_data = sc.textFile("white_wine.csv")

The entry point into all SQL functionality in Spark is the SQLContext class. We will use the global context object sc to create a SQLContext instance

In [21]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

Spark SQL can convert an RDD of Row objects to a DataFrame. Rows are constructed by passing a list of key/value pairs as kwargs to the Row class. The keys define the column names, and the types are inferred by looking at the first row. Therefore, it is important that there is no missing data in the first row of the RDD in order to properly infer the schema.

The data in our dataset is delimited by ";" and includes headers. We will now create the RDD from the given data and specify the schema.

In [22]:
# splits the data based on the delimiter.
data = raw_data.map(lambda x: x.split(";"))

header = data.first()

# removes the header from the RDD
data = data.filter(lambda l: l!=header)

In [23]:
from pyspark.sql import Row

# specify the schema    
row_data = data.map(lambda p: Row(
    fixed_acidity=float(p[0]),
    volatile_acidity=float(p[1]),
    citric_acid=float(p[2]),
    residual_sugar=float(p[3]),
    chlorides=float(p[4]),
    free_fulfur_dioxide=float(p[5]),
    total_sulfur_dioxide=float(p[6]),
    density=float(p[7]),
    pH=float(p[8]),
    sulphates=float(p[9]),
    alcohol=float(p[10]),
    quality=int(p[11]))   
)

Once we have our RDD of Row, we can infer and register the schema to create the Spark DataFrame

In [24]:
wineDataFrame = sqlContext.createDataFrame(row_data)

We can look at our dataframe schema using printSchema. In addition, we can get baisc statistics for numerical columns using describe. 

In [25]:
#print the schema
wineDataFrame.printSchema()

# Number of records/columns
print "Number of records: %s" % wineDataFrame.count()
print "Number of columns: %s" % len(wineDataFrame.columns)

#view some summary of the columns
wineDataFrame.describe("alcohol").show()
wineDataFrame.describe("fixed_acidity").show()
wineDataFrame.describe("quality").show()

root
 |-- alcohol: double (nullable = true)
 |-- chlorides: double (nullable = true)
 |-- citric_acid: double (nullable = true)
 |-- density: double (nullable = true)
 |-- fixed_acidity: double (nullable = true)
 |-- free_fulfur_dioxide: double (nullable = true)
 |-- pH: double (nullable = true)
 |-- quality: long (nullable = true)
 |-- residual_sugar: double (nullable = true)
 |-- sulphates: double (nullable = true)
 |-- total_sulfur_dioxide: double (nullable = true)
 |-- volatile_acidity: double (nullable = true)

Number of records: 4898
Number of columns: 12
+-------+------------------+
|summary|           alcohol|
+-------+------------------+
|  count|              4898|
|   mean| 10.51426704777462|
| stddev|1.2306205677573196|
|    min|               8.0|
|    max|              14.2|
+-------+------------------+

+-------+------------------+
|summary|     fixed_acidity|
+-------+------------------+
|  count|              4898|
|   mean|  6.85478766843609|
| stddev|0.84386822768751

Now, we will register the DataFrame as a SQL table and use SQL language to query the data.

In [26]:
# Register the DataFrame as a table.
wineDataFrame.registerTempTable("white_wine")

In [27]:
# #run sql queries
queryResult = sqlContext.sql("SELECT alcohol,chlorides,citric_acid,pH,residual_sugar,sulphates,quality FROM white_wine where quality=9")
queryResult.show()

+-------+---------+-----------+----+--------------+---------+-------+
|alcohol|chlorides|citric_acid|  pH|residual_sugar|sulphates|quality|
+-------+---------+-----------+----+--------------+---------+-------+
|   10.4|    0.035|       0.45| 3.2|          10.6|     0.46|      9|
|   12.4|    0.021|       0.29|3.41|           1.6|     0.61|      9|
|   12.5|    0.031|       0.36|3.28|           2.0|     0.48|      9|
|   12.7|    0.018|       0.34|3.28|           4.2|     0.36|      9|
|   12.9|    0.032|       0.49|3.37|           2.2|     0.42|      9|
+-------+---------+-----------+----+--------------+---------+-------+



Since Spark uses the dataframes in distributed mode, it makes the querying process very efficient and fast for large sets of data. 

For detailed information about the Dataframe operations and data sources, please refer [this](https://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-operations).

### Using Spark MLlib: Decision Trees

In this section, we will see how to use MLlib to perform classification task using decision trees. The standard Spark package comes with an in-built Machine Learning library (MLlib) which supports many high-quality alogrithms and utilities. Due to distributed computing and ability to do in-memory computation, Spark excels at iterative computation thus enabling MLlib to run fast.

Some of the available algorithms are:
- classification:  logistic regression, linear support vector machine(SVM), naive Bayes 
- regression: generalized linear regression (GLM) 
- collaborative filtering: alternating least squares (ALS) 
- clustering: k-means 
- decomposition: singular value decomposition (SVD), principal component analysis (PCA)

We will apply the classification problem to the white wine quality dataset used in above example.


#### Preparing the data

Get the dataset file and store the ";" separated data into a RDD without the header.

In [28]:
import urllib

wine_dataset_file = urllib.urlretrieve ("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", "white_wine.csv")

raw_data = sc.textFile("white_wine.csv")

# splits the data based on the delimiter.
data = raw_data.map(lambda x: x.split(";"))

header = data.first()

# removes the header from the RDD
data = data.filter(lambda l: l!=header)

In order to use Decision trees in MLlib, we need to translate our data into a list of labelled points, which are key-value pairs. The key or label is the target classification or class of the observation and the values is a feature vector stored as an array. 

In [29]:
from pyspark.mllib.regression import LabeledPoint
import numpy as np

# first 11 columns include the features data
def extract_features_dt(record):
    return np.array(map(float, record[0:11]))

# last column is our class label 
def extract_label(record):
    return float(record[-1])

# create labelled points from the data RDD
data_dt = data.map(lambda r: LabeledPoint(extract_label(r), extract_features_dt(r)))

print "Decision Tree feature vector: " + str(data_dt.first().features)
print "Decision Tree feature vector length: " + str(len(data_dt.first().features))

Decision Tree feature vector: [7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8]
Decision Tree feature vector length: 11


We will do 70-30 split of our data for training and testing purposes.

In [30]:
# split the data into training and test sets
train_data_dt, test_data_dt = data_dt.randomSplit([0.7, 0.3])

#### Training a classifier

We will now proceed with building our decision tree model. For this, we will use train classifier method of the DecisionTree which takes labelled training data, number of classes, categorical features information (none in our case), impurity, maximum tree depth and maximum bins.

In [31]:
from pyspark.mllib.tree import DecisionTree
from time import time

startTime = time()

#build the model
dt_model = DecisionTree.trainClassifier(train_data_dt,11,{},"gini",4,42)

print "Classifier trained in {} seconds".format(round(time()-startTime,3))

Classifier trained in 2.096 seconds


#### Interpreting the model

Using the toDebugString method in our three model we can obtain a lot of information regarding splits, nodes, etc.

In [32]:
print "Learned classification tree model:\n"
print dt_model.toDebugString()

Learned classification tree model:

DecisionTreeModel classifier of depth 4 with 31 nodes
  If (feature 10 <= 10.8)
   If (feature 1 <= 0.27)
    If (feature 1 <= 0.205)
     If (feature 10 <= 9.1)
      Predict: 6.0
     Else (feature 10 > 9.1)
      Predict: 6.0
    Else (feature 1 > 0.205)
     If (feature 10 <= 9.8)
      Predict: 5.0
     Else (feature 10 > 9.8)
      Predict: 6.0
   Else (feature 1 > 0.27)
    If (feature 10 <= 10.3)
     If (feature 8 <= 3.24)
      Predict: 5.0
     Else (feature 8 > 3.24)
      Predict: 5.0
    Else (feature 10 > 10.3)
     If (feature 5 <= 17.0)
      Predict: 5.0
     Else (feature 5 > 17.0)
      Predict: 6.0
  Else (feature 10 > 10.8)
   If (feature 10 <= 11.7)
    If (feature 5 <= 10.0)
     If (feature 1 <= 0.35)
      Predict: 5.0
     Else (feature 1 > 0.35)
      Predict: 4.0
    Else (feature 5 > 10.0)
     If (feature 6 <= 152.0)
      Predict: 6.0
     Else (feature 6 > 152.0)
      Predict: 6.0
   Else (feature 10 > 11.7)
    If (

#### Evaluating the model

In order to measure the classification error on our test data, we use map on the test_data RDD and the model to predict each test point class.

In [33]:
preds = dt_model.predict(test_data_dt.map(lambda p: p.features))
actual_and_preds = test_data_dt.map(lambda p: p.label).zip(preds)

print "Decision Tree depth: " + str(dt_model.depth())
print "Decision Tree number of nodes: " + str(dt_model.numNodes())
print "Decision Tree predictions: \n" + str(actual_and_preds.take(50))

Decision Tree depth: 4
Decision Tree number of nodes: 31
Decision Tree predictions: 
[(6.0, 5.0), (6.0, 6.0), (7.0, 7.0), (5.0, 5.0), (6.0, 6.0), (5.0, 6.0), (5.0, 5.0), (5.0, 5.0), (5.0, 5.0), (5.0, 6.0), (7.0, 6.0), (6.0, 6.0), (6.0, 5.0), (6.0, 6.0), (6.0, 5.0), (6.0, 6.0), (6.0, 5.0), (8.0, 6.0), (6.0, 5.0), (6.0, 5.0), (5.0, 5.0), (5.0, 5.0), (7.0, 7.0), (7.0, 7.0), (6.0, 5.0), (4.0, 5.0), (6.0, 5.0), (5.0, 5.0), (5.0, 6.0), (6.0, 5.0), (5.0, 5.0), (5.0, 5.0), (5.0, 5.0), (5.0, 5.0), (6.0, 6.0), (5.0, 5.0), (6.0, 6.0), (5.0, 5.0), (6.0, 5.0), (6.0, 6.0), (6.0, 5.0), (6.0, 5.0), (8.0, 7.0), (7.0, 7.0), (6.0, 5.0), (6.0, 5.0), (4.0, 4.0), (7.0, 6.0), (6.0, 6.0), (6.0, 6.0)]


Classification results are returned in pairs, with the actual test label and the predicted one. This is used to calculate the classification error by using filter and count as follows.

In [34]:
t0 = time()

test_accuracy = actual_and_preds.filter(lambda (v, p): v != p).count() / float(test_data_dt.count())

print "Prediction made in {} seconds".format(round(time()-t0,3))

print "Test accuracy is {}".format(round(test_accuracy,5))

Prediction made in 0.809 seconds
Test accuracy is 0.49112


We can see that the accuracy obtained from our model is around 49%. You can further experiment with different depths and maxBins to see how it impacts the complexity, time required to train and accuracy of the model.

# Summary and References

This tutorial highlighted some of the basic operations and analysis tasks that can be performed with Apache Spark using PySpark. Much more detailed information about the libraries, packages and operations supported in Apache Spark is available at below links:

- [Apche Spark](http://spark.apache.org/docs/latest/index.html)
- [PySpark (API Documentation)](http://spark.apache.org/docs/2.0.1/api/python/index.html)
- [RDDs](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD)
- [SPARK SQL and Dataframes](https://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-operations)
- [Machine Learning Library (MLlib)](http://spark.apache.org/docs/latest/ml-guide.html)
- [PySpark Internals](https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals)
- [Decision Tree Example](https://xscale10.blogspot.com/2015/08/machine-learning-using-spark-mllib-part.html)
