# Introduction to PySpark

## Have Some Spark in Your Life (Motivations for using Spark)

Apache Spark is a cluster computing platfrom designed to be fast and general-purpose. 

Speed is important because it could determine whether we are waiting minutes or hours when doing data exploration. Spark offers the ability to run computations in memory. It is faster than MapReduce for complex applications running on disk.

Spark is general-purpose because it covers a wide range of workloads, such as data pre-processing, SQL and Machine Learning Pipeline, etc., that previously required separate technologies. Spark makes it easy to combine different processing tools and alleviates our burden to maintain separate tools in the data analysis pipelines. 


## How Spark works

Spark is a "computational engine" that schedules, distributes and monitors applications consisting of many computational task across many computing clusters. Spark offers us the power to use distributed system without worrying about what goes on under the hood.

Resilient Distributed Datasets(RDDs) is the core of Spark. On top of the core, there are high level components specialized for various workloads, such as SQL or machine learning. Thus, simply by optimizing the core, we are able to optimize performance for all operations. 

Spark in mainly written in Scala but supports Python with PySpark package.


## Overview of Content
In this tutorial, we cover Spark basics and the Spark machine learning package using PySpark. We will also run a Spark standalone cluster locally.

topics:

* [Downloading and Installing PySpark](#Downloading-and-Installing-PySpark)
* [Starting our first SparkContext and SparkSession](#Starting-our-first-SparkContext-and-SparkSession)
* [Creating or Loading Sample Data](#Creating-or-Loading-Sample-Data)
* [RDD and DataFrame Basics](#RDD-and-DataFrame-Basics)
* [Machine Learning in Spark](#Machine-Learning-in-Spark)
* [Shutting off a SparkContext](#Shutting-off-a-SparkContext)
* [PySpark in Action: Will It Be On Sale?](#PySpark-in-Action:-Will-It-Be-On-Sale?)
* [Starting a Spark cluster](#Starting-a-Spark-cluster)
* [Useful Resources](#Useful-Resources)



## Downloading and Installing PySpark

It is easy to install PySpark using Anaconda:
 
    $ conda install -c conda-forge pyspark
      
* Note: This version only works with Python 3.6. If you are running an older version of Python, please consult documentation.
* Note: the above procedure may not work on Windows. 

After installing pySpark, make sure the following command works:

In [1]:
import pyspark

## Starting our first SparkContext and SparkSession

Every Spark application starts from creating a SparkContext. SparkContext sets up the internal services and establishes a connection to a Spark execution environment. Once a SparkContext is created, we can use it to do RDD operations.
There are serveral ways to creating a SparkContext depending on use. A SparkSession is created from a SparkContext to create DataFrames.

We will be using ```getOrCreate()``` to create a local SparkContext. ```getOrCreate()``` allows us to either get the existing SparkContext or create a new one. We could also start a SparkContext by first initializing a SparkConf and add in extra parameters if we are integrating it into a cluster. Please check Spark documentations for more info.

In [2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession(sparkContext=sc)

## Creating or Loading Sample Data

To start, let us begin with an introduction to PySpark data structures: RDD and Dataframes.

#### RDD 
RDD was the primary user-facing API at Spark's inception. An RDD is an immutable distrubted collection of elements of your data, partitioned across nodes in our cluster and can be operated in parallel with lower level api that offer transformations and actions. With the advent of DataFrame, we use RDD only when we need low-level control of the dataset or when our data is unstructured.


#### DataFrame
Dataframe is an immutable distributed collection data. We could also manipulate it using SQL queries.

To sum it up, we should use Dataframe if our data permits because Dataframes achieves a higher-level abstraction with optimization and performance benefits. Below, we introduce creating and importing data into RDD and Dataframe. Spark could import many data forms, we are using csv as an example.

##### Creating a simple RDD using SparkContext's parallelize on a existing iterable or collection.

In [13]:
# let's create a list of (name,age)
l = [('John',3),('Mary',5),('Grace',6)]
l_rdd = sc.parallelize(l)
l_rdd.collect()

[('John', 3), ('Mary', 5), ('Grace', 6)]

##### Creating a RDD from an external dataset 


In [5]:
shoes_rdd = sc.textFile('shoes.csv')
shoes_rdd.take(5) # as there are many columns in the csv, this is not very pretty

['id,asins,brand,categories,colors,count,dimension,manufacturerNumber,name,prices.amountMin,prices.amountMax,prices.availability,prices.color,prices.condition,prices.count,prices.currency,prices.isSale,prices.merchant,prices.offer,prices.returnPolicy,prices.shipping,prices.size,,,,',
 'AVpe__eOilAPnD_xSt-H,,Novica,"Access.,Clothing,Shoes,Women\'s Clothing",Purple,,,231516,Handcrafted Alpaca Blend \'Purple Charisma\' Sweater (Peru),62.99,62.99,,,,,USD,FALSE,Overstock.com,,,,L,,,,',
 'AVpe__eOilAPnD_xSt-H,,Novica,"Access.,Clothing,Shoes,Women\'s Clothing",Purple,,,231516,Handcrafted Alpaca Blend \'Purple Charisma\' Sweater (Peru),62.99,62.99,,,,,USD,FALSE,Overstock.com,,,,M,,,,',
 'AVpe__eOilAPnD_xSt-H,,Novica,"Access.,Clothing,Shoes,Women\'s Clothing",Purple,,,231516,Handcrafted Alpaca Blend \'Purple Charisma\' Sweater (Peru),62.99,62.99,,,,,USD,FALSE,Overstock.com,,,,S,,,,',
 'AVpe__eOilAPnD_xSt-H,,Novica,"Access.,Clothing,Shoes,Women\'s Clothing",Purple,,,231516,Handcrafted Alpaca Ble

##### Creating a dataframe with createDataFrame() using an RDD, a list or a pandas.DataFrame

In [14]:
df = spark.createDataFrame(l,['name','age'])
df.show()

+-----+---+
| name|age|
+-----+---+
| John|  3|
| Mary|  5|
|Grace|  6|
+-----+---+



##### Creating a dataframe from an external dataset. 

In [12]:
#import csv, tell the function that the csv has a header and ask it to infer that data type of each column.
shoes_df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("shoes.csv")
shoes_df.show(3) # also not pretty

+--------------------+-----+------+--------------------+------+-----+---------+------------------+--------------------+----------------+----------------+-------------------+------------+----------------+------------+---------------+-------------+---------------+------------+-------------------+---------------+-----------+----+----+----+----+
|                  id|asins| brand|          categories|colors|count|dimension|manufacturerNumber|                name|prices.amountMin|prices.amountMax|prices.availability|prices.color|prices.condition|prices.count|prices.currency|prices.isSale|prices.merchant|prices.offer|prices.returnPolicy|prices.shipping|prices.size|_c22|_c23|_c24|_c25|
+--------------------+-----+------+--------------------+------+-----+---------+------------------+--------------------+----------------+----------------+-------------------+------------+----------------+------------+---------------+-------------+---------------+------------+-------------------+---------------+-

## RDD and DataFrame Basics

### RDD
As mentioned before, RDD is the basic unit of Spark. While DataFrame could save us from dealing with RDD for most of data preprocessing, it is still helpful to understand some of the basics so that we could write RDD operations if a pre-built function is not available.

There are two types of RDD operations: transformations and actions. Transformations construct a new RDD from previous one and always outputs another RDD. Actions are Spark RDD that outputs non-RDD values, such as an integer or a float. 

Transformations are lazy operations, meaning they will not execute unless an action is called upon the transformed RDD. If we have many transformations followed by each other, lazy operations can allow Spark to make optimizations before execution. 

[<img src="https://github.com/minxuancao/688tutorial/blob/master/RDD.png?raw=true">](https://github.com/minxuancao/688tutorial/blob/master/RDD.png?raw=true)

#### Some Transformation operation

##### filter(): 
use ```filter()``` text to RDD element with the target

In [15]:
grace_rdd = l_rdd.filter(lambda x: "Grace" in x)
grace_rdd.collect()

[('Grace', 6)]

##### union() 
use ```union()``` combines two RDD together

In [16]:
girl_rdd = l_rdd.filter(lambda x: "Mary" in x).union(grace_rdd)
girl_rdd.collect()

[('Mary', 5), ('Grace', 6)]

##### map()
use ```map()``` to do operation on every element in the RDD

In [17]:
#increase the age of each children
l_rdd_ageInc = l_rdd.map(lambda x:(x[0],x[1]+1))
l_rdd_ageInc.collect()

[('John', 4), ('Mary', 6), ('Grace', 7)]

##### reduceByKey()
use ```reduceByKey()``` to calculate a value for the RDD elements in the same group

In [19]:
#reduceByKey to calculate average. 
l_rdd_remapped = l_rdd.map(lambda x:('people',(x[1],1))) # map all people to a common group "people"
l_rdd_remapped.collect()

[('people', (3, 1)), ('people', (5, 1)), ('people', (6, 1))]

In [20]:
average = l_rdd_remapped.reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1])) # calculate the sum
average.collect()

[('people', (14, 3))]

In [21]:
average.map(lambda x:x[1][0]/x[1][1]).collect() # calcualte average

[4.666666666666667]

#### Some Action Operation

##### collect()
we have already used ```collect()``` serveral time above. It returns all the elements of the dataset as an array.

In [23]:
l_rdd.collect()

[('John', 3), ('Mary', 5), ('Grace', 6)]

##### count() and countByValue()
we use ```count()``` to count the number of lines in a RDD. ```countByValue()``` counts the occurence of each RDD element.

In [27]:
l_rdd.count()

3

In [28]:
l_rdd.countByValue()

defaultdict(int, {('Grace', 6): 1, ('John', 3): 1, ('Mary', 5): 1})

##### Reduce()
```reduce``` aggragate the elements using the provided function and returns a value. 

In [29]:
#calculate average using reduce
#first, get the age as a separate RDD
sum_rdd = l_rdd.map(lambda x:x[1])
sum_rdd.collect()

[3, 5, 6]

In [31]:
# map each element to one to calculate number of elements
num_rdd = l_rdd.map(lambda x:(1))
num_rdd.collect()

[1, 1, 1]

In [35]:
#calculate average
sumVal = sum_rdd.reduce(lambda accum,n: accum +n)
print("sum is: ",sumVal)
numVal = num_rdd.reduce(lambda accum,n: accum +n)
print("number of element is: ",numVal)
avg = sumVal/numVal
print("average is: ",avg)

sum is:  14
number of element is:  3
average is:  4.666666666666667


This introduction is not inclusive so feel free to check the API for more examples. 

### DataFrame

Spark DataFrame are very similar to Pandas DataFrame. Read the [PySpark in Action: Will It Be On Sale?](#action) section for a quick example.

## Machine Learning in PySpark

Spark provides a general machine learning library(Mlib). It is designed to be simple and fast by utilizing distributed computing. In this section, we focus on introducing StringIndexer and OneHotEncoder, which is used for preprocessing of categorical data. Then, we go into Spark's machine learning pipeline.

### ```StringIndexer``` and ```OneHotEncoderEstimator```
In Spark, ```Stringindexer``` maps a categorical variable column to an index column that Spark will then see as categorical variables. The indices start with 0 and are ordered by label frequencies. 

Three steps to implemnting the ```StringIndexer```
1. Build the StringIndexer model: specify the input and output col name
2. Learn the StringIndexer model: fit the model with your data
3. Execute the indexing: call transform to execute the indexing 

Immediately after ```StringIndexer```, we follow up with ```OneHotEncoderEstimator```. ```OneHotEncoderEstimator``` converts each categories of a String Indexed column with a sparse vector. 

Let's look at a small example:

In [36]:
import pandas as pd
pdf = pd.DataFrame({
        'Day of the week':['Monday','Friday','Tuesday','Thursday','Friday'],
        'weather':['Sunny','Cloudy','Snow','Rain','Rain'],
        'temperature':['80','70','40','60','50'],
    })
df = spark.createDataFrame(pdf)
df.show()

+---------------+-----------+-------+
|Day of the week|temperature|weather|
+---------------+-----------+-------+
|         Monday|         80|  Sunny|
|         Friday|         70| Cloudy|
|        Tuesday|         40|   Snow|
|       Thursday|         60|   Rain|
|         Friday|         50|   Rain|
+---------------+-----------+-------+



In [37]:
from pyspark.ml.feature import StringIndexer

#build indexer
string_indexer = StringIndexer(inputCol='Day of the week',outputCol='indexed_day')

#learn the model
string_indexer_model = string_indexer.fit(df)

#transform the data
df_stringindexer = string_indexer_model.transform(df)

df_stringindexer.show()

+---------------+-----------+-------+-----------+
|Day of the week|temperature|weather|indexed_day|
+---------------+-----------+-------+-----------+
|         Monday|         80|  Sunny|        1.0|
|         Friday|         70| Cloudy|        0.0|
|        Tuesday|         40|   Snow|        2.0|
|       Thursday|         60|   Rain|        3.0|
|         Friday|         50|   Rain|        0.0|
+---------------+-----------+-------+-----------+



Next, we use ```OneHotEncoderEstimator```, which follows the same steps as detailed above.

In [43]:
from pyspark.ml.feature import OneHotEncoderEstimator
OneHotEncoderEstimator(inputCols=['indexed_day'],outputCols=['encoded_day']).fit(df_stringindexer).transform(df_stringindexer).show()

+---------------+-----------+-------+-----------+-------------+
|Day of the week|temperature|weather|indexed_day|  encoded_day|
+---------------+-----------+-------+-----------+-------------+
|         Monday|         80|  Sunny|        1.0|(3,[1],[1.0])|
|         Friday|         70| Cloudy|        0.0|(3,[0],[1.0])|
|        Tuesday|         40|   Snow|        2.0|(3,[2],[1.0])|
|       Thursday|         60|   Rain|        3.0|    (3,[],[])|
|         Friday|         50|   Rain|        0.0|(3,[0],[1.0])|
+---------------+-----------+-------+-----------+-------------+



The result of ```oneHotEncoderEstimator``` is read as \[ vector size, \[index of the variable\], \[1.0\]\]. It represents a binary sparse matrix. Note that, by default, Spark drops the last category to ensure linear independence. We could manually include this as demonstrate below. Before training a model, we should OneHotEncode all categorical data and merge them using ```VectorAssembler```, as demonstrated in [PySpark in Action: Will It Be On Sale?](#action).

In [39]:
#specify not to drop the last category
OneHotEncoder(dropLast=False,inputCol='indexed_day',outputCol='encoded_day').transform(df_stringindexer).show()

+---------------+-----------+-------+-----------+-------------+
|Day of the week|temperature|weather|indexed_day|  encoded_day|
+---------------+-----------+-------+-----------+-------------+
|         Monday|         80|  Sunny|        1.0|(4,[1],[1.0])|
|         Friday|         70| Cloudy|        0.0|(4,[0],[1.0])|
|        Tuesday|         40|   Snow|        2.0|(4,[2],[1.0])|
|       Thursday|         60|   Rain|        3.0|(4,[3],[1.0])|
|         Friday|         50|   Rain|        0.0|(4,[0],[1.0])|
+---------------+-----------+-------+-----------+-------------+



### Spark pipeline

Pipeline is a sequence of stages which consists of Estimators and/or transformers. It allows us to fit and transform continuously without saving the middle product.
In the code below, we pipelined the string-indexing and one-hot-encoding stage of the two categorical variables-"Day of the week" and "weather".

In [44]:
from pyspark.ml import Pipeline
categorical_columns = ['Day of the week', 'weather']
stringindexer_stages = [StringIndexer(inputCol = column,outputCol='stringindexed_'+column)\
                        for column in categorical_columns]
onehotencoder_stages = [OneHotEncoder(dropLast=False,inputCol = 'stringindexed_'+column,outputCol='onehotencoded_'+column)\
                        for column in categorical_columns]

all_stages = stringindexer_stages + onehotencoder_stages

#build pipeline model
pipeline = Pipeline(stages=all_stages)

#fit and transform
df_coded = pipeline.fit(df).transform(df)
#only select interested columns
selected_columns = ['onehotencoded_' + c for c in categorical_columns] + ['temperature']
df_coded = df_coded.select(selected_columns)
df_coded.show()

+-----------------------------+---------------------+-----------+
|onehotencoded_Day of the week|onehotencoded_weather|temperature|
+-----------------------------+---------------------+-----------+
|                (4,[1],[1.0])|        (4,[3],[1.0])|         80|
|                (4,[0],[1.0])|        (4,[1],[1.0])|         70|
|                (4,[2],[1.0])|        (4,[2],[1.0])|         40|
|                (4,[3],[1.0])|        (4,[0],[1.0])|         60|
|                (4,[0],[1.0])|        (4,[0],[1.0])|         50|
+-----------------------------+---------------------+-----------+



### Other ML package in Spark

Spark also offers a very intuitive machine learning API that is similiar to SciKit-Learn. Please check the Mlib API for interested model and functionality.

## Shutting off a SparkContext

A SparkContext needs to be manually shutoff before opening a new one.

In [47]:
sc.stop()

## PySpark in Action: Will It Be on Sale?

In the sections below, we provide an example using Spark DataFrame and Mlib to predict whether a pair of shoes will be on sale given its brand, seller and color. We will using Spark to read in the data for data preprocessing, create StringIndexing, OneHotEncoding and vectorAssembling and train a logistic regression model with cross validation. We will also be using Plot.ly for data visualization in the midst.

We obtained the Women's Shoe Price data from Kaggle(https://www.kaggle.com/datafiniti/womens-shoes-prices) and modified for this tutorial. If you want to follow along, please download the dataset here: [shoes.csv](https://github.com/minxuancao/688tutorial/blob/master/shoes.csv.zip)

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
# start a sparkContext
sc = SparkContext.getOrCreate()
spark = SparkSession(sparkContext=sc)

#read in data
data = spark.read.format("csv").option("header","true").option("inferSchema","true").load("shoes.csv")

#cache for faster access
data.cache() 

DataFrame[id: string, asins: string, brand: string, categories: string, colors: string, count: string, dimension: string, manufacturerNumber: string, name: string, prices.amountMin: double, prices.amountMax: double, prices.availability: string, prices.color: string, prices.condition: string, prices.count: string, prices.currency: string, prices.isSale: boolean, prices.merchant: string, prices.offer: string, prices.returnPolicy: string, prices.shipping: string, prices.size: string, _c22: string, _c23: string, _c24: string, _c25: string]

First, let's extract the features and labels from the DataFrame.

In the csv, the relevant features are name brand, price.amountMax, colors, and prices.merchant. 

Label is price.isSale.

Side note: as there is a "." in prices.amountMax and prices.isSale, we use back ticks to escape this.

In [2]:
from pyspark.sql.functions import *
#select relevant data and rename the columns
selected_data = data.select(col('`prices.amountMax`').alias('price'),\
                            col('`prices.isSale`').alias('isSale'),\
                            col('`prices.merchant`').alias('merchant'),'brand','colors');
#drop rows that have N/A
selected_data = selected_data.dropna()

#change data type of price and isSale
df = selected_data.withColumn('price',selected_data.price.cast('float'))
df = df.withColumn('isSale',when(col('isSale')=='True',1.0).otherwise(0.0))
#drop N/A rows
df = df.dropna()

Before moving on, let us do some visualizations to understand the data using Plot.ly. See https://plot.ly/ for more info.

In [3]:
import plotly
#import plotly.plotly as py
import plotly.figure_factory as ff
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly import tools
import pandas as pd

init_notebook_mode(connected=True)


In [4]:
#lets look at the first 10 lines of the dataFrame
pdf = df.limit(10).toPandas()
table = ff.create_table(pdf)
iplot(table)


In [5]:
# sale vs regular price data 
df.groupBy("isSale").count().collect()

[Row(isSale=0.0, count=14691), Row(isSale=1.0, count=2450)]

We have six times for non-sale data then sale data. which would be fine for logistic regression. Let's move on to see how our features are distributed.

In [6]:
from plotly.graph_objs import *
#visualize brand distribution
brand_pdf = df.groupBy("brand").count().toPandas()
brand_trace = Bar(x=brand_pdf['brand'],
                y=brand_pdf['count'],
                 name = 'brand')

#visualize merchant distribution
merchant_pdf = df.groupBy("merchant").count().toPandas()
merchant_trace = Bar(x=merchant_pdf['merchant'],
                    y=merchant_pdf['count'],
                    name = 'merchant')

#visualize color representation
color_pdf = df.groupBy("colors").count().toPandas()
color_trace = Bar(x=color_pdf['colors'],
                y=color_pdf['count'],
                 name = 'colors')
#data = [brand_trace, merchant_trace, color_trace]

fig = tools.make_subplots(rows=1, cols=3, subplot_titles=('brand distribution', 'merchant distribution','color distribution'))

fig.append_trace(brand_trace, 1, 1)

fig.append_trace(merchant_trace, 1, 2)

fig.append_trace(color_trace, 1, 3)

fig['layout']['xaxis1'].update(title='brand')
fig['layout']['xaxis2'].update(title='merchant')
fig['layout']['xaxis3'].update(title='color')
fig['layout']['yaxis1'].update(title='count')
fig['layout']['yaxis2'].update(title='count')
fig['layout']['yaxis3'].update(title='count')

fig['layout'].update(height=500, width=1000)
iplot(fig) #feel free to toggle the graph

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]  [ (1,3) x3,y3 ]



In [7]:

trace_sale = Scatter(
    x = df.filter(df['isSale']==1).toPandas()['brand'],
    y = df.filter(df['isSale']==1).toPandas()['price'],
    name = 'sale',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = 'rgba(152, 0, 0, .8)',
        line = dict(
            width = 2,
            color = 'rgb(0, 0, 0)'
        )
    ))

trace_nonsale = Scatter(
    x = df.filter(df['isSale']==0).toPandas()['brand'],
    y = df.filter(df['isSale']==0).toPandas()['price'],
    name = 'non_sale',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = 'rgba(255, 182, 193, .9)',
        line = dict(
            width = 2,
        )
    ))

layout = Layout(
        hovermode= 'closest',
    xaxis= dict(
        title= 'price',
        ticklen= 5,
        zeroline= False,
        gridwidth= 2,
    ),
    yaxis=dict(
        title= 'brand',
    ),
    showlegend= True
    )

data = [trace_sale,trace_nonsale]

fig = dict(data=data,layout=layout)
iplot(fig) #feel free to toggle the graph

We notice that especially for merchant and color, the distribution is widely skewed. It might be better to get rid of some the merchants that only have a small amount of data. We also could not see a clear indication that a clear boundary exists in the plot above. Clearly, this problem may not be very well stated or more rigorous feature engineering is needed. As the purpose is to demonstrate Spark Mlib rather than training a good model, we will continue with the model.

In [75]:
#create onehotencoder
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder

categorical_columns = list(set(df.columns)-set(['isSale','price']))
#build stages
stringindexer_stages = [StringIndexer(inputCol = column,outputCol='stringindexed_'+column)\
                        for column in categorical_columns]
onehotencoder_stages = [OneHotEncoder(inputCol = 'stringindexed_'+column,outputCol='onehotencoded_'+column)\
                        for column in categorical_columns]

all_stages = stringindexer_stages + onehotencoder_stages

#build pipeline model
pipeline = Pipeline(stages=all_stages)

#fit pipeline model
pipeline_mode = pipeline.fit(df)

#transform data
df_coded = pipeline_mode.transform(df)

#remove uncoded columns
selected_columns = ['onehotencoded_' + c for c in categorical_columns] + ['price','isSale']
df_coded = df_coded.select(selected_columns)

df_coded.show()

+-------------------+--------------------+----------------------+-----+------+
|onehotencoded_brand|onehotencoded_colors|onehotencoded_merchant|price|isSale|
+-------------------+--------------------+----------------------+-----+------+
|    (833,[2],[1.0])|   (1593,[29],[1.0])|       (194,[0],[1.0])|62.99|   0.0|
|    (833,[2],[1.0])|   (1593,[29],[1.0])|       (194,[0],[1.0])|62.99|   0.0|
|    (833,[2],[1.0])|   (1593,[29],[1.0])|       (194,[0],[1.0])|62.99|   0.0|
|    (833,[2],[1.0])|   (1593,[29],[1.0])|       (194,[0],[1.0])|62.99|   0.0|
|    (833,[2],[1.0])|   (1593,[29],[1.0])|       (194,[0],[1.0])|62.99|   1.0|
|    (833,[2],[1.0])|   (1593,[29],[1.0])|       (194,[0],[1.0])|62.99|   0.0|
|    (833,[2],[1.0])|   (1593,[29],[1.0])|       (194,[0],[1.0])|56.69|   0.0|
|  (833,[632],[1.0])|  (1593,[911],[1.0])|      (194,[23],[1.0])| 10.0|   0.0|
|  (833,[632],[1.0])|  (1593,[911],[1.0])|      (194,[23],[1.0])|  8.0|   0.0|
|  (833,[114],[1.0])|    (1593,[4],[1.0])|       (19

In [76]:
#create VectorAssembler
from pyspark.ml.feature import VectorAssembler

#feature columns
feature_columns = df_coded.columns[0:4]
#build VectorAssembler instance
vectorassembler = VectorAssembler(inputCols=feature_columns,outputCol='features')
#transform data
df_features = vectorassembler.transform(df_coded)
df_features = df_features.withColumn('label',col('isSale').alias('label'))
df_features.show()

+-------------------+--------------------+----------------------+-----+------+--------------------+-----+
|onehotencoded_brand|onehotencoded_colors|onehotencoded_merchant|price|isSale|            features|label|
+-------------------+--------------------+----------------------+-----+------+--------------------+-----+
|    (833,[2],[1.0])|   (1593,[29],[1.0])|       (194,[0],[1.0])|62.99|   0.0|(2621,[2,862,2426...|  0.0|
|    (833,[2],[1.0])|   (1593,[29],[1.0])|       (194,[0],[1.0])|62.99|   0.0|(2621,[2,862,2426...|  0.0|
|    (833,[2],[1.0])|   (1593,[29],[1.0])|       (194,[0],[1.0])|62.99|   0.0|(2621,[2,862,2426...|  0.0|
|    (833,[2],[1.0])|   (1593,[29],[1.0])|       (194,[0],[1.0])|62.99|   0.0|(2621,[2,862,2426...|  0.0|
|    (833,[2],[1.0])|   (1593,[29],[1.0])|       (194,[0],[1.0])|62.99|   1.0|(2621,[2,862,2426...|  1.0|
|    (833,[2],[1.0])|   (1593,[29],[1.0])|       (194,[0],[1.0])|62.99|   0.0|(2621,[2,862,2426...|  0.0|
|    (833,[2],[1.0])|   (1593,[29],[1.0])|    

In [82]:
#split data into training and test data
training, test = df_features.randomSplit([0.8,0.2],seed=100)

#cross-validation model
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator

lr = LogisticRegression(featuresCol='features',labelCol='label')
evaluator = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction')
# here we have 3 values each for regParam and elasticNetParam. So, in total, there are 9 models to train and validate.
param_grid = ParamGridBuilder()\
                .addGrid(lr.regParam,[0.0,0.5,1.0])\
                .addGrid(lr.elasticNetParam,[0.0,0.5,1.0])\
                .build()                
# we are using 5-fold cv here. 
cv = CrossValidator(estimator=lr,estimatorParamMaps=param_grid,evaluator=evaluator,numFolds=5)
# do cross validation on training dataset
cv_model = cv.fit(training)
cv_pred = cv_model.transform(training)

#use the cv_model to predict test data
predictions = cv_model.transform(test)

print("prediction accuracy: ", evaluator.evaluate(cv_pred))
print("testing accuracy: ",evaluator.evaluate(predictions))


prediction accuracy:  0.9041277431555472
testing accuracy:  0.8171412334974186


## Starting a Spark cluster

Notice that we have only scratched the surface of Spark. Spark is powerful because of its parallel computing and fault tolerance features. To exploit the full power of Spark, we need to start Spark Clusters. In Spark, there is one master(cluster manager) and others are slaves. The master is the one that distributes the work. The slaves do the work. A master could also be a slave.

[<img src="https://github.com/minxuancao/688tutorial/blob/master/cluster2.png?raw=true">](https://github.com/minxuancao/688tutorial/blob/master/cluster2.png?raw=true)

In Spark, there are three ways to start a cluster: standalone, Hadoop YARN, Apache Mesos. We will only be focusing on starting a standalone cluster. A standalone clusters means that Spark is installed on every computer involved in the cluster. The master(cluster manager) in provided by Spark. 

We will start a standalone cluster on our local machine. That is, our personal computer will both act as the master and the slaves. This will not be very powerful but it is a good way to get a flavor of Spark clusters. 

Before we start, we need to download a separate version of Spark because we can't use the Anaconda version to start a master using terminal.

To be safe, check that you have java installed:
    
    $ java -version

Install Spark:

    $ curl -O https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz

Untar:

    $ tar zxvf spark-2.2.0-bin-hadoop2.7.tgz

Now, cd into ```spark-2.2.0-bin-hadoop2/conf```.

Write the following in spark-env.sh:

    export SPARK_WORKER_MEMORY=1g
    export SPARK_EXECUTOR_MEMORY=512m
    export SPARK_WORKER_INSTANCES=2
    export SPARK_WORKER_CORES=2
    export SPARK_WORKER_DIR= <some dir to save running logs>

SPARK_WORKER_DIR specifies the location to store running logs. So just save it in the directory of your choice.

After this, open slaves and write:
    
    localhost 

since we only have a standalone machine. If the master is remote, it should be the url of the master.

start master with the following command:
$ sbin/start-master.sh

Now, go to  localhost:8080 using a browser to see the Spark UI.

[<img src="https://github.com/minxuancao/688tutorial/blob/master/before3.png?raw=true">](https://github.com/minxuancao/688tutorial/blob/master/before.png?raw=true)

start the slaves:

$ sbin/start-slaves.sh

refresh to see changes in the UI. If you get connection refused error, enable "Remote Login" for your machine. 

Now that we have the clusters running, let's open up the shell to run some operations. Unfortunately, it is a scala shell.

$ bin/spark-shell --master URL_displaying_on_the_Left_Corner_of_UI_page
[<img src="https://github.com/minxuancao/688tutorial/blob/master/after3.png?raw=true">](https://github.com/minxuancao/688tutorial/blob/master/after2.png?raw=true)

After starting the shell, run these commands and see changes in the UI.

    val file=sc.textFile(“README.md”)
    file.count()
    file.take(3)

Use Ctrl-c to exit the shell. Use

    $sbin/stop-master.sh
    $sbin/stop-slave.sh 
to stop the processes.

## Useful Resources

PySpark API: http://spark.apache.org/docs/2.2.0/api/python/index.html

RDD Cheatsheet: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_Cheat_Sheet_Python.pdf

DataFrame Cheatsheet: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf

Helpful tutorials with examples:
1. https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/ (also recommends relevant books)
2. https://github.com/MingChen0919/learning-apache-spark