# Spark Fundamentals

Basic RDDs - Don't freak, these are just references for future needs
======================================

Common RDD Constructors
-----------------------

Expression                               |Meaning
----------                               |-------
`sc.parallelize(iterable)`               |Create RDD of elements of some iterable
`sc.textFile(path)`                      |Create RDD of lines from file

Common Transformations
----------------------

Expression                               |Meaning
----------                               |-------
`filter(boolean condition)`              |Returns for where some boolean condition is True
`map(some function)`                     |Applies some function
`flatMap(some function)`                 |Apply some function that returns an iterator and flatten the entire output
`sample(withReplacement=True, ratio)`    |Sample the data by some ratio
`distinct()`                             |Remove duplicates in RDD
`sortBy(key function, ascending=True)`   |Sort elements by key defined in function in designated order
`randomSplit([ratio1, ratio2], seed)`    |Splits your data into two depening on ratio array

Common Key Pair RDD Transformations
----------------------------------

Expression                               |Meaning
----------                               |-------
`groupByKey(key value rdd)`              |Collapse a key value RDD by the key, and keeps the values in a iterable
`reduceByKey(some function)`             |Collapse a key value RDD by the key, and combines the values by some function
`mapValues(some function)`               |Apply some function to the values of some key value RDD
`flatMapValues(some function)`           |Apply some function that returns an iterator the the values of some key value RDD, and create a key value for each iterates
`keys()`                                 |Returns the keys of a key value RDD
`values()`                               |Returns the values of a key value RDD

Common Multiple RDD Transformations
----------------------------------

Expression                               |Meaning
----------                               |-------
`union(another rdd)`                     |Append another RDD to current RDD
`join(another rdd)`                      |Join another RDD to current RDD by matching keys
`leftOuterJoin(another rdd)`             |Join another RDD to current RDD where another RDD has matching keys
`rightOuterJoin(another rdd)`            |Join current RDD to other RDD where current RDD has matching keys
`zip(another rdd)`                       |Combines two RDD to form a key value pair RDD

Common Actions
--------------

Expression                             |Meaning
----------                             |-------
`collect()`                            |Convert RDD to in-memory list 
`take(n)`                              |First n elements of RDD 
`top(n)`                               |Top n elements of RDD
`takeSample(withReplacement=True, n)`  |Create sample of n elements with replacement
`sum()`                                |Find element sum (assumes numeric elements)
`mean()`                               |Find element mean (assumes numeric elements)
`stdev()`                              |Find element deviation (assumes numeric elements)
`takeOrdered(n, function)`             |Returns n ordered elements as sorted by the value returned by the function

### Step 1: import pyspark

In [1]:
import pyspark as ps

### Step 2: initialize a spark context (RDD manager)

In [4]:
# to check # of cores on computer
import multiprocessing
multiprocessing.cpu_count()

4

In [2]:
sc = ps.SparkContext('local[4]')  #use all 4 cores

### Step 3:  Construct a RDD with the data (we will be using churn.csv)

In [5]:
churn_rdd = sc.textFile('churn.csv')

### Step 4: Lets look at the first two lines to understand the format that textFile creates

In [6]:
churn_rdd.take(2)

[u"State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?",
 u'KS,128,415,382-4657,no,yes,25,265.100000,110,45.070000,197.400000,99,16.780000,244.700000,91,11.010000,10.000000,3,2.700000,1,False.']

In [None]:
# We'll need to 

### Step 5: We need to split the data by commas.

In [10]:
churn_rdd = churn_rdd.map(lambda x: x.split(','))

In [11]:
churn_rdd.take(2)

[[u'State',
  u'Account Length',
  u'Area Code',
  u'Phone',
  u"Int'l Plan",
  u'VMail Plan',
  u'VMail Message',
  u'Day Mins',
  u'Day Calls',
  u'Day Charge',
  u'Eve Mins',
  u'Eve Calls',
  u'Eve Charge',
  u'Night Mins',
  u'Night Calls',
  u'Night Charge',
  u'Intl Mins',
  u'Intl Calls',
  u'Intl Charge',
  u'CustServ Calls',
  u'Churn?'],
 [u'KS',
  u'128',
  u'415',
  u'382-4657',
  u'no',
  u'yes',
  u'25',
  u'265.100000',
  u'110',
  u'45.070000',
  u'197.400000',
  u'99',
  u'16.780000',
  u'244.700000',
  u'91',
  u'11.010000',
  u'10.000000',
  u'3',
  u'2.700000',
  u'1',
  u'False.']]

### Step 6: Extract the headers

In [16]:
header = churn_rdd.first()

### Step 7: Remove the header from the data

In [18]:
# churn_rdd.filter(lambda x: x != header).take(2)
churn_rdd = churn_rdd.filter(lambda x: x != header)

### Step 8: Finding total churn

In [20]:
# convert churn to 
# churn_rdd =
churn_rdd.map(lambda x: 0 if x[-1]=='False.' else 1) \
          .sum()

483

### Step 9: Finding total churn per State

In [23]:
churn_rdd.map(lambda x: (x[0], 0 if x[-1]=='False.' else 1 )) \
         .reduceByKey(lambda v1, v2: v1 + v2) \
         .take(10)

[(u'WA', 14),
 (u'WI', 7),
 (u'FL', 8),
 (u'WY', 9),
 (u'NH', 9),
 (u'NJ', 18),
 (u'TX', 18),
 (u'ND', 6),
 (u'TN', 5),
 (u'CA', 9)]

### Step 10: Finding average Customer calls per Churn

In [27]:
state_churn = churn_rdd.map(lambda x: (x[0], 0 if x[-1]=='False.' else 1 )) \
                       .reduceByKey(lambda v1, v2: v1 + v2)
state_call = churn_rdd.map(lambda x: (x[0], int(x[-2]))) \
                      .reduceByKey(lambda v1, v2: v1 + v2)

state_churn.join(state_call) \
           .mapValues(lambda (churn, call): 1.* call/churn) \
           .take(10)

[(u'WA', 7.214285714285714),
 (u'WI', 15.857142857142858),
 (u'FL', 12.375),
 (u'WY', 12.333333333333334),
 (u'NH', 9.444444444444445),
 (u'NJ', 6.333333333333333),
 (u'TX', 6.444444444444445),
 (u'ND', 15.0),
 (u'TN', 14.0),
 (u'CA', 5.555555555555555)]

### Caching RDDs to leverage the in memory usage

In [28]:
cached_churn_rdd = churn_rdd.persist()

### Practice #1: What's the min, mean, and max night charge for users that churned?

### Practice #2: How many of the churned users have Vmail plan?

### Practice #3: Which state have the most day calls?