# Spark Fundamentals

Basic RDDs - Don't freak, these are just references for future needs
======================================

Common RDD Constructors
-----------------------

Expression                               |Meaning
----------                               |-------
`sc.parallelize(iterable)`               |Create RDD of elements of some iterable
`sc.textFile(path)`                      |Create RDD of lines from file

Common Transformations
----------------------

Expression                               |Meaning
----------                               |-------
`filter(boolean condition)`              |Returns for where some boolean condition is True
`map(some function)`                     |Applies some function
`flatMap(some function)`                 |Apply some function that returns an iterator and flatten the entire output
`sample(withReplacement=True, ratio)`    |Sample the data by some ratio
`distinct()`                             |Remove duplicates in RDD
`sortBy(key function, ascending=True)`   |Sort elements by key defined in function in designated order
`randomSplit([ratio1, ratio2], seed)`    |Splits your data into two depening on ratio array

Common Key Pair RDD Transformations
----------------------------------

Expression                               |Meaning
----------                               |-------
`groupByKey(key value rdd)`              |Collapse a key value RDD by the key, and keeps the values in a iterable
`reduceByKey(some function)`             |Collapse a key value RDD by the key, and combines the values by some function
`mapValues(some function)`               |Apply some function to the values of some key value RDD
`flatMapValues(some function)`           |Apply some function that turns a key and iterable value RDD into key value RDD
`keys()`                                 |Returns the keys of a key value RDD
`values()`                               |Returns the values of a key value RDD

Common Multiple RDD Transformations
----------------------------------

Expression                               |Meaning
----------                               |-------
`union(another rdd)`                     |Append another RDD to current RDD
`join(another rdd)`                      |Join another RDD to current RDD by matching keys
`leftOuterJoin(another rdd)`             |Join another RDD to current RDD where another RDD has matching keys
`rightOuterJoin(another rdd)`            |Join current RDD to other RDD where current RDD has matching keys
`zip(another rdd)`                       |Combines two RDD to form a key value pair RDD

Common Actions
--------------

Expression                             |Meaning
----------                             |-------
`collect()`                            |Convert RDD to in-memory list 
`take(n)`                              |First n elements of RDD 
`top(n)`                               |Top n elements of RDD
`takeSample(withReplacement=True, n)`  |Create sample of n elements with replacement
`sum()`                                |Find element sum (assumes numeric elements)
`mean()`                               |Find element mean (assumes numeric elements)
`stdev()`                              |Find element deviation (assumes numeric elements)
`takeOrdered(n, function)`             |Returns n ordered elements as sorted by the value returned by the function

### Step 1: import pyspark

In [1]:
import pyspark as ps

### Step 2: initialize a spark context (RDD manager)

In [2]:
sc = ps.SparkContext('local[4]')

### Step 3:  Construct a RDD with the data (we will be using churn.csv)

In [4]:
churn_rdd = sc.textFile('churn.csv')

### Step 4: Lets look at the first two lines to understand the format that textFile creates

In [7]:
churn_rdd.take(2)

[u"State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?",
 u'KS,128,415,382-4657,no,yes,25,265.100000,110,45.070000,197.400000,99,16.780000,244.700000,91,11.010000,10.000000,3,2.700000,1,False.']

### Step 5: We need to split the data by commas.

In [8]:
churn_rdd = churn_rdd.map(lambda x: x.split(','))

### Step 6: Extract the headers

In [11]:
headers = churn_rdd.first() # this is a list for reference

### Step 7: Remove the header from the data

In [14]:
churn_rdd = churn_rdd.filter(lambda x: x != headers)

### Step 8: Finding total churn

In [22]:
(
    churn_rdd.map(lambda x: x[-1] != 'False.')
             .sum()
)

483

### Step 9: Finding total churn per State

In [29]:
(
    churn_rdd.map(lambda x: (x[0], x[-1] != 'False.'))
             .reduceByKey(lambda v1, v2: v1 + v2)
             .take(5)
)

[(u'WA', 14), (u'WI', 7), (u'FL', 8), (u'WY', 9), (u'NH', 9)]

### Step 10: Finding average Customer Service Calls per Churn for each State

##### Tup of tup

In [38]:
(
    churn_rdd.map(lambda x: (x[0], (int(x[-2]), x[-1] != 'False.')))
             .reduceByKey(lambda val1, val2: (val1[0] + val2[0], val1[1] + val2[1]))
             .mapValues(lambda (x, y): 1. * x / y)
             .take(2)
)

[(u'WA', 7.214285714285714), (u'WI', 15.857142857142858)]

#### Joins

In [39]:
cust_rdd = ( 
                churn_rdd.map(lambda x: (x[0], int(x[-2])))
                         .reduceByKey(lambda v1, v2: v1 + v2)
           )


In [43]:
state_rdd = (
                churn_rdd.map(lambda x: (x[0], x[-1] != 'False.'))
                         .reduceByKey(lambda v1, v2: v1 + v2)
            )

In [48]:
(
    state_rdd.join(cust_rdd)
             .mapValues(lambda (x, y): 1. * y / x)
             .take(2)
)

[(u'WA', 7.214285714285714), (u'WI', 15.857142857142858)]

### Caching RDDs to leverage the in memory usage

In [49]:
cached_churn_rdd = churn_rdd.persist()

### Practice #1: What's the min, mean, and max night charge for users that churned?

### Practice #2: How many of the churned users have Vmail plan?

### Practice #3: Which state have the most day calls?