# Pyspark Interactive Shell Quickstart 

Copying [this tutorial](http://spark.apache.org/docs/latest/quick-start.html)

## Basics

In [1]:
sc

<pyspark.context.SparkContext at 0x10f032cd0>

In [2]:
text_file = sc.textFile("README.md")

In [3]:
text_file.count()

32

In [4]:
text_file.first()

u'# Apache Spark'

In [5]:
lines_with_spark = text_file.filter(lambda line: "Spark" in line)
lines_with_spark.count()

11

## More on RDD Operations

We can do some old school `map` and `reduce` on the text file to find the longest line.

In [6]:
text_file.map(lambda line: len(line.split())).reduce(lambda a, b: a if(a > b) else b)

66

Another way to do this is to create function and apply them in the map and reduce instead of lambdas.  
This is definitely easier to follow what is happening. 

We first map the len_line function to each row in the text_file.   
I guess the map function naturally handles applying our function to all the lines.

Then we apply the max funciton inside the reduce step.   
I guess this sorta loops through the elements and takes two at a time to compare. 

I don't fully understand how the map and reduce functions work yet but the following code successfully counts the length of the longest line. 

In [7]:
def max(a, b):
    if a > b:
        return a
    else:
        return b

def len_line(line):
    return len(line.split())
    
text_file.map(len_line).reduce(max)

66

Aw yeah, here is a simple way to write this line by line instead of one long line.  
I like this much better for readability.

In [8]:
text_file \
    .map(lambda line: len(line.split())) \
    .reduce(lambda a, b: a if(a > b) else b)    

66

In [11]:
wordCounts = text_file \
    .flatMap(lambda line: line.split()) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b)

wordCounts.collect()


[(u'all', 1),
 (u'Requirements', 1),
 (u'R,', 1),
 (u'including', 2),
 (u'computation', 1),
 (u'standalone,', 1),
 (u'PySpark.', 1),
 (u'Scala,', 1),
 (u'only', 1),
 (u'rich', 1),
 (u'Apache', 1),
 (u'guide,', 1),
 (u'Mesos)', 1),
 (u'matches', 1),
 (u'not', 2),
 (u'Spark', 13),
 (u'documentation,', 1),
 (u'cluster.', 1),
 (u'compatibility).', 1),
 (u'GraphX', 1),
 (u'[project', 1),
 (u'##', 3),
 (u'related', 1),
 (u'see', 1),
 (u'download', 1),
 (u'[Apache', 1),
 (u'odd', 1),
 (u'best', 1),
 (u'#', 1),
 (u'0.10.4),', 1),
 (u'processing.', 1),
 (u'(be', 1),
 (u'please', 1),
 (u'provides', 1),
 (u'supports', 2),
 (u'we', 1),
 (u'This', 3),
 (u'packaged', 1),
 (u'change', 1),
 (u'programming', 1),
 (u'experience', 1),
 (u'Packaging', 1),
 (u'ensure', 1),
 (u'experimental', 1),
 (u'errors.', 1),
 (u'packaging', 2),
 (u'tools', 2),
 (u'use', 1),
 (u'from', 2),
 (u'fast', 1),
 (u'<http://spark.apache.org/>', 1),
 (u'README', 1),
 (u'numpy', 1),
 (u'minor', 1),
 (u'engine', 1),
 (u'building'

Now lets do this again but sort the fields.  
Unfortunately in pyspark sorting is a bit of a hack. : (

To sort by the values, and not the key, I have to do  swap with the map function...twice.  
Swap, sort, then swap back.

In [13]:
wordCounts = text_file \
    .flatMap(lambda line: line.split()) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .map(lambda (a, b): (b, a)) \
    .sortByKey(False, 1) \
    .map(lambda (a, b): (b, a))

wordCounts.collect()

[(u'Spark', 13),
 (u'the', 9),
 (u'and', 8),
 (u'for', 8),
 (u'is', 4),
 (u'Python', 4),
 (u'you', 4),
 (u'a', 4),
 (u'version', 4),
 (u'of', 4),
 (u'to', 4),
 (u'##', 3),
 (u'This', 3),
 (u'cluster', 3),
 (u'including', 2),
 (u'not', 2),
 (u'supports', 2),
 (u'packaging', 2),
 (u'tools', 2),
 (u'from', 2),
 (u'but', 2),
 (u'with', 2),
 (u'this', 2),
 (u'You', 2),
 (u'may', 2),
 (u'PySpark', 2),
 (u'are', 2),
 (u'standalone', 2),
 (u'(including', 2),
 (u'or', 2),
 (u'own', 2),
 (u'SQL', 2),
 (u'that', 2),
 (u'an', 2),
 (u'can', 2),
 (u'general', 2),
 (u'in', 2),
 (u'on', 2),
 (u'It', 2),
 (u'all', 1),
 (u'Requirements', 1),
 (u'R,', 1),
 (u'computation', 1),
 (u'standalone,', 1),
 (u'PySpark.', 1),
 (u'Scala,', 1),
 (u'only', 1),
 (u'rich', 1),
 (u'Apache', 1),
 (u'guide,', 1),
 (u'Mesos)', 1),
 (u'matches', 1),
 (u'documentation,', 1),
 (u'cluster.', 1),
 (u'compatibility).', 1),
 (u'GraphX', 1),
 (u'[project', 1),
 (u'related', 1),
 (u'see', 1),
 (u'download', 1),
 (u'[Apache', 1),
 

Alright, thats probably good for this initial explore.  
I'm not sure how I will use spark the most in the future.  
I may use spark SQL or sparkR much more.   
I can't really seeing myself relying on map and reduce functions in pyspark.  
There must be a better way!