PySpark and Data Movement
=========================

In this notebook we inspect GitHub JSON records with PySpark.  This serves two objectives:

1.  We learn about using collections APIs to filter through JSON record data
2.  We learn the value of avoiding memory transfer by comparing two different operations, `groupByKey` and `combineByKey`, that solve the same problems in two different ways

### GitHub Data on Spark

We read some JSON GitHub Data with Spark

In [1]:
from pyspark import SparkContext

In [None]:
sc = SparkContext('spark://schedulers:7077')

In [None]:
rdd = sc.textFile("s3a://githubarchive-data/2015-01-01-*.json.gz")

In [None]:
rdd.take(2)

In [None]:
import json
js = rdd.map(json.loads)

In [None]:
js.take(1)

In [None]:
js.persist()

In [None]:
js.count()

### Count the number of records, grouped by type

### ... with groupBy

In [None]:
%%time
js.groupBy(lambda d: d['type']).map(lambda kv: (kv[0], len(kv[1]))).collect()

In [None]:
js.keyBy(lambda d: d['type']).take(2)

### ... with combineByKey

In [None]:
%%time
def add(acc, x): return acc + 1
def global_add(x, y): return x + y

js.keyBy(lambda d: d['type']).combineByKey(lambda x: 1, add, global_add).collect()

<table>
    <tr>
      <td>
        <img src="https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/images/group_by.png" width="400">
      </td>
      <td>
        <img src="https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/images/reduce_by.png" width="400">
      </td>
    </tr>
</table>



[--Databricks](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html)

### ... with reduceByKey

In [None]:
%%time
js.keyBy(lambda d: d['type']).map(lambda e: (e[0], 1)).reduceByKey(lambda acc, x: acc + x).collect()