# Basic setup

Here we will import the `pyspark` module and set up a `SparkSession`.  By default, we'll use a `SparkSession` running locally, with one Spark executor for every core on the local machine.  You can change this to run against a Spark cluster by replacing `local[*]` with the URL of the Spark master.


In [1]:
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext

spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext

# Creating a resilient distributed dataset (RDD)

One of the easiest ways to create a resilient distributed dataset is from a local collection, with the `parallelize` method on the `SparkContext` object.

In [2]:
numberRDD = sc.parallelize(range(1, 10000))

You can also create RDDs from files, S3 objects, and other external data sources.  See [the documentation](https://spark.apache.org/docs/latest/programming-guide.html#external-datasets) for more information. 

# Basic RDD transformations 

RDDs are _immutable_, so to transform a RDD, we create a new one.  RDDs are also _lazy_, so instead of transforming the elements when we create the new RDD, we store a reference to the original and the operation we'd need to apply to it to construct the transformed RDD.

In [3]:
# filter numberRDD, keeping only even numbers
evens = numberRDD.filter(lambda x: x % 2 == 0)

# produce an RDD by doubling every element in numberRDD
doubled = numberRDD.map(lambda x: x * 2)

# filter numberRDD, keeping only multiples of five
fives = numberRDD.filter(lambda x: x % 5 == 0)

# return an RDD of the elements in both evens and fives
tens = evens.intersection(fives)
sortedTens = tens.sortBy(lambda x: x)

You can see other RDD transformations in the [Spark documentation](https://spark.apache.org/docs/latest/programming-guide.html#transformations).

# RDD actions

Since RDDs are lazy and RDD transformations don't actually compute anything, we need some way to force Spark to actually schedule a computation.  RDD _actions_ are operations that schedule the graph of computations implied by an RDD and return a result to the main program.  Here are a few examples:

In [4]:
(evens.count(), doubled.count())

(4999, 9999)

In [5]:
# note that we may not get results in order!
tens.take(5)

[6240, 8320, 2080, 2320, 8560]

In [6]:
# ...unless we sort
sortedTens.take(5)

[10, 20, 30, 40, 50]

In [7]:
# we can take a sample from an RDD (with or without replacement)
sortedTens.takeSample(False, 10)

[3760, 1040, 4800, 1210, 4170, 4250, 5660, 1470, 3100, 3440]

In [8]:
sortedTens.reduce(lambda x, y: max(x, y))

9990

You can see some other RDD actions in the [Spark documentation](https://spark.apache.org/docs/latest/programming-guide.html#actions).

# Structured query and data frames

Spark also includes support for structured queries, including SQL and pandas- or R-like "data frame" operations through a query DSL.

Let's see structured query in action by loading a [Parquet](http://parquet.apache.org/) file with some simplified [fedmsg](https://fedora-fedmsg.readthedocs.io/en/latest/) log messages.

In [9]:
df = spark.read.load("/data/msgs.parquet")
df.printSchema()

root
 |-- category: string (nullable = true)
 |-- i: long (nullable = true)
 |-- id: long (nullable = true)
 |-- msg: string (nullable = true)
 |-- msg_id: string (nullable = true)
 |-- source_name: string (nullable = true)
 |-- source_version: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- topic: string (nullable = true)



We can use the data frame DSL to do RDBMS-style queries on these data, which is great for characterizing or exploring it.  Because these queries can execute in parallel or across multiple machines, aggregations can be _much_ faster in Spark than they are on a traditional RDBMS.  

It is simple to do basic aggregations, like this:

In [10]:
df.groupBy('category').count().orderBy('count', ascending=False).show()

+------------+-------+
|    category|  count|
+------------+-------+
|    buildsys|3494565|
|         git|  92611|
|        copr|  76774|
|       pkgdb|  43328|
|       bodhi|  33624|
|fedoratagger|  30835|
|   fedbadges|  29494|
|        wiki|  17609|
|      askbot|  15278|
|         fas|  13022|
|  summershum|   6996|
|        trac|   6496|
|     compose|   5340|
|        null|   5095|
|     ansible|   4333|
|      github|   4291|
|      planet|   3926|
|     meetbot|   3016|
|      anitya|   1300|
|         fmn|    817|
+------------+-------+
only showing top 20 rows



(This file isn't huge, but it's in the Docker image.)

In [11]:
df.count()

3889881

We can use the `show` method to quickly inspect a few rows of a data frame (not just the results of a query).  This is often helpful to sanity-check a new data source.

In [12]:
df.show(10)

+--------+---+--------+--------------------+--------------------+-----------+--------------+--------------------+--------------------+
|category|  i|      id|                 msg|              msg_id|source_name|source_version|           timestamp|               topic|
+--------+---+--------+--------------------+--------------------+-----------+--------------+--------------------+--------------------+
|buildsys|  1|14261348|{"build_id":22449...|2014-9c2aa45d-5e8...| datanommer|         0.6.4|2014-10-10T21:11:...|org.fedoraproject...|
|buildsys|  2|14261349|{"build_id":22456...|2014-7db189aa-0ff...| datanommer|         0.6.4|2014-10-10T21:11:...|org.fedoraproject...|
|buildsys| 12|14261359|{"build_id":22312...|2014-14a4b71c-d5a...| datanommer|         0.6.4|2014-10-10T21:11:...|org.fedoraproject...|
|buildsys| 17|14261364|{"build_id":22731...|2014-e0cb85f1-22c...| datanommer|         0.6.4|2014-10-10T21:11:...|org.fedoraproject...|
|buildsys| 32|14261379|{"build_id":22297...|2014-9e430f

# Data cleaning

Uh oh!  It looks like the `msg` field of this data frame is JSON-encoded message structures instead of actual message structures.  While we'd _never_ see messy data in the real world, this really throws a wrench into our tutorial.  Let's fix that by asking Spark to infer a schema for the JSON fields.

In [13]:
msgRDD = df.select("msg").rdd.map(lambda x: x[0])
# structs = sqlc.jsonRDD(msgRDD)
# structs.printSchema()

You'll notice that the last two lines are commented out there, and with good reason.  You can uncomment them and run them, but only if you're patient and willing to scroll. You'll get a huge schema with objects that have one field for (as one example) every Fedora user who has ever participated in an IRC meeting! (Alternatively, [click here](https://gist.github.com/willb/ede22cdcd25b64e8cda952f927701d96) to see a rendered version of the inferred schema.) 

Spark can't infer a useful schema for these JSON records, because their schemas diverge and because of some of the unusual ways that `fedmsg` data uses JSON to encode maps.  While there are a few reasons for the schema divergence (see [a practical treatment](http://chapeau.freevariable.com/2014/10/fedmsg-and-spark.html) or a more [type-theoretic one](http://chapeau.freevariable.com/2014/11/algebraic-types.html)), in this case one problem is that different `fedmsg` messages use the `branches` field to refer to values with incompatible types.

Fortunately, we can fix that with a pretty quick hack.  We'll just go through every record and retain only a few fields that we know are not going to give us grief.  (You'd probably want to do something more sophisticated in a real application.)  We'll use Spark's _user-defined function_ mechanism to achieve this.

In [14]:
import json

# define the fields we want to keep
interesting_fields = ['agent', 'author', 'copr', 'user', 'msg', 'meeting_topic', 'name', 'owner', 'package']

# describe the return type of our user-defined function
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
resultType = StructType([StructField(f, StringType(), nullable=True) for f in interesting_fields])

# this is the body of our first user-defined function, to restrict us to a subset of fields
def trimFieldImpl(js):
    try:
        d = json.loads(js)
        return [d.get(f) for f in interesting_fields]
    except:
        # return an empty struct if we fail to parse this message
        return [None] * len(interesting_fields)
    
from pyspark.sql.functions import udf

# register trimFieldImpl as a user-defined function
trimFields = udf(trimFieldImpl, resultType)

trimmedDF = df.withColumn("msg", trimFields("msg"))

In [15]:
trimmedDF.printSchema()

root
 |-- category: string (nullable = true)
 |-- i: long (nullable = true)
 |-- id: long (nullable = true)
 |-- msg: struct (nullable = true)
 |    |-- agent: string (nullable = true)
 |    |-- author: string (nullable = true)
 |    |-- copr: string (nullable = true)
 |    |-- user: string (nullable = true)
 |    |-- msg: string (nullable = true)
 |    |-- meeting_topic: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- owner: string (nullable = true)
 |    |-- package: string (nullable = true)
 |-- msg_id: string (nullable = true)
 |-- source_name: string (nullable = true)
 |-- source_version: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- topic: string (nullable = true)



Data frames are a great way to explore structured data, but you can also train models against them (either by converting to RDDs and using [MLlib](https://spark.apache.org/docs/latest/mllib-guide.html) or by using [ML Pipelines](https://spark.apache.org/docs/latest/ml-guide.html) to define learning pipelines directly on data frames).

Let's extract bug and update comments from our `fedmsg` data and use those to train a word2vec model.

In [16]:
def getComments(js):
    try:
        # construct a dict from the json string
        # we care about the following paths:  
        # * /comment/text 
        # * /notes/
        # * /update/comments//text
        d = json.loads(js)
        comment = [d.get('comment', {})]
        notes = d.get('notes', [])
        update_comments =  d.get('update', {}).get('comments', [])
        comment_texts = [c['text'] for c in comment + update_comments if 'text' in c]
        return comment_texts + notes
    except:
        return[]

commentsRDD = msgRDD.flatMap(lambda js: getComments(js))

# Turn comments into sequences of words.  Convert everything 
# to lowercase first to avoid spurious "synonyms" between differently-
# capitalized words (but try this also without `w.lower()` and see how 
# your results change!)
#
# We won't bother stripping punctuation or stemming but while we're in
# #YOLO territory for a tutorial and demo, you'd surely want to do 
# something more sensible in a real application.
wordSeqs = commentsRDD.map(lambda s: [w.lower() for w in s.split()])

In [17]:
# actually train a model

from pyspark.mllib.feature import Word2Vec

w2v = Word2Vec()
model = w2v.fit(wordSeqs)

# find synonyms for a given word
synonyms = model.findSynonyms('works', 5)

for word, distance in synonyms:
    print("{}: {}".format(word, distance))

working: 0.7936919460617579
work: 0.7433684004230725
looks: 0.7173046140849485
worked: 0.6833483441153587
as: 0.6710281464056327


In [18]:
# see some of the words in the model

list(model.getVectors().keys())[:20]

['breaks',
 '(fedora)',
 'yaneti.',
 'immanetize.',
 'hpejakle.',
 'submitted',
 'plugin',
 'looks',
 'alpha',
 'kkeithle.',
 'tgl.',
 'packages,',
 'jdunn.',
 'nalin',
 'used',
 'automatic',
 'great.',
 'mystro256.',
 'oget.',
 'frafra.']

# Some exercises:

1.  Try running the word2vec pipeline above without converting words to lowercase (you can eliminate the list comprehension or change `w.lower()` to `w`).  How does this change your results?
2.  Try eliminating punctuation from words (use a regular expression).  You may also want to remove _stopwords_, or extremely common words (like articles, prepositions, conjunctions, etc.), and the `StopWordsRemover` class in Spark will remove stopwords from prose in several languages.  Check the documentation for details on how to use it (you'll want to use the MLlib version and not the `spark.ml` version since we're working with RDDs and not data frames).  Run your model and some queries after each step.  Do the additional data cleaning steps improve your results?  If not, why do you suppose the results change?
3.  You may also be interested in stemming your words (e.g. converting "works", "worked", "working", "worker" to "work").  While you can also use a regular expression for this, you may find it's more productive to use an external library.  Consider using the [NLTK](http://nltk.org) package, which provides a sensible text tokenizer and a stemmer, among other tools for natural language processing.
4.  The synonym query is interesting, but the word2vec model also supports finding analogies through linear transformations, for example, in a word2vec model trained on general English text, the vector for "king" plus the vector for "woman" minus the vector for "man" will be very similar to the vector for "queen".  While the training corpus we have with a subset of fedmsg messages is far, far, smaller than we'd want to use for a general model, you may want to play with this and see how it works.  You can use `getVectors()` to get the vectors for particular words, `numpy` for the vector transformations, and then you can pass a vector to the `findSynonyms()` method to see which words have vectors closest to a given vector.