# Connecting to Spark

This is an iPython notebook.  You can execute a cell by clicking on it and pressing shift-enter.

We can execute spark commands in here directly and get immediate results.

We're going to be using Python with DataFrames, which is only available in Spark 1.3 or later.  We're going to be using a recent version of open source spark.  To use it, you'll have to import the `SQLContext`.

In [None]:
from pyspark.sql import SQLContext
sql = SQLContext(sc)

# Reading a Cassandra Table

In [None]:
user = sql.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace="training", table="user")

# Displaying results

If we never perform an operation, our dataframe is never read in.  We can force our dataframe into memory and see it by calling `collect()` or `show()` on it.

In [None]:
user.collect()

In [None]:
user.show()

# Basic Filtering

If we're going to do anything with our data, we need to be able to do a simple task: Filtering.

Here's the syntax for filtering:

In [None]:
user.filter(user.age > 2)

There's an alternative syntax for filtering:

In [None]:
user[user.age > 2]

And of course, a third syntax for filters that have a degree of complexity. 

In [None]:
user.filter("age > 100 or name = 'Larry'").collect()

Try filtering for users name "Jon"

When you refer to `user.age`, you're looking at a `Column`.  The api for `Column is here: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

**Advanced Query:** Try selecting the users who have the favorite food "Bacon"

# A nicer reader

Personally I find needing to code `org.apache.spark.sql.cassandra` everywhere a little annoying.  Here's a couple convenience functions that returns a function (slightly tricky) that can be used to reference tables in a keyspace.  Execute the below block.  You can then refer to tables like such:

`user = reader("user")`

In [None]:
def create_reader(sql, keyspace):
    def reader(table):
        df = sql.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace=keyspace, table=table)
        return df
    return reader

def create_writer(sql, keyspace, mode="append"):
    def writer(df, table):
        df.write.format("org.apache.spark.sql.cassandra").\
                 options(table=table, keyspace=keyspace).save(mode="append")
    return writer

writer = create_writer(sql, "training")
reader = create_reader(sql, "training")

# Data Migrations

One thing Spark is useful for is performing data migrations.  We frequently need to take a table and write out a new structure.  Here's an example where we take the user table and construct a new table that maps food to users.  The `writer()` function takes a dataframe and a table.  Currently the fields need to be in the correct order in the dataframe.  Let's build an index of age -> user, adults only.  After you execute the below cell, verify the correct data is in the table `adults`

In [None]:
adults = user[user.age > 18].select('age', 'user_id', 'name')
writer(adults, "adults")

Now it's your turn.  This migration may be a little tricky.  What we want to do is map foods people like to users.  We're going to want to save this in the table `favorite_foods_index`.  Look at it's structure using cqlsh or dev center.  Hint: take a look at the documentation for `explode()`:

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

# SparkSQL

# Pandas and Plotting

In [None]:
%matplotlib inline