# Connecting to Spark

This is an iPython notebook.  You can execute a cell by clicking on it and pressing shift-enter.

We can execute spark commands in here directly and get immediate results.

We're going to be using Python with DataFrames, which is only available in Spark 1.3 or later.  We're going to be using a recent version of open source spark.  To use it, you'll have to import the `SQLContext`.

In [1]:
from pyspark.sql import SQLContext
sql = SQLContext(sc)

# Reading a Cassandra Table

In [2]:
user = sql.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace="training", table="user")

# Displaying results

If we never perform an operation, our dataframe is never read in.  We can force our dataframe into memory and see it by calling `collect()` or `show()` on it.

In [3]:
user.collect()

[Row(user_id=1, age=34, favorite_foods=[u'Bacon', u'Cheese'], name=u'Jon'),
 Row(user_id=2, age=22, favorite_foods=[u'Kale', u'Pizza', u'Wine'], name=u'Dani'),
 Row(user_id=4, age=1, favorite_foods=[u'Candy', u'Fear'], name=u'Baby Luke'),
 Row(user_id=3, age=108, favorite_foods=[u'Muffins', u'Pie', u'Steak'], name=u'Patrick'),
 Row(user_id=5, age=10, favorite_foods=[u'Anger'], name=u'Larry')]

In [13]:
user.show()

+-------+---+--------------------+---------+
|user_id|age|      favorite_foods|     name|
+-------+---+--------------------+---------+
|      1| 34|ArrayBuffer(Bacon...|      Jon|
|      2| 22|ArrayBuffer(Kale,...|     Dani|
|      4|  1|ArrayBuffer(Candy...|Baby Luke|
|      3|108|ArrayBuffer(Muffi...|  Patrick|
|      5| 10|  ArrayBuffer(Anger)|    Larry|
+-------+---+--------------------+---------+



# Basic Filtering

If we're going to do anything with our data, we need to be able to do a simple task: Filtering.

Here's the syntax for filtering:

In [9]:
user.filter(user.age > 2)

DataFrame[user_id: int, age: int, favorite_foods: array<string>, name: string]

There's an alternative syntax for filtering:

In [10]:
user[user.age > 2]

DataFrame[user_id: int, age: int, favorite_foods: array<string>, name: string]

And of course, a third syntax for filters that have a degree of complexity. 

In [15]:
user.filter("age > 100 or name = 'Larry'").collect()

[Row(user_id=3, age=108, favorite_foods=[u'Muffins', u'Pie', u'Steak'], name=u'Patrick'),
 Row(user_id=5, age=10, favorite_foods=[u'Anger'], name=u'Larry')]

Try filtering for users name "Jon"

When you refer to `user.age`, you're looking at a `Column`.  The api for `Column is here: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

**Advanced Query:** Try selecting the users who have the favorite food "Bacon"

# A nicer reader

Personally I find needing to code `org.apache.spark.sql.cassandra` everywhere a little annoying.  Here's a couple convenience functions that returns a function (slightly tricky) that can be used to reference tables in a keyspace.  Execute the below block.  You can then refer to tables like such:

`user = reader("user")`

In [11]:
def create_reader(sql, keyspace):
    def reader(table):
        df = sql.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace=keyspace, table=table)
        return df
    return reader

def create_writer(sql, keyspace, mode="append"):
    def writer(df, table):
        df.write.format("org.apache.spark.sql.cassandra").\
                 options(table=table, keyspace=keyspace).save(mode="append")
    return writer

writer = create_writer(sql, "training")
reader = create_reader(sql, "training")

# Data Migrations

# SparkSQL

# Pandas and Plotting

In [14]:
%matplotlib inline