Let's start with the basics.

# Connecting to Spark

We're going to be using Python with DataFrames, which is only available in Spark 1.3 or later.  We're going to be using a recent version of open source spark.  To use it, you'll have to import the `SQLContext`.

In [3]:
from pyspark.sql import SQLContext
sql = SQLContext(sc)

Let's set up some common functions

# Reading a table

In [4]:
user = sql.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace="users", table="user")

Py4JJavaError: An error occurred while calling o32.load.
: java.io.IOException: Couldn't find users.user or any similarly named keyspace and table pairs
	at org.apache.spark.sql.cassandra.CassandraSourceRelation.<init>(CassandraSourceRelation.scala:52)
	at org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:182)
	at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:57)
	at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
	at py4j.Gateway.invoke(Gateway.java:259)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:207)
	at java.lang.Thread.run(Thread.java:745)


# Display

# A nicer reader

In [16]:
def create_reader(sql):
    def reader(keyspace, table):
        df = sql.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace=keyspace, table=table)
        return df
    return reader

reader = create_reader(sql)

In [4]:
apd.collect()

[Row(user_id=1, favorite_foods=[u'Bacon', u'Cheese'], name=u'Jon'),
 Row(user_id=2, favorite_foods=[u'Kale', u'Pizza', u'Wine'], name=u'Dani'),
 Row(user_id=3, favorite_foods=[u'Muffins', u'Pie', u'Steak'], name=u'Patrick')]

# A Nicer Writer

In [17]:
def create_writer(sql, mode="append"):
    def writer(df, keyspace, table):
        df.write.format("org.apache.spark.sql.cassandra").\
                 options(table=table, keyspace=keyspace).save(mode="append")
    return writer

writer = create_writer(sql)

# Migrating to a new structure

In [22]:
from pyspark.sql.functions import *
result = apd.select(explode(apd.favorite_foods).alias("food"), "user_id")
writer(result, "users", "favorite_foods_index")

[Row(food=u'Bacon', user_id=1),
 Row(food=u'Cheese', user_id=1),
 Row(food=u'Kale', user_id=2),
 Row(food=u'Pizza', user_id=2),
 Row(food=u'Wine', user_id=2),
 Row(food=u'Muffins', user_id=3),
 Row(food=u'Pie', user_id=3),
 Row(food=u'Steak', user_id=3)]

In [6]:
%%javascript
IPython.load_extensions('usability/hide_input');

<IPython.core.display.Javascript object>